(cache)The Wayback Machine’s snapshots of news homepages plummet after a “breakdown” in archiving projects

The Wayback Machine’s snapshots of news homepages plummet after a “breakdown” in archiving projects

Between May and October 2025, homepage snapshots fell by 87% across 100 news publications.

By Andrew Deck and Hanaa' Tameez Oct. 21, 2025, 2:45 p.m.

On September 7, Russia carried out a massive drone attack on Ukraine’s capital, Kyiv, killing four people and injuring 40. The Associated Press reported that it was the largest aerial attack since the war between the two countries began in 2022.

The Kyiv Post, one of Ukraine’s leading English-language news outlets, covered the story, but no public record of its homepage exists in the Internet Archive’s Wayback Machine for that day. A homepage snapshot — a viewable version of a page captured by the Wayback Machine’s crawlers — does not appear until September 8, more than 24 hours after the attack.

In the first five months of 2025, the Wayback Machine captured snapshots of the Kyiv Post an average of 85 times per day. Between May 17 and October 1, though, the daily average dropped to one. For 52 days between May and October, the Wayback Machine shows no snapshots of the Kyiv Post at all.

Screenshot taken 10/21/2025

News outlets’ homepages are vital historical records, providing a real-time view into what a newsroom deems the most important stories of the moment. From a homepage — headlines, word choice, story placement — readers get a sense of a newsroom’s editorial priorities and how they change over time. If homepages aren’t saved, records of those changes are lost.

July 31, 2024

The Wayback Machine, an initiative from the nonprofit Internet Archive, has been archiving the webpages of news outlets — alongside millions of other websites — for nearly three decades. Earlier this month, it announced that it will soon archive its trillionth web page. The Internet Archive has long stressed the importance of archiving homepages, particularly to fact-check politicians’ claims. In 2018, for instance, when Donald Trump accused Google of failing to promote his State of the Union address on its homepage, Google used the Wayback Machine’s archive of its homepage to disprove the statement.

“[Google’s] job isn’t to make copies of the homepage every 10 minutes,” Mark Graham, the director of the Wayback Machine, said at the time. “Ours is.”

But a Nieman Lab analysis shows that the Wayback Machine’s snapshots of news outlets’ homepages have plummeted in recent months. Between January 1 and May 15, 2025, the Wayback Machine shows a total of 1.2 million snapshots collected from 100 major news sites’ homepages. Between May 17 and October 1, 2025, it shows 148,628 snapshots from those same 100 sites — a decline of 87%. (You can see our data here.)

While our analysis focused on news sites, they’re not the only URLs impacted. We documented a similarly large decrease in the number of snapshots available of federal government website homepages after May 16, during a period when the Trump administration has taken down pages on government sites and made changes, often without disclosure, otherwise known as “stealth editing.”

When we contacted Graham for this story, he confirmed there had been “a breakdown in some specific archiving projects in May that caused less archives to be created for some sites.” He did not answer our questions about which projects were impacted, saying only that they included “some news sites.”

Graham confirmed that the number of homepage archives is indicative of the amount of archiving happening across a website. He also said, though, that homepage crawling is just one of several processes the Internet Archive runs to find and save individual pages, and that “other processes that archive individual pages from those sites, including various news sites, [were] not affected by this breakdown. ”

After the Wayback Machine crawls websites, it builds indexes that structure and organize the material it’s collected. Graham said some of the missing snapshots we identified will become available once the relevant indexes are built.

“Some material we had archived post-May 16th of this year is not yet available via the Wayback Machine as their corresponding indexes have not yet been built,” he said.

August 13, 2025

Under normal circumstances, building these indexes can cause a delay of a few hours or a few days before the snapshots appear in the Wayback Machine. The delay we documented is more than five months long. Graham said there are “various operational reasons” for this delay, namely “resource allocation,” but otherwise declined to specify.

According to Graham, the “breakdown” in archiving projects has been fixed and the number of snapshots will soon return to its pre-May 16 levels. He did not share any more specifics on the timeframe. But when we re-analyzed our sample set on October 19, we found that the total number of snapshots for our testing period had actually declined since we first conducted the analysis on October 7.

In the process of reporting a different story in September, we noticed something curious about the Wayback Machine: Starting on May 16, 2025, it showed a steep drop in the number of snapshots of news publishers’ homepages.

We analyzed the homepage URLs of 100 top news sites. Our sample included the 50 top English-language news sites in the world and 50 other leading U.S. and international news sites, including The New York Times, the BBC, and Le Monde.

For example, The New York Times homepage averaged 122 snapshots per day in the Wayback Machine between January 1 and May 15, 2025. But from May 17 to October 1, that number dropped to an average of 16 snapshots per day. CNN, the most-archived homepage in our sample, had a total of 34,524 snapshots from January 1 to May 15. In the months since, a total of 1,903 snapshots are available, a decrease of 94%.

While all 100 sites we looked at showed the same pattern, they were not all affected equally. For instance, the Wayback Machine’s snapshots of news.google.com totaled 18,300 between January 1 and May 15 and 14,902 between May and October — an 18.5% decrease, the smallest change for any site in our sample.

Meanwhile, newsbreak.com was snapshotted 11,185 times between January and May. But between May and October, the Wayback Machine only has 38 snapshots, representing a 99.6% decrease.

Some homepages that were once snapshotted dozens of times a day had no snapshots at all on some days between May and October. Those include the homepages of the Miami Herald, Minnesota Public Radio, SFGate, PennLive, Business Insider, The Independent (United Kingdom), Rappler (Philippines), the Kyiv Post (Ukraine), and El País (Spain).

Here’s one way this gap played out. The Wayback Machine does not turn up a record of Oregon Public Broadcasting’s homepage on September 28. That was the day after President Trump ordered the deployment of 200 national guard troops to Portland, Oregon. The next available snapshot was taken on the afternoon of September 29.

Screenshot taken 10/21/2025

For the first several months of this year, until May 16, OPB.org shows an average of 83 snapshots per day. Between May 17 and October 1, it averaged one snapshot per day. Fifty-seven days in the Wayback Machine show no snapshots of OPB.org at all.

To better understand the extent of the breakdown, we also looked at the homepages of U.S. government websites, including whitehouse.gov, congress.gov, and data.gov. These sites are less likely than news homepages to be updated daily or hourly, but the Trump administration has overhauled many of them. Last month, for instance, the administration purged content related to sexuality, gender identity, and health equity from the Centers for Disease Control and Prevention (CDC) site. Between January 1 and May 15, 2025, the homepage of CDC.gov averaged 162 snapshots per day in the Wayback Machine. In the months since, that number has dropped to an average of 16 snapshots per day.

“This points to a broader problem with how dependent we all are on a single, amazingly useful organization to try and cover the bulk of the work for web archiving,” Trevor Owens, an archivist and author of After Disruption: A Future for Cultural Memory, told us.

It’s unclear why certain days are missing for some publications and not others, but a closer inspection of homepages in the Wayback Machine shows that the sharp drop-off in snapshots is reflected in the activity of “collections” — archiving projects that are sometimes organized under a specific theme.

More generic crowdsourced “collections” that rely on users to initiate archiving, like “ArchiveBot” and “Save Page Now,” consistently display snapshots across the past year. Several prominent collections focused on news that are overseen by Internet Archive employees — like “Local News – US,” “Tow Center Pink Slime News Sites,” and “Exile Media Central America” — appear to take snapshots of impacted homepages much less frequently after May 16.

Ian Milligan is a digital historian, a professor at the University of Waterloo, and the author of Averting the Digital Dark Age: How Archivists, Librarians, and Technologists Built the Web a Memory. When we asked him about our findings, he said it would be ideal to have daily snapshots of every news site. But the reality is that archiving work is expensive to do and organizations like the Internet Archive have to manage their budgets accordingly. In researching the history of the internet after the September 11 attacks, for example, he found a sharp increase in digital news archives because preservation became a priority.

April 30, 2024

“It’s less [about] having that one daily snapshot and more about having archives that are responsive so that if something goes on in Minnesota, there’s an ability to turn up the dial and get it more often,” Milligan said.

Archiving homepages isn’t just important for historical records, though. Homepages are also one of the central ways the Wayback Machine finds individual pages to save.

“Your entry point to a crawl is generally the homepage, because the homepage gives you the map to the structure of what is underlying that page,” said Matthew Weber, a communications professor at Rutgers University who researches local news ecosystems using the Internet Archive, adding, “Crawling that initial page is critical to being able to archive and store the article page.”

Crawlers that are set to regularly archive news publications often treat homepages as “seed URLs.” The crawler gets to individual articles by “hopping” to links found on that homepage, such as promoted stories.

“It tends to be the case that a homepage will be the seed and then the way that the crawler gets to the individual articles is by finding links to them off that page,” Owens said. “For any given crawl, the person running it will set multiple seeds, scopes, and configure how many hops it should make.”

Weber said he would expect a drop in the homepage crawls to result in decreased article page crawls, too. While we did look into the frequency of article page archiving, we weren’t able to conduct a systematic analysis across our entire sample set. Graham told us that individual page archiving processes were not affected by the May breakdown.

According to the Internet Archive’s own annual filings, it employed at least 134 people in 2023 and brought in $23 million in revenue, a shoestring budget for the task at hand — its expenses totaled $32.7 million that year.

“We’re just trying to do our job, and we’re trying to preserve as much of the public web as possible, make it available to people that are curious and want to learn, make it available for future generations,” Graham said. “[We’re trying to] give systems around the world a fighting chance in the face of the avalanche of mis- and disinformation now, especially propelled by the rise of AI.”

March 24, 2022

None of the experts we consulted for this story had noticed the dip in news homepage snapshots until we brought it to their attention. While they were surprised by our findings, they also highlighted a larger issue in the United States: There’s no real mandate for the internet to be preserved at all.

Outside the Internet Archive, the Library of Congress operates perhaps the second-largest web archiving initiative in the U.S., but on a significantly smaller scale. According to Owens, who used to be the director of digital services at the Library of Congress, the project currently has “on the order of 20 billion archived resource files,” while the Wayback Machine archives more than 500 million URLs per day.

In France, for example, the National Library is required by law to preserve and make accessible all French websites and other digital works. That works for all .fr domains, but would be much harder in the United States, Weber said. “How do you preserve the entirety of the .com?”

“It’s challenging that in the United States, we are reliant on a nonprofit organization that was started in the late 1990s with a lot of different individuals literally donating data to an entity that wanted to collect and aggregate this information,” Weber said. “We’re very grateful that the Internet Archive has built a systematic program out of that. But there is always the risk that threats and challenges cause that program to change in some ways. And it seems like that may be going on right now, which is concerning.”

Photo of the Internet Archive headquarters in San Francisco, California, courtesy of the Internet Archive

Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).

Hanaa' Tameez is a staff writer at Nieman Lab. You can reach her via email (hanaa@niemanlab.org), Bluesky DM (@hanaatameez.bsky.social), or on Signal (@hanaatameez.01).

POSTED Oct. 21, 2025, 2:45 p.m.

SEE MORE ON Aggregation & Discovery