The shift

The Data That Powers A.I. Is Disappearing Fast

New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.

Listen to this article · 7:42 min Learn more

Credit...Raven Jiang

By Kevin Roose

Reporting from San Francisco

July 19, 2024

For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Subscribe to The Times to read as many articles as you like.

Kevin Roose is a Times technology columnist and a host of the podcast "Hard Fork." More about Kevin Roose

A version of this article appears in print on July 22, 2024, Section B, Page 1 of the New York edition with the headline: The Data That Powers A.I. Is Disappearing Fast. Order Reprints | Today’s Paper | Subscribe

See more on: OpenAI, Alphabet Inc., Microsoft Corporation

Related Content

inEducation: Computer Science

This reading list has been curated to support university students and others studying Computer Science. If you are affiliated with a U.S. College or University, visit accessnyt.com to learn if your institution provides complimentary access. All others can inquire with their library.

Move Over, Mathematicians, Here Comes AlphaProof
Google Deepmind
When A.I. Fails the Language Test, Who Is Left Out of the Conversation?
Cebisile Mbonani for The New York Times
OpenAI Is Testing an A.I.-Powered Search Engine
Arsenii Vaselenko for The New York Times

More In Technology

Business Is Buzzing Again for the Meme Makers of the Left
Gluekit
Germans Combat Climate Change From Their Balconies
Image by Patrick Junker for The New York Times
How Do You Solve a Problem Like Elon?
Eliot Blondet/Abaca, via Sipa USA
Elon Musk Wants People on X to Police Election Posts. It’s Not Working Well.
Susan Walsh/Associated Press
Automakers Sold Driver Data for Pennies, Senators Say
Anna Rose Layden for The New York Times
In Silicon Valley, Where Trump Made Inroads, Democrats Are Now Invigorated
Doug Mills/The New York Times

Editors’ Picks

Before Bum Bum Cream, 80 Years of Teen Beauty Trends
Elizabeth Renstrom for The New York Times
My Partner Told Me About His Fights With His Ex. I Think I’m on Her Side.
Illustration by Tomi Um
Pop the Cork? A Shipwreck Brims With Unopened Sparkling Wine
Tomasz Stachura/Baltictech

Melinda French Gates Is Ready to Take Sides
Devin Oktar Yalkin for The New York Times
Floods Sweep Dolly Parton’s Dollywood Theme Park
Storyful
Opinion: Harris vs. Trump Is Taking Shape. And Then There’s Vance.
Erin Schaff/The New York Times
Jill Schary Robinson, Who Wrote of Her Hollywood Upbringing, Dies at 88
Bernard Gotfryd, via Library of Congress
Know What’s Funny About Getting Old? These Movies Do.
Bleecker Street, via Associated Press
The Olympic Flame Isn’t a Flame at All
Gabriela Bhaskar for The New York Times
Slow Down if You See These Dating ‘Yellow Flags’
Illustration by Nicolás Ortega; Photograph by Getty Images
Major Shifts Beneath the Surface in a New Trump-Harris Poll
Kenny Holston/The New York Times
Pete Buttigieg Thinks the Trump Fever Could Break
Photo illustration by Devin Oktar Yalkin

SKIP ADVERTISEMENT

Explore Our Coverage of Artificial Intelligence

Related Content

inEducation: Computer Science

More In Technology

Editors’ Picks

Trending in The Times