Supported by
The shift
The Data That Powers A.I. Is Disappearing Fast
New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.
Reporting from San Francisco
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now, that data is drying up.
Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.
“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.
Subscribe to The Times to read as many articles as you like.
Kevin Roose is a Times technology columnist and a host of the podcast "Hard Fork." More about Kevin Roose
Explore Our Coverage of Artificial Intelligence
News and Analysis
OpenAI is testing an A.I.-powered search engine that can access information from across the internet in real time.
While the United States has had a head start on A.I. development, China is catching up.
Kamala Harris has won concessions from Big Tech leaders on A.I., but she hasn’t successfully pushed Congress to regulate. Her presidency could mean more of the same.
The Age of A.I.
As health insurance plans increasingly rely on technology to deny treatment, physicians are fighting back with chatbots that synthesize research and make the case.
The inventor and futurist Ray Kurzweil hopes to reach “the Singularity” and live indefinitely. His margin of error at 76 is shrinking.
The A.I. boom is a long-awaited gift for wonky consultants, as businesses rattled by tech’s latest trend have turned to these advisers for guidance.
Related Content
Google Deepmind
Cebisile Mbonani for The New York Times
Arsenii Vaselenko for The New York Times
Image by Patrick Junker for The New York Times
Eliot Blondet/Abaca, via Sipa USA
Susan Walsh/Associated Press
Anna Rose Layden for The New York Times
Doug Mills/The New York Times
Editors’ Picks
Elizabeth Renstrom for The New York Times
Illustration by Tomi Um
Tomasz Stachura/Baltictech
Trending in The Times
Devin Oktar Yalkin for The New York Times
Erin Schaff/The New York Times
Bernard Gotfryd, via Library of Congress
Bleecker Street, via Associated Press
Gabriela Bhaskar for The New York Times
Illustration by Nicolás Ortega; Photograph by Getty Images
Kenny Holston/The New York Times
Photo illustration by Devin Oktar Yalkin
Advertisement