AO3 Data Scraped for AI Training Dataset

What is happening, and what you can do. Check for potential edits with additions at the end of the post!

What is happening? What do we know?

A user going by "nyuuzyou" on the HuggingFace platform uploaded a dataset a few days ago - containing scraped content from AO3. HuggingFace is a very popular platform and widely used for sharing machine learning and AI models/datasets. The scraped dataset includes fics, fanart, and other fanworks - all taken without permission and intended for use in training gen AI models. You can find more information in this Reddit post.

This dataset is one of several compiled from various websites—at least seven in total. While two datasets have been removed, the AO3 one was only disabled on HuggingFace. This means that it’s not downloadable at the moment but still visible. It may also return if takedown efforts end up being challenged/reversed by that user.

Key Details

  • Scope: On AO3, all content with work IDs between 1 and 63,200,000 has been targeted. The work ID is the number at the end of a work's URL — for example, in https://archiveofourown.org/works/12345678, 12345678 is the work ID. You can find it by simply opening the work and checking the URL in your browser’s address bar. So, if your work falls in that range and is publicly accessible (i.e., not locked and open to everyone, including guests), it’s mostly likely included in the dataset. This dataset is currently disabled on HuggingFace, but that doesn't mean it's gone. It's only a temporary takedown as of now.
  • Takedown notices have been issued, but this user has also uploaded the dataset to other sites after backlash and partial removal.
  • There are talks in the discussion forums of potentially moving this dataset to Telegram, torrents, and/or other private channels.
  • HuggingFace AO3 dataset page
  • Other distributed sites listed here (as per a Reddit comment)
  • Currently deleted from ModelScope

What can you do?

  • Should the dataset return again and you see that your work was affected: file your own DMCA or copyright takedown notice. The uploader, in their own words, "has not agreed to take down the entire repo. At this time, the scraper has agreed with taking down art from the person who owns the copyright. That means each of you will need to request a takedown."
  • Instructions and a sample CSV template to list your work IDs for removal are provided in this guide. You can find more details in this announcement by PaperDemon.
  • Lock your works! It would limit visibility to registered users only, and is a very good step to prevent scraping or unauthorized use. To lock all your works on AO3, go to “My Works,” click “Edit Works,” and select all. Then click “Edit” and check the box labeled “Only show to registered users.” Scroll down and click “Update All Works” to apply the change.

⚠️ | Final Notes:

This user has so far shown no signs of stopping and is continuing to redistribute the data across multiple sites, even after numerous takedown requests (read more here). So, we can only recommend to be cautious and beware, lock your works, feel free to make use of takedown notices if you're unfortunately affected, and spread the word to fellow creators.

Follow up on this and get the latest updated in the Fanfic Communities Network (FCN) Discord Server!

If you have more information regarding this - e.g. if works from other sites are affected too - please reach out to us in the FCN!!

Join over 100 million people using Tumblr to find their communities and make friends.
Join over 100 million people using Tumblr to find their communities and make friends.