(cache)Essential tips for web archiving.

In this brief guide, I will share what I've learned about archiving live webpages and recovering deleted webpages using various archive services.

Before you start

I highly recommend installing the Web Archives extension in your web browser (Chrome, Firefox). It provides quick access to archive services and search engine caches.

Overview of archive services & caches

Wayback Machine: Best and largest webpage archive. You probably already use it. Wayback Machine is the only archival service that also performs automated crawling, so its coverage is much better than other archives. See usage tips on Wikipedia.
Archive.today (also known as Archive.is): A mid-sized service that takes a different approach from Wayback Machine; it executes all page JavaScript, then saves the resulting HTML content as a static page. This approach works well for some JavaScript-heavy pages. See information and usage tips on Wikipedia. Notably, Archive.today is able to back up captures from other services, automatically detecting a capture's original URL and indexing it accordingly. This allows other users to find the archived page by searching for its original URL.
Ghostarchive: A smaller archive service that uses the Webrecorder suite to store and replay webpage captures. This works particularly well for interactive JavaScript-heavy pages. Ghostarchive also has custom setups for capturing social media posts on X/Twitter, Instagram, and possibly other services. See usage tips on Wikipedia.
Conifer: A very small and specialized web archiving service that allows registered users to create collections of webpage captures using Webrecorder. Collections can be downloaded as WARC files or shared publicly. To learn how to use Conifer, read the User Guide.
Megalodon: A small web archiving service based in Japan. Refuses to archive pages on sites that use robots.txt. Not particularly good for archiving complex pages, but useful for archiving pages from some Japanese sites. See usage tips on Wikipedia.
FreezePage: A very old webpage capture service that does not execute JavaScript. Captures are deleted after 30 days of account inactivity, or after 3 days for unregistered users! Do not use FreezePage for permanent archival! You should immediately save any FreezePage captures to Archive.today.
Bing cache: Accessible from a dropdown menu in Bing search results. Use the url: operator (example: url:https://example.com) to show results only for the exact URL you're looking for. Not every result will have a cache available.
Yandex cache: Accessible from a dropdown menu in Yandex search results. Yandex's cache is quite comprehensive, but not every result will have a cache available. Searching for URLs on Yandex often triggers a CAPTCHA though, at least for me.
Google cache: Formerly accessible using the cache: operator in Google search. Unfortunately, Google removed its cache feature entirely in September 2024, so this no longer works.

Recovering a deleted webpage

If you come across a webpage that has been deleted, you can use the Web Archives extension to try to find an existing archived or cached version of it. Just click the extension icon, then click the service that you want to check. Or click "All Search Engines" to check all of them at once! If the page you're looking for immediately redirects or is completely inaccessible, you can still use the extension. Just click the "Tab" dropdown in the top-left corner of the extension popup, then switch to "URL" and paste in the URL you want to look for.

Generally I check these services in order:

Wayback Machine
Archive.today
Bing cache
Yandex cache
Ghostarchive

Note that search engine caches are not permanent! If you find a deleted webpage on a search engine cache (Bing or Yandex) you will need to save it to a permanent archive. When saving a page from Bing cache, it is recommended to use Archive.today because it automatically detects the original URL of a cached page and indexes it accordingly. This allows other users to find the archived page by searching for its original URL. Unfortunately, Yandex seems to block Archive.today and Wayback Machine. Currently I recommend saving pages from Yandex to Megalodon, then saving the Megalodon capture to Archive.today. To do this, you will need to manually navigate to Archive.today and paste the Megalodon URL into the "My url is alive and I want to archive its content" box. Other methods tend to mangle the URL and cause the capture to break. Impressively, Archive.today is able to detect the original URLs of these captures; here is an example.

Here are some extra tips for specific sites:

YouTube Videos: Use the YouTube Video Finder service. Bing cache also has excellent coverage of YouTube if you just need metadata or proof that a video existed.
X/Twitter: Wayback Machine is no longer able to archive tweets in replayable form, but some tweets are still saved in raw JSON form. To access JSON captures, change the URL from x.com to twitter.com and search for the first capture. Here is an example.

Archiving a live webpage

It's important to proactively save webpages you visit to ensure you can return to them later. Wayback Machine's Save Page Now service usually works well for this. But some sites are more tricky; Here I will provide tips for dealing with those sites. Audio and video content often must be downloaded manually; the tool I recommend for this is yt-dlp, but if you're not comfortable using the command line, cobalt is a great alternative.

Airtable: Use Archive.today to save the page, then download each table as a CSV file using the three-dots or dropdown menu and upload the files to the Internet Archive.
Binary/raw files: Wayback Machine and Megalodon support archiving raw files, but Archive.today and Ghostarchive do not; they only support archiving webpages.
Bluesky posts: Use Ghostarchive or Archive.today.
eBay auction pages: Use Archive.today. (Other services do not save full-size images)
Facebook posts: Use Archive.today. To save videos, download them with yt-dlp and upload them to the Internet Archive.
Google Docs: Change the end of the URL from /edit to /mobilebasic to load a plain HTML version of the document. Then save to Wayback or any other archive service.
Google Sheets: Change the end of the URL from /edit to /htmlview to load a plain HTML version of the spreadsheet. Then save to Wayback or any other archive service.
Imgur images: No archive service is able to save full-size images or large albums. Use Archive.today or Megalodon to save the page, then download the image or album and upload it to the Internet Archive.
Instagram posts: Use Ghostarchive or Archive.today. Unfortunately, on posts with more than two images, the latter images may fail to save.
Mastodon posts: Wayback may be unable to save posts from some instances; use Archive.today instead for those.
Microsoft Sway presentations: Use Ghostarchive.
News articles (in general): Use Archive.today. Often bypasses paywalls.
Peatix event pages: Use Megalodon.
Reddit threads:
- For text threads: Change the URL from www.reddit.com to old.reddit.com, then save to Wayback Machine.
- For threads with images: Save to Archive.today. Unfortunately, image replies will not be saved unless you also save each <image> link manually. Reddit blocks Ghostarchive so using that is not an option.
Threads posts: Use Archive.today or Ghostarchive.
Soundcloud tracks: Use Archive.today or Ghostarchive to save the page, then download the track and upload it to the Internet Archive.
- Some tracks are downloadable at their original quality from the three-dots "More" menu. You'll need a Soundcloud account, but the account does not need a verified email address.
- If a track is not downloadable, you can still download it at streaming quality with yt-dlp.
TikTok videos: Use Conifer. It is the only archive service that supports playback of TikTok videos. Or use Ghostarchive or Wayback to save the page, then save the video and upload it to the Internet Archive. Many TikTok videos can be saved by right-clicking them and selecting "Download video." yt-dlp can save videos if downloads are disabled.
Tumblr posts:
- Archive.today works very reliably, but it sometimes fails to save posts from the Tumblr dashboard (URLs of the form www.tumblr.com/exampleblog/ID). And it cannot save videos.
- If Archive.today fails to save a post, try capturing it with FreezePage and immediately saving the FreezePage capture to Archive.today.
- Megalodon often works for blogs with custom themes (URLs of the form exampleblog.tumblr.com/post/ID).
- Wayback Machine and Ghostarchive are able to save text posts, but fail to save images. Strangely, Wayback Machine is sometimes able to save videos; here is an example.
Vimeo videos: Use Archive.today to save the page, then download the video with yt-dlp and upload it to the Internet Archive.
X/Twitter posts:
- For individual tweets with images: Use Archive.today or Megalodon.
- For threads: Use Ghostarchive. It is the only service that shows replies and threads because it uses real Twitter accounts. But it often fails to save media as of 2024-11-29.
- For videos: download the video with yt-dlp and upload it to the Internet Archive.
Yahoo Japan Auctions pages: Blocks Wayback Machine, but all other archive services should work.
YouTube videos: Save the YouTube page to Wayback Machine using Save Page Now, then check back in a few days to make sure the video was saved. Ghostarchive can be used instead for shorter videos. Megalodon may be used to save some of the top comments, excluding any replies.

n0samu/web-archival-guide.md

Before you start

Overview of archive services & caches

Recovering a deleted webpage

Archiving a live webpage