Over the last few months, I have been working on how to efficiently send websites to the Internet archive on a regular basis. The goal: to secure the online content of queer organizations on a regular basis. Turns out: This is not that easy.
Let’s start at the beginning: How to save a copy of a website in the Internet Archive? Capturing websites for the Internet Archive is done by Save Page Now (SPN) mechanism. It works by appending a URL to https://web.archive.org/save/
. For example, for sending this website: https://web.archive.org/save/https://katharinabrunner.de
Many different applications use the SPN API, for example:
- on the Internet Archive’s website
- Browser extensions for Firefox and Chrome
- archivebox.io, a suite to save websites in different formats, incl. sending them to the Internet Archive
- savepagenow, a Python package
- waystation, a Github Action built to archive Github Pages automatically
- wayback-machine-spn-scripts, sophisticaed bash script
- and many more…
The SPN2 API documentation resides in this Google Doc.
Things to consider
- Just save the page or all outlinks?
- Saving only when there is no snapshot after a certain timestamp?
I went with the spn bash scripts and deployed it with Github Actions. This script is a sophisticated implementation including logging. I have a list of roundabout 300 URLs. Saving them with all outlinks can take some time. Github stops all running actions after six hours.
2 Kommentare
@blog @cutterkom Thanks for sharing! You end with saying: “Github stops all running actions after six hours.” Do you mean that this is a problem for your project, or it’s just something to be generally aware of when others want to adapt your approach?
@tootbaack @blog It‘s no problem for my project. I just restart the action and I use a parameter that fetches websites only if they had not been archived in the last xx days or weeks. It‘s something to be aware of,because as you probably know, archiving stuff takes some time