Archiving websites: The Internet Archives Save Page Now (SPN) API

Over the last few months, I have been working on how to efficiently send websites to the Internet archive on a regular basis. The goal: to secure the online content of queer organizations on a regular basis. Turns out: This is not that easy.

Let’s start at the beginning: How to save a copy of a website in the Internet Archive? Capturing websites for the Internet Archive is done by Save Page Now (SPN) mechanism. It works by appending a URL to https://web.archive.org/save/. For example, for sending this website: https://web.archive.org/save/https://katharinabrunner.de

Many different applications use the SPN API, for example:

The SPN2 API documentation resides in this Google Doc.

Things to consider

  • Just save the page or all outlinks?
  • Saving only when there is no snapshot after a certain timestamp?

I went with the spn bash scripts and deployed it with Github Actions. This script is a sophisticated implementation including logging. I have a list of roundabout 300 URLs. Saving them with all outlinks can take some time. Github stops all running actions after six hours.

2 Kommentare

Stefan Baack 23. Februar 2024

@blog @cutterkom Thanks for sharing! You end with saying: “Github stops all running actions after six hours.” Do you mean that this is a problem for your project, or it’s just something to be generally aware of when others want to adapt your approach?

Katharina Brunner 23. Februar 2024

@tootbaack @blog It‘s no problem for my project. I just restart the action and I use a parameter that fetches websites only if they had not been archived in the last xx days or weeks. It‘s something to be aware of,because as you probably know, archiving stuff takes some time

Schreibe einen Kommentar