Archiving websites: The Internet Archives Save Page Now (SPN) API

23. Februar 2024 In Data + Code

Over the last few months, I have been working on how to efficiently send websites to the Internet archive on a regular basis. The goal: to secure the online content of queer organizations on a regular basis. Turns out: This is not that easy.

Let’s start at the beginning: How to save a copy of a website in the Internet Archive? Capturing websites for the Internet Archive is done by Save Page Now (SPN) mechanism. It works by appending a URL to https://web.archive.org/save/. For example, for sending this website: https://web.archive.org/save/https://katharinabrunner.de

Many different applications use the SPN API, for example:

on the Internet Archive’s website
Browser extensions for Firefox and Chrome
archivebox.io, a suite to save websites in different formats, incl. sending them to the Internet Archive
savepagenow, a Python package
waystation, a Github Action built to archive Github Pages automatically
wayback-machine-spn-scripts, sophisticaed bash script
and many more…

The SPN2 API documentation resides in this Google Doc.

Things to consider

Just save the page or all outlinks?
Saving only when there is no snapshot after a certain timestamp?

I went with the spn bash scripts and deployed it with Github Actions. This script is a sophisticated implementation including logging. I have a list of roundabout 300 URLs. Saving them with all outlinks can take some time. Github stops all running actions after six hours.

#Digital Archive #Internet Archive

Browsertrix: Run with `urlFile`

How to read environment variables in R-Shiny apps

2 Kommentare

Stefan Baack 23. Februar 2024

@blog @cutterkom Thanks for sharing! You end with saying: “Github stops all running actions after six hours.” Do you mean that this is a problem for your project, or it’s just something to be generally aware of when others want to adapt your approach?

Katharina Brunner 23. Februar 2024

@tootbaack @blog It‘s no problem for my project. I just restart the action and I use a parameter that fetches websites only if they had not been archived in the last xx days or weeks. It‘s something to be aware of,because as you probably know, archiving stuff takes some time

Things to consider

2 Kommentare

Schreibe einen Kommentar Antworten abbrechen

Ähnliche Beiträge

KI, ML, etc.: Es dauert noch und das ist ganz normal so

Maintenance vs. Innovation

Status Update: How much data?

Ohne Effekt wäre effektiver