Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.
The days of handcrafted HTML web sites are long gone. Now web sites are dynamic and built on the fly using the latest JavaScript, PHP, or Python framework. As a result, the sites are more fragile: a database crash, spurious upgrade, or unpatched vulnerability might lose data. In my previous life as web developer, I had to come to terms with the idea that customers expect web sites to basically work forever. This expectation matches poorly with "move fast and break things" attitude of web development. Working with the Drupal content-management system (CMS) was particularly challenging in that regard as major upgrades deliberately break compatibility with third-party modules, which implies a costly upgrade process that clients could seldom afford. The solution was to archive those sites: take a living, dynamic web site and turn it into plain HTML files that any web server can serve forever. This process is useful for your own dynamic sites but also for third-party sites that are outside of your control and you might want to safeguard.
For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:
$ nice wget --mirror --execute robots=off --no-verbose --convert-links \ --backup-converted --page-requisites --adjust-extension \ --base=./ --directory-prefix=./ --span-hosts \ --domains=www.example.com,example.com http://www.example.com/
The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores robots.txt rules, as is now common practice for archivists, and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.
The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.
That is, when things go well. Anyone who has ever worked with a computer knows that things seldom go according to plan; all sorts of things can make the procedure derail in interesting ways. For example, it was trendy for a while to have calendar blocks in web sites. A CMS would generate those on the fly and make crawlers go into an infinite loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions (e.g. Wget has a --reject-regex
option) to ignore problematic resources. Another option, if the administration interface for the web site is accessible, is to disable calendars, login forms, comment forms, and other dynamic areas. Once the site becomes static, those will stop working anyway, so it makes sense to remove such clutter from the original site as well.
Unfortunately, some web sites are built with much more than pure HTML. In single-page sites, for example, the web browser builds the content itself by executing a small JavaScript program. A simple user agent like Wget will struggle to reconstruct a meaningful static copy of those sites as it does not support JavaScript at all. In theory, web sites should be using progressive enhancement to have content and functionality available without JavaScript but those directives are rarely followed, as anyone using plugins like NoScript or uMatrix will confirm.
Traditional archival methods sometimes fail in the dumbest way. When trying to build an offsite backup of a local newspaper (pamplemousse.ca), I found that WordPress adds query strings (e.g. ?ver=1.12.4
) at the end of JavaScript includes. This confuses content-type detection in the web servers that serve the archive, which rely on the file extension to send the right Content-Type
header. When such an archive is loaded in a web browser, it fails to load scripts, which breaks dynamic websites.
As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.
At the Internet Archive, Brewster Kahle and Mike Burner designed the ARC (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") specification that was released as an ISO standard in 2009 and revised in 2017. The standardization effort was led by the International Internet Preservation Consortium (IIPC), which is an "international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future", according to Wikipedia; it includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based Heritrix crawler.
A WARC file aggregates multiple resources like HTTP headers, file contents, and other metadata in a single compressed archive. Conveniently, Wget actually supports the file format with the --warc
parameter. Unfortunately, web browsers cannot render WARC files directly, so a viewer or some conversion is necessary to access the archive. The simplest such viewer I have found is pywb, a Python package that runs a simple webserver to offer a Wayback-Machine-like interface to browse the contents of WARC files. The following set of commands will render a WARC file on http://localhost:8080/:
$ pip install pywb $ wb-manager init example $ wb-manager add example crawl.warc.gz $ wayback
This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.
Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification, which was fixed in the 1.1 specification. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl. Here is how it is invoked:
$ crawl https://example.com/
(It does say "very simple" in the README.) The program does support some command-line options, but most of its defaults are sane: it will fetch page requirements from other domains (unless the -exclude-related
flag is used), but does not recurse out of the domain. By default, it fires up ten parallel connections to the remote site, a setting that can be changed with the -c
flag. But, best of all, the resulting WARC files load perfectly in pywb.
There are plenty more resources for using WARC files. In particular, there's a Wget drop-in replacement called Wpull that is specifically designed for archiving web sites. It has experimental support for PhantomJS and youtube-dl integration that should allow downloading more complex JavaScript sites and streaming multimedia, respectively. The software is the basis for an elaborate archival tool called ArchiveBot, which is used by the "loose collective of rogue archivists, programmers, writers and loudmouths" at ArchiveTeam in its struggle to "save the history before it's lost forever". It seems that PhantomJS integration does not work as well as the team wants, so ArchiveTeam also uses a rag-tag bunch of other tools to mirror more complex sites. For example, snscrape will crawl a social media profile to generate a list of pages to send into ArchiveBot. Another tool the team employs is crocoite, which uses the Chrome browser in headless mode to archive JavaScript-heavy sites.
This article would also not be complete without a nod to the HTTrack project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.
In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.
Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag, a self-hosted "read it later" service designed as a free-software alternative to Pocket (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually unreadable and Wallabag sometimes fails to parse the article. Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.
The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself.
Archiving web sites
Posted Sep 25, 2018 14:48 UTC (Tue) by anarcat (subscriber, #66354) [Link]
As usual, here's the list of issues and patches generated while researching this article:ia
commandline tool
Archiving web sites
Posted Oct 4, 2018 17:58 UTC (Thu) by anarcat (subscriber, #66354) [Link]
As it turns out, I couldn't stop working on this topic and opened two more PRs upstream after submitting WARC files to the internet archive:ia
documentationia
The Pamplemousse crawl is now available on the Internet Archive, it might end up in the wayback machine at some point if the Archive curators think it is worth it.
Another example of a crawl is this archive of two Bloomberg articles which the "save page now" feature of the Internet archive wasn't able to save correctly (but webrecorder.io) could! Those pages can be seen in the web recorder player to get a better feel of how faithful a WARC file really is.
Archiving web sites
Posted Sep 25, 2018 15:52 UTC (Tue) by bnewbold (subscriber, #72587) [Link]
Thanks for another well researched article! In particular, great to see Archive Team get more attention and love, they are a really impressive community (IMHO).
Two additional resources the Internet Archive has, which don't fit the personal and self-sufficient angle this article focuses on, are the "Save Page Now" feature/API on web.archive.org (which allows anybody to request that the archive crawlers immediately snapshot a single page plus embedded resources), and Brozzler, our new "headless browser" crawler (https://github.com/internetarchive/brozzler), which combines with warcprox (https://github.com/internetarchive/warcprox), a proxy that saves all HTTP(S) traffic to WARC format. The tide seems to be in the direction of using headless browsers over custom crawling tools (like Heritrix), though the cost is *significantly* more, so it hasn't been used in non-profit crawling at the same scales yet (as far as I know). As some context there, it's my understanding that Google and other search crawlers have been using headless browsers for years (there is a narrative that this is why Chrome/Chromium is "so fast" and had so much sandboxing/security focus in the early day compared to other browsers).
The Archive also has an "archive as a service" offering with hundreds of institutional users, with full control over crawl prioritization and WARC export, but the cost is high for individual users (feels too self-promotional to link, but you can find it easily).
Archiving web sites
Posted Sep 25, 2018 21:21 UTC (Tue) by Kamilion (subscriber, #42576) [Link]
Archiving web sites
Posted Sep 26, 2018 17:09 UTC (Wed) by jond (subscriber, #37669) [Link]
It's great to see mention of your "as a service" offering, which is something I was looking at for something I'm involved in. I'm convinced that your software is a great fit for our needs, but as a small, volunteer-driven, non-profit group, we almost certainly can't afford the SaaS. Is there any chance you would consider open sourcing your service software?
Archiving web sites
Posted Sep 27, 2018 18:12 UTC (Thu) by bnewbold (subscriber, #72587) [Link]
Many of the major components (Heretrix, Brozzler, various Wayback replay tools, trough, warcprox, etc) are free software, and we have been amenable to, eg, licensing front-end javascript code. Many of us strongly believe in FLOSS principles, but we have limited resources and social capital to spend, and have tried to focus those on the highest impact changes.
I encourage you to keep asking though! In the meanwhile a lot of smaller groups hit our "save page now" API with a cron script to backup smaller websites (reasonable solution for up to a couple thousand URLs), which costs nothing and just takes (volunteer) time.
Archiving web sites
Posted Sep 27, 2018 18:22 UTC (Thu) by anarcat (subscriber, #66354) [Link]
I think you're being unfair to yourselves. :) Most, if not all of the archive.org magic sauce is basically public. The hard work is connecting all the pieces together and making them work reliably, on the long term. That's your achievement, and it's amazing. I do encourage people to free their software for others to use, but I know that, in practice, it's not always meaningful or useful, especially for old codebases with who knows what inside... ;)That said, "hitting save page now" is basically what I'm doing on my blog. I wrote a feed2exec plugin to ping the wayback machine when new content is posted on my site, and it has served me well.
But for larger operations, I hope my article shows that there are plenty of tools out there to build your own little internet archive. It might not have all the bell and whistles (multimedia support and library collections, for example), but you can get pretty far with ArchiveBot/crocoite/wpull and a viewer like pywb.
It really depends, after all, what you want to actually do: archive your own website? other websites? old software?
For the latter, by the way, a significant resource might also be the software heritage folks although they are primarily focused on source code...
Archiving web sites
Posted Oct 8, 2018 0:02 UTC (Mon) by pabs (subscriber, #43278) [Link]
$ HEAD https://web.archive.org/save/https://lwn.net/
404 Not Found
Connection: close
Date: Mon, 08 Oct 2018 00:00:44 GMT
Server: nginx/1.13.11
Content-Length: 170
Content-Type: text/html
Client-Date: Mon, 08 Oct 2018 00:00:44 GMT
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
Client-SSL-Cert-Subject: /OU=Domain Control Validated/CN=*.archive.org
Client-SSL-Cipher: ECDHE-RSA-AES128-GCM-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Archiving web sites
Posted Oct 11, 2018 15:14 UTC (Thu) by anarcat (subscriber, #66354) [Link]
Archiving web sites
Posted Oct 12, 2018 3:31 UTC (Fri) by pabs (subscriber, #43278) [Link]
Archiving web sites
Posted Oct 24, 2018 21:08 UTC (Wed) by anarcat (subscriber, #66354) [Link]
Archiving web sites
Posted Nov 23, 2018 4:03 UTC (Fri) by nikisweeting (guest, #128789) [Link]
I'm the maintainer of Bookmark Archiver, mentioned near the end, and we'd love to add Brozzler/WarcProx system that can crawl pages using Chrome headless to replay JS and other user actions recorded with puppeteer during the browsing session.
The end goal is to have exactly the actions that I took when visiting a site replayable at a later date. I may even save the VM containing the binaries for a browser capable of replaying the archive once a year, so that the sites can be visited far, far in the future on any x86 compatible machine.
Archiving web sites
Posted Sep 25, 2018 16:48 UTC (Tue) by xxiao (guest, #9631) [Link]
web scraping with python can work reliably with static sites, and has to use headless-browser etc to crawl javascript-sites which is very challenging too.
Archiving web sites
Posted Sep 25, 2018 16:59 UTC (Tue) by anarcat (subscriber, #66354) [Link]
I'm sorry to hear that! :) The reason why this is not covered more deeply is because the "WARC" approach worked for the site I tested against. It's true it might fail against single-page applications (SPA) - for this other tools are necessary. This is what I covered in the Future work and alternatives section. In particular, I would use thecrocoite
project for JavaScript-heavy sites, and it's what ArchiveTeam use for their "chromebot". It uses chrome in headless mode to browse the page, but I haven't tested it. It does seem to work for them however...
Just use Selenium
Posted Sep 26, 2018 8:51 UTC (Wed) by zoobab (subscriber, #9945) [Link]
Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd:
https://web.archive.org/web/20170615082215/http://filez.z...
Just click on "allwinner", it basically did not managed to maintain the links.
Just use Selenium
Posted Sep 26, 2018 10:10 UTC (Wed) by tlamp (subscriber, #108540) [Link]
That's just one part of the equation... You need still tooling to crawl and save the whole page (and its internal links) in a future accessible offline format, like WARC is.
I could imagine that Selenium could be used as driver/backend for one of the projects mentioned, though.
Just use Selenium
Posted Sep 26, 2018 11:33 UTC (Wed) by zoobab (subscriber, #9945) [Link]
I found a way to pilot it with a the REPL+telnet, still have to document it. And this plugin rewrites the links as well.
scrapbook
Posted Oct 3, 2018 22:44 UTC (Wed) by debacle (subscriber, #7114) [Link]
(*) This is: $ echo firefox-esr hold | sudo dpkg --set-selections
scrapbook
Posted Oct 4, 2018 1:44 UTC (Thu) by pabs (subscriber, #43278) [Link]
scrapbook
Posted Oct 4, 2018 9:15 UTC (Thu) by debacle (subscriber, #7114) [Link]
scrapbook
Posted Oct 4, 2018 13:50 UTC (Thu) by pabs (subscriber, #43278) [Link]
https://bonedaddy.net/pabs3/log/2018/09/08/webextocalypse/
https://github.com/tahama/scrapbookq/blob/master/src/mani...
/usr/share/mozilla/extensions/{ec8030f7-c20a-464f-9b0e-13a3a9e97384}/tahama@163.com -> git repo
Just use Selenium
Posted Sep 27, 2018 2:11 UTC (Thu) by anarcat (subscriber, #66354) [Link]
Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd.[...] Just click on "allwinner", it basically did not managed to maintain the links.That's a bug, I guess. Note that if you append a slash, the links work fine. But when you crawl down, it's true that some files are missing.
I counter that there's a better way to archive those files on archive.org: you could upload those to the Internet archive software collection. That way there would be meaningful, semantic data associated with those things instead of just an opaque directory listing.
That is more work, of course... Which is why I packaged that commandline tool (because their web UI is hellish). ;)
Archiving web sites
Posted Oct 1, 2018 17:09 UTC (Mon) by ceplm (subscriber, #41334) [Link]
Archiving web sites
Posted Oct 1, 2018 17:24 UTC (Mon) by anarcat (subscriber, #66354) [Link]
(There's something to be said about the sustainability of browser add-ons here, but I'll stay on topic.. ;)
Archiving web sites
Posted Oct 2, 2018 9:19 UTC (Tue) by ceplm (subscriber, #41334) [Link]
2. hideous UI prepared for Internet Archive, but not for humans. Search? really?
I actually don't hate the fact that KDE WAR and MAFF were just a file storing the web page in question, and I would like to store that one page somewhere in my regular files like any other document.
So, if pywb had just a command display with one parameter [directory] which would open as http://localhost:8080 with a list of all archives in the directory and its subdirectories, and just by simple clicking on that name of archive one would open saved page. Nothing more.
Archiving web sites
Posted Oct 2, 2018 15:03 UTC (Tue) by anarcat (subscriber, #66354) [Link]
pywb should now work with wget warcs + other options
Posted Oct 11, 2018 0:38 UTC (Thu) by ikreymer (guest, #127798) [Link]
Hi,Thanks for mentioning pywb and Webrecorder. I wanted to mention that we've released an updated version of our warcio library, which is used by pywb and it should now be able to handle WARCs created by wget. Thanks for bringing attention to this issue!
If you've already installed pywb, you can update the warcio library by running pip install -U warcio
to get the latest.
I also wanted to mention a few other options for creating WARCs using warcio and pywb
from warcio.capture_http import capture_http import requests with capture_http('example.warc.gz'): # request all urls to be loaded requests.get('https://example.com/') requests.get('https://google.com/)
pywb has a built-in 'record' mode that allows you to record directly into a pywb collection by, for example, browsing http://localhost:8080/example/record/http://example.com/
. You can enable this by running wayback --live --record
More info on this in the pywb docs
pywb also supports proxy mode recording and allows you set pywb as the http and https proxy for your browser. You can record directly into a pywb collection and content recorded in this way is more likely to work when replayed in pywb.
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds