User: Password:
|
|
Subscribe / Log in / New account

Archiving web sites

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

September 25, 2018

This article was contributed by Antoine Beaupré

I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.

Converting simple sites

The days of handcrafted HTML web sites are long gone. Now web sites are dynamic and built on the fly using the latest JavaScript, PHP, or Python framework. As a result, the sites are more fragile: a database crash, spurious upgrade, or unpatched vulnerability might lose data. In my previous life as web developer, I had to come to terms with the idea that customers expect web sites to basically work forever. This expectation matches poorly with "move fast and break things" attitude of web development. Working with the Drupal content-management system (CMS) was particularly challenging in that regard as major upgrades deliberately break compatibility with third-party modules, which implies a costly upgrade process that clients could seldom afford. The solution was to archive those sites: take a living, dynamic web site and turn it into plain HTML files that any web server can serve forever. This process is useful for your own dynamic sites but also for third-party sites that are outside of your control and you might want to safeguard.

For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:

    $ nice wget --mirror --execute robots=off --no-verbose --convert-links \
                --backup-converted --page-requisites --adjust-extension \
                --base=./ --directory-prefix=./ --span-hosts \
                --domains=www.example.com,example.com http://www.example.com/

The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores robots.txt rules, as is now common practice for archivists, and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.

The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.

That is, when things go well. Anyone who has ever worked with a computer knows that things seldom go according to plan; all sorts of things can make the procedure derail in interesting ways. For example, it was trendy for a while to have calendar blocks in web sites. A CMS would generate those on the fly and make crawlers go into an infinite loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions (e.g. Wget has a --reject-regex option) to ignore problematic resources. Another option, if the administration interface for the web site is accessible, is to disable calendars, login forms, comment forms, and other dynamic areas. Once the site becomes static, those will stop working anyway, so it makes sense to remove such clutter from the original site as well.

JavaScript doom

Unfortunately, some web sites are built with much more than pure HTML. In single-page sites, for example, the web browser builds the content itself by executing a small JavaScript program. A simple user agent like Wget will struggle to reconstruct a meaningful static copy of those sites as it does not support JavaScript at all. In theory, web sites should be using progressive enhancement to have content and functionality available without JavaScript but those directives are rarely followed, as anyone using plugins like NoScript or uMatrix will confirm.

Traditional archival methods sometimes fail in the dumbest way. When trying to build an offsite backup of a local newspaper (pamplemousse.ca), I found that WordPress adds query strings (e.g. ?ver=1.12.4) at the end of JavaScript includes. This confuses content-type detection in the web servers that serve the archive, which rely on the file extension to send the right Content-Type header. When such an archive is loaded in a web browser, it fails to load scripts, which breaks dynamic websites.

As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.

Creating and displaying WARC files

At the Internet Archive, Brewster Kahle and Mike Burner designed the ARC (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") specification that was released as an ISO standard in 2009 and revised in 2017. The standardization effort was led by the International Internet Preservation Consortium (IIPC), which is an "international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future", according to Wikipedia; it includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based Heritrix crawler.

A WARC file aggregates multiple resources like HTTP headers, file contents, and other metadata in a single compressed archive. Conveniently, Wget actually supports the file format with the --warc parameter. Unfortunately, web browsers cannot render WARC files directly, so a viewer or some conversion is necessary to access the archive. The simplest such viewer I have found is pywb, a Python package that runs a simple webserver to offer a Wayback-Machine-like interface to browse the contents of WARC files. The following set of commands will render a WARC file on http://localhost:8080/:

    $ pip install pywb
    $ wb-manager init example
    $ wb-manager add example crawl.warc.gz
    $ wayback

This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.

Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification, which was fixed in the 1.1 specification. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl. Here is how it is invoked:

    $ crawl https://example.com/

(It does say "very simple" in the README.) The program does support some command-line options, but most of its defaults are sane: it will fetch page requirements from other domains (unless the -exclude-related flag is used), but does not recurse out of the domain. By default, it fires up ten parallel connections to the remote site, a setting that can be changed with the -c flag. But, best of all, the resulting WARC files load perfectly in pywb.

Future work and alternatives

There are plenty more resources for using WARC files. In particular, there's a Wget drop-in replacement called Wpull that is specifically designed for archiving web sites. It has experimental support for PhantomJS and youtube-dl integration that should allow downloading more complex JavaScript sites and streaming multimedia, respectively. The software is the basis for an elaborate archival tool called ArchiveBot, which is used by the "loose collective of rogue archivists, programmers, writers and loudmouths" at ArchiveTeam in its struggle to "save the history before it's lost forever". It seems that PhantomJS integration does not work as well as the team wants, so ArchiveTeam also uses a rag-tag bunch of other tools to mirror more complex sites. For example, snscrape will crawl a social media profile to generate a list of pages to send into ArchiveBot. Another tool the team employs is crocoite, which uses the Chrome browser in headless mode to archive JavaScript-heavy sites.

This article would also not be complete without a nod to the HTTrack project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.

In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.

Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag, a self-hosted "read it later" service designed as a free-software alternative to Pocket (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually unreadable and Wallabag sometimes fails to parse the article. Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.

The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself.


(Log in to post comments)

Archiving web sites

Posted Sep 25, 2018 14:48 UTC (Tue) by anarcat (subscriber, #66354) [Link]

As usual, here's the list of issues and patches generated while researching this article: I also want to personally thank the folks in the #archivebot channel for their assistance and letting me play with their toys.

Archiving web sites

Posted Oct 4, 2018 17:58 UTC (Thu) by anarcat (subscriber, #66354) [Link]

As it turns out, I couldn't stop working on this topic and opened two more PRs upstream after submitting WARC files to the internet archive:

The Pamplemousse crawl is now available on the Internet Archive, it might end up in the wayback machine at some point if the Archive curators think it is worth it.

Another example of a crawl is this archive of two Bloomberg articles which the "save page now" feature of the Internet archive wasn't able to save correctly (but webrecorder.io) could! Those pages can be seen in the web recorder player to get a better feel of how faithful a WARC file really is.

Archiving web sites

Posted Sep 25, 2018 15:52 UTC (Tue) by bnewbold (subscriber, #72587) [Link]

Full disclosure: Internet Archive staff

Thanks for another well researched article! In particular, great to see Archive Team get more attention and love, they are a really impressive community (IMHO).

Two additional resources the Internet Archive has, which don't fit the personal and self-sufficient angle this article focuses on, are the "Save Page Now" feature/API on web.archive.org (which allows anybody to request that the archive crawlers immediately snapshot a single page plus embedded resources), and Brozzler, our new "headless browser" crawler (https://github.com/internetarchive/brozzler), which combines with warcprox (https://github.com/internetarchive/warcprox), a proxy that saves all HTTP(S) traffic to WARC format. The tide seems to be in the direction of using headless browsers over custom crawling tools (like Heritrix), though the cost is *significantly* more, so it hasn't been used in non-profit crawling at the same scales yet (as far as I know). As some context there, it's my understanding that Google and other search crawlers have been using headless browsers for years (there is a narrative that this is why Chrome/Chromium is "so fast" and had so much sandboxing/security focus in the early day compared to other browsers).

The Archive also has an "archive as a service" offering with hundreds of institutional users, with full control over crawl prioritization and WARC export, but the cost is high for individual users (feels too self-promotional to link, but you can find it easily).

Archiving web sites

Posted Sep 25, 2018 21:21 UTC (Tue) by Kamilion (subscriber, #42576) [Link]

Thanks, warcprox was just what I've been seeking for years.
Kinda sick of looking through my browser history trying to find a keyword, and getting ridiculously stupid results.

Archiving web sites

Posted Sep 26, 2018 17:09 UTC (Wed) by jond (subscriber, #37669) [Link]

It's great to see mention of your "as a service" offering, which is something I was looking at for something I'm involved in. I'm convinced that your software is a great fit for our needs, but as a small, volunteer-driven, non-profit group, we almost certainly can't afford the SaaS. Is there any chance you would consider open sourcing your service software?

Archiving web sites

Posted Sep 27, 2018 18:12 UTC (Thu) by bnewbold (subscriber, #72587) [Link]

Bluntly, I don't think it is a priority or likely to happen, as is also the case with our general purpose storage infrastructure behind archive.org ("Petabox", which Wayback and most of our other services are built on top of).

Many of the major components (Heretrix, Brozzler, various Wayback replay tools, trough, warcprox, etc) are free software, and we have been amenable to, eg, licensing front-end javascript code. Many of us strongly believe in FLOSS principles, but we have limited resources and social capital to spend, and have tried to focus those on the highest impact changes.

I encourage you to keep asking though! In the meanwhile a lot of smaller groups hit our "save page now" API with a cron script to backup smaller websites (reasonable solution for up to a couple thousand URLs), which costs nothing and just takes (volunteer) time.

Archiving web sites

Posted Sep 27, 2018 18:22 UTC (Thu) by anarcat (subscriber, #66354) [Link]

I think you're being unfair to yourselves. :) Most, if not all of the archive.org magic sauce is basically public. The hard work is connecting all the pieces together and making them work reliably, on the long term. That's your achievement, and it's amazing. I do encourage people to free their software for others to use, but I know that, in practice, it's not always meaningful or useful, especially for old codebases with who knows what inside... ;)

That said, "hitting save page now" is basically what I'm doing on my blog. I wrote a feed2exec plugin to ping the wayback machine when new content is posted on my site, and it has served me well.

But for larger operations, I hope my article shows that there are plenty of tools out there to build your own little internet archive. It might not have all the bell and whistles (multimedia support and library collections, for example), but you can get pretty far with ArchiveBot/crocoite/wpull and a viewer like pywb.

It really depends, after all, what you want to actually do: archive your own website? other websites? old software?

For the latter, by the way, a significant resource might also be the software heritage folks although they are primarily focused on source code...

Archiving web sites

Posted Oct 8, 2018 0:02 UTC (Mon) by pabs (subscriber, #43278) [Link]

I noticed that the "Save Page Now" API of archive.org stopped working with HEAD requests but still works with GET requests. Any idea where I should report this to?

$ HEAD https://web.archive.org/save/https://lwn.net/
404 Not Found
Connection: close
Date: Mon, 08 Oct 2018 00:00:44 GMT
Server: nginx/1.13.11
Content-Length: 170
Content-Type: text/html
Client-Date: Mon, 08 Oct 2018 00:00:44 GMT
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
Client-SSL-Cert-Subject: /OU=Domain Control Validated/CN=*.archive.org
Client-SSL-Cipher: ECDHE-RSA-AES128-GCM-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL

Archiving web sites

Posted Oct 11, 2018 15:14 UTC (Thu) by anarcat (subscriber, #66354) [Link]

i wrote info at a.org about this. this started failing for me as well, on October 9th 21h00 UTC-4.

Archiving web sites

Posted Oct 12, 2018 3:31 UTC (Fri) by pabs (subscriber, #43278) [Link]

Me too, but did not get any reply. I assume like most such contact addresses it is overwhelmed with a deluge of questions and spam. It would probably be easier to get in touch with with IA via someone who works there.

Archiving web sites

Posted Oct 24, 2018 21:08 UTC (Wed) by anarcat (subscriber, #66354) [Link]

I've just tried exactly that as well just now.

Archiving web sites

Posted Nov 23, 2018 4:03 UTC (Fri) by nikisweeting (guest, #128789) [Link]

Hi! Thanks for mentioning Brozzler, that's something I've been looking for for a while as well.

I'm the maintainer of Bookmark Archiver, mentioned near the end, and we'd love to add Brozzler/WarcProx system that can crawl pages using Chrome headless to replay JS and other user actions recorded with puppeteer during the browsing session.

The end goal is to have exactly the actions that I took when visiting a site replayable at a later date. I may even save the VM containing the binaries for a browser capable of replaying the archive once a year, so that the sites can be visited far, far in the future on any x86 compatible machine.

Archiving web sites

Posted Sep 25, 2018 16:48 UTC (Tue) by xxiao (guest, #9631) [Link]

it is still unclear to me how to archive the sites that are using javascript to render pages heavily, e.g. SPA sites, will wget --warc be good enough? or some headless chrome solution? we all know wget --mirror is good at archiving static sites, after reading this article, I still do not know what is the best approach to archive "javascript-site"s.

web scraping with python can work reliably with static sites, and has to use headless-browser etc to crawl javascript-sites which is very challenging too.

Archiving web sites

Posted Sep 25, 2018 16:59 UTC (Tue) by anarcat (subscriber, #66354) [Link]

I'm sorry to hear that! :) The reason why this is not covered more deeply is because the "WARC" approach worked for the site I tested against. It's true it might fail against single-page applications (SPA) - for this other tools are necessary. This is what I covered in the Future work and alternatives section. In particular, I would use the crocoite project for JavaScript-heavy sites, and it's what ArchiveTeam use for their "chromebot". It uses chrome in headless mode to browse the page, but I haven't tested it. It does seem to work for them however...

Just use Selenium

Posted Sep 26, 2018 8:51 UTC (Wed) by zoobab (subscriber, #9945) [Link]

Just use Selenium that spaws a real browsers, those curl/wget don't pass most websites, as we unfortunately have Javascript nowadays.

Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd:

https://web.archive.org/web/20170615082215/http://filez.z...

Just click on "allwinner", it basically did not managed to maintain the links.

Just use Selenium

Posted Sep 26, 2018 10:10 UTC (Wed) by tlamp (subscriber, #108540) [Link]

> Just use Selenium that spaws a real browsers, those curl/wget don't pass most websites, as we unfortunately have Javascript nowadays.

That's just one part of the equation... You need still tooling to crawl and save the whole page (and its internal links) in a future accessible offline format, like WARC is.

I could imagine that Selenium could be used as driver/backend for one of the projects mentioned, though.

Just use Selenium

Posted Sep 26, 2018 11:33 UTC (Wed) by zoobab (subscriber, #9945) [Link]

In Firefox, I used the scrapbook plugins to do that with a number of deepness level in the links (like save all the pages linked by this page).

I found a way to pilot it with a the REPL+telnet, still have to document it. And this plugin rewrites the links as well.

scrapbook

Posted Oct 3, 2018 22:44 UTC (Wed) by debacle (subscriber, #7114) [Link]

I use scrapbook heavily, but now I had to set firefox 52.9.0esr-1 on hold (*), because newer versions (e.g. 60.2.1esr-1) do not support it anymore. There is a different plugin "scrapbookq" for newer firefox versions, but I did not try it yet.

(*) This is: $ echo firefox-esr hold | sudo dpkg --set-selections

scrapbook

Posted Oct 4, 2018 1:44 UTC (Thu) by pabs (subscriber, #43278) [Link]

That version of Firefox has a number of security issues, it would be a good idea to upgrade even if you have to drop scrapbook usage briefly.

scrapbook

Posted Oct 4, 2018 9:15 UTC (Thu) by debacle (subscriber, #7114) [Link]

You are right, Paul, but since some years a lot of my "web workflow" depends on scrapbook. I use scrapbook more than bookmarks, for example. As always, convenience wins over security, and I will use the old firefox until either scrapbookq is in Debian or there is another good solution. I will probably move my firefox usage in some sort of containers though.

scrapbook

Posted Oct 4, 2018 13:50 UTC (Thu) by pabs (subscriber, #43278) [Link]

If you aren't willing to package it yourself and do not want to use the version from the Mozilla add-on site, you could run it from a git checkout using the mechanism I discovered that Mozilla still support. Basically, the extension manifest.json needs a gecko item in it (scrapbookq already has one) and you should create a symlink using the gecko id as the name from the extensions directory to the git repository.

https://bonedaddy.net/pabs3/log/2018/09/08/webextocalypse/
https://github.com/tahama/scrapbookq/blob/master/src/mani...
/usr/share/mozilla/extensions/{ec8030f7-c20a-464f-9b0e-13a3a9e97384}/tahama@163.com -> git repo

Just use Selenium

Posted Sep 27, 2018 2:11 UTC (Thu) by anarcat (subscriber, #66354) [Link]

Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd.[...] Just click on "allwinner", it basically did not managed to maintain the links.
That's a bug, I guess. Note that if you append a slash, the links work fine. But when you crawl down, it's true that some files are missing.

I counter that there's a better way to archive those files on archive.org: you could upload those to the Internet archive software collection. That way there would be meaningful, semantic data associated with those things instead of just an opaque directory listing.

That is more work, of course... Which is why I packaged that commandline tool (because their web UI is hellish). ;)

Archiving web sites

Posted Oct 1, 2018 17:09 UTC (Mon) by ceplm (subscriber, #41334) [Link]

Archiving web pages is one of the most frustrating experience I had on the web. One would expect some stability from archiving programs, but I had already had to write https://gitlab.com/mcepl/war2maff only to find that MAFF (https://addons.mozilla.org/en-US/firefox/addon/mozilla-ar...) is now dead as well, What is the next format I should convert my archived web pages?

Archiving web sites

Posted Oct 1, 2018 17:24 UTC (Mon) by anarcat (subscriber, #66354) [Link]

What's wrong with WARC files, which I mentioned in the article? :)

(There's something to be said about the sustainability of browser add-ons here, but I'll stay on topic.. ;)

Archiving web sites

Posted Oct 2, 2018 9:19 UTC (Tue) by ceplm (subscriber, #41334) [Link]

1. super overkill (special directories, what the heck is collection and why should I care, etc.)

2. hideous UI prepared for Internet Archive, but not for humans. Search? really?

I actually don't hate the fact that KDE WAR and MAFF were just a file storing the web page in question, and I would like to store that one page somewhere in my regular files like any other document.

So, if pywb had just a command display with one parameter [directory] which would open as http://localhost:8080 with a list of all archives in the directory and its subdirectories, and just by simple clicking on that name of archive one would open saved page. Nothing more.

Archiving web sites

Posted Oct 2, 2018 15:03 UTC (Tue) by anarcat (subscriber, #66354) [Link]

sounds like a great feature request. :)

pywb should now work with wget warcs + other options

Posted Oct 11, 2018 0:38 UTC (Thu) by ikreymer (guest, #127798) [Link]

Hi,

Thanks for mentioning pywb and Webrecorder. I wanted to mention that we've released an updated version of our warcio library, which is used by pywb and it should now be able to handle WARCs created by wget. Thanks for bringing attention to this issue!

If you've already installed pywb, you can update the warcio library by running pip install -U warcio to get the latest.

I also wanted to mention a few other options for creating WARCs using warcio and pywb

  1. Using Python and latest warcio, you can now create a WARC with a few lines of python:
    from warcio.capture_http import capture_http
    import requests
    
    with capture_http('example.warc.gz'):
         # request all urls to be loaded
         requests.get('https://example.com/')
         requests.get('https://google.com/)
    
  2. pywb has a built-in 'record' mode that allows you to record directly into a pywb collection by, for example, browsing http://localhost:8080/example/record/http://example.com/. You can enable this by running wayback --live --record
    More info on this in the pywb docs

  3. pywb also supports proxy mode recording and allows you set pywb as the http and https proxy for your browser. You can record directly into a pywb collection and content recorded in this way is more likely to work when replayed in pywb.

-Ilya (Webrecorder Lead Dev)


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds