LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.
This article was contributed by Antoine Beaupré
I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.
Converting simple sites
For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:
$ nice wget –mirror –execute robots=off –no-verbose –convert-links –backup-converted –page-requisites –adjust-extension –base=. / –directory-prefix=. / –span-hosts –domains=www.example.com, example.com http://www.example.com/
The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores robots.txt rules, as is now common practice for archivists , and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.
The above command will also fetch “page requisites “like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.
That is, when things go well. Anyone who has ever worked with a computer knows that things seldom go according to plan; all sorts of things can make the procedure derail in interesting ways. For example, it was trendy for a while to have calendar blocks in web sites. A CMS would generate those on the fly and make crawlers go into an infinite loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions (eg Wget has a – reject-regex option) to ignore problematic resources. Another option, if the administration interface for the web site is accessible, is to disable calendars, login forms, comment forms, and other dynamic areas. Once the site becomes static, those will stop working anyway, so it makes sense to remove such clutter from the original site as well.
As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.
Creating and displaying WARC files
At the Internet Archive , Brewster Kahle and Mike Burner designed the ARC (for “ARChive”) file format in to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC (“Web ARChive “) specification that was released as an ISO standard in and revised in . The standardization effort was led by the International Internet Preservation Consortium (IIPC), which is an “ international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future “, according to Wikipedia; It includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based Heritrix crawler .
A WARC file aggregates multiple resources like HTTP headers, file contents, and other metadata in a single compressed archive. Conveniently, Wget actually supports the file format with the
- warc parameter. Unfortunately, web browsers cannot render WARC files directly, so a viewer or some conversion is necessary to access the archive. The simplest such viewer I have found is pywb , a Python package that runs a simple webserver to offer a Wayback-Machine-like interface to browse the contents of WARC files. The following set of commands will render a WARC file on http: // localhost: / : $ pip install pywb $ wb-manager init example $ wb-manager add example crawl.warc.gz $ wayback
This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.
Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification , which was fixed in the 1.1 specification . Until Wget or pywb fix those problems, WARC files produced by Wget are not Reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl . Here is how it is invoked:
$ crawl https://example.com/
– exclude-related flag is used), but does not recurse out of the domain. By default, it fires up ten parallel connections to the remote site, a setting that can be changed with the – c flag. But, best of all, the resulting WARC files load perfectly in pywb.
Future work and alternatives
This article would also not be complete without a nod to the HTTrack project, the “website copier “. Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.
In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.
Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag , a self-hosted “read it later” service designed as a free-software alternative to (Pocket ) is now owned by Mozilla). But Wallabag, by design, creates only a “readable” version of the article instead of a full copy. In some cases, the “readable version” is actually unreadable and Wallabag sometimes fails to parse the article . Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.
The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself .