(cache) Bookmark Archives That Don't (Pinboard Blog)

When I first started work on Pinboard, it was because I wanted to have my own Internet archive, a place where if I saved something I stood a reasonable chance of being able to see it again, no matter what happened to the original website. I was astonished at the amount of link rot that had accumulated in my delicious bookmarks after just three years of use.

Of course, other people had this idea as well, and today Pinboard is one of three paid sites that will store archived copies of your bookmarks for an annual fee. In this post, I'd like to take a moment to argue why I think we do a better job at this than anybody else.

Here's a summary of the competing offers:

	fee	full text search	what gets saved?
Pinboard	$25/year	PDF, HTML, TXT	page source + dependencies
Diigo	$20/year	PDF, HTML, TXT	page source
Historious	$30/year	HTML, TXT	page source

As you can see, all three sites offer archiving and full-text search, although Historious can't handle PDFs.

The real difference between the sites lies in how they handle embedded content. As an experiment, I purchased paid accounts on Diigo and Historious and bookmarked a recent article from Wired magazine on all three services. After the archived copy of the page became available, I edited it to remove all references to outside domains, then loaded it in my browser. This mimicked what would happen if Wired somehow disappeared from the internet, and I had to rely entirely on the information that had been archived for me.

Here what I saw in Safari:

Original Article	Pinboard

Historious	Diigo

(The grey line in the two bottom images shows where I cut out a long column of links for the sake of formatting)

As you can see, none of the services was able to completely duplicate the page. But Pinboard came close, storing both the page styling and the prominent image that is the focus of the article. Historious and Diigo could only display a skeletal text version of the original page. All images and formatting were lost.

So what's going on?

When you bookmark a website in Diigo or Historious, the service save you a copy of that site's HTML. But it doesn't examine the HTML file to see what other resources that page will need when it's displayed in a browser.

When you click on a 'cached' bookmark in one of those services, you see a page that looks like an exact copy of the original - for good reason. All of the images, javascript files, stylesheets and other embedded resources on the page are being pulled in by your browser from their original location. This approach is a lot like backing up all your important files with a symlink. It will work beautifully until the moment you need it.

Pinboard, on the other hand, attempts to resolve page dependencies when it first downloads your bookmarked site. The crawler parses the HTML, fetches the images, stylesheets, javascript files, and other embedded elements it thinks a browser will need to render the page, and then rewrites the links on the stored copy of the site so they point to local copies of that content. This technique is not foolproof, but as the example shows, it can make a world of difference when one of your links dies.

Properly identifying page dependencies is not easy. There are many cases where our crawler is misled into grabbing too much or too little. For example, some javascript-intensive sites load their dependencies dynamically. Other sites rely on iframes, use recursive import statements in stylesheets, dynamically change their structure at page load time, or hide behind a paywall.

There is also a storage penalty for archiving full content. The average archived bookmark on Pinboard is nearly half a megabyte in size. When you consider that our most prolific subscriber has over 70,000 bookmarks, you can see why just storing HTML is such an attractive option.

However, in 2010 I don't believe it makes any sense to try to archive bookmarks if you're not willing to resolve dependencies. Modern websites are a rich gumbo of javascript, CSS, Flash, images and embedded video, and from a user's perspective an archived copy should behave like the original, no matter what it takes to make that happen.

Though we're still far from reaching this goal ourselves, I think it's important that our users be aware of what's at stake. Whether it's archived bookmarks or your own personal data, there's nothing less fun than believing you had a working backup only to find out after the fact that you were wrong.

If you're curious about Pinboard archiving, I encourage you to try it for yourself. The upgrade comes with a three-day free trial period. We crawl your links, turn on full-text search for your account, and then you decide if the service is worth the price. I think you'll be really happy with the results.

—maciej on November 25, 2010

Pinboard is a bookmarking site and personal archive with an emphasis on speed over socializing.

This is the Pinboard developer blog, where I announce features and share news.

How To Reach Help

Send bug reports to bugs@pinboard.in

Talk to me on Twitter

Post to the discussion group at pinboard-dev

Or find me on IRC: #pinboard at freenode.net