In my estimation the number of pages cited in the Wikipedia several times bigger than the number of pages saved in archive.is. The number of new links which appear on the Wikipedia daily also several times bigger than the number of stored pages every day on archive.is.
If the Wikipedia would save all references into archive.is (you said “afford linking .. en masse“) this is of course cause problems.
Such activity would not be similar to pressing PrintScreen and save the picture to a photo hosting site. It would become webcrawling and then archive.is would have to obey robots.txt.
This may incur additional costs associated with the purchase of new equipment to cope with the increasing load. If the expense exceeds a certain threshold then the question arises about who pays for it: me, the Wikipedia or the visitors from the Wikipedia to archive.is (any other options here?). In the last case hardly it will be AdSense advertising as archive.is has on the search page, rather an aggressive fundraising campaign. Fundraising is more familiar to the users coming from the Wikipedia because it is exactly the way used by archive.org and the Wikipedia and the expected conversion ratio is higher than from AdSense.
Could you explain your idea?
Imgur sometimes (for some clients) redirects from i.imgur.com to imgur.com; from the image to the html page with the same image + ads.
That was the case with the archive, I made a quick fix so images are loaded (example http://archive.is/tNdK2) But I am not sure how reliable it is. I have no idea about how the Imgur’s redirect logic works.
What forum?
There is a problem with saving pages from sites behind of Incapsula (a DDOS-protection CDN similar to Cloudflare), such as sciencedaily.com, offshoreleaks.icij.org, monsanto.com, …
They do not ask for CAPTCHA, they just return a blank page for half of requests and even retrying via proxy does not help.
I will investigate it further.
I will see, thank you for the report.
ok, fixed.
ok
When such urls are requested, web.archive.org starts saving a page, and archive.is starts saving how web.archive.org is saving;
it most cases such race results in pages saved very badly: http://archive.is/xYpxk
I have received several bug reports about archive.is saving empty or 404 pages from Google Cache although there expected to be some content.
It seems that there is more than one Google Cache, and what you get depends not only on the URL but also on which one of the Google datacenters serves you request.
Examples of pages saved via different proxies:
http://archive.is/https://webcache.googleusercontent.com/search?q=cache:_PVt8WPb4DEJ:*
http://archive.is/https://webcache.googleusercontent.com/search?q=cache:CO15sF9zSrQJ:*
I think, the archive should perform few requests simultaneously and then save all successful versions.