There is a problem with saving pages from sites behind of Incapsula (a DDOS-protection CDN similar to Cloudflare), such as sciencedaily.com, offshoreleaks.icij.org, monsanto.com, …
They do not ask for CAPTCHA, they just return a blank page for half of requests and even retrying via proxy does not help.
I will investigate it further.
I will see, thank you for the report.
ok, fixed.
ok
When such urls are requested, web.archive.org starts saving a page, and archive.is starts saving how web.archive.org is saving;
it most cases such race results in pages saved very badly: http://archive.is/xYpxk
I have received several bug reports about archive.is saving empty or 404 pages from Google Cache although there expected to be some content.
It seems that there is more than one Google Cache, and what you get depends not only on the URL but also on which one of the Google datacenters serves you request.
Examples of pages saved via different proxies:
http://archive.is/https://webcache.googleusercontent.com/search?q=cache:_PVt8WPb4DEJ:*
http://archive.is/https://webcache.googleusercontent.com/search?q=cache:CO15sF9zSrQJ:*
I think, the archive should perform few requests simultaneously and then save all successful versions.
There are too much snapshots from 8ch.net and media.8ch.net with child porn.
I see that blocking the whole 8ch.net is not a good solution, but I
cannot review all the snapshots manually.
Any ideas how to separate pages with CP from the rest of 8ch content?
Please email me the link, I will have a look.
Aren’t they?
I just checked the last snapshot saved from youtube (http://archive.is/hJu9a) and see the comments expanded.
Sure. Compare http://archive.is/0HIzc (before fix) and http://archive.is/PX1P3 (after fix).