(cache)Archive.is blog

I would like to make a charitable donation to support internet archiving, like what you do. Any suggestion? I kinda don't want to donate to Internet Archive, because they remove content.

Anonymous

https://liberapay.com/archiveis/

14 hours ago

Nobody fucking gets their TLS cert revoked over some trolling or bad content. Lets Encrypt literally refuses to play content cop as a CA. it's core to their premise as an organization. Look up their blog entry "The CA's Role in Fighting Phishing and Malware" (spoiler: it's no role at all)

Anonymous

I fail to see how blog posts (which itself is a perfect example of the content to be silently altered or wiped) get the performative power which one could rely on

can you expand hidden github comments. it says '84 hidden items/Load more...' at e.g. /7cxnj

Anonymous

Yes.

There used to be expanded, but something went wrong. Fixed now

2 days ago

Can I make tax deductible donation to support archive is?

Anonymous

No

I have been told that “Higbee and Associates” copyright trolls [1][2] (and probably their clones) have been used Archive.Today’s snapshots as “evidences” of “crimes” of their victims.

We are not associated with those guys and never heard about them before. Archive’s snapshots are not notarized nor use strong anti-forging technologies, so they cannot serve as legal evidences. Shortly, if you received a copyright claim with Archive’s snapshots as a proof - it is a scam.

There is the second floor: other guys (for example [3]) try to extort more money from the victims of aforementioned scammers for archive snapshots removal in order to protect them from further attacks.

It is a scam too, do not pay them.

1. http://archive.today/2020.02.17-184546/https://pubcit.typepad.com/clpblog/2019/02/consumer-warning-copyright-trolling-by-higbee-and-associates.html

2. http://archive.today/2020.02.17-184543/https://www.techdirt.com/articles/20190220/13283641640/investigating-higbee-associates-copyright-trolling-operation.shtml

3. http://archive.today/2020.02.17-183946/https://sumbit.nl/prijs.html

1 week ago

Why can't we archive URLs on Plurk? Did they implement some methods to prevent archiving?

Anonymous

plurk.com? It works for me. Which page do you have problem with?

1 week ago

What about the CA "Let's Encrypt"? They seem good. Nonprofit. You still support HTTPS connections, just only when explicitly instructed or you have HTTPS Everywhere installed. HTTPS sites get better SEO as well. I fail to see how this site is any more controversial than similar platforms like Wayback Machine or Megalodon.

Anonymous

The Archive’s ability to preserve short-living content of social media turned it to a lovely instrument of troll wars (Alt-Right vs. SJW, Ukraine vs. Russia, …)
and although the Archive tries to be neutral to those battles, it was often under the fire of technical and social attackers.
The pattern of attacks resulted in our infrastructure became similar to those of Wikileaks, SciHub, 8ch or DailyStormer - many mirror domains, fast-flux IPs for ingress and egress, etc.
If there are attacks which have already been made against one of the websites in this karass, the rest have to be prepared.
Revocation of SSL certificate as the result of some social attack is very likely, so I would even argue for using plain http in links to the Archive.

2 weeks ago

why no longer archive 'good' twitter? recently archive is now new style which is very bad

Anonymous

This https://www.reddit.com/r/Twitter/comments/ce1bea/reverting_back_to_the_old_twitter_interface_for/ ?

Just enabled, let’s see.

do you store logs? if so, what do they contain and how long do you keep them?

Anonymous

Yes, approximately 3-6 months. Useful for debugging and to track spammers.

The logs are not archived to the storage, as they fill the webserver’s disk space they are deleted

2 weeks ago

Can you get the pdf's to archive again? Before you used to have to automatically archive pdf's, which was extremely useful because you could pull links that would take you to direct line items buried deep within the document. Thanks

Anonymous

It never worked with PDFs actually.

It used to prefix links to PDFs with `http://webcache.googleusercontent.com/search?q=cache:` so a poor google cache’s PDF-to-HTML converted did the job.

But that approach had obvious drawbacks:

1. low rendering quality

2. many PDFs are not in google cache, and this hack does not for them

Examples can be seen here here: archive.today/http://webcache.googleusercontent.com/search?q=cache:*

If it is what you want, you can always prefix links to PDFs with that magic string before submitting to the archive

Do you plan on setting up HTTPS fully (i.e. forcing HTTPS even when requesting classic HTTP like most websites do nowadays). Good job on this project th

Anonymous

No.

I see the only reason to do it - to improve performance by forcing http2. But in case of archive, the performance bottleneck is not the network, but the speed of spindle disks.

On the other hand, there are two drawbacks of forcing https:

1. for the bots it is harder to support SSL (for example Perl does not
include SSL libraries by default).

2. certificate authority is an additional point of failure which
could go mad: there were cases when SSL-certificates of controversial
websites have been revoked.

2 weeks ago

In FAQ I read "But take in mind that when you archive a page, your IP is being sent to the the website you archive as though you are using a proxy". So if I archive a page, will the website know that a certain IP (mine) visited them through archiveis? Or in other words, will the website owner know that their website has been archived through archiveis by my IP?

Anonymous

Yes, but it obsolete, it is not so since December 2019′s big update.

The idea of passing client’s IP in X-Forwarded-For was to let the server provide to the archive the same localized version as the client has seen, It worked in 2012, but not in 2020.

1 month ago

why are you imitating cloudflare's captcha page? funny joke?

Anonymous

It is well recognizable as an interstitial page, caused by too many requests. Explaining this message would require too many words which no one would read. An orange page with captcha on the left does it instantly, as an hieroglyph of the universal Internet language.

Is Archive Today down? When I try to visit on Chrome I get a message which reads "This site can’t be reached". I've also tried visiting individual mirrors like VN, PH, IS, etc. and the website just keeps loading endlessly, but never actually completes. The problem started yesterday morning and since then, I haven't been able to visit the website or any archived page.

Anonymous

No, it should work. There were no outages in the last days.

1 month ago

The Twitter profile page and replies full page url version capture no longer works, only individual Tweet; is this deliberate by the ArchiveIS' service or did Twitter modify the way their page loads to disable?

Anonymous

api.twitter.com responses with “429 Too Many Requests“. It seems I need more twitter accounts

How come the PDF's won't archive anymore? They used to archive, but now you just get this black page. It used to work though.

Anonymous

They never worked.

PDF support is in my TODO list, but it is not implemented yet.

So far you can use documentcloud.org or archive.org to store PDF

Do you have an API that I could license from you for use in other projects?

Anonymous

No

Why archiving is so slow?

Webpages appear instantly in browsers, so people wonder why archiving took dozens of seconds, sometimes 3-5 minutes.

There are many reasons:

1. The instantly loaded page might have nothing but “loading” spinner, so there are intentional delays.

2. Webpage might have pictures loaded lazily, only when then user scrolls page down. The archiver scrolls the page here and there to load those image, even if the page lacks those lazy elements: it just has no idea so it makes a pessimistic assumption.

3. Webpage might have have analytic scripts which invisibly work in the background. The page looks loaded if you look at the screen, but it is still loading if you look at networks events. It make difficult to detect the moment when the page loading is completed. Even more, there are pages which do not stop loading at all (news feeds, stock market charts, …)

4. Archiving process has more steps than just load a page. It is better to compare with loading a page and then send it to paper printer.

2 months ago

Hi. When I get the same web page with the same URL after a certain period archived, would the last data be deleted and updated while I need to maintain both the past pages and today's instead?

Anonymous

No, both version will be in the archive and linked with <-prev next-> links

2 months ago

I noticed with the new upgrade (which is nice for most of what I've seen!), the rewriting of links so that clicking on links in one captured webpage allowed you to visit another captured webpage (if it was captured) seems to be gone. Could it be possible to have it again, for browsability?

Anonymous

Yes, the update broke many things which have to be restored

2 months ago