If the bad guys were to unexpectedly shut down one of your two data centers how long would it take before a second data center would be back online?
They are both online, normally balancing the load. When one is off for maintenance, the website works slow.
Considering the old successful story of reddit's "give gold to support our server uptime" scheme was in fundraising (granted in their case it turned into a giant scam). But I think the "fundraising progress meter" they used to publicly report on their daily money goals was very effective. It gave a community feel to giving gold. Almost like it was a collective responsibility to keep reddit ad-free by donating to the common good. Have you considered having a daily donation progress bar as well?
“daily” is overoptimistic, most days there are zero :)
/8rpYA and simular pages from the same parent domain is an interesting case where the archived page is actually an IFrame of a another news page. What do you think is the best way to archive this? Would it be a good idea to log it as a redirect and just follow the page to the iFramed page? Thus Removing the Frame? Or would the archive still work and be more accurate leaving the IFrame?
I removed CSS which limited the height of the iframe (only for this site). Is it better now?
About: /post/632648485201739776/ - Thanks again!! Could you apply this rule to all new URLs on this portal or do you only fix specific archives?
It will be applied to all new URLs after next deployment (later today or tomorrow)
In Idealista (leading apartment search website in Spain), can you fix "Leer comentario completo" (read full comment) and "fotos siguientes" (see next photos)? Thanks! /VeJYf
fixed
is it 1999? asking because you block access to browsers & I haven't seen this retarded shit in decades. you block browsers that are identical to ones you've listed as supported
If you are about Brave, I agree that adding ads to the pages and replacing ref.links is very 1999′ish. It was called ActiveX malware back then.
Every URL I archive and have archived is blocked by copyright. Why? If I archive with VPN it doesn't get blocked but if I don't use VPN or another IP address it gets blocked. Did I request too much URL's? It says: "In response to a request we received from 'US Digital Millennium Copyright Act' the page is not currently available.If you need it for research, investigation or other purposes, please, inquiry via email, or Search this page in Google Cache Поискать эту страницу в Архив.Орг Search t"
There could be a bug. What website are you trying to archive?
Can you remove the blocking/login panel that appears on Facebook pages when you are not logged in? It shows up on /4EG79 but not /lHNEb and seems to appear when a person scrolls down the page. Thanks!
4EG79 is saved from Archive.org, not from Facebook. It is dangerous to click on buttons “Not now”, “Hide popup”, … Archive.org snapshots, likely they won’t work as intended. On contrary, lHNEb is saved from Facebook and “Not now” has been clicked.
How much space Is left in archive. is servers?
Not to much. I plan to change data duplication to erasure coding to use space more efficiently.
Sometimes it can be important to capture in the archive the original url that the archived page was redirected from. I noticed that you have this feature, thank you. Sometimes the redirect can be several urls before landing on the page that needs to be archived. I ask do you capture the middle redirects? And if so, how urls of the redirect chain do you record? Is it all of them?
Yes, new archiver (which works since Dec 2019) records a bit more than the old one, that includes all URLs of intermediate redirects, all URLs of images and scripts, HTTP headers, IP addresses of the servers, etc. I had the idea to visualize it, probably in a form like “Network” tab of Browser’s DevTool. And to use that info to improve adblocker.
When an link in an archived page is clicked, it is checked to see if it has also been archived. If so, then the archived page loads, if not then the real url loads. But what if there are three archived versions of that archived out-link page: the out-link with a timestamp one day before the originating page, one with a timestamp one week after, and one with the most latest archive. How do you determine which version to link to?
With the closest timestamp to the snapshot you are currently on.
There are also <-prior and next-> buttons to navigate in time in case of multiple versions.
I read in your FAQ that you keep the images at 2x duplication and textual information at 3x. With many websites using the same JavaScript libraries how do you deal with storing commonly referenced libraries say JQuery? Do you use pointers to save on space?
JavaScript libraries are not stored, they are executed at the time of capturing and the result of the execution is archived.
Commonly referenced blobs like background images and fonts are deduplicated, yes.
Does it archive entire social media accounts, like a person's Twitter account, or just specific posts?
just specific posts
Are Wayback Machine links no longer allowed to be backed up in your archive? The archive process seems to keep rejecting them.
There is an issue with Wayback Machine snapshots which are just saved to Wayback Machine.
There seems to be some sort of eventually consistent storage, so if you just saved a link to Wayback Machine and immediately send the WM link to a friend (or feed in to Archive.Today), they might see an empty page on WM. In 10-30 minutes the WM page is visible to everyone
Can Archive Today have long screenshot of the whole webpage like that of Internet Archive?
No, it would double the costs.
The new Twitter keeps showing up in new archive saves now. Is there anyway to revert back to the old Twitter for new archives or did Twitter just permanently kill off their old site design?
Yes, but old Twitter (or what is left from it) does not show tweets which are marked as “sensitive content“. Apparently, because now it is tailored only for GoogleBot, not for humans
is neo-nazi material permitted?
I think, yes, although I am not sure about the future.
So far, the materials which attract the most govt (or quasi-govt) takedown requests are:
* child porn (from NCMEC, OCLCTIC, ECO.DE, JUGENDSCHUTZ.NET, IHBARWEB, CYBERTIP.CA, MELDPUNT, PAPS.JP, IWF.ORG.UK, HOTLINE.IE, …)
* ISIS propaganda (from CTIRU and EUROPOL)
* Cookbooks for drugs and explosives (mainly from ROSKOMNADZOR)
Sites archived via google as a proxy (using the I'm feeling lucky link) are hit with a redirect interstitial page. /IGtuE
Fixed
Why I still have to do captcha in your onion site? And why the tor browser says my connection to your onion site is insecure and your certificate is not trusted because it's self-signed, once I try to archive or search any page?
There is no easy way to obtain a browser-trusted certificate for .onion domain. As far as I know, there are only 2 .onion sites with valid certificate: Facebook and New York Times.
Anyway, it is merely a show off and it has nothing to do with “secure“: when you visit .onion websites, traffic is unencrypted only between the browser and Tor service running on the same computer, so plain http is OK.
is it possible to save all pages of a blog?
No