Robots.txt meant for search engines don’t work well for web archives

6d6b73 · 1 minute ago

There should be a way to direct archiving bots to a file that has the newest, compressed version of the website for them to download. Wouldn't that be easier for everyone?

reply

metafunctor · 4 hours ago

It appears that IA applies (or did apply) a new version of robots.txt to pages already in their index, even if they were archived years ago. That's silly, and stopping doing that would probably solve much of this problem.

reply

icebraining · 2 hours ago

I don't think it's "silly". IA operates in a sketchy legal environment. There's no fair use exclusion for what they're doing, and it made sense to be extra careful and deferential towards website operators, lest they get hit by a lawsuit.

reply

the8472 · 1 hour ago

> There's no fair use exclusion for what they're doing

Fair use is not the only exception to copyright. US copyright law has a separate section on exceptions for libraries and archives.

reply

mbrookes · 2 hours ago

I first came across that issue during one of the many Facebook privacy scandals. I'd found some juicy bits in (IIRC) a much earlier version of their privacy policy. But when I went back to it later, the robots.txt had been updated, and the earlier archives had been obliterated.

That just seems wrong.

reply

LoSboccacc · 2 hours ago

yeah it's not even hard to solve of a problem, just use the archived version of robots.txt that matches the crawl date

too bad they already lost loads of internet content that way

reply

paol · 2 hours ago

They didn't lose anything. Content excluded in this manner is only made inaccessible to the public, not deleted from the archive. They can change their policy retroactively.

reply

tempay · 2 hours ago

What should happen in the case that a website misconfigures robots.txt and ends up wanting to remove private data?

I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.

reply

CM30 · 2 hours ago

Provide a content removal form? It works for DMCA notices, it can work here. Maybe even have a 'reason' textbox to see why someone may want content removed...

reply

makomk · 1 hour ago

Of course, there's the inevitable risk that the Internet Archive's newfound control over who is allowed to make their past disappear into the memory hole and who has it archived forever will be used for political ends, especially since the ability to manually archive pages is already used this way by staff. (Take a look at Jason Scott's Twitter or that of the Archive Team sometime - lots of conspicuous manual archiving of stuff that's embarrassing to a certain US political party.)

The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.

reply

TeMPOraL · 58 minutes ago

Here I'd lean towards archiving everything indiscriminately. Politicians especially should not have the "right to be forgotten", because what they do is of historical interest.

reply

LoSboccacc · 2 hours ago

Same when an archive crawl illegal/copyrigthed data, a pathway for that needs to exists anyway

reply

frik · 2 hours ago

Upvoted. That's exactly the problem, and the way to solve it.

Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.

BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.

reply

thisacctforreal · 1 hour ago

What about extending robots.txt to include date ranges?

For future domain-owners this is likely far too much control, but maybe that could be mitigated if IA tracks DNS/whois/registration info too

reply

libeclipse · 5 hours ago

On the linked page, I see comments about ignoring the webmasters' wishes et al.

All I can say is f*ck that. It's a free and open internet. If you put content up on a public site, anyone has the right to go and look at it. Stop complaining when someone saves it.

And sure some people complain that scrapers slow down their site and that's why they use robots.txt, but really? Really? It's 2017 and your site is affected by that. I think you have bigger things to worry about.

reply

bkor · 1 hour ago

> Really? It's 2017 and your site is affected by that.

That someone wants to use a robot to completely scrape an entire dynamic website is their goal. A site is not responsible to make that possible. One bot causes _way_ more traffic and CPU usage than just a normal visitor or 1000s of visitors.

Saying '2017' or anything else: meh.

Various network operators are pretty helpful. Sending abuse complaints regarding misbehaving bots has resulted in actions before. I've seen action being taken from universities, ISPs, etc. Though normally the bots are auto-blocked (on IP address or ranges; quite easy to script).

robots.txt is an established / de facto standard. Ignore it, be prepared to explain why. IMO pretty much any country have computer hacking laws which are vague enough that to consciously ignore such a standard can be seen as "invading".

A "not my problem" approach: I think you should really think a little bit more.

reply

madshiva · 48 minutes ago

I totally ignore it and my bot never get caught. If they catch me I will say that the script wasn't working correctly, but what you are saying is wrong, there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem, I just don't follow your rule, but I have the choice too and you have the choice to block my IP too which I think is more harmful.

Also thanks for spreading bad information.

reply

adventured · 29 minutes ago

> there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem

You're not wrong about robots.txt, you're wrong in a much more broad way. There is in fact an extremely dangerous law that could easily ensnare what you're talking about:

https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

reply

blowski · 3 hours ago

What's your opinion on Google scraping your content and putting minimally attributed snippets?

reply

chii · 4 hours ago

> scrapers slow down their site and that's why they use robots.txt

a poorly written scraper may really slow down your site, especially if it wasn't intended to be scrapped repeatedly. There should be something to be said about frequency which scrapers should follow (specified by the website owner via a robots.txt like spec).

But website owners cannot demand unreasonable frequencies (such as once a year!), and what constitutes unreasonable is up for debate.

reply

eknkc · 4 hours ago

I don't think a poorly written scraper would follow robots.txt rules according to spec. So, in any case the site should have other measures (rate limiting?) anyway.

reply

davb · 4 hours ago

Additionally, if excessive scraping became an issue for my site I'd consider rate limiting client.

reply

dingo_bat · 3 hours ago

> (specified by the website owner via a robots.txt like spec).

Nope, if a website wants such a restriction, it must enforce it. Robots.txt is a request. It's worthless.

reply

bkor · 1 hour ago

If a robot misbehaves, it'll either be blocked or it'll go to the networks abuse section and that bot will be taken down. That a site possibly could have some kind of technical solution to this doesn't matter.

reply

timClicks · 4 hours ago

The Crawl-delay directive is the de facto standard for this.

reply

LoSboccacc · 2 hours ago

they only ever needed to honor the robots.txt at the date of archival.

archive.org fucked up by making robots retroactive, if they used the archived robots.txt as a filter for a site at the relevant date, they'd have had the best of both worlds - respecting how sites work without losing how sites appeared at a date.

reply

pbhjpbhj · 2 hours ago

They'd be the flouting copyright laws, like Google do, but nonetheless tortuously. They're making a copy, which is already an infringement, distributing it seemingly against the owners express wishes is treated as a crime in some jurisdictions.

reply

Retric · 2 hours ago

Robots.TXT is explicit permission to make a copy, otherwise crawling is meaningless and that is not nessisarily reversible. Like putting up a yard sale sign, then trying to get the poeple that show up yesterday arrested for trespassing.

What the archive can do after that point is a different issue, but they clearly can keep a copy. Further, someone else is using the domain they don't nessisarily have anything to do with the archived data.

reply

pbhjpbhj · 27 minutes ago

Robots.txt is usually explicitly permission for a robot not to crawl. But a robot crawling your site and an archive, cache, or duplicate page are all different propositions.

Google and others have enhanced robots.txt to enable permission for crawling (allow, sitemap), meta tags can deny archiving and various means allow permission to be explicitly denied for caching.

To use your analogy of raising a sign: if you don't put up a 'no trespassing' sign then it doesn't make trespassing legal.

FWIW I disprove of this state of affairs and consider copyright to be hugely defective in these respects.

>but they clearly can keep a copy //

It's nuanced but permission to access a page =/= permission to keep a copy. Just as you have explicit permission to access a video on YouTube but in most jurisdictions will not have permission to download it for later (commercial) use.

reply

reitanqild · 1 hour ago

Anyone knows why this was downvoted?

reply

tomjen3 · 1 hour ago

They also make the content available to the public, which is directly competing with the site owners.

Thats inviting lawsuits they can't win and expecting people to pay the bandwidth for it too.

reply

TeMPOraL · 35 minutes ago

Archive.org needs to be able to apply to itself. We could then use archive.org to view how archive.org in the past viewed some interesting site, thus avoiding the whole retroactive robots.txt fail.

;).

reply

yeukhon · 3 hours ago

I remember writing a dumb parser for robots.txt. I have to agree, robots.txt is simplistic but so non-standard. I wonder why search engines can't just say NO to this. Does search engines today still honor robots.txt?

Here's my shameless plug: https://github.com/yeukhon/robots-txt-scanner

I still remember writing most of this on Caltrain one morning heading to SF visiting someone I dearly loved.....

reply

mushiake · 5 hours ago

fastastic news.

Archive Team's take on this[0]

[0]http://www.archiveteam.org/index.php?title=Robots.txt

reply

dingaling · 4 hours ago

It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites.

The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.

Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.

My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.

Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.

reply

duskwuff · 4 hours ago

There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index.

What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.

reply

Spare_account · 4 hours ago

Here's a slight modification to the GP proposal:

- Respect robots.txt at the time you crawl it.

- If robots.txt appears later, stop archiving from that date forwards.

- Preserve access to old archived copies of the site by default.

- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.

reply

pbhjpbhj · 20 minutes ago

>mechanism that allows a proven site owner to explicitly request retrospective access removal. //

It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.

reply

r721 · 3 hours ago

>may cause the Internet Archive to be banned from crawling some websites.

It looks like Facebook banned ia_archiver (recently? I recall it worked a few weeks ago):

>User-agent: ia_archiver

>Disallow: /

https://www.facebook.com/robots.txt

reply

rz2k · 5 hours ago

The logic is sound, and I see that it was mostly written in 2011, but I can also see it being harmful.

How about an IETF RFC to clarify?

Libraries operate under a lot of unwritten social conventions, perhaps even more than most other institutions. (robots.txt even if largely ignored is a popular convention) Aggressive or confrontational wording, regardless of whether they are "right" doesn't seem in libraries' interests.

reply

laumars · 5 hours ago

To be honest I see robots.txt as a failed experiment since it relies on trust rather than security or thoughtful design.

reply

hdhzy · 5 hours ago

I don't think it's about security.

For example I've got a link to do delegated login like /login-with/github. When people click it an oauth flow will start. But it is useless for robots to follow so I disallow it in robots.txt. If they still follow nothing breaks and it's not a security issue but if I can avoid starting unnecessary oauth logins it's an additional benefit.

reply

laumars · 3 hours ago

robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content (sadly I have that seen that happen a lot) where as a better approach would be IP whitelisting and/or user authentication.

However I'm not claiming security is the only reason people use (misuse?) robots.txt. For example in your case you could mitigate your need for a robots.txt with a nofollow attribute[1]. Sure bad bots could still crawl your site and find the authentication URL without probing robots.txt so the security implications there is pretty much non-existent. But you've already got a thoughtful design (the other point I raised) that mitigates the need for robots.txt anyway so adding something like "nofollow" maybe enough to remove the robots.txt requirement altogether.

[1] https://en.wikipedia.org/wiki/Nofollow

reply

dchest · 1 hour ago

This is crazy, that's not what robots.txt is for. How can you complain about the security of a thing that is not meant to provide security?

According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.

reply

laumars · 1 hour ago

That was an anecdote since the previous poster raised the point about security. I'm definitely not claiming robots.txt should be for security nor was designed for security!

I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.

I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.

reply

nabla9 · 2 hours ago

robots.text is not a security tool. It's communication tool that gives advice. Just like sitemaps.

If you need to add security (logins) to protect content you don't need to protect you inconvenience users.

reply

laumars · 2 hours ago

I'd already covered the security point replying to another poster (https://news.ycombinator.com/item?id=14163792) but just to be clear, I'm absolutely not claiming robots.txt is a security tool. Quite the opposite, I saying following good security and design practices renders the robots.txt file obsolete.

Your point about sitemaps helps illustrate that point of mine because having a decent sitemap mitigates the need for Allow lines in robots.txt. It's another feature of the web where robots.txt isn't well equipped to handle and thus there have been other, better, tools built to highlight pages of interest to search engines.

reply

bkor · 1 hour ago

Lots of laws are pretty similar. e.g. technically you could steal loads of things. Practically you don't. Defeating/ignoring mechanisms such as robots.txt (vs maybe some security person in a store) still makes stealing not ok.

reply

laumars · 49 minutes ago

The morality of whether bots should obey robots.txt is a separate issue to the point I raised about how you shouldn't trust bots to obey them. To use your example of high street stores: shops have security tags on expensive items / clothing as a method of securing products from theft because you cannot blindly trust everyone not to steal (though wouldn't it be great if that wasn't the case). Equally websites cannot trust that bots will obey robots.txt. Which means any content that doesn't want to be crawled needs to be behind nofollow attributes or (if it's sensitive) user authentication layers and any content that does need to be indexed also needs to be in a sitemap. Once you have all of these extra layers implemented, the robots.txt becomes utterly redundant. Hence why I say it's a failed experiment. The benefits it offers are superseded by better solutions.

reply

kakarot · 3 hours ago

I came here to say something about respecting the wishes of others, etc, but you know what? You're absolutely right. We shouldn't even need to have a conversation about trust and respect.

It should be non-negotiable if you don't want your personal contents indexed by scrapers and archivers, and it should be enforced by design. It's a broken system.

reply

afandian · 3 hours ago

I'm not so sure that even Google respects it. I did some digging about the semantics about robots.txt whilst writing a bot myself, and it seems that Google doesn't follow links that are excluded, but it will visit those pages. Maybe that counts as "paying attention", but I don't think they "respect" it.

reply

rubatuga · 2 hours ago

One thing that should be considered is the right for an individual to be forgotten.

reply

freshhawk · 2 hours ago

Forgotten by whom? Who judges what is "forgotten" and when? How is it enforced?

The specifics here matter a great deal, the versions so far are regularly abused by the wealthy and don't apply to any of the data warehouses that the powerful and well connected have access to.

Where did this "right" come from? What's the legal and ethical basis for it? It is analogous to censorship or book burning at the basic level, destroying information to hide it from the public. It requires a consistent and strong justification as well as justified limited scope because of that, and it better be obviously beneficial to society even accounting for the inevitable misuse by those in power.

reply

beejiu · 2 hours ago

There is a legal basis in the European Union. https://en.wikipedia.org/wiki/Right_to_be_forgotten

reply

merb · 1 hour ago

and the legal basis is pretty fluffy: - http://www.sueddeutsche.de/digital/bgh-grundsatzurteil-namen... - http://www.focus.de/digital/internet/bgh-urteil-keine-staend...

it's german, but basically it says: "This is not a blank check".

reply

TeMPOraL · 49 minutes ago

In my (current) opinion, it's this law that should be forgotten. What's on the public Internet is a matter of public interest. All I can see is this law being used by bad people to hide their bad deeds, especially when those bad deeds should be known.

reply

madshiva · 3 hours ago

Yeah just ignore robots.txt because there's others solution.

If the site don't want to be scanned they can adopt a lot of counter measure and robots.txt will not save it from abuse.

He remind me the old days when my website wasn't working from US because I just fake that the site was down because there's no reason that somebody goes to my site from US (I know it's kind stupid, but when all your content is in french and you are a kid... :) )

reply

sengork · 4 hours ago

Found this[1] via the Wikipedia's Talk page for the robots.txt article. It showcases that early on robots.txt was designated to help maintain bandwidth performance of web servers. Back then it would have been due to bandwidth contention, today it may be bandwidth cost to some operators which robots.txt help mitigate.

[1] https://yro.slashdot.org/comments.pl?sid=377285&cid=21554125

reply

amelius · 3 hours ago

I think we should write down a legal license in our robots.txt file, as a retribution for all those lengthy EULAs these big companies make us read :)

reply

Aissen · 3 hours ago

Finally. A bit late, since a lot of the archive has been removed because of new owner's aggressive (or malicious) new robots.txt

reply

pjc50 · 3 hours ago

Hidden, rather than removed.

reply

droithomme · 4 hours ago

I have some sites where I specifically block archiving from some sections for good reason. (Even if I didn't have a good reason though it would still be my choice.)

I have a very big problem with them disregarding robots directives. Sure some crawlers ignore them: Hostile net actors up to no good. This decision means they are a hostile net actor. I'll have to take extreme measures such as determining all the ip address ranges they use and totally blocking access. This inconveniences me, which means they are now my enemy.

edit- For those interested: Deny from 207.241.224.0/22

reply

Asparagirl · 1 hour ago

Are you under the impression that individual web archivists don't also scrape websites of interest, and submit those WARC's for inclusion into the Wayback Machine, independent of the IA's crawlers?

Because believe me, we do...good luck banning every AWS and DO IP range.

reply

tomjen3 · 55 minutes ago

I actually didn't know that. Do you operate the same crawlers?

I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.

reply

andrius4669 · 2 hours ago

Why not just block ia_archiver useragent in your web server for these paths instead? Also, I'm curious, what that good reason could be?

reply

cprecioso · 4 hours ago

Can I ask what the good reason is?

reply

omgtehlion · 4 hours ago

I have easier solution for you: just shut down your site and be done with it. This way no malicious actor will be able to save your precious information.

reply