There should be a way to direct archiving bots to a file that has the newest, compressed version of the website for them to download. Wouldn't that be easier for everyone?
It appears that IA applies (or did apply) a new version of robots.txt to pages already in their index, even if they were archived years ago. That's silly, and stopping doing that would probably solve much of this problem.
I don't think it's "silly". IA operates in a sketchy legal environment. There's no fair use exclusion for what they're doing, and it made sense to be extra careful and deferential towards website operators, lest they get hit by a lawsuit.
I first came across that issue during one of the many Facebook privacy scandals. I'd found some juicy bits in (IIRC) a much earlier version of their privacy policy. But when I went back to it later, the robots.txt had been updated, and the earlier archives had been obliterated.
They didn't lose anything. Content excluded in this manner is only made inaccessible to the public, not deleted from the archive. They can change their policy retroactively.
What should happen in the case that a website misconfigures robots.txt and ends up wanting to remove private data?
I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.
Provide a content removal form? It works for DMCA notices, it can work here. Maybe even have a 'reason' textbox to see why someone may want content removed...
Of course, there's the inevitable risk that the Internet Archive's newfound control over who is allowed to make their past disappear into the memory hole and who has it archived forever will be used for political ends, especially since the ability to manually archive pages is already used this way by staff. (Take a look at Jason Scott's Twitter or that of the Archive Team sometime - lots of conspicuous manual archiving of stuff that's embarrassing to a certain US political party.)
The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.
Here I'd lean towards archiving everything indiscriminately. Politicians especially should not have the "right to be forgotten", because what they do is of historical interest.
Upvoted. That's exactly the problem, and the way to solve it.
Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.
BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.
On the linked page, I see comments about ignoring the webmasters' wishes et al.
All I can say is f*ck that. It's a free and open internet. If you put content up on a public site, anyone has the right to go and look at it. Stop complaining when someone saves it.
And sure some people complain that scrapers slow down their site and that's why they use robots.txt, but really? Really? It's 2017 and your site is affected by that. I think you have bigger things to worry about.
> Really? It's 2017 and your site is affected by that.
That someone wants to use a robot to completely scrape an entire dynamic website is their goal. A site is not responsible to make that possible. One bot causes _way_ more traffic and CPU usage than just a normal visitor or 1000s of visitors.
Saying '2017' or anything else: meh.
Various network operators are pretty helpful. Sending abuse complaints regarding misbehaving bots has resulted in actions before. I've seen action being taken from universities, ISPs, etc. Though normally the bots are auto-blocked (on IP address or ranges; quite easy to script).
robots.txt is an established / de facto standard. Ignore it, be prepared to explain why. IMO pretty much any country have computer hacking laws which are vague enough that to consciously ignore such a standard can be seen as "invading".
A "not my problem" approach: I think you should really think a little bit more.
I totally ignore it and my bot never get caught. If they catch me I will say that the script wasn't working correctly, but what you are saying is wrong, there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem, I just don't follow your rule, but I have the choice too and you have the choice to block my IP too which I think is more harmful.
> there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem
You're not wrong about robots.txt, you're wrong in a much more broad way. There is in fact an extremely dangerous law that could easily ensnare what you're talking about:
> scrapers slow down their site and that's why they use robots.txt
a poorly written scraper may really slow down your site, especially if it wasn't intended to be scrapped repeatedly. There should be something to be said about frequency which scrapers should follow (specified by the website owner via a robots.txt like spec).
But website owners cannot demand unreasonable frequencies (such as once a year!), and what constitutes unreasonable is up for debate.
I don't think a poorly written scraper would follow robots.txt rules according to spec. So, in any case the site should have other measures (rate limiting?) anyway.
If a robot misbehaves, it'll either be blocked or it'll go to the networks abuse section and that bot will be taken down. That a site possibly could have some kind of technical solution to this doesn't matter.
they only ever needed to honor the robots.txt at the date of archival.
archive.org fucked up by making robots retroactive, if they used the archived robots.txt as a filter for a site at the relevant date, they'd have had the best of both worlds - respecting how sites work without losing how sites appeared at a date.
They'd be the flouting copyright laws, like Google do, but nonetheless tortuously. They're making a copy, which is already an infringement, distributing it seemingly against the owners express wishes is treated as a crime in some jurisdictions.
Robots.TXT is explicit permission to make a copy, otherwise crawling is meaningless and that is not nessisarily reversible. Like putting up a yard sale sign, then trying to get the poeple that show up yesterday arrested for trespassing.
What the archive can do after that point is a different issue, but they clearly can keep a copy. Further, someone else is using the domain they don't nessisarily have anything to do with the archived data.
Robots.txt is usually explicitly permission for a robot not to crawl. But a robot crawling your site and an archive, cache, or duplicate page are all different propositions.
Google and others have enhanced robots.txt to enable permission for crawling (allow, sitemap), meta tags can deny archiving and various means allow permission to be explicitly denied for caching.
To use your analogy of raising a sign: if you don't put up a 'no trespassing' sign then it doesn't make trespassing legal.
FWIW I disprove of this state of affairs and consider copyright to be hugely defective in these respects.
>but they clearly can keep a copy //
It's nuanced but permission to access a page =/= permission to keep a copy. Just as you have explicit permission to access a video on YouTube but in most jurisdictions will not have permission to download it for later (commercial) use.
Archive.org needs to be able to apply to itself. We could then use archive.org to view how archive.org in the past viewed some interesting site, thus avoiding the whole retroactive robots.txt fail.
I remember writing a dumb parser for robots.txt. I have to agree, robots.txt is simplistic but so non-standard. I wonder why search engines can't just say NO to this. Does search engines today still honor robots.txt?
It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites.
The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.
Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.
My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.
Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.
There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index.
What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.
- If robots.txt appears later, stop archiving from that date forwards.
- Preserve access to old archived copies of the site by default.
- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.
If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.
The logic is sound, and I see that it was mostly written in 2011, but I can also see it being harmful.
How about an IETF RFC to clarify?
Libraries operate under a lot of unwritten social conventions, perhaps even more than most other institutions. (robots.txt even if largely ignored is a popular convention) Aggressive or confrontational wording, regardless of whether they are "right" doesn't seem in libraries' interests.
For example I've got a link to do delegated login like /login-with/github. When people click it an oauth flow will start. But it is useless for robots to follow so I disallow it in robots.txt. If they still follow nothing breaks and it's not a security issue but if I can avoid starting unnecessary oauth logins it's an additional benefit.
robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content (sadly I have that seen that happen a lot) where as a better approach would be IP whitelisting and/or user authentication.
However I'm not claiming security is the only reason people use (misuse?) robots.txt. For example in your case you could mitigate your need for a robots.txt with a nofollow attribute[1]. Sure bad bots could still crawl your site and find the authentication URL without probing robots.txt so the security implications there is pretty much non-existent. But you've already got a thoughtful design (the other point I raised) that mitigates the need for robots.txt anyway so adding something like "nofollow" maybe enough to remove the robots.txt requirement altogether.
This is crazy, that's not what robots.txt is for. How can you complain about the security of a thing that is not meant to provide security?
According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.
That was an anecdote since the previous poster raised the point about security. I'm definitely not claiming robots.txt should be for security nor was designed for security!
I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.
I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.
I'd already covered the security point replying to another poster (https://news.ycombinator.com/item?id=14163792) but just to be clear, I'm absolutely not claiming robots.txt is a security tool. Quite the opposite, I saying following good security and design practices renders the robots.txt file obsolete.
Your point about sitemaps helps illustrate that point of mine because having a decent sitemap mitigates the need for Allow lines in robots.txt. It's another feature of the web where robots.txt isn't well equipped to handle and thus there have been other, better, tools built to highlight pages of interest to search engines.
Lots of laws are pretty similar. e.g. technically you could steal loads of things. Practically you don't. Defeating/ignoring mechanisms such as robots.txt (vs maybe some security person in a store) still makes stealing not ok.
The morality of whether bots should obey robots.txt is a separate issue to the point I raised about how you shouldn't trust bots to obey them. To use your example of high street stores: shops have security tags on expensive items / clothing as a method of securing products from theft because you cannot blindly trust everyone not to steal (though wouldn't it be great if that wasn't the case). Equally websites cannot trust that bots will obey robots.txt. Which means any content that doesn't want to be crawled needs to be behind nofollow attributes or (if it's sensitive) user authentication layers and any content that does need to be indexed also needs to be in a sitemap. Once you have all of these extra layers implemented, the robots.txt becomes utterly redundant. Hence why I say it's a failed experiment. The benefits it offers are superseded by better solutions.
I came here to say something about respecting the wishes of others, etc, but you know what? You're absolutely right. We shouldn't even need to have a conversation about trust and respect.
It should be non-negotiable if you don't want your personal contents indexed by scrapers and archivers, and it should be enforced by design. It's a broken system.
I'm not so sure that even Google respects it. I did some digging about the semantics about robots.txt whilst writing a bot myself, and it seems that Google doesn't follow links that are excluded, but it will visit those pages. Maybe that counts as "paying attention", but I don't think they "respect" it.
Forgotten by whom? Who judges what is "forgotten" and when? How is it enforced?
The specifics here matter a great deal, the versions so far are regularly abused by the wealthy and don't apply to any of the data warehouses that the powerful and well connected have access to.
Where did this "right" come from? What's the legal and ethical basis for it? It is analogous to censorship or book burning at the basic level, destroying information to hide it from the public. It requires a consistent and strong justification as well as justified limited scope because of that, and it better be obviously beneficial to society even accounting for the inevitable misuse by those in power.
In my (current) opinion, it's this law that should be forgotten. What's on the public Internet is a matter of public interest. All I can see is this law being used by bad people to hide their bad deeds, especially when those bad deeds should be known.
Yeah just ignore robots.txt because there's others solution.
If the site don't want to be scanned they can adopt a lot of counter measure and robots.txt will not save it from abuse.
He remind me the old days when my website wasn't working from US because I just fake that the site was down because there's no reason that somebody goes to my site from US (I know it's kind stupid, but when all your content is in french and you are a kid... :) )
Found this[1] via the Wikipedia's Talk page for the robots.txt article. It showcases that early on robots.txt was designated to help maintain bandwidth performance of web servers. Back then it would have been due to bandwidth contention, today it may be bandwidth cost to some operators which robots.txt help mitigate.
I have some sites where I specifically block archiving from some sections for good reason. (Even if I didn't have a good reason though it would still be my choice.)
I have a very big problem with them disregarding robots directives. Sure some crawlers ignore them: Hostile net actors up to no good. This decision means they are a hostile net actor. I'll have to take extreme measures such as determining all the ip address ranges they use and totally blocking access. This inconveniences me, which means they are now my enemy.
edit- For those interested: Deny from 207.241.224.0/22
Are you under the impression that individual web archivists don't also scrape websites of interest, and submit those WARC's for inclusion into the Wayback Machine, independent of the IA's crawlers?
Because believe me, we do...good luck banning every AWS and DO IP range.
I actually didn't know that. Do you operate the same crawlers?
I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.
I have easier solution for you: just shut down your site and be done with it. This way no malicious actor will be able to save your precious information.
reply