View Post [edit]
Poster: | Fizscy | Date: | Dec 27, 2011 8:29am |
Forum: | web | Subject: | Why does the wayback machine pay attention to robots.txt |
The wayback machine is exempt from copyright issues under fair use doctrine and due to its educational purpose.
Please stop ignoring website because of ignorant, uninformed, or possessive webmasters.
Reply [edit]
Poster: | athmanb | Date: | Nov 30, 2016 1:40am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
And I agree that the way archive.org currently handles this file is pretty dumb.
If the original website owner gave consent to have his data archived, only he should be able to rescind it. Allowing a domain squatter to censor other people's work is not a positive thing.
This even further encourages squatters, since now they don't only hold the domain name hostage. The entire past existence of the website is erased until they get paid.
A robots.txt should definitely be respected during archival itself, the retroactive changes should not be accepted. If somebody wants to remove personal data, there could be legal alternatives after the model of the "EU right to be forgotten", which require a requestee to prove his identity.
As an addendum, archive.org doesn't even handle the robots.txt file correctly. In the one I'm currently looking at, only "baiduspider" and "ips-agent" are banned, yet archive.org still refuses to show site content.
Reply [edit]
Poster: | peterdaly | Date: | Jan 14, 2017 6:00am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
I'm a novice here, I am currently involved in a court case and I could really do with access to a particular website using the internet archive but the robots.txt problem has now stopped me accessing the archive so I can't make a nice little video for court.
The domain is currently with godaddy and I can purchase it for £600 ish, just wondering if I will be able to remove the robots.txt and re-gain access to the old screenshots??
Regards
Pete
Reply [edit]
Poster: | MeditateOrDie | Date: | Apr 24, 2017 10:05am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by MeditateOrDie on 2017-04-24 17:05:09
Reply [edit]
Poster: | TechLord | Date: | Jul 4, 2018 12:26pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
You phrased my deep opinion so well.
When you wrote this, ytimg actually abused robots.txt, but this stopped at 20120229.
Reply [edit]
Poster: | #Danooxt3 | Date: | Mar 14, 2016 2:54pm |
Forum: | web | Subject: | Robots.txt |
This message sucks.
I'm totally with you bro!
Reply [edit]
Poster: | Gameboy Genius (nitro2k01) | Date: | Jan 21, 2016 2:13pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by Gameboy Genius (nitro2k01) on 2016-01-21 22:13:25
Reply [edit]
Poster: | Hobbyboy | Date: | Nov 14, 2014 6:43am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
When a hostmaster adds a robots.txt, it blocks the whole site on the internet archive from being viewed, including the archived versions, which ends up breaking references from other websites.
Also, it stops you from being able to find a copy of old software that isn't available to download anywhere anymore. For example, I was trying to look for a copy of Ubuntu Studio 8.04, which wasn't on the Ubuntu archive for some reason, but the internet archive had it mirrored. If the Ubuntu archive added a robots.txt, it would be unavailable to download anywhere. Robots.txt is basically putting history in a locked up room, and throwing away the key.
Reply [edit]
Poster: | Thestral | Date: | Apr 25, 2014 12:50pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by Thestral on 2014-04-25 19:44:37
This post was modified by Thestral on 2014-04-25 19:50:33
Reply [edit]
Poster: | DKL3 | Date: | Apr 9, 2014 3:34pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
There is a similar dillemma where a webmaster wants their site excluded from the Wayback Machine (e.g. Nintendo of Europe). Plain ridiculous stuff right there. The UK only had a temporary feud with archive.org, and some sites are still blocked because of it.
This life is clearly losing its edge.
Reply [edit]
Poster: | DKL3 | Date: | Apr 9, 2014 3:34pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
There is a similar dillemma where a webmaster wants their site excluded from the Wayback Machine (e.g. Nintendo of Europe). Plain ridiculous stuff right there. The UK only had a temporary feud with archive.org, and some sites are still blocked because of it.
This life is clearly losing its edge.
Reply [edit]
Poster: | carehart | Date: | Jun 14, 2016 2:57pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by carehart on 2014-04-20 00:52:43
This post was modified by carehart on 2016-06-14 21:57:40
Reply [edit]
Poster: | carehart | Date: | Apr 19, 2014 6:04pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
http://archive.org/about/exclude.php
So in fact, this could be contributing to the problem. (I still think many of the blocks could be simply because people are naively blocking all spiders, or blocking Alexa but not realizing it's about archive.org).
What I mean is: why don't the archive.org and Alexa folks come up with *ANOTHER directive* that they tell site owners here to put in, which could for instance distinguish between whether they want their site crawled going forward versus whether (or not) they want their content in the archive from the past to remain or not.
I just really suspect that at least for some who WOULD read this page, they may well opt for blocking crawling going forward without removal of past archived content.
To any who would say "but there's no robots.txt standard directive that would suit this", I would point out that hte robots.txt "standard" is pretty wishy-washy. There are plenty of "standard" directives that some crawlers don't honor. And that means that there are plenty of directives people add to their robots.txt that are ignored by spiders.
More to the point, I mean that if the folks here/alexa came up with directives that were meaningful only to the alexa crawler and specific to archive.org, there would be no harm if folks added them. Because again, these would not be the only directives they may add which would not be meaningful to ALL spiders (that look at the file).
The robots.txt concept is more a "convention" than a "standard", as there is no official standards body. More at http://en.wikipedia.org/wiki/Robots_exclusion_standard. So I really think there'd be no harm in my proposal, "unconventional" though it may be.
Is there anyone following this thread who may be in a position of responsibility to help us know if that might ever even be considered or discussed?
Reply [edit]
Poster: | DKL3 | Date: | Apr 20, 2014 11:25am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
That's all it simply is.
Reply [edit]
Poster: | archivefcc | Date: | Apr 2, 2015 11:36pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Why in the world does http://archive.org/about/exclude.php not ALSO explain how to ALLOW archiving by the internet archive while preserving blocks for other crawlers?
According to http://www.robotstxt.org/robotstxt.html, it would be something like.
User-agent: ia_archiver
Disallow:
User-agent: *
Disallow: /
Reply [edit]
Poster: | fyiman | Date: | Jun 14, 2016 12:18am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
I wanted to point out that the link you posted to the archive.org FAQ page has an erroneous (trailing) "." (period) at the end of the URL, which causes the page to open showing the TOP of the page, rather than showing the desired (anchored) section.
Here is the link (without the extra trailing period):
https://archive.org/about/faqs.php#14
I don't know if you are given the option to edit your post but if you are able to do so, perhaps you could consider correcting the link.
Reply [edit]
Poster: | carehart | Date: | Jun 14, 2016 3:42pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by carehart on 2016-06-14 22:42:59
Reply [edit]
Poster: | Infenwe | Date: | Jul 19, 2014 12:36pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
That seems like a distinctly suboptimal design decision to me. One way that it seems likely to happen is this:
1) up to 2005-ish: site is held by its original owners and is doing reasonably well.
2) 2006-ish: site goes under. Domain taken over by squatter.
3) 2014-ish: Squatter's account gets suspended due to abuse. The people who put up the suspension notice also put in a /robots.txt to disallow crawling of /.
And voilà! Legitimate content that no one ever wanted to get purged from archive.org is suddenly gone.
Case in point: http://web.archive.org/web/20070103112847/http://www.infoceptor.com/
Reply [edit]
Poster: | DKL3 | Date: | Jul 20, 2014 4:57am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by DKL3 on 2014-07-20 11:57:26
Reply [edit]
Poster: | '=-/-=-/=#- | Date: | Jul 30, 2013 4:18pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | Andy The Penguin Friend | Date: | Nov 5, 2014 11:22am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
An example of a (slightly) happy ending to this problem, is that the official Heart of Darkness site (Heartofdarkness.com) has been around since mid-1990s. It went down sometime around 2004 but squatters who took the domain didn't implement robots.txt to prohibit webcrawling until far later (sometime after 2008)
I of course was devastated. I wanted to revive the site as a fan tribute but was unable to access the archived information through wayback like that. However, I found who owned the domain by going through Godaddy (I intended to buy it back before they said they'd only sell it for $5000. Yikes.) I explained that the robots.txt was making it impossible for me to view archived data from previous years, and that it broke my heart that I couldn't see it, and they altered the robots.txt so that I could see it again.
If wayback has archived it before the robots.txt the addition of robots DOES NOT delete the content, just makes it inaccessible. I encourage people to contact the current domain owners about it. That guy was very nice to me and was happy to help.
Unfortunately though, the HOD site was also on "AmazingStudio.com" and there's a whole new can of worms due to some of the newer files being hosted on that domain instead of heartofdarkness.com I can't figure out why files from there through wayback show up as "forbidden"
Anyways, it's a shame that robots.txt has to exist. What's its original intended use anyway?
Reply [edit]
Poster: | user001 | Date: | Nov 6, 2014 7:24am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | Detective John Carter of Mars | Date: | Dec 27, 2011 3:01pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
"The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection."
Reply [edit]
Poster: | PiRSquared | Date: | Sep 6, 2014 9:20pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | d0c5i5 | Date: | Jan 21, 2015 2:32pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Why hasn't this been fixed? I used to find so many things that I can't find because these domain pirates are buying up barely used/forgotten/lapsed domain names and often put in robots.txt (along with countless USELESS ads to nowhere)...
Look, I love collecting old hardware or resurrecting old hardware from countless places and doing stuff with them. Like so many many linux/GNU projects there may be few or scare references to how it was done, pieces of code, or even small downloads that are completely worthy of being preserved, but as the hardware ages (or the authors literally die), this data gets erased from history and I'm often left with links to source code/downloads/whatever refernced in forums that point to what was free/open data (even LICENSED as distributable, if GNU/GPL applies, so I doubt the new owner trying to make a buck off all the people that could end up on the domain they snached has any more claim than I do)....
Hmmm... If I were to name my kid "Disney", and disney died/forgot to fill out a form, etc, would/could I ever wipe out all of the Disney movies from history?
Reply [edit]
Poster: | d0c5i5 | Date: | Feb 21, 2015 1:26pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Regarding how this should be handled, imho, is that robots.txt should only be honored at crawl time. Period. (Esp if they didn't include the robots.txt back on the crawled date)
If someone wants to remove OLD data for a domain they now own AND they owned in the past, then they should do the leg work. Archive.org could offer a service where if you provide specific proof of ownership, possibly a legitimate claim for why it should be removed, and perhaps a fee to pay a trusted 3rd party to evaluate your request, then and only then, should they consider removing the records.
I just think about this, and fast forward 50 years, and they amount of both unintentional and intentional censorship that will happen, and it makes me sad. I know we are moving into the future, but I think archive.org is one of the shining examples of why the past matters, and it shouldn't be wiped away without a reason.
my 2c,
d0c
Reply [edit]
Poster: | PiRSquared | Date: | Jan 21, 2015 2:51pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | rin-q | Date: | Jan 24, 2015 6:23pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
So the domain has been bought by a reseller, and since a robots.txt file has been added, none of the information that was available two years ago can be reached via the Wayback Machine.
So a good example website would be obakemono dot com.
A big loss for those interested in Japanese folklore, sadly.
Reply [edit]
Poster: | PiRSquared | Date: | Mar 12, 2018 10:39am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by PiRSquared on 2018-03-12 17:39:04
Reply [edit]
Poster: | billybiscuits | Date: | Oct 22, 2017 2:47pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | rin-q | Date: | Jan 27, 2015 7:06pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by rin-q on 2015-01-28 03:06:38
Reply [edit]
Poster: | PiRSquared | Date: | Jan 27, 2015 7:33pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | rin-q | Date: | Jan 28, 2015 10:03am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by rin-q on 2015-01-28 18:00:57
This post was modified by rin-q on 2015-01-28 18:03:35
Reply [edit]
Poster: | Goyllo | Date: | Sep 18, 2016 1:46am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | d0c5i5 | Date: | May 25, 2019 8:46pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
It would be very simplistic and do less harm and more good to handle this another way. I could probably come up with dozen easy effective ways people who are running a site can prevent and protect their content rather than relying on someone ELSE to do their work for them. If it is so important, then the extra step wouldn't be difficult.
This issue of a robots.txt policy here stomping over top of OLDER sites owned by people who had a registration in older years is a key point of contention. If I move to an address and publish something with reference to that address, only the stuff I create and not anything the person that lived there before, should be affected or considered. If I don't want people to copy my book, then they can't but there is such a thing as fair use and a right to reference specific points of data for both my content, and no rule/policy i dream up should affect content published by the previous tenant with the same address.
One simple method that would need no changes to the current setup on behalf of the website owners is to simply use existing data to see if robots.txt was there, and for those dates that precede it only have the policy hold for registrations which did not change during that time period (beyond the obvious changes in renewal date).
A change _wiith_ a website's involvement could be as simple as publishing the ip addresses of the bots and simply denying access.
Another idea would be to honor robots.txt only so long as an md5 hash of the named registration owner/address, and perhaps the domain registration expiration date, added to the robots.txt, is present.
One other idea that could be thrown into the mix is using the live version of a robots.txt and applying it to a strict period of say 12 months from the current date should give owners ample time to address issues by having access to a simple/easy way to immediately address any issues. It could be added with a policy of denying blanket domain access for the entire history the first time a robots.txt shows up for the same 12 month period.
The problem is NOTHING is being done and its making me look for alternatives that aren't afraid to do the right thing simply out of fear of litigation.
Reply [edit]
Poster: | jory2 | Date: | Dec 28, 2011 6:46am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Did you somehow miss(or completely misunderstand)all the educational materials and websites made available on the subject of Copyright Law and what is considered a "fair-use" of copyright protected works?
You typed and spelled the words correctly,fair-use-doctrine, did you bother to read the guidelines?
"Please stop ignoring website because of ignorant, uninformed, or possessive webmasters."
That should go over well
Reply [edit]
Poster: | Fizscy | Date: | Dec 28, 2011 7:42am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
"The purpose and nature of the use.
If the copy is used for teaching at a non-profit institution, distributed without charge, and made by a teacher or students acting individually, then the copy is more likely to be considered as fair use."
The web archive is not a search engine crawler or similar robot, yet it seems to follow disallow requests for search crawlers just the same.
Second, the adding of that robots.txt has absolutely ZERO effect on the copyright and the fair use of the site. Nothing, nada, zip, zilch.
Third, domains change hands. Using a robots.txt file today to erase all previous copies on the archive is rediculous, especially since the copies may be of a different site.
The archive should only exempt sites that have specifically requested, to archive.org by email, that their website not be indexed.
Reply [edit]
Poster: | jory2 | Date: | Dec 28, 2011 8:39am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Good, then I'll assume you're aware that "fair-use" is restricted to the U.S. Copyright Act and not the Canadian Copyright Act. And for what it's worth my field of study is Copyright and Intellectual Property Law.
"The purpose and nature of the use."
I'll assume you understood that to mean that not all Works can be argued under "fair-use"? Unless you applied your own special meaning to the fair-use clause of the U.S. Copyright Act(s)?
"If the copy is used for teaching at a non-profit institution, distributed without charge, and made by a teacher or students acting individually, then the copy is more likely to be considered as fair use."
I'll assume you're aware this website is privately owned and operated and receives private funds on top of government funds, and of course has the archive-it paid service. This website is considered a non-profit commercial website. It is not legally considered a Library and because of that will not be able to apply the limitations for Libraries as detailed in both the U.S and Canadian Copyright Acts.
"The web archive is not a search engine crawler or similar robot, yet it seems to follow disallow requests for search crawlers just the same."
What's your point?
"Second, the adding of that robots.txt has absolutely ZERO effect on the copyright and the fair use of the site. Nothing, nada, zip, zilch."
I'll assume you understood that content owners are not legally obligated to put a robot.txt file on their sites to prevent copyright violations.
Unless you have your own special meaning to that as well?
"Third, domains change hands. Using a robots.txt file today to erase all previous copies on the archive is rediculous, especially since the copies may be of a different site."
This website is not simply coping the name of the domain, this website is making copies of the intellectual properties on privately owned websites without the express permission of the rightful copyright owners.
"The archive should only exempt sites that have specifically requested, to archive.org by email, that their website not be indexed."
This website should only be making copies of websites that they received permission copy in the first place.
Reply [edit]
Poster: | Thestral | Date: | Apr 25, 2014 1:12pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
This post was modified by Thestral on 2014-04-25 20:12:02
Reply [edit]
Poster: | Mr Cranky | Date: | Dec 28, 2011 11:28am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
Reply [edit]
Poster: | jory2 | Date: | Dec 30, 2011 8:58am |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
I did come across the Internet Archive's stance on robot.txt files however.
Starting January 2010, Archive-It is running a pilot program to test a new feature that allows our partners to crawl and archive areas of sites that are blocked by a site's robots.txt file.
"Partners who have a need for this feature should contact the Archive-It team to let us know what sites and why you would like to use this feature. It would be helpful to know if you have previously contacted the site owner about allowing our crawler to crawl their sites, and what their response was (if any). We ask our partners to use this feature only when necessary. Also, please keep in mind that many things that are blocked by robots.txt are parts of a site that you wouldn't necessary want to archive, so please be sure to review the urls that are blocked in the 'Hosts Report' for your crawl to determine if you need this feature or not."
Oddly enough this stance seems to be a complete 180 on this websites TOS.
Reply [edit]
Poster: | jory2 | Date: | Dec 28, 2011 12:13pm |
Forum: | web | Subject: | Re: Why does the wayback machine pay attention to robots.txt |
I am curious though, did you find it to be a interestingly humorous read like the misunderstandings that play-out in the forums on this website with respect to copyrights fair use and the legal definitions of libraries?