Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What’s the legality of web scraping?
124 points by malshe 7 hours ago | hide | past | web | favorite | 75 comments
I teach machine learning applications to masters students. Many students ask me whether it’s legally OK to scrape websites without using an API and use the data for their projects for my course. I usually just direct them to use APIs with authentication or use tabular datasets on Kaggle, data.world, etc., because I’m not a lawyer and I don’t know the legality of web scraping. The most relevant article I know is from EFF (https://www.eff.org/deeplinks/2018/04/scraping-just-automated-access-and-everyone-does-it) but it’s more than a year old.

Can anyone who knows the law please guide me on this issue? Note that the concern is less about what’s ethical and more about what’s legal. This will also help me in my research because these days some reviewers are raising this concern when they see authors used web scraped data. Online there are a ton of opinion pieces but nobody is clear on the legal side of it. Mostly people oppose scraping because they think it’s unethical.






The current state of the art is hiQ vs LinkedIn:

https://www.eff.org/cases/hiq-v-linkedin

Basically: if it's publicly visible, you can scrape it.

Caveat: the case is still making its way to the Supreme Court.

Edit: There's also Sandvig v. Sessions, which establishes that scraping publicly available data isn't a computer crime:

https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...

Edit2: Two extra common sense caveats:

- Don't hammer the site you're scraping, which is to say don't make it look like you're doing a denial of service attack.

- Don't sell or publish the data wholesale, as is -- that's basically guaranteed to attract copyright infringement lawsuits. Consume it, transform it, use it as training data, etc. instead.


As someone who used to run a heavily trafficked and heavily scrapped site, some tips from an operator:

- Make sure your scrapper has both a reasonable delay (one request per second or slower) and a proper backoff. It you start getting errors, back off. We never cared about scrapers, until we noticed them, and we only noticed them if they hit us too hard, we told them to back off, and then they didn't.

- Look deep for an API. A ton of people would scrape reddit without realizing we had (an admittedly poorly marketed) API for doing just that.

- Respect robots.txt. That was another way to get noticed quickly -- hitting forbidden URLs. If you hit a forbidden URL too often, you'd start getting 500 errors, and if you didn't back off, you'd get banned from using the site. It was an easy way to tell if someone was not a well behaved scraper.


#3 respect robots.txt

This is a polite thing to do, but I don't think that there is any legal precedence for it being an actual requirement. Notably, both Apple and The Wayback Machine publicly disregard robots.txt files [1]. I would be very curious to read any court ruling that determined a robots.txt file needs to be respected.

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...



They look at them, but they don't follow them strictly [1]. They make judgement calls on what they should do rather than treating robots.txt files as a legal contract.

[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


It's a pity that robots.txt doesn't let you specify what the crawler can do with the resources it's allowed to fetch. I think that if we had such a feature (or something similar, like a "License" header) standardized early enough , a few issues regarding crawling and search engines would be moot, or at least easier to solve automatically.

True but all the commercial websites would use it to ban scraping then.

If we're talking about being polite, then #4 respect the TOS. Especially requests per minute.

It’s the TOS itself that is legally tenuous, so you’re best bet is to completely ignore it. There’s no picking and choosing part s of it. Ignore all of it or implicitly accept all of it.

This being the top comment, it must be noted that HiQ v. LinkedIn is very much the exception to the well-established rule.

I'm not a lawyer but I did receive a C&D from a Fortune 100 that ultimately shut my project down. I was not selling or exposing any data directly -- it was purely consumed on the back end.

I was not hammering their site, but aggregating and caching requests such that people who used my project ultimately had orders-of-magnitude lower impact than they would've had otherwise.

The data we were sampling was fundamentally non-copyrightable in the US per Feist v. Rural Telecom; just a compendium of places, dates, and times (in the EU, raw data without substantial creative components is copyrightable), but because it was on their servers, and because we had to extract it from a HTML page that constituted a creative work, the CFAA and the Copyright Act were against us.

I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless. The law and the legal precedent is 100% in favor of the site being scraped. Essentially, it may not be illegal until they tell you to stop, but after that, it's unquestionably illegal. There is no public right-of-way on the internet.

My case is by no means unusual; it happens to several small companies on a daily basis, and it's a critical component in the ability of BigTechCos to maintain their walled gardens and effectively use legal mechanisms to route around the web's inherent distributed properties. All this "decentralized internet" stuff misses the point that the decentralization is not a technical problem, but a legal and social one.

Eric Goldman's blog [0] is a great resource that has consistently followed law related to scrapers for several years. He discusses hiQ v. LinkedIn at [1].

----

The applicable federal statutes, which are primarily the CFAA and the Copyright Act, don't leave much wiggle room at all on this topic, and neither does the overwhelming majority of case law. Precedents established in the 80s like MAI v. Peak have been consistently misapplied to screen scraping.

There are two particular onerous prongs of the law here: first, the CFAA's "authorized access" stipulations, and second, interpretations of the Copyright Act that hold RAM copies of data are sufficiently tangible to be potentially-infringing.

The CFAA makes it both a crime and a tort to ever access a server in a manner that "exceeds authorized access" -- essentially, as soon as the company indicates that they don't want you to talk to them, if you talk to them again, you're dead meat (craigslist v. 3taps among others).

Most companies include boilerplate in their Terms of Service that says the site cannot be accessed by any automated means and generally successfully argue that you were thereby on notice regarding the extent of your authorized access as soon as you did anything that constitutes enactment of that contract, which generally means accessing anything beyond the front page of the site ("clickwrap" or "linkwrap"), and almost certainly means anything that involves logging in, submitting forms, etc.

Re: the Copyright Act -- until it's modified to clarify that RAM copies are not independent infringements and to enshrine the rights of users to extract their own copyrighted content from another's copyrighted wrapper, it's going to be a potential infringement every time your software downloads someone's page. The real-world analog of the "RAM Copy doctrine", as it's called, would be that every time your eye reflects the image of a copyrighted work into your brain, you've made a new infringing copy. When it gets to court, that's what scrapers deal with -- and they almost always lose.

On the API front you may be able to argue that a simple JSON structure isn't sufficiently creative to qualify for copyright protection, but that would be blazing a new trail (and still leaves the CFAA to worry about). In almost all cases, something as complex as the JavaScript and the HTML that you get from $ANYWEBSITE.com, just loading it on an unapproved device is probably an infringement. That each digital load/transform is a potential infringement is how you hear about millions of infringements in file sharing cases, etc., because they're claiming each time you copied that data from your hard drive into your RAM, it was a new independent infringing copy.

Seriously, sit down and read the law, and then read the dozens of cases where this has been litigated previously. HiQ v. LinkedIn is a very limited anomaly in this pantheon, still very early in the cycle, and NO ONE should be taking it as a guiding star, at least not until it hits the Supreme Court and they come down reversing all the old precedent around this.

If you are going to build a business that depends on scraping, ONLY do so with the backing of mega-well-funded VCs, etc., who are able and willing to take on the powerful lobbies, and who are funding your company at least as much for its potential to break legal precedent as for its commercial viability.

Final note: expect no help from FAANG et al on this. Without the CFAA, their walled gardens are dead in the water. It is a critical tool used by MegaCos to retain their digital monopolies. "Network effect" means something, but it's only strangling the web to death because there are $1000/hr law firms enforcing it behind the scenes. Without that, we'd have automatic multiplexed Twitter/G+/FB streams a long time ago. They shut down aggregators because they need to control the direct interface to the user -- if they're relegated to a backend data provider by someone with a better user experience, they're very vulnerable. This realization is what motivated Craigslist's rapid reversal on scraper-friendliness and sunk 3taps, and been the death of many potentially innovative early-stage companies.

-----

tl;dr The long and short of it is that until Congress passes revisions to the CFAA and the Copyright Act and/or until the Supreme Court comes down with a wide-ranging ironclad reversal of the last 30 years of case law on this topic, it's going to be perilous for anyone whose business depends on scraping.

And all this is at the federal level -- many states have enacted similar statutes so they can get in on the "put hackers in jail" action, and these battles will have to be fought at the state level too.

[0] https://blog.ericgoldman.org/ [1] https://blog.ericgoldman.org/archives/2017/08/linkedin-enjoi...


Thanks. I was unaware about Sandvig v. Sessions

Does this mean Photos, and using those photos?

I've added a second edit which hopefully answers that question.

A timely reminder that the "new and improved, cool, friendly, loves open source, a different company" Microsoft is - beyond the slick rebranding PR - still quite happy to throw it's massive weight around, abusing the law, intent on rewriting the fundamentals of the open internet and access to information to everyone's disadvantage but their own.

whether it’s legally OK to scrape websites without using an API

I'm not a lawyer either, but making such a frivolous distinction has always bothered me --- HTTP(S) and HTML is an API, and it's the one the web browser uses. Maybe the "official" API offers some better formatting and such, but ultimately you're just getting the same information from the same source. As long as you don't hammer the server to the point that it becomes disruptive to other users, as far as they're concerned you're just another user visiting the site.

IMHO making such a distinction is harmful because it places an artificial barrier to understanding how things actually work. I've had a coworker think that it was impossible to automate retrieving information from a (company internal) site "because it doesn't have an API". It usually takes asking them "then how did you get that information?" and a bit more discussion before they finally realise.

"If you asked a hundred people to go to different pages on a site and tell you what they found, is that legal?"


The distinction is usually based on implied consent. The general legal principle is that if you own property and you grant consent for people to use it for one purpose, they are free to use it for that purpose, but you haven't necessarily granted consent for other purposes. Offering an API is a strong indication that you actually intend to allow people to consume the data with software, because otherwise you wouldn't have bothered. Offering an HTML interface is usually an indication that you intend for people to consume the data with a web browser.

Offering an HTML interface may be an indication that you also consent to allowing machines to read the data through the HTML - that's the idea behind search engines. But that's where it gets complicated, and that's why there's all sorts of other considerations to the legal question. Things like did you include the pages in question in robots.txt, did you say anything explicitly about scrapers in the ToS, does the scraper offer a way to contact its owner about abuse, has the website actually contacted them, has an IP ban been issued, is the scraping for commercial purposes, does it compete directly with the site, does it interfere with legitimate human use, etc.


>IMHO making such a distinction is harmful because it places an artificial barrier to understanding how things actually work. I've had a coworker think that it was impossible to automate retrieving information from a (company internal) site "because it doesn't have an API". It usually takes asking them "then how did you get that information?" and a bit more discussion before they finally realise.

Yes! Similar pet peeve about e.g. "You can't use encryption in Gmail." No, nothing stops you from encrypting the message outside of Gmail and pasting the ciphertext in your message's body. It's that e.g. there might not be native support in the web client.


but making such a frivolous distinction has always bothered me

Dismissing various legal and social conventions as 'frivolous distinctions' is, in the end, probably a more harmful viewpoint than the inconveniences the 'distinctions' introduce. It's also too easy to apply it in arbitrary and self-serving ways. Scraping data off some website? Frivolous distinction. Someone hoards your personal data? Venal violation of your privacy rights.


I agree. This is exactly why I made that distinction.

You may want to review the court decision in the LinkedIn vs hiQ case[0][1].

> It is generally impermissible to enter into a private home without permission in any circumstances. By contrast, it is presumptively not trespassing to open the unlocked door of a business during daytime hours because "the shared understanding is that shop owners are normally open to potential customers." These norms, moreover govern not only the time of entry but the manner; entering a business through the back window might be a trespass even when entering through the door is not.

[0] https://arstechnica.com/tech-policy/2017/08/court-rejects-li...

[1] https://www.documentcloud.org/documents/3932131-2017-0814-Hi...


Thanks. This is the case the EFF article I linked in the original post also refers to.

The rule of thumb seems to be:

- If the website offers the data publicly (without authentication), it's free to scrape.

- If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse.

- If you use the data to compete with a big company, they will sue you regardless.

Court resolutions will vary on the court and judge. https://en.wikipedia.org/wiki/Web_scraping#Legal_issues


> If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse.

At the risk of stating the obvious, most data you'll find is protected by copyright. Eg this comment is written by me so according to nearly all jurisdictions in the world, I own the copyright (unless HN has a clause that I agreed to when I signed up that I sign it away, like stack overflow has).

Most forums, blogs, essays, articles, news sites, recipes and song lyrics are covered by copyright. I'm pretty sure that a webshop's blarb about why product x is good is covered by copyright.


> If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse

Careful with this one. It's possible that it could be copyrighted in both the US and Europe (and also have some beyond copyright protection in Europe--more on the European situation later). In the US, a collection of data might count as a "compilation", defined in 17 USC 101:

> A “compilation” is a work formed by the collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes an original work of authorship. The term “compilation” includes collective works

In the case of a compilation like a collection of house addresses, the important thing is whether the selection and arrangement of the data was sufficiently creative. The big case on this was decided by the Supreme Court in 1991. The cite is Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

Briefly, the compilation in that case was a book of telephone listing. There was no doubt that it had taken a lot of work to produce, and up to this point copyright law followed the "sweat of the brow" doctrine, which basically means that if you put a lot of time and effort into making something in a category that can be covered by category, you could get copyright.

In Feist, the Court said that it is a Constitutional requirement for copyright that the work must actually be creative. It didn't take much creativity to qualify, but there had to be a spark of creativity in there. In the Feist case, they found that the telephone book in question was just an alphabetical lists of all phone users in a region, which the telephone company was required by law to make. There was no creativity in either the selection or arrangement of the data, so no compilation copyright.

Based on Feist, then, a list of all house addresses in a region, sorted by address, or owner, or something like that, would probably be up for grabs. If it is a subset of the houses, then it is possible that selection was sufficiently creative to allow copyright. Same goes for a clever arrangement or presentation of the data, although if what you are using it for doesn't copy the arrangement or presentation they compilation copyright might not cover your copying.

BTW, in the particular case of address data, if your application doesn't actually need specific house addresses but instead just needs to know all the valid streets in a US state, and the address ranges on those streets, look at how that state handles sales tax. Sales taxes are usually based on street address, and the states make available databases that list all streets and the tax rates for each address range within the street.

If the state is one of the states that have joined the Streamlines Sales Tax arrangement, you can get their data here [1]. All the states part of the SST group (around half of the states) agreed to a common format for the data. I think most non-SST states also make the data available in a reasonable form, so the approach of using tax data to get address information works in them, too, just not as conveniently.

Most of the rest of the world also recognizes some kind of copyright on data collections, similar to the compilation copyright in the US, for data collections that are selected or arranged with sufficient creativity. This is part of the TRIPS trade agreement.

In the case of scraping for academic purposes, it might be OK even if it would otherwise be a copyright violation due to fair use. If it is a state owned school, it might not matter because of sovereign immunity which greatly limits the ability of citizens to sue a state government for violations of Federal laws.

Some places, including most of Europe, also have a sui generis database right that creates a property right separate from copyright in databases, based on the effort to put together the database (i.e., the old "sweat of the brow" theory). I'll just point to Wikipedia for those who want more on the sui generis database right [2].

Oh, I suppose if the house addresses were for houses in Europe, then besides copyright and the sui generis database right, you might also want to consider whether or not scraping and using the data might have GDPR implications for you.

[1] https://www.streamlinedsalestax.org/Shared-Pages/rate-and-bo...

[2] https://en.wikipedia.org/wiki/Sui_generis_database_right


As an alternative to manual scraping, you can use CommonCrawl[0] or other open data sets, such as those provided by AWS[1]. That should alleviate any legal concerns (I think. I'm not a lawyer, but I'm sure CommonCrawl and Amazon have lawyers), and it's considerably faster than scraping. On top of that, you don't end up placing an unnecessary load on random websites.

[0] https://commoncrawl.org/

[1] https://registry.opendata.aws/


Thanks. I did not know about CommonCrawl

biggest scraper of the world? google. do they obey robots.txt? not a chance, they really don't care. So do what google does, which is basically they run it like they own the world and guess what? it's actually legal

Complicated. Ask a lawyer. It depends a lot on the specifics of what you're doing, and case-law makes a lot of very subtle distinctions based on exactly who you're scraping, what their ToS says, how they present the ToS, how much data you take, what you do with that data, is it public, is it facts & numbers vs. opinion & expression, how much you might inconvenience their other users and staff, whether you're a direct competitor of them, etc.

I suspect you'll actually get different answers depending on which lawyer you ask. If you've got deep enough pockets you can probably ensure you get the answer you want, and if you have really deep pockets you can probably ensure the court gets the answer you want. But if you're just a student who doesn't want to end up in court, there are potential minefields there.


if you're just a student doing research, the risk that you'll end up in court is near zero regardless of other details. Any company would have to argue that you are 'causing damages' in order to sue you. So your research would have to be harming their servers, siphoning away their customers, or otherwise materially harming the company.

IANAL, but I have done tons of web scraping over the years.

My tips:

- Keep careful control of the rate you scrape. Every time I have ever heard of someone getting negative feedback it is because they have scraped pages at a rate that caused an impact on the website they were scraping. If you don't cause a noticeable increase in traffic/load nobody will check to see what is going on, and generally nobody has a reason to care.

- Some sites are notoriously aggressive at going after people, such as craigslist. I wouldn't try to scrape them.

- Use some kind of proxy!


- Use some kind of proxy!

Many proxies, in random order, would be the best.

That brings up another curious question: What's the legality of posting a site to something like HN or Slashdot and effectively getting it DDoS'd...?


> posting a site to something like HN or Slashdot and effectively getting it DDoS'd

I imagine there's some reading of the CFAA that could theoretically land you in hot water for this, but this is silly.

Intent is very important. Can one sue or prosecute a popular food critic for writing something about a restaurant, causing lines so long that long-time regulars can't get a seat anymore?

On the other hand, you have things like booter services (essentially, DDoS as a service). Continuing the analogy, I imagine if you hired 100 people to physically block the entrance of a restaurant for some reason, you would be on the hook for damages in civil court and something along the lines of "disturbing the peace" in criminal court.


Teach your students to ensure there’s a delay between requests so they aren’t hammering anyone’s server, and follow the rules in the robots.txt. I’ve scraped more than a billion pages without any issues.

He asked about legality, not technical difficulty.

Actually many of the students are technically competent to do the scrpaing mostly using Python and I am pretty sure they learned not to overwhelm web servers.

Just because it’s technically feasible does not mean it’s legal or ethical.

I'd say the act of web scraping alone, is almost never unethical if you are careful to not cause undue load to servers. From an ethics, not legal perspective, I don't see a whole lot of difference between your computer's silicon eyes and your organic eyes just looking at something that's already in plain view.

It might be illegal in some jurisdiction; IANAL but I think you can just get out of that jurisdiction and scrape away if that is the case. It might violate some ToS but ToS isn't law; the consequences of violating a ToS are usually on the order of getting your IP banned.

What you do with the stuff you scraped can be ethical or unethical.


What makes it unethical?

Why should I be treated differently than search engine spiders?

If somebody doesn’t want their site scraped then they can let people know with robots.txt. Get off your high horse.


They never said it was unethical.

Likewise, just because its not legal or in some perspective its unethical, doesn't mean one should not do it.

> the concern is less about what’s ethical and more about what’s legal.

Please reconsider this position. You're teaching the future generation of engineers and scientists. Even if it's not strictly the topic of your course, please don't teach your students that everything that's technically legal to do is fine. Show that being socially conscious matters as well. Everybody will be better off.


I think I didn't frame my statement well. Although I don't make absolute claims about ethics, I tell the students that some practices are considered unethical by some people because of XYZ reasons. I leave up to the students to make up their mind because all of them are adults and many of them are actually much older than me. I have lived and worked in several countries around the world which has taught me that talking about ethics in an absolutist fashion is a terrible idea. Once I was teaching a bunch of internatioanl students predictive models to screen job candidates when the candidate pool is enormous to tackle manually. Some students felt that it was unethical to use algorithms to decide a human being's job prospects. I know it may sound weird to many folks on HN but people really have strong feelings about these issues on ethical grounds and it varies a lot from person to person and culture to culture.

I completely disagree on 2 levels:

First, teachers shouldn’t be teaching morals. Specially in college and university. The slippery slope between morals to politics is a dangerous one. I rather them focusing on their actual course materials

Finally, there is nothing wrong with scrapping on ethical standpoint if you don’t DDOS the target services. It gave us search engines. And that’s probably one of most important breakthrough for humanity in the past few decades.


The question obviously came up during their teaching, so it's become part of the course, whether they want it or not. OP also says that their peers think there are ethical questions in regard to scraping.

I don't see where you see politics in how they handle such questions. I'm not advocating they go on an extended lecture about their personal views on the political system that made the laws and what not.

I'm saying that there's a difference between handling these kind of questions with "if you're not sure, maybe you should kindly ask the publisher of the data if they would be ok with you scraping/using it that way" and "if your lawyer says you're in the clear, fuck them and scrape away."


The OP asked explicitly about legality not ethics. So no ethis didn't came up organically.

Although I agree with your point about policitcs, in case of ethics, I don't take an extreme stance. I certainly discuss the ethical issues surrpounding certain practices but I refrain from preaching them what they should be doing. People should be aware of the ethical concerns of others, in particular when the issues are fast evolving.

I agree completely. Unfortunately, not only are teachers currently teaching morals, those in the social sciences teach a radical political position, as well as advocate heavily for activism. It's completely inappropriate, and I can't believe that the institutions they're employed by seem to be complacent at best, responsible at worst.

> First, teachers shouldn’t be teaching morals. Specially in college and university. The slippery slope between morals to politics is a dangerous one. I rather them focusing on their actual course materials

I disagree 100%. "To teach about the human anatomy, we've kidnapped Paul here, and will now cut him apart."

It's great for teaching (how better to observe what happens when you cut open a living person than ... cutting open a living person), but it's unethical (and illegal), and that's an important lesson as well.


Equally importantly, not everything that is illegal is immoral.

I teach software development in a data science master's degree. We learn about web scraping, it is an important skill for a future data scientist IMHO, as the web is the largest and most important dataset in the world.

Google is massively scraping the web and is building products on top of the data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?

As others pointed out that one should take care about ToS.


Thanks. Do you have any set guidelines for students when they should not scrape a website?

Several comments suggest you ask a lawyer. AFAIK, a lawyer can’t answer the question without doing the work same as a doctor can’t tell if your sick, without examination.

I know this isn’t the answer you are seeking, but it might help you find more examples— the area of copyright and fair use has a longer history with digital images. Here’s an example legal court case showing, as others have noted, the ruling judge has great impact on the outcome: “Court Rules Images That Are Found and Used From the Internet Are 'Fair Use'” By Jack Alexander, 2018-07-02 [1]

Maybe your educational institution has already done some legal work related to issues of copyright and educational use?

Here is an example from a university where they have done the legal work and constructed further guidelines to determine safe harbor guidelines.

“The use of copyright protected images in student assignments and presentations for university courses is covered by Copyright Act exceptions for fair dealing and educational institution users. [...] In certain circumstances you may be able to use more than a "short excerpt" (e.g. 10%) of a work under fair dealing. SFU's Fair Dealing Policy sets out "safe harbour" limits for working under fair dealing at SFU, but the Copyright Act does not impose specific limits.”

[1]: https://fstoppers.com/business/court-rules-images-are-found-...

[2]: I want to use another person's images and materials in my assignment or class presentation. What am I able to do under copyright](https://www.lib.sfu.ca/help/academic-integrity/copyright/stu...)


Thanks a lot for sharing these links. I will follow your advice and talk to the university's legal folks about this.

It may depend on how you use the data — for instance, publicly sharing what you scraped is a clear copyright violation in many cases.

And in some cases scraping is a violation of ToS. (Though who knows whether that’s ever been litigated as enforceable.)


ToS presumably only applies to those that have seen them. So if you want to put data behind ToS you need to show them and be able to demonstrate that the user had seen them such as having users log in, and accept ToS at registration.

It would be extremely surprising to learn otherwise, for example that there is a jurisdiction in which site users are bound by terms they can only find by actively looking for them on the site.


Thanks for pointing out the ToS. Does the ToS apply even when someone is not logged in to an account?

Ask a lawyer? Many are written as if they do; how enforceable that is is beyond my knowledge.

Web crawling by search engines shouldn't be far from web scraping in terms of data collection. I am wondering what is the legal boundary of web crawling for search engines? While web scraping sounds sneaky, why isn't web crawling?

You willingly submit links to a service to crawl your site, there's nothing like "consent" for scraping...

Nope - it just requires someone to link to you. Since there are informational sites that list new domains, that might happen automatically.

You don't, actually, most sites are discovered organically through links on other sites. Submitting links hasn't been common since the days of Yahoo and DMOZ.

You're right that "consent" is the important legal issue, but it's usually implied based on what your site requires re: authentication/authorization, robots.txt, and the controls Google has provided to let you tell them not to index a site.


respecting robots.txt and using publicly declared ua come to mind.

What jurisdiction are we talking about? Laws aren't the same everywhere.

We are based in the US

is it illegal anywhere?

Disclaimers: IANAL. And I run https://serpapi.com. I can give you free credits to your students for ML uses if you want.

Legality highly depends on where you are.

In the US, scraping of public data is a fair use exception protected by the first. If you have to sign in to access the data, you then might be bound by the ToS.

In Europe, scraping of public data can be against several laws. Notably GDPR, the new copyright law, and you might be infringing copyrights on database as defined by the CNIL.


Thanks a lot! We are based in the US. I will get in touch with you.

Hey just registered, really neat. You don't support 'popular times' results for local places. This may not even exist in your area so you may not be aware of it.

I've always really wanted to make a terminal app to keep track of how busy my local places are. Not saying I'd become a customer who would keep the lights on or anything like that, but at the very least it would make a cool demo.


The CFAA is arbitrarily enforced and it is impossible to know if you are safe legally. People in this thread are saying that publicly accessible data is safe to scrape but that certainly wasn't the case in United States v. Andrew Auernheimer.

Sandvig v. Sessions is more recent and says otherwise for publicly available information:

https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...


I think this is the most important sentence from that article:

does not make it a crime to access information in a manner that the website doesn’t like if you are otherwise entitled to access that same information.


I think the legal developments around this topic have been fluid, which has made it difficult to keep a track of the current state of the matter. I will check out the case you mention.


Thanks. I was unaware of this case.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: