The client-side scraping companion
artoo is a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities.
This nice droid is loaded into the JavaScript context of any webpage through a handy bookmarklet you can instantly install by dropping the above icon onto your bookmark bar.
Now that you have installed artoo let's scrape the famous Hacker News in four painless steps:
artoo.scrape('td.title:has(a):not(:last)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
}, artoo.savePrettyJson);
That's it. You've just scraped Hacker News front page and downloaded the data as a pretty-printed json file*.
* If you need a more thorough scraper, check this out.
Please note that artoo has been built having Chrome and Chromium in mind. So, even if artoo may function quite properly on other browsers, some of its features such as instructions recording might not be available on those.
If you think this is unfair and feel that some features can be ported to other browsers, please report it and we'll find a solution together.
« Why on earth should I scrape on my browser? Isn't this insane? »
Well, before quitting the present documentation and run back to your beloved scrapy© spiders, you should pause for a minute or two and read the reasons why artoo has made the choice of client-side scraping.
Usually, the scraping process occurs thusly: we find sites from which we need to retrieve data and we consequently build a program whose goal is to fetch those site's html and parse it to get what we need.
The only problem with this process is that, nowadays, websites are not just plain html. We need cookies, we need authentication, we need JavaScript execution and a million other things to get proper data.
So, by the days, to cope with this harsh reality, our scraping programs became complex monsters being able to execute JavaScript, authenticate on websites and mimic human behaviour.
But, if you sit back and try to find other programs able to perform all those things, you'll quickly come to this observation:
Aren't we trying to rebuild web browsers?
So why shouldn't we take advantage of this and start scraping within the cosy environment of web browsers? It has become really easy today to execute JavaScript in a a browser's console and this is exactly what artoo is doing.
Using browsers as scraping platforms comes with a lot of advantages:
The intention here is not at all to say that classical scraping is obsolete but rather that client-side scraping is a possibility today and, what's more, a useful one.
You'll never find yourself crawling pages massively on a browser, but for most of your scraping tasks, client-side should enhance your productivity dramatically.
Contributions are more than welcome. Feel free to submit any pull request as long as you added unit tests if relevant and passed them all.
To install the development environment, clone your fork and use the following commands:
# Install dependencies
npm install
# Testing
npm test
# Compiling dev & prod bookmarklets
grunt bookmarklets
# Running a test server hosting the concatenated file
npm start
# Running a https server hosting the concatenated file
# Note that you'll need some ssl keys (instructions to come...)
npm run https
artoo is being developed by Guillaume Plique @ SciencesPo - médialab.
Logo by Daniele Guido.
Under a MIT License.