(cache) I Don't Need No Stinking API - Web Scraping in 2016 and Beyond

Social media APIs and their rate limits have not been nice to me recently, especially Instagram. Who needs it anyway?

Sites are increasingly getting smarter against scraping / data mining attempts. AngelList even detects PhantomJS (have not seen other sites do this). But if you are automating your exact actions that happen via a browser, can this be blocked?

I’m going to share everything that I’ve learnt to date from my recent love affair with Selenium automation/scraping/crawling. The purpose of this post is to illustrate some of the techniques I’ve created which I haven’t seen anywhere – as a broader idea to be shared around, rather than a how-to.

First off, in terms of concurrency or the amount of horsepower you get for your hard earned $$$ – Selenium sucks. It’s simply not built for what you would consider ‘scraping’. But with sites being built with more and more smarts these days, the only truely reliable way to mine data off the internets is to use browser automation.

My stack looks like, pretty much all JavaScript. There goes a few readers 😑😆 – WebdriverIO, Node.js and a bunch of NPM packages including the likes of antigate (thanks to Troy Hunt – Breaking CAPTCHA with automated humans) but I’m sure most of my techniques can be applied to any flavour of the Selenium 2 driver. It just happens that I find coding JavaScript optimal for browser automation.

Faking Human Delays

It’s definitely good practice to add these human-like, random pauses in some places just to be extra safe:

const getRandomInt = (min, max) => {
    return Math.floor(Math.random() * (max - min + 1)) + min
}

browser
    .init()
    // do stuff
    .pause(getRandomInt(2000, 5000))
    // do more stuff

const getRandomInt = (min, max) => {

return Math.floor(Math.random() * (max - min + 1)) + min

}

browser

.init()

// do stuff

.pause(getRandomInt(2000, 5000))

// do more stuff

Parsing Data jQuery Style with Cheerio

Below is a snippet from a function that gets videos from a Facebook page which I used for Skater.Life and Hustle.Vision

const getVideos = (url) => {
    browser
        .url(url)
        .pause(15000)
        .getTitle()
        .then((title) => {
            if (!argv.production) console.log(`Title: ${title}`)
        })
        .getSource()
        .then((source) => {
            $ = cheerio.load(source)
            $('div.userContentWrapper[role="article"]').each((i, e) => {
                // parse stuff jQuery style here & maybe save it somewhere
                // wheeeeeee
            }

const getVideos = (url) => {

browser

.url(url)

.pause(15000)

.getTitle()

.then((title) => {

if (!argv.production) console.log(`Title: ${title}`)

})

.getSource()

.then((source) => {

$ = cheerio.load(source)

$('div.userContentWrapper[role="article"]').each((i, e) => {

// parse stuff jQuery style here & maybe save it somewhere

// wheeeeeee

}

I also use this similar method and some regex to parse RSS feeds that can’t be read by command line, cURL-like scripts.

fastFeed.parse(data, (err, feed) => {
    if (err) {
        console.error('Error with fastFeed.parse() - trying via Selenium')
        console.error(err)
        browser
            .url(rssUrl)
            .pause(10000)
            .getSource()
            .then((source) => {
                source = source.replace(/&lt;/g, '<').replace(/&gt;/g, '>').replace(/&amp;/g, '&')
                source = source.replace(/(<.?html([^>]+)*>)/ig, '').replace(/(<.?head([^>]+)*>)/ig, '').replace(/(<.?body([^>]+)*>)/ig, '').replace(/(<.?pre([^>]+)*>)/ig, '')
                if (debug) console.log(source)
                fastFeed.parse(source, (err, feed) => {
                    // let's go further up the pyramid of doom!
                    // ༼ノಠل͟ಠ༽ノ ︵ ┻━┻
                }

fastFeed.parse(data, (err, feed) => {

if (err) {

console.error('Error with fastFeed.parse() - trying via Selenium')

console.error(err)

browser

.url(rssUrl)

.pause(10000)

.getSource()

.then((source) => {

source = source.replace(/</g, '<').replace(/>/g, '>').replace(/&/g, '&')

source = source.replace(/(<.?html([^>]+)*>)/ig, '').replace(/(<.?head([^>]+)*>)/ig, '').replace(/(<.?body([^>]+)*>)/ig, '').replace(/(<.?pre([^>]+)*>)/ig, '')

if (debug) console.log(source)

fastFeed.parse(source, (err, feed) => {

// let's go further up the pyramid of doom!

// ༼ノಠل͟ಠ༽ノ︵ ┻━┻

}

It actually works pretty well – I’ve tested with multiple sources.

Injecting JavaScript

If you get to my level, injecting JavaScript for the client-side becomes commonplace.

browser
    // du sum stufs lel
    .execute(() => {
        let person = prompt("Please enter your name", "Harry Potter")
        if (person != null) {
            alert(`Hello ${person}! How are you today?`)
        }
    })

browser

// du sum stufs lel

.execute(() => {

let person = prompt("Please enter your name", "Harry Potter")

if (person != null) {

alert(`Hello ${person}! How are you today?`)

}

})

By the way, this is a totally non-practical example (in case you haven’t noticed). Check out the following headings.

Beating CAPTCHA

GIFLY.co had not updated for over 48 hours and I wondered why. My script which gets animated gifs from various Facebook pages was being hit with the capture screen 😮

Cracking Facebook’s captcha was actually pretty easy. It took me exactly 15 minutes to accomplish this. I’m sure there are ways to do this internally but with Antigate providing an NPM package and with costs so low, it was a no-brainer for me.

const Antigate = require('antigate')
let ag = new Antigate('booo00000haaaaahaaahahaaaaaa')

browser
    .url(url)
    .pause(5000)
    .getTitle()
    .then((title) => {
        if (!argv.production) console.log(`Title: ${title}`)
        if (title == 'Security Check Required') {
            browser
                .execute(() => {
                    // injectz0r the stuffs necessary
                    function convertImageToCanvas(image) {
                        var canvas = document.createElement("canvas")
                        canvas.width = image.width
                        canvas.height = image.height
                        canvas.getContext("2d").drawImage(image, 0, 0)
                        return canvas
                    }
                    // give me a png with base64 encoding
                    return convertImageToCanvas(document.querySelector('#captcha img[src*=captcha]')).toDataURL()
                })
                .then((result) => {
                    // apparently antigate doesn't like the first part
                    let image = result.value.replace('data:image/png;base64,', '')
                    ag.process(image, (error, text, id) => {
                        if (error) {
                            throw error
                        } else {
                            console.log(`Captcha is ${text}`)
                            browser
                                .setValue('#captcha_response', text)
                                .click('#captcha_submit')
                                .pause(15000)
                                .emit('good') // continue to do stuffs
                        }
                    })
                })
        }

const Antigate = require('antigate')

let ag = new Antigate('booo00000haaaaahaaahahaaaaaa')

browser

.url(url)

.pause(5000)

.getTitle()

.then((title) => {

if (!argv.production) console.log(`Title: ${title}`)

if (title == 'Security Check Required') {

browser

.execute(() => {

// injectz0r the stuffs necessary

function convertImageToCanvas(image) {

var canvas = document.createElement("canvas")

canvas.width = image.width

canvas.height = image.height

canvas.getContext("2d").drawImage(image, 0, 0)

return canvas

}

// give me a png with base64 encoding

return convertImageToCanvas(document.querySelector('#captcha img[src*=captcha]')).toDataURL()

})

.then((result) => {

// apparently antigate doesn't like the first part

let image = result.value.replace('data:image/png;base64,', '')

ag.process(image, (error, text, id) => {

if (error) {

throw error

} else {

console.log(`Captcha is ${text}`)

browser

.setValue('#captcha_response', text)

.click('#captcha_submit')

.pause(15000)

.emit('good') // continue to do stuffs

}

})

}

So injecting JavaScript has become super-handy here. I’m converting an image to a canvas, then running .toDataURL() to get a Base64 encoded PNG image to send to the Antigate endpoint. The function was stolen from a site where I steal a lot of things from, shouts to David Walsh. This solves the Facebook captcha, enters the value then clicks submit.

Catching AJAX Errors

Why would you want to catch client-side AJAX errors? Because reasons. For example, I automated unfollowing everyone on Instagram and I found that even through their website (not via the API) there is some kind of a rate limit.

browser
    // ... go to somewhere on Instagram
    .execute(() => {
        fkErrorz = []
        jQuery(document).ajaxError(function (e, request, settings) {
            fkErrorz.push(e)
        })
    })
    // unfollow some people, below runs in a loop
        browser
            .click('.unfollowButton')
            .execute(() => {
                return fkErrorz.length
            })
            .then((result) => {
                let errorsCount = parseInt(result.value)
                console.log('AJAX errors: ' + errorsCount)
                if (errorsCount > 2) {
                    console.log('Exiting process due to AJAX errors')
                    process.exit()
                    // let's get the hell outta here!!
                }
            })

browser

// ... go to somewhere on Instagram

.execute(() => {

fkErrorz = []

jQuery(document).ajaxError(function (e, request, settings) {

fkErrorz.push(e)

})

// unfollow some people, below runs in a loop

browser

.click('.unfollowButton')

.execute(() => {

return fkErrorz.length

})

.then((result) => {

let errorsCount = parseInt(result.value)

console.log('AJAX errors: ' + errorsCount)

if (errorsCount > 2) {

console.log('Exiting process due to AJAX errors')

process.exit()

// let's get the hell outta here!!

}

})

Because a follow/unfollow invokes an AJAX call, and being rate limited would mean an AJAX error, I inject an AJAX error capturing function then save it to a global variable.

I retrieve this value after each unfollow and terminate the script if I get 3 errors.

Intercepting AJAX Data

While scraping/crawling/spidering Instagram, I ran into a problem. A tag page did not give me the post date in the DOM. I really needed this data for IQta.gs and I couldn’t afford visiting every post as I’m parsing about 200 photos every time.

What I did find though, is that there is a date variable stored in the post object that the browser receives. Heck, I ended up not even using this variable but this is what I came up with:

browser
    .url(url) // an Instagram tag page eg. https://www.instagram.com/explore/tags/coding/
    .pause(10000)
    .getTitle()
    .then((title) => {
        if (!argv.production) console.log(`Title: ${title}`)
    })
    .execute(() => {
        // override AJAX prototype & hijack data
        iqtags = [];
        (function (send) {
            XMLHttpRequest.prototype.send = function () {
                this.addEventListener('readystatechange', function () {
                    if (this.responseURL == 'https://www.instagram.com/query/' && this.readyState == 4) {
                        let response = JSON.parse(this.response)
                        iqtags = iqtags.concat(response.media.nodes)
                    }
                }, false)
                send.apply(this, arguments)
            }
        })(XMLHttpRequest.prototype.send)
    })
    // do some secret awesome stuffs (actually, I'm just scrolling to trigger lazy loading to get moar data)
    .execute(() => {
        return iqtags
    })
    .then((result) => {
        let nodes = result.value
        if (!argv.production) console.log(`Received ${nodes.length} images`)

        let hashtags = []
        nodes.forEach((n) => {
            // use regex to get hashtags from captions
        })
        if (argv.debug > 1) console.log(hashtags)

browser

.url(url) // an Instagram tag page eg. https://www.instagram.com/explore/tags/coding/

.pause(10000)

.getTitle()

.then((title) => {

if (!argv.production) console.log(`Title: ${title}`)

})

.execute(() => {

// override AJAX prototype & hijack data

iqtags = [];

(function (send) {

XMLHttpRequest.prototype.send = function () {

this.addEventListener('readystatechange', function () {

if (this.responseURL == 'https://www.instagram.com/query/' && this.readyState == 4) {

let response = JSON.parse(this.response)

iqtags = iqtags.concat(response.media.nodes)

}

}, false)

send.apply(this, arguments)

}

})(XMLHttpRequest.prototype.send)

})

// do some secret awesome stuffs (actually, I'm just scrolling to trigger lazy loading to get moar data)

.execute(() => {

return iqtags

})

.then((result) => {

let nodes = result.value

if (!argv.production) console.log(`Received ${nodes.length} images`)

let hashtags = []

nodes.forEach((n) => {

// use regex to get hashtags from captions

})

if (argv.debug > 1) console.log(hashtags)

It’s getting a little late in Melbourne.

Other Smarts

So I run all of this in a Docker container running on AWS. I’ve pretty much made my Instagram crawlers fault-tolerant with some bash scripting (goes to check if it is running now)

==> iqtags.grid2.log <==
[2016-08-23 15:01:40][LOG] There are 1022840 images for chanbaek
[2016-08-23 15:01:41][LOG] chanbaek { ok: 1,
  nModified: 0,
  n: 1,
  upserted: [ { index: 0, _id: 57bc654f1007e86f09f70b49 } ] }
[2016-08-23 15:01:51][LOG] Getting random item from queue
[2016-08-23 15:01:51][LOG] Aggregating related tags from a random hashtag in db
[2016-08-23 15:01:51][LOG] Hashtag #spiritualfreedom doesn't exist in db
[2016-08-23 15:01:51][LOG] Navigating to https://www.instagram.com/explore/tags/spiritualfreedom
[2016-08-23 15:02:05][LOG] Title: #spiritualfreedom • Instagram photos and videos

==> iqtags.grid3.log <==
[2016-08-23 15:00:39][LOG] Navigating to https://www.instagram.com/explore/tags/artist
[2016-08-23 15:00:56][LOG] Title: #artist • Instagram photos and videos
[2016-08-23 15:01:37][LOG] Received 185 images
[2016-08-23 15:01:37][LOG] There are 40114945 images for artist
[2016-08-23 15:01:37][LOG] artist { ok: 1, nModified: 1, n: 1 }
[2016-08-23 15:01:47][LOG] Getting random item from queue
[2016-08-23 15:01:47][LOG] Aggregating related tags from a random hashtag in db
[2016-08-23 15:01:47][LOG] Hashtag #bornfree doesn't exist in db
[2016-08-23 15:01:47][LOG] Navigating to https://www.instagram.com/explore/tags/bornfree
[2016-08-23 15:02:01][LOG] Title: #bornfree • Instagram photos and videos
[2016-08-23 15:02:44][LOG] Received 183 images
[2016-08-23 15:02:44][LOG] There are 90195 images for bornfree
[2016-08-23 15:02:45][LOG] bornfree { ok: 1,
  nModified: 0,
  n: 1,
  upserted: [ { index: 0, _id: 57bc658f1007e86f09f70b4c } ] }

==> iqtags.grid2.log <==

[2016-08-23 15:01:40][LOG] There are 1022840 images for chanbaek

[2016-08-23 15:01:41][LOG] chanbaek { ok: 1,

nModified: 0,

n: 1,

upserted: [ { index: 0, _id: 57bc654f1007e86f09f70b49 } ] }

[2016-08-23 15:01:51][LOG] Getting random item from queue

[2016-08-23 15:01:51][LOG] Aggregating related tags from a random hashtag in db

[2016-08-23 15:01:51][LOG] Hashtag #spiritualfreedom doesn't exist in db

[2016-08-23 15:01:51][LOG] Navigating to https://www.instagram.com/explore/tags/spiritualfreedom

[2016-08-23 15:02:05][LOG] Title: #spiritualfreedom • Instagram photos and videos

==> iqtags.grid3.log <==

[2016-08-23 15:00:39][LOG] Navigating to https://www.instagram.com/explore/tags/artist

[2016-08-23 15:00:56][LOG] Title: #artist • Instagram photos and videos

[2016-08-23 15:01:37][LOG] Received 185 images

[2016-08-23 15:01:37][LOG] There are 40114945 images for artist

[2016-08-23 15:01:37][LOG] artist { ok: 1, nModified: 1, n: 1 }

[2016-08-23 15:01:47][LOG] Getting random item from queue

[2016-08-23 15:01:47][LOG] Aggregating related tags from a random hashtag in db

[2016-08-23 15:01:47][LOG] Hashtag #bornfree doesn't exist in db

[2016-08-23 15:01:47][LOG] Navigating to https://www.instagram.com/explore/tags/bornfree

[2016-08-23 15:02:01][LOG] Title: #bornfree • Instagram photos and videos

[2016-08-23 15:02:44][LOG] Received 183 images

[2016-08-23 15:02:44][LOG] There are 90195 images for bornfree

[2016-08-23 15:02:45][LOG] bornfree { ok: 1,

nModified: 0,

n: 1,

upserted: [ { index: 0, _id: 57bc658f1007e86f09f70b4c } ] }

Yes, it seems that all 6 IQta.gs crawlers are running fine 🙂 I’ve run into some issues with Docker where an image becomes unusable, I have no idea why – I did not spend time to look into the root cause, but basically my bash script will detect non-activity and completely remove and start the Selenium grid again from a fresh image.

Random Closing Thoughts

I had this heading written down before I started writing this post and I have forgotten the random thoughts I had back then. Oh well, maybe it will come back tomorrow.

Ah, a special mention goes out to Hartley Brody and his post, which was a very popular article on Hacker News in 2012/2013 – it inspired me to write this.

Those of you wondering what the hell browser is:

const webdriverio = require('webdriverio')
let options
if (argv.chrome) {
    options = {
        desiredCapabilities: {
            browserName: 'chrome', chromeOptions: {
                prefs: {
                    'profile.default_content_setting_values.notifications': 2
                }
            }
        },
        port: port,
        logOutput: '/var/log/fk/iqtags.crawler/'
    }
}
else {
    options = {
        desiredCapabilities: {
            browserName: 'firefox'
        },
        port: port,
        logOutput: '/var/log/fk/iqtags.crawler/'
    }
}
const browser = webdriverio.remote(options)

const webdriverio = require('webdriverio')

let options

if (argv.chrome) {

options = {

desiredCapabilities: {

browserName: 'chrome', chromeOptions: {

prefs: {

'profile.default_content_setting_values.notifications': 2

}

port: port,

logOutput: '/var/log/fk/iqtags.crawler/'

}

else {

options = {

desiredCapabilities: {

browserName: 'firefox'

port: port,

logOutput: '/var/log/fk/iqtags.crawler/'

}

const browser = webdriverio.remote(options)

And argv comes from yargs

Thanks to my long time friend In-Ho @ Google for proofreading! 🙂

Follow me

Francis Kim

Since early-mid 2000s, my career's been mostly focused on eCommerce (Magento) and the sites I've worked on so far generate over AUD 150M+ revenue each year. I believe JavaScript and Automation is the future!

Follow me

Latest posts by Francis Kim (see all)

I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond - 24/08/2016
Running Meteor 1.3 in tmux is Awesome - 11/07/2016
Selenium WebdriverIO issue when typing numbers - 06/07/2016

6 Comments

Add yours →

VividSoftwareSolutions

24/08/2016 — 8:31 AM

Francis, very impressive. I like the Catching Ajax errors part.
Thanks a ton for all the useful snippets!

- Francis Kim
  
  24/08/2016 — 9:48 AM
  
  Thank you, glad you find it useful!
  
Kevin

24/08/2016 — 9:27 AM

Just curious, how would you make money with scraping??

I’m a JS developer for profession, and have to setup Selenium + mocha / nightwatch frequently.

So wonder if there is some way to make an extra bug 🙂

- Francis Kim
  
  24/08/2016 — 9:47 AM
  
  Not sure to be honest lol. Most scraping jobs seem to be kind of low quality jobs wanting x leads for y dollars – that’s certainly not what I am after. But it has served me well for personal use. I think the money will be in creating a bot that allows smart automation rather than just doing the numbers.
  
- Nick Sweeting
  
  24/08/2016 — 10:17 AM
  
  Lead-gen jobs are all over the place, but they don’t pay very well unless you can scrape multiple sources and cross-reference to improve lead quality (which usually takes some data-science experience).
  
Cupcake89

24/08/2016 — 11:19 AM

Your logo and specially it’s colors resembles speedof.me. Haha

I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond

Faking Human Delays

Parsing Data jQuery Style with Cheerio

Injecting JavaScript

Beating CAPTCHA

Catching AJAX Errors

Intercepting AJAX Data

Other Smarts

Random Closing Thoughts

Francis Kim

Latest posts by Francis Kim (see all)

6 Comments

Add yours →

VividSoftwareSolutions

Francis Kim

Kevin

Francis Kim

Nick Sweeting

Cupcake89

2 Pingbacks

Leave a Reply Cancel reply