How Google’s Web Crawler Bypasses Paywalls

by Isoroku Yamamoto

Wall Street Journal fixed their “paste a headline into Google News” paywall trick. However, Google can still index the content.

Digital publications allow discriminatory access for search engines by inspecting HTTP request headers. The two relevant headers are Referer and User-Agent.

Referer identifies the address of the web page that linked to the resource. Previously, when you clicked a link through Google search, the Referer would say https://www.google.com/. This is no longer enough.

More recently, websites started checking for User-Agent, a string that identifies the browser or app that made the request. Wall Street Journal wants to know that you not only came from Google, but also that you are an agent of Google.

By providing this information in request headers, anyone can appear to be a Google web crawler. In fact, I will show you how to make a Chrome extension that does just that.

1. Create a file called manifest.json. Paste the following in the file. Add any sites you would like to read to the permissions list.

{
  "name": "Innocuous Chrome Extension",
  "version": "0.1",
  "description": "This is an innocuous chrome extension.",
  "permissions": ["webRequest", "webRequestBlocking",
                  "http://www.ft.com/*",
                  "http://www.wsj.com/*",
                  "https://www.wsj.com/*",
                  "http://www.economist.com/*",
                  "http://www.nytimes.com/*",
                  "https://hbr.org/*",
                  "http://www.newyorker.com/*",
                  "http://www.forbes.com/*",
                  "http://online.barrons.com/*",
                  "http://www.barrons.com/*",
                  "http://www.investingdaily.com/*",
                  "http://realmoney.thestreet.com/*",
                  "http://www.washingtonpost.com/*"
                  ],
  "background": {
    "scripts": ["background.js"]
  },
  "manifest_version": 2
}

2. Create a file called background.js. Paste the following into the file:

var ALLOW_COOKIES = ["nytimes", "ft.com"]

function changeRefer(details) {
  foundReferer = false;
  foundUA = false

  var reqHeaders = details.requestHeaders.filter(function(header) {
    // block cookies by default
    if (header.name !== "Cookie") {
      return header;
    } 

    allowHeader = ALLOW_COOKIES.map(function(url) {
      if (details.url.includes(url)) {
        return true;
      }
    });
    if (allowHeader.filter(Boolean)==true) return header; 

  }).map(function(header) {
    
    if (header.name === "Referer") {
      header.value = "https://www.google.com/";
      foundReferer = true;
    }
    if (header.name === "User-Agent") {
      header.value = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
      foundUA = true;
    }
    return header;
  })
  
  // append referer
  if (!foundReferer) {
    reqHeaders.push({
      "name": "Referer",
      "value": "https://www.google.com/"
    })
  }
  if (!foundUA) {
    reqHeaders.push({
      "name": "User-Agent",
      "value": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    })
  }
  console.log(reqHeaders);
  return {requestHeaders: reqHeaders};
}

function blockCookies(details) {
  for (var i = 0; i < details.responseHeaders.length; ++i) {
    if (details.responseHeaders[i].name === "Set-Cookie") {
      details.responseHeaders.splice(i, 1);
    }
  }
  return {responseHeaders: details.responseHeaders};
}

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["requestHeaders", "blocking"]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
  urls: ["<all_urls>"],
  types: ["main_frame"],
}, ["responseHeaders", "blocking"]);

Save both files in one directory. These should be the only files in the directory. If you were too lazy to copy and paste, you can download the source code here.

Now type chrome://extensions/ in the browser address bar.

Click Load unpacked extension... (Make sure Developer Mode is checked in the upper right if you do not see the buttons.)

Select the directory where you saved the two files. Enable the chrome extension and visit wsj.com.

Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

23 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Jay says:

February 19, 2016 at 9:47 am

I thought Google’s rules were to be indexed you couldn’t be behind a paywall. Did I have that wrong, or did that change?

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:44 pm
  
  Google allows it for publishers if you specify “registration or subscription required” in the sitemap.
  
  Reply
111010101001010 says:

February 19, 2016 at 10:45 am

Or you could just got to the article and click the x button in your browser really quickly before it redirects you.

Reply
Elias Pipping says:

February 19, 2016 at 11:06 am

Interesting. I wouldn’t have expected them to make it that easy.

WSJ doesn’t even seem to care about the UA. On the command line e.g.,

curl –referer https://www.google.com/ http://www.wsj.com/articles/your-favourite-article

gives me the full article already.

Reply
1. Elaine says:
  
  February 19, 2016 at 2:34 pm
  
  nice… must be the lack of cookies.
  
  Reply
gurkan says:

February 19, 2016 at 12:06 pm

Until someone starts to filter by IP blocks..

Reply
1. sqdcn says:
  
  February 19, 2016 at 6:13 pm
  
  Not very likely. It’s not applicable to record every IP in Google’s network.
  
  Reply
Cameron says:

February 19, 2016 at 12:49 pm

It doesn’t seem to be working for me. Have they already fixed it? Or have I messed up somewhere?

Reply
1. Elaine says:
  
  February 19, 2016 at 2:35 pm
  
  working last I checked… might need to delete cookies first.
  
  Reply
Lucas says:

February 19, 2016 at 1:18 pm

On my side, there is an error when inspecting your extension. It said there were errors with theses parts :

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
urls: [“”],
types: [“main_frame”],
}, [“requestHeaders”, “blocking”]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
urls: [“”],
types: [“main_frame”],
}, [“responseHeaders”, “blocking”]);

Removed the first element from the urls array so now it looks like :

chrome.webRequest.onBeforeSendHeaders.addListener(changeRefer, {
urls: [],
types: [“main_frame”],
}, [“requestHeaders”, “blocking”]);

chrome.webRequest.onHeadersReceived.addListener(blockCookies, {
urls: [],
types: [“main_frame”],
}, [“responseHeaders”, “blocking”]);

And its working fine now! Great work

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:38 pm
  
  Thanks… maybe my version only worked on my particular version of chrome.
  
  Reply
2. Isoroku says:
  
  February 19, 2016 at 2:48 pm
  
  Oh, I just realized that I originally had urls: [“<all_urls>”], and the text within <> disappeared. Thank you.
  
  Reply
alp says:

February 19, 2016 at 1:43 pm

Care to publish the extension so that we can just install?

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:41 pm
  
  Sorry my friend, the goal of this post is only to educate.
  
  Reply
florinpatan says:

February 19, 2016 at 2:17 pm

Well, if they are smart enough they also check the IP for the request and it’s game over as you won’t be able to get the address of the Googlebot for example

Reply
Hammy Goonan says:

February 19, 2016 at 2:23 pm

Without wanting to be too obtuse, that’s why Apple can’t give the FBI a back door.

Reply
Pingback: How Google’s Web Crawler Bypasses Paywalls (elaineou.com) | ..:: Frog in the box ::.. | ..:: Frog in the box ::..
Erin Dachtler says:

February 19, 2016 at 2:40 pm

I got an error trying to use this.
The solution seems to be to replace the part about `urls: [“”]` with `urls: [“”]`

Reply
1. Isoroku says:
  
  February 19, 2016 at 2:48 pm
  
  Oops! I just realized that I originally had [“<all_urls>”], and the text within <> disappeared in the provided code. Just fixed it. Thank you for pointing this out.
  
  Reply
Josh McVey (@y3rsh) says:

February 19, 2016 at 3:44 pm

Love it.

Reply
Joe says:

February 19, 2016 at 4:40 pm

This doesn’t seem to be working on WSJ.

Reply
deanalator says:

February 19, 2016 at 5:02 pm

clear cookies. reload extensions (ctrl-R) then go to wsj

Reply
Astro Jetson says:

February 19, 2016 at 6:07 pm

Any chance you can do similar education for Firefox?

Reply

Elaine's Idle Mind

and Devil's Workshop

How Google’s Web Crawler Bypasses Paywalls

Like this:

23 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Leave a Reply Cancel reply

Go talk about it:

Like this:

23 thoughts on “How Google’s Web Crawler Bypasses Paywalls”

Leave a Reply Cancel reply