Analyzing Browser History Using Python and Pandas

2017-11-12

data science

Today, we’re going to do some splunking within the deep, dark place which is your browser history.

In order to obtain the data for this tutorial from Google Chrome, go to ~/Library/Application Support/Google/Chrome/Default on a Mac/Linux computer or %LocalAppData%\Google\Chrome\User Data\Default on a Windows PC.

Run the following SQLite command to obtain a text file in reverse chronological order:

(Mac/Linux)

sqlite3 History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'),url from  urls order by last_visit_time desc" > ~/hist.txt

(Windows)

sqlite History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'),url from  urls order by last_visit_time desc" > %userprofile%\hist.txt

Check your user folder. A file called hist.txt should be there. Move the file to a suitable place for this exercise.

(This process brought to you by the brilliant people at https://superuser.com/questions/602252/can-chrome-browser-history-be-exported-to-an-html-file)

Import the needed libraries, numpy and pandas:

import pandas as pd
import numpy as np

Clean Up Data

That data that we pulled is extremely messy. Here’s an example row:

2017-11-12 21:10:11|https://news.ycombinator.com/item?id=15678587

We need to split on that vertical bar while making sure not to split on a bar in the URL itself. Since Pandas probably doesn’t do this out of the box, let’s write a custom import function:

# Open our file
with open('hist.txt') as f:
    content = f.readlines()
# Strip whitespace then split on first occurrence of pipe character
raw_data = [line.split('|', 1) for line in [x.strip() for x in content]]
# We now have a 2D list.
print(raw_data[1])

['2017-11-12 21:09:21', 'https://news.ycombinator.com/']

Using our 2D list, let’s make a Pandas DataFrame with custom column headers.

data = pd.DataFrame(raw_data, columns=['datetime', 'url'])

Make sure it is working:

data.head(1)

	datetime	url
0	2017-11-12 21:10:11	https://news.ycombinator.com/item?id=15678587

Now, we’re almost done with ingesting the data. Let’s convert the datetime string column into a column of Pandas datetime elements.

data.datetime = pd.to_datetime(data.datetime)

Double-check that it is indeed a Pandas timestamp:

data.datetime[0]

Timestamp('2017-11-12 21:10:11')

Finally, let’s remove all information from the URL, leaving only the domain/subdomain:

from urllib.parse import urlparse
parser = lambda u: urlparse(u).netloc
data.url = data.url.apply(parser)

Let’s check our work again:

data.head(1)

	datetime	url
0	2017-11-12 21:10:11	news.ycombinator.com

Finally, our data is clean.

Analyzing the Data

Now that the boring part is done, let’s analyze our browsing data.

Most Visited Sites

Let’s generate a list of our top sites sorted by frequency, then print out the first two to get our most visited sites:

# Aggregate domain entries
site_frequencies = data.url.value_counts().to_frame()
# Make the domain a column
site_frequencies.reset_index(level=0, inplace=True)
# Rename columns to appropriate names
site_frequencies.columns = ['domain', 'count']
# Display top 2
site_frequencies.head(2)

	domain	count
0	www.google.com	3904
1	github.com	1571

It should come as no shock that my top sites, just like any other dev, were Google and Github.

(Is secretly surprised that Stack Overflow was not one of the top ten)

Now, let’s see our top sites in a beautiful chart:

import matplotlib.pyplot as plt
topN = 20
plt.figure(1, figsize=(10,10))
plt.title('Top $n Sites Visited'.replace('$n', str(topN)))
pie_data = site_frequencies['count'].head(topN).tolist()
pie_labels = None
# Uncomment to get specific domain names
# pie_labels = site_frequencies['domain'].head(topN).tolist()
plt.pie(pie_data, autopct='%1.1f%%', labels=pie_labels)
plt.show()

Pie Chart

What else can we do with the data? A lot, I’m sure, since this is the Holy Grail for ad tracking companies.

Next week, I’ll try to predict browsing trends using this collected dataset.

Clean Up Data

Analyzing the Data

Most Visited Sites

Contents