Today, we’re going to do some splunking within the deep, dark place which is your browser history.
In order to obtain the data for this tutorial from Google Chrome, go to ~/Library/Application Support/Google/Chrome/Default
on a Mac/Linux computer or %LocalAppData%\Google\Chrome\User Data\Default
on a Windows PC.
Run the following SQLite command to obtain a text file in reverse chronological order:
(Mac/Linux)
sqlite3 History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'),url from urls order by last_visit_time desc" > ~/hist.txt
(Windows)
sqlite History "select datetime(last_visit_time/1000000-11644473600,'unixepoch'),url from urls order by last_visit_time desc" > %userprofile%\hist.txt
Check your user folder. A file called hist.txt
should be there. Move the file to a suitable place for this exercise.
(This process brought to you by the brilliant people at https://superuser.com/questions/602252/can-chrome-browser-history-be-exported-to-an-html-file)
Import the needed libraries, numpy
and pandas
:
import pandas as pd
import numpy as np
Clean Up Data
That data that we pulled is extremely messy. Here’s an example row:
2017-11-12 21:10:11|https://news.ycombinator.com/item?id=15678587
We need to split on that vertical bar while making sure not to split on a bar in the URL itself. Since Pandas probably doesn’t do this out of the box, let’s write a custom import function:
# Open our file
with open('hist.txt') as f:
content = f.readlines()
# Strip whitespace then split on first occurrence of pipe character
raw_data = [line.split('|', 1) for line in [x.strip() for x in content]]
# We now have a 2D list.
print(raw_data[1])
['2017-11-12 21:09:21', 'https://news.ycombinator.com/']
Using our 2D list, let’s make a Pandas DataFrame with custom column headers.
data = pd.DataFrame(raw_data, columns=['datetime', 'url'])
Make sure it is working:
data.head(1)
datetime | url | |
---|---|---|
0 | 2017-11-12 21:10:11 | https://news.ycombinator.com/item?id=15678587 |
Now, we’re almost done with ingesting the data. Let’s convert the
datetime string column into a column of Pandas datetime
elements.
data.datetime = pd.to_datetime(data.datetime)
Double-check that it is indeed a Pandas timestamp:
data.datetime[0]
Timestamp('2017-11-12 21:10:11')
Finally, let’s remove all information from the URL, leaving only the domain/subdomain:
from urllib.parse import urlparse
parser = lambda u: urlparse(u).netloc
data.url = data.url.apply(parser)
Let’s check our work again:
data.head(1)
datetime | url | |
---|---|---|
0 | 2017-11-12 21:10:11 | news.ycombinator.com |
Finally, our data is clean.
Analyzing the Data
Now that the boring part is done, let’s analyze our browsing data.
Most Visited Sites
Let’s generate a list of our top sites sorted by frequency, then print out the first two to get our most visited sites:
# Aggregate domain entries
site_frequencies = data.url.value_counts().to_frame()
# Make the domain a column
site_frequencies.reset_index(level=0, inplace=True)
# Rename columns to appropriate names
site_frequencies.columns = ['domain', 'count']
# Display top 2
site_frequencies.head(2)
domain | count | |
---|---|---|
0 | www.google.com | 3904 |
1 | github.com | 1571 |
It should come as no shock that my top sites, just like any other dev, were Google and Github.
(Is secretly surprised that Stack Overflow was not one of the top ten)
Now, let’s see our top sites in a beautiful chart:
import matplotlib.pyplot as plt
topN = 20
plt.figure(1, figsize=(10,10))
plt.title('Top $n Sites Visited'.replace('$n', str(topN)))
pie_data = site_frequencies['count'].head(topN).tolist()
pie_labels = None
# Uncomment to get specific domain names
# pie_labels = site_frequencies['domain'].head(topN).tolist()
plt.pie(pie_data, autopct='%1.1f%%', labels=pie_labels)
plt.show()
What else can we do with the data? A lot, I’m sure, since this is the Holy Grail for ad tracking companies.
Next week, I’ll try to predict browsing trends using this collected dataset.