What data on myself I collect and why? [see within blog graph]

How I am using 50+ sources of my personal data

This is the list of personal data sources I use or planning to use with rough guides on how to get your hands on that data if you want it as well. It's still incomplete and I'm going to update it regularly.

My goal is collecting almost all of my digital trace, automating data collection to the maximum extent possible and making it work in the background, so one can set up pipelines once and hopefully never think about it again.

This is kind of a follow-up on my previous post on the sad state of personal data, and part of my personal way of getting around this sad state.

If you're terrified by the long list, you can jump straight into "Data consumers" section to find out how I use it. In addition, check out my infrastructure map, which might explain it better!

¶1 Why do you collect X? How do you use your data?

All things considered, I think it's a fair question! Why bother with all this infrastructure and hoard the data if you never use it?

In the next section, I will elaborate on each specific data source, but to start with I'll list the rationales that all of them share:

¶backup

It may feel unnecessary, but shit happens. What if your device dies, account gets suspended for some reason or the company goes bust?

¶lifelogging

Most data in digital form got timestamps, so automatically, without manual effort, constitutes data for your timeline.

I want to remember more, be able to review my past and bring back and reflect on memories. Practicing lifelogging helps with that.

It feels very wrong that things can be forgotten and lost forever. It's understandable from the neuroscience point of view, i.e. the brain has limited capacity and it would be too distracting to remember everything all the time. That said, I want to have a choice whether to forget or remember events, and I'd like to be able to potentially access forgotten ones.

¶quantified self

Most collected digital data is somewhat quantitative and can be used to analyze your body or mind.

¶2 What do I collect/want to collect?

As I mentioned, most of the collected data serve as a means of backup/lifelogging/quantified self, so I won't mention them again in the 'Why' sections.

All my data collection pipelines are automatic unless mentioned otherwise.

Some scripts are still private so if you want to know more, let me know so I can prioritize sharing them.

¶Amazon

How: jbms/finance-dl

Why:

was planning to correlate them with monzo/HSBC transactions, but haven't got to it yet

¶Arbtt (desktop time tracker)

How: arbtt-capture

Why:

haven't used it yet, but it could be a rich source of lifelogging context

¶Bitbucket (repositories)

How: samkuehn/bitbucket-backup

Why:

proved especially useful considering Atlassian is going to wipe mercurial repositories

I've got lots of private mercurial repositories with university homework and other early projects, and it's sad to think of people who will lose theirs during this wipe.

¶Bluemaestro (environment sensor)

How: sensor syncs with phone app via Bluetooth, /data/data/com.bluemaestro.tempo_utility/databases/ is regularly copied to grab the data.

Why:

temperature during sleep data for the dashboard
lifelogging: capturing weather conditions information

E.g. I can potentially see temperature/humidity readings along with my photos from hiking or skiing.

¶Blood

How: via thriva, data imported manually into an org-mode table (not doing too frequently so wasn't worth automated scraping)

Also tracked glucose and ketones (with freestyle libre) for a few days out of curiosity, also didn't bother automating it.

Why:

contributes to the dashboard, could be a good way of establishing your baselines

¶Browser history (Firefox/Chrome)

How: custom scripts, copying the underlying sqlite databases directly, running on my computers and phone.

Why:

better browsing history

¶Emfit QS (sleep tracker)

Emfit QS is kind of a medical grade sleep tracker. It's more expensive than wristband ones (e.g. fitbit, jawbone) but also more reliable and gives more data.

How: emfitexport.

Why:

sleep data for the dashboard

¶Endomondo

How: Endomondo collects GPS data, and HR data (via Wahoo Tickr X strap). Then, karlicoss/endoexport.

Why:

exercise data for the dashboard

¶Facebook

How: manual archive export.

I barely use Facebook, so don't even bother doing it regularly.

¶Facebook Messenger

How: karlicoss/fbmessengerexport

Why:

¶Feedbin

How: via API

Why:

better browsing history

¶Feedly

How: via API

Why:

better browsing history

¶Fitbit

How: manual CSV export, as I only used it for few weeks. Then the sync stopped working and I had to return it. However, it seems possible via API.

Why:

activity data for the #dashboard

¶Foursquare/Swarm

How: via API

¶Github (repositories)

How: github-backup

Why:

capable of exporting starred repositories as well, so if the authors delete them I will still have them

¶Github (events)

How: manually requested archive (once), after that automatic karlicoss/ghexport

Why:

better browsing history
better search in comments/open issues, etc.

¶Gmail

How: imap-backup, Google Takeout

Why:

this is arguably the most important thing you should export considering how heavily everything relies on email
better search
better browsing history

¶Goodreads

How: karlicoss/goodrexport

¶Google takeout

How: semi-automatic.

only manual step: enable scheduled exports (you can schedule 6 per year at a time), and choose to keep it on Google Drive in export settings
mount your Google Drive (e.g. via google-drive-ocamlfuse)
keep a script that checks mounted Google Drive for fresh takeout and moves it somewhere safe

Why:

Google collects lots of data, which you could put to some good use. However, old data is getting wiped, so it's important to export Takeout regularly.
better browsing history
(potentially) search history for promnesia
search in youtube watch history
location data for lifelogging and the dashboard (activity)

¶STRTHackernews

How: haven't got to it yet. It's going to require:

extracting upvotes/saved items via web scraping since Hackernews doesn't offer an API for that. Hopefully, there is an existing library for that.
I'm also using Materialistic app that has its own 'saved' posts and doesn't synchronize with Hackernews.

Exporting them is going to require copying the database directly from the app private storage.

Why: same reasons as Reddit.

¶HSBC bank

How: manual exports of monthly PDFs with transactions. They don't really offer API, so unless you want to web scrape and deal with 2FA, it seems it's the best you can do.

Why

personal finance; used it with karlicoss/hsbc-parser to feed into hledger

¶Hypothesis

How: karlicoss/hypexport

Why:

¶Instapaper

How: karlicoss/instapexport

Why:

better search
better browsing history, in particular implementing overlay with highlights
quick todos via orger

¶Jawbone

How: via API. Jawbone is dead now, so if you haven't exported it already, likely your data is lost forever.

Why:

sleep data for the dashboard

¶Kindle

How: manually exported MyClippings.txt from Kindle. Potentially can be automated similarly to Kobo.

Why:

better search

¶Kobo reader

How: almost automatic via karlicoss/kobuddy. Manual step: having to connect your reader via USB now and then.

Why:

better search
spaced repetition for unfamiliar words/new concepts via orger

¶Last.fm

How: karlicoss/lastfm-backup

¶Monzo bank

How: karlicoss/monzoexport

Why:

automatic personal finance, fed into hledger

¶Nomie

How: regular copies of /data/data/io.nomie.pro/files/_pouch_events and /data/data/io.nomie.pro/files/_pouch_trackers

Why:

could be a great tool for detailed lifelogging if you're into it

¶Nutrition

I tracked almost all nutrition data for stuff I ingested over the course of a year.

How: I found most existing apps/projects clumsy and unsatisfactory, so I developed my own system. Not even a proper app, something simpler, basically a domain-specific language in Python to track it.

Tracking process was simply editing a python file and adding entries like:

# file: food_2017.py
july_09 = F(
  [  # lunch
       spinach * bag,
       tuna_spring_water * can,       # can size for this tuna is 120g
       beans_broad_wt    * can * 0.5, # half can. can size for broad beans is 200g
       onion_red_tsc     * gr(115)  , # grams, explicit
       cheese_salad_tsc  * 100,       # grams, implicit as it makes sense for cheese
       lime, # 1 fruit, implicit
  ],
  [
     # dinner...
  ],
  tea_black * 10,     # cups, implicit
  wine_red * ml * 150, # ml, explicit
)

july_10 = ... # more logs

Comments added for clarity of course, so it'd be more compact normally.

Then some code was used for processing, calculating, visualizing, etc.

Having a real programming language instead of an app let me make it very flexible and expressive, e.g.:

I could define composite dishes as Python objects, and then easily reuse them.

E.g. if I made four servings of soup on 10.08.2018, ate one immediately and froze other three I would define something like soup_20180810 = [...], and then I can simply reuse soup_20180810 when I eat it again. (date was easy to find out as I label food when put it in the freezer anyway)
I could make many things implicit, making it pretty expressive without spending time on unnecessary typing
I rarely had to in nutrient composition manually, I just pasted the product link to supermarket website and had an automatic script to parse nutrient information
For micronutrients (that usually aren't listed on labels) I used the USDA sqlite database

The hard thing was actually not entering, but rather not having nutrition information if you're eating out. That year I was mostly cooking my own food, so tracking was fairly easy.

Also I was more interested in lower bounds, (e.g. "do I consume at least recommended amount of micronutrients"), so not having logged food now and then was fine for me.

Why:

I mostly wanted to learn about food composition and how it relates to my diet, and I did

That logging motivated me to learn about different foods and try them out while keeping dishes balanced. I cooked so many different things, made my diet way more varied and became less picky.

I stopped because cooking did take some time and I actually realized that as long as I actually vary food and try to eat everything now and then, I hit all recommended amounts of micronutrients, so I stopped. It's kind of an obvious thing that everyone recommends, but one thing is hearing it as a common wisdom and completely different is coming to the same conclusion from your data.
nutritional information contributes to dashboard

¶Photos

How: no extra effort required if you sync/organize your photos and videos now and then.

Why:

obvious source of lifelogging, in addition comes with GPS data

¶PDF annotations

As in, native PDF annotations.

How: nothing needs to be done, PDFs are local to your computer. You do need some tools to crawl your filesystem and extract the annotations.

Why:

experience of using your PDF annotations (e.g. searching) is extremely poor

I'm improving this by using orger.

¶Pinboard

How: karlicoss/pinbexport

Why:

¶Plaintext notes

Mostly this refers to org-mode files, which I use for notekeeping and logging.

How: nothing needs to be done, they are local.

Why:

search comes for free, it's already local
better browsing history

¶Pocket

How: karlicoss/pockexport

Why:

better search
better browsing history, in particular implementing overlay with highlights

¶Reddit

How: karlicoss/rexport

Why:

better search
better browsing history
org-mode interface for processing saved Reddit posts/comments, via orger

¶Remember the Milk

How: ical export from the API.

Why:

better search

I stopped using RTM in favor of org-mode, but I can still easily find my old task and notes, which allowed for a smooth transition.

¶Rescuetime

How: karlicoss/rescuexport

Why:

richer contexts for lifelogging

¶Shell history

How: many shells support keeping timestamps along your commands in history.

E.g. "Remember all your bash history forever".

Why:

potentially can be useful for detailed lifelogging

¶Sleep

Apart from automatic collection of HR data, etc., I collect some extra stats like:

whether I woke up on my own or after alarm
whether I still feel sleepy shortly after waking up
whether I had dreams (and I log dreams if I did)
I log every time I feel sleepy throughout the day

How: org-mode, via org-capture into table. Alternatively, you could use a spreadsheet for that as well.

Why:

I think it's important to find connections between subjective feelings and objective stats like amount of exercise, sleep HR, etc., so I'm trying to find correlations using my dashboard
dreams are quite fun part of lifelogging

¶Sms/calls

How: SMS Backup & Restore app, automatic exports.

¶Spotify

How: export script, using plamere/spotipy

Why:

potentially can be useful for better search in music listening history
can be used for custom recommendation algorithms

¶Stackexchange

How: karlicoss/stexport

Why:

¶Taplog

(not using it anymore, in favor of org-mode)

How: regular copying of /data/data/com.waterbear.taglog/databases/Buttons Database

Why:

a quick way of single tap logging (e.g. weight/sleep/exercise etc), contributes to the dashboard

¶Telegram

How: fabianonline/telegram_backup

Why:

¶Twitter

How: twitter archive (manually, once), after that regular automatic exports via API

Why:

¶VK.com

How: Totktonada/vk_messages_backup.

Sadly VK broke their API so the script stopped working. I'm barely using VK now anyway so not motivated enough to work around it.

Why:

¶Weight

How: manually, used Nomie and Taplog, but now just using org-mode and extracting data with orgparse. Could be potentially automated via wireless scales, but not much of a priority for me.

Why:

obvious data source for the dashboard

¶TODOWhatsapp

Barely using it so haven't bothered yet.

How: Whatsapp doesn't offer API, so potentially going to require grabbing sqlite database from Android app (/data/data/com.whatsapp/databases/msgstore.db)

Why:

¶23andme

How: manual raw data export from 23andme website. I hope your genome doesn't change so often to bother with automatic exports!

Why:

was planning to setup some sort of automatic search of new genome insights against open source analysis tools

Haven't really had time to think about it yet, and it feels like a hard project out of my realm of competence.

¶3 Data consumers

¶Instant search

Typical search interfaces make me unhappy as they are siloed, slow, awkward to use and don't work offline. So I built my own ways around it! I write about it in detail here.

In essence, I'm mirroring most of my online data like chat logs, comments, etc., as plaintext. I can overview it in any text editor, and incrementally search over all of it in a single keypress.

¶orger

orger is a tool that helps you generate an org-mode representation of your data.

It lets you benefit from the existing tooling and infrastructure around org-mode, the most famous being Emacs.

I'm using it for:

searching, overviewing and navigating the data
creating tasks straight from the apps (e.g. Reddit/Telegram)
spaced repetition via org-drill

Orger comes with some existing modules, but it should be easy to adapt your own data source if you need something else.

I write about it in detail here and here.

¶promnesia

promnesia is a browser extension I'm working on to escape silos by unifying annotations and browsing history from different data sources.

I've been using it for more than a year now and working on final touches to properly release it for other people.

¶dashboard

As a big fan of #quantified-self, I'm working on personal health, sleep and exercise dashboard, built from various data sources.

I'm working on making it public, you can see some screenshots here.

¶timeline

Timeline is a #lifelogging project I'm working on.

I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened. That way it works as a sort of external memory.

Ideally, it would look similar to Andrew Louis's Memex, or might even reuse his interface if he open sources it. I highly recommend watching his talk for inspiration.

¶`HPI` python package

This python package is a my personal (python) API to access all collected data.

I'm elaborating on it here.

¶4 --

Happy to answer any questions on my approach and help you with liberating your data.

In the next post I'm elaborating on design decisions behind my data export and access infrastructure.

Updates:

[2020-01-14]: added 'Nutrition', 'Shell history' and 'Sleep' sections

Discussion:

23 Comments

Name (optional)

E-mail (optional)

Website (optional)

Ben Congdon•4 years ago

Awesome list! I'm excited to see how your process continues to evolve.

I recently stumbled on the dogsheep project which provides a bunch of exporters to sqlite -- and I've begun to write some of my own.

1 | Reply

karlicoss•4 years ago

Thanks!

Ah, yes, saw dogsheep as well, it will be included in the next part (which is more on 'how' to export).

Generally I'm trying to avoid sqlite as long as I can since it's painful to serialize/deserialize and also change schema, and if you want to play with data interactively, something like pandas dataframe is more convenient anyway.

But sqlite is of course faster, so I'm using a hybrid approach using cachew (e.g. example here ).

1 | Reply

Konstantin•4 years ago

Just a quick question. Can promnesia show annotations from Hypothes.is? Sorry for asking: I am on mobile in a middle of nowhere so I can not just download promnesia and check it myself.

I can see it on the screenshot of v0.8. I hope that Hypothes.is support is still there.

1 | Reply

karlcoss•4 years ago

Hey, no problem! I haven't properly released it yet anyway, so you'd need to build it first anyway. I'm in progress of documenting and releasing it's easier to use by other people.

But yeah, hypothesis support is there. In fact it's pretty agnostic to specific annotation service and can work with pocket, instapaper, etc.

Wondering though, Hypothesis has pretty decent browser add-on that displays annotations inline, what's that you lacking in it that makes you ask about promnesia?

0 | Reply

Konstantin•4 years ago

I liked the idea of gathering notes on a url from different sources to display them in together. Unfortunately, nowadays I have to use different apps and sites and quite a lof of valuable insights are scattered across them

0 | Reply

Dane-git•4 years ago

I too have desired such a feature, however the final view of what such a thing might look like is still a bit hazy to me. Something between hypothesis and a oneN0te type thing. Something like a unified topic view, where highlights from separate sources can easily be integrated, re-arranged, expanded, and notes stuck where desired.

0 | Reply

Anonymous•4 years ago

What the hell? I thought I was being super original having this idea, apparently theres a whole "quantified self"(just learned about this term) community out there.

1 | Reply

karlcoss•4 years ago

Welcome to the quantified self community! :)

Recommend looking at awesome-quantified-self, people have been thinking about it for a while. I'm sure you'll have some original ideas that haven't been done by anyone else though :)

1 | Reply

tilapia•4 years ago

How about to added an interface for better browsing, search, ... and json export/import ?

1 | Reply

silipwn•3 years ago

Hi, Karl! I was interested in knowing how you track things from your smartphone?

0 | Reply

karlicoss•3 years ago

Hi! I don't collect much from my smartphone, but when I do, I usually use Orgzly (as well as using it for other org-mode notes)

0 | Reply

Riley•3 years ago

Where can I go for an update on this? I would pay you over $100 to set this up for me. Right now Im using python to export my imessages and notify me of daily usage statistics and things of the sort. From there I want to write code that will enter it all into a CSV for me and from there into a dashboard.

0 | Reply

kiko•3 years ago

Hi! Cool site, wonder how you build your website and what comment system do you use?

0 | Reply

karlicoss•3 years ago

Thanks! I use a custom script to build it (but the bulk of work is done by Emacs for org-mode and Jupyter for python notebooks), more info is here https://github.com/karlicoss/beepb00p

For comments, at the moment I'm using https://posativ.org/isso

0 | Reply

Albert Zeyer•3 years ago

Hi karlicoss!

I recently tried to manage the Google Takeout in a somewhat reasonable way. As a first try, I manually scheduled a takeout. And it resulted in 358 packages (ZIP files) (each 2GB) (~700 GB in total) (biggest part are some YouTube videos I uploaded, and also all my photos + videos, which are in full quality, because the Pixel phone has unlimited storage on Google Photos). The website did not allow to download all together, so you had to go through the list and manually click on each. To make things even more annoying, it asked you to type in your password after every 10 minutes.

I coded a Chrome extension to automate at least this download part, although this is not really nice: https://github.com/albertz/chrome-ext-google-takeout-downloader

You say you put it into your Google Drive instead. But wouldn't that eat up your quota? In my case, I'm on the free plan, so there is no way the 700 GB would fit. And I read that they will start deleting random files from you when you are over quota.

On Google Drive, is the Takeout still compressed (ZIP or tar.gz) or all uncompressed? Currently when I download the ZIP files, I need to uncompress them and merge them all together into the same directory structure.

Reading/parsing from the Takeout is yet another task. As I understand your infra (https://beepb00p.xyz/myinfra.html), you only take out the (GPS?) location? What about other stuff? E.g. mails and photos. Or your mention the browser history. All the data uses a lot of different formats. You find some JSON, some CSV, but sometimes also just some HTML files which you would need to scrape.

I somehow would want to synchronize and combine that data (e.g. mails and photos) with my other explicit backup of my mails and photos. E.g. maybe the Takeout has some other meta information which I don't have otherwise. Then I would want to add this meta information.

Currently I started to use Git Annex for the storage of all my data, to e.g. deduplicate photos. I also just dump in the whole Google Takeout into it as-is. But I'm not sure yet whether this is the best way.

0 | Reply

karlicoss•3 years ago

Hi! Wow, 700 GB sounds annoying! I don't keep my photos there (and don't have many youtube videos), so mine is two orders of magnitude smaller. So also don't know about any photo-specific metadata as well.

So I don't run into the quota, and after grabbing the export locally, I delete it from Drive.

Yep, indeed, there are many formats! And I don't extract all data (at least yet). Also some of it didn't fit on the diagram :)

https://github.com/karlicoss/HPI/blob/master/my/media/youtube.py
https://github.com/karlicoss/promnesia/blob/master/src/promnesia/sources/takeout.py
I also use emails, but not that often (and just open them in mutt). But would be nice to support in HPI as well.

Yeah, Git Annex sounds reasonable for such thing. Although do you have that many duplicates? Otherwise it might end up just as overhead.

0 | Reply

YamiFrankc•3 years ago

This is fantastic, you have inspired me start doing something similar. I have lost data form years ago that I thought I would never need, and its kinda sad to want to look into what I was doing say 10 years ago and not be able to do it. You also got me interested in org-mode!

0 | Reply

karlicoss•3 years ago

I'm glad :)

0 | Reply

yami•3 years ago

I am wondering if you use/plan to also archive Discord Its seems like there is a lot of relevant data can be taken from there too, even if you are not that active on it

0 | Reply

karlicoss•3 years ago

Yep, it's in my plans! From what I know, Discord API is a bit prohibitive, so have to be careful about exporting data... But it's possible to do a manual data export, and then you can use https://github.com/seanbreckenridge/discord_data to process it

0 | Reply

Anominous•3 years ago

Hi, I was just wondering how do you manage the YouTube videos? Any extractor tools you have used?

0 | Reply

karlicoss•3 years ago

Not really -- so far only been using youtube history from google takeout, but would be nice to do something like this eventually I've got https://github.com/coleifer/micawber#examples bookmarked, but haven't tried yet

0 | Reply

Mel•5 months ago

Thanks a lot for pockexport!

Regarding the limitations [1]. You probably already know but in the case that not: Without a count parameter being set, 5000 items will be returned as long as no offset is defined. If an offset is defined but no count 25 items will be returned.

I tested with count=15000 and it was possible to download 15000 items at once, although the return recipe stated the item count was 5000 the list contained 15000 items.

[1] https://github.com/karlicoss/pockexport#limitations

0 | Reply

What data on myself I collect and why? [see within blog graph]

Table of Contents

¶1 Why do you collect X? How do you use your data?

¶backup

¶lifelogging

¶quantified self

¶2 What do I collect/want to collect?

¶Amazon

¶Arbtt (desktop time tracker)

¶Bitbucket (repositories)

¶Bluemaestro (environment sensor)

¶Blood

¶Browser history (Firefox/Chrome)

¶Emfit QS (sleep tracker)

¶Endomondo

¶Facebook

¶Facebook Messenger

¶Feedbin

¶Feedly

¶Fitbit

¶Foursquare/Swarm

¶Github (repositories)

¶Github (events)

¶Gmail

¶Goodreads

¶Google takeout

¶STRTHackernews

¶HSBC bank

¶Hypothesis

¶Instapaper

¶Jawbone

¶Kindle

¶Kobo reader

¶Last.fm

¶Monzo bank

¶Nomie

¶Nutrition

¶Photos

¶PDF annotations

¶Pinboard

¶Plaintext notes

¶Pocket

¶Reddit

¶Remember the Milk

¶Rescuetime

¶Shell history

¶Sleep

¶Sms/calls

¶Spotify

¶Stackexchange

¶Taplog

¶Telegram

¶Twitter

¶VK.com

¶Weight

¶TODOWhatsapp

¶23andme

¶3 Data consumers

¶Instant search

¶orger

¶promnesia

¶dashboard

¶timeline

¶HPI python package

¶4 --

23 Comments

¶`HPI` python package