created: 4 Dec 2015; modified: 26 May 2017; status: finished; confidence: highly likely; importance: 8
Local archives are useful for personal purposes, but sometimes, in investigations that may be controversial, you want to be able to prove that the copy you downloaded was not modified and you need to timestamp it and prove the exact file existed on or before a certain date. This can be done by creating a cryptographic hash of the file and then publishing that hash to global chains like centralized digital timestampers or the decentralized Bitcoin blockchain. Current timestamping mechanisms tend to be centralized, manual, cumbersome, or cost too much to use routinely. Centralization can be overcome by timestamping to Bitcoin; costing too much can be overcome by batching up an arbitrary number of hashes and creating just 1 hash/timestamp covering them all; manual & cumbersome can be overcome by writing programs to handle all of this and incorporating them into one’s workflow. So using an efficient cryptographic timestamping service (the OriginStamp Internet service), we can write programs to automatically & easily timestamp arbitrary files & strings, timestamp every commit to a Git repository, and webpages downloaded for archival purposes. We can implement the same idea offline, without reliance on OriginStamp, but at the cost of additional software dependencies like a Bitcoin client.
The most robust way of timestamping is cryptographic timestamping, where a document (such as a downloaded webpage) is hashed using a cryptographic hash function like SHA-256, and then the hash is published; the hash proves that that exact version of the document existed on/before the date the hash was published on. If published to somewhere like Twitter or one’s blog, though, now one has two problems of timestamping, so it is better to use the Bitcoin blockchain, where one can easily timestamp by methods like sending 1 satoshi to the address corresponding to the document’s hash. (Appropriately, Bitcoin itself is an intellectual descendant of earlier Usenet timestamping services.)
Making a full Bitcoin transaction for every version of every file one wants to timestamp works, but requires a Bitcoin client installed, can become expensive due to transaction fees, be a hassle to do manually, and bloats the Bitcoin blockchain (inasmuch as clients verifying the blockchain must keep track of all addresses with unspent funds, and every timestamping transaction represents an additional such address).
Remote timestamping service
Using services like Proof of Existence solves the install problem but not the hassle or fees (eg Proof of Existence charges ฿0.005 as of 2 December 2015 or ~$1.8 per timestamp).
We can do better by using a service like OriginStamp (Gipp et al 2015): OriginStamp is a web service which receives hashes from users, and then each day, it batches together all hashes submitted that day, hashes them, and makes a Bitcoin transaction to that master hash.1 This gives one day-level granularity of timestamping (which might sound bad but usually day-level precision is fine and in any case, the precision of Bitcoin timestamping is limited by the time delay between each block and mining) To verify any particular hash, one looks up that hash in the OriginStamp archives, finds the day/batch it is part of, hashes the whole batch, and checks that there was a Bitcoin transaction that day. Because OriginStamp only needs to make a single transaction each day, no matter how many hashes are submitted, it has near-zero effect on the Bitcoin blockchain and costs little to run - if one Bitcoin transaction costs 5 cents, then a year of daily transaction fees is <$20 (though OriginStamp accepts donations and I have given ฿0.1).
Timestamping files or strings
We can get a free API key and then, thanks to OriginStamp’s API, write a simple Bash shell script using curl
& sha256sum
to timestamp files or strings, which we will name timestamp
, make executable with chmod +x timestamp
, and put somewhere in our path:
#!/bin/bash
set -euo pipefail
API_KEY="73be2f5ae81ffa076480ac4d48fa9b2d"
# loop over input targets, hash them whether file or string, and submit:
for TARGET in "$@"; do
if [ -f "$TARGET" ]; then
# since it's a valid file, tell `sha256sum` to read it and hash it:
HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
else
# if it's a string we're precommitting to instead, pipe it into `sha256sum`:
HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)
fi
echo -n "$TARGET: "
curl --request POST --header "Content-Type: application/json" --header "Authorization: Token token=$API_KEY" \
--data "{\"hash_sha256\":\"$HASH\"}" 'http://www.originstamp.org/api/stamps'
# print newline to keep output tidier since curl doesn't add a final newline to the JSON output
echo ""
done
Now we can timestamp arbitrary files or strings as we please:
$ timestamp ~/wiki/catnip.page ~/wiki/images/logo.png
# /home/gwern/wiki/catnip.page: {"hash_sha256":"4b357388100f3cdf330bfa30572e7b3779564295a8f5e6e695fa8b2304fa450e",
# "created_at":"2015-12-02T23:57:56.985Z","updated_at":"2015-12-02T23:57:56.985Z","submitted_at":null,"title":null}
#
# /home/gwern/wiki/images/logo.png: {"hash_sha256":"243d5b9b4f97931a07d02497b8fddb181f9ba72dc37bd914077e3714d0163a2f",
# "created_at":"2015-12-02T23:57:20.996Z","updated_at":"2015-12-02T23:57:20.996Z","submitted_at":null,"title":null}
$ timestamp "Lyndon Johnson was really behind the Kennedy assassination." "Sorry: I ate the last cookie in the jar."
# Lyndon Johnson was really behind the Kennedy assassination.: {"hash_sha256":"4aef69aeaf777251d08b809ae1458c1b73653ee5f78699670d37849f6f92d116",
# "created_at":"2015-12-02T23:58:57.615Z","updated_at":"2015-12-02T23:58:57.615Z","submitted_at":null,"title":null}
#
# Sorry: I ate the last cookie in the jar.: {"hash_sha256":"508190d52a6dfff315c83d7014266737eeb70ab9b95e0cab253639de383a0b44",
# "created_at":"2015-12-02T23:59:03.475Z","updated_at":"2015-12-02T23:59:03.475Z","submitted_at":null,"title":null}
Timestamping version control systems
Given this script, we can integrate timestamping elsewhere - for example, into a Git version control system repository of documents using its post-commit hook feature. We could write out the full curl
call as part of a self-contained script, but we already factored the timestamping out as a separate shell script. So setting it up and enabling it is now as simple as:
echo 'timestamp $(git rev-parse HEAD)' >> .git/hooks/post-commit
chmod +x .git/hooks/post-commit
Now each commit we make, the SHA-1 hash of the last commit will be timestamped and we can take the repo later and prove that all of the content existed before a certain day; this might be source code, but also anything one might want to track changes to - interviews, web page archives, copies of emails, financial documents, etc.
This approach generalizes to most version control systems built on cryptographic hashes as IDs, where timestamping the ID-hashes is enough to assure the entire tree of content. (I’m not sure about other VCSes; perhaps the post-commit hooks could timestamp entire revisions/patches?)
There have long been concerns that SHA-1 is increasingly weak; as of 2017, collisions can be generated at feasible costs, so timestamps of SHA-1 hashes no longer prove as much as they used to.
It might be possible to use a tool like git-evtag
for hashing the entire repository history including the changes themselves (rather than just the IDs), and timestamp this master hash instead of the latest-revision hash. Alternately, since there are no worries about SHA-256 being broken anytime soon, one could write a post-commit script to directly parse out a list of modified files & timestamp each file; in which case, every version of every file has its own separate SHA-256-based timestamp. (The disadvantage here is also an advantage as it enables selective disclosure: if you are timestamping the entire Git repository, then to subsequently prove the timestamp to a third party, you must provide the entire repository so they can replay it, see what the final state of the relevant file is, and check that it contains what you claim it contains and that the relevant revision’s SHA-1 is correctly timestamped; but if you have timestamped each file separately, you can provide just the relevant version of the relevant file from your repository, rather every version of every file prior. The tradeoff here is similar to that of timestamping a hash of a batch vs timestamping individual hashes.
Probably the best approach is to timestamp each file at the beginning, use VCS timestamps subsequently for regular activity, and every once in a long while timestamp all the files again; then for slow-changing files, one will be probably be able to reveal a useful timestamp without needing to reveal the whole VCS history as well, while still having backup timestamps of the whole VCS in case very fine-grained timestamps turn out to be necessary.)
Timestamping downloaded web pages
Automatically tracking Git commits is easy because of the hook functionality, but what if we want to download web pages and then timestamp them? Downloading them normally with wget
and then manually calling timestamp
on whatever the file winds up being named is a pain, so we want to do it automatically. This gets a little trickier because if we write a script which takes a URL as an argument, we don’t necessarily know what the resulting filepath will be - the URL could redirect us to another version of that page with different arguments, another page on that domain, or to another domain entirely, and then there’s the URL-decoding to deal with.
The simple (and stupid) way is to parse out a filename from the wget
output because it conveniently places the destination filename in a pair of Unicode quote marks, which give us a perfect way to parse out the first2 downloaded filename; this turns out to work well enough in my preliminary testing of it. A script wget-archive
which does this and works well with my archiver daemon:
#!/bin/bash
set -euo pipefail
cd ~/www/
USER_AGENT="Firefox 6.4"
FILE=$(nice -n 20 ionice -c3 wget --continue --unlink --page-requisites --timestamping -e robots=off \
--reject .exe,.rar,.7z,.bin,.jar,.flv,.mp4,.avi,.webm,.ico \
--user-agent="$USER_AGENT" "$@" 2>&1 \
| egrep 'Saving to: ‘.*’' | sed -e 's/Saving to: ‘//' | tr -d '’' | head -1 )
timestamp "$FILE"
Local timestamping
As convenient as OriginStamp is, and as nice as it is to have only one Bitcoin transaction made per day covering all OriginStamp users, one may not want to rely on it for any number of reasons: sporadic Internet connectivity, uncertainty that OriginStamp’s data will remain accessible in the far future, uncertainty OriginStamp correctly implements the timestamping algorithm, needing to timestamp so much that it would seriously burden OriginStamp’s bandwidth/storage resources & interfere with other users, not wanting to leak volume & timing of timestamps, etc.
This can be done with yet more scripting, a local Bitcoin client with sufficient funds (~$20 should cover a year of usage), and something to convert hashes to Bitcoin addresses (bitcoind
& bitcoin-tool
respectively for the latter two).
A simple architecture here would be to change timestamper
to create hashes of inputs as before, but instead of sending them off to OriginStamp, they are stored in a local file. This file accumulates hashes from every use of timestamper
that day. At the end of the time period, another script runs: it
- archives the master file to a date-stamped file (replacing it with an empty file to receive future hashes)
- hashes the archived file to yield the master hash of that batch of hashes
- converts the master hash to a Bitcoin address
- finally, calls a local Bitcoin client like Electrum or Bitcoin Core’s
bitcoind
to make 1 transaction to the address
So let’s say that hashes are being stored in ~/docs/timestamps/
; the simpler timestamper
script reads just:
#!/bin/bash
set -euo pipefail
MASTER_DIR=~/docs/timestamps/
for TARGET in "$@"; do
if [ -f "$TARGET" ]; then
HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
else
HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)
fi
echo "$HASH" >> $MASTER_DIR/today.txt
done
The hardest part is converting a SHA-256 hash to a valid Bitcoin address, which involves a number of steps, so in this example I’ll use the lightweight bitcoin-tool
for that part. To give an example of bitcoin-tool
use, we can verify an OriginStamp timestamp to make sure we’re doing things the same way. Take this test timestamp:
$ echo "I have a secret." | sha256sum
7306a744a285474742f4f9ae8ddae8214fb7625348d578fb3077fb0bae92b8f1
OriginStamp’s page includes the full batch of hashes (the Transaction Seed
), which we can verify includes 7306a..f1
; so far so good. We can then pipe the full list into sha256sum
using xclip
which gives us the master hash:
$ xclip -o | sha256sum
7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e -
7ad6..1e
is the hash that the OriginStamp page claims to use as the Secret
, which also checks out. This hash needs to be converted to a Bitcoin address, so we call bitcoin-tool
with the relevant options:
$ bitcoin-tool --network bitcoin --input-type private-key --input-format hex --output-type address --output-format base58check \
--public-key-compression compressed --input "7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e"
1DMQELo9krQDvHHK5nPjbKLQFnnLtUdMFm
1DMQ..Fm
is also the same Bitcoin address that OriginStamp claims to send to on that page, so all that remains is to check that some bitcoin was sent to 1DMQ..Fm
today, and looking on Blockchain.info, we see that some bitcoins were sent. So we have successfully independently verified that that list of hashes was timestamped on the day it was claimed to have been timestamped, and that OriginStamp is both working correctly & our use of bitcoin-tool
to convert a SHA-256 hash to a Bitcoin address is likewise working. With that, we can proceed.
To stamp a batch, we can write a script we’ll call timestamper-flush
:
#!/bin/bash
set -euo pipefail
MASTER_DIR=~/docs/timestamps/
DATE=$(date +'%s')
mv $MASTER_DIR/today.txt $MASTER_DIR/"$DATE".txt && touch $MASTER_DIR/today.txt
MASTER_HASH=$(sha256sum "$MASTER_DIR/$DATE.txt" | cut --delimiter=' ' --field=1)
BITCOIN_ADDRESS=$(bitcoin-tool --network bitcoin --input-type private-key --input-format hex \
--output-type address --output-format base58check --public-key-compression compressed \
--input "$MASTER_HASH")
# assuming no password to unlock
## bitcoind walletpassphrase $PASSWORD 1
bitcoind sendtoaddress "$BITCOIN_ADDRESS" 0.00000001 "Timestamp for $DATE" || \
bitcoind getbalance # no funds?
timestamper-flush
can be put into a crontab as simply @daily timestamper-flush
(or @hourly
/@weekly
/@monthly
etc), and can be called at any time if necessary. (I have not tested these scripts, for lack of disk space to run a full node, but I believe them to be correct; and if not, the idea is clear and one can implement it as one prefers.)
Now one has an efficient, local, secure timestamping service.
External links
OriginStamp apparently does not use
OP_RETURN
like Proof of Existence does, which should be more efficient; but it is controversial and limited to 80 bytes, was reduced even further to 40 bytes, and then increased again to 80 bytes which I’m not sure is enough storage space for a secure hash.↩Which is typically the web page we care about, and subsequent files are things like CSS or images which don’t need to be timestamped, but if one is paranoid about this, it should be possible to timestamp all the downloaded files by removing the
| head -1
call and maybe translating the newlines to spaces for the subsequenttimestamp
call.↩