スタック・オーバーフローに参加する
687万人以上のプログラマーが集まるスタック・オーバーフローに参加しませんか?
簡単な登録後、すぐにご利用いただけます。
登録

This probably never happened in the real-world yet, and may never happen, but let's consider this: say you have a git repository, make a commit, and get very very unlucky: one of the blobs ends up having the same SHA-1 as another that is already in your repository. Question is, how would Git handle this? Simply fail? Find a way to link the two blobs and check which one is needed according to the context?

More a brain-teaser than an actual problem, but I found the issue interesting.

share|improve this question
63  
Once a brain teaser, now potentially an actual problem. – Toby Feb 23 at 13:30
8  
@Toby This question was about a pre-image attack; what Google demonstrated is a collision attack -- similar but slightly different. You can read more about the difference here. – Saheed Feb 23 at 15:07
    
@Saheed I don't see what part of this question is about a pre-image attack specifically, as the question posed is just about a collision in a git repository, not about exploiting it. – Toby Feb 24 at 18:52
1  
@Toby The original brain teaser was not about an attack (neither pre-image nor collision) but about an accidental collision which is so unfathomably unlikely that it is not worth considering. I think what Saheed was correctly trying to say that this is still not an actual problem. However, you are right that the Google collision attack has potentially created an security problem depending on how Git is used. – Andrew W. Phillips Feb 25 at 6:49
up vote 542 down vote accepted
+150

I did an experiment to find out exactly how Git would behave in this case. This is with version 2.7.9~rc0+next.20151210. I basically just reduced the hash size from 160-bit to 4-bit by applying the following diff and rebuilding git:

--- git-2.7.0~rc0+next.20151210.orig/block-sha1/sha1.c
+++ git-2.7.0~rc0+next.20151210/block-sha1/sha1.c
@@ -246,6 +246,8 @@ void blk_SHA1_Final(unsigned char hashou
    blk_SHA1_Update(ctx, padlen, 8);

    /* Output hash */
-   for (i = 0; i < 5; i++)
-       put_be32(hashout + i * 4, ctx->H[i]);
+   for (i = 0; i < 1; i++)
+       put_be32(hashout + i * 4, (ctx->H[i] & 0xf000000));
+   for (i = 1; i < 5; i++)
+       put_be32(hashout + i * 4, 0);
 }

Then I did a few commits and noticed the following.

  1. If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
  2. If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
  3. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
  4. If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  5. If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
  6. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  7. If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
  8. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
  9. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

For #2 you will typically get an error like this when you run "git push":

error: object 0400000000000000000000000000000000000000 is a tree, not a blob
fatal: bad blob object
error: failed to push some refs to origin

or:

error: unable to read sha1 file of file.txt (0400000000000000000000000000000000000000)

if you delete the file and then run "git checkout file.txt".

For #4 and #6, you will typically get an error like this:

error: Trying to write non-commit object
f000000000000000000000000000000000000000 to branch refs/heads/master
fatal: cannot update HEAD ref

when running "git commit". In this case you can typically just type "git commit" again since this will create a new hash (because of the changed timestamp)

For #5 and #9, you will typically get an error like this:

fatal: 1000000000000000000000000000000000000000 is not a valid 'tree' object

when running "git commit"

If someone tries to clone your corrupt repository, they will typically see something like:

git clone (one repo with collided blob,
d000000000000000000000000000000000000000 is commit,
f000000000000000000000000000000000000000 is tree)

Cloning into 'clonedversion'...
done.
error: unable to read sha1 file of s (d000000000000000000000000000000000000000)
error: unable to read sha1 file of tullebukk
(f000000000000000000000000000000000000000)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

What "worries" me is that in two cases (2,3) the repository becomes corrupt without any warnings, and in 3 cases (1,7,8), everything seems ok, but the repository content is different than what you expect it to be. People cloning or pulling will have a different content than what you have. The cases 4,5,6 and 9 are ok, since it will stop with an error. I suppose it would be better if it failed with an error at least in all cases.

share|improve this answer
100  
Awesome answer - reducing the hash size to see how it actually behaves is a great idea. – Gnurou Jan 19 '16 at 3:59
3  
@Gnurou I agree and did upvote that answer at the time. Were those cases mentioned to the git mailing list? – VonC Jan 19 '16 at 5:42
4  
Also, what are the plans if any to move to another hashing algorithm. – Pete Dec 13 '16 at 18:07
8  
…and now it's real: shattered.it – lapo Feb 24 at 15:49
4  
Must read - Linus Torval's explanations: plus.google.com/+LinusTorvalds/posts/7tp2gYWQugL – phil_lgr Mar 1 at 6:27

Original answer (2012) (see shattered.io 2017 SHA1 collision below)

That old (2006) answer from Linus might still be relevant:

Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have.

So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order.

However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository.
So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one.

So you have two cases of collision:

  • the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1.
    At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the old object.
    You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something completely different than what you expected.
    So you would generally notice this kind of collision fairly quickly.
    In related news, the question is what to do about the inadvertent collision..
    First off, let me remind people that the inadvertent kind of collision is really really really damn unlikely, so we'll quite likely never ever see it in the full history of the universe.
    But if it happens, it's not the end of the world: what you'd most likely have to do is just change the file that collided slightly, and just force a new commit with the changed contents (add a comment saying "/* This line added to avoid collision */") and then teach git about the magic SHA1 that has been shown to be dangerous.
    So over a couple of million years, maybe we'll have to add one or two "poisoned" SHA1 values to git. It's very unlikely to be a maintenance problem ;)

  • The attacker kind of collision because somebody broke (or brute-forced) SHA1.
    This one is clearly a lot more likely than the inadvertent kind, but by definition it's always a "remote" repository. If the attacker had access to the local repository, he'd have much easier ways to screw you up.
    So in this case, the collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's literally no different from the attacker just not having found a collision at all, but just using the object you already had (ie it's 100% equivalent to the "trivial" collision of the identical file generating the same SHA1).

The question of using SHA-256 is regularly mentioned, but not act upon for now.


Note (Humor): you can force a commit to a particular SHA1 prefix, with the project gitbrute from Brad Fitzpatrick (bradfitz).

gitbrute brute-forces a pair of author+committer timestamps such that the resulting git commit has your desired prefix.

Example: https://github.com/bradfitz/deadbeef


Daniel Dinnyes points out in the comments to 7.1 Git Tools - Revision Selection, which includes:

A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.


Even the more recently (February 2017) shattered.io demonstrated the possibility of forging a SHA1 collision:
(see much more in my separate answer, including Linus Torvalds' Google+ post)

  • a/ still requires over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.
  • b/ would forge one file (with the same SHA1), but with the additional constraint its content and size would produce the identical SHA1 (a collision on the content alone is not enough): see "How is the git hash calculated?"): a blob SHA1 is computed based on the content and size.

See "Lifetimes of cryptographic hash functions" from Valerie Anita Aurora for more.
In that page, she notes:

Google spent 6500 CPU years and 110 GPU years to convince everyone we need to stop using SHA-1 for security critical applications.
Also because it was cool

See more in my separate answer below.

share|improve this answer
18  
twist: still hashes the same after adding /* This line added to avoid collision */ :D you can win the lottery twice :P – Janus Troelsen May 18 '13 at 14:19
3  
@JanusTroelsen sure, but it is still a lottery, is it not? ;) (as mention in this short note about SHA1) – VonC May 18 '13 at 14:22
6  
@VonC regarding that reference: is an outburst of a global werewolf epidemic - wiping out all humanity and resulting in the gruesome death of all my developers on the same night, even though they were geographically distributed - considered an unrelated incident?? Of course, assuming it happened on a full moon, obviously. Now, such a scenario would change things. Even thinking about it is insanity! That is on an entire different scale of probability! That would mean we must... STOP USING GIT! NOW!!! EVERYONE RUUUUUN!!!!!!! – Daniel Dinnyes Mar 26 '14 at 15:15
2  
Note that the gitbrute doesn't force a particular SHA1 but only a only a prefix (ie a subpart of the whole SHA1). Forcing an entire SHA1 (ie with a prefix of the full length of the key) will probably take "too long". – mb14 Mar 1 '15 at 16:04
2  
@JanusTroelsen Then you would add: /* This line added to avoid collision of the avoid collision line */ – smg Jun 22 '15 at 22:59

According to Pro Git:

If you do happen to commit an object that hashes to the same SHA-1 value as a previous object in your repository, Git will see the previous object already in your Git database and assume it was already written. If you try to check out that object again at some point, you’ll always get the data of the first object.

So it wouldn't fail, but it wouldn't save your new object either.
I don't know how that would look on the command line, but that would certainly be confusing.

A bit further down, that same reference attempts to illustrate the likely-ness of such a collision:

Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

share|improve this answer
35  
I'd like to see the source for the numbers on the last sentence ;-) – Joachim Sauer Feb 22 '12 at 12:11
14  
@Jasper: that link is good documentation, but it does not contain statistics on the probability of every member of a team being attacked and killed by wolves in unrelated incidents on the same night. – Joachim Sauer Jan 7 '13 at 8:15
4  
@Jasper: Well, the way I read it, the text literally claims that the probability of 6.5 billion team members getting killed by wolves on the same night is higher than 50%. But my main objection to his statement is that such an event would have to be a worldwide phenomenon; it's inconceivable that this could occur due to unrelated incidents. ;) – Keith Robertson Sep 18 '13 at 13:12
5  
@KeithRobertson I am pretty sure the post is talking about the chance of all your actual team members being eaten compared to the chance of a hash collision if everyone on the world was producing insane amounts of code, alongside the time it takes under those circumstances to get to a 50% chance of a collision (i.e. the wolves incident didn't involve the entire world and the 50% was separate from the wolves). You did get the point though, if such an event is inconceivable, so should a git hash collision be. (Of course, one is (almost) purely chance based and the other isn't, but still.) – Jasper Sep 19 '13 at 16:27
9  
Watch out for wolves tonight – Toby Feb 23 at 13:32

To add to my previous answer from 2012, there is now (Feb. 2017, five years later), an example of actual SHA-1 collision with shattered.ion where you can craft two colliding PDF files: that is obtain a SHA-1 digital signature on the first PDF file which can also be abused as a valid signature on the second PDF file.
See also "At death’s door for years, widely used SHA1 function is now dead", and this illustration.

Update 26 of February: Linus confirmed the following points in a Google+ post:

(1) First off - the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a "content identifier" for a content-addressable system like git.

(2) Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation.

(3) And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.


Original answer (25th of February) But:

Joey Hess tries those pdf in a Git repo and he found:

That includes two files with the same SHA and size, which do get different blobs thanks to the way git prepends the header to the content.

joey@darkstar:~/tmp/supercollider>sha1sum  bad.pdf good.pdf 
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  bad.pdf
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  good.pdf
joey@darkstar:~/tmp/supercollider>git ls-tree HEAD
100644 blob ca44e9913faf08d625346205e228e2265dd12b65    bad.pdf
100644 blob 5f90b67523865ad5b1391cb4a1c010d541c816c1    good.pdf

While appending identical data to these colliding files does generate other collisions, prepending data does not.

So the main vector of attack (forging a commit) would be:

  • Generate a regular commit object;
  • use the entire commit object + NUL as the chosen prefix, and
  • use the identical-prefix collision attack to generate the colliding good/bad objects.
  • ... and this is useless because the good and bad commit objects still point to the same tree!

Plus, you already can and detect cryptanalytic collision attacks against SHA-1 present in each file with cr-marcstevens/sha1collisiondetection

Adding a similar check in Git itself would have some computation cost.

On changing hash, Linux comments:

The size of the hash and the choice of the hash algorithm are independent issues.
What you'd probably do is switch to a 256-bit hash, use that internally and in the native git database, and then by default only show the hash as a 40-character hex string (kind of like how we already abbreviate things in many situations).
That way tools around git don't even see the change unless passed in some special "--full-hash" argument (or "--abbrev=64" or whatever - the default being that we abbreviate to 40).

Still, a transition plan (from SHA1 to another hash function) would still be complex, but actively studied.
A convert-to-object_id campaign is in progress:

share|improve this answer
    
The collision: 1. The attempt was to create a collision, not one occurring by coincidence. 2. From te PDF report: In total the computational effort spent is equivalent to 2^63.1 SHA-1 compressions and took approximately 6,500 CPU years and 100 GPU years. 3. Although we should move on from MD5 and SHA-1 the are in general fine for file unique usages. – zaph Feb 25 at 3:48
    
It's worth noting that WebKit checked in the colliding PDFs for a test. It broke their git-svn mirror infrastructure: bugs.webkit.org/show_bug.cgi?id=168774#c24 – dahlbyk Feb 25 at 15:03
    
@dahlbyk It is worth noting indeed... in that I noted it in the answer (the link behind "It does have some issue for git-svn though" refers to it, albeit indirectly) – VonC Feb 25 at 15:05
    
Doh, tried to check if this had already been mentioned. +1 – dahlbyk Feb 25 at 15:07
1  
@Mr_and_Mrs_D no it does not yet fail with an error. A big patch is in progress which will then help facilitate that collision detection: marc.info/?l=git&m=148987267504882&w=2 – VonC 22 hours ago

I think cryptographers would celebrate.

Quote from Wikipedia article on SHA-1:

In February 2005, an attack by Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu was announced. The attacks can find collisions in the full version of SHA-1, requiring fewer than 2^69 operations. (A brute-force search would require 2^80 operations.)

share|improve this answer
7  
The point is that a flaw has been found in SHA1 and that this was about the time when Git was being introduced. Also, probability is non-linear. Just because you play the lottery for fifty years does not mean you have a higher chance of winning. You just have the same chance every single time. The person playing for the first time can still win. – 0xC0000022L Feb 8 '14 at 4:11
    
This is only attack that find collision, which mean that you can find y such that h(x) == h(y)` which is serious threat for arbitrary data like SSL certificates however this do not affect Git which would be vulnerable to second pre-image attack which mean that having message x you can modify it to message x' that h(x) == h(x'). So this attack doesn't weaken Git. Also Git haven't chosen SHA-1 for security reasons. – Łukasz Niemier Feb 2 at 0:08
    
Now a collision has been found - just not one that bothers git directly yet. stackoverflow.com/questions/42433126/… – Willem Hengeveld Feb 28 at 23:50

There are several different attack models for hashes like SHA-1, but the one usually discussed is collision search, including Marc Stevens' HashClash tool.

"As of 2012, the most efficient attack against SHA-1 is considered to be the one by Marc Stevens[34] with an estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers."

As folks pointed out, you could force a hash collision with git, but doing so won't overwrite the existing objects in another repository. I'd imagine even git push -f --no-thin won't overwrite the existing objects, but not 100% sure.

That said, if you hack into a remote repository then you could make your false object the older one there, possibly embedding hacked code into an open source project on github or similar. If you were careful then maybe you could introduce a hacked version that new users downloaded.

I suspect however that many things the project's developers might do could either expose or accidentally destroy your multi-million dollar hack. In particular, that's a lot of money down the drain if some developer, who you didn't hack, ever runs the aforementioned git push --no-thin after modifying the effected files, sometimes even without the --no-thin depending.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.