Transitioning to stronger hash function #58
Comments
Also add hash agility to avoid another migration in the future. |
I was just going to mention that. It's a good opportunity to start with a hash tree from scratch |
I agree that this should be rolled into a new merkle tree torrent spec. Clients should only have to support two types of torrents: legacy/SHA1 and merkle/<new hash function>. |
Maybe even going so far to have a tree per file. but maybe that's too much (it would have quite significant consequences of the internals of clients) |
I'm sortof in favor of having a tree per file but I recognize its complexity it adds. It would alleviate some headaches about appending to torrents and deduping files. But to be backwards-compatible we'll have to include BEP47 paddings because it requires piece alignment. In a from-scratch spec they would be implicit. On a more procedural note, how would we handle declaring parts of BEP3 as legacy? |
@dionyziz I think your input on hashing functions to consider/benchmark for the purposes of hashing file pieces (and merkle tree creation) for integrity verification (and other operations) of torrent files would be most appreciated given your field of expertise. (My guess is that SHA-256 and SHA-3 would be the first candidates, but I've no clue in terms of performance of this functions vs the old SHA1) |
@the8472 The body of BEP3 should be left unmodified and all changes described in a new BEP. Depending on how extensive the changes are, it might make sense to use BEP3 as the base for the new BEP to create a new stand-alone protocol spec. Once the new BEP is finalized we can add an "Updated by" or "Obsoleted by" header to BEP3 referencing it. |
The breakage of sha1 is a collision rather than a reversal, so it isn't a huge problem for BitTorrent (if someone sends you a torrent with broken data in it they can screw you in lots of other and worse ways ways) but this may be a good impetus for doing all the backwards-incompatible things we'd like to do. There are three backwards-incompatible things I think are good ideas: (1) Change the infodict to be a bencoded string of its own rather than an included object. This will get around the problems of bencoding being not really canonical. This is a trivial no-brainer if we're already doing something incompatible. (2) Switch to separate hashing for the different files in a multifile torrent. There's no technical downside to this except that it's a pain to implement. We should probably just bite the bullet and do it. (3) Switch to sha256 and using a merkle tree. For various reasons sha256 is clearly the right hash to use. The main problem with bep30 is that it results in the same file hashing to different roots based on the piece size, which is completely unnecessary. The hashing should go down to fixed size terminals, no more than 16kib but maybe 8 or 4 for more flexibility, and it should not change with the piece size. The piece size should still be specified in the info dict but it's there so that peers are all speaking the same language when they discuss what they have, without affecting hashes. This requires that piece sizes only be powers of two, so apologies to anyone fond of setting piece sizes to prime numbers. Suggestions for other worthwhile changes which can only be done incompatibly are welcome. Most extensions don't require incompatibility. |
For an example, see the .torrent files inside https://keybase.pub/kyhwana/sha-2/ folders a/ and ab/ "a.pdf" inside a/ and ab/ have the same SHA1's but different SHA256 (and are different images) I imagine you could position the changed data inside pieces at the right offsets to send malicious pieces and have it verify as well. |
This is a security problem for BitTorrent, because users may trust a .torrent file download, for example using an HTTPS link on a trusted domain and trusting the PKI system, but may not trust their peers, because their network could be compromised. If a SHA1 collision can be found between a malicious binary file (e.g. a virus) and a non-malicious binary file (e.g. a video game), this is a very reasonable attack: It is a reasonable threat model to assume the network is compromised, but HTTPS is not. I recommend upgrading to SHA256. |
To be clear: The current problem with sha1 is generation of collisions, not reversals, so if you can trust the source of the .torrent file but not peers you're still fine, and that's the usual threat model. That said it's a problem worth fixing and the only reason we haven't done so already is that it requires a backwards incompatible change. |
Trusting the source of a .torrent file but not your peers is not fine. The attack could work like this: Initially, the attacker generates two different files, A and B. The attacker also creates a torrent file for A and makes sure that there exists a SHA1 collision with the contents of file B. Let's say these are C++ source code files. A is an innocent and useful program, while B is a malicious program. Then the attacker sends A to a trusted third party, for example the package maintainer of a certain linux distribution. The trusted third party manually verifies that the source code for A is benevolent and that the torrent contains the correct hash and vouches for its validity by hosting the verified .torrent file on their HTTPS website. The attacker subsequently replaces the file contents of A on the network with the contents of B during the download by modifying peer data in transit. The torrent client will not be able to detect this tampering. This attack does not require a reversal, only a collision generation. |
I'd like to put in a plug for BLAKE2. It has a deeper security margin than SHA-256 has and it is much faster. BLAKE2 is widely used. It is already included in your favorite crypto library. BLAKE2 is a descendant of DJB's Salsa20/Chacha20 designs, and its immediate ancestor BLAKE1 was thoroughly reviewed during the SHA3 competition. (Those are slides that I presented at ACNS 2013.) I'd be interested in a measurement of whether hash function performance difference between SHA-256 and BLAKE2 makes a difference in BitTorrent's usage. Even if hash function performance isn't a critical issue in this case, BLAKE2 is a fine choice. |
A reasonable argument can be made for blake2. Its advantages are that it's faster in software and has a larger hash size. Its disadvantages are that it's less standard, meaning people will give me more grief about it if I use it, and that it's slower if there's hardware acceleration which is likely to become commonplace for sha256 and nothing else, and that the speed of this operation doesn't matter much anyway (yes I'm talking out of both sides of my mouth here) and that the longer key length takes up more space without a practical improvement to the already ludicrous security parameter. With using merkle roots maybe the longer length doesn't matter in practice. Or blake2s could be used for shorter hash lengths and better performance on 512 bit inputs. Or maybe blake2b could be used for the terminals and truncated and blake2s could be used climbing up the tree. (blake2b is slower on 512 bit inputs because its block size is 1024.) Truth be known I'd be fine with any of these options, I just want to do the thing which will cause the least bullshit complaining. |
You can have BLAKE2b or BLAKE2s generate shorter outputs. This is similar to truncating the outputs, except that the intended output length is mixed in so that H(x, gimme-bytes=24) is not equal to H(x, gimme-bytes=32)[:24]. Either way. BLAKE2 is fairly widely-accepted nowadays, I think. I'm not sure if people would give you grief about using it. It tends to be the default hash function in new crypto designs — things like the Password Hashing Competition, libsodium, etc. |
Is this expected to be a one-time update to a new algorithm, or should some accommodation be made for future algorithm changes, as the8472 suggested? Presumably, a new default algorithm such as SHA-2-256 or Blake2b will be specified. But the spec could also describe a standard way of indicating the hash algorithm in-band. Clients could choose to implement support for additional algorithms, and they could be adopted as needed without further changes to the protocol. One option for indicating this on the wire or in magnet links could be the "multihash" digest format used by IPFS (draft spec seems to be here). This adds a ~2-byte prefix to each digest indicating the algorithm used and the digest length (kind-of similar to the BitTorrent protocol handshake prefix). There would be minimal additional complexity for clients that don't implement algorithm flexibility: they would just verify and discard or add the default two-byte hash prefix (e.g. |
Hash algorithm extensibility is a delusion. If a change is made to use a hash algorithm which isn't currently defined then that change will, by definition, be incompatible, even if you play a semantic game of pretending you were prepared for it. In protocols in general extensibility mechanisms are a huge source of complexity and security problems, and all the new hash functions we're talking about here should be good until the end of time, so it's better to pick one and call it a day. |
This seems unnecessary. Most of those problems are solved or worked around these days. And it would prevent a transition format that supports both legacy and new hashes.
agreed
great idea
Strong disagreement. It obviously is not about forwards compatibility in the sense of supporting not-yet-existing hash functions. It is about making future migrations possible without changing any structural aspects of the protocol. Yes, clients would still need updating to include a new hash function, but all the code branches deciding which hash to use should already be there. To force implementers to spend time on that now we could specify two hash functions. Since functions of vastly different design usually don't fall at the same time that would also mean if one falls we'd still have another one in operation and enough time to add a 3rd one to the spec while deprecating the vulnerable one. All it takes is a "blake2" or "keccak" string in the info dict and possibly magnets. Unlike interactive protocols we wouldn't need any negotiations either. Torrents would simply contain a declaration of what they are. Clients can then either complain that they don't support it yet or that something is obsolete and should not be trusted.
Unlike asymmetric cryptography those hash functions are not mathematically proven to be NP. And historically hash functions had somewhat limited lifetimes.
We will probably need different truncation in some places since we can accommodate 256bit or even 512bit hashes in some places but have to make do with 160bit lengths in other parts of the protocol (DHT, Trackers, handshake). |
https://tahoe-lafs.org/~zooko/preimage-attacks-color.html began life as an
attempt to update and correct that Val Aurora piece. My piece is currently
missing 2 right-hand columns for 2015 and 2016 in which nothing was
changed. I haven't checked, but I assume the shattered.io attack on sha1
was just using reference [51] from
https://tahoe-lafs.org/~zooko/preimage-attacks-color.html
… |
Something that may also be useful to shrink the metadata size is to use directory trees instead of flat lists. In some multifile torrents the per-file information actually dominates the size and not the pieces.
takes up a lot of space in torrents with many small files. Instead we could do
|
Strongly disagree here. Here's great example where this could be of great utility Last night I was thinking about the problem Bitcoiners have with the initial sync of the blockchain and how it could greatly be improved by all nodes using libtorrent. If a torrent's metadata specified what hashing algorithm has been used for its pieces in the info section, Bitcoin clients could seed and announce (themselves on the DHT) each block as a separate merkle torrents, where the merkle hash included in the torrent metadata would correspond to merkle root specified for the block header in Bitcoin-land, other cryptocurrencies could make use of this no matter what hashing algorithm they use if the hashing algo were specified. |
@gubatron admittedly, the bitcoin case doesn't need authentication – the blockchain can be downloaded over untrusted torrent peers and even using an untrusted bittorrent file. Authenticity can subsequently follow by verifying the proof of work contained within. |
@dionyziz yes, in the case of bitcoin, if nodes were to sync using torrents, the only way to trust the torrents (doesn't matter their source) is if their inner merkle-tree sum matches the merkle root on the corresponding block header for that torrent. I guess I might as well ask, as this is the part that is still missing in my mind. Do torrents support variable piece sizes? In the case of regular file sharing, can you have smaller pieces so that a piece doesn't span across the end of a file and the beginning of another one? The idea would be to make each transaction in the block the equivalent of a piece that gets hashed as one of the merkle tree nodes, so that the merkle root in the block header is the same as the one defined inside the torrent. This would be one hell of a custom torrent, but I think it could fly if variable piece sizes are doable in the torrent metadata. |
This is more or less achieved by aligning files in the piece-space via paddings in the end. Clients not aware of paddings download zeroes. Those that are effectively download a shorter piece. If you define one file per block then you can have that today. But I think bitcoin is a distraction from bittorrent's core uses. It's nice if it can be mapped nicely onto BT, but the main goal is to distribute files. |
@bramcohen regarding distinguishing the leaf size of a hash tree and the piece size, I think Steven had a good point (in the other thread about a new merkle tree extension). Fundamentally, the only reason to have pieces be larger than 16kiB is to not spend a lot of bandwidth on the BITFIELD- and HAVE messages. However, it might make more sense to provide alternatives that are more compact/efficient that those. I spent some time with an extension for that (but there's some more work to be done there). There's also the binmap paper. |
No one have mentioned magnet links yet. Will existing magnet link URNs also be vulnerable? Do we have to migrate to a new URN like this? >>> import base64
>>> import hashlib
>>> s = hashlib.sha512("Replace SHA-1 with stronger message digest functions!").digest()
>>> base64.b32encode(s)
'EYVH4MBTKSZQOARCINCZZXUARJFP2SY4GXZITGS27VSZKUANE27XMNUZZ52O6UP6GUKS3LMVSGWYVBKAL3CLY5AXKN4C4HDYLBY4J2A='
>>> new_infohash = base64.b32encode(s).lower().strip("=")
>>> new_urn = "bsha"
>>> new_maglink = "magnet:?xt=urn:%s:%s" % (new_urn, new_infohash)
>>> new_maglink
'magnet:?xt=urn:bsha:eyvh4mbtkszqoarcinczzxuarjfp2sy4gxzitgs27vszkuane27xmnuzz52o6up6guks3lmvsgwyvbkal3cly5axkn4c4hdylby4j2a' |
In principle the piece and hash tree hash function could be different than the one used to calculate the info-hash, but it wouldn't make much sense to upgrade SHA-1 of pieces without also upgrading the info-hash function. Magnet links normally use hex-encoding for the info-hash though (early versions used base32, but that's mostly phased out) |
Existing hashes are never vulnerable to collision attacks. This is not specific to bittorrent, it's a consequence of what collision attacks do. But it is difficult to tell whether a hash predates the first collision attack on a hash function.
Are you talking about the syntax of magnets (btih) or the hash function used? We might be able to mostly keep the syntax or simply extend it to keep both legacy and forward compatibility. But we will need to use a new hashfunction to avoid certain attack scenarios that would make bittorrent less viable for some distribution use-cases.
It also limits the worst-case memory use for keeping track of which pieces each peer has. If they are at 50% and select their pieces at random instead of batches then clever in-memory encodings won't save you. Not to mention that such extensions would increase the complexity of a new protocol if they were strict dependencies. I think being able to specify >16KiB piece sizes would help keeping things simple. |
The hashes likely to have anybody behind them are sha256, sha3, blake2b, and blake2s (and maybe a mix of blake2b and blake2s). I continue to maintain that making it parameterizable is a bad idea, and after getting feedback from some more crypto people it sounds like sha256 is viewed as fine by everybody and is depended on by everything, so that should be viewed as the default. |
The advantage of going smaller than 16KiB is that if you split smaller requests across peers then you can hash check them individually. That said noone seems interested in going smaller on requests, so I'll just drop that thought. But since nobody ever deviates from 16KiB, why don't we make that the forced amount? Then requests just have to specify an index without specifying a request size, so there's less complexity and less concerns about what request sizes are and are not acceptable, which is a bit of a concern because peers can force extra data to be sent to them after a choke happens by requesting large piece sizes so an already started request has to be finished even after a choke happens. |
Reducing leaf hash sizes doesn't help there unless you also reduce the piece size. Clients do not have access to the leaf hashes until after they have verified a whole piece and can derive them from the data itself.
At this point? Ontological inertia. Source code exists, let us reuse it if there is no good justification for the effort to change it. We're already doing a fairly big redesign of the metadata, there's no need to touch the wire protocol. At least in principle low-bandwidth clients could do <16KiB requests to keep the congestion-induced latency for control messages low (maybe coupled to µTP-MTU?). I don't think any implementation actually does this, but at least it's possible. |
16 kiB has been (for a long time) the upper bound on request sizes (which I think makes sense). It still makes sense to request smaller blocks for the tail piece, which is done today. With more tail pieces, which the proposed hash tree would create, it becomes even more useful to be able to request smaller blocks. |
Huh? The whole point of going Merkle is that you can verify the very first chunk you get, even when all you've got is the root.
We need to add hash proofs if nothing else
What's wrong with just truncating the last piece? Certainly the last chunk should be assumed to be truncated, which makes indices work fine. |
That's a fair point. I have a hard time coming up with any other reasons for unaligned or oddly sized requests. However, I think @the8472 has a reasonable point though, that adding a new PIECE message can be done independently, and deferring it probably eases the transition |
I think the draft now contains all necessary components for an upgrade plus some of the "always wanted to do that" changes discussed. |
That is one possible use of merkle trees. But as specified by BEP30 it is in conflict with other use-cases (discussed in #29). The new draft retains all the old possibilities provided by piece lists, vastly simplifies deduplication. This makes .torrent files including the root dictionary about about as heavy-weight as info dictionaries were before. But magnets/metadata exchange can now provide in a faster startup by grabbing piece level hashes on demand. So with some extensions can get all the goodies of merkle torrents if we want them, but the draft for the new base protocol aims to cover all the use cases enabled by the old base protocol (and then some). |
See also https://biterrant.io/ for a Proof of Concept attack involving good/malicious binaries. |
There's another old idea which we have the opportunity to add now: Instead of peers presenting the infohash directly, they present a hash of the hash and use the actual infohash as a shared secret for kicking off an encrypted connection so that peers who don't have access to the infodict can't download the file. This relates to the general subject of how radical we want to be and how many stages we want to do this in. What I'm going to do now is read through the detailed proposal and critique it primarily from the point of view of forming a rollout plan. |
Well, the attempt at peer protocol encryption for the purpose of traffic shaping evasion kind of failed after working for a short time. Failed in the sense that it is an arms race and we did not continue to refine it to keep winning that race. But there are numerous other security properties that can be achieved with encrypted peer connections. But I think some of them can be better achieved with payload encryption (#18, #20), but I'll have to revise that proposal in light of a new torrent format. |
@bramcohen When using protocol encryption today, peers don't advertise the real info-hash, but a hash derived from it. Primarily what I think would be required to achieve a bit more privacy would be to consistently advertise such derived hash to the DHT, trackers and local peer discovery (and also to reject un-encrypted connections) |
That conflicts with making public content easily accessible. Many search sites depend on or supplement their data with DHT-indexed torrents and thus provide a valuable service to users. Of course there are other privacy aspects. For example passive ISP-level/public wifi surveillance. This could be prevented by unsigned DH-exchanges on the protocol level and for DHT stores. Anyway, my point is we can cover multiple use-cases at the same time but that requires discussion about several distinct threat models and countermeasures. I think that would be better served by a separate discussion. |
Here was my argument that BLAKE2 was safer than SHA-256 for use inside one
part of Zcash:
zcash/zips#36 (comment)
Skip the other argument also in that thread about "provable security" and
read the parts about BLAKE2 vs. SHA-256. In short, BLAKE2 has a bigger
security margin and has been at least as well-studied if not better-studied
than SHA-256. OTOH, I feel pretty confident about SHA-256. I wouldn't warn
you against SHA-256 for BitTorrent, I'm feel pretty sure SHA-256 will be
fine. But when asked to make judgments for *Zcash*, which is my baby and my
love, my paranoia kicks in and I think SHA-256 isn't good enough, and we
need something even safer.
That argument (especially considering some details such as Samuel Neves
posted: zcash/zips#36 (comment))
persuaded Daira Hopwood to use BLAKE2 for that application.
… |
If "good until the end of time" is your goal as you do not want another transition in x years, I don't believe you should use SHA-256. While not weakened at full-rounds, there have been some notable attacks found, partial list over at wikipedia. Additionally, the US government has been moving secure systems off of SHA-256 since last year, in favor of SHA-384. It doesn't seem logical to me to implement SHA-256 for the long-term when there are alternatives (particularly one that is well-vetted, faster, has a better security margin). |
I for one am not opposed to stronger hash functions. The question is whether the other devs will agree to them. Impact:
|
I'm going to go out on a limb and say that 128 bits of security should be enough for anyone. It is enough to put brute force attacks firmly in the realm of boiling the oceans. Even if SHA-256 were subject to a break as severe as SHA-1, losing ~20 bits of security, it would still be comfortably secure against practical attacks with conventional computers. It's going to take a radical new attack to find a SHA-256 collision, and such an attack may well render SHA-384 vulnerable as well. Note that while SHA-512 and its truncated derivatives are faster in software on 64-bit processors, such CPUs are increasingly likely to have even faster hardware support for SHA-256. Software performance is more likely to matter on embedded systems with 32-bit CPUs where SHA-256 is faster. Given the trade-offs, I'm willing to take a chance with SHA-256. |
But if we are debating "until the end of time" security then we have to take quantum computers into account. The BHT algorithm claims to reduce collision resistance to 1/3rd of the hash length instead of 1/2. DJB refutes this, but he leaves the question whether that is the final word on the issue open. |
Well, 64bit systems are making a foray into the embedded landscape, e.g. the raspi3. But looking at some openssl benchmarks it looks like most systems are very anemic anyway and the difference between sha256 and sha512 is a factor of 2-3 at most. So if sha512 is prohibitively expensive then sha256 won't fare much better. Do we have a case where every ounce of performance matters? I don't really want to be responsible for a "256bits ought to be enough for everyone" a few years down the road if we can avoid it. |
I don't think we have a clear enough picture at this point to predict which functions will prevail in a post-quantum world. Practical quantum computing is still a long ways off and the economics of it are totally unknown, so a wait-and-see approach seems appropriate. Regarding performance. ARMv8 has an optional crypto extension which supports SHA-256, so 64 bit embedded devices at least potentially carry hardware support. To be fair it looks like the RPi3 lacks that extension, but it also lacks enough I/O throughput for it to matter. The main reason to care about performance is power consumption. Reducing power draw by a factor of 2 or 3 is a big deal in mobile and datacenter settings. It also makes it easier to run a client on a cheap VPS as they tend to be heavily oversubscribed for CPU time and less so for bandwidth. |
I guess we'll see in a decade or two. Anyway, I'm mostly waiting on review from @bramcohen. Once any differences are settled we should also generate some test-data and add them to the spec. Due to the merkle trees the complexity is a little higher, so hopefully we can avoid implementation errors that way. |
Here's a better link to my arguments for why BLAKE2 is safer than SHA2: Note in particular the security margin bit — SHA-256 is breakable up to 31 out of 64 rounds. BLAKE2b is breakable up to 2 & ½ of 12 rounds (https://en.wikipedia.org/wiki/Hash_function_security_summary). Note also the "SHA-256's father has turned his back on it" argument, which I present not as an argument that SHA-256 is weak, but that SHA-256 will suffer a worse and worse reputation as time goes by. |
Closing this as the BEP has been added as draft. Feedback is still welcome from implementors, especially if they encounter some roadblocks. At the moment it still is a draft after all. |
the8472 commentedon Feb 23, 2017
With the published collision attack on SHA1 bittorrent may become less attractive for use-cases which don't just want simple integrity-checking (for which it still is perfectly fine) but also authentication.
We should come up with a plan to transition to stronger hashes for all uses, ideally while maintaining backwards compatibility for a transition period.
Thoughts:
pieces