[tahoe-dev] hedging our bets -- in case SHA-256 turns out to be insecure

Sun Nov 8 03:30:47 PST 2009

Folks:

We're going to be deploying a new crypto scheme in Tahoe-LAFS next  
year -- the year 2010.  Tahoe-LAFS is used for long-term storage, and  
I won't be surprised if people store files on Tahoe-LAFS in 2010 and  
then rely on the confidentiality and integrity of those files for  
many years or even decades to come.  (People started storing files on  
Tahoe-LAFS in 2008 and so far they show no signs of losing interest  
in the integrity and confidentiality of those files.)

This long-term focus makes Tahoe-LAFS's job harder than the job of  
protecting transient network packets.  If someone figures out in 2020  
or 2030 how to spoof a network transaction that you sent in 2010 (see  
[1]), it'll be far too late to do you any harm, but if they figure  
out in 2030 how to alter a file that you uploaded to a Tahoe-LAFS  
grid in 2010, that might harm you.

Therefore I've been thinking about how to make Tahoe-LAFS robust  
against the possibility that SHA-256 will turn out to be insecure.

A very good way to do this is to design Tahoe-LAFS so that it relies  
as little as possible on SHA-256's security properties.  The property  
that seems to be the hardest for a secure hash function to provide is  
collision-resistance.  We are analyzing new crypto schemes to see how  
many security properties of Tahoe-LAFS we can continue to guarantee  
even if the collision-resistance of the underlying secure hash  
function fails, and similarly for the other properties of the secure  
hash function which might fail [2].

This note is not about that design process, though, but about how to  
maximize the chance that the underlying hash function does provide  
the desired security properties.

We could use a different hash function than SHA-256 -- there are many  
alternatives.  SHA-512 would probably be safer, but it is extremely  
expensive on the cheap, low-power 32-bit ARM CPUs that are one of our  
design targets [3], and the output size of 512 bits is too large to  
fit into Tahoe-LAFS capabilities.  There are fourteen candidates left  
in the SHA-3 contest at the moment.  Several of them have  
conservative designs and good performance, but there is always the  
risk that they will be found to have catastrophic design flaws or  
that a great advance in hash function cryptanalysis will suddenly  
show how to crack them.  Of course, a similar risk applies to SHA-256!

So I turn to the question of how to combine multiple hash functions  
to build a hash function which is secure even if one or more of the  
underlying hash functions turns out to be weak.

I've read several interesting papers on the subject -- such as [4, 5]  
and especially "Robust Multi-Property Combiners for Hash Functions  
Revisited" by Marc Fischlin, Anja Lehmann, and Krzysztof Pietrzak  
[6].  The good news is that it turns out to be doable!  The latter  
two papers show nice strong theoretical results -- ways to combine  
hash functions so that the resulting combination is as strong or  
stronger than the two underlying hash functions.  The bad news is  
that the proposal in [6] builds a combined function whose output is  
twice the size of the output of a single hash function.  There is a  
good theoretical reason for this [4], but it won't work for our  
practical engineering requirements -- we need hash function outputs  
as small as possible (partially due to usability issues)

The other bad news is that the construction proposed in [6] is  
complicated, underspecified, and for the strongest version of it, it  
imposes a limit on the length of the inputs that you can feed to your  
hash function.  It grows to such complexity and incurs such  
limitations because it is, if I may call it this, "too theoretical".   
It is designed to guarantee certain output properties predicated on  
minimal theoretical assumptions about the properties of the  
underlying hash functions.  This is a fine goal, but in practice we  
don't want to pay such a high cost in complexity and performance in  
order to gain such abstract improvement.  We should be able to "hedge  
our bets" and achieve a comfortable margin of safety with a very  
simple and efficient scheme by making stronger, less formal, but very  
plausible assumptions about the underlying hash functions.  Read on.

I propose the following combined hash function C, built out of two  
hash functions H1 and H2:

C(x) = H1(H1(x) || H2(x))

The first observation is that if H1 is collision-resistant then so is  
C.  In practice I would expect to use SHA-256 for H1, so the  
resulting combiner C[SHA-256, H2] will be at least as strong as  
SHA-256.  (One could even think of this combiner C as just being a  
tricky way to strengthen SHA-256 by using the output of H2(x) as a  
randomized salt -- see [7].)

The next observation is that finding a pair of inputs x1, x2 which  
collide in *both* H1 and in H2 is likely to be much harder than  
finding a pair of inputs that collide in H1 and finding a pair of  
inputs that collide in H2 (see [5]).

Now the reason that a combiner like this one is not published in  
theoretical crypto literature is that it obviously could fail if the  
outer hash function H1 fails.  For example, even if H2 is collision- 
resistant, if H1 turns out to be susceptible to collisions, then  
theoretically speaking C[H1, H2] might be susceptible to collisions.   
However, in real life C[H1, H2] would most likely still be collision  
resistant!

All practical attacks on real hash functions so far (if I understand  
correctly) are multi-block attacks in which the attacker is able to  
feed a sufficiently long and unconstrained input to the hash  
functions that the effects of the later parts of his inputs are able  
to manipulate the state generated by the earlier parts of his  
inputs.  My combiner C uses H1 in its outer invocation on a single- 
block-sized input, which means no such multi-block attacks are  
possible on the outer invocation.  In addition, the inputs that the  
attacker gets to feed to the outer invocation of H1 are highly  
constrained.  Basically, he would already have to be very good at  
manipulating the inner invocations H1 and H2 in ways that he isn't  
supposed to before he can even begin to manipulate the outer  
invocation of H1.

A measure of the practical security of a combiner like this one would  
be "how safe would it be if it were instantiated using broken  
practical hash functions such as MD5 and SHA1?".  It appears to me  
(from an admittedly cursory analysis) that there is no realistic way  
to find collisions in C[MD5, SHA1] even though there are realistic  
ways to find collisions in MD5 and in SHA1.  Of course, I'm not  
proposing to use C[MD5, SHA1]!  I'm proposing to use C[SHA-256, _]  
where _ is some other hash function which is believed to be strong.   
The example of instantiating C with MD5 and SHA1 just goes to show  
that C is a hash function which is stronger than either of its two  
underlying hash functions.

The other desirable security properties such as second-preimage  
resistance and pre-image resistance seem to follow the same pattern  
as collision-resistance -- C[H1, H2] seems to be much stronger than  
H1 or H2 alone.

Regards,

Zooko

[1] http://extendedsubset.com/Renegotiating_TLS.pdf
[2] http://allmydata.org/trac/tahoe/wiki/NewCaps/WhatCouldGoWrong
[3] http://bench.cr.yp.to/results-hash.html#arm-apollo
[4] Krzysztof Pietrzak: "Non-Trivial Black-Box Combiners for  
Collision-Resistant Hash-Functions don't Exist"
[5] Jonathan J. Hoch, Adi Shamir: "On the Strength of the  
Concatenated Hash Combiner when All the Hash Functions are Weak"
[6] Marc Fischlin, Anja Lehmann, Krzysztof Pietrzak: "Robust Multi- 
Property Combiners for Hash Functions Revisited"
[7] http://webee.technion.ac.il/~hugo/rhash/rhash.pdf