Opened at 2009-07-10T09:41:24Z
Last modified at 2011-07-28T21:51:53Z
#753 new defect
use longer storage index / cap for collision resistance — at Version 3
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-encoding | Version: | 1.4.1 |
Keywords: | newcaps security | Cc: | |
Launchpad Bug: |
Description (last modified by warner)
As we work on new encoding schemes (like DSA-based mutable files), I'm thinking that we want to put a lower bound on the cap/SI length to maintain reasonable margin against collisions. 256 bits would be more than enough. 128 is ok, but a bit tight. 92 bits would make me nervous.
robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation before it) because, for something that is meant as a backup system, even the slightest possibility of the CHK-based indexing scheme mapping two documents to the same storage index was too high for him. (I believe he would be more satisfied with a scheme that used centrally-allocated guaranteed-unique storage indices, which we could do but would require more coordination and longer caps, since we could no longer use a randomly-generated readkey to derive the storage index. In exchange for a controllable but non-zero probability of collision, we get to avoid central coordination and use smaller caps).
The specific places where collisions could occur are:
- mapping from file contents to CHK-derived readkey
- mapping from readkey (CHK-derived or randomly generated) to storage index
- mapping from randomly-generated mutable writekey to storage index
The "birthday paradox" determines the chance of collision. If I'm doing my math right, if you want less than 'p' chance of getting any collisions when selecting items out of a namespace of size 'N', then you can't select more than C = sqrt(2*N*p) items. This is called a "paradox" (where "surprise" would be a better term) because that square root causes C to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to C=19. In the Tahoe context, C is the number of files you can add to the grid.
In the current case, our 128-bit storage index (N=2128) means that p=0.5 gets us a nice large 264 number of files, except that p=0.5 is insufficient margin: we'd much prefer a vanishingly small chance of collision, like p=2-64. Fortunately we get two bits of margin for every one bit we reduce from C. The table looks like:
N | numfiles | prob(collision) | |
96 | 248 | -> | 2-1 (0.5) |
96 | 240 | -> | 2-17 |
96 | 232 | -> | 2-33 |
96 | 224 | -> | 2-49 |
128 | 264 | -> | 2-1 (0.5) |
128 | 256 | -> | 2-17 |
128 | 248 | -> | 2-33 |
128 | 232 | -> | 2-65 |
192 | 296 | -> | 2-1 |
192 | 280 | -> | 2-33 |
192 | 264 | -> | 2-65 |
256 | 2128 | -> | 2-1 (0.5) |
256 | 296 | -> | 2-65 |
Note that our N is the minimum of the storage-index size and the top-most cap value (i.e. the readkey for immutable files, or the writekey for mutable files). So a DSA-based mutable file with a 92-bit writecap gives us an N of 292, even if it is expanded into a storage-index of 128 or 256 bits.
Also note that the allmydata.com grid currently has something like 10M objects in it, about C=223.
So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap would be nice, it's too short to provide enough collision margin. I want to be able to put trillions of files into a grid, and I want a the chance of collision to be so small that I don't ever need to worry about it, and 96 bits isn't really there. 128 bits is probably good enough, but doesn't have enough margin to be obviously and unquestionably safe (C=232 is a lot of files but you can imagine people wanting more, p=2-64 is a tiny probability but you can imagine people wanting a bit better). 256 would be plenty (but of course I want my filecaps to be shorter than that).
Change History (3)
comment:1 Changed at 2009-07-10T14:14:42Z by zooko
comment:2 Changed at 2009-07-10T14:27:51Z by zooko
P.S. Oh, what I wrote about people choosing whatever storage index they want is possibly confused because I'm conflating two ideas into the word "storage index". Let's talk about those two uses as "server selector" and "file identifier". The former is how you decide which servers you're going to use (either when uploading a file for the first time or when using a cap to find a file that's already up there). The latter is how you tell a storage server which of the shares that it has are the shares that you are currently interested in. I currently think that the verify cap is a good thing to use for the latter role, but perhaps not for the former.
comment:3 Changed at 2009-07-11T10:25:36Z by warner
- Description modified (diff)
as before, I think I'd like to continue using "storage index" for what you're calling the "file identifier", but yeah split out "server selector" or "peer-selection index" or some similar term for the purpose of determining which servers you're going to be talking to. One way of describing this would be "we used to use the storage-index as the peer-selection index, but these days we put two separate values in the filecap".
I am also starting to think of these as separate concepts, but remember that we've yet to actually implement such a split.
Sylvan's concern was about availability: he considered a backup system to be broken if its design has a built-in probability of file unrecoverability. It's easier to see the problem if we set the encryption-key and hash lengths to infinity, but restrict the storage index to say four bits. Then upload two files, and try to download one of them.. you've got a 1/16 chance of getting a download failure because your two files had the same storage-index, you downloaded the wrong bits, and now they won't pass the integrity check.
Also, when we talk about this, we should be careful to distinguish between the failure modes of mutable files versus immutable files.. they're very distinct. And, collisions at different levels have very different consequences: if the storage index is too small, we'll get availability failures; if the immutable encryption key or mutable writekey is too small, we'll get confidentiality failures. I've been assuming that we'll keep the security parameters sufficiently large.. this ticket was specifically about the availability concerns resulting from a too-small storage index.
If we compress the filecap by deriving the storage-index from the writekey, then clearly we're limited by min(len(writekey),len(storage-index)).
Mostly I ticketed this issue because it's something I want to keep in mind as we design the next revision of the filecap format. If we don't already have a wiki page for it, I'll add one to organize our ideas.. I think they're currently spread across half a dozen tickets.
I updated the table in the description: I think 192-bit caps will let us have an effectively-infinite number of files (264) with an effectively-zero chance of collision (2-65).
[edited to fix trac markup]
Thanks for this analysis. I like your comments at the end that we want a bit of "overkill" in number of files and chance of collision. People who don't want to rely on the collision-resistance of secure hash functions at all are probably not part of our market, but people who are willing to do so in principle, but would feel better with a nice fat margin of safety are definitely in our market.
Note that if you generate a new write cap (private key) and then check and it turns out that the same write cap has been generated by someone else then now you have gained the ability to write to their mutable file or directory! That's why I have been thinking that 96-bits was too few for write caps. Originally I had been thinking something like "It would probably not be worthwhile for any attacker to spend 296 computer power trying to forge a write cap.". But this way of thinking discounts two important factors: chance of success and number of targets. If there are 240 writecaps in use, then if an attacker does a mere 236 work (excluding the cost of checking whether each writecap that they generate is already out there), then they have a 2-20 chance of forging someone's writecap. (Thanks to Daniel J. Bernstein's papers and mailing list postings for helping me understand this important point. http://cr.yp.to/snuffle/bruteforce-20050425.pdf )
However, a storage-index collision doesn't sound nearly as serious to me. No integrity or confidentiality is necessarily lost due to storage-index collision, right? Well, it could complicate the question of "which servers are handling which shares of this mutable file" -- an issue that is already not well managed.
Anyway, nowadays I think of storage-indexes as a separate layer built on top of the crypto layer. People can define their storage indexes as secure hashes of some pieces of the capabilities if they want, or choose random storage indices, or hierarchical ones based on the DNS names, or just not have storage indices at all and require every downloader to query every server. It shouldn't impact the security of the crypto layer, if the crypto layer includes integrity checking using the verifycap itself on the storage-server side.
I think we should write a document called something like "crypto failure modes (What could possibly go wrong?)" that explains what the consequences are of each different type of failure. (As requested by Nathan: http://allmydata.org/pipermail/tahoe-dev/2009-April/001661.html .)
The one discussed above is caused by "two people choose the same write cap (signing key) seed (possibly due to malice)". That one leads to an integrity failure, where one of the people thinks that they are the only one with a write-cap to a specific file or directory, but actually someone else also has the same write-cap.
So I think that is the worst one (because I value integrity highly). Another similar integrity failure could come about from a failure of the digital signature algorithm -- i.e. if someone were able to forge digital signatures even without the writecap. (Note that a collision in the hash function used by the digital signature algorithm could cause this. People who don't want to rely on collision-resistance of secure hash functions at all can't even rely on rsa, dsa, ssh, ssl, or gpg, although I should hasten to add that those algorithms typically include some randomness in the input to their secure hash function, to make it that much harder for attackers to cause collisions.)
After those integrity failures, there are confidentiality failures. The obvious one is someone being able to crack AES-128-CTR without having the readkey. Another one is if the content-hash-key encryption where to generate the same readkey for two different immutable files. I suppose that's another reason why using an added convergence secret is safer (although I hasten to add that I see no reason to distrust the collision-resistance of SHA-256d-truncated-to-128-bits at the present).
Note that of course in practice the dangers from bugs and from operator error (i.e. misplacing your keys) are a lot greater than these algorithmic crypto risks. So much greater that the former are pretty much guaranteed to happen and the latter will probably never. Nonetheless, I value getting the crypto part right so that it is secure and also so that everyone who is willing to rely on crypto in principle is willing to rely on our crypto, so thanks for you help with this.