Written by Zooko Wilcox-O'Hearn, documenting ideas due to Drew Perttula, Brian Warner, and Zooko Wilcox-O'Hearn, 2008-03-20.
Convergent encryption is already known to suffer from a confirmation-of-a-file attack. We show that it suffers also from a learn-the-remaining-information attack. The conditions under which this attack works cannot be predicted by a computer program nor by an unsophisticated user. We propose a solution which trades away part of the space savings benefits of convergent encryption in order to prevent this new attack. Our defense also prevents the old attack. The issues are presented in the context of the Tahoe Least-Authority Filesystem, a secure decentralized filesystem.
Convergent encryption, also known as content hash keying, was first mentioned by John Pettitt on the cypherpunks list in 1996 ¹, was used by Freenet ² and Mojo Nation ³ in 2000, and was analyzed in a technical report by John Doceur et al. in 2002 ⁴. Today it is used by at least Freenet, GNUnet ⁵, flud ⁶, and the Tahoe Least-Authority Filesystem ⁷. The remainder of this note will focus on the Tahoe Least Authority Filesystem. The use of convergent encryption in other systems may have different consequences than described here, because of the different use cases or added defenses that those systems may have.
Convergent encryption is simply encrypting a file using a symmetric encryption key which is the secure hash of the plaintext of the file.
Security engineers have always appreciated (e.g. ⁸) that convergent encryption allows an attacker to perform a confirmation-of-a-file attack -- if the attacker already knows the full plaintext of a file, then they can check whether a given user has a copy of that file.
Whether this confirmation-of-a-file attack is a security or privacy problem depends on the situation. If you want to store banned books or political pamphlets without attracting the attention of an oppressive government, or store pirated copies of music or movies without attracting the attention of copyright holders, then the confirmation-of-a-file attack is potentially a critical problem. On the other hand, if the sensitive parts of your data are secret personal things like your bank account number, passwords, and so forth, then it isn't a problem. Or so I -- and as far as I know everyone else -- thought until March 16, 2008.
I had planned to inform users of Tahoe-LAFS -- then at version 0.9.0 -- about the confirmation-of-a-file attack by adding a FAQ entry:
Q: Can anyone else see the contents of files that I have not shared?
A: The files that you store are encrypted so that nobody can see a file's contents (unless of course you intentionally share the file with them). However, if the file that you store is something that someone has already seen, such as if it is a file that you downloaded from the Internet in the first place, then they can recognize it as being the same file when you store it, even though it is encrypted. So basically people can tell which files you are storing if they are "publically known" files, but they can't learn anything about your own personal files.
However on March 16, 2008 Drew Perttula and Brian Warner came up with an attack that shows that the above FAQ is wrong.
They extended the confirmation-of-a-file attack into the learn-the-remaining-information attack. In this new attack, the attacker learns some information from the file. This is done by trying possible values for unknown parts of a file and then checking whether the result matches the observed ciphertext. For example, if you store a document such as a form letter from your bank, which contains a few pages of boilerplate legal text plus a few important parts, such as your bank account number and password, then an attacker who knows the boilerplate might be able to learn your account number and password.
For another example, if you use Tahoe-LAFS to backup your entire home directory, or your entire filesystem, then the attacker gains the opportunity to try to learn partial information about various files which are of predictable format but have sensitive fields in them, such as .my.cnf (MySQL configuration files), .htpasswd, .cvspass, .netrc, web browser cookie files, etc.. In some cases, files such as these will contain too much entropy from the perspective of the attacker to allow this attack, but in other cases the attacker will know, or be able to guess, most of the fields, and brute force the remainder.
Designers of these systems -- MySQL, Apache, etc. -- know that user secrets are often guessable (increasingly often, nowadays), which is why such applications limit the rate at which clients can attempt to login and try passwords. From a cryptographer's perspective, using Tahoe-LAFS with convergent encryption on all files allows an attacker to use off-line attacks instead of on-line attacks, which renders vulnerable some secrets which were formerly safe.
The amount of information that the attacker can learn using this attack is upper-bounded by two things: first, by the attacker's computational capacity. If the attacker can perform no more than 2^64 computations, then he can learn no more than 64 bits worth of information from any file. Note that he does not have to spend all of this computation attacking a single file owned by a single user -- instead he can attack many users and many files in parallel with the same computation. (He can also burn his resulting values into a rainbow table on DVD and sell it on the Web so that the buyer can attack other users using the result of the attacker's computation. This is currently done for hashed or encrypted passwords.)
The second upper bound is the amount of entropy in the file from the attacker's perspective. If the file contains more entropy, from the attacker's perspective, than his computational capacity, then he learns nothing about the file (except that the file was not any of the things that he guessed that it might be). Note that the amount of entropy is relative to the attacker, not intrinsic to the file. You may think that a PDF file with millions of bytes in it would have a lot of entropy, but if it happens to be a form letter from your bank, and if the attacker happened to receive the same letter with only a few fields different, then it contains little entropy to him. This subtlety may underlie the failure of many, including myself, to understand this issue sooner. Observe also that it is not possible for a computer program to determine whether a given file has sufficient entropy -- the answer to that question depends on what the attacker knows and on how sophisticated and accurate is his model of the victim.
(Note: ideas like this are often Obvious in Retrospect. If this one was Obvious in Forespect to anyone, I would appreciate references. I've scoured the citations mentioned in this note and found no hint of it.)
What can we do about this? Well, in Tahoe-LAFS the application which uses the secure filesystem, or even the human user which uses the application, can choose to use convergent encryption or not on a per-file basis, and there is no backwards-compatibility problem (Tahoe-LAFS v0.9.0 will be able read files which are written with or without convergent encryption).
However, we can do better than that by creating a secret value and mixing that value into the per-file encryption key (so instead of symmetric_key = H(plaintext), you have symmetric_key = H(added_secret, plaintext), where "," denotes an unambiguous encoding of both operands). This idea is due to Brian Warner and Drew Perttula.
The set of people who know this added secret is the set of people whose files can converge, and it is also the set of people who are able to perform either of the two attacks described above. This means that attackers with whom you do not share this added secret are prevented from performing either attack on your files.
One policy which is easy to implement is for each user to keep the added secret to themselves. This would make successive uploads of the same file by the same user converge, but would not achieve convergence with any other user's files, and would fully protect the users against these two attacks.
There are various possibilities for how to automatically decide whether or not to use convergent encryption on a given file or set of files -- heuristics based on file size, name or location, white lists or black lists of files, and perhaps ways that the user interface could make the alternatives apparent to the user. These ideas are out of the scope of this note (and, for now, of the Tahoe-LAFS decentralized filesystem itself -- the application written atop Tahoe-LAFS can decide).
It should be noted that if the exact size of a file is divulged then this information can be used for a confirmation-of-a-file attack, if there are few likely files of that exact size. Adding padding to files before encryption can substantially reduce the effectiveness of that attack vector, as is already well known.
One of the original motivations for convergent encryption, as expressed in Doceur's technical report, is to conserve storage space by coalescing identical files. That technical report cited an experiment by Doceur et al. at Microsoft in 2002 which showed that coalescing all files on a set of 585 Windows workstations resulted in a 50% space savings. My suspicion is that the gains available for modern uses are nowhere near that good -- I wouldn't be surprised if it were less than 5% for typical uses of the Tahoe Least-Authority Filesystem.
My reasoning is:
Allmydata.com, the commercial sponsor of the Tahoe-LAFS project, undertook an analysis of files stored by their customers to date. (Note that those customers were using a previous product from Allmydata.com, not the current product which is based on Tahoe.) For those customers, universal convergent encryption would have saved only 0.1% of the space.
Convergent encryption renders user files vulnerable to a confirmation-of-a-file attack. We already knew that. It also renders user files vulnerable to a learn-the-remaining-information attack in subtle ways. We didn't think of this until now. My search of the literature suggests that nobody else did either.
The addition of an added secret to be mixed into the symmetric encryption key allows you to limit the scope of users with whom your files will converge. Attackers who are outside of this set of users cannot use the new learn-the-remaining-information attack, nor can they use the old confirmation-of-a-file attack.
The Tahoe Least-Authority Filesystem version 1.0 included the feature of mixing in an added secret to the symmetric encryption key when doing convergent encryption.
For discovering this previously unknown weakness in convergent encryption, telling us about it, and helping us design a flexible defense against it, Drew Perttula won the second ever "I Hacked Tahoe-LAFS!" t-shirt.
Here he is at the official awards ceremony at the office of AllMyData, June 26, 2008:
Drew Perttula being awarded with this "I Hacked Tahoe-LAFS!" t-shirt. (The t-shirt was present only as an image file at the awards ceremony, so it was projected onto a white board which was held in front of Drew.)
Here's Drew a few days after the awards ceremony, wearing the actual physical reification of his shirt [*]:
¹ https://groups.google.com/d/msg/mail.cypherpunks/RLHsypJfd7o/CjGOsOxqwT0J
² http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.4919
³ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9607
⁴ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.7586
⁸ https://groups.google.com/d/msg/mail.cypherpunks/RLHsypJfd7o/CjGOsOxqwT0J