1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ============================== |
---|
4 | Specification Document Outline |
---|
5 | ============================== |
---|
6 | |
---|
7 | While we do not yet have a clear set of specification documents for Tahoe |
---|
8 | (explaining the file formats, so that others can write interoperable |
---|
9 | implementations), this document is intended to lay out an outline for what |
---|
10 | these specs ought to contain. Think of this as the ISO 7-Layer Model for |
---|
11 | Tahoe. |
---|
12 | |
---|
13 | We currently imagine 4 documents. |
---|
14 | |
---|
15 | 1. `#1: Share Format, Encoding Algorithm`_ |
---|
16 | 2. `#2: Share Exchange Protocol`_ |
---|
17 | 3. `#3: Server Selection Algorithm, filecap format`_ |
---|
18 | 4. `#4: Directory Format`_ |
---|
19 | |
---|
20 | #1: Share Format, Encoding Algorithm |
---|
21 | ==================================== |
---|
22 | |
---|
23 | This document will describe the way that files are encrypted and encoded into |
---|
24 | shares. It will include a specification of the share format, and explain both |
---|
25 | the encoding and decoding algorithms. It will cover both mutable and |
---|
26 | immutable files. |
---|
27 | |
---|
28 | The immutable encoding algorithm, as described by this document, will start |
---|
29 | with a plaintext series of bytes, encoding parameters "k" and "N", and either |
---|
30 | an encryption key or a mechanism for deterministically deriving the key from |
---|
31 | the plaintext (the CHK specification). The algorithm will end with a set of N |
---|
32 | shares, and a set of values that must be included in the filecap to provide |
---|
33 | confidentiality (the encryption key) and integrity (the UEB hash). |
---|
34 | |
---|
35 | The immutable decoding algorithm will start with the filecap values (key and |
---|
36 | UEB hash) and "k" shares. It will explain how to validate the shares against |
---|
37 | the integrity information, how to reverse the erasure-coding, and how to |
---|
38 | decrypt the resulting ciphertext. It will result in the original plaintext |
---|
39 | bytes (or some subrange thereof). |
---|
40 | |
---|
41 | The sections on mutable files will contain similar information. |
---|
42 | |
---|
43 | This document is *not* responsible for explaining the filecap format, since |
---|
44 | full filecaps may need to contain additional information as described in |
---|
45 | document #3. Likewise it it not responsible for explaining where to put the |
---|
46 | generated shares or where to find them again later. |
---|
47 | |
---|
48 | It is also not responsible for explaining the access control mechanisms |
---|
49 | surrounding share upload, download, or modification ("Accounting" is the |
---|
50 | business of controlling share upload to conserve space, and mutable file |
---|
51 | shares require some sort of access control to prevent non-writecap holders |
---|
52 | from destroying shares). We don't yet have a document dedicated to explaining |
---|
53 | these, but let's call it "Access Control" for now. |
---|
54 | |
---|
55 | |
---|
56 | #2: Share Exchange Protocol |
---|
57 | =========================== |
---|
58 | |
---|
59 | This document explains the wire-protocol used to upload, download, and modify |
---|
60 | shares on the various storage servers. |
---|
61 | |
---|
62 | Given the N shares created by the algorithm described in document #1, and a |
---|
63 | set of servers who are willing to accept those shares, the protocols in this |
---|
64 | document will be sufficient to get the shares onto the servers. Likewise, |
---|
65 | given a set of servers who hold at least k shares, these protocols will be |
---|
66 | enough to retrieve the shares necessary to begin the decoding process |
---|
67 | described in document #1. The notion of a "storage index" is used to |
---|
68 | reference a particular share: the storage index is generated by the encoding |
---|
69 | process described in document #1. |
---|
70 | |
---|
71 | This document does *not* describe how to identify or choose those servers, |
---|
72 | rather it explains what to do once they have been selected (by the mechanisms |
---|
73 | in document #3). |
---|
74 | |
---|
75 | This document also explains the protocols that a client uses to ask a server |
---|
76 | whether or not it is willing to accept an uploaded share, and whether it has |
---|
77 | a share available for download. These protocols will be used by the |
---|
78 | mechanisms in document #3 to help decide where the shares should be placed. |
---|
79 | |
---|
80 | Where cryptographic mechanisms are necessary to implement access-control |
---|
81 | policy, this document will explain those mechanisms. |
---|
82 | |
---|
83 | In the future, Tahoe will be able to use multiple protocols to speak to |
---|
84 | storage servers. There will be alternative forms of this document, one for |
---|
85 | each protocol. The first one to be written will describe the Foolscap-based |
---|
86 | protocol that tahoe currently uses, but we anticipate a subsequent one to |
---|
87 | describe a more HTTP-based protocol. |
---|
88 | |
---|
89 | #3: Server Selection Algorithm, filecap format |
---|
90 | ============================================== |
---|
91 | |
---|
92 | This document has two interrelated purposes. With a deeper understanding of |
---|
93 | the issues, we may be able to separate these more cleanly in the future. |
---|
94 | |
---|
95 | The first purpose is to explain the server selection algorithm. Given a set |
---|
96 | of N shares, where should those shares be uploaded? Given some information |
---|
97 | stored about a previously-uploaded file, how should a downloader locate and |
---|
98 | recover at least k shares? Given a previously-uploaded mutable file, how |
---|
99 | should a modifier locate all (or most of) the shares with a reasonable amount |
---|
100 | of work? |
---|
101 | |
---|
102 | This question implies many things, all of which should be explained in this |
---|
103 | document: |
---|
104 | |
---|
105 | * the notion of a "grid", nominally a set of servers who could potentially |
---|
106 | hold shares, which might change over time |
---|
107 | * a way to configure which grid should be used |
---|
108 | * a way to discover which servers are a part of that grid |
---|
109 | * a way to decide which servers are reliable enough to be worth sending |
---|
110 | shares |
---|
111 | * an algorithm to handle servers which refuse shares |
---|
112 | * a way for a downloader to locate which servers have shares |
---|
113 | * a way to choose which shares should be used for download |
---|
114 | |
---|
115 | The server-selection algorithm has several obviously competing goals: |
---|
116 | |
---|
117 | * minimize the amount of work that must be done during upload |
---|
118 | * minimize the total storage resources used |
---|
119 | * avoid "hot spots", balance load among multiple servers |
---|
120 | * maximize the chance that enough shares will be downloadable later, by |
---|
121 | uploading lots of shares, and by placing them on reliable servers |
---|
122 | * minimize the work that the future downloader must do |
---|
123 | * tolerate temporary server failures, permanent server departure, and new |
---|
124 | server insertions |
---|
125 | * minimize the amount of information that must be added to the filecap |
---|
126 | |
---|
127 | The server-selection algorithm is defined in some context: some set of |
---|
128 | expectations about the servers or grid with which it is expected to operate. |
---|
129 | Different algorithms are appropriate for different situtations, so there will |
---|
130 | be multiple alternatives of this document. |
---|
131 | |
---|
132 | The first version of this document will describe the algorithm that the |
---|
133 | current (1.3.0) release uses, which is heavily weighted towards the two main |
---|
134 | use case scenarios for which Tahoe has been designed: the small, stable |
---|
135 | friendnet, and the allmydata.com managed grid. In both cases, we assume that |
---|
136 | the storage servers are online most of the time, they are uniformly highly |
---|
137 | reliable, and that the set of servers does not change very rapidly. The |
---|
138 | server-selection algorithm for this environment uses a permuted server list |
---|
139 | to achieve load-balancing, uses all servers identically, and derives the |
---|
140 | permutation key from the storage index to avoid adding a new field to the |
---|
141 | filecap. |
---|
142 | |
---|
143 | An alternative algorithm could give clients more precise control over share |
---|
144 | placement, for example by a user who wished to make sure that k+1 shares are |
---|
145 | located in each datacenter (to allow downloads to take place using only local |
---|
146 | bandwidth). This algorithm could skip the permuted list and use other |
---|
147 | mechanisms to accomplish load-balancing (or ignore the issue altogether). It |
---|
148 | could add additional information to the filecap (like a list of which servers |
---|
149 | received the shares) in lieu of performing a search at download time, perhaps |
---|
150 | at the expense of allowing a repairer to move shares to a new server after |
---|
151 | the initial upload. It might make up for this by storing "location hints" |
---|
152 | next to each share, to indicate where other shares are likely to be found, |
---|
153 | and obligating the repairer to update these hints. |
---|
154 | |
---|
155 | The second purpose of this document is to explain the format of the file |
---|
156 | capability string (or "filecap" for short). There are multiple kinds of |
---|
157 | capabilties (read-write, read-only, verify-only, repaircap, lease-renewal |
---|
158 | cap, traverse-only, etc). There are multiple ways to represent the filecap |
---|
159 | (compressed binary, human-readable, clickable-HTTP-URL, "tahoe:" URL, etc), |
---|
160 | but they must all contain enough information to reliably retrieve a file |
---|
161 | (given some context, of course). It must at least contain the confidentiality |
---|
162 | and integrity information from document #1 (i.e. the encryption key and the |
---|
163 | UEB hash). It must also contain whatever additional information the |
---|
164 | upload-time server-selection algorithm generated that will be required by the |
---|
165 | downloader. |
---|
166 | |
---|
167 | For some server-selection algorithms, the additional information will be |
---|
168 | minimal. For example, the 1.3.0 release uses the hash of the encryption key |
---|
169 | as a storage index, and uses the storage index to permute the server list, |
---|
170 | and uses an Introducer to learn the current list of servers. This allows a |
---|
171 | "close-enough" list of servers to be compressed into a filecap field that is |
---|
172 | already required anyways (the encryption key). It also adds k and N to the |
---|
173 | filecap, to speed up the downloader's search (the downloader knows how many |
---|
174 | shares it needs, so it can send out multiple queries in parallel). |
---|
175 | |
---|
176 | But other server-selection algorithms might require more information. Each |
---|
177 | variant of this document will explain how to encode that additional |
---|
178 | information into the filecap, and how to extract and use that information at |
---|
179 | download time. |
---|
180 | |
---|
181 | These two purposes are interrelated. A filecap that is interpreted in the |
---|
182 | context of the allmydata.com commercial grid, which uses tahoe-1.3.0, implies |
---|
183 | a specific peer-selection algorithm, a specific Introducer, and therefore a |
---|
184 | fairly-specific set of servers to query for shares. A filecap which is meant |
---|
185 | to be interpreted on a different sort of grid would need different |
---|
186 | information. |
---|
187 | |
---|
188 | Some filecap formats can be designed to contain more information (and depend |
---|
189 | less upon context), such as the way an HTTP URL implies the existence of a |
---|
190 | single global DNS system. Ideally a tahoe filecap should be able to specify |
---|
191 | which "grid" it lives in, with enough information to allow a compatible |
---|
192 | implementation of Tahoe to locate that grid and retrieve the file (regardless |
---|
193 | of which server-selection algorithm was used for upload). |
---|
194 | |
---|
195 | This more-universal format might come at the expense of reliability, however. |
---|
196 | Tahoe-1.3.0 filecaps do not contain hostnames, because the failure of DNS or |
---|
197 | an individual host might then impact file availability (however the |
---|
198 | Introducer contains DNS names or IP addresses). |
---|
199 | |
---|
200 | #4: Directory Format |
---|
201 | ==================== |
---|
202 | |
---|
203 | Tahoe directories are a special way of interpreting and managing the contents |
---|
204 | of a file (either mutable or immutable). These "dirnode" files are basically |
---|
205 | serialized tables that map child name to filecap/dircap. This document |
---|
206 | describes the format of these files. |
---|
207 | |
---|
208 | Tahoe-1.3.0 directories are "transitively readonly", which is accomplished by |
---|
209 | applying an additional layer of encryption to the list of child writecaps. |
---|
210 | The key for this encryption is derived from the containing file's writecap. |
---|
211 | This document must explain how to derive this key and apply it to the |
---|
212 | appropriate portion of the table. |
---|
213 | |
---|
214 | Future versions of the directory format are expected to contain |
---|
215 | "deep-traversal caps", which allow verification/repair of files without |
---|
216 | exposing their plaintext to the repair agent. This document wil be |
---|
217 | responsible for explaining traversal caps too. |
---|
218 | |
---|
219 | Future versions of the directory format will probably contain an index and |
---|
220 | more advanced data structures (for efficiency and fast lookups), instead of a |
---|
221 | simple flat list of (childname, childcap). This document will also need to |
---|
222 | describe metadata formats, including what access-control policies are defined |
---|
223 | for the metadata. |
---|