1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ============= |
---|
4 | Mutable Files |
---|
5 | ============= |
---|
6 | |
---|
7 | 1. `Mutable Formats`_ |
---|
8 | 2. `Consistency vs. Availability`_ |
---|
9 | 3. `The Prime Coordination Directive: "Don't Do That"`_ |
---|
10 | 4. `Small Distributed Mutable Files`_ |
---|
11 | |
---|
12 | 1. `SDMF slots overview`_ |
---|
13 | 2. `Server Storage Protocol`_ |
---|
14 | 3. `Code Details`_ |
---|
15 | 4. `SMDF Slot Format`_ |
---|
16 | 5. `Recovery`_ |
---|
17 | |
---|
18 | 5. `Medium Distributed Mutable Files`_ |
---|
19 | 6. `Large Distributed Mutable Files`_ |
---|
20 | 7. `TODO`_ |
---|
21 | |
---|
22 | Mutable files are places with a stable identifier that can hold data that |
---|
23 | changes over time. In contrast to immutable slots, for which the |
---|
24 | identifier/capability is derived from the contents themselves, the mutable |
---|
25 | file identifier remains fixed for the life of the slot, regardless of what |
---|
26 | data is placed inside it. |
---|
27 | |
---|
28 | Each mutable file is referenced by two different caps. The "read-write" cap |
---|
29 | grants read-write access to its holder, allowing them to put whatever |
---|
30 | contents they like into the slot. The "read-only" cap is less powerful, only |
---|
31 | granting read access, and not enabling modification of the data. The |
---|
32 | read-write cap can be turned into the read-only cap, but not the other way |
---|
33 | around. |
---|
34 | |
---|
35 | The data in these files is distributed over a number of servers, using the |
---|
36 | same erasure coding that immutable files use, with 3-of-10 being a typical |
---|
37 | choice of encoding parameters. The data is encrypted and signed in such a way |
---|
38 | that only the holders of the read-write cap will be able to set the contents |
---|
39 | of the slot, and only the holders of the read-only cap will be able to read |
---|
40 | those contents. Holders of either cap will be able to validate the contents |
---|
41 | as being written by someone with the read-write cap. The servers who hold the |
---|
42 | shares are not automatically given the ability read or modify them: the worst |
---|
43 | they can do is deny service (by deleting or corrupting the shares), or |
---|
44 | attempt a rollback attack (which can only succeed with the cooperation of at |
---|
45 | least k servers). |
---|
46 | |
---|
47 | |
---|
48 | Mutable Formats |
---|
49 | =============== |
---|
50 | |
---|
51 | History |
---|
52 | ------- |
---|
53 | |
---|
54 | When mutable files first shipped in Tahoe-0.8.0 (15-Feb-2008), the only |
---|
55 | version available was "SDMF", described below. This was a |
---|
56 | limited-functionality placeholder, intended to be replaced with |
---|
57 | improved-efficiency "MDMF" files shortly afterwards. The development process |
---|
58 | took longer than expected, and MDMF didn't ship until Tahoe-1.9.0 |
---|
59 | (31-Oct-2011), and even then it was opt-in (not used by default). |
---|
60 | |
---|
61 | SDMF was intended for relatively small mutable files, up to a few megabytes. |
---|
62 | It uses only one segment, so alacrity (the measure of how quickly the first |
---|
63 | byte of plaintext is returned to the client) suffers, as the whole file must |
---|
64 | be downloaded even if you only want to get a single byte. The memory used by |
---|
65 | both clients and servers also scales with the size of the file, instead of |
---|
66 | being limited to the half-a-MB-or-so that immutable file operations use, so |
---|
67 | large files cause significant memory usage. To discourage the use of SDMF |
---|
68 | outside it's design parameters, the early versions of Tahoe enforced a |
---|
69 | maximum size on mutable files (maybe 10MB). Since most directories are built |
---|
70 | out of mutable files, this imposed a limit of about 30k entries per |
---|
71 | directory. In subsequent releases, this limit was removed, but the |
---|
72 | performance problems inherent in the SDMF implementation remained. |
---|
73 | |
---|
74 | In the summer of 2010, Google-Summer-of-Code student Kevan Carstensen took on |
---|
75 | the project of finally implementing MDMF. Because of my (Brian) design |
---|
76 | mistake in SDMF (not including a separate encryption seed in each segment), |
---|
77 | the share format for SDMF could not be used for MDMF, resulting in a larger |
---|
78 | gap between the two implementations (my original intention had been to make |
---|
79 | SDMF a clean subset of MDMF, where any single-segment MDMF file could be |
---|
80 | handled by the old SDMF code). In the fall of 2011, Kevan's code was finally |
---|
81 | integrated, and first made available in the Tahoe-1.9.0 release. |
---|
82 | |
---|
83 | SDMF vs. MDMF |
---|
84 | ------------- |
---|
85 | |
---|
86 | The improvement of MDMF is the use of multiple segments: individual 128-KiB |
---|
87 | sections of the file can be retrieved or modified independently. The |
---|
88 | improvement can be seen when fetching just a portion of the file (using a |
---|
89 | Range: header on the webapi), or when modifying a portion (again with a |
---|
90 | Range: header). It can also be seen indirectly when fetching the whole file: |
---|
91 | the first segment of data should be delivered faster from a large MDMF file |
---|
92 | than from an SDMF file, although the overall download will then proceed at |
---|
93 | the same rate. |
---|
94 | |
---|
95 | We've decided to make it opt-in for now: mutable files default to |
---|
96 | SDMF format unless explicitly configured to use MDMF, either in ``tahoe.cfg`` |
---|
97 | (see :doc:`../configuration`) or in the WUI or CLI command that created a |
---|
98 | new mutable file. |
---|
99 | |
---|
100 | The code can read and modify existing files of either format without user |
---|
101 | intervention. We expect to make MDMF the default in a subsequent release, |
---|
102 | perhaps 2.0. |
---|
103 | |
---|
104 | Which format should you use? SDMF works well for files up to a few MB, and |
---|
105 | can be handled by older versions (Tahoe-1.8.3 and earlier). If you do not |
---|
106 | need to support older clients, want to efficiently work with mutable files, |
---|
107 | and have code which will use Range: headers that make partial reads and |
---|
108 | writes, then MDMF is for you. |
---|
109 | |
---|
110 | |
---|
111 | Consistency vs. Availability |
---|
112 | ============================ |
---|
113 | |
---|
114 | There is an age-old battle between consistency and availability. Epic papers |
---|
115 | have been written, elaborate proofs have been established, and generations of |
---|
116 | theorists have learned that you cannot simultaneously achieve guaranteed |
---|
117 | consistency with guaranteed reliability. In addition, the closer to 0 you get |
---|
118 | on either axis, the cost and complexity of the design goes up. |
---|
119 | |
---|
120 | Tahoe's design goals are to largely favor design simplicity, then slightly |
---|
121 | favor read availability, over the other criteria. |
---|
122 | |
---|
123 | As we develop more sophisticated mutable slots, the API may expose multiple |
---|
124 | read versions to the application layer. The tahoe philosophy is to defer most |
---|
125 | consistency recovery logic to the higher layers. Some applications have |
---|
126 | effective ways to merge multiple versions, so inconsistency is not |
---|
127 | necessarily a problem (i.e. directory nodes can usually merge multiple |
---|
128 | "add child" operations). |
---|
129 | |
---|
130 | |
---|
131 | The Prime Coordination Directive: "Don't Do That" |
---|
132 | ================================================= |
---|
133 | |
---|
134 | The current rule for applications which run on top of Tahoe is "do not |
---|
135 | perform simultaneous uncoordinated writes". That means you need non-tahoe |
---|
136 | means to make sure that two parties are not trying to modify the same mutable |
---|
137 | slot at the same time. For example: |
---|
138 | |
---|
139 | * don't give the read-write URI to anyone else. Dirnodes in a private |
---|
140 | directory generally satisfy this case, as long as you don't use two |
---|
141 | clients on the same account at the same time |
---|
142 | * if you give a read-write URI to someone else, stop using it yourself. An |
---|
143 | inbox would be a good example of this. |
---|
144 | * if you give a read-write URI to someone else, call them on the phone |
---|
145 | before you write into it |
---|
146 | * build an automated mechanism to have your agents coordinate writes. |
---|
147 | For example, we expect a future release to include a FURL for a |
---|
148 | "coordination server" in the dirnodes. The rule can be that you must |
---|
149 | contact the coordination server and obtain a lock/lease on the file |
---|
150 | before you're allowed to modify it. |
---|
151 | |
---|
152 | If you do not follow this rule, Bad Things will happen. The worst-case Bad |
---|
153 | Thing is that the entire file will be lost. A less-bad Bad Thing is that one |
---|
154 | or more of the simultaneous writers will lose their changes. An observer of |
---|
155 | the file may not see monotonically-increasing changes to the file, i.e. they |
---|
156 | may see version 1, then version 2, then 3, then 2 again. |
---|
157 | |
---|
158 | Tahoe takes some amount of care to reduce the badness of these Bad Things. |
---|
159 | One way you can help nudge it from the "lose your file" case into the "lose |
---|
160 | some changes" case is to reduce the number of competing versions: multiple |
---|
161 | versions of the file that different parties are trying to establish as the |
---|
162 | one true current contents. Each simultaneous writer counts as a "competing |
---|
163 | version", as does the previous version of the file. If the count "S" of these |
---|
164 | competing versions is larger than N/k, then the file runs the risk of being |
---|
165 | lost completely. [TODO] If at least one of the writers remains running after |
---|
166 | the collision is detected, it will attempt to recover, but if S>(N/k) and all |
---|
167 | writers crash after writing a few shares, the file will be lost. |
---|
168 | |
---|
169 | Note that Tahoe uses serialization internally to make sure that a single |
---|
170 | Tahoe node will not perform simultaneous modifications to a mutable file. It |
---|
171 | accomplishes this by using a weakref cache of the MutableFileNode (so that |
---|
172 | there will never be two distinct MutableFileNodes for the same file), and by |
---|
173 | forcing all mutable file operations to obtain a per-node lock before they |
---|
174 | run. The Prime Coordination Directive therefore applies to inter-node |
---|
175 | conflicts, not intra-node ones. |
---|
176 | |
---|
177 | |
---|
178 | Small Distributed Mutable Files |
---|
179 | =============================== |
---|
180 | |
---|
181 | SDMF slots are suitable for small (<1MB) files that are editing by rewriting |
---|
182 | the entire file. The three operations are: |
---|
183 | |
---|
184 | * allocate (with initial contents) |
---|
185 | * set (with new contents) |
---|
186 | * get (old contents) |
---|
187 | |
---|
188 | The first use of SDMF slots will be to hold directories (dirnodes), which map |
---|
189 | encrypted child names to rw-URI/ro-URI pairs. |
---|
190 | |
---|
191 | SDMF slots overview |
---|
192 | ------------------- |
---|
193 | |
---|
194 | Each SDMF slot is created with a public/private key pair. The public key is |
---|
195 | known as the "verification key", while the private key is called the |
---|
196 | "signature key". The private key is hashed and truncated to 16 bytes to form |
---|
197 | the "write key" (an AES symmetric key). The write key is then hashed and |
---|
198 | truncated to form the "read key". The read key is hashed and truncated to |
---|
199 | form the 16-byte "storage index" (a unique string used as an index to locate |
---|
200 | stored data). |
---|
201 | |
---|
202 | The public key is hashed by itself to form the "verification key hash". |
---|
203 | |
---|
204 | The write key is hashed a different way to form the "write enabler master". |
---|
205 | For each storage server on which a share is kept, the write enabler master is |
---|
206 | concatenated with the server's nodeid and hashed, and the result is called |
---|
207 | the "write enabler" for that particular server. Note that multiple shares of |
---|
208 | the same slot stored on the same server will all get the same write enabler, |
---|
209 | i.e. the write enabler is associated with the "bucket", rather than the |
---|
210 | individual shares. |
---|
211 | |
---|
212 | The private key is encrypted (using AES in counter mode) by the write key, |
---|
213 | and the resulting crypttext is stored on the servers. so it will be |
---|
214 | retrievable by anyone who knows the write key. The write key is not used to |
---|
215 | encrypt anything else, and the private key never changes, so we do not need |
---|
216 | an IV for this purpose. |
---|
217 | |
---|
218 | The actual data is encrypted (using AES in counter mode) with a key derived |
---|
219 | by concatenating the readkey with the IV, the hashing the results and |
---|
220 | truncating to 16 bytes. The IV is randomly generated each time the slot is |
---|
221 | updated, and stored next to the encrypted data. |
---|
222 | |
---|
223 | The read-write URI consists of the write key and the verification key hash. |
---|
224 | The read-only URI contains the read key and the verification key hash. The |
---|
225 | verify-only URI contains the storage index and the verification key hash. |
---|
226 | |
---|
227 | :: |
---|
228 | |
---|
229 | URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) |
---|
230 | URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) |
---|
231 | URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) |
---|
232 | |
---|
233 | Note that this allows the read-only and verify-only URIs to be derived from |
---|
234 | the read-write URI without actually retrieving the public keys. Also note |
---|
235 | that it means the read-write agent must validate both the private key and the |
---|
236 | public key when they are first fetched. All users validate the public key in |
---|
237 | exactly the same way. |
---|
238 | |
---|
239 | The SDMF slot is allocated by sending a request to the storage server with a |
---|
240 | desired size, the storage index, and the write enabler for that server's |
---|
241 | nodeid. If granted, the write enabler is stashed inside the slot's backing |
---|
242 | store file. All further write requests must be accompanied by the write |
---|
243 | enabler or they will not be honored. The storage server does not share the |
---|
244 | write enabler with anyone else. |
---|
245 | |
---|
246 | The SDMF slot structure will be described in more detail below. The important |
---|
247 | pieces are: |
---|
248 | |
---|
249 | * a sequence number |
---|
250 | * a root hash "R" |
---|
251 | * the encoding parameters (including k, N, file size, segment size) |
---|
252 | * a signed copy of [seqnum,R,encoding_params], using the signature key |
---|
253 | * the verification key (not encrypted) |
---|
254 | * the share hash chain (part of a Merkle tree over the share hashes) |
---|
255 | * the block hash tree (Merkle tree over blocks of share data) |
---|
256 | * the share data itself (erasure-coding of read-key-encrypted file data) |
---|
257 | * the signature key, encrypted with the write key |
---|
258 | |
---|
259 | The access pattern for read is: |
---|
260 | |
---|
261 | * hash read-key to get storage index |
---|
262 | * use storage index to locate 'k' shares with identical 'R' values |
---|
263 | |
---|
264 | * either get one share, read 'k' from it, then read k-1 shares |
---|
265 | * or read, say, 5 shares, discover k, either get more or be finished |
---|
266 | * or copy k into the URIs |
---|
267 | |
---|
268 | * read verification key |
---|
269 | * hash verification key, compare against verification key hash |
---|
270 | * read seqnum, R, encoding parameters, signature |
---|
271 | * verify signature against verification key |
---|
272 | * read share data, compute block-hash Merkle tree and root "r" |
---|
273 | * read share hash chain (leading from "r" to "R") |
---|
274 | * validate share hash chain up to the root "R" |
---|
275 | * submit share data to erasure decoding |
---|
276 | * decrypt decoded data with read-key |
---|
277 | * submit plaintext to application |
---|
278 | |
---|
279 | The access pattern for write is: |
---|
280 | |
---|
281 | * hash write-key to get read-key, hash read-key to get storage index |
---|
282 | * use the storage index to locate at least one share |
---|
283 | * read verification key and encrypted signature key |
---|
284 | * decrypt signature key using write-key |
---|
285 | * hash signature key, compare against write-key |
---|
286 | * hash verification key, compare against verification key hash |
---|
287 | * encrypt plaintext from application with read-key |
---|
288 | |
---|
289 | * application can encrypt some data with the write-key to make it only |
---|
290 | available to writers (use this for transitive read-onlyness of dirnodes) |
---|
291 | |
---|
292 | * erasure-code crypttext to form shares |
---|
293 | * split shares into blocks |
---|
294 | * compute Merkle tree of blocks, giving root "r" for each share |
---|
295 | * compute Merkle tree of shares, find root "R" for the file as a whole |
---|
296 | * create share data structures, one per server: |
---|
297 | |
---|
298 | * use seqnum which is one higher than the old version |
---|
299 | * share hash chain has log(N) hashes, different for each server |
---|
300 | * signed data is the same for each server |
---|
301 | |
---|
302 | * now we have N shares and need homes for them |
---|
303 | * walk through peers |
---|
304 | |
---|
305 | * if share is not already present, allocate-and-set |
---|
306 | * otherwise, try to modify existing share: |
---|
307 | * send testv_and_writev operation to each one |
---|
308 | * testv says to accept share if their(seqnum+R) <= our(seqnum+R) |
---|
309 | * count how many servers wind up with which versions (histogram over R) |
---|
310 | * keep going until N servers have the same version, or we run out of servers |
---|
311 | |
---|
312 | * if any servers wound up with a different version, report error to |
---|
313 | application |
---|
314 | * if we ran out of servers, initiate recovery process (described below) |
---|
315 | |
---|
316 | Server Storage Protocol |
---|
317 | ----------------------- |
---|
318 | |
---|
319 | The storage servers will provide a mutable slot container which is oblivious |
---|
320 | to the details of the data being contained inside it. Each storage index |
---|
321 | refers to a "bucket", and each bucket has one or more shares inside it. (In a |
---|
322 | well-provisioned network, each bucket will have only one share). The bucket |
---|
323 | is stored as a directory, using the base32-encoded storage index as the |
---|
324 | directory name. Each share is stored in a single file, using the share number |
---|
325 | as the filename. |
---|
326 | |
---|
327 | The container holds space for a container magic number (for versioning), the |
---|
328 | write enabler, the nodeid which accepted the write enabler (used for share |
---|
329 | migration, described below), a small number of lease structures, the embedded |
---|
330 | data itself, and expansion space for additional lease structures:: |
---|
331 | |
---|
332 | # offset size name |
---|
333 | 1 0 32 magic verstr "Tahoe mutable container v1\n\x75\x09\x44\x03\x8e" |
---|
334 | 2 32 20 write enabler's nodeid |
---|
335 | 3 52 32 write enabler |
---|
336 | 4 84 8 data size (actual share data present) (a) |
---|
337 | 5 92 8 offset of (8) count of extra leases (after data) |
---|
338 | 6 100 368 four leases, 92 bytes each |
---|
339 | 0 4 ownerid (0 means "no lease here") |
---|
340 | 4 4 expiration timestamp |
---|
341 | 8 32 renewal token |
---|
342 | 40 32 cancel token |
---|
343 | 72 20 nodeid which accepted the tokens |
---|
344 | 7 468 (a) data |
---|
345 | 8 ?? 4 count of extra leases |
---|
346 | 9 ?? n*92 extra leases |
---|
347 | |
---|
348 | The "extra leases" field must be copied and rewritten each time the size of |
---|
349 | the enclosed data changes. The hope is that most buckets will have four or |
---|
350 | fewer leases and this extra copying will not usually be necessary. |
---|
351 | |
---|
352 | The (4) "data size" field contains the actual number of bytes of data present |
---|
353 | in field (7), such that a client request to read beyond 504+(a) will result |
---|
354 | in an error. This allows the client to (one day) read relative to the end of |
---|
355 | the file. The container size (that is, (8)-(7)) might be larger, especially |
---|
356 | if extra size was pre-allocated in anticipation of filling the container with |
---|
357 | a lot of data. |
---|
358 | |
---|
359 | The offset in (5) points at the *count* of extra leases, at (8). The actual |
---|
360 | leases (at (9)) begin 4 bytes later. If the container size changes, both (8) |
---|
361 | and (9) must be relocated by copying. |
---|
362 | |
---|
363 | The server will honor any write commands that provide the write token and do |
---|
364 | not exceed the server-wide storage size limitations. Read and write commands |
---|
365 | MUST be restricted to the 'data' portion of the container: the implementation |
---|
366 | of those commands MUST perform correct bounds-checking to make sure other |
---|
367 | portions of the container are inaccessible to the clients. |
---|
368 | |
---|
369 | The two methods provided by the storage server on these "MutableSlot" share |
---|
370 | objects are: |
---|
371 | |
---|
372 | * readv(ListOf(offset=int, length=int)) |
---|
373 | |
---|
374 | * returns a list of bytestrings, of the various requested lengths |
---|
375 | * offset < 0 is interpreted relative to the end of the data |
---|
376 | * spans which hit the end of the data will return truncated data |
---|
377 | |
---|
378 | * testv_and_writev(write_enabler, test_vector, write_vector) |
---|
379 | |
---|
380 | * this is a test-and-set operation which performs the given tests and only |
---|
381 | applies the desired writes if all tests succeed. This is used to detect |
---|
382 | simultaneous writers, and to reduce the chance that an update will lose |
---|
383 | data recently written by some other party (written after the last time |
---|
384 | this slot was read). |
---|
385 | * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) |
---|
386 | * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] |
---|
387 | * each element of the test vector is read from the slot's data and |
---|
388 | compared against the specimen using the desired (in)equality. If all |
---|
389 | tests evaluate True, the write is performed |
---|
390 | * write_vector=ListOf(TupleOf(offset, newdata)) |
---|
391 | |
---|
392 | * offset < 0 is not yet defined, it probably means relative to the |
---|
393 | end of the data, which probably means append, but we haven't nailed |
---|
394 | it down quite yet |
---|
395 | * write vectors are executed in order, which specifies the results of |
---|
396 | overlapping writes |
---|
397 | |
---|
398 | * return value: |
---|
399 | |
---|
400 | * error: OutOfSpace |
---|
401 | * error: something else (io error, out of memory, whatever) |
---|
402 | * (True, old_test_data): the write was accepted (test_vector passed) |
---|
403 | * (False, old_test_data): the write was rejected (test_vector failed) |
---|
404 | |
---|
405 | * both 'accepted' and 'rejected' return the old data that was used |
---|
406 | for the test_vector comparison. This can be used by the client |
---|
407 | to detect write collisions, including collisions for which the |
---|
408 | desired behavior was to overwrite the old version. |
---|
409 | |
---|
410 | In addition, the storage server provides several methods to access these |
---|
411 | share objects: |
---|
412 | |
---|
413 | * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) |
---|
414 | |
---|
415 | * returns DictOf(int, MutableSlot) |
---|
416 | |
---|
417 | * get_mutable_slot(storage_index) |
---|
418 | |
---|
419 | * returns DictOf(int, MutableSlot) |
---|
420 | * or raises KeyError |
---|
421 | |
---|
422 | We intend to add an interface which allows small slots to allocate-and-write |
---|
423 | in a single call, as well as do update or read in a single call. The goal is |
---|
424 | to allow a reasonably-sized dirnode to be created (or updated, or read) in |
---|
425 | just one round trip (to all N shareholders in parallel). |
---|
426 | |
---|
427 | migrating shares |
---|
428 | ```````````````` |
---|
429 | |
---|
430 | If a share must be migrated from one server to another, two values become |
---|
431 | invalid: the write enabler (since it was computed for the old server), and |
---|
432 | the lease renew/cancel tokens. |
---|
433 | |
---|
434 | Suppose that a slot was first created on nodeA, and was thus initialized with |
---|
435 | WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is |
---|
436 | moved from nodeA to nodeB. |
---|
437 | |
---|
438 | Readers may still be able to find the share in its new home, depending upon |
---|
439 | how many servers are present in the grid, where the new nodeid lands in the |
---|
440 | permuted index for this particular storage index, and how many servers the |
---|
441 | reading client is willing to contact. |
---|
442 | |
---|
443 | When a client attempts to write to this migrated share, it will get a "bad |
---|
444 | write enabler" error, since the WE it computes for nodeB will not match the |
---|
445 | WE(nodeA) that was embedded in the share. When this occurs, the "bad write |
---|
446 | enabler" message must include the old nodeid (e.g. nodeA) that was in the |
---|
447 | share. |
---|
448 | |
---|
449 | The client then computes H(nodeB+H(WEM+nodeA)), which is the same as |
---|
450 | H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which |
---|
451 | is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to |
---|
452 | anyone else. Also note that the client does not send a value to nodeB that |
---|
453 | would allow the node to impersonate the client to a third node: everything |
---|
454 | sent to nodeB will include something specific to nodeB in it. |
---|
455 | |
---|
456 | The server locally computes H(nodeB+WE(nodeA)), using its own node id and the |
---|
457 | old write enabler from the share. It compares this against the value supplied |
---|
458 | by the client. If they match, this serves as proof that the client was able |
---|
459 | to compute the old write enabler. The server then accepts the client's new |
---|
460 | WE(nodeB) and writes it into the container. |
---|
461 | |
---|
462 | This WE-fixup process requires an extra round trip, and requires the error |
---|
463 | message to include the old nodeid, but does not require any public key |
---|
464 | operations on either client or server. |
---|
465 | |
---|
466 | Migrating the leases will require a similar protocol. This protocol will be |
---|
467 | defined concretely at a later date. |
---|
468 | |
---|
469 | Code Details |
---|
470 | ------------ |
---|
471 | |
---|
472 | The MutableFileNode class is used to manipulate mutable files (as opposed to |
---|
473 | ImmutableFileNodes). These are initially generated with |
---|
474 | client.create_mutable_file(), and later recreated from URIs with |
---|
475 | client.create_node_from_uri(). Instances of this class will contain a URI and |
---|
476 | a reference to the client (for peer selection and connection). |
---|
477 | |
---|
478 | NOTE: this section is out of date. Please see src/allmydata/interfaces.py |
---|
479 | (the section on IMutableFilesystemNode) for more accurate information. |
---|
480 | |
---|
481 | The methods of MutableFileNode are: |
---|
482 | |
---|
483 | * download_to_data() -> [deferred] newdata, NotEnoughSharesError |
---|
484 | |
---|
485 | * if there are multiple retrieveable versions in the grid, get() returns |
---|
486 | the first version it can reconstruct, and silently ignores the others. |
---|
487 | In the future, a more advanced API will signal and provide access to |
---|
488 | the multiple heads. |
---|
489 | |
---|
490 | * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError |
---|
491 | * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError |
---|
492 | |
---|
493 | download_to_data() causes a new retrieval to occur, pulling the current |
---|
494 | contents from the grid and returning them to the caller. At the same time, |
---|
495 | this call caches information about the current version of the file. This |
---|
496 | information will be used in a subsequent call to update(), and if another |
---|
497 | change has occured between the two, this information will be out of date, |
---|
498 | triggering the UncoordinatedWriteError. |
---|
499 | |
---|
500 | update() is therefore intended to be used just after a download_to_data(), in |
---|
501 | the following pattern:: |
---|
502 | |
---|
503 | d = mfn.download_to_data() |
---|
504 | d.addCallback(apply_delta) |
---|
505 | d.addCallback(mfn.update) |
---|
506 | |
---|
507 | If the update() call raises UCW, then the application can simply return an |
---|
508 | error to the user ("you violated the Prime Coordination Directive"), and they |
---|
509 | can try again later. Alternatively, the application can attempt to retry on |
---|
510 | its own. To accomplish this, the app needs to pause, download the new |
---|
511 | (post-collision and post-recovery) form of the file, reapply their delta, |
---|
512 | then submit the update request again. A randomized pause is necessary to |
---|
513 | reduce the chances of colliding a second time with another client that is |
---|
514 | doing exactly the same thing:: |
---|
515 | |
---|
516 | d = mfn.download_to_data() |
---|
517 | d.addCallback(apply_delta) |
---|
518 | d.addCallback(mfn.update) |
---|
519 | def _retry(f): |
---|
520 | f.trap(UncoordinatedWriteError) |
---|
521 | d1 = pause(random.uniform(5, 20)) |
---|
522 | d1.addCallback(lambda res: mfn.download_to_data()) |
---|
523 | d1.addCallback(apply_delta) |
---|
524 | d1.addCallback(mfn.update) |
---|
525 | return d1 |
---|
526 | d.addErrback(_retry) |
---|
527 | |
---|
528 | Enthusiastic applications can retry multiple times, using a randomized |
---|
529 | exponential backoff between each. A particularly enthusiastic application can |
---|
530 | retry forever, but such apps are encouraged to provide a means to the user of |
---|
531 | giving up after a while. |
---|
532 | |
---|
533 | UCW does not mean that the update was not applied, so it is also a good idea |
---|
534 | to skip the retry-update step if the delta was already applied:: |
---|
535 | |
---|
536 | d = mfn.download_to_data() |
---|
537 | d.addCallback(apply_delta) |
---|
538 | d.addCallback(mfn.update) |
---|
539 | def _retry(f): |
---|
540 | f.trap(UncoordinatedWriteError) |
---|
541 | d1 = pause(random.uniform(5, 20)) |
---|
542 | d1.addCallback(lambda res: mfn.download_to_data()) |
---|
543 | def _maybe_apply_delta(contents): |
---|
544 | new_contents = apply_delta(contents) |
---|
545 | if new_contents != contents: |
---|
546 | return mfn.update(new_contents) |
---|
547 | d1.addCallback(_maybe_apply_delta) |
---|
548 | return d1 |
---|
549 | d.addErrback(_retry) |
---|
550 | |
---|
551 | update() is the right interface to use for delta-application situations, like |
---|
552 | directory nodes (in which apply_delta might be adding or removing child |
---|
553 | entries from a serialized table). |
---|
554 | |
---|
555 | Note that any uncoordinated write has the potential to lose data. We must do |
---|
556 | more analysis to be sure, but it appears that two clients who write to the |
---|
557 | same mutable file at the same time (even if both eventually retry) will, with |
---|
558 | high probability, result in one client observing UCW and the other silently |
---|
559 | losing their changes. It is also possible for both clients to observe UCW. |
---|
560 | The moral of the story is that the Prime Coordination Directive is there for |
---|
561 | a reason, and that recovery/UCW/retry is not a subsitute for write |
---|
562 | coordination. |
---|
563 | |
---|
564 | overwrite() tells the client to ignore this cached version information, and |
---|
565 | to unconditionally replace the mutable file's contents with the new data. |
---|
566 | This should not be used in delta application, but rather in situations where |
---|
567 | you want to replace the file's contents with completely unrelated ones. When |
---|
568 | raw files are uploaded into a mutable slot through the Tahoe-LAFS web-API |
---|
569 | (using POST and the ?mutable=true argument), they are put in place with |
---|
570 | overwrite(). |
---|
571 | |
---|
572 | The peer-selection and data-structure manipulation (and signing/verification) |
---|
573 | steps will be implemented in a separate class in allmydata/mutable.py . |
---|
574 | |
---|
575 | SMDF Slot Format |
---|
576 | ---------------- |
---|
577 | |
---|
578 | This SMDF data lives inside a server-side MutableSlot container. The server |
---|
579 | is oblivious to this format. |
---|
580 | |
---|
581 | This data is tightly packed. In particular, the share data is defined to run |
---|
582 | all the way to the beginning of the encrypted private key (the encprivkey |
---|
583 | offset is used both to terminate the share data and to begin the encprivkey). |
---|
584 | |
---|
585 | :: |
---|
586 | |
---|
587 | # offset size name |
---|
588 | 1 0 1 version byte, \x00 for this format |
---|
589 | 2 1 8 sequence number. 2^64-1 must be handled specially, TBD |
---|
590 | 3 9 32 "R" (root of share hash Merkle tree) |
---|
591 | 4 41 16 IV (share data is AES(H(readkey+IV)) ) |
---|
592 | 5 57 18 encoding parameters: |
---|
593 | 57 1 k |
---|
594 | 58 1 N |
---|
595 | 59 8 segment size |
---|
596 | 67 8 data length (of original plaintext) |
---|
597 | 6 75 32 offset table: |
---|
598 | 75 4 (8) signature |
---|
599 | 79 4 (9) share hash chain |
---|
600 | 83 4 (10) block hash tree |
---|
601 | 87 4 (11) share data |
---|
602 | 91 8 (12) encrypted private key |
---|
603 | 99 8 (13) EOF |
---|
604 | 7 107 436ish verification key (2048 RSA key) |
---|
605 | 8 543ish 256ish signature=RSAsign(sigkey, H(version+seqnum+r+IV+encparm)) |
---|
606 | 9 799ish (a) share hash chain, encoded as: |
---|
607 | "".join([pack(">H32s", shnum, hash) |
---|
608 | for (shnum,hash) in needed_hashes]) |
---|
609 | 10 (927ish) (b) block hash tree, encoded as: |
---|
610 | "".join([pack(">32s",hash) for hash in block_hash_tree]) |
---|
611 | 11 (935ish) LEN share data (no gap between this and encprivkey) |
---|
612 | 12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) |
---|
613 | 13 ?? -- EOF |
---|
614 | |
---|
615 | (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. |
---|
616 | This is the set of hashes necessary to validate this share's leaf in the |
---|
617 | share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. |
---|
618 | (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes |
---|
619 | long. This is the set of hashes necessary to validate any given block of |
---|
620 | share data up to the per-share root "r". Each "r" is a leaf of the share |
---|
621 | has tree (with root "R"), from which a minimal subset of hashes is put in |
---|
622 | the share hash chain in (8). |
---|
623 | |
---|
624 | Recovery |
---|
625 | -------- |
---|
626 | |
---|
627 | The first line of defense against damage caused by colliding writes is the |
---|
628 | Prime Coordination Directive: "Don't Do That". |
---|
629 | |
---|
630 | The second line of defense is to keep "S" (the number of competing versions) |
---|
631 | lower than N/k. If this holds true, at least one competing version will have |
---|
632 | k shares and thus be recoverable. Note that server unavailability counts |
---|
633 | against us here: the old version stored on the unavailable server must be |
---|
634 | included in the value of S. |
---|
635 | |
---|
636 | The third line of defense is our use of testv_and_writev() (described below), |
---|
637 | which increases the convergence of simultaneous writes: one of the writers |
---|
638 | will be favored (the one with the highest "R"), and that version is more |
---|
639 | likely to be accepted than the others. This defense is least effective in the |
---|
640 | pathological situation where S simultaneous writers are active, the one with |
---|
641 | the lowest "R" writes to N-k+1 of the shares and then dies, then the one with |
---|
642 | the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the |
---|
643 | one with the highest "R" writes to k-1 shares and dies. Any other sequencing |
---|
644 | will allow the highest "R" to write to at least k shares and establish a new |
---|
645 | revision. |
---|
646 | |
---|
647 | The fourth line of defense is the fact that each client keeps writing until |
---|
648 | at least one version has N shares. This uses additional servers, if |
---|
649 | necessary, to make sure that either the client's version or some |
---|
650 | newer/overriding version is highly available. |
---|
651 | |
---|
652 | The fifth line of defense is the recovery algorithm, which seeks to make sure |
---|
653 | that at least *one* version is highly available, even if that version is |
---|
654 | somebody else's. |
---|
655 | |
---|
656 | The write-shares-to-peers algorithm is as follows: |
---|
657 | |
---|
658 | * permute peers according to storage index |
---|
659 | * walk through peers, trying to assign one share per peer |
---|
660 | * for each peer: |
---|
661 | |
---|
662 | * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test |
---|
663 | |
---|
664 | * this means that we will overwrite any old versions, and we will |
---|
665 | overwrite simultaenous writers of the same version if our R is higher. |
---|
666 | We will not overwrite writers using a higher seqnum. |
---|
667 | |
---|
668 | * record the version that each share winds up with. If the write was |
---|
669 | accepted, this is our own version. If it was rejected, read the |
---|
670 | old_test_data to find out what version was retained. |
---|
671 | * if old_test_data indicates the seqnum was equal or greater than our |
---|
672 | own, mark the "Simultanous Writes Detected" flag, which will eventually |
---|
673 | result in an error being reported to the writer (in their close() call). |
---|
674 | * build a histogram of "R" values |
---|
675 | * repeat until the histogram indicate that some version (possibly ours) |
---|
676 | has N shares. Use new servers if necessary. |
---|
677 | * If we run out of servers: |
---|
678 | |
---|
679 | * if there are at least shares-of-happiness of any one version, we're |
---|
680 | happy, so return. (the close() might still get an error) |
---|
681 | * not happy, need to reinforce something, goto RECOVERY |
---|
682 | |
---|
683 | Recovery: |
---|
684 | |
---|
685 | * read all shares, count the versions, identify the recoverable ones, |
---|
686 | discard the unrecoverable ones. |
---|
687 | * sort versions: locate max(seqnums), put all versions with that seqnum |
---|
688 | in the list, sort by number of outstanding shares. Then put our own |
---|
689 | version. (TODO: put versions with seqnum <max but >us ahead of us?). |
---|
690 | * for each version: |
---|
691 | |
---|
692 | * attempt to recover that version |
---|
693 | * if not possible, remove it from the list, go to next one |
---|
694 | * if recovered, start at beginning of peer list, push that version, |
---|
695 | continue until N shares are placed |
---|
696 | * if pushing our own version, bump up the seqnum to one higher than |
---|
697 | the max seqnum we saw |
---|
698 | * if we run out of servers: |
---|
699 | |
---|
700 | * schedule retry and exponential backoff to repeat RECOVERY |
---|
701 | |
---|
702 | * admit defeat after some period? presumeably the client will be shut down |
---|
703 | eventually, maybe keep trying (once per hour?) until then. |
---|
704 | |
---|
705 | |
---|
706 | Medium Distributed Mutable Files |
---|
707 | ================================ |
---|
708 | |
---|
709 | These are just like the SDMF case, but: |
---|
710 | |
---|
711 | * We actually take advantage of the Merkle hash tree over the blocks, by |
---|
712 | reading a single segment of data at a time (and its necessary hashes), to |
---|
713 | reduce the read-time alacrity. |
---|
714 | * We allow arbitrary writes to any range of the file. |
---|
715 | * We add more code to first read each segment that a write must modify. |
---|
716 | This looks exactly like the way a normal filesystem uses a block device, |
---|
717 | or how a CPU must perform a cache-line fill before modifying a single word. |
---|
718 | * We might implement some sort of copy-based atomic update server call, |
---|
719 | to allow multiple writev() calls to appear atomic to any readers. |
---|
720 | |
---|
721 | MDMF slots provide fairly efficient in-place edits of very large files (a few |
---|
722 | GB). Appending data is also fairly efficient. |
---|
723 | |
---|
724 | |
---|
725 | Large Distributed Mutable Files |
---|
726 | =============================== |
---|
727 | |
---|
728 | LDMF slots (not implemented) would use a fundamentally different way to store |
---|
729 | the file, inspired by Mercurial's "revlog" format. This would enable very |
---|
730 | efficient insert/remove/replace editing of arbitrary spans. Multiple versions |
---|
731 | of the file can be retained, in a revision graph that can have multiple heads. |
---|
732 | Each revision can be referenced by a cryptographic identifier. There are two |
---|
733 | forms of the URI, one that means "most recent version", and a longer one that |
---|
734 | points to a specific revision. |
---|
735 | |
---|
736 | Metadata can be attached to the revisions, like timestamps, to enable rolling |
---|
737 | back an entire tree to a specific point in history. |
---|
738 | |
---|
739 | LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 |
---|
740 | provides explicit support for revision identifiers and branching. |
---|
741 | |
---|
742 | |
---|
743 | TODO |
---|
744 | ==== |
---|
745 | |
---|
746 | improve allocate-and-write or get-writer-buckets API to allow one-call (or |
---|
747 | maybe two-call) updates. The challenge is in figuring out which shares are on |
---|
748 | which machines. First cut will have lots of round trips. |
---|
749 | |
---|
750 | (eventually) define behavior when seqnum wraps. At the very least make sure |
---|
751 | it can't cause a security problem. "the slot is worn out" is acceptable. |
---|
752 | |
---|
753 | (eventually) define share-migration lease update protocol. Including the |
---|
754 | nodeid who accepted the lease is useful, we can use the same protocol as we |
---|
755 | do for updating the write enabler. However we need to know which lease to |
---|
756 | update.. maybe send back a list of all old nodeids that we find, then try all |
---|
757 | of them when we accept the update? |
---|
758 | |
---|
759 | We now do this in a specially-formatted IndexError exception: |
---|
760 | "UNABLE to renew non-existent lease. I have leases accepted by " + |
---|
761 | "nodeids: '12345','abcde','44221' ." |
---|
762 | |
---|
763 | confirm that a repairer can regenerate shares without the private key. Hmm, |
---|
764 | without the write-enabler they won't be able to write those shares to the |
---|
765 | servers.. although they could add immutable new shares to new servers. |
---|