Context Navigation

mutable.rst

Visit:

Last change on this file was f81900e, checked in by Brian Warner <warner@…>, at 2016-03-30T08:26:47Z

format docs for Sphinx

Added indexes, fixed cross-references.

Also a few pip-related cleanups I noticed along the way.

Property mode set to 100644

File size: 34.9 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	=============
4	Mutable Files
5	=============
6
7	1. `Mutable Formats`_
8	2. `Consistency vs. Availability`_
9	3. `The Prime Coordination Directive: "Don't Do That"`_
10	4. `Small Distributed Mutable Files`_
11
12	1. `SDMF slots overview`_
13	2. `Server Storage Protocol`_
14	3. `Code Details`_
15	4. `SMDF Slot Format`_
16	5. `Recovery`_
17
18	5. `Medium Distributed Mutable Files`_
19	6. `Large Distributed Mutable Files`_
20	7. `TODO`_
21
22	Mutable files are places with a stable identifier that can hold data that
23	changes over time. In contrast to immutable slots, for which the
24	identifier/capability is derived from the contents themselves, the mutable
25	file identifier remains fixed for the life of the slot, regardless of what
26	data is placed inside it.
27
28	Each mutable file is referenced by two different caps. The "read-write" cap
29	grants read-write access to its holder, allowing them to put whatever
30	contents they like into the slot. The "read-only" cap is less powerful, only
31	granting read access, and not enabling modification of the data. The
32	read-write cap can be turned into the read-only cap, but not the other way
33	around.
34
35	The data in these files is distributed over a number of servers, using the
36	same erasure coding that immutable files use, with 3-of-10 being a typical
37	choice of encoding parameters. The data is encrypted and signed in such a way
38	that only the holders of the read-write cap will be able to set the contents
39	of the slot, and only the holders of the read-only cap will be able to read
40	those contents. Holders of either cap will be able to validate the contents
41	as being written by someone with the read-write cap. The servers who hold the
42	shares are not automatically given the ability read or modify them: the worst
43	they can do is deny service (by deleting or corrupting the shares), or
44	attempt a rollback attack (which can only succeed with the cooperation of at
45	least k servers).
46
47
48	Mutable Formats
49	===============
50
51	History
52	-------
53
54	When mutable files first shipped in Tahoe-0.8.0 (15-Feb-2008), the only
55	version available was "SDMF", described below. This was a
56	limited-functionality placeholder, intended to be replaced with
57	improved-efficiency "MDMF" files shortly afterwards. The development process
58	took longer than expected, and MDMF didn't ship until Tahoe-1.9.0
59	(31-Oct-2011), and even then it was opt-in (not used by default).
60
61	SDMF was intended for relatively small mutable files, up to a few megabytes.
62	It uses only one segment, so alacrity (the measure of how quickly the first
63	byte of plaintext is returned to the client) suffers, as the whole file must
64	be downloaded even if you only want to get a single byte. The memory used by
65	both clients and servers also scales with the size of the file, instead of
66	being limited to the half-a-MB-or-so that immutable file operations use, so
67	large files cause significant memory usage. To discourage the use of SDMF
68	outside it's design parameters, the early versions of Tahoe enforced a
69	maximum size on mutable files (maybe 10MB). Since most directories are built
70	out of mutable files, this imposed a limit of about 30k entries per
71	directory. In subsequent releases, this limit was removed, but the
72	performance problems inherent in the SDMF implementation remained.
73
74	In the summer of 2010, Google-Summer-of-Code student Kevan Carstensen took on
75	the project of finally implementing MDMF. Because of my (Brian) design
76	mistake in SDMF (not including a separate encryption seed in each segment),
77	the share format for SDMF could not be used for MDMF, resulting in a larger
78	gap between the two implementations (my original intention had been to make
79	SDMF a clean subset of MDMF, where any single-segment MDMF file could be
80	handled by the old SDMF code). In the fall of 2011, Kevan's code was finally
81	integrated, and first made available in the Tahoe-1.9.0 release.
82
83	SDMF vs. MDMF
84	-------------
85
86	The improvement of MDMF is the use of multiple segments: individual 128-KiB
87	sections of the file can be retrieved or modified independently. The
88	improvement can be seen when fetching just a portion of the file (using a
89	Range: header on the webapi), or when modifying a portion (again with a
90	Range: header). It can also be seen indirectly when fetching the whole file:
91	the first segment of data should be delivered faster from a large MDMF file
92	than from an SDMF file, although the overall download will then proceed at
93	the same rate.
94
95	We've decided to make it opt-in for now: mutable files default to
96	SDMF format unless explicitly configured to use MDMF, either in ``tahoe.cfg``
97	(see :doc:`../configuration`) or in the WUI or CLI command that created a
98	new mutable file.
99
100	The code can read and modify existing files of either format without user
101	intervention. We expect to make MDMF the default in a subsequent release,
102	perhaps 2.0.
103
104	Which format should you use? SDMF works well for files up to a few MB, and
105	can be handled by older versions (Tahoe-1.8.3 and earlier). If you do not
106	need to support older clients, want to efficiently work with mutable files,
107	and have code which will use Range: headers that make partial reads and
108	writes, then MDMF is for you.
109
110
111	Consistency vs. Availability
112	============================
113
114	There is an age-old battle between consistency and availability. Epic papers
115	have been written, elaborate proofs have been established, and generations of
116	theorists have learned that you cannot simultaneously achieve guaranteed
117	consistency with guaranteed reliability. In addition, the closer to 0 you get
118	on either axis, the cost and complexity of the design goes up.
119
120	Tahoe's design goals are to largely favor design simplicity, then slightly
121	favor read availability, over the other criteria.
122
123	As we develop more sophisticated mutable slots, the API may expose multiple
124	read versions to the application layer. The tahoe philosophy is to defer most
125	consistency recovery logic to the higher layers. Some applications have
126	effective ways to merge multiple versions, so inconsistency is not
127	necessarily a problem (i.e. directory nodes can usually merge multiple
128	"add child" operations).
129
130
131	The Prime Coordination Directive: "Don't Do That"
132	=================================================
133
134	The current rule for applications which run on top of Tahoe is "do not
135	perform simultaneous uncoordinated writes". That means you need non-tahoe
136	means to make sure that two parties are not trying to modify the same mutable
137	slot at the same time. For example:
138
139	* don't give the read-write URI to anyone else. Dirnodes in a private
140	directory generally satisfy this case, as long as you don't use two
141	clients on the same account at the same time
142	* if you give a read-write URI to someone else, stop using it yourself. An
143	inbox would be a good example of this.
144	* if you give a read-write URI to someone else, call them on the phone
145	before you write into it
146	* build an automated mechanism to have your agents coordinate writes.
147	For example, we expect a future release to include a FURL for a
148	"coordination server" in the dirnodes. The rule can be that you must
149	contact the coordination server and obtain a lock/lease on the file
150	before you're allowed to modify it.
151
152	If you do not follow this rule, Bad Things will happen. The worst-case Bad
153	Thing is that the entire file will be lost. A less-bad Bad Thing is that one
154	or more of the simultaneous writers will lose their changes. An observer of
155	the file may not see monotonically-increasing changes to the file, i.e. they
156	may see version 1, then version 2, then 3, then 2 again.
157
158	Tahoe takes some amount of care to reduce the badness of these Bad Things.
159	One way you can help nudge it from the "lose your file" case into the "lose
160	some changes" case is to reduce the number of competing versions: multiple
161	versions of the file that different parties are trying to establish as the
162	one true current contents. Each simultaneous writer counts as a "competing
163	version", as does the previous version of the file. If the count "S" of these
164	competing versions is larger than N/k, then the file runs the risk of being
165	lost completely. [TODO] If at least one of the writers remains running after
166	the collision is detected, it will attempt to recover, but if S>(N/k) and all
167	writers crash after writing a few shares, the file will be lost.
168
169	Note that Tahoe uses serialization internally to make sure that a single
170	Tahoe node will not perform simultaneous modifications to a mutable file. It
171	accomplishes this by using a weakref cache of the MutableFileNode (so that
172	there will never be two distinct MutableFileNodes for the same file), and by
173	forcing all mutable file operations to obtain a per-node lock before they
174	run. The Prime Coordination Directive therefore applies to inter-node
175	conflicts, not intra-node ones.
176
177
178	Small Distributed Mutable Files
179	===============================
180
181	SDMF slots are suitable for small (<1MB) files that are editing by rewriting
182	the entire file. The three operations are:
183
184	* allocate (with initial contents)
185	* set (with new contents)
186	* get (old contents)
187
188	The first use of SDMF slots will be to hold directories (dirnodes), which map
189	encrypted child names to rw-URI/ro-URI pairs.
190
191	SDMF slots overview
192	-------------------
193
194	Each SDMF slot is created with a public/private key pair. The public key is
195	known as the "verification key", while the private key is called the
196	"signature key". The private key is hashed and truncated to 16 bytes to form
197	the "write key" (an AES symmetric key). The write key is then hashed and
198	truncated to form the "read key". The read key is hashed and truncated to
199	form the 16-byte "storage index" (a unique string used as an index to locate
200	stored data).
201
202	The public key is hashed by itself to form the "verification key hash".
203
204	The write key is hashed a different way to form the "write enabler master".
205	For each storage server on which a share is kept, the write enabler master is
206	concatenated with the server's nodeid and hashed, and the result is called
207	the "write enabler" for that particular server. Note that multiple shares of
208	the same slot stored on the same server will all get the same write enabler,
209	i.e. the write enabler is associated with the "bucket", rather than the
210	individual shares.
211
212	The private key is encrypted (using AES in counter mode) by the write key,
213	and the resulting crypttext is stored on the servers. so it will be
214	retrievable by anyone who knows the write key. The write key is not used to
215	encrypt anything else, and the private key never changes, so we do not need
216	an IV for this purpose.
217
218	The actual data is encrypted (using AES in counter mode) with a key derived
219	by concatenating the readkey with the IV, the hashing the results and
220	truncating to 16 bytes. The IV is randomly generated each time the slot is
221	updated, and stored next to the encrypted data.
222
223	The read-write URI consists of the write key and the verification key hash.
224	The read-only URI contains the read key and the verification key hash. The
225	verify-only URI contains the storage index and the verification key hash.
226
227	::
228
229	URI:SSK-RW:b2a(writekey):b2a(verification_key_hash)
230	URI:SSK-RO:b2a(readkey):b2a(verification_key_hash)
231	URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash)
232
233	Note that this allows the read-only and verify-only URIs to be derived from
234	the read-write URI without actually retrieving the public keys. Also note
235	that it means the read-write agent must validate both the private key and the
236	public key when they are first fetched. All users validate the public key in
237	exactly the same way.
238
239	The SDMF slot is allocated by sending a request to the storage server with a
240	desired size, the storage index, and the write enabler for that server's
241	nodeid. If granted, the write enabler is stashed inside the slot's backing
242	store file. All further write requests must be accompanied by the write
243	enabler or they will not be honored. The storage server does not share the
244	write enabler with anyone else.
245
246	The SDMF slot structure will be described in more detail below. The important
247	pieces are:
248
249	* a sequence number
250	* a root hash "R"
251	* the encoding parameters (including k, N, file size, segment size)
252	* a signed copy of [seqnum,R,encoding_params], using the signature key
253	* the verification key (not encrypted)
254	* the share hash chain (part of a Merkle tree over the share hashes)
255	* the block hash tree (Merkle tree over blocks of share data)
256	* the share data itself (erasure-coding of read-key-encrypted file data)
257	* the signature key, encrypted with the write key
258
259	The access pattern for read is:
260
261	* hash read-key to get storage index
262	* use storage index to locate 'k' shares with identical 'R' values
263
264	* either get one share, read 'k' from it, then read k-1 shares
265	* or read, say, 5 shares, discover k, either get more or be finished
266	* or copy k into the URIs
267
268	* read verification key
269	* hash verification key, compare against verification key hash
270	* read seqnum, R, encoding parameters, signature
271	* verify signature against verification key
272	* read share data, compute block-hash Merkle tree and root "r"
273	* read share hash chain (leading from "r" to "R")
274	* validate share hash chain up to the root "R"
275	* submit share data to erasure decoding
276	* decrypt decoded data with read-key
277	* submit plaintext to application
278
279	The access pattern for write is:
280
281	* hash write-key to get read-key, hash read-key to get storage index
282	* use the storage index to locate at least one share
283	* read verification key and encrypted signature key
284	* decrypt signature key using write-key
285	* hash signature key, compare against write-key
286	* hash verification key, compare against verification key hash
287	* encrypt plaintext from application with read-key
288
289	* application can encrypt some data with the write-key to make it only
290	available to writers (use this for transitive read-onlyness of dirnodes)
291
292	* erasure-code crypttext to form shares
293	* split shares into blocks
294	* compute Merkle tree of blocks, giving root "r" for each share
295	* compute Merkle tree of shares, find root "R" for the file as a whole
296	* create share data structures, one per server:
297
298	* use seqnum which is one higher than the old version
299	* share hash chain has log(N) hashes, different for each server
300	* signed data is the same for each server
301
302	* now we have N shares and need homes for them
303	* walk through peers
304
305	* if share is not already present, allocate-and-set
306	* otherwise, try to modify existing share:
307	* send testv_and_writev operation to each one
308	* testv says to accept share if their(seqnum+R) <= our(seqnum+R)
309	* count how many servers wind up with which versions (histogram over R)
310	* keep going until N servers have the same version, or we run out of servers
311
312	* if any servers wound up with a different version, report error to
313	application
314	* if we ran out of servers, initiate recovery process (described below)
315
316	Server Storage Protocol
317	-----------------------
318
319	The storage servers will provide a mutable slot container which is oblivious
320	to the details of the data being contained inside it. Each storage index
321	refers to a "bucket", and each bucket has one or more shares inside it. (In a
322	well-provisioned network, each bucket will have only one share). The bucket
323	is stored as a directory, using the base32-encoded storage index as the
324	directory name. Each share is stored in a single file, using the share number
325	as the filename.
326
327	The container holds space for a container magic number (for versioning), the
328	write enabler, the nodeid which accepted the write enabler (used for share
329	migration, described below), a small number of lease structures, the embedded
330	data itself, and expansion space for additional lease structures::
331
332	# offset size name
333	1 0 32 magic verstr "Tahoe mutable container v1\n\x75\x09\x44\x03\x8e"
334	2 32 20 write enabler's nodeid
335	3 52 32 write enabler
336	4 84 8 data size (actual share data present) (a)
337	5 92 8 offset of (8) count of extra leases (after data)
338	6 100 368 four leases, 92 bytes each
339	0 4 ownerid (0 means "no lease here")
340	4 4 expiration timestamp
341	8 32 renewal token
342	40 32 cancel token
343	72 20 nodeid which accepted the tokens
344	7 468 (a) data
345	8 ?? 4 count of extra leases
346	9 ?? n*92 extra leases
347
348	The "extra leases" field must be copied and rewritten each time the size of
349	the enclosed data changes. The hope is that most buckets will have four or
350	fewer leases and this extra copying will not usually be necessary.
351
352	The (4) "data size" field contains the actual number of bytes of data present
353	in field (7), such that a client request to read beyond 504+(a) will result
354	in an error. This allows the client to (one day) read relative to the end of
355	the file. The container size (that is, (8)-(7)) might be larger, especially
356	if extra size was pre-allocated in anticipation of filling the container with
357	a lot of data.
358
359	The offset in (5) points at the count of extra leases, at (8). The actual
360	leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
361	and (9) must be relocated by copying.
362
363	The server will honor any write commands that provide the write token and do
364	not exceed the server-wide storage size limitations. Read and write commands
365	MUST be restricted to the 'data' portion of the container: the implementation
366	of those commands MUST perform correct bounds-checking to make sure other
367	portions of the container are inaccessible to the clients.
368
369	The two methods provided by the storage server on these "MutableSlot" share
370	objects are:
371
372	* readv(ListOf(offset=int, length=int))
373
374	* returns a list of bytestrings, of the various requested lengths
375	* offset < 0 is interpreted relative to the end of the data
376	* spans which hit the end of the data will return truncated data
377
378	* testv_and_writev(write_enabler, test_vector, write_vector)
379
380	* this is a test-and-set operation which performs the given tests and only
381	applies the desired writes if all tests succeed. This is used to detect
382	simultaneous writers, and to reduce the chance that an update will lose
383	data recently written by some other party (written after the last time
384	this slot was read).
385	* test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
386	* the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
387	* each element of the test vector is read from the slot's data and
388	compared against the specimen using the desired (in)equality. If all
389	tests evaluate True, the write is performed
390	* write_vector=ListOf(TupleOf(offset, newdata))
391
392	* offset < 0 is not yet defined, it probably means relative to the
393	end of the data, which probably means append, but we haven't nailed
394	it down quite yet
395	* write vectors are executed in order, which specifies the results of
396	overlapping writes
397
398	* return value:
399
400	* error: OutOfSpace
401	* error: something else (io error, out of memory, whatever)
402	* (True, old_test_data): the write was accepted (test_vector passed)
403	* (False, old_test_data): the write was rejected (test_vector failed)
404
405	* both 'accepted' and 'rejected' return the old data that was used
406	for the test_vector comparison. This can be used by the client
407	to detect write collisions, including collisions for which the
408	desired behavior was to overwrite the old version.
409
410	In addition, the storage server provides several methods to access these
411	share objects:
412
413	* allocate_mutable_slot(storage_index, sharenums=SetOf(int))
414
415	* returns DictOf(int, MutableSlot)
416
417	* get_mutable_slot(storage_index)
418
419	* returns DictOf(int, MutableSlot)
420	* or raises KeyError
421
422	We intend to add an interface which allows small slots to allocate-and-write
423	in a single call, as well as do update or read in a single call. The goal is
424	to allow a reasonably-sized dirnode to be created (or updated, or read) in
425	just one round trip (to all N shareholders in parallel).
426
427	migrating shares
428	````````````````
429
430	If a share must be migrated from one server to another, two values become
431	invalid: the write enabler (since it was computed for the old server), and
432	the lease renew/cancel tokens.
433
434	Suppose that a slot was first created on nodeA, and was thus initialized with
435	WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
436	moved from nodeA to nodeB.
437
438	Readers may still be able to find the share in its new home, depending upon
439	how many servers are present in the grid, where the new nodeid lands in the
440	permuted index for this particular storage index, and how many servers the
441	reading client is willing to contact.
442
443	When a client attempts to write to this migrated share, it will get a "bad
444	write enabler" error, since the WE it computes for nodeB will not match the
445	WE(nodeA) that was embedded in the share. When this occurs, the "bad write
446	enabler" message must include the old nodeid (e.g. nodeA) that was in the
447	share.
448
449	The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
450	H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
451	is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
452	anyone else. Also note that the client does not send a value to nodeB that
453	would allow the node to impersonate the client to a third node: everything
454	sent to nodeB will include something specific to nodeB in it.
455
456	The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
457	old write enabler from the share. It compares this against the value supplied
458	by the client. If they match, this serves as proof that the client was able
459	to compute the old write enabler. The server then accepts the client's new
460	WE(nodeB) and writes it into the container.
461
462	This WE-fixup process requires an extra round trip, and requires the error
463	message to include the old nodeid, but does not require any public key
464	operations on either client or server.
465
466	Migrating the leases will require a similar protocol. This protocol will be
467	defined concretely at a later date.
468
469	Code Details
470	------------
471
472	The MutableFileNode class is used to manipulate mutable files (as opposed to
473	ImmutableFileNodes). These are initially generated with
474	client.create_mutable_file(), and later recreated from URIs with
475	client.create_node_from_uri(). Instances of this class will contain a URI and
476	a reference to the client (for peer selection and connection).
477
478	NOTE: this section is out of date. Please see src/allmydata/interfaces.py
479	(the section on IMutableFilesystemNode) for more accurate information.
480
481	The methods of MutableFileNode are:
482
483	* download_to_data() -> [deferred] newdata, NotEnoughSharesError
484
485	* if there are multiple retrieveable versions in the grid, get() returns
486	the first version it can reconstruct, and silently ignores the others.
487	In the future, a more advanced API will signal and provide access to
488	the multiple heads.
489
490	* update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
491	* overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
492
493	download_to_data() causes a new retrieval to occur, pulling the current
494	contents from the grid and returning them to the caller. At the same time,
495	this call caches information about the current version of the file. This
496	information will be used in a subsequent call to update(), and if another
497	change has occured between the two, this information will be out of date,
498	triggering the UncoordinatedWriteError.
499
500	update() is therefore intended to be used just after a download_to_data(), in
501	the following pattern::
502
503	d = mfn.download_to_data()
504	d.addCallback(apply_delta)
505	d.addCallback(mfn.update)
506
507	If the update() call raises UCW, then the application can simply return an
508	error to the user ("you violated the Prime Coordination Directive"), and they
509	can try again later. Alternatively, the application can attempt to retry on
510	its own. To accomplish this, the app needs to pause, download the new
511	(post-collision and post-recovery) form of the file, reapply their delta,
512	then submit the update request again. A randomized pause is necessary to
513	reduce the chances of colliding a second time with another client that is
514	doing exactly the same thing::
515
516	d = mfn.download_to_data()
517	d.addCallback(apply_delta)
518	d.addCallback(mfn.update)
519	def _retry(f):
520	f.trap(UncoordinatedWriteError)
521	d1 = pause(random.uniform(5, 20))
522	d1.addCallback(lambda res: mfn.download_to_data())
523	d1.addCallback(apply_delta)
524	d1.addCallback(mfn.update)
525	return d1
526	d.addErrback(_retry)
527
528	Enthusiastic applications can retry multiple times, using a randomized
529	exponential backoff between each. A particularly enthusiastic application can
530	retry forever, but such apps are encouraged to provide a means to the user of
531	giving up after a while.
532
533	UCW does not mean that the update was not applied, so it is also a good idea
534	to skip the retry-update step if the delta was already applied::
535
536	d = mfn.download_to_data()
537	d.addCallback(apply_delta)
538	d.addCallback(mfn.update)
539	def _retry(f):
540	f.trap(UncoordinatedWriteError)
541	d1 = pause(random.uniform(5, 20))
542	d1.addCallback(lambda res: mfn.download_to_data())
543	def _maybe_apply_delta(contents):
544	new_contents = apply_delta(contents)
545	if new_contents != contents:
546	return mfn.update(new_contents)
547	d1.addCallback(_maybe_apply_delta)
548	return d1
549	d.addErrback(_retry)
550
551	update() is the right interface to use for delta-application situations, like
552	directory nodes (in which apply_delta might be adding or removing child
553	entries from a serialized table).
554
555	Note that any uncoordinated write has the potential to lose data. We must do
556	more analysis to be sure, but it appears that two clients who write to the
557	same mutable file at the same time (even if both eventually retry) will, with
558	high probability, result in one client observing UCW and the other silently
559	losing their changes. It is also possible for both clients to observe UCW.
560	The moral of the story is that the Prime Coordination Directive is there for
561	a reason, and that recovery/UCW/retry is not a subsitute for write
562	coordination.
563
564	overwrite() tells the client to ignore this cached version information, and
565	to unconditionally replace the mutable file's contents with the new data.
566	This should not be used in delta application, but rather in situations where
567	you want to replace the file's contents with completely unrelated ones. When
568	raw files are uploaded into a mutable slot through the Tahoe-LAFS web-API
569	(using POST and the ?mutable=true argument), they are put in place with
570	overwrite().
571
572	The peer-selection and data-structure manipulation (and signing/verification)
573	steps will be implemented in a separate class in allmydata/mutable.py .
574
575	SMDF Slot Format
576	----------------
577
578	This SMDF data lives inside a server-side MutableSlot container. The server
579	is oblivious to this format.
580
581	This data is tightly packed. In particular, the share data is defined to run
582	all the way to the beginning of the encrypted private key (the encprivkey
583	offset is used both to terminate the share data and to begin the encprivkey).
584
585	::
586
587	# offset size name
588	1 0 1 version byte, \x00 for this format
589	2 1 8 sequence number. 2^64-1 must be handled specially, TBD
590	3 9 32 "R" (root of share hash Merkle tree)
591	4 41 16 IV (share data is AES(H(readkey+IV)) )
592	5 57 18 encoding parameters:
593	57 1 k
594	58 1 N
595	59 8 segment size
596	67 8 data length (of original plaintext)
597	6 75 32 offset table:
598	75 4 (8) signature
599	79 4 (9) share hash chain
600	83 4 (10) block hash tree
601	87 4 (11) share data
602	91 8 (12) encrypted private key
603	99 8 (13) EOF
604	7 107 436ish verification key (2048 RSA key)
605	8 543ish 256ish signature=RSAsign(sigkey, H(version+seqnum+r+IV+encparm))
606	9 799ish (a) share hash chain, encoded as:
607	"".join([pack(">H32s", shnum, hash)
608	for (shnum,hash) in needed_hashes])
609	10 (927ish) (b) block hash tree, encoded as:
610	"".join([pack(">32s",hash) for hash in block_hash_tree])
611	11 (935ish) LEN share data (no gap between this and encprivkey)
612	12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key)
613	13 ?? -- EOF
614
615	(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
616	This is the set of hashes necessary to validate this share's leaf in the
617	share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
618	(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
619	long. This is the set of hashes necessary to validate any given block of
620	share data up to the per-share root "r". Each "r" is a leaf of the share
621	has tree (with root "R"), from which a minimal subset of hashes is put in
622	the share hash chain in (8).
623
624	Recovery
625	--------
626
627	The first line of defense against damage caused by colliding writes is the
628	Prime Coordination Directive: "Don't Do That".
629
630	The second line of defense is to keep "S" (the number of competing versions)
631	lower than N/k. If this holds true, at least one competing version will have
632	k shares and thus be recoverable. Note that server unavailability counts
633	against us here: the old version stored on the unavailable server must be
634	included in the value of S.
635
636	The third line of defense is our use of testv_and_writev() (described below),
637	which increases the convergence of simultaneous writes: one of the writers
638	will be favored (the one with the highest "R"), and that version is more
639	likely to be accepted than the others. This defense is least effective in the
640	pathological situation where S simultaneous writers are active, the one with
641	the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
642	the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
643	one with the highest "R" writes to k-1 shares and dies. Any other sequencing
644	will allow the highest "R" to write to at least k shares and establish a new
645	revision.
646
647	The fourth line of defense is the fact that each client keeps writing until
648	at least one version has N shares. This uses additional servers, if
649	necessary, to make sure that either the client's version or some
650	newer/overriding version is highly available.
651
652	The fifth line of defense is the recovery algorithm, which seeks to make sure
653	that at least one version is highly available, even if that version is
654	somebody else's.
655
656	The write-shares-to-peers algorithm is as follows:
657
658	* permute peers according to storage index
659	* walk through peers, trying to assign one share per peer
660	* for each peer:
661
662	* send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
663
664	* this means that we will overwrite any old versions, and we will
665	overwrite simultaenous writers of the same version if our R is higher.
666	We will not overwrite writers using a higher seqnum.
667
668	* record the version that each share winds up with. If the write was
669	accepted, this is our own version. If it was rejected, read the
670	old_test_data to find out what version was retained.
671	* if old_test_data indicates the seqnum was equal or greater than our
672	own, mark the "Simultanous Writes Detected" flag, which will eventually
673	result in an error being reported to the writer (in their close() call).
674	* build a histogram of "R" values
675	* repeat until the histogram indicate that some version (possibly ours)
676	has N shares. Use new servers if necessary.
677	* If we run out of servers:
678
679	* if there are at least shares-of-happiness of any one version, we're
680	happy, so return. (the close() might still get an error)
681	* not happy, need to reinforce something, goto RECOVERY
682
683	Recovery:
684
685	* read all shares, count the versions, identify the recoverable ones,
686	discard the unrecoverable ones.
687	* sort versions: locate max(seqnums), put all versions with that seqnum
688	in the list, sort by number of outstanding shares. Then put our own
689	version. (TODO: put versions with seqnum <max but >us ahead of us?).
690	* for each version:
691
692	* attempt to recover that version
693	* if not possible, remove it from the list, go to next one
694	* if recovered, start at beginning of peer list, push that version,
695	continue until N shares are placed
696	* if pushing our own version, bump up the seqnum to one higher than
697	the max seqnum we saw
698	* if we run out of servers:
699
700	* schedule retry and exponential backoff to repeat RECOVERY
701
702	* admit defeat after some period? presumeably the client will be shut down
703	eventually, maybe keep trying (once per hour?) until then.
704
705
706	Medium Distributed Mutable Files
707	================================
708
709	These are just like the SDMF case, but:
710
711	* We actually take advantage of the Merkle hash tree over the blocks, by
712	reading a single segment of data at a time (and its necessary hashes), to
713	reduce the read-time alacrity.
714	* We allow arbitrary writes to any range of the file.
715	* We add more code to first read each segment that a write must modify.
716	This looks exactly like the way a normal filesystem uses a block device,
717	or how a CPU must perform a cache-line fill before modifying a single word.
718	* We might implement some sort of copy-based atomic update server call,
719	to allow multiple writev() calls to appear atomic to any readers.
720
721	MDMF slots provide fairly efficient in-place edits of very large files (a few
722	GB). Appending data is also fairly efficient.
723
724
725	Large Distributed Mutable Files
726	===============================
727
728	LDMF slots (not implemented) would use a fundamentally different way to store
729	the file, inspired by Mercurial's "revlog" format. This would enable very
730	efficient insert/remove/replace editing of arbitrary spans. Multiple versions
731	of the file can be retained, in a revision graph that can have multiple heads.
732	Each revision can be referenced by a cryptographic identifier. There are two
733	forms of the URI, one that means "most recent version", and a longer one that
734	points to a specific revision.
735
736	Metadata can be attached to the revisions, like timestamps, to enable rolling
737	back an entire tree to a specific point in history.
738
739	LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
740	provides explicit support for revision identifiers and branching.
741
742
743	TODO
744	====
745
746	improve allocate-and-write or get-writer-buckets API to allow one-call (or
747	maybe two-call) updates. The challenge is in figuring out which shares are on
748	which machines. First cut will have lots of round trips.
749
750	(eventually) define behavior when seqnum wraps. At the very least make sure
751	it can't cause a security problem. "the slot is worn out" is acceptable.
752
753	(eventually) define share-migration lease update protocol. Including the
754	nodeid who accepted the lease is useful, we can use the same protocol as we
755	do for updating the write enabler. However we need to know which lease to
756	update.. maybe send back a list of all old nodeids that we find, then try all
757	of them when we accept the update?
758
759	We now do this in a specially-formatted IndexError exception:
760	"UNABLE to renew non-existent lease. I have leases accepted by " +
761	"nodeids: '12345','abcde','44221' ."
762
763	confirm that a repairer can regenerate shares without the private key. Hmm,
764	without the write-enabler they won't be able to write those shares to the
765	servers.. although they could add immutable new shares to new servers.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/specifications/mutable.rst

Download in other formats: