Context Navigation

source: trunk/docs/specifications/dirnodes.rst

Visit:

Last change on this file was c49aa44, checked in by Jean-Paul Calderone <exarkun@…>, at 2023-03-22T13:04:15Z
Update the raw number and give a reference for interpretation
Property mode set to `100644`
File size: 23.2 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	==========================
4	Tahoe-LAFS Directory Nodes
5	==========================
6
7	As explained in the architecture docs, Tahoe-LAFS can be roughly viewed as
8	a collection of three layers. The lowest layer is the key-value store: it
9	provides operations that accept files and upload them to the grid, creating
10	a URI in the process which securely references the file's contents.
11	The middle layer is the file store, creating a structure of directories and
12	filenames resembling the traditional Unix or Windows filesystems. The top
13	layer is the application layer, which uses the lower layers to provide useful
14	services to users, like a backup application, or a way to share files with
15	friends.
16
17	This document examines the middle layer, the "file store".
18
19	1. `Key-value Store Primitives`_
20	2. `File Store Goals`_
21	3. `Dirnode Goals`_
22	4. `Dirnode secret values`_
23	5. `Dirnode storage format`_
24	6. `Dirnode sizes, mutable-file initial read sizes`_
25	7. `Design Goals, redux`_
26
27	1. `Confidentiality leaks in the storage servers`_
28	2. `Integrity failures in the storage servers`_
29	3. `Improving the efficiency of dirnodes`_
30	4. `Dirnode expiration and leases`_
31
32	8. `Starting Points: root dirnodes`_
33	9. `Mounting and Sharing Directories`_
34	10. `Revocation`_
35
36	Key-value Store Primitives
37	==========================
38
39	In the lowest layer (key-value store), there are two operations that reference
40	immutable data (which we refer to as "CHK URIs" or "CHK read-capabilities" or
41	"CHK read-caps"). One puts data into the grid (but only if it doesn't exist
42	already), the other retrieves it::
43
44	chk_uri = put(data)
45	data = get(chk_uri)
46
47	We also have three operations which reference mutable data (which we refer to
48	as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK
49	slots"). One creates a slot with some initial contents, a second replaces the
50	contents of a pre-existing slot, and the third retrieves the contents::
51
52	mutable_uri = create(initial_data)
53	replace(mutable_uri, new_data)
54	data = get(mutable_uri)
55
56	File Store Goals
57	================
58
59	The main goal for the middle (file store) layer is to give users a way to
60	organize the data that they have uploaded into the grid. The traditional way
61	to do this in computer filesystems is to put this data into files, give those
62	files names, and collect these names into directories.
63
64	Each directory is a set of name-entry pairs, each of which maps a "child name"
65	to a directory entry pointing to an object of some kind. Those child objects
66	might be files, or they might be other directories. Each directory entry also
67	contains metadata.
68
69	The directory structure is therefore a directed graph of nodes, in which each
70	node might be a directory node or a file node. All file nodes are terminal
71	nodes.
72
73	Dirnode Goals
74	=============
75
76	What properties might be desirable for these directory nodes? In no
77	particular order:
78
79	1. functional. Code which does not work doesn't count.
80	2. easy to document, explain, and understand
81	3. confidential: it should not be possible for others to see the contents of
82	a directory
83	4. integrity: it should not be possible for others to modify the contents
84	of a directory
85	5. available: directories should survive host failure, just like files do
86	6. efficient: in storage, communication bandwidth, number of round-trips
87	7. easy to delegate individual directories in a flexible way
88	8. updateness: everybody looking at a directory should see the same contents
89	9. monotonicity: everybody looking at a directory should see the same
90	sequence of updates
91
92	Some of these goals are mutually exclusive. For example, availability and
93	consistency are opposing, so it is not possible to achieve #5 and #8 at the
94	same time. Moreover, it takes a more complex architecture to get close to the
95	available-and-consistent ideal, so #2/#6 is in opposition to #5/#8.
96
97	Tahoe-LAFS v0.7.0 introduced distributed mutable files, which use public-key
98	cryptography for integrity, and erasure coding for availability. These
99	achieve roughly the same properties as immutable CHK files, but their
100	contents can be replaced without changing their identity. Dirnodes are then
101	just a special way of interpreting the contents of a specific mutable file.
102	Earlier releases used a "vdrive server": this server was abolished in the
103	v0.7.0 release.
104
105	For details of how mutable files work, please see :doc:`mutable`.
106
107	For releases since v0.7.0, we achieve most of our desired properties. The
108	integrity and availability of dirnodes is equivalent to that of regular
109	(immutable) files, with the exception that there are more simultaneous-update
110	failure modes for mutable slots. Delegation is quite strong: you can give
111	read-write or read-only access to any subtree, and the data format used for
112	dirnodes is such that read-only access is transitive: i.e. if you grant Bob
113	read-only access to a parent directory, then Bob will get read-only access
114	(and not read-write access) to its children.
115
116	Relative to the previous "vdrive server"-based scheme, the current
117	distributed dirnode approach gives better availability, but cannot guarantee
118	updateness quite as well, and requires far more network traffic for each
119	retrieval and update. Mutable files are somewhat less available than
120	immutable files, simply because of the increased number of combinations
121	(shares of an immutable file are either present or not, whereas there are
122	multiple versions of each mutable file, and you might have some shares of
123	version 1 and other shares of version 2). In extreme cases of simultaneous
124	update, mutable files might suffer from non-monotonicity.
125
126
127	Dirnode secret values
128	=====================
129
130	As mentioned before, dirnodes are simply a special way to interpret the
131	contents of a mutable file, so the secret keys and capability strings
132	described in :doc:`mutable` are all the same. Each dirnode contains an RSA
133	public/private keypair, and the holder of the "write capability" will be able
134	to retrieve the private key (as well as the AES encryption key used for the
135	data itself). The holder of the "read capability" will be able to obtain the
136	public key and the AES data key, but not the RSA private key needed to modify
137	the data.
138
139	The "write capability" for a dirnode grants read-write access to its
140	contents. This is expressed on concrete form as the "dirnode write cap": a
141	printable string which contains the necessary secrets to grant this access.
142	Likewise, the "read capability" grants read-only access to a dirnode, and can
143	be represented by a "dirnode read cap" string.
144
145	For example,
146	URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o
147	is a write-capability URI, while
148	URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o
149	is a read-capability URI, both for the same dirnode.
150
151
152	Dirnode storage format
153	======================
154
155	Each dirnode is stored in a single mutable file, distributed in the Tahoe-LAFS
156	grid. The contents of this file are a serialized list of netstrings, one per
157	child. Each child is a list of four netstrings: (name, rocap, rwcap,
158	metadata). (Remember that the contents of the mutable file are encrypted by
159	the read-cap, so this section describes the plaintext contents of the mutable
160	file, after it has been decrypted by the read-cap.)
161
162	The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only
163	capability URI to that child, either an immutable (CHK) file, a mutable file,
164	or a directory. It is also possible to store 'unknown' URIs that are not
165	recognized by the current version of Tahoe-LAFS. The 'rwcap' is a read-write
166	capability URI for that child, encrypted with the dirnode's write-cap: this
167	enables the "transitive readonlyness" property, described further below. The
168	'metadata' is a JSON-encoded dictionary of type,value metadata pairs. Some
169	metadata keys are pre-defined, the rest are left up to the application.
170
171	Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random
172	value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI
173	string, using a key that is formed from a tagged hash of the IV and the
174	dirnode's writekey. The MAC is written only for compatibility with older
175	Tahoe-LAFS versions and is no longer verified.
176
177	If Bob has read-only access to the 'bar' directory, and he adds it as a child
178	to the 'foo' directory, then he will put the read-only cap for 'bar' in both
179	the rwcap and rocap slots (encrypting the rwcap contents as described above).
180	If he has full read-write access to 'bar', then he will put the read-write
181	cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since
182	other users who have read-only access to 'foo' will be unable to decrypt its
183	rwcap slot, this limits those users to read-only access to 'bar' as well,
184	thus providing the transitive readonlyness that we desire.
185
186	Dirnode sizes, mutable-file initial read sizes
187	==============================================
188
189	How big are dirnodes? When reading dirnode data out of mutable files, how
190	large should our initial read be? If we guess exactly, we can read a dirnode
191	in a single round-trip, and update one in two RTT. If we guess too high,
192	we'll waste some amount of bandwidth. If we guess low, we need to make a
193	second pass to get the data (or the encrypted privkey, for writes), which
194	will cost us at least another RTT.
195
196	Assuming child names are between 10 and 99 characters long, how long are the
197	various pieces of a dirnode?
198
199	::
200
201	netstring(name) ~= 4+len(name)
202	chk-cap = 97 (for 4-char filesizes)
203	dir-rw-cap = 88
204	dir-ro-cap = 91
205	netstring(cap) = 4+len(cap)
206	encrypted(cap) = 16+cap+32
207	JSON({}) = 2
208	JSON({ctime=float,mtime=float,'tahoe':{linkcrtime=float,linkmotime=float}}): 137
209	netstring(metadata) = 4+137 = 141
210
211	so a CHK entry is::
212
213	5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+137
214
215	And a 15-byte filename gives a 416-byte entry. When the entry points at a
216	subdirectory instead of a file, the entry is a little bit smaller. So an
217	empty directory uses 0 bytes, a directory with one child uses about 416
218	bytes, a directory with two children uses about 832, etc.
219
220	When the dirnode data is encoding using our default 3-of-10, that means we
221	get 139ish bytes of data in each share per child.
222
223	The pubkey, signature, and hashes form the first 935ish bytes of the
224	container, then comes our data, then about 1216 bytes of encprivkey. So if we
225	read the first::
226
227	1kB: we get 65bytes of dirnode data : only empty directories
228	2kB: 1065bytes: about 8
229	3kB: 2065bytes: about 15 entries, or 6 entries plus the encprivkey
230	4kB: 3065bytes: about 22 entries, or about 13 plus the encprivkey
231
232	So we've written the code to do an initial read of 4kB from each share when
233	we read the mutable file, which should give good performance (one RTT) for
234	small directories.
235
236
237	Design Goals, redux
238	===================
239
240	How well does this design meet the goals?
241
242	1. functional: YES: the code works and has extensive unit tests
243	2. documentable: YES: this document is the existence proof
244	3. confidential: YES: see below
245	4. integrity: MOSTLY: a coalition of storage servers can rollback individual
246	mutable files, but not a single one. No server can
247	substitute fake data as genuine.
248	5. availability: YES: as long as 'k' storage servers are present and have
249	the same version of the mutable file, the dirnode will
250	be available.
251	6. efficient: MOSTLY:
252	network: single dirnode lookup is very efficient, since clients can
253	fetch specific keys rather than being required to get or set
254	the entire dirnode each time. Traversing many directories
255	takes a lot of roundtrips, and these can't be collapsed with
256	promise-pipelining because the intermediate values must only
257	be visible to the client. Modifying many dirnodes at once
258	(e.g. importing a large pre-existing directory tree) is pretty
259	slow, since each graph edge must be created independently.
260	storage: each child has a separate IV, which makes them larger than
261	if all children were aggregated into a single encrypted string
262	7. delegation: VERY: each dirnode is a completely independent object,
263	to which clients can be granted separate read-write or
264	read-only access
265	8. updateness: VERY: with only a single point of access, and no caching,
266	each client operation starts by fetching the current
267	value, so there are no opportunities for staleness
268	9. monotonicity: VERY: the single point of access also protects against
269	retrograde motion
270
271
272
273	Confidentiality leaks in the storage servers
274	--------------------------------------------
275
276	Dirnode (and the mutable files upon which they are based) are very private
277	against other clients: traffic between the client and the storage servers is
278	protected by the Foolscap SSL connection, so they can observe very little.
279	Storage index values are hashes of secrets and thus unguessable, and they are
280	not made public, so other clients cannot snoop through encrypted dirnodes
281	that they have not been told about.
282
283	Storage servers can observe access patterns and see ciphertext, but they
284	cannot see the plaintext (of child names, metadata, or URIs). If an attacker
285	operates a significant number of storage servers, they can infer the shape of
286	the directory structure by assuming that directories are usually accessed
287	from root to leaf in rapid succession. Since filenames are usually much
288	shorter than read-caps and write-caps, the attacker can use the length of the
289	ciphertext to guess the number of children of each node, and might be able to
290	guess the length of the child names (or at least their sum). From this, the
291	attacker may be able to build up a graph with the same shape as the plaintext
292	file store, but with unlabeled edges and unknown file contents.
293
294
295	Integrity failures in the storage servers
296	-----------------------------------------
297
298	The mutable file's integrity mechanism (RSA signature on the hash of the file
299	contents) prevents the storage server from modifying the dirnode's contents
300	without detection. Therefore the storage servers can make the dirnode
301	unavailable, but not corrupt it.
302
303	A sufficient number of colluding storage servers can perform a rollback
304	attack: replace all shares of the whole mutable file with an earlier version.
305	To prevent this, when retrieving the contents of a mutable file, the
306	client queries more servers than necessary and uses the highest available
307	version number. This insures that one or two misbehaving storage servers
308	cannot cause this rollback on their own.
309
310
311	Improving the efficiency of dirnodes
312	------------------------------------
313
314	The current mutable-file -based dirnode scheme suffers from certain
315	inefficiencies. A very large directory (with thousands or millions of
316	children) will take a significant time to extract any single entry, because
317	the whole file must be downloaded first, then parsed and searched to find the
318	desired child entry. Likewise, modifying a single child will require the
319	whole file to be re-uploaded.
320
321	The current design assumes (and in some cases, requires) that dirnodes remain
322	small. The mutable files on which dirnodes are based are currently using
323	"SDMF" ("Small Distributed Mutable File") design rules, which state that the
324	size of the data shall remain below one megabyte. More advanced forms of
325	mutable files (MDMF and LDMF) are in the design phase to allow efficient
326	manipulation of larger mutable files. This would reduce the work needed to
327	modify a single entry in a large directory.
328
329	Judicious caching may help improve the reading-large-directory case. Some
330	form of mutable index at the beginning of the dirnode might help as well. The
331	MDMF design rules allow for efficient random-access reads from the middle of
332	the file, which would give the index something useful to point at.
333
334	The current SDMF design generates a new RSA public/private keypair for each
335	directory. This takes some time and CPU effort (around 100 milliseconds on a
336	relatively high-end 2021 laptop) per directory.
337	We have designed (but not yet built) a DSA-based
338	mutable file scheme which will use shared parameters to reduce the
339	directory-creation effort to a bare minimum (picking a random number instead
340	of generating two random primes).
341
342	When a backup program is run for the first time, it needs to copy a large
343	amount of data from a pre-existing local filesystem into reliable storage.
344	This means that a large and complex directory structure needs to be
345	duplicated in the dirnode layer. With the one-object-per-dirnode approach
346	described here, this requires as many operations as there are edges in the
347	imported filesystem graph.
348
349	Another approach would be to aggregate multiple directories into a single
350	storage object. This object would contain a serialized graph rather than a
351	single name-to-child dictionary. Most directory operations would fetch the
352	whole block of data (and presumeably cache it for a while to avoid lots of
353	re-fetches), and modification operations would need to replace the whole
354	thing at once. This "realm" approach would have the added benefit of
355	combining more data into a single encrypted bundle (perhaps hiding the shape
356	of the graph from a determined attacker), and would reduce round-trips when
357	performing deep directory traversals (assuming the realm was already cached).
358	It would also prevent fine-grained rollback attacks from working: a coalition
359	of storage servers could change the entire realm to look like an earlier
360	state, but it could not independently roll back individual directories.
361
362	The drawbacks of this aggregation would be that small accesses (adding a
363	single child, looking up a single child) would require pulling or pushing a
364	lot of unrelated data, increasing network overhead (and necessitating
365	test-and-set semantics for the modification side, which increases the chances
366	that a user operation will fail, making it more challenging to provide
367	promises of atomicity to the user).
368
369	It would also make it much more difficult to enable the delegation
370	("sharing") of specific directories. Since each aggregate "realm" provides
371	all-or-nothing access control, the act of delegating any directory from the
372	middle of the realm would require the realm first be split into the upper
373	piece that isn't being shared and the lower piece that is. This splitting
374	would have to be done in response to what is essentially a read operation,
375	which is not traditionally supposed to be a high-effort action. On the other
376	hand, it may be possible to aggregate the ciphertext, but use distinct
377	encryption keys for each component directory, to get the benefits of both
378	schemes at once.
379
380
381	Dirnode expiration and leases
382	-----------------------------
383
384	Dirnodes are created any time a client wishes to add a new directory. How
385	long do they live? What's to keep them from sticking around forever, taking
386	up space that nobody can reach any longer?
387
388	Mutable files are created with limited-time "leases", which keep the shares
389	alive until the last lease has expired or been cancelled. Clients which know
390	and care about specific dirnodes can ask to keep them alive for a while, by
391	renewing a lease on them (with a typical period of one month). Clients are
392	expected to assist in the deletion of dirnodes by canceling their leases as
393	soon as they are done with them. This means that when a client unlinks a
394	directory, it should also cancel its lease on that directory. When the lease
395	count on a given share goes to zero, the storage server can delete the
396	related storage. Multiple clients may all have leases on the same dirnode:
397	the server may delete the shares only after all of the leases have gone away.
398
399	We expect that clients will periodically create a "manifest": a list of
400	so-called "refresh capabilities" for all of the dirnodes and files that they
401	can reach. They will give this manifest to the "repairer", which is a service
402	that keeps files (and dirnodes) alive on behalf of clients who cannot take on
403	this responsibility for themselves. These refresh capabilities include the
404	storage index, but do not include the readkeys or writekeys, so the
405	repairer does not get to read the files or directories that it is helping to
406	keep alive.
407
408	After each change to the user's file store, the client creates a manifest and
409	looks for differences from their previous version. Anything which was removed
410	prompts the client to send out lease-cancellation messages, allowing the data
411	to be deleted.
412
413
414	Starting Points: root dirnodes
415	==============================
416
417	Any client can record the URI of a directory node in some external form (say,
418	in a local file) and use it as the starting point of later traversal. Each
419	Tahoe-LAFS user is expected to create a new (unattached) dirnode when they first
420	start using the grid, and record its URI for later use.
421
422	Mounting and Sharing Directories
423	================================
424
425	The biggest benefit of this dirnode approach is that sharing individual
426	directories is almost trivial. Alice creates a subdirectory that she wants
427	to use to share files with Bob. This subdirectory is attached to Alice's
428	file store at "alice:shared-with-bob". She asks her file store for the
429	read-only directory URI for that new directory, and emails it to Bob. When
430	Bob receives the URI, he attaches the given URI into one of his own
431	directories, perhaps at a place named "bob:shared-with-alice". Every time
432	Alice writes a file into this directory, Bob will be able to read it.
433	(It is also possible to share read-write URIs between users, but that makes
434	it difficult to follow the `Prime Coordination Directive`_ .) Neither
435	Alice nor Bob will get access to any files above the mounted directory:
436	there are no 'parent directory' pointers. If Alice creates a nested set of
437	directories, "alice:shared-with-bob/subdir2", and gives a read-only URI to
438	shared-with-bob to Bob, then Bob will be unable to write to either
439	shared-with-bob/ or subdir2/.
440
441	.. _`Prime Coordination Directive`: ../write_coordination.rst
442
443	A suitable UI needs to be created to allow users to easily perform this
444	sharing action: dragging a folder from their file store to an IM or email
445	user icon, for example. The UI will need to give the sending user an
446	opportunity to indicate whether they want to grant read-write or read-only
447	access to the recipient. The recipient then needs an interface to drag the
448	new folder into their file store and give it a home.
449
450	Revocation
451	==========
452
453	When Alice decides that she no longer wants Bob to be able to access the
454	shared directory, what should she do? Suppose she's shared this folder with
455	both Bob and Carol, and now she wants Carol to retain access to it but Bob to
456	be shut out. Ideally Carol should not have to do anything: her access should
457	continue unabated.
458
459	The current plan is to have her client create a deep copy of the folder in
460	question, delegate access to the new folder to the remaining members of the
461	group (Carol), asking the lucky survivors to replace their old reference with
462	the new one. Bob may still have access to the old folder, but he is now the
463	only one who cares: everyone else has moved on, and he will no longer be able
464	to see their new changes. In a strict sense, this is the strongest form of
465	revocation that can be accomplished: there is no point trying to force Bob to
466	forget about the files that he read a moment before being kicked out. In
467	addition it must be noted that anyone who can access the directory can proxy
468	for Bob, reading files to him and accepting changes whenever he wants.
469	Preventing delegation between communication parties is just as pointless as
470	asking Bob to forget previously accessed files. However, there may be value
471	to configuring the UI to ask Carol to not share files with Bob, or to
472	removing all files from Bob's view at the same time his access is revoked.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: