1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ===================== |
---|
4 | Lease database design |
---|
5 | ===================== |
---|
6 | |
---|
7 | The target audience for this document is developers who wish to understand |
---|
8 | the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0. |
---|
9 | |
---|
10 | |
---|
11 | Introduction |
---|
12 | ------------ |
---|
13 | |
---|
14 | A "lease" is a request by an account that a share not be deleted before a |
---|
15 | specified time. Each storage server stores leases in order to know which |
---|
16 | shares to spare from garbage collection. |
---|
17 | |
---|
18 | Motivation |
---|
19 | ---------- |
---|
20 | |
---|
21 | The leasedb will replace the current design in which leases are stored in |
---|
22 | the storage server's share container files. That design has several |
---|
23 | disadvantages: |
---|
24 | |
---|
25 | - Updating a lease requires modifying a share container file (even for |
---|
26 | immutable shares). This complicates the implementation of share classes. |
---|
27 | The mixing of share contents and lease data in share files also led to a |
---|
28 | security bug (ticket `#1528`_). |
---|
29 | |
---|
30 | - When only the disk backend is supported, it is possible to read and |
---|
31 | update leases synchronously because the share files are stored locally |
---|
32 | to the storage server. For the cloud backend, accessing share files |
---|
33 | requires an HTTP request, and so must be asynchronous. Accepting this |
---|
34 | asynchrony for lease queries would be both inefficient and complex. |
---|
35 | Moving lease information out of shares and into a local database allows |
---|
36 | lease queries to stay synchronous. |
---|
37 | |
---|
38 | Also, the current cryptographic protocol for renewing and cancelling leases |
---|
39 | (based on shared secrets derived from secure hash functions) is complex, |
---|
40 | and the cancellation part was never used. |
---|
41 | |
---|
42 | The leasedb solves the first two problems by storing the lease information in |
---|
43 | a local database instead of in the share container files. The share data |
---|
44 | itself is still held in the share container file. |
---|
45 | |
---|
46 | At the same time as implementing leasedb, we devised a simpler protocol for |
---|
47 | allocating and cancelling leases: a client can use a public key digital |
---|
48 | signature to authenticate access to a foolscap object representing the |
---|
49 | authority of an account. This protocol is not yet implemented; at the time |
---|
50 | of writing, only an "anonymous" account is supported. |
---|
51 | |
---|
52 | The leasedb also provides an efficient way to get summarized information, |
---|
53 | such as total space usage of shares leased by an account, for accounting |
---|
54 | purposes. |
---|
55 | |
---|
56 | .. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528 |
---|
57 | |
---|
58 | |
---|
59 | Design constraints |
---|
60 | ------------------ |
---|
61 | |
---|
62 | A share is stored as a collection of objects. The persistent storage may be |
---|
63 | remote from the server (for example, cloud storage). |
---|
64 | |
---|
65 | Writing to the persistent store objects is in general not an atomic |
---|
66 | operation. So the leasedb also keeps track of which shares are in an |
---|
67 | inconsistent state because they have been partly written. (This may |
---|
68 | change in future when we implement a protocol to improve atomicity of |
---|
69 | updates to mutable shares.) |
---|
70 | |
---|
71 | Leases are no longer stored in shares. The same share format is used as |
---|
72 | before, but the lease slots are ignored, and are cleared when rewriting a |
---|
73 | mutable share. The new design also does not use lease renewal or cancel |
---|
74 | secrets. (They are accepted as parameters in the storage protocol interfaces |
---|
75 | for backward compatibility, but are ignored. Cancel secrets were already |
---|
76 | ignored due to the fix for `#1528`_.) |
---|
77 | |
---|
78 | The new design needs to be fail-safe in the sense that if the lease database |
---|
79 | is lost or corruption is detected, no share data will be lost (even though |
---|
80 | the metadata about leases held by particular accounts has been lost). |
---|
81 | |
---|
82 | |
---|
83 | Accounting crawler |
---|
84 | ------------------ |
---|
85 | |
---|
86 | A "crawler" is a long-running process that visits share container files at a |
---|
87 | slow rate, so as not to overload the server by trying to visit all share |
---|
88 | container files one after another immediately. |
---|
89 | |
---|
90 | The accounting crawler replaces the previous "lease crawler". It examines |
---|
91 | each share container file and compares it with the state of the leasedb, and |
---|
92 | may update the state of the share and/or the leasedb. |
---|
93 | |
---|
94 | The accounting crawler may perform the following functions (but see ticket |
---|
95 | #1834 for a proposal to reduce the scope of its responsibility): |
---|
96 | |
---|
97 | - Remove leases that are past their expiration time. (Currently, this is |
---|
98 | done automatically before deleting shares, but we plan to allow expiration |
---|
99 | to be performed separately for individual accounts in future.) |
---|
100 | |
---|
101 | - Delete the objects containing unleased shares — that is, shares that have |
---|
102 | stable entries in the leasedb but no current leases (see below for the |
---|
103 | definition of "stable" entries). |
---|
104 | |
---|
105 | - Discover shares that have been manually added to storage, via ``scp`` or |
---|
106 | some other out-of-band means. |
---|
107 | |
---|
108 | - Discover shares that are present when a storage server is upgraded to |
---|
109 | a leasedb-supporting version from a previous version, and give them |
---|
110 | "starter leases". |
---|
111 | |
---|
112 | - Recover from a situation where the leasedb is lost or detectably |
---|
113 | corrupted. This is handled in the same way as upgrading from a previous |
---|
114 | version. |
---|
115 | |
---|
116 | - Detect shares that have unexpectedly disappeared from storage. The |
---|
117 | disappearance of a share is logged, and its entry and leases are removed |
---|
118 | from the leasedb. |
---|
119 | |
---|
120 | |
---|
121 | Accounts |
---|
122 | -------- |
---|
123 | |
---|
124 | An account holds leases for some subset of shares stored by a server. The |
---|
125 | leasedb schema can handle many distinct accounts, but for the time being we |
---|
126 | create only two accounts: an anonymous account and a starter account. The |
---|
127 | starter account is used for leases on shares discovered by the accounting |
---|
128 | crawler; the anonymous account is used for all other leases. |
---|
129 | |
---|
130 | The leasedb has at most one lease entry per account per (storage_index, |
---|
131 | shnum) pair. This entry stores the times when the lease was last renewed and |
---|
132 | when it is set to expire (if the expiration policy does not force it to |
---|
133 | expire earlier), represented as Unix UTC-seconds-since-epoch timestamps. |
---|
134 | |
---|
135 | For more on expiration policy, see :doc:`../garbage-collection`. |
---|
136 | |
---|
137 | |
---|
138 | Share states |
---|
139 | ------------ |
---|
140 | |
---|
141 | The leasedb holds an explicit indicator of the state of each share. |
---|
142 | |
---|
143 | The diagram and descriptions below give the possible values of the "state" |
---|
144 | indicator, what that value means, and transitions between states, for any |
---|
145 | (storage_index, shnum) pair on each server:: |
---|
146 | |
---|
147 | |
---|
148 | # STATE_STABLE -------. |
---|
149 | # ^ | ^ | | |
---|
150 | # | v | | v |
---|
151 | # STATE_COMING | | STATE_GOING |
---|
152 | # ^ | | | |
---|
153 | # | | v | |
---|
154 | # '----- NONE <------' |
---|
155 | |
---|
156 | |
---|
157 | **NONE**: There is no entry in the ``shares`` table for this (storage_index, |
---|
158 | shnum) in this server's leasedb. This is the initial state. |
---|
159 | |
---|
160 | **STATE_COMING**: The share is being created or (if a mutable share) |
---|
161 | updated. The store objects may have been at least partially written, but |
---|
162 | the storage server doesn't have confirmation that they have all been |
---|
163 | completely written. |
---|
164 | |
---|
165 | **STATE_STABLE**: The store objects have been completely written and are |
---|
166 | not in the process of being modified or deleted by the storage server. (It |
---|
167 | could have been modified or deleted behind the back of the storage server, |
---|
168 | but if it has, the server has not noticed that yet.) The share may or may not |
---|
169 | be leased. |
---|
170 | |
---|
171 | **STATE_GOING**: The share is being deleted. |
---|
172 | |
---|
173 | State transitions |
---|
174 | ----------------- |
---|
175 | |
---|
176 | • **STATE_GOING** → **NONE** |
---|
177 | |
---|
178 | trigger: The storage server gains confidence that all store objects for |
---|
179 | the share have been removed. |
---|
180 | |
---|
181 | implementation: |
---|
182 | |
---|
183 | 1. Remove the entry in the leasedb. |
---|
184 | |
---|
185 | • **STATE_STABLE** → **NONE** |
---|
186 | |
---|
187 | trigger: The accounting crawler noticed that all the store objects for |
---|
188 | this share are gone. |
---|
189 | |
---|
190 | implementation: |
---|
191 | |
---|
192 | 1. Remove the entry in the leasedb. |
---|
193 | |
---|
194 | • **NONE** → **STATE_COMING** |
---|
195 | |
---|
196 | triggers: A new share is being created, as explicitly signalled by a |
---|
197 | client invoking a creation command, *or* the accounting crawler discovers |
---|
198 | an incomplete share. |
---|
199 | |
---|
200 | implementation: |
---|
201 | |
---|
202 | 1. Add an entry to the leasedb with **STATE_COMING**. |
---|
203 | |
---|
204 | 2. (In case of explicit creation) begin writing the store objects to hold |
---|
205 | the share. |
---|
206 | |
---|
207 | • **STATE_STABLE** → **STATE_COMING** |
---|
208 | |
---|
209 | trigger: A mutable share is being modified, as explicitly signalled by a |
---|
210 | client invoking a modification command. |
---|
211 | |
---|
212 | implementation: |
---|
213 | |
---|
214 | 1. Add an entry to the leasedb with **STATE_COMING**. |
---|
215 | |
---|
216 | 2. Begin updating the store objects. |
---|
217 | |
---|
218 | • **STATE_COMING** → **STATE_STABLE** |
---|
219 | |
---|
220 | trigger: All store objects have been written. |
---|
221 | |
---|
222 | implementation: |
---|
223 | |
---|
224 | 1. Change the state value of this entry in the leasedb from |
---|
225 | **STATE_COMING** to **STATE_STABLE**. |
---|
226 | |
---|
227 | • **NONE** → **STATE_STABLE** |
---|
228 | |
---|
229 | trigger: The accounting crawler discovers a complete share. |
---|
230 | |
---|
231 | implementation: |
---|
232 | |
---|
233 | 1. Add an entry to the leasedb with **STATE_STABLE**. |
---|
234 | |
---|
235 | • **STATE_STABLE** → **STATE_GOING** |
---|
236 | |
---|
237 | trigger: The share should be deleted because it is unleased. |
---|
238 | |
---|
239 | implementation: |
---|
240 | |
---|
241 | 1. Change the state value of this entry in the leasedb from |
---|
242 | **STATE_STABLE** to **STATE_GOING**. |
---|
243 | |
---|
244 | 2. Initiate removal of the store objects. |
---|
245 | |
---|
246 | |
---|
247 | The following constraints are needed to avoid race conditions: |
---|
248 | |
---|
249 | - While a share is being deleted (entry in **STATE_GOING**), we do not accept |
---|
250 | any requests to recreate it. That would result in add and delete requests |
---|
251 | for store objects being sent concurrently, with undefined results. |
---|
252 | |
---|
253 | - While a share is being added or modified (entry in **STATE_COMING**), we |
---|
254 | treat it as leased. |
---|
255 | |
---|
256 | - Creation or modification requests for a given mutable share are serialized. |
---|
257 | |
---|
258 | |
---|
259 | Unresolved design issues |
---|
260 | ------------------------ |
---|
261 | |
---|
262 | - What happens if a write to store objects for a new share fails |
---|
263 | permanently? If we delete the share entry, then the accounting crawler |
---|
264 | will eventually get to those store objects and see that their lengths |
---|
265 | are inconsistent with the length in the container header. This will cause |
---|
266 | the share to be treated as corrupted. Should we instead attempt to |
---|
267 | delete those objects immediately? If so, do we need a direct |
---|
268 | **STATE_COMING** → **STATE_GOING** transition to handle this case? |
---|
269 | |
---|
270 | - What happens if only some store objects for a share disappear |
---|
271 | unexpectedly? This case is similar to only some objects having been |
---|
272 | written when we get an unrecoverable error during creation of a share, but |
---|
273 | perhaps we want to treat it differently in order to preserve information |
---|
274 | about the storage service having lost data. |
---|
275 | |
---|
276 | - Does the leasedb need to track corrupted shares? |
---|
277 | |
---|
278 | |
---|
279 | Future directions |
---|
280 | ----------------- |
---|
281 | |
---|
282 | Clients will have key pairs identifying accounts, and will be able to add |
---|
283 | leases for a specific account. Various space usage policies can be defined. |
---|
284 | |
---|
285 | Better migration tools ('tahoe storage export'?) will create export files |
---|
286 | that include both the share data and the lease data, and then an import tool |
---|
287 | will both put the share in the right place and update the recipient node's |
---|
288 | leasedb. |
---|