1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ============================================================= |
---|
4 | Redundant Array of Independent Clouds: Share To Cloud Mapping |
---|
5 | ============================================================= |
---|
6 | |
---|
7 | |
---|
8 | Introduction |
---|
9 | ============ |
---|
10 | |
---|
11 | This document describes a proposed design for the mapping of LAFS shares to |
---|
12 | objects in a cloud storage service. It also analyzes the costs for each of the |
---|
13 | functional requirements, including network, disk, storage and API usage costs. |
---|
14 | |
---|
15 | |
---|
16 | Terminology |
---|
17 | =========== |
---|
18 | |
---|
19 | *LAFS share* |
---|
20 | A Tahoe-LAFS share representing part of a file after encryption and |
---|
21 | erasure encoding. |
---|
22 | |
---|
23 | *LAFS shareset* |
---|
24 | The set of shares stored by a LAFS storage server for a given storage index. |
---|
25 | The shares within a shareset are numbered by a small integer. |
---|
26 | |
---|
27 | *Cloud storage service* |
---|
28 | A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_, |
---|
29 | Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage. |
---|
30 | |
---|
31 | *Cloud storage interface* |
---|
32 | A protocol interface supported by a cloud storage service, such as the |
---|
33 | S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the |
---|
34 | Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be |
---|
35 | multiple services implementing a given cloud storage interface. In this design, |
---|
36 | only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces. |
---|
37 | |
---|
38 | *Store object* |
---|
39 | A file-like abstraction provided by a cloud storage service, storing a |
---|
40 | sequence of bytes. Store objects are mutable in the sense that the contents |
---|
41 | and metadata of the store object with a given name in a given backend store |
---|
42 | can be replaced. Store objects are called “blobs” in the Azure interface, |
---|
43 | and “objects” in the other interfaces. |
---|
44 | |
---|
45 | *Cloud backend store* |
---|
46 | A container for store objects provided by a cloud service. Cloud backend |
---|
47 | stores are called “buckets” in the S3 and Google Cloud Storage interfaces, |
---|
48 | and “containers” in the Azure and OpenStack Storage interfaces. |
---|
49 | |
---|
50 | |
---|
51 | Functional Requirements |
---|
52 | ======================= |
---|
53 | |
---|
54 | * *Upload*: a LAFS share can be uploaded to an appropriately configured |
---|
55 | Tahoe-LAFS storage server and the data is stored to the cloud |
---|
56 | storage service. |
---|
57 | |
---|
58 | * *Scalable shares*: there is no hard limit on the size of LAFS share |
---|
59 | that can be uploaded. |
---|
60 | |
---|
61 | If the cloud storage interface offers scalable files, then this could be |
---|
62 | implemented by using that feature of the specific cloud storage |
---|
63 | interface. Alternately, it could be implemented by mapping from the LAFS |
---|
64 | abstraction of an unlimited-size immutable share to a set of size-limited |
---|
65 | store objects. |
---|
66 | |
---|
67 | * *Streaming upload*: the size of the LAFS share that is uploaded |
---|
68 | can exceed the amount of RAM and even the amount of direct attached |
---|
69 | storage on the storage server. I.e., the storage server is required to |
---|
70 | stream the data directly to the ultimate cloud storage service while |
---|
71 | processing it, instead of to buffer the data until the client is finished |
---|
72 | uploading and then transfer the data to the cloud storage service. |
---|
73 | |
---|
74 | * *Download*: a LAFS share can be downloaded from an appropriately |
---|
75 | configured Tahoe-LAFS storage server, and the data is loaded from the |
---|
76 | cloud storage service. |
---|
77 | |
---|
78 | * *Streaming download*: the size of the LAFS share that is |
---|
79 | downloaded can exceed the amount of RAM and even the amount of direct |
---|
80 | attached storage on the storage server. I.e. the storage server is |
---|
81 | required to stream the data directly to the client while processing it, |
---|
82 | instead of to buffer the data until the cloud storage service is finished |
---|
83 | serving and then transfer the data to the client. |
---|
84 | |
---|
85 | * *Modify*: a LAFS share can have part of its contents modified. |
---|
86 | |
---|
87 | If the cloud storage interface offers scalable mutable files, then this |
---|
88 | could be implemented by using that feature of the specific cloud storage |
---|
89 | interface. Alternately, it could be implemented by mapping from the LAFS |
---|
90 | abstraction of an unlimited-size mutable share to a set of size-limited |
---|
91 | store objects. |
---|
92 | |
---|
93 | * *Efficient modify*: the size of the LAFS share being |
---|
94 | modified can exceed the amount of RAM and even the amount of direct |
---|
95 | attached storage on the storage server. I.e. the storage server is |
---|
96 | required to download, patch, and upload only the segment(s) of the share |
---|
97 | that are being modified, instead of to download, patch, and upload the |
---|
98 | entire share. |
---|
99 | |
---|
100 | * *Tracking leases*: The Tahoe-LAFS storage server is required to track when |
---|
101 | each share has its lease renewed so that unused shares (shares whose lease |
---|
102 | has not been renewed within a time limit, e.g. 30 days) can be garbage |
---|
103 | collected. This does not necessarily require code specific to each cloud |
---|
104 | storage interface, because the lease tracking can be performed in the |
---|
105 | storage server's generic component rather than in the component supporting |
---|
106 | each interface. |
---|
107 | |
---|
108 | |
---|
109 | Mapping |
---|
110 | ======= |
---|
111 | |
---|
112 | This section describes the mapping between LAFS shares and store objects. |
---|
113 | |
---|
114 | A LAFS share will be split into one or more “chunks” that are each stored in a |
---|
115 | store object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)` |
---|
116 | chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive. |
---|
117 | (It is not possible for `C` to be zero, because valid shares always have a header, |
---|
118 | so, there is at least one chunk for each share.) |
---|
119 | |
---|
120 | For an existing share, the chunk size is determined by the size of the first |
---|
121 | chunk. For a new share, it is a parameter that may depend on the storage |
---|
122 | interface. It is an error for any chunk to be larger than the first chunk, or |
---|
123 | for any chunk other than the last to be smaller than the first chunk. |
---|
124 | If a mutable share with total size less than the default chunk size for the |
---|
125 | storage interface is being modified, the new contents are split using the |
---|
126 | default chunk size. |
---|
127 | |
---|
128 | *Rationale*: this design allows the `chunksize` parameter to be changed for |
---|
129 | new shares written via a particular storage interface, without breaking |
---|
130 | compatibility with existing stored shares. All cloud storage interfaces |
---|
131 | return the sizes of store objects with requests to list objects, and so |
---|
132 | the size of the first chunk can be determined without an additional request. |
---|
133 | |
---|
134 | The name of the store object for chunk `i` > 0 of a LAFS share with storage index |
---|
135 | `STORAGEINDEX` and share number `SHNUM`, will be |
---|
136 | |
---|
137 | shares/`ST`/`STORAGEINDEX`/`SHNUM.i` |
---|
138 | |
---|
139 | where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the |
---|
140 | `.0` is omitted. |
---|
141 | |
---|
142 | *Rationale*: this layout maintains compatibility with data stored by the |
---|
143 | prototype S3 backend, for which Least Authority Enterprises has existing |
---|
144 | customers. This prototype always used a single store object to store each |
---|
145 | share, with name |
---|
146 | |
---|
147 | shares/`ST`/`STORAGEINDEX`/`SHNUM` |
---|
148 | |
---|
149 | By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts, |
---|
150 | the storage server can obtain a list of store objects associated with a given |
---|
151 | shareset without having to know the layout in advance, and without having to |
---|
152 | make multiple API requests. This also simplifies sharing of test code between the |
---|
153 | disk and cloud backends. |
---|
154 | |
---|
155 | Mutable and immutable shares will be “chunked” in the same way. |
---|
156 | |
---|
157 | |
---|
158 | Rationale for Chunking |
---|
159 | ---------------------- |
---|
160 | |
---|
161 | Limiting the amount of data received or sent in a single request has the |
---|
162 | following advantages: |
---|
163 | |
---|
164 | * It is unnecessary to write separate code to take advantage of the |
---|
165 | “large object” features of each cloud storage interface, which differ |
---|
166 | significantly in their design. |
---|
167 | * Data needed for each PUT request can be discarded after it completes. |
---|
168 | If a PUT request fails, it can be retried while only holding the data |
---|
169 | for that request in memory. |
---|
170 | |
---|
171 | |
---|
172 | Costs |
---|
173 | ===== |
---|
174 | |
---|
175 | In this section we analyze the costs of the proposed design in terms of network, |
---|
176 | disk, memory, cloud storage, and API usage. |
---|
177 | |
---|
178 | |
---|
179 | Network usage—bandwidth and number-of-round-trips |
---|
180 | ------------------------------------------------- |
---|
181 | |
---|
182 | When a Tahoe-LAFS storage client allocates a new share on a storage server, |
---|
183 | the backend will request a list of the existing store objects with the |
---|
184 | appropriate prefix. This takes one HTTP request in the common case, but may |
---|
185 | take more for the S3 interface, which has a limit of 1000 objects returned in |
---|
186 | a single “GET Bucket” request. |
---|
187 | |
---|
188 | If the share is to be read, the client will make a number of calls each |
---|
189 | specifying the offset and length of the required span of bytes. On the first |
---|
190 | request that overlaps a given chunk of the share, the server will make an |
---|
191 | HTTP GET request for that store object. The server may also speculatively |
---|
192 | make GET requests for store objects that are likely to be needed soon (which |
---|
193 | can be predicted since reads are normally sequential), in order to reduce |
---|
194 | latency. |
---|
195 | |
---|
196 | Each read will be satisfied as soon as the corresponding data is available, |
---|
197 | without waiting for the rest of the chunk, in order to minimize read latency. |
---|
198 | |
---|
199 | All four cloud storage interfaces support GET requests using the |
---|
200 | Range HTTP header. This could be used to optimize reads where the |
---|
201 | Tahoe-LAFS storage client requires only part of a share. |
---|
202 | |
---|
203 | If the share is to be written, the server will make an HTTP PUT request for |
---|
204 | each chunk that has been completed. Tahoe-LAFS clients only write immutable |
---|
205 | shares sequentially, and so we can rely on that property to simplify the |
---|
206 | implementation. |
---|
207 | |
---|
208 | When modifying shares of an existing mutable file, the storage server will |
---|
209 | be able to make PUT requests only for chunks that have changed. |
---|
210 | (Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but |
---|
211 | future versions will probably do so for MDMF files.) |
---|
212 | |
---|
213 | In some cases, it may be necessary to retry a request (see the `Structure of |
---|
214 | Implementation`_ section below). In the case of a PUT request, at the point |
---|
215 | at which a retry is needed, the new chunk contents to be stored will still be |
---|
216 | in memory and so this is not problematic. |
---|
217 | |
---|
218 | In the absence of retries, the maximum number of GET requests that will be made |
---|
219 | when downloading a file, or the maximum number of PUT requests when uploading |
---|
220 | or modifying a file, will be equal to the number of chunks in the file. |
---|
221 | |
---|
222 | If the new mutable share content has fewer chunks than the old content, |
---|
223 | then the remaining store objects for old chunks must be deleted (using one |
---|
224 | HTTP request each). When reading a share, the backend must tolerate the case |
---|
225 | where these store objects have not been deleted successfully. |
---|
226 | |
---|
227 | The last write to a share will be reported as successful only when all |
---|
228 | corresponding HTTP PUTs and DELETEs have completed successfully. |
---|
229 | |
---|
230 | |
---|
231 | |
---|
232 | Disk usage (local to the storage server) |
---|
233 | ---------------------------------------- |
---|
234 | |
---|
235 | It is never necessary for the storage server to write the content of share |
---|
236 | chunks to local disk, either when they are read or when they are written. Each |
---|
237 | chunk is held only in memory. |
---|
238 | |
---|
239 | A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite |
---|
240 | database to store metadata about shares. In that case the same database would |
---|
241 | be used for the cloud backend. This would enable lease tracking to be implemented |
---|
242 | in the same way for disk and cloud backends. |
---|
243 | |
---|
244 | |
---|
245 | Memory usage |
---|
246 | ------------ |
---|
247 | |
---|
248 | The use of chunking simplifies bounding the memory usage of the storage server |
---|
249 | when handling files that may be larger than memory. However, this depends on |
---|
250 | limiting the number of chunks that are simultaneously held in memory. |
---|
251 | Multiple chunks can be held in memory either because of pipelining of requests |
---|
252 | for a single share, or because multiple shares are being read or written |
---|
253 | (possibly by multiple clients). |
---|
254 | |
---|
255 | For immutable shares, the Tahoe-LAFS storage protocol requires the client to |
---|
256 | specify in advance the maximum amount of data it will write. Also, a cooperative |
---|
257 | client (including all existing released versions of the Tahoe-LAFS code) will |
---|
258 | limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk |
---|
259 | size will be greater than that, it is possible to ensure that for each allocation, |
---|
260 | the maximum chunk data memory usage is the lesser of two chunks, and the allocation |
---|
261 | size. (There is some additional overhead but it is small compared to the chunk |
---|
262 | data.) If the maximum memory usage of a new allocation would exceed the memory |
---|
263 | available, the allocation can be delayed or possibly denied, so that the total |
---|
264 | memory usage is bounded. |
---|
265 | |
---|
266 | It is not clear that the existing protocol allows allocations for mutable |
---|
267 | shares to be bounded in general; this may be addressed in a future protocol change. |
---|
268 | |
---|
269 | The above discussion assumes that clients do not maliciously send large |
---|
270 | messages as a denial-of-service attack. Foolscap (the protocol layer underlying |
---|
271 | the Tahoe-LAFS storage protocol) does not attempt to resist denial of service. |
---|
272 | |
---|
273 | |
---|
274 | Storage |
---|
275 | ------- |
---|
276 | |
---|
277 | The storage requirements, including not-yet-collected garbage shares, are |
---|
278 | the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud |
---|
279 | objects stored is equal to the total size of shares that the disk backend |
---|
280 | would store. |
---|
281 | |
---|
282 | Erasure coding causes the size of shares for each file to be a |
---|
283 | factor `shares.total` / `shares.needed` times the file size, plus overhead |
---|
284 | that is logarithmic in the file size `¹¹`_. |
---|
285 | |
---|
286 | |
---|
287 | API usage |
---|
288 | --------- |
---|
289 | |
---|
290 | Cloud storage backends typically charge a small fee per API request. The number of |
---|
291 | requests to the cloud storage service for various operations is discussed under |
---|
292 | “network usage” above. |
---|
293 | |
---|
294 | |
---|
295 | Structure of Implementation |
---|
296 | =========================== |
---|
297 | |
---|
298 | A generic “cloud backend”, based on the prototype S3 backend but with support |
---|
299 | for chunking as described above, will be written. |
---|
300 | |
---|
301 | An instance of the cloud backend can be attached to one of several |
---|
302 | “cloud interface adapters”, one for each cloud storage interface. These |
---|
303 | adapters will operate only on chunks, and need not distinguish between |
---|
304 | mutable and immutable shares. They will be a relatively “thin” abstraction |
---|
305 | layer over the HTTP APIs of each cloud storage interface, similar to the |
---|
306 | S3Bucket abstraction in the prototype. |
---|
307 | |
---|
308 | For some cloud storage services it may be necessary to transparently retry |
---|
309 | requests in order to recover from transient failures. (Although the erasure |
---|
310 | coding may enable a file to be retrieved even when shares are not stored by or |
---|
311 | not readable from all cloud storage services used in a Tahoe-LAFS grid, it may |
---|
312 | be desirable to retry cloud storage service requests in order to improve overall |
---|
313 | reliability.) Support for this will be implemented in the generic cloud backend, |
---|
314 | and used whenever a cloud storage adaptor reports a transient failure. Our |
---|
315 | experience with the prototype suggests that it is necessary to retry on transient |
---|
316 | failures for Amazon's S3 service. |
---|
317 | |
---|
318 | There will also be a “mock” cloud interface adaptor, based on the prototype's |
---|
319 | MockS3Bucket. This allows tests of the generic cloud backend to be run without |
---|
320 | a connection to a real cloud service. The mock adaptor will be able to simulate |
---|
321 | transient and non-transient failures. |
---|
322 | |
---|
323 | |
---|
324 | Known Issues |
---|
325 | ============ |
---|
326 | |
---|
327 | This design worsens a known “write hole” issue in Tahoe-LAFS when updating |
---|
328 | the contents of mutable files. An update to a mutable file can require |
---|
329 | changing the contents of multiple chunks, and if the client fails or is |
---|
330 | disconnected during the operation the resulting state of the store objects |
---|
331 | for that share may be inconsistent—no longer containing all of the old version, |
---|
332 | but not yet containing all of the new version. A mutable share can be left in |
---|
333 | an inconsistent state even by the existing Tahoe-LAFS disk backend if it fails |
---|
334 | during a write, but that has a smaller chance of occurrence because the current |
---|
335 | client behavior leads to mutable shares being written to disk in a single |
---|
336 | system call. |
---|
337 | |
---|
338 | The best fix for this issue probably requires changing the Tahoe-LAFS storage |
---|
339 | protocol, perhaps by extending it to use a two-phase or three-phase commit |
---|
340 | (ticket #1755). |
---|
341 | |
---|
342 | |
---|
343 | |
---|
344 | References |
---|
345 | =========== |
---|
346 | |
---|
347 | ¹ omitted |
---|
348 | |
---|
349 | .. _²: |
---|
350 | |
---|
351 | ² “Amazon S3” Amazon (2012) |
---|
352 | |
---|
353 | https://aws.amazon.com/s3/ |
---|
354 | |
---|
355 | .. _³: |
---|
356 | |
---|
357 | ³ “Rackspace Cloud Files” Rackspace (2012) |
---|
358 | |
---|
359 | https://www.rackspace.com/cloud/cloud_hosting_products/files/ |
---|
360 | |
---|
361 | .. _⁴: |
---|
362 | |
---|
363 | ⁴ “Google Cloud Storage” Google (2012) |
---|
364 | |
---|
365 | https://developers.google.com/storage/ |
---|
366 | |
---|
367 | .. _⁵: |
---|
368 | |
---|
369 | ⁵ “Windows Azure Storage” Microsoft (2012) |
---|
370 | |
---|
371 | https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/ |
---|
372 | |
---|
373 | .. _⁶: |
---|
374 | |
---|
375 | ⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012) |
---|
376 | |
---|
377 | http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html |
---|
378 | |
---|
379 | .. _⁷: |
---|
380 | |
---|
381 | ⁷ “OpenStack Object Storage” openstack.org (2012) |
---|
382 | |
---|
383 | http://openstack.org/projects/storage/ |
---|
384 | |
---|
385 | .. _⁸: |
---|
386 | |
---|
387 | ⁸ “Google Cloud Storage Reference Guide” Google (2012) |
---|
388 | |
---|
389 | https://developers.google.com/storage/docs/reference-guide |
---|
390 | |
---|
391 | .. _⁹: |
---|
392 | |
---|
393 | ⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012) |
---|
394 | |
---|
395 | http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx |
---|
396 | |
---|
397 | .. _¹⁰: |
---|
398 | |
---|
399 | ¹⁰ “Representational state transfer” English Wikipedia (2012) |
---|
400 | |
---|
401 | https://en.wikipedia.org/wiki/Representational_state_transfer |
---|
402 | |
---|
403 | .. _¹¹: |
---|
404 | |
---|
405 | ¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012) |
---|
406 | |
---|
407 | :doc:`../../performance` |
---|