Opened at 2008-09-08T22:44:26Z
Last modified at 2023-03-24T19:30:32Z
#510 closed enhancement
use plain HTTP for storage server protocol — at Version 22
Reported by: | warner | Owned by: | taral |
---|---|---|---|
Priority: | major | Milestone: | HTTP Storage Protocol |
Component: | code-storage | Version: | 1.2.0 |
Keywords: | standards gsoc http leastauthority | Cc: | zooko, jeremy@…, peter@… |
Launchpad Bug: |
Description (last modified by daira)
Zooko told me about an idea: use plain HTTP for the storage server protocol, instead of foolscap. Here are some thoughts:
- it could make Tahoe easier to standardize: the spec wouldn't have to include foolscap too
- the description of the share format (all the hashes/signatures/etc) becomes the most important thing: most other aspects of the system can be inferred from this format (with peer selection being a significant omission)
- download is easy, use GET and a URL of /shares/STORAGEINDEX/SHNUM, perhaps with an HTTP Content-Range header if you only want a portion of the share
- upload for immutable files is easy: PUT /shares/SI/SHNUM, which works only once
- upload for mutable files:
- implement DSA-based mutable files, in which the storage index is the hash of the public key (or maybe even equal to the public key)
- the storage server is obligated to validate every bit of the share against the roothash, validate the roothash signature against the pubkey, and validate the pubkey against the storage index
- the storage server will accept any share that validates up to the SI and has a seqnum higher than any existing share
- if there is no existing share, the server will accept any valid share
- when using Content-Range: (in some one-message equivalent of writev), the server validates the resulting share, which is some combination of the existing share and the deltas being written. (this is for MDMF where we're trying to modify just one segment, plus the modified hash chains, root hash, and signature)
Switching to a validate-the-share scheme to control write access is good and bad:
- + repairers can create valid, readable, overwritable shares without access to the writecap.
- - storage servers must do a lot of hashing and public key computation on every upload
- - storage servers must know the format of the uploaded share, so clients cannot start using new formats without first upgrading all the storage servers
The result would be a share-transfer protocol that would look exactly like HTTP, however it could not be safely implemented by a simple HTTP server because the PUT requests must be constrained by validating the share. (a simple HTTP server doesn't really implement PUT anyways). There is a benefit to using "plain HTTP", but some of the benefit is lost when in fact it is really HTTP being used as an RPC mechanism (think of the way S3 uses HTTP).
It might be useful to have storage servers declare two separate interfaces: a plain HTTP interface for read, and a separate port or something for write. The read side could indeed be provided by a dumb HTTP server like apache; the write side would need something slightly more complicated. An apache module to provide the necessary share-write checking would be fairly straightforward, though.
Hm, that makes me curious about the potential to write the entire Tahoe node as an apache module: it could convert requests for /ROOT/uri/FILECAP etc into share requests and FEC decoding...
Change History (22)
comment:1 Changed at 2008-09-10T20:20:57Z by zooko
comment:2 Changed at 2008-09-24T13:52:07Z by zooko
I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
comment:3 Changed at 2010-02-23T03:09:25Z by zooko
- Milestone changed from undecided to 2.0.0
comment:4 follow-up: ↓ 5 Changed at 2010-03-01T10:30:52Z by jrydberg
"PUT /shares/SI/SHNUM, which works only once" - Shouldn't POST be used rather than PUT? PUT is idempotent.
comment:5 in reply to: ↑ 4 Changed at 2010-03-02T03:08:01Z by davidsarah
- Keywords standards added
Replying to jrydberg:
"PUT /shares/SI/SHNUM, which works only once" - Shouldn't POST be used rather than PUT? PUT is idempotent.
PUTting a share would be idempotent, because "(aside from error or expiration issues) the side-effects of N > 0 identical requests is the same as for a single request" (http://tools.ietf.org/html/rfc2616#section-9.1). I.e. repeating the request can have no harmful effect. (Note that, assuming the collision-resistence of the hash, there is only one possible valid contents for the share at a given SI and SHNUM.)
HTTP doesn't require that an idempotent request always succeeds. The only ways in which client behaviour is specified to depend on idempotence are:
- If there is an asynchronous close during a sequence of idempotent requests, clients SHOULD retry the request sequence once without user interaction (http://tools.ietf.org/html/rfc2616#section-8.1.4).
- Idempotent requests can be pipelined (http://tools.ietf.org/html/rfc2616#section-8.1.2.2).
These are both desirable for uploading of shares.
comment:6 Changed at 2010-03-04T21:57:36Z by jsgf
- Cc jeremy@… added
comment:7 Changed at 2010-03-12T23:30:26Z by davidsarah
- Keywords gsoc added
comment:8 Changed at 2010-08-15T04:58:06Z by zooko
See also #1007 (HTTP proxy support for node to node communication).
comment:9 Changed at 2010-08-15T04:58:41Z by zooko
- Summary changed from use plain HTTP for storage server protocol? to use plain HTTP for storage server protocol
comment:10 Changed at 2010-11-05T13:18:50Z by davidsarah
- Keywords http added
comment:11 Changed at 2011-06-29T08:26:45Z by warner
Some notes from the 2011 Tahoe Summit:
We can't keep using shared-secret prove-by-present-it write-enablers over a non-confidential HTTP transport. One approach would be to use a verifying key as the write-enabler, and sign the serialized mutation request message, but that would impose a heavy CPU cost on each write (a whole pubkey verification).
A cheaper approach would use a shared-secret write-enabler to MAC the mutation request. To get this shared secret to the server over a non-confidential channel, we need a public-key encryption scheme. The scheme David-Sarah and I cooked up uses one pubkey-decryption operation per server connection, and avoids all but one verification operation per re-key operation. Normal share mutation uses only (cheap) symmetric operations.
Basically, each client/server pair establishes a symmetric session key as soon as the connection is established. This involves putting a public encryption key in the #466 signed-introducer announcement, maybe as simple as a DH gx parameter (probably an elliptic-curve group element). At startup, the client picks a secret, creates gy, sends it in a special message to the server, and the resulting shared gxy is the session key. The client could use a derivative of their persistent master secret for this, or it could be random each time, doesn't matter.
The session key is used in an authenticated-encryption mode like CCM or basic AES+HMAC. When a #1426 re-key operation is performed, the signed please-update-the-write-enabler message is encrypted with the session key, protecting the WE from eavesdroppers. The server checks the re-key request's signature and stores the new WE next to the share.
To actually authorize mutate-share operations, the request is serialized, then MACed using the WE as the secret key. Requests without a valid MAC are rejected. This uses only cheap hash operations for the mutation requests. The expensive pubkey ops are only used once per file per serverid-change (migration) during re-keying, and one per connection to establish the session key.
comment:12 Changed at 2011-08-24T15:53:21Z by zooko
- Owner set to taral
comment:13 Changed at 2011-08-29T00:16:13Z by taral
Okay, so a couple things:
- I need a list of the protocol messages. :)
- You guys sound like you're re-inventing TLS. Can someone explain why we shouldn't run the protocol over TLS instead of inventing our own crypto?
comment:14 Changed at 2011-09-01T21:24:09Z by zooko
- Cc zooko added
comment:15 Changed at 2011-09-08T18:17:39Z by zooko
Taral:
The protocol messages are the methods of the classes which subclass RemoteInterface and are listed in interfaces.py. For example, to upload an immutable file, you get a remote reference to an RIBucketWriter and call its methods write() and close().
About crypto:
Note that we're talking only about the encryption used to protect authorization of users to do certain things. There is another use of encryption, which is to protect the confidentiality of the file data, and that we already do in our own custom way (since TLS doesn't really apply to files the way Tahoe-LAFS users use them).
The current version of Tahoe-LAFS protocol does actually run over SSL/TLS and rely on that to protect certain authorization secrets. The most important authorization secret is called the "write enabler", which you can read more about in specifications/mutable.rst, interfaces.py, client-side mutable/publish.py and mutable/filenode.py, and server-side storage/mutable.py.
When developing a new HTTP(S)-based protocol, we have to decide whether to implement our own encryption to manage authorization or to continue using the "enablers" design on top of SSL/TLS (thus making it be an HTTPS -only protocol and not an HTTP-protocol). I think it may actually ease deployment and usage to do the former, because SSL/TLS is a bit of a pain to deploy. I think it may actually also simplify the protocol! This is somewhat surprising, but what we need is an authorization protocol and what SSL/TLS provides is a two-party confidential, integrity-preserving channel with server-authentication. It kind of looks like implementing our own crypto authorization protocol (such as described in comment:11) may result in a simpler protocol than implementing an authorization protocol layered on top of a secure channel protocol.
Our custom protocol would also be a bit more efficient, where efficiency is measured primarily by number of required round-trips.
(Note that Brian Warner's foolscap is already a general-purpose authorization protocol built on top of SSL, but it doesn't quite fit into our needs because of a few efficiency considerations including the size of the foolscap authorization tokens (furls). Also, foolscap includes a Python-oriented remote object protocol and the whole point of this ticket is to get away from that. :-))
I don't have time to dredge up all the pros and cons that we've talked about, but if anyone does remember them or find them, please post them to this ticket or link to them from this ticket.
comment:16 Changed at 2011-09-08T19:02:58Z by zooko
There are a few high-level docs for people getting started understanding the basic ideas of Tahoe-LAFS data formats.
These are a good start and one should probably read them first, but they really don't get specific enough so that you could, for example, go off and implement a compatible implementation yourself. Here are some "works in progress" where we hope that such a detailed specification will one day live:
docs/specifications/outline.rst
You could help, possibly by asking specific questions which can only be answered by fleshing out those specification documents.
comment:17 Changed at 2011-09-08T19:21:07Z by taral
Thanks zooko!
comment:18 Changed at 2011-09-09T19:06:35Z by zooko
You're welcome! I look forward to seeing what you do with it.
comment:19 Changed at 2011-09-09T19:11:43Z by zooko
Oh, I forgot to mention another "high level overview" for getting started with. This one was written by someone who I don't really know anything about -- named Mahmoud Ahmed Ismail, they haven't interacted with the Tahoe-LAFS developers much, but they started their own project inspired by Tahoe-LAFS and they wrote a high-level doc which is a really good introduction to Tahoe-LAFS design:
comment:20 Changed at 2011-10-17T20:48:15Z by warner
I put a quick prototype of using HTTP to download immutable share data in my "http-transport" github branch (https://github.com/warner/tahoe-lafs/tree/http-transport , may or may not still exist by the time you read this). It advertises a "storage-URL" through the #466 extended introducer announcement, and uses the new web.client code in recent Twisted (10.0 I think?) and a Range: header to fetch the correct read vector. It does not yet use persistent connections, which I think are necessary to get the performance improvement we're hoping for. It also still uses Foolscap for share discovery (getting from a storage index to a list of share numbers on that server), and doesn't touch mutable shares at all, and of course doesn't even think about uploads or modifying shares.
I also added #1565 to discuss the URLs that should be used to access this kind of service.
comment:21 Changed at 2011-11-08T03:21:53Z by taral
Sorry about the delay, folks... things have been busy around here. If anyone else is interested in contributing to this, please feel free.
comment:22 Changed at 2013-05-30T00:13:14Z by daira
- Description modified (diff)
The cloud backend, which uses HTTP or HTTPS to connect to the cloud storage service, provides some interesting data on how an HTTP-only storage protocol might perform. With request pipelining and connection pooling, it seems to do a pretty good job of maxing out the upstream bandwidth to the cloud on my home Internet connection, although it would be interesting to test it with a fatter pipe. (For downloads, performance appears to be limited by inefficiencies in the downloader rather than in the cloud backend.)
Currently, the cloud backend splits shares into "chunks" to limit the amount of data that needs to be held in memory or in a store object (see docs/specifications/backends/raic.rst). This is somewhat redundant with segmentation: ciphertext "segments" are erasure-encoded into "blocks" (a segment is k = shares.needed times larger than a block), and stored in a share together with a header and metadata, which is then chunked. Blocks and chunks are not aligned (for two reasons: the share header, and the typical block size of 128 KiB / 3, which is not a factor of the 512 KiB default chunk size). So,
- a sequential scan over blocks will reference the same chunk for several (typically about 12 for k = 3) consecutive requests.
- a single block may span chunks.
- writes not aligned with a chunk must be implemented using read-modify-write.
The cloud backend uses caching to mitigate any resulting inefficiency. However, this is only of limited help because the storage client lacks information about where the chunk boundaries are and the behaviour of the chunk cache, and the storage server lacks information about the access patterns of the uploader or downloader.
A possible performance improvement and simplification that I'm quite enthusiastic about for an HTTP-based protocol is to make blocks the same thing as chunks. That is, the segment size would be k times the chunk size, and the uploader or downloader would directly store or request chunks, rather than blocks, from the backend storage, doing any caching itself.
Brian: very nice write-up. This is the kind of thing that ought to be posted to tahoe-dev. I kind of think that all new tickets opened on the trac should be mailed to tahoe-dev. That's what the distutils-sig list does, and it seems to work fine.
But anyway, would you please post the above to tahoe-dev? Thanks.