#197 closed enhancement (fixed)

small mutable files

Reported by: warner Owned by: warner
Priority: blocker Milestone: 0.7.0
Component: code-encoding Version: 0.6.1
Keywords: Cc:
Launchpad Bug:

Description (last modified by warner)

Last week, zooko and I finished designing "Small Distributed Mutable Files". The design notes are in source:docs/mutable.txt . Now it's time to implement them.

Our hope is to have this in an 0.6.2 release within two weeks (since I'm giving a talk on 9-Nov and it'd be nice to have it ready by then), but if that doesn't work out, the goal is for the 0.7 release.

Here's the task list, so we can claim different pieces without colliding too much. Please edit this summary when you claim a piece.

  • adding RSA to allmydata.Crypto
    • not sure quite how yet: see #11
  • backend: new methods in IStorageServer to allocate/locate/read/write mutable buckets, MutableBucketWriter? class to implement the server-side container format -CLAIMED BY warner, 80% done-
  • client-side SMDF slot format wrangling, given a chunk of data and the right keys, generate the slot contents
  • client-side peer selection: walk through peers, find existing shares, decide upon which shares go where
    • recovery algorithm
  • client-side filenode class: API should have replace method.
  • client-side dirnode class -CLAIMED BY warner-

Distributed Dirnodes (#115) will be implemented on top of this.

Change History (17)

comment:1 Changed at 2007-10-29T22:43:12Z by warner

  • Description modified (diff)
  • Status changed from new to assigned

I'll start with the backend. We're putting off RSA for a day to give zooko some time to look at alternatives to pycrypto

comment:2 Changed at 2007-10-31T07:43:46Z by warner

  • Description modified (diff)

got most of the server-side slots in place. Needs more test coverage to make sure we don't corrupt shares on resize, etc. Also leases need a lot more test coverage.

Tomorrow I'll start looking at the layers above that.

comment:3 Changed at 2007-10-31T15:26:34Z by zooko

  • Priority changed from major to blocker

This is the definitive feature for v0.7.0 release.

If we get this running in time for the next public presentation of Tahoe (which Brian will be giving), then we'll release v0.7. If we don't, then we'll release v0.6.2.

Oh, except that there might be a backwards-incompatibility due to sha-256. I haven't yet determined if our use of sha-256 would run afoul of that bug. But anyway, if there is a backwards-incompatibility then we have to bump the minor version number, so then it will be v0.7 regardless.

comment:5 Changed at 2007-11-01T20:16:27Z by zooko

Please see comment:ticket:102:9, the end of which mentions that perhaps URI isn't a good place for the "dir-or-file" type.

comment:6 Changed at 2007-11-08T11:01:49Z by warner

Whew. Two weeks of intense design, followed by two weeks of *very* intense coding, and Tahoe now has a shiny new mutable file implementation!

It isn't quite ready yet.. there are about two days of integration work left. So no release this week, I'm afraid. But early next week.

The integration tasks left:

  • the Client class is where upload/download/create methods should live. That means client.upload(target), client.download_to(target), client.create_dirnode(), client.get_node_for_uri(uri). Direct access to the uploader/downloader should go away, and everything should go through the Client. We might want to move this to a separate object (or at least a separate Interface), but everything that accesses the grid layer should go through the same object.
  • the client should create a new-style dirnode upon first boot instead of an old-style one
  • the old dirnode code should be removed, along with the vdrive client-side code and the vdrive server (and the vdrive.furl config file)
  • dirnode2.py should replace dirnode.py
  • URIs for the new mutable filenodes and dirnodes are a bit goofy-looking. URI:DIR2:... . I suggest we leave them as they are for this release but set a task for the next release to go through and clean up all URIs (also settle on shorter hash lengths to make the URIs smaller, and probably also start storing binary representations of the read/write caps in dirnodes rather than printable URI strings)

Testing tasks left around this stuff:

  • the solaris buildslave is not actually testing new code (#145), so we need to know that pycryptopp is working on solaris before release
  • docs/README about the new dependency on libcrypto++
  • .debs for pycryptopp so we can install on the test grid machines

Mutable file tasks, security vulnerabilities, not for this release but soon thereafter:

  • check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.
  • rollback-attacks: we chose a policy of "first retrieveable version wins" on download, but for small grids and large expansion factors (i.e. small values of k) this makes it awfully easy for a single out-of-date server to effectively perform a rollback attack against you. I think we should define some parameter epsilon and use the highest-seqnum'ed retrieveable version from k+epsilon servers.
  • consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place.

Mutable file tasks, lower priority:

  • analyze control flow to count the round trips. I was hoping we could get an update done in just one RTT but at the moment it's more like 3 or 4. It's much more challenging than I originally thought.
  • try to always push a share (perhaps N+1) to ourselves, so we'll have the private key around. It would be sad to have a directory that had contents that were unrecoverable but which we could no longer modify because we couldn't get the privkey anymore.
  • choose one likely server (specifically ourselves) during publish to use to fetch our encprivkey. This means doing an extra readv (or perhaps just an extra-large readv) for that one server in _query_peers: the rest can use pretty small reads, like 1000 bytes. This ought to save us a round-trip.
  • error-handling. peers throwing random remote exceptions should not cause our publish to fail unless it's for NotEnoughPeersError?.
  • the notion of "container size" in the mutable-slot storage API is pretty fuzzy. One idea was to allow read vectors to refer to the end of the segment (like python string slices using negative index values), for which we'd need a well-defined container size. I'm not sure this is actually useful for anything, though. (maybe grabbing the encrypted privkey, since it's always at the end?). Probably not useful until MDMF where you'd want to grab the encprivkey without having to grab the whole share too.
  • tests, tests, tests. There are LOTS of corner cases that I want coverage on. The easy ones are what download does in the face of out-of-date servers. The hard ones are what upload does in the face of simultaneous writers.
  • write-time collision recovery. We designed an algorithm (in docs/mutable.txt) to handle this well. There is a place for it to be implemented in allmydata.mutable.Publish._maybe_recover .
  • Publish peer selection: rebalance shares on each publish, by noticing when there are multiple shares on a single peer and also unused peers in the permuted list. The idea is that shares created on a small grid should automatically spread out when updated after the grid has grown.
  • RSA key generation takes an unfortunately long time (between 0.8 and 3.2 seconds in my casual tests). This will make a RW deepcopy of a large directory structure pretty slow. We should do some benchmarking of this thing to determine key size / speed tradeoffs, and maybe at some point consider ECC if it could be faster.
  • code terminology: share vs slot vs container, "SSK" vs mutable file vs slot. We need to nail down the meanings of some of these and clean up the code to match.

comment:7 Changed at 2007-11-08T18:19:34Z by warner

If we provide web UI access to mutable *files* (as opposed to merely mutable directories), say, so a certain developer could put an HTML presentation together without being limited to a strict tree structure in the HREFs, I think it should look like the following:

  • the "Upload File" button either acquires a "Mutable?" checkbox, or there are two separate buttons, one for immutable, one for mutable.
  • all entries in the directory listing that are mutable files should provide a way to upload new contents: either a (choose-file, "Replace"-button) pair, or a button labeled "Replace.." that takes you to a separate page from which you can choose a file and hit the real "Replace" button.

comment:8 Changed at 2007-11-08T18:22:48Z by warner

This is an issue for a different ticket, but I'd also like to see our dirnode URIs changed to be a clear wrapper around a filenode URI. When the internal filenode URI is a mutable file (RW or RO), the dirnode is mutable (RW or RO). When the internal filenode URI is an immutable CHK file, that dirnode is immutable. When we implement the deep-copy facility, it will have an option to either create a new tree of mutable RW dirnodes, or a tree of immutable dirnodes, and the latter can be done much more efficiently by creating CHK files (no expensive RSA key generation) instead of mutable files.

comment:9 Changed at 2007-11-08T19:48:35Z by warner

I mentioned this one above:

  • consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place.

I've decided we *should* verify it here; the cryptopp benchmarks show that my Mac at home takes 560 microseconds per RSA2048 verification, so adding an extra 10 checks will only add 5.6ms to the publish time, which is small compared to the 42ms that the same computer will take to do the RSA2048 signature needed to publish the new contents.

comment:10 Changed at 2007-11-08T20:03:29Z by warner

I went ahead and pushed the verify-signatures-in-Publish._got_query_results change. It still needs testing, of course, just like all the other failure cases.

comment:11 Changed at 2007-11-13T18:03:10Z by zooko

I've created ticket #207 to hold all the parts of this task that can be deferred until v0.7.1.

Here are the remaining things that I think we need to fix to close this ticket:

  • There is currently a test failure on solaris. I also get a (different) unit test failure from the foolscap tests on that machine foolscap #31.
  • check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.

comment:12 Changed at 2007-11-14T14:56:04Z by zooko

The test failure on solaris had to do with old, incompatible versions of packages being loaded from the system instead of the current versions of the packages. I removed the old packages from the host, so that problem is currently not happening (although we would still like to fix it do that the unit tests do not load packages from the system -- #145).

comment:13 Changed at 2007-11-14T14:56:54Z by zooko

I believe Brian implemented "check the share hash chain" in 8ba10d0155cddeb3, but I'll leave it to him to close this ticket.

comment:14 Changed at 2007-11-14T20:53:10Z by warner

nope, not yet. I added a test which provokes a failure in a single share, to see if download succeeds anyways, but that test does not specifically check to make sure that the corrupted share is detected. There is not yet code to check this hash chain. Maybe today, more likely late tomorrow.

comment:15 Changed at 2007-11-14T21:32:45Z by warner

Ok, I just pushed the validate-the-share_hash_chain fix in 2eeac5cff8baf05d, so this check is now being done. I did an eyeball test, but we still need unit tests (but not for 0.7.0).

#207 has the remaining mutable file issues.

I will close this one shortly.. there's an OS-X buildbot failure in the test I just pushed, which I need to investigate before I'll consider this part done.

comment:16 Changed at 2007-11-14T22:54:46Z by warner

it looks like the code which is supposed to fall back to other shares when a corrupt share is detected is not working. The lack of deterministic tests (included in #207) makes this a random failure rather than an easily repeatable one.

still investigating..

comment:17 Changed at 2007-11-17T00:27:04Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

Ok, the last issue is resolved, and getting multiple shares from a single server (in which one is bad and the other good) no longer causes the good shares to be ignored. The remainder of the mutable file issues (for release after this one) are in #207, I'm closing this one out.

Note: See TracTickets for help on using tickets.