[tahoe-dev] plans for tahoe-1.3.0

Tue Sep 2 11:55:29 PDT 2008

Zooko pointed out that it would be nice to have a release soon. Since
this release is going to be mostly filled with checker/repairer work,
it seemed like a good idea to define what our goals are for the
release: what are we missing today, how much more do we want to get in
before we'd feel comfortable with cutting the 1.3.0 tag.

The main thing I want out of this release is a usable
checker/verifier/repairer, which means:

 * checking: just count number of available shares
 * verifying: read share contents, check hashes
 * repair: create new shares as necessary to replace bad/missing ones.
   * Mutable shares are repaired in place. Note that mutable repair requires
     a write-cap, to make sure the write-enabler shared secrets are created
     correctly. It would be nice to be able to repair from just a read-cap or
     a verify-cap, but this may need to wait until we switch to DSA mutable
     files, and/or change the way we control server-side write access.
   * Immutable shares must be manually deleted from the storage servers, so
     repair needs a mechanism to report which shares should be examined and
     removed. Immutable repair really means creating new shares to make up
     for the bad ones.

 * there should be a "check" button for each file to initiate a check, or a
   verify, with or without auto-repair. The page that this button displays
   should contain the results of the operation: which shares were found
   where, how much verification was performed, whether repair was deemed
   necessary, whether repair was actually done, and the success or failure of
   the repair operation.

 * there should be a "deep-check" button for each directory to perform a
   recursive traversal and check/verify/repair everything reachable from that
   point. The page this returns should show aggregate information about the
   check/repair: a count of how many files/dirs were examined, how many were
   healthy, how many had problems, etc. The page should have a line or two
   about each problem.

 * there should be machine-parseable versions of these buttons: POST
   operations that return JSON with the same information as the
   human-targetted HTML described above.

 * serious problems (like hash failures) should be automatically reported to
   some centralized Incident Gatherer, so we can discover bugs, failing
   drives, or malice.

 * allmydata should be able to run a periodic checker/repairer on customer
   rootcaps. We should be able to count the number of missing/bad shares and
   track it over time (to inform us of the impact of bouncing/moving storage
   servers, discover failing drives, etc). We need to find out how long the
   full check/verify/repair process takes, so we can decide upon a suitable
   repeat rate (perhaps once a month).

 * to handle bad immutable shares, we should add a 'tahoe check-share'
   command that can be run on the storage-server side and check all the
   hashes of a single share file on disk. If file-verify observes a bad hash,
   we should be able to go to the disk and use this tool to see if the
   problem is transient or persistent, to make decisions about the stability
   of that disk.

I'd be happy with a 1.3.0 release that accomplished just those goals. There
are some other tickets we've got floating around with relatively high
priority, but just having a working checker/repairer would be enough for me.

We're fairly close on most of these fronts. The mutable
checker/verifier/repairer is feature-complete and just needs a few more unit
tests before I'd call it done. Zooko is working on the immutable side: last I
heard, I believe the checker is complete, the verifier is halfway there, and
the repairer doesn't exist yet.

The webapi buttons are in place, however I plan to move them around to make
them more usable (in particular, I plan to have a sort of 'Get Info' link for
each file/dir, and the check/verify/repair button will be on that page
instead of on the main directory listing page). These pages return
human-oriented HTML with checker/repair results, however I am not yet
satisfied with the information contained therein, particularly for the
recursive "deep-check" case. There are no JSON-bearing pages at all, which
means the automated allmydata periodic check/repair can't be written yet.

The foolscap Incident Gatherer is being used to report hash failures, etc,
and with a little bit of testing I'll be happy about it. The usability of
this will be improved by a new version of Foolscap that I should release
today or tomorrow. We don't have a dedicated logging channel for less severe
failures (like missing shares), which might be nice, but I think the JSON
results would be good enough (basically pushing the work over to the ops
side, who need to write the scripts to drive check/repair and accumulate the
results).

The 'tahoe check-share' tool doesn't exist yet. I figure it would take about
a day to write. I'd be happy to ship 1.3.0 without it.

So, I think we're close. I'm going to be concentrating on the webapi buttons
and on their results, including the JSON form. I believe Zooko will be
concentraing on the immutable verifier and (more important, in my opinion)
the repairer.

I think I can get my portions done by about the end of the week. If Zooko can
do the same for the immutable code, then we might be able to get 1.3.0 out
next week.

Anyways, that's my plan.. let's see what actually happens :)

cheers,
 -Brian