[tahoe-dev] automatic repair/renewal : where should it go?

Sat Aug 29 07:33:20 PDT 2009

On Thursday 27 August 2009 02:18:52 am Brian Warner wrote:
> I'd like to provoke a discussion about file repair, to see if we can
> make some progress on improving the process.

I'd like to confuse your question by adding more issues to consider ;-)

One thing that has been concerning me ever since I started doing the math on 
reliability is the issue of chained access.  The common way for people to use 
Tahoe is to have a small number of dircaps (often only one) that they keep 
track of, and all of their thousands (or millions) of files are accessible by 
traversing directory trees rooted at one of those dircaps.

The probability that a single file will survive isn't just a function of the 
encoding parameters and the failure modes of the servers holding its shares, 
it's also dependent on the availability of its cap.  If that cap is only held 
in a single dirnode, and that dirnode is lost, so is the file.  If the 
dirnode's cap is held only in another dirnode, then there's another failure 
point.  And so on.  A file that lives at the bottom of a deeply nested 
directory tree, with the user holding only the dircap of the root, may be 
much more vulnerable to failure than we'd normally expect by just looking at 
the share distribution of that single file.

More importantly, this means that root and near-root dircaps that provide the 
gateways to large numbers of files are immensely important.  IMO, this means 
that they should have more conservative encoding parameters selected and/or 
be repaired more aggressively.

I agree that the repair threshold should be fuzzy, and I think it should be 
based on some notion of file "importance" in addition to file "weakness".  
Ideally, file "importance" should be "importance to the user", but obviously 
that's not easy to determine without adding a lot of user-managed metadata.  
I have some ideas in that direction (based -- unsurprisingly, I'm sure -- on 
applying different reliability target probabilities), but let's not go there 
right now.

What is pretty easy to determine while doing a deep check of a directory tree 
is how many files are accessible through a given dirnode.  It's not generally 
possible to know how many other dirnodes can be used to access those same 
files, but I think that can be ignored in the name of conservatism.

So, I'd like to see a fuzzy repair threshold function that considers both file 
weakness and file importance (for dirnodes) in determining which files to 
repair.  Ultimately, dirnode "repair" should even perform deployment of 
additional shares to further protect a perfectly-healthy dirnode which has 
become the gateway to large numbers of files.  Obviously there are lots of 
obstacles in the way of achieving that goal, but I think it's worth keeping 
in mind.

	Shawn.