[tahoe-dev] automatic repair/renewal : where should it go?
Shawn Willden
shawn at willden.org
Sat Aug 29 07:33:20 PDT 2009
On Thursday 27 August 2009 02:18:52 am Brian Warner wrote:
> I'd like to provoke a discussion about file repair, to see if we can
> make some progress on improving the process.
I'd like to confuse your question by adding more issues to consider ;-)
One thing that has been concerning me ever since I started doing the math on
reliability is the issue of chained access. The common way for people to use
Tahoe is to have a small number of dircaps (often only one) that they keep
track of, and all of their thousands (or millions) of files are accessible by
traversing directory trees rooted at one of those dircaps.
The probability that a single file will survive isn't just a function of the
encoding parameters and the failure modes of the servers holding its shares,
it's also dependent on the availability of its cap. If that cap is only held
in a single dirnode, and that dirnode is lost, so is the file. If the
dirnode's cap is held only in another dirnode, then there's another failure
point. And so on. A file that lives at the bottom of a deeply nested
directory tree, with the user holding only the dircap of the root, may be
much more vulnerable to failure than we'd normally expect by just looking at
the share distribution of that single file.
More importantly, this means that root and near-root dircaps that provide the
gateways to large numbers of files are immensely important. IMO, this means
that they should have more conservative encoding parameters selected and/or
be repaired more aggressively.
I agree that the repair threshold should be fuzzy, and I think it should be
based on some notion of file "importance" in addition to file "weakness".
Ideally, file "importance" should be "importance to the user", but obviously
that's not easy to determine without adding a lot of user-managed metadata.
I have some ideas in that direction (based -- unsurprisingly, I'm sure -- on
applying different reliability target probabilities), but let's not go there
right now.
What is pretty easy to determine while doing a deep check of a directory tree
is how many files are accessible through a given dirnode. It's not generally
possible to know how many other dirnodes can be used to access those same
files, but I think that can be ignored in the name of conservatism.
So, I'd like to see a fuzzy repair threshold function that considers both file
weakness and file importance (for dirnodes) in determining which files to
repair. Ultimately, dirnode "repair" should even perform deployment of
additional shares to further protect a perfectly-healthy dirnode which has
become the gateway to large numbers of files. Obviously there are lots of
obstacles in the way of achieving that goal, but I think it's worth keeping
in mind.
Shawn.
More information about the tahoe-dev
mailing list