[tahoe-dev] servers-of-happiness default of 7 prevents first-time installation from working "out of the box"

Wed Jun 16 11:59:29 PDT 2010

>>> At 2010-06-15 17:14 (-0600), Zooko O'Whielacronx wrote:
>>>
>>>> One possible solution to this would be to lower the default
>>>> servers-of-happiness from 7 to 1. This would require us to also
>>>> lower the default number of shares needed from 3 to 1, because the
>>>> current code won't let you have a servers-of-happiness lower than
>>>> your number-of-shares-needed:

In some quick testing on the train this morning, I created a local test
grid with three nodes, and tried an upload, which failed because of the
servers-of-happiness threshold. (incidentally, the exception message was
really confusing.. there almost 100 words in it, and the phrase "such
that" was repeated a lot. There is probably no good way to express this
exception clearly, but I'd lean towards a short english phrase like
"couldn't achieve sufficient diversity" and then a terse batch of
numbers and punctuation. Maybe name it InsufficientDiversityError
instead of UploadUnhappinessError?)

Then I reduced [client]shares.happy to 3, restarted, and repeated the
upload, which succeeded. 'k' (i.e. shares.needed) was still 3.

Then I reduced shares.happy to 2, shut down the third node (leaving
two), and repeated, which succeeded. So I think that assertion (SOH>=k)
isn't actually true.

This is good (for me), because here's what I'm doing with my personal
backupgrid:

 * there are three storage servers, A,B,C
 * my client node uses the default configuration: 3-of-10
 * the 1.6 uploader distributes evenly, so I'll get 3sh/3sh/4sh on those
   servers
  * (the 1.6 repairer is not so clever, so when repair happens, I'll
     get less optimal placement, but it looks like the 1.7 repairer
     benefits from the better uploader, so this will get better)
  * (when we get static share-placement control, I'll change this to
     explicitly put sh0/sh1/sh2 on A, sh3/4/5 on B, and sh6/7/8/9 on C,
     and maybe switch to 3-of-9 encoding at the same time)

The properties I get from this arrangement are:

 * tolerance to two disk/servers/houses failing
 * tolerance to up to 7 share failures per file. Drives that experience
   single-sector errors or bitflips in one share will still provide good
   service on the other two (or three) shares on that disk.
 * (eventually) local repair of single share errors would be faster,
   since server A would only need to pull FILESIZE/3 bytes from some
   other server to replace a local share.

> What it means now is "There are at least $servers-of-happiness servers
> such than any subset of them of at least size K can deliver your
> file."

>From this, and the assertion that you made above, it seems like
servers-of-happiness design really wants you to have only one share per
server. At least, if SOH=k then one-share-per-server qualifies as a
successful upload (even if you only uploaded 'k' shares instead of the
full 'N'), and SOH>k raises the bar from there.

For my grid, I guess could use SOH=3, and stick with my 3-of-10
defaults. Uploads would fail unless all three servers were online at the
same time. I'd really prefer to set this threshold at 2 servers (because
one of the servers was offline for a week, and I wanted to run a backup
anyways, and I want a future repair/rebalance to fix the situation).
Since the SOH>=k assertion does not seem to be enforced, I could achieve
my goals in 1.7 by setting SOH=2, but if it were enforced, I'd be stuck.

Does the thinking behind SOH say that I should be using 1-of-3 instead
of 3-of-9 or 3-of-10? I don't want to do that, because:

 * 1-of-3 is less resilient to failure than 3-of-9 (when single-share
   failures are possible: not all disks fail catastrophically)
 * local repair bandwidth would be greater (server A would have to pull
   FILESIZE/1 bytes to replace a single lost share, not FILESIZE/3)
 * I might add a fourth server D some day, at which point I'd like to
   rebalance and still keep all of my filecaps. A rebalancer could
   change those 3/3/4 distributions into 2/2/3/3 without changing the
   shares in any way. Changing to 1-of-3 encoding would require
   re-encoding and re-uploading the entire archive.

We picked 3-of-10 because it was conservative and flexible: both 'k' and
'N' are low enough to keep the overhead small, but high enough that
you'll get awesome reliability if you actually have 10 distinct servers
at your disposal. Making Tahoe work well for small grids, even just one
or two servers, is important to me. I'd rather not require users to
change their encoding defaults just because they don't have a lot of
servers.

So, I don't propose that we change anything for 1.7 . I'd like to know
more about the supposed SOH>=k assertion, because I want 1.7 to not
enforce it, so I can get that two-servers-are-sufficient property from
my three-node 3-of-10 backupgrid. (if my testing is just getting lucky
or hitting a bug, I'd like to clear that up).

For 1.8, I'd like to investigate the small-grid (and
starts-small-but-grows) behavior, probably starting with user-facing
docs that explain what you get and what you need to do. If those docs
are too large, then I'd want to fix the design to allow them to be
small.

Incidentally, here are some notes I took while starting to review the
new uploader code.

 * The SOH value is controlled in tahoe.cfg by [client]shares.happy ,
   but the value is expressed in units of *servers*. This is really
   confusing, and changes the meaning of a pre-existing tahoe.cfg value.
   This needs to be renamed to [client]servers.happy . This is also a
   backwards-compatibility issue: existing clients which have set
   shares.happy will start behaving very differently when they upgrade.

 * in upload.py, "servers_of_happiness" is used both as a function name
   and as an attribute/variable: very confusing. The function name
   should be a verb (like "measure_happiness()"), leaving the attribute
   name to be a noun. Further, the noun form gets completely different
   names in different places: I see it appear in the following forms:
   - "servers_of_happiness"
   - "desired" (in locate_all_shareholders())
   - "encoding_param_happy"
   - part of the "share_counts" tuple
   - part of the "params" tuple
   The "desired" value should be renamed to servers_of_happiness, and
   the "share_counts" tuple should be renamed (or SOH should be moved
   out of it) because SOH has nothing to do with shares.

 * it looks like the new uploader waits for *all* "readonly" servers to
   respond before asking any of the non-readonly servers. This will
   induce delays and increase network/disk usage. Stalled servers, even
   if they are far from the start of the permuted ring, will stall the
   upload. (there's a tradeoff between performance/cost and increasing
   the probability that we discover+use existing shares.. this
   implementation is at one extreme of that tradeoff).

While a complete uploader rewrite was not in scope for this project,
when we do get around to doing one, I'd like to follow the pattern I've
come to like in mutable/* and in the new (immutable) downloader. This is
a state machine, in which we perform queries until we get enough
information to start the process. The uploader is more constrained than
the downloader (because we really need to commit to a servermap before
we start encoding and uploading shares), but I'd still use the state
machine for the share-placement phase. The state machine should start at
the beginning of the permuted ring and ask servers to hold shares (or,
if we believe that they'll say no because of a size constraint or
read-only flag, just ask them if they already have a share). Keep asking
until we've got a home for all shares, then ask a little bit more.

If we see any evidence of pre-existing shares, we should be willing to
ask even further, in the hopes of finding them all. But we shouldn't
always ask every server. In the ideal+common case (where we're uploading
a new file to servers that all have room), we should only talk to
N+epsilon servers. Some day, when we have tools that automatically
rebalance the grid, this will be the normal case, and I'd like it to be
efficient.

(incidentally, one thing I keep in mind when I'm using
storage_client.get_servers_for_index() is how that's going to work when
we have millions of servers. So I pretend that get_servers_for_index()
is a generator, that might actually talk to some central server which
knows about the full serverlist, and gets back bundles of dozens or
hundreds of servers at a time. So I think in terms of touching just a
few servers at a time, and rarely enumerating all of them at once.)

Once we've decided we've spent enough effort locating shares, we can
rearrange the proposed servermap to achieve diversity goals. This may
involve cancelling storage requests that we previously sent out, or
sending new requests to existing servers. (e.g. we allocated sh0 to
server A, but then discovered a copy of sh0 on server B, so we cancel
the sh0 at A and replace it with a sh1 at A). The goal, as always, is to get
the shares living on the earliest-in-the-permuted-ring servers, with the
low-numbered "primary" shares at the beginning. The rebalancer/repairer
will work towards this same goal, which will speed up download and
achieve good (uncorrelated) distribution. We want all of our tools to
push towards the same ideal state, so the long-term behavior is nicely
convergent despite being spread over multiple uncoordinated systems.

Finally, once we've got a servermap that we like, we start encoding and
uploading. If we lose any servers during the upload, we keep going
unless the resulting servermap no longer makes us happy.

I really apologize for not getting involved in this issue earlier. It
was a combination of not having time to be involved (new job, the usual
excuses) and incorrectly believing that I knew the scope of the changes
to the uploader. I really appreciate Kevan's hard work and thoroughness
(especially on the design, docs, and tests), and I'm not going to ask
for any changes on the eve of a release. But I'd like to find a way to
achieve my own personal goals, and the small-grid goals that I think are
important for Tahoe users in general, in the 1.8 release, and if
possible in 1.7 .

cheers,
 -Brian