[tahoe-dev] How many servers can fail?

Wed Oct 26 12:16:18 UTC 2011

I think it would simplify things greatly to further constrain share
placement so that each server gets no more than one share, so that N,
H and K all refer to servers.  I realize that there are some
interesting things that can be achieved by setting N to be a multiple
of the number of servers available, but in practice I don't think they
add enough value to offset the conceptual complexity.

Personally, I would like to eliminate the distinction between N and H.
 I would set them to the same value except that I've discovered that
doing so seems to dramatically reduce write availability even when
more than N=H servers are available.

On Tue, Oct 25, 2011 at 4:53 PM, Brian Warner <warner at lothar.com> wrote:
> On 10/25/11 9:20 AM, Dirk Loss wrote:
>
>> To foster my understanding, I've tried to visualize what that means:
>>
>>  http://dirk-loss.de/tahoe-lafs_nhk-defaults.png
>
> Wow, that's an awesome picture!. If we ever get to produce an animated
> cartoon of precious shares and valiant servers and menacing attackers
> battling it out while the user nervously hits the Repair button, I'm
> hiring you for sure :). But yeah, your understanding is correct.
>
> It may help to know that "H" is a relatively recent addition (1.7.0,
> Jun-2010). The original design had only k=3 and N=10, but assumed that
> you'd only upload files in an environment with at least N servers (in
> fact the older design, maybe 1.2.0 in 2009, had the Introducer tell all
> clients what k,N to use, instead of them picking it for themselves). Our
> expectation was thus that you'd get no more than one share per server,
> so "losing 7 servers" was equivalent to "losing 7 shares", leaving you 3
> (>=k) left.
>
> I designed the original uploader to allow uploads in the presence of
> fewer than N servers, by storing multiple shares per server as necessary
> to place all N shares. The code strives for uniform placement (it won't
> put 7 on server A and then 1 each on servers B,C,D, unless they're
> nearly full). My motivation was to improve the out-of-the-box experience
> (where you spin up a test grid with just one or two servers, but don't
> think to modify your k/N to match), and to allow reasonable upgrades to
> more servers later (by migrating the doubled-up shares to new servers,
> keeping the filecaps and encoding the same, but improving the
> diversity).
>
> There was a "shares of happiness" setting in that original uploader, but
> it was limited to throwing an exception if too many servers drop off
> *during* the upload itself (which commits to a fixed set of servers at
> the start of the process). I still expected there to be plenty of
> servers available, so re-trying the upload would still get you full
> diversity.
>
> The consequences of my choosing write-availability over reliability show
> up when some of your servers are already down when you *start* the
> upload (this wasn't a big deal for the AllMyData production grid, but
> happens much more frequently in a volunteergrid). You might think you're
> on a grid with 20 servers, but it's 2AM and most of those boxes are
> turned off, so your upload only actually gets to use 2 servers. The old
> code would cheerfully put 5 shares on each, and now you've got a 2POF
> (dual-point-of-failure). The worst case was when your combination
> client+server hadn't really managed to connect to the network yet, and
> stored all the shares on itself (SPOF). You might prefer to get a
> failure rather than a less-reliable upload: to choose reliability over
> the availability of writes.
>
> So 1.7.0 changed the old "shares of happiness" into a more accurate (but
> more confusing) "servers of happiness" (but unfortunately kept the old
> name). It also overloaded what "k" means. So now you set "H" to be the
> size of a "target set". The uploader makes sure that any "k"-sized
> subset of this target will have enough shares to recover the file. That
> means that H and k are counting *servers* now. (N and k still control
> encoding as usual, so k also counts shares, but share *placement* is
> constrained by H and k). The uploader refuses to succeed unless it can
> get sufficient diversity, where H and k define what "sufficient" means.
>
> (there may be situations where an upload would fail, but your data would
> still have been recoverable: shares-of-happiness is meant to ensure a
> given level of diversity/safety: choosing reliability over
> write-availability)
>
> So the new "X servers may fail and your data is still recoverable"
> number comes from H-k (both counting servers). The share-placement
> algorithm still tries for uniformity, and if it achieves that then you
> can tolerate even more failures (up to N-k if you manage to get one
> share per server).
>
>
> I'm still not sure the servers-of-happiness metric is ideal. While it
> lets you specify a safety level more accurately/meaningfully in the face
> of insufficient servers, it's always been more hard to explain and
> understand. Some days I'm in favor of a more-absolute list of "servers
> that must be up" (I mocked up a control panel in
> http://tahoe-lafs.org/pipermail/tahoe-dev/2011-January/005944.html).
> Having a minimum-reliability constraint is good, but you also want to
> tell your client just how many servers you *expect* to have around, so
> it can tell you whether it can satisfy your demands.
>
> I still think it'd be pretty cool if the client's Welcome page had a
> little game or visualization where it could show you, given the current
> set of available servers, whether the configured k/H/N could be
> satisfied or not. Something to help explain the share-placement rules
> and help users set reasonable expectations.
>
> cheers,
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>

-- 
Shawn