source: trunk/docs/proposed/denver.txt

Last change on this file was 91565f4, checked in by Brian Warner <warner@…>, at 2008-06-03T06:07:02Z

docs: move files that are about future plans into docs/proposed/, to clearly separate them from descriptions of the present codebase

  • Property mode set to 100644
File size: 10.2 KB
Line 
1The "Denver Airport" Protocol
2
3 (discussed whilst returning robk to DEN, 12/1/06)
4
5This is a scaling improvement on the "Select Peers" phase of Tahoe2. The
6problem it tries to address is the storage and maintenance of the 1M-long
7peer list, and the relative difficulty of gathering long-term reliability
8information on a useful numbers of those peers.
9
10In DEN, each node maintains a Chord-style set of connections to other nodes:
11log2(N) "finger" connections to distant peers (the first of which is halfway
12across the ring, the second is 1/4 across, then 1/8th, etc). These
13connections need to be kept alive with relatively short timeouts (5s?), so
14any breaks can be rejoined quickly. In addition to the finger connections,
15each node must also remain aware of K "successor" nodes (those which are
16immediately clockwise of the starting point). The node is not required to
17maintain connections to these, but it should remain informed about their
18contact information, so that it can create connections when necessary. We
19probably need a connection open to the immediate successor at all times.
20
21Since inbound connections exist too, each node has something like 2*log2(N)
22plus up to 2*K connections.
23
24Each node keeps history of uptime/availability of the nodes that it remains
25connected to. Each message that is sent to these peers includes an estimate
26of that peer's availability from the point of view of the outside world. The
27receiving node will average these reports together to determine what kind of
28reliability they should announce to anyone they accept leases for. This
29reliability is expressed as a percentage uptime: P=1.0 means the peer is
30available 24/7, P=0.0 means it is almost never reachable.
31
32
33When a node wishes to publish a file, it creates a list of (verifierid,
34sharenum) tuples, and computes a hash of each tuple. These hashes then
35represent starting points for the landlord search:
36
37 starting_points = [(sharenum,sha(verifierid + str(sharenum)))
38                    for sharenum in range(256)]
39
40The node then constructs a reservation message that contains enough
41information for the potential landlord to evaluate the lease, *and* to make a
42connection back to the starting node:
43
44 message = [verifierid, sharesize, requestor_furl, starting_points]
45
46The node looks through its list of finger connections and splits this message
47into up to log2(N) smaller messages, each of which contains only the starting
48points that should be sent to that finger connection. Specifically we sent a
49starting_point to a finger A if the nodeid of that finger is <= the
50starting_point and if the next finger B is > starting_point. Each message
51sent out can contain multiple starting_points, each for a different share.
52
53When a finger node receives this message, it performs the same splitting
54algorithm, sending each starting_point to other fingers. Eventually a
55starting_point is received by a node that knows that the starting_point lies
56between itself and its immediate successor. At this point the message
57switches from the "hop" mode (following fingers) to the "search" mode
58(following successors).
59
60While in "search" mode, each node interprets the message as a lease request.
61It checks its storage pool to see if it can accomodate the reservation. If
62so, it uses requestor_furl to contact the originator and announces its
63willingness to host the given sharenum. This message will include the
64reliability measurement derived from the host's counterclockwise neighbors.
65
66If the recipient cannot host the share, it forwards the request on to the
67next successor, which repeats the cycle. Each message has a maximum hop count
68which limits the number of peers which may be searched before giving up. If a
69node sees itself to be the last such hop, it must establish a connection to
70the originator and let them know that this sharenum could not be hosted.
71
72The originator sends out something like 100 or 200 starting points, and
73expects to get back responses (positive or negative) in a reasonable amount
74of time. (perhaps if we receive half of the responses in time T, wait for a
75total of 2T for the remaining ones). If no response is received with the
76timeout, either re-send the requests for those shares (to different fingers)
77or send requests for completely different shares.
78
79Each share represents some fraction of a point "S", such that the points for
80enough shares to reconstruct the whole file total to 1.0 points. I.e., if we
81construct 100 shares such that we need 25 of them to reconstruct the file,
82then each share represents .04 points.
83
84As the positive responses come in, we accumulate two counters: the capacity
85counter (which gets a full S points for each positive response), and the
86reliability counter (which gets S*(reliability-of-host) points). The capacity
87counter is not allowed to go above some limit (like 4x), as determined by
88provisioning. The node keeps adding leases until the reliability counter has
89gone above some other threshold (larger but close to 1.0).
90
91[ at download time, each host will be able to provide the share back with
92  probability P times an exponential decay factor related to peer death. Sum
93  these probabilities to get the average number of shares that will be
94  available. The interesting thing is actually the distribution of these
95  probabilities, and what threshold you have to pick to get a sufficiently
96  high chance of recovering the file. If there are N identical peers with
97  probability P, the number of recovered shares will have a gaussian
98  distribution with an average of N*P and a stddev of (??). The PMF of this
99  function is an S-curve, with a sharper slope when N is large. The
100  probability of recovering the file is the value of this S curve at the
101  threshold value (the number of necessary shares).
102
103  P is not actually constant across all peers, rather we assume that it has
104  its own distribution: maybe gaussian, more likely exponential (power law).
105  This changes the shape of the S-curve. Assuming that we can characterize
106  the distribution of P with perhaps two parameters (say meanP and stddevP),
107  the S-curve is a function of meanP, stddevP, N, and threshold...
108
109  To get 99.99% or 99.999% recoverability, we must choose a threshold value
110  high enough to accomodate the random variations and uncertainty about the
111  real values of P for each of the hosts we've selected. By counting
112  reliability points, we are trying to estimate meanP/stddevP, so we know
113  which S-curve to look at. The threshold is fixed at 1.0, since that's what
114  erasure coding tells us we need to recover the file. The job is then to add
115  hosts (increasing N and possibly changing meanP/stddevP) until our
116  recoverability probability is as high as we want.
117]
118
119The originator takes all acceptance messages and adds them in order to the
120list of landlords that will be used to host the file. It stops when it gets
121enough reliability points. Note that it does *not* discriminate against
122unreliable hosts: they are less likely to have been found in the first place,
123so we don't need to discriminate against them a second time. We do, however,
124use the reliability points to acknowledge that sending data to an unreliable
125peer is not as useful as sending it to a reliable one (there is still value
126in doing so, though). The remaining reservation-acceptance messages are
127cancelled and then put aside: if we need to make a second pass, we ask those
128peers first.
129
130Shares are then created and published as in Tahoe2. If we lose a connection
131during the encoding, that share is lost. If we lose enough shares, we might
132want to generate more to make up for them: this is done by using the leftover
133acceptance messages first, then triggering a new Chord search for the
134as-yet-unaccepted sharenums. These new peers will get shares from all
135segments that have not yet been finished, then a second pass will be made to
136catch them up on the earlier segments.
137
138Properties of this approach:
139 the total number of peers that each node must know anything about is bounded
140 to something like 2*log2(N) + K, probably on the order of 50 to 100 total.
141 This is the biggest advantage, since in tahoe2 each node must know at least
142 the nodeid of all 1M peers. The maintenance traffic should be much less as a
143 result.
144
145 each node must maintain open (keep-alived) connections to something like
146 2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the
147 introducer).
148
149 during upload, each node must actively use 100 connections to a random set
150 of peers to push data (just like tahoe2).
151
152 The probability that any given share-request gets a response is equal to the
153 number of hops it travels through times the chance that a peer dies while
154 holding on to the message. This should be pretty small, as the message
155 should only be held by a peer for a few seconds (more if their network is
156 busy). In tahoe2, each share-request always gets a response, since they are
157 made directly to the target.
158
159I visualize the peer-lookup process as the originator creating a
160message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I
161would like to store X bytes of data for file Y (share #Z) on a system close
162to (but not below) nodeid STARTING_POINT. If you find this amenable, please
163contact me at FURL so we can make arrangements.". These messages are then
164bundled together according to their rough destination (STARTING_POINT) and
165sent somewhere in the right direction.
166
167Download happens the same way: lookup messages are disseminated towards the
168STARTING_POINT and then search one successor at a time from there. There are
169two ways that the share might go missing: if the node is now offline (or has
170for some reason lost its shares), or if new nodes have joined since the
171original upload and the search depth (maximum hop count) is too small to
172accomodate the churn. Both result in the same amount of localized traffic. In
173the latter case, a storage node might want to migrate the share closer to the
174starting point, or perhaps just send them a note to remember a pointer for
175the share.
176
177Checking: anyone who wishes to do a filecheck needs to send out a lookup
178message for every potential share. These lookup messages could have a higher
179search depth than usual. It would be useful to know how many peers each
180message went through before being returned: this might be useful to perform
181repair by instructing the old host (which is further from the starting point
182than you'd like) to push their share closer towards the starting point.
Note: See TracBrowser for help on using the repository browser.