1 | = THIS PAGE DESCRIBES HISTORICAL ARCHITECTURE CHOICES: THE CURRENT CODE DOES NOT WORK AS DESCRIBED HERE. = |
---|
2 | |
---|
3 | When a file is uploaded, the encoded shares are sent to other peers. But to |
---|
4 | which ones? The PeerSelection algorithm is used to make this choice. |
---|
5 | |
---|
6 | In the old (May 2007) version, the verifierid is used to consistently-permute |
---|
7 | the set of all peers (by sorting the peers by HASH(verifierid+peerid)). Each |
---|
8 | file gets a different permutation, which (on average) will evenly distribute |
---|
9 | shares among the grid and avoid hotspots. |
---|
10 | |
---|
11 | This permutation places the peers around a 2^256^-sized ring, like the rim of |
---|
12 | a big clock. The 100-or-so shares are then placed around the same ring (at 0, |
---|
13 | 1/100*2^256^, 2/100*2^256^, ... 99/100*2^256^). Imagine that we start at 0 with |
---|
14 | an empty basket in hand and proceed clockwise. When we come to a share, we |
---|
15 | pick it up and put it in the basket. When we come to a peer, we ask that peer |
---|
16 | if they will give us a lease for every share in our basket. |
---|
17 | |
---|
18 | The peer will grant us leases for some of those shares and reject others (if |
---|
19 | they are full or almost full). If they reject all our requests, we remove |
---|
20 | them from the ring, because they are full and thus unhelpful. Each share they |
---|
21 | accept is removed from the basket. The remainder stay in the basket as we |
---|
22 | continue walking clockwise. |
---|
23 | |
---|
24 | We keep walking, accumulating shares and distributing them to peers, until |
---|
25 | either we find a home for all shares, or there are no peers left in the ring |
---|
26 | (because they are all full). If we run out of peers before we run out of |
---|
27 | shares, the upload may be considered a failure, depending upon how many |
---|
28 | shares we were able to place. The current parameters try to place 100 shares, |
---|
29 | of which 25 must be retrievable to recover the file, and the peer selection |
---|
30 | algorithm is happy if it was able to place at least 75 shares. These numbers |
---|
31 | are adjustable: 25-out-of-100 means an expansion factor of 4x (every file in |
---|
32 | the grid consumes four times as much space when totalled across all |
---|
33 | StorageServers), but is highly reliable (the actual reliability is a binomial |
---|
34 | distribution function of the expected availability of the individual peers, |
---|
35 | but in general it goes up very quickly with the expansion factor). |
---|
36 | |
---|
37 | If the file has been uploaded before (or if two uploads are happening at the |
---|
38 | same time), a peer might already have shares for the same file we are |
---|
39 | proposing to send to them. In this case, those shares are removed from the |
---|
40 | list and assumed to be available (or will be soon). This reduces the number |
---|
41 | of uploads that must be performed. |
---|
42 | |
---|
43 | When downloading a file, the current release just asks all known peers for |
---|
44 | any shares they might have, chooses the minimal necessary subset, then starts |
---|
45 | downloading and processing those shares. A later release will use the full |
---|
46 | algorithm to reduce the number of queries that must be sent out. This |
---|
47 | algorithm uses the same consistent-hashing permutation as on upload, but |
---|
48 | instead of one walker with one basket, we have 100 walkers (one per share). |
---|
49 | They each proceed clockwise in parallel until they find a peer, and put that |
---|
50 | one on the "A" list: out of all peers, this one is the most likely to be the |
---|
51 | same one to which the share was originally uploaded. The next peer that each |
---|
52 | walker encounters is put on the "B" list, etc. |
---|
53 | |
---|
54 | All the "A" list peers are asked for any shares they might have. If enough of |
---|
55 | them can provide a share, the download phase begins and those shares are |
---|
56 | retrieved and decoded. If not, the "B" list peers are contacted, etc. This |
---|
57 | routine will eventually find all the peers that have shares, and will find |
---|
58 | them quickly if there is significant overlap between the set of peers that |
---|
59 | were present when the file was uploaded and the set of peers that are present |
---|
60 | as it is downloaded (i.e. if the "peerlist stability" is high). Some limits |
---|
61 | may be imposed in large grids to avoid querying a million peers; this |
---|
62 | provides a tradeoff between the work spent to discover that a file is |
---|
63 | unrecoverable and the probability that a retrieval will fail when it could |
---|
64 | have succeeded if we had just tried a little bit harder. The appropriate |
---|
65 | value of this tradeoff will depend upon the size of the grid, and will change |
---|
66 | over time. |
---|