Opened at 2008-02-08T02:49:35Z
Last modified at 2013-06-25T15:58:18Z
#302 closed task
stop permuting peerlist, use SI as offset into ring instead? — at Initial Version
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-peerselection | Version: | 0.7.0 |
Keywords: | repair newcaps newurls performance preservation upload | Cc: | |
Launchpad Bug: |
Description
We were chatting today about a subject that comes up about every three months: why do we use a consistently-permuted peerlist when choosing where shares should be placed?
The StorageIndex is the output of a hash function and therefore randomly distributed, so the set of all storage indices in the network should be (in the long run) evenly distributed across the SHA-256 2256 numberspace.
Our current Tahoe2? algorithm uses the file's Storage Index to permute a list of all known storage servers (by hashing the (SI+peerid) string and sorting the result). This adds an additional level of random-distribution: each file gets a different ordering of storage servers.
The alternate approach that Zooko suggested today was to skip the permutation and instead define the algorithm to be:
- put the storage servers in a circle, placed by their nodeid
- use the storage index as a starting location
- travel clockwise, asking each server in turn to accept a lease
- continue travelling until all shares are placed, or we've run out of servers
The reason we went with the permuted list was because we were concerned about the effects of non-uniform storage server capacities. There are two situations to look at: light loading (i.e. nobody is full and all lease requests are accepted), and heavy loading (some or most storage servers are full, and are rejecting lease requests, so the uploader will visit other servers to place their shares).
For light loading, the non-permuted approach causes any individual storage server to experience a load roughly equal to the percentage of the ring that is covered by its N counter-clockwise neighbors. On average, this will be the same for all peers, but some peerids might be clustered more than others, resulting in more traffic to those peers than the rest.
For heavy loading, once a server is full, all of the traffic that would have landed on it will be redirected clockwise around the ring (imagine rain flowing across the ground until it finds a hole large enough to form a puddle). This will result in a concentration of load on the server just past the full region, and may cause that server to fill quickly. The likely result is a large section of the ring which is full, while there may be more space available on the other side of the ring.
In contrast, the permuted approach removes the correlation between server locations: each file sees all servers in different locations. So instead of having "hot spots" (either caused by randomly-clustered peerids, or randomly-clustered full servers), the load will be distributed more evenly. We would expect the heavily-loaded grid to see all servers get full at roughly the same time.
There are two arguments in favor of switching to the non-permuted approach. The first is simplicity: it is slightly easier to explain the non-permuted algorithm, and it is easier to predict where any given file's shares will wind up. The second is a potential benefit for repair. The issue is as follows: a storage server has just died (the hard drive experienced a fatal error), and all of those shares are gone. Repair can be invoked to replace the missing shares and bring the file back up to full strength, but which files need to be repaired? The most convenient list of storage indexes that need repairing was on the server that just died. Is there some other way to construct this list of repair work?
(The assumption is that failure-driven repair is more sustainable than constant repair of all known files. This depends heavily upon the numbers we use: how many servers, how many files, how many clients, how many repair processes, what bandwidth is consumed by repair, etc).
The benefit that non-permuted share distribution would offer is in the resulting correlation between shares held by server B and those held by its neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have never rejected a lease request and are equally old, then every storage index on server B will also be on either A or C (assuming k>=2). Thus, if B dies, we can use A and C to construct the list of repair work that needs to be done.
However, Rob astutely pointed out that there are plenty of other ways to accomplish this same job. For example, each server could be assigned a "buddy", using a simple foolscap protocol, and each time server B adds a new share, it tells its buddy "D" about the storage index. If B dies, we ask the buddy for the share list and dump it into the repair queue. We could use a central server for this purpose, or distribute it out: what really matters is that we have a way to find out who the buddy is.
We need to think more about this. I like the load distribution properties of permuting, and I'm not particularly concerned about the descriptive complexity, but I too am concerned about the repair design, and would like to leave ourselves some tricks to pull out in case we run into future problems. The B-probably-has-A+C benefit of non-permutation breaks down if any of the servers are full or new, so I'm less convinced that it will be a significant help. But, so much of this is guesswork.. we don't really know.