Opened at 2011-12-17T22:32:42Z
Last modified at 2014-09-11T22:22:43Z
#1640 new defect
the mutable publisher should try harder to place all shares — at Initial Version
Reported by: | kevan | Owned by: | nobody |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-peerselection | Version: | 1.9.0 |
Keywords: | mutable upload | Cc: | zooko |
Launchpad Bug: |
Description
If a connection error is encountered while pushing a share to a storage server, the mutable publisher forgets about the writer object associated with the (share, server) placement; this is consistent with the pre-1.9 publisher, and, in high level terms, means that the publisher views that share placement as probably invalid, associating the error with a server failure or something like it. The pre-1.9 publisher attempts to find another home for the share placed on the broken server. The current publisher doesn't.
When I first wrote the publisher, I wanted to support streaming upload of mutable files. That made it hard to find a new home for a share placed on a broken storage server, since we wouldn't necessarily have all of the parts of the share we generated and placed before the failure available to upload to a new server. We ended up ditching streaming uploads due to other concerns; instead, we write a share all at once, and we have everything we will write to a storage server available to us when we write. Given this, there's no compelling reason that the publisher couldn't attempt to find a new home for shares placed on broken servers. Ensuring that all shares are placed if at all possible makes it more likely that there will be a recoverable version of the mutable file available after an update.
In practical terms, this increases the chance of data loss somewhat, proportional to the number of servers that fail during a publish operation. If too many storage servers fail during the upload process and too much of the initial share placement is lost due to these failures, the newly-placed mutable file might not be recoverable. A fix would involve a way to change the server associated with a writer after the writer is created, and probably some control flow changes to ensure that write failures result in shares being reassigned.