#877 closed defect (fixed)
reentrant confusion in mutable publish causes incorrect UCWE
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 1.6.0 |
Component: | code-mutable | Version: | 1.5.0 |
Keywords: | availability upload | Cc: | |
Launchpad Bug: |
Description
The incident reports in #786 reveal a problem in the mutable-file publish control loop which is likely to trigger an inappropriate Uncoordinated Write Error when a server goes away between mapupdate and publish. Here's the setup:
- the user modifies an existing healthy mutable file, 3-of-10 encoding
- Mapupdate identifies a bunch of shares on their servers, Retrieve gets the data, Publish prepares to upload a new version. Let's say each share N goes to server N (sh1 to server 1, etc), and that there's an extra copy of share 3 on server 11. And let's say that server 3 disconnected while the Retrieve was running (so it's still in the sharemap but the RemoteReference is now dead).
- loop() is entered the first time
- it calls update_goal() to figure out where all the shares want to live. Since mapupdate found all shares, no new shares need to be placed: each known share will be updated, including the doubled share 3.
- it then calls _send_shares() to deliver messages to the 11 servers that we're using
- _send_shares() builds a dictionary named all_tw_vectors that contains all of the messages it will send, batched by serverid (so it can send exactly one message to each server, even if there are multiple shares per server). Then it loops through this dictionary (let's pretend in sorted order even though .items() is random), and for each peerid, it updates self.outstanding to remember that the message is in flight (so our next pass through loop() won't send duplicate messages), and then sends the message. The relevant code looks like this:
for (peerid, tw_vectors) in all_tw_vectors.items(): for shnum in shnums: self.outstanding.add( (peerid, shnum) ) d = self._do_testreadwrite(peerid, secrets, tw_vectors, read_vector) d.addCallbacks(self._got_write_answer, self._got_write_error) d.addCallback(self.loop)
- when we've processed peerid 1 and 2, and we're working on 3, the
_do_testreadwrite call will see the disconnected server and errback
with DeadReferenceError. This immediately fires
_got_write_error, which log.UNUSUALs the event, removes the share
from self.outstanding, and adds the peer to self.bad_peers so
it becomes ineligible for other shares. Then the Deferred chain moves
onto the next call, and invokes self.loop. This all happens while
the for loop is on the stack: the rest of the server messages have not
been processed yet.
- now loop() is entered the second time, reentrantly. It updates the goal, which removes the sh3-on-X mapping but doesn't add any new shares because sh3 is already happily living on server3. It computes the "needed" list of shares to send, which does not include 1 or 2 or 3 (because those messages are in self.outstanding, but does include all of the others (because they haven't been added to self.outstanding yet, even though they will be as soon as the suspended call to _send_shares gets a chance).
- Then it (re)enters _send_shares(), which dutifully sends requests to the needed list: 4, 5, etc, up to 10. The inner call to _send_shares exits, as does the inner call to loop()
- now the outer call to _send_shares() regains control, in the middle of its for loop. It moves to server 4, adds the messages to the self.outstanding set (which is ignored, because it was already there from the inner call), sends the (duplicate) message, and continues onwards.
- The result is that we get duplicate identical messages for the later servers.
- The first message to server 4 comes back with a successful result: the test vector matches, so the write was performed. That message is removed from self.outstanding with "discard". The servermap is updated with the new share we know to be in place on this server.
- Later, the second message to server 4 comes back. It is removed from
self.outstanding, but it was already gone, and "discard" ignores
this. This message has failed, because the test vector it provided did not
match, because the first message had been processed before the second
message arrived.
- this log.WEIRDs an event (and triggers an Incident). The log messages say things like: somebody modified the share on us: shnum=8: I thought they had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in this message comes from the servermap, and indicates that it had been updated upon receipt of the first response. The second version in this message come from the server, and indicates that the first response was successful.
- it also marks the upload as being "surprised", since it appears that someone else has modified the share since our mapupdate was done
- the surprise causes this Publish to fail with UncoordinatedWriteError
The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were different than this simplified example. There were actually 20 shares scattered among 12 servers due to lots of servers being offline during previous updates. The server which was lost only held one share (sh0), and that share was also stored on a different server. The order in which the for loop was traversed is unknown. There were two server messages still outstanding when the duplicate response was received, ending the Publish process. There may have been lots of duplicate messages sent, for which the responses didn't arrive until after the Incident log was closed.
The upshot is that the reentrant call to loop() can cause trouble. The best fix is probably to change the code in _send_shares to fire loop() via an eventual-send, specifically with foolscap's fireEventually Deferred-creating function. Since this is the only place in Publish that invokes loop() (other than the initial priming call), this should be sufficient to prevent reentrancy.
Attachments (10)
Change History (15)
comment:1 Changed at 2010-01-02T22:11:26Z by warner
- Milestone changed from undecided to 1.6.0
- Resolution set to fixed
- Status changed from new to closed
comment:2 Changed at 2010-01-09T04:44:01Z by zooko
- Priority changed from major to critical
- Resolution fixed deleted
- Status changed from closed to reopened
Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of tahoe cp -r $LOCAL $TAHOE (on volunteergrid) produced the same error and two or three incident report files for each run. So it was not caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug".
Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic.
By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?
comment:3 Changed at 2010-01-09T04:50:07Z by zooko
Tomorrow (Tahoe-LAFS Hacking Weekend!) I'll try reproducing Nathan's problem, and if I can I'll try different versions of Tahoe-LAFS, especially the current trunk which has your fix from earlier in this ticket.
comment:4 Changed at 2010-01-10T07:39:22Z by warner
- Resolution set to fixed
- Status changed from reopened to closed
comment:5 Changed at 2010-01-13T19:39:30Z by davidsarah
- Keywords availability upload added
Fixed, in 419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a NoNetworkGrid and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).