Opened at 2008-01-04T17:23:30Z
Closed at 2009-12-12T04:33:43Z
#253 closed defect (wontfix)
everything stalls after abrupt disconnect
Reported by: | zooko | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-network | Version: | 0.7.0 |
Keywords: | reliability | Cc: | arch_o_median |
Launchpad Bug: |
Description
We just set up a 3-node network at Seb's house with my laptop, Seb's, and Josh's. When I turned off the airport on my Mac, then subsequently Seb and Josh couldn't do anything -- uploads, downloads, and "check this file" operations all hung silently.
After a few minutes I reconnected my laptop, but the problem persisted for several minutes -- perhaps 5 -- before Seb's tahoe node recovered and was able to function normally. However, Josh's node never did by the time that we called it a night (maybe 15 minutes).
I'm attaching all three logs.
Attachments (3)
Change History (14)
Changed at 2008-01-04T17:24:22Z by zooko
comment:1 Changed at 2008-01-04T17:32:23Z by zooko
I've assigned this to Brian in order to draw his attention to it, since it probably involves foolscap connection management, but I'm going to try to reproduce it now.
comment:2 Changed at 2008-01-04T18:09:21Z by zooko
- Milestone changed from undecided to 0.7.0
comment:3 Changed at 2008-01-04T23:28:55Z by warner
were you all running the latest Foolscap? This was definitely a problem in older versions, but the hope was that we fixed it in 0.2.0 or so.
I'll try to look at those logs when I get a chance. The real information may well be in the foolscap log events, though, which we aren't recording by default, so it's possible that the important data is missing, so reproducing the problem would be a big help.
comment:4 Changed at 2008-01-04T23:35:08Z by zooko
I looked into the logs in order to answer the question of which versions of foolscap were in use, and that information isn't there! I thought that we logged all version numbers. I'll investigate that.
comment:5 Changed at 2008-01-05T05:52:00Z by warner
so you shut down your laptop, after which the other nodes would see their TCP packets go unacknowledged. TCP takes about 15 minutes to break the connection in this state (see Foolscap#28 for some experimental timing data). During this period, the other nodes cannot distinguish between your laptop being slow and it being gone.
Assuming your laptop doesn't come back, to allow the other nodes to make progress, we need to modify the Tahoe download code to switch to an alternate source of shares when a callRemote takes too long. One heuristic might be to keep track of how long it took to acquire the previous share, and if the next share takes more than 150% as long, move that peer to the bottom of the list and ask somebody else for it. To allow uploads to make progress, we'd want to do something similar: if our remote_write call doesn't complete within 150% of the time the previous one did (or 150% of the max time that the other peers handled it), assume that this peer is dubious. Either we consider it dead (and wind up with a slightly-unhealthy file), or we buffer the shares that we wanted to send to them (and consume storage in the hopes that they'll come back).
Now, when your laptop did come back, did you restart your tahoe node? If so, the new node's connections should have displaced the ones from the old node, and any uploads/downloads in progress should have seen immediate connectionLost errors. For upload I think we handle this properly (we abandon that share, resulting in a slightly unhealthy file, and if we don't achieve shares_of_happiness then we declare the upload to have failed). For download I think we explode pretty violently: an indefinite hang is a distinct possiblity. (at the very least we should flunk the download, but really we should switch over to other peers as described above). New operations (done after your new node finished connecting) should have worked normally.. if not, perhaps our Introducer code isn't properly replacing the peer reference when the reconnector fires the callback a second time.
But, if you *didn't* restart your tahoe node when you reconnected, now we're in a different state. The other nodes would have outstanding data trying to get to your node, and TCP will retransmit that with an exponential backoff (doubling the delay each time). If your machine was off-net for 4 minutes, you could expect those nodes to not try again for a further 4 minutes. If your node send data of its own, that might trigger a fast retry, but maybe not, and your node might not have needed to talk to them at that point. Once a retry was attempted, I'd expect data to start flowing quickly and normal operations to resume.
Any idea which case it was?
comment:6 Changed at 2008-01-05T20:00:17Z by zooko
- Cc arch_o_median added
I didn't restart my Tahoe node. Seb's tahoe node reconnected within a few minutes of my turning on my wireless card, but Josh's hadn't even after maybe 15 minutes.
Cc: josh (arch)
comment:7 Changed at 2008-01-05T20:34:47Z by zooko
- Milestone changed from 0.7.0 to 0.7.1
I'm ready to call this a known issue for v0.7.0. Bumping it to v0.7.1 Milestone.
comment:8 Changed at 2008-01-23T02:43:48Z by zooko
- Milestone changed from 0.7.1 to undecided
comment:9 Changed at 2008-09-24T13:27:12Z by zooko
comment:10 Changed at 2009-10-28T07:17:34Z by davidsarah
- Keywords reliability added
comment:11 Changed at 2009-12-12T04:33:43Z by zooko
- Resolution set to wontfix
- Status changed from new to closed
I haven't tried to reproduce this problem or further diagnose it in two years, and much has changed since then. I'm going to presumptively close it as 'wontfix'. A future rewrite of the download logic may also fix it if it isn't already fixed -- see comment:4:ticket:193.
twistd.log from seb's laptop