#181 closed defect (fixed)

restart is a bit too sensitive to slow shutdown (or shutdown is a bit too slow)

Reported by: zooko Owned by: warner
Priority: major Milestone: 0.7.0
Component: code-nodeadmin Version: 0.6.1
Keywords: Cc:
Launchpad Bug:

Description

On cygwin on Parallels, tahoe restart is susceptible to failure to restart due to slowness. In four attempts to restart, I got two successes, one time the webport couldn't be acquired (because it was still "in use") and one time I got

never saw process go away
not restarting

after about a second or two, and about half a second before the process went away.

I'm not sure what the very best solution to these kinds of issues is, but I'm opening this ticket in order to think about it later. At the moment, I kind of think of tahoe restart as meaning: "You! Stay here and try to kill that thing until it dies, and then try to revive it until it revives, or until I come back and tell you to stop.".

Change History (11)

comment:1 Changed at 2007-10-19T20:51:47Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned
  • Summary changed from restart is a bit too sensitive to timing on cygwin on Parallels to restart is a bit too sensitive to slow shutdown (or shutdown is a bit too slow)

I just saw the same failure on Mac OS X (on a super-fast Macbook Pro -- Core Duo 2).

Apparently shutting down in response to SIGTERM sometimes takes too long for the "tahoe restart" command-line.

I prefer the general philosophy of <a href="http://www.usenix.org/events/hotos03/tech/full_papers/candea/candea_html/index.html">crash-only software</a> -- basically that whatever the tahoe process is doing between the time it receives the SIGTERM and the time that the operating system frees up its resources is stuff that we have to be able to do without (in the case that it crashes), and we would be better off exercising the "do without it" code and not spending time working on the shutdown code.

Obviously there are places where this philosophy doesn't fit.

However, I wonder if this is a place where that idea will fit. I intend to examine how things would work if tahoe stop sent SIGKILL instead of SIGTERM.

Because this ticket is an example of how having shutdown code in existence gives you something that you can spend development time fixing/improving...

comment:2 Changed at 2007-10-19T22:47:20Z by zooko

  • Milestone changed from undecided to 0.6.2
  • Version changed from 0.6.0 to 0.6.1

comment:3 Changed at 2007-10-19T22:57:46Z by zooko

I've inpected the source code and, as I suspected, there is nothing that will go wrong if tahoe stop and tahoe restart issue SIGKILL instead of SIGTERM.

I propose to make that change.

Note that we still have shutdown code for some components, because the component in question might be used within a process which itself isn't going away, such as a unit test, other tests/performance measurements, or just because we like to be able to create and destroy these components at will without having to let go of all the other state that is in a process.

However there is one bit of shutdown code -- Node -- that comes with the comment "This class can be started and stopped only once.".

I propose to remove this shutdown code.

There is still one final question, which is "What does the user mean when they say tahoe restart -- do they mean 'Try to kill this for up to 5 seconds, and if it dies then start a new one.', or do they mean 'Start trying to kill this. If it dies then start a new one.'?".

Personally I think I tend to mean the latter. I issue tahoe restart on the command-line, and I do not wait 5 seconds to see what happens. I immediately switch away. If the process takes < 5 seconds to die, then I get what I expected. If the process never dies, then whichever way we do this I'll be mildly surprise to find it still running later. If the process takes > 5 seconds to die, then if we do this the current way I'll be mildly surprised to find no process later, and it we do > the time it took the old process to die!)

comment:4 Changed at 2007-10-22T19:35:11Z by zandr

To express my input to the conversation happening 4 feet to the south....

I'd be quite happy with 'try to shutdown for 60 second, then kill -9', since that's what I'll be doing if the shutdown fails anyway.

comment:5 Changed at 2007-10-22T22:48:20Z by warner

Ok, so the behavior we agreed to is:

  • 'tahoe start' does the log-watching trick I described earlier (the same as used in buildbot).
  • 'tahoe stop' does a kill -9, removes the twistd.pid file, then waits for the process to stop running (by polling the PID with os.kill(pid, 0)). If it doesn't see the process go away within 5 seconds, warn the user, but keep waiting until either it goes away or the user kills the 'tahoe stop' process.
  • the Tahoe codebase is required to handle being stopped at arbitrary times. The existence of "clean shutdown" code is for the benefit of unit tests, to make sure that a component which is shut down (with stopService, obviously *not* SIGKILL) will not interfere with later tests run in the same process.

I'll work on this one if you'd like.

comment:6 Changed at 2007-10-22T23:06:22Z by zooko

  • Owner changed from zooko to warner
  • Status changed from assigned to new

Yes, please!

One detail that we mentioned and that might not be worth the effort is to mv the pid file aside into a random tempfile instead of rm'ing it, and then inspect it to make sure you haven't fallen afoul of a rare race condition in which your kill -9 removed the old tahoe process and then someone else started a new tahoe process before you rm'ed the pid file.

If you detect such a problem, squawk loudly about it and try to put the pid file back.

comment:7 Changed at 2007-10-29T18:47:19Z by warner

I've just pushed the 'tahoe stop' change described above: use kill -9, remove the twistd.pid file, wait for the process to go away, warn after 5 seconds (and every 10 seconds thereafter), don't stop watching.

comment:8 Changed at 2007-10-30T19:24:32Z by warner

And then, the same day I implemented that, I found a reason to not do it that way. The SDMF backend storage container has a critical window during which an abrupt shutdown will corrupt the share. Always doing SIGKILL will increase the rate of share loss.

It's another one of those multi-way tradeoffs. We could tolerate SIGKILL and minimize the window by always copying the share to a new file, doing the edits there, then atomically dropping the new share into place, but that comes at a significant IO expense. We could make 'tahoe stop' do SIGINT and give the node like 5 seconds to finish current IO operations, which would reduce the probability of share loss (crashes or power loss are always a possibility, but we don't have to exacerbate the situation by doing hard kills all the time).

comment:9 Changed at 2007-10-31T02:43:28Z by zooko

Since SDMFs get overwritten in their entirety each time, why is it more I/O expense? Oh, I know, because of the metadata such as leases. Could I see a photo of the whiteboard from last week of the data layout?

My question is: how about copying over the part that isn't going to change to a new file, then appending the new share data, then relinking it into place. Is the added reliability of shares worth the added I/O? Oh, and I suppose it doesn't exactly match the testv_and_writev API, as it requires the update vector to be the whole share. Hm.

But aside from that issue, I don't necessarily object to giving the process a SIGTERM warning before the SIGKILL. I think this change to crash-only was useful because it prompted us to think through these kinds of questions about intermediate persistent state, and because it led us to not waste developer time (and sysadmin time) on "clean shutdown" behavior that we didn't really need.

If letting the filesystem I/O buffers flush is some clean shutdown behavior that we *do* really need, I'm okay with that, as long as it doesn't mislead us into adding other (harder to maintain) behavior that we don't really need. (Or lead us to forget about the chance of leaving inconsistent persistent state that we don't know how to deal with afterward.)

Make sense?

I guess there are two orthogonal reasons why I like crash-only:

  1. force us to think the effect of unclean shutdown
  2. don't add maintenance burden for something we can live without

Basically, the I regard the behavior of Python, the operating system, file system, etc., in response to kill -15 $pid ; sleep 5 ; kill -9 $pid as being easy for us to maintain. I regard the behavior of Twisted and any Python code of ours in response to that sequence as relatively hard to maintain. :-)

So overall I'm +1 on leaving the SIGKILL shutdown as is, in order to keep reminding us to think about consistency of intermediate persistent state, but I'm also +0 on adding a SIGTERM in order to let filesystem updates flush.

comment:10 Changed at 2007-11-01T18:00:41Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

I created #200 -- writing of shares is fragile and/or there is no graceful shutdown to hold the unresolved part of this ticket, and I'm closing this one as fixed.

comment:11 Changed at 2007-11-01T18:12:49Z by zooko

  • Milestone changed from 0.6.2 to 0.7.0

Milestone 0.6.2 deleted

Note: See TracTickets for help on using tickets.