[tahoe-dev] [tahoe-lafs] #200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh (was: writing of shares is fragile)

Mon Nov 2 00:16:25 PST 2009

#200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh
--------------------------+-------------------------------------------------
 Reporter:  zooko         |           Owner:  warner    
     Type:  enhancement   |          Status:  new       
 Priority:  major         |       Milestone:  eventually
Component:  code-storage  |         Version:  0.6.1     
 Keywords:  reliability   |   Launchpad_bug:            
--------------------------+-------------------------------------------------

Comment(by warner):

 Hrmph, I guess this is one of my hot buttons. Zooko and I have discussed
 the
 "crash-only" approach before, and I think we're still circling around each
 other's opinions. I currently feel that any approach that prefers
 fragility
 is wrong. Intentionally killing the server with no warning whatsoever
 (i.e.
 the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to
 provide some warning and tolerate a brief delay, is equal to intentionally
 causing data loss and damaging shares for the sake of some sort of
 ideological purity that I don't really understand.

 Be nice to your server! Don't shoot it in the head just to prove that you
 can. :-)

 Yes, sometimes the server will die abruptly. But it will be manually
 restarted far more frequently than that. Here's my list of
 running-to-not-running transition scenarios, in roughly increasing order
 of
 frequency:

  * kernel crash  (some disk writes completed, in temporal order if you're
 lucky)
  * power loss (like kernel crash)
  * process crash / SIGSEGV (all disk writes completed)
  * kernel shutdown (process gets SIGINT, then SIGKILL, all disk writes
    completed and buffers flushed)
  * process shutdown (SIGINT, then SIGKILL: process can choose what to do,
 all
    disk writes completed)

 The tradeoff is between:
  * performance in the good case
  * shutdown time in the "graceful shutdown" case
  * recovery time after something unexpected/rare happens
  * correctness: amount of corruption when something unexpected/rare
 happens
    (i.e. resistance to corruption: what is the probability that a share
 will
    survive intact?)
  * code complexity

 Modern disk filesystems effectively write a bunch of highly-correct
 corruption-resistant but poor-performance data to disk (i.e. the journal),
 then write a best-effort performance-improving index to very specific
 places
 (i.e. the inodes and dirnodes and free-block-tables and the rest). In the
 good case, it uses the index and gets high performance. In the bad case
 (i.e.
 the fsck that happens after it wakes up and learns that it didn't shut
 down
 gracefully), it spends a lot of time on recovery but maximizes the
 correctness by using the journal. The shutdown time is pretty small but
 depends upon how much buffered data is waiting to be written (it tends to
 be
 insignificant for hard drives, but annoyingly long for removable USB
 drives).

 A modern filesystem could achieve its correctness goals purely by using
 the
 journal, with zero shutdown time (umount == poweroff), and would never
 spend
 any time recovering anything, and would be completely "crash-only", but of
 course the performance would be so horrible that nobody would ever use it.
 Each open() or read() would involve a big fsck process, and it would
 probably
 have to keep the entire directory structure in RAM.

 So it's an engineering tradeoff. In Tahoe, we've got a layer of
 reliability
 over and above the individual storage servers, which lets us deprioritize
 the
 per-server correctness/corruption-resistance goal a little bit.

 If correctness were infinitely important, we'd write out each new version
 of
 a mutable share to a separate file, then do an fsync(), then perform an
 atomic rename (except on platforms that are too stupid to provide such a
 feature, of course), then do fsync() again, to maximize the period of time
 when the disk contained a valid monotonically-increasing version of the
 share.

 If performance or code complexity were infinitely important, we'd modify
 the
 share in-place with as few writes and syscalls as possible, and leave the
 flushing up to the filesystem and kernel, to do at the most efficient time
 possible.

 If performance and correctness were top goals, but not code complexity,
 you
 could imagine writing out a journal of mutable share updates, and somehow
 replaying it on restart if we didn't see the "clean" bit that means we'd
 finished doing all updates before shutdown.

 So anyways, those are my feelings in the abstract. As for the specific, I
 strongly feel that "tahoe stop" should be changed to send SIGINT and give
 the
 process a few seconds to finish any mutable-file-modification operation it
 was doing before sending it SIGKILL. (as far as I'm concerned, the only
 reason to ever send SIGKILL is because you're impatient and don't want to
 wait for it to clean up, possibly because you believe that the process has
 hung or stopped making progress, and you can't or don't wish to look at
 the
 logs to find out what's going on).

 I don't yet have an informed opinion about copy-before-write or
 edit-in-place. As Zooko points out, it would be appropriate to measure the
 IO
 costs of writing out a new copy of each share, and see how bad it looks.
 Code
 notes

  * the simplest way to implement copy-before-write would be to first copy
 the
    entire share, then apply in-place edits to the new versions, then
    atomically rename. We'd want to consider a recovery-like scan for
    abandoned editing files (i.e.
    {{{find storage/shares -name *.tmp |xargs rm}}}) at startup, to avoid
    unbounded accumulation of those tempfiles, except that would be
 expensive
    to perform and will never yield much results.

  * another option is to make a backup copy of the entire share, apply
    in-place edits to the *old* version, then delete the backup (and
 establish
    a recovery procedure that looks for backup copies and uses them to
 replace
    the presumeably-incompletely-edited original). This would be easier to
    implement if the backup copies are all placed in a single central
    directory, so the recovery process can scan for them quickly, perhaps
 in
    storage/shares/updates/$SI.

 However, my suspicion is that edit-in-place is the appropriate tradeoff,
 because that will lead to simpler code (i.e. fewer bugs) and better
 performance, while only making us vulnerable to share corruption during
 the
 rare events that don't give the server time to finish its write() calls
 (i.e.
 kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is
 *not*
 appropriate to call fsync(), because we lose performance everywhere but
 only
 improve correctness in the kernel crash and power loss scenarios. (a
 graceful
 kernel shutdown, or arbitrary process shutdown followed by enough time for
 the kernel/filesystem to flush its buffers, would provide for all write()s
 to
 be flushed even without a single fsync() call).

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/200#comment:5>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid