[tahoe-dev] [tahoe-lafs] #200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh (was: writing of shares is fragile)
tahoe-lafs
trac at allmydata.org
Mon Nov 2 00:16:25 PST 2009
#200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh
--------------------------+-------------------------------------------------
Reporter: zooko | Owner: warner
Type: enhancement | Status: new
Priority: major | Milestone: eventually
Component: code-storage | Version: 0.6.1
Keywords: reliability | Launchpad_bug:
--------------------------+-------------------------------------------------
Comment(by warner):
Hrmph, I guess this is one of my hot buttons. Zooko and I have discussed
the
"crash-only" approach before, and I think we're still circling around each
other's opinions. I currently feel that any approach that prefers
fragility
is wrong. Intentionally killing the server with no warning whatsoever
(i.e.
the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to
provide some warning and tolerate a brief delay, is equal to intentionally
causing data loss and damaging shares for the sake of some sort of
ideological purity that I don't really understand.
Be nice to your server! Don't shoot it in the head just to prove that you
can. :-)
Yes, sometimes the server will die abruptly. But it will be manually
restarted far more frequently than that. Here's my list of
running-to-not-running transition scenarios, in roughly increasing order
of
frequency:
* kernel crash (some disk writes completed, in temporal order if you're
lucky)
* power loss (like kernel crash)
* process crash / SIGSEGV (all disk writes completed)
* kernel shutdown (process gets SIGINT, then SIGKILL, all disk writes
completed and buffers flushed)
* process shutdown (SIGINT, then SIGKILL: process can choose what to do,
all
disk writes completed)
The tradeoff is between:
* performance in the good case
* shutdown time in the "graceful shutdown" case
* recovery time after something unexpected/rare happens
* correctness: amount of corruption when something unexpected/rare
happens
(i.e. resistance to corruption: what is the probability that a share
will
survive intact?)
* code complexity
Modern disk filesystems effectively write a bunch of highly-correct
corruption-resistant but poor-performance data to disk (i.e. the journal),
then write a best-effort performance-improving index to very specific
places
(i.e. the inodes and dirnodes and free-block-tables and the rest). In the
good case, it uses the index and gets high performance. In the bad case
(i.e.
the fsck that happens after it wakes up and learns that it didn't shut
down
gracefully), it spends a lot of time on recovery but maximizes the
correctness by using the journal. The shutdown time is pretty small but
depends upon how much buffered data is waiting to be written (it tends to
be
insignificant for hard drives, but annoyingly long for removable USB
drives).
A modern filesystem could achieve its correctness goals purely by using
the
journal, with zero shutdown time (umount == poweroff), and would never
spend
any time recovering anything, and would be completely "crash-only", but of
course the performance would be so horrible that nobody would ever use it.
Each open() or read() would involve a big fsck process, and it would
probably
have to keep the entire directory structure in RAM.
So it's an engineering tradeoff. In Tahoe, we've got a layer of
reliability
over and above the individual storage servers, which lets us deprioritize
the
per-server correctness/corruption-resistance goal a little bit.
If correctness were infinitely important, we'd write out each new version
of
a mutable share to a separate file, then do an fsync(), then perform an
atomic rename (except on platforms that are too stupid to provide such a
feature, of course), then do fsync() again, to maximize the period of time
when the disk contained a valid monotonically-increasing version of the
share.
If performance or code complexity were infinitely important, we'd modify
the
share in-place with as few writes and syscalls as possible, and leave the
flushing up to the filesystem and kernel, to do at the most efficient time
possible.
If performance and correctness were top goals, but not code complexity,
you
could imagine writing out a journal of mutable share updates, and somehow
replaying it on restart if we didn't see the "clean" bit that means we'd
finished doing all updates before shutdown.
So anyways, those are my feelings in the abstract. As for the specific, I
strongly feel that "tahoe stop" should be changed to send SIGINT and give
the
process a few seconds to finish any mutable-file-modification operation it
was doing before sending it SIGKILL. (as far as I'm concerned, the only
reason to ever send SIGKILL is because you're impatient and don't want to
wait for it to clean up, possibly because you believe that the process has
hung or stopped making progress, and you can't or don't wish to look at
the
logs to find out what's going on).
I don't yet have an informed opinion about copy-before-write or
edit-in-place. As Zooko points out, it would be appropriate to measure the
IO
costs of writing out a new copy of each share, and see how bad it looks.
Code
notes
* the simplest way to implement copy-before-write would be to first copy
the
entire share, then apply in-place edits to the new versions, then
atomically rename. We'd want to consider a recovery-like scan for
abandoned editing files (i.e.
{{{find storage/shares -name *.tmp |xargs rm}}}) at startup, to avoid
unbounded accumulation of those tempfiles, except that would be
expensive
to perform and will never yield much results.
* another option is to make a backup copy of the entire share, apply
in-place edits to the *old* version, then delete the backup (and
establish
a recovery procedure that looks for backup copies and uses them to
replace
the presumeably-incompletely-edited original). This would be easier to
implement if the backup copies are all placed in a single central
directory, so the recovery process can scan for them quickly, perhaps
in
storage/shares/updates/$SI.
However, my suspicion is that edit-in-place is the appropriate tradeoff,
because that will lead to simpler code (i.e. fewer bugs) and better
performance, while only making us vulnerable to share corruption during
the
rare events that don't give the server time to finish its write() calls
(i.e.
kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is
*not*
appropriate to call fsync(), because we lose performance everywhere but
only
improve correctness in the kernel crash and power loss scenarios. (a
graceful
kernel shutdown, or arbitrary process shutdown followed by enough time for
the kernel/filesystem to flush its buffers, would provide for all write()s
to
be flushed even without a single fsync() call).
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/200#comment:5>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list