#1280 new defect

deal with fragile, but disposable, bucket state files

Reported by: francois Owned by: daira
Priority: normal Milestone: soon
Component: code-nodeadmin Version: 1.8.1
Keywords: pickle reliability Cc: zancas
Launchpad Bug:

Description (last modified by daira)

If a bucket state file can't be loaded and parsed for any reason (usually because it is 0-length, but any other sort of error should also be handled similarly), then just blow it away and start fresh.


After a hard system shutdown due to power failure, Tahoe node might not be able to start again automatically because files storage/bucket_counter.state or storage/lease_checker.state are empty.

The easy workaround is to manually delete the empty files before restarting nodes.

find /srv/tahoe/*/storage/bucket_counter.state -size 0 -exec rm {} \;
find /srv/tahoe/*/storage/lease_checker.state -size 0 -exec rm {} \;

Here is what a startup attempt looks like in such case.

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 614, in run
    runApp(config)
  File "/usr/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 330, in run
    self.application = self.createOrGetApplication()
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 416, in createOrGetApplication
    application = getApplication(self.config, passphrase)
--- <exception caught here> ---
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 427, in getApplication
    application = service.loadApplication(filename, style, passphrase)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 368, in loadApplication
    application = sob.loadValueFromFile(filename, 'application', passphrase)
  File "/usr/lib/python2.5/site-packages/twisted/persisted/sob.py", line 214, in loadValueFromFile
    exec fileObj in d, d
  File "tahoe-client.tac", line 10, in <module>
    c = client.Client()
  File "/opt/tahoe-lafs/src/allmydata/client.py", line 140, in __init__
    self.init_storage()
  File "/opt/tahoe-lafs/src/allmydata/client.py", line 269, in init_storage
    expiration_sharetypes=expiration_sharetypes)
  File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 97, in __init__
    self.add_bucket_counter()
  File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 114, in add_bucket_counter
    self.bucket_counter = BucketCountingCrawler(self, statefile)
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 449, in __init__
    ShareCrawler.__init__(self, server, statefile)
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 86, in __init__
    self.load_state()
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 195, in load_state
    state = pickle.load(f)
exceptions.EOFError: 

Change History (15)

comment:1 follow-up: Changed at 2010-12-20T16:06:31Z by warner

oops. Yeah, if pickle.load fails at startup, we should probably delete the file so it can be regenerated.

We should probably also investigate using the "write-to-temp-file/close/rename" dance. One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).

comment:2 Changed at 2011-01-04T02:14:04Z by davidsarah

  • Keywords pickle reliability added

See also #1290 (replace all use of pickles with JSON).

comment:3 Changed at 2011-01-04T16:43:22Z by francois

  • Owner set to francois
  • Status changed from new to assigned

comment:4 Changed at 2011-08-18T04:23:36Z by zooko

  • Owner changed from francois to zooko
  • Status changed from assigned to new

I think the current code uses the "write-to-temp-file&&close&&rename" dance. Zancas was working on a patch to use FilePath (#1437) and I saw that his patch simplified the extant "write-to-temp-file&&close&&rename" dance by using FilePath.setContents. I'll look for the details.

comment:5 Changed at 2011-08-18T05:47:15Z by zooko

  • Cc zancas added
  • Resolution set to fixed
  • Status changed from new to closed

Here is where the current trunk code uses the "write-to-temp-file&&close&&rename" dance. Closing this ticket as fixed. :-) Re-open if you disagree.

comment:6 in reply to: ↑ 1 ; follow-up: Changed at 2011-08-18T16:41:26Z by zooko

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from bucket_counter.state and lease_checker.state might get corrupted after hard system shutdown to if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message

Replying to warner:

We should probably also investigate using the "write-to-temp-file/close/rename" dance.

It appears to me that Brian had already implemented the "write-to-temp-file&&close&&rename" dance [20090219034633-4233b-3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the comment above]. :-)

One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).

This is a crawler state file, so if the rename fails then the previous version of the file is left there, right? So the crawler will re-do whatever work it did since the last successful rename. If there is a persistent problem which causes rename to fail (e.g. the permissions on the old state file and its parent directory forbid you to overwrite it?), then this would be a bad failure mode where the crawler always appears to be doing work but never finishes. (The "Sisyphus" Failure Mode. :-))

Oh, except the failure to do the rename will cause the node to stop with an error message, right? Re-opening this ticket and changing it to say "If the write fails, stop the node with a clear error message.". Why stop the node? Because there is no other reliable way to get the operator's attention. Also because there is something screwed-up here, and stopping is the safest course to prevent worse failures.

comment:7 in reply to: ↑ 6 ; follow-up: Changed at 2011-08-19T02:02:54Z by davidsarah

Replying to zooko:

Why stop the node? Because there is no other reliable way to get the operator's attention.

Seems to me that's a bug right there. There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node. Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.

comment:8 in reply to: ↑ 7 Changed at 2011-08-19T05:27:00Z by zooko

Replying to davidsarah:

Replying to zooko:

Why stop the node? Because there is no other reliable way to get the operator's attention.

Seems to me that's a bug right there.

Well, let's open a ticket for it, but I still think that even if we fix that issue by providing a reliable way to get the operator's attention, that stopping the node is still the right thing to do when we discover something wrong like this, because (a) it increases the likelihood of getting the operator's attention, (b) it reduces the chance of damage or exploitation, and (c) it helps the operator with post-mortem diagnostics.

There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node.

What do you mean SPoFs?

Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.

I guess the Incident Gatherer is perfect for this.

comment:9 Changed at 2011-10-13T19:53:12Z by davidsarah

  • Milestone changed from 1.9.0 to 1.10.0

comment:10 Changed at 2012-04-01T05:02:29Z by davidsarah

  • Priority changed from major to normal

comment:11 Changed at 2012-10-16T02:36:28Z by zooko

  • Description modified (diff)
  • Summary changed from if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message to deal with fragile, but disposable, bucket state files

comment:12 follow-up: Changed at 2013-07-09T14:25:47Z by daira

  • Description modified (diff)

I suspect that changeset 3cb99364 effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree.

See also #1290.

comment:13 in reply to: ↑ 12 Changed at 2013-07-09T15:36:42Z by zooko

  • Owner changed from zooko to daira
  • Status changed from reopened to new

Replying to daira:

I suspect that changeset 3cb99364 effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree.

I agree, so I'll close this ticket, but there's something that I don't like about 3cb99364, which is that it catches arbitrary Exception instead of a set of exceptions that we think it should handle. Catching of general Exception has caused problems in the past, such as when the exception being caught is due to a shallow bug in the the code (like an AttributeError, or a out-of-memory exception, etc. How about if we change it from catching Exception to catching (EnvironmentError, EOFError, KeyError)?

comment:14 Changed at 2013-07-14T16:14:44Z by daira

Short of reading the stdlib source code, I don't know the set of exceptions that pickle.load can raise. I'd be open to changing it so that the try ... except Exception is only around the pickle.load call, though. Perhaps something like:

        state = {"version": 1,
                 "last-cycle-finished": None,
                 "current-cycle": None,
                 "last-complete-prefix": None,
                 "last-complete-bucket": None,
                 }
        try:
            f = open(self.statefile, "rb")
        except EnvironmentError:
            pass
        else:
            try:
                state = pickle.load(f)
            except Exception:
                pass
            finally:
                f.close()
Last edited at 2013-07-14T16:21:40Z by daira (previous) (diff)

comment:15 Changed at 2013-11-14T19:08:24Z by zooko

#834 would fix this.

Version 0, edited at 2013-11-14T19:08:24Z by zooko (next)
Note: See TracTickets for help on using tickets.