Opened at 2009-06-10T14:23:41Z
Closed at 2009-06-30T12:38:12Z
#732 closed defect (fixed)
Not Enough Shares when repairing a file which has 7 shares on 2 servers
Reported by: | zooko | Owned by: | zooko |
---|---|---|---|
Priority: | major | Milestone: | 1.5.0 |
Component: | code-encoding | Version: | 1.4.1 |
Keywords: | repair process | Cc: | kpreid |
Launchpad Bug: |
Description
My demo at the Northern Colorado Linux Users Group had an unfortunate climactic conclusion when someone (whose name I didn't catch) asked about repairing damaged files, so I clicked the check button with the "repair" checkbox turned on, and got this:
NotEnoughSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more.
I couldn't figure it out and had to just bravely claim that Tahoe had really great test coverage and this sort of unpleasant surprise wasn't common. I also promised to email them all with the explanation, so I'm subscribing to the NCLUG mailing list so that I can e-mail the URL to this ticket. :-)
The problem remains reproducible today. I have a little demo grid with an introducer, a gateway, and two storage servers. The gateway has storage service turned off. I have a file stored therein with 3-of-10 encoding, and I manually rm'ed three shares from one of the storage servers. Check correctly reports says:
"summary": "Not Healthy: 7 shares (enc 3-of-10)"
Check also works with the "verify" checkbox turned on.
When I try to repair I get thie Not Enough Shares error and an incident report like this one (full incident report file attached):
07:03:12.747 [5977]: web: 127.0.0.1 GET /uri/[CENSORED].. 200 308553 07:03:25.604 [5978]: <Repairer #6>(u7rxp): starting repair 07:03:25.604 [5979]: CHKUploader starting 07:03:25.604 [5980]: starting upload of <DownUpConnector #6> 07:03:25.604 [5981]: creating Encoder <Encoder for unknown storage index> 07:03:25.604 [5982]: <CiphertextDownloader #22>(u7rxpbtbw5wb): starting download 07:03:25.613 [5983]: SCARY <CiphertextDownloader #22>(u7rxpbtbw5wb): download failed! FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/repairer.py", line 69, in start d2 = dl.start() File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 715, in start d.addCallback(self._got_all_shareholders) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 195, in addCallback callbackKeywords=kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 810, in _got_all_shareholders self._verifycap.needed_shares) allmydata.interfaces.NotEnoughSharesError: Failed to get enough shareholders ] [INCIDENT-TRIGGER] 07:03:26.253 [5984]: web: 127.0.0.1 POST /uri/[CENSORED].. 410 234
Attachments (5)
Change History (17)
Changed at 2009-06-10T14:26:02Z by zooko
comment:1 Changed at 2009-06-15T19:45:48Z by zooko
comment:2 Changed at 2009-06-16T19:19:13Z by zooko
Here is the mailing list message on nclug@nclug.org where I posted the promised follow-up.
comment:3 Changed at 2009-06-19T21:43:52Z by kpreid
- Cc kpreid added
I have the same problem; here are my incident reports (volunteer grid). Here is the troublesome directory. (My repair attempts have been from the CLI using the RW cap, not the RO.) Note that the files are all readable, and tahoe deep-check agrees they are recoverable; only repair fails.
comment:4 Changed at 2009-06-20T21:07:35Z by warner
kpreid: could you also upload those incident reports to this ticket? I don't have a volunteergrid node running on localhost.
comment:5 Changed at 2009-06-20T21:31:51Z by zooko
I just set up a public web gateway for volunteergrid:
http://nooxie.zooko.com:9798/
It's running on nooxie.zooko.com, the same host that runs the introducer for the volunteergrid.
comment:6 Changed at 2009-06-20T21:52:21Z by warner
Thanks for the gateway! In the spirit of fewer clicks, http://nooxie.zooko.com:9798/uri/URI%3ADIR2-RO%3Awz2jevwzhgzdkpocyvadxjx6sm%3Aicljlu7etpouvvnduhuzyfgyyv5bvqp4iophltfdbtrwdjy3wuea/ contains kpreid's incident reports, as long as the volunteer grid and zooko's gateway stay up..
(in case it wasn't clear, I'm arguing that the volunteer grid is not as good a place to put bug report data as, say, this bug report :-).
comment:7 Changed at 2009-06-20T22:57:15Z by kpreid
Done.
comment:8 Changed at 2009-06-21T07:57:37Z by warner
ok, I found the bug.. repairer.py instantiates CiphertextDownloader with a !Client instance, when it's supposed to be passing a StorageFarmBroker instance. They both happen to have a method named get_servers, and the methods have similar signatures (they both accept a single string and return an iterable), but different semantics. The result is that the repairer's downloader was getting zero servers, because it was askinging with a storage_index where it should have been asking with a service name. Therefore the downloader didn't send out any queries, so it got no responses, so it concluded that there were no shares available.
test_repairer still passes because it's using a NoNetworkClient instead of the regular !Client, and NoNetworkClient isn't behaving quite the same way as !Client (the get_servers method happens to behave like StorageFarmBroker).
This was my bad.. I updated a number of places but missed repairer.py . The StorageFarmBroker thing, in general, should remove much of the need for a separate no-network test client (the plan is to use the regular client but configure it to not talk to an introducer and stuff in a bunch of loopback'ed storage servers). But in the transition period, this one fell through.
My plan is to change NoNetworkClient first, so that test_repairer fails like it's supposed to, then change one of the get_servers to a different name (so that the failure turns into an AttributeError), then finally fix repairer.py to pass in the correct object. Hopefully I'll get that done tomorrow.
If you'd like to just fix it (for local testing), edit repairer.py line 53 (in Repairer.start) and change the first argument of download.CiphertextDownloader from self._client to self._client.get_storage_broker(). A quick test here suggests that this should fix the error.
comment:9 Changed at 2009-06-24T04:13:53Z by warner
the patches I pushed in the last few days should fix this problem. Zooko, kpreid, could you upgrade and try the repair again? And if that works, close this ticket?
comment:10 Changed at 2009-06-24T14:36:10Z by kpreid
Seems to work. On my test case I get a different error: ERROR: MustForceRepairError(There were unrecoverable newer versions, so force=True must be passed to the repair() operation) but I assume this is unrelated.
comment:11 Changed at 2009-06-24T21:22:50Z by warner
Yeah, MustForceRepairError is indicated for mutable files, when there are fewer than 'k' shares of some version N, but k or more shares of some version N-1. (the version numbers are slightly more complicated than that, but that's irrelevant). This means that the repairer sees evidence of a newer version, but is unable to recover it, and passing in force=True to the repair() call will knowingly give up on that version.
I don't think there is yet a webapi to pass force=True. Also, I think there might be situations in which the repairer fails to look far enough for newer versions. Do a "check" and look at the version numbers (seqNNN in the share descriptions), to see if the message seems correct.
This can occur when a directory update occurs while the node is not connected to the usual storage nodes, especially if the nodes that *are* available then go away later.
comment:12 Changed at 2009-06-30T12:38:12Z by zooko
- Resolution set to fixed
- Status changed from new to closed
#736 (UnrecoverableFileError? on directory which has 6 shares (3 needed)) may be related.