#1628 closed defect (fixed)
UCWE on deep check with recent version
Reported by: | davidsarah | Owned by: | kevan |
---|---|---|---|
Priority: | critical | Milestone: | 1.9.2 |
Component: | code | Version: | 1.9.0 |
Keywords: | tahoe-check verify ucwe regression review-needed | Cc: | kpreid, killyourtv@… |
Launchpad Bug: |
Description (last modified by kpreid)
Reported by kpreid: I upgraded my tahoe to a recent development version, in the git mirror:
git://github.com/warner/tahoe-lafs.git commit b73aba98de93c4c0b0013f1dd435c64e73e48f4c
Now, my daily deep-check --repair --add-lease on my four aliases on the volunteer grid consistently fails as follows. The first might have a legitimate uncoordinated write, but the last two are not regularly touched by anything but the repair process, and this identical failure has occurred for the past 4 days.
I'd appreciate it if this is fixed before my leases expire. :-)
ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: " ERROR: AssertionError() "[Failure instance: Traceback: <type 'exceptions.AssertionError'>: " /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:563:upload /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:_do_serialized /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:298:addCallback /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:287:addCallbacks --- <exception caught here> --- /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:545:_runCallbacks /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:<lambda> /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:689:_upload /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/publish.py:402:publish ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: " ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "
Before these failures, here is a typical example of normal results. (The indentation is added by my script.) I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.
done: 3 objects checked pre-repair: 3 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 3 healthy, 0 unhealthy repair successful done: 1 objects checked pre-repair: 0 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 0 healthy, 1 unhealthy repair successful done: 5 objects checked pre-repair: 4 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 4 healthy, 1 unhealthy repair successful done: 5 objects checked pre-repair: 4 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 4 healthy, 1 unhealthy
Attachments (6)
Change History (34)
comment:1 Changed at 2011-11-30T22:58:04Z by davidsarah
comment:2 Changed at 2011-12-01T00:04:32Z by kpreid
- Description modified (diff)
comment:3 Changed at 2011-12-01T10:17:26Z by killyourtv
I had the same problem after upgrading to 1.9, reported in #1583 (though I neglected to add the full traceback).
comment:4 Changed at 2011-12-06T05:55:16Z by zooko
Thanks for the information, kpreid and killyourtv. Kevin, would you please attempt to reproduce it and see if a foolscap incident report file is generated in ~/.tahoe/logs/incidents/, and then push the "Report A Problem" button as killyourtv did in #1583?
comment:5 follow-up: ↓ 11 Changed at 2011-12-06T13:43:09Z by kpreid
For some reason I cannot upload the second of the four incident files, which contained an AssertionError? from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. I have temporarily uploaded it here: http://switchb.org/pri/incident-2011-12-06--13-30-50Z-fhhyszi.flog.bz2
comment:6 Changed at 2011-12-10T22:55:28Z by kevan
- Owner changed from somebody to kevan
- Status changed from new to assigned
comment:7 Changed at 2011-12-13T01:22:29Z by kevan
Hm, it looks like the publisher forgot how to keep track of multiple copies of a single share during the MDMF implementation. So, in your case, it knows about one copy of each of a few shares that each exist on more than one server, then gets surprised when it encounters the other copies when pushing other shares, interprets the shares as evidence of an uncoordinated write, and breaks. Can you test attachment:fix-1628.darcs.patch and let me know if it resolves the issue for you?
Changed at 2011-12-13T01:23:13Z by kevan
comment:8 Changed at 2011-12-13T01:44:58Z by kpreid
I'm currently trying not to have to deal with darcs. If you can supply that as a unified diff or against the git mirror I can test it.
comment:9 Changed at 2011-12-13T02:23:18Z by kevan
Does attachment:fix-1628.diff work for you? Don't mind the references to a fifth patch; it's not related to this issue.
comment:10 follow-up: ↓ 13 Changed at 2011-12-13T02:49:46Z by kpreid
fix-1628.diff appears to have eliminated the problem.
comment:11 in reply to: ↑ 5 Changed at 2011-12-13T21:48:16Z by zooko
Replying to kpreid:
For some reason I cannot upload the second of the four incident files, which contained an AssertionError? from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file.
That's due to a misconfiguration of this Trac. #1581 My apologies.
comment:12 Changed at 2011-12-13T21:49:13Z by zooko
So is this a regression in 1.9.0 vs. earlier releases, and could it result in data loss, and should we plan a 1.9.1 release to fix it?
comment:13 in reply to: ↑ 10 Changed at 2011-12-13T22:54:46Z by killyourtv
comment:14 Changed at 2011-12-14T00:47:54Z by davidsarah
From the patch comments:
This tests for two regressions resulting from a design flaw in the 1.9 mutable publisher; specifically, that the publisher doesn't keep track of more than one server for each share. This can lead to spurious UCWEs, as seen in ticket #1628. This also means that the publisher will no longer write shares associated with a new version of a mutable file over all of the existing shares that it can find, which potentially decreases the robustness of the new version of the mutable file.
This sounds to me like a regression serious enough to justify a 1.9.1. Although multiple servers holding the same share shouldn't happen if there have been only publish operations with a stable set of servers, it can easily happen if the grid membership is less stable and there have been repairs.
comment:15 Changed at 2011-12-14T00:48:51Z by davidsarah
- Milestone changed from undecided to soon
comment:16 Changed at 2011-12-14T21:12:38Z by killyourtv
- Cc killyourtv@… added
comment:17 follow-up: ↓ 26 Changed at 2011-12-15T03:01:03Z by kevan
I agree with comment:14. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?
comment:18 Changed at 2011-12-16T16:52:21Z by zooko
- Milestone changed from soon to 1.9.1
- Priority changed from major to critical
Changed at 2011-12-27T20:38:48Z by kevan
comment:19 Changed at 2011-12-27T20:40:14Z by kevan
- Keywords review-needed added
attachment:fix-1628.darcs.2.patch fixes a flaw in my initial patch. I think it's ready for review.
comment:20 Changed at 2011-12-28T05:39:46Z by warner
oops, I just reviewed and landed the *first* patch, in e29323f68fc5447b. I'll see if I can deduce a delta between the two darcs patches..
comment:21 Changed at 2011-12-28T05:50:43Z by Brian Warner <warner@…>
In 147670fd89a04bad:
comment:22 Changed at 2011-12-28T05:52:05Z by Brian Warner <warner@…>
In 147670fd89a04bad:
comment:23 Changed at 2011-12-28T05:57:49Z by warner
kevan: can you double-check that I got that delta right? I *think* the only part that changed was this bit:
all_shnums = filter(lambda sh: len(self.writers[sh]) > 0, self.writers.iterkeys()) if len(all_shnums) < self.required_shares or self.surprised: return self._failure()
with which I fully concur. Since empty lists are falsey, you could also express it like:
all_shnums = set([shnum for shnum in self.writers if self.writers[shnum]]) # or all_shnums = set([shnum for shnum,writers in self.writers.items() if writers]) # or, relying upon the uniqueness of dict keys: all_shnums = [shnum for shnum,writers in self.writers.items() if writers] # or, since we only actually care about the count of unique shnums: shares = len([shnum for shnum,writers in self.writers.items() if writers])
(also, be aware of the DictOfSets that I use in the immutable code for tracking the shnum->servers mapping)
Should I leave this ticket open until we get that second test written?
comment:24 Changed at 2011-12-28T19:26:39Z by kevan
I altered test_multiply_placed_shares to fail if some of the shares aren't updated to the newest version on an update, so we don't need to wait for another test. I guess the git changelogs are a little stale; sorry for any confusion from that.
You caught the only important change with your delta. I also removed
self.g.clients[0].DEFAULT_ENCODING_PARAMETERS['n'] = 75
from test_multiply_placed_shares. Placing a lot of shares (so each server holds a few shares) made the test yield an UCWE more reliably, but it still sometimes made it to the multiple version check due to #1641. It didn't seem worthwhile to set a magical encoding parameter if it didn't always work, and the test always failed without the fix in any case, so I took it out. It probably doesn't matter either way, but the test might be a little faster without that line.
Thanks for the review, the suggested alternatives, and for landing the fixes.
comment:25 Changed at 2011-12-29T00:00:40Z by warner
- Resolution set to fixed
- Status changed from assigned to closed
Ok, I applied that change too, in 7989fe21cc1465ac. So I think we can close this one now. Thanks!
comment:26 in reply to: ↑ 17 ; follow-up: ↓ 27 Changed at 2011-12-31T04:53:18Z by zooko
Replying to kevan:
I agree with comment:14. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?
Kevan: did you do this investigation? Release Manager Brian said "a week or two" and the Milestone is currently marked as due on 2012-01-15, so I think we have time.
comment:27 in reply to: ↑ 26 Changed at 2011-12-31T22:06:27Z by kevan
I did -- the result is ticket #1641.
comment:28 Changed at 2012-03-25T21:48:08Z by amir
Hi all,
I'm a tahoe-lafs novice, but playing with my first shares (4 storage servers, 2 clients on 2 of the storage servers, k=2, H=4, N=5) I succeeded within short time frame to fuckup my shares. I'm not yet fully sure what we did to confuse them, but in the end we had one share of seq2 and 5 shares of seq10. A deep-check --repair alias: always resulted in this:
$ tahoe deep-check --repair -v sound:28C3
'<root>': not healthy
repair successful
'28c3.Pausenmusik.mp3': healthy
done: 2 objects checked
pre-repair: 1 healthy, 1 unhealthy
1 repairs attempted, 1 successful, 0 failed
post-repair: 1 healthy, 1 unhealthy
It was always the mutable files which became unhealty. And no number of repairs could get them fixed.
jg71 sugested to use git head, as some bugs were fixed there. I did as proposed and just found out that git head fixes all the above problems without fuss. Great work guys!
Yes, that sounds like #766.