[tahoe-dev] [tahoe-lafs] #651: errors on directory write: UncoordinatedWriteError, KeyError
tahoe-lafs
trac at allmydata.org
Wed Mar 4 11:01:13 PST 2009
#651: errors on directory write: UncoordinatedWriteError, KeyError
--------------------------+-------------------------------------------------
Reporter: zooko | Owner:
Type: defect | Status: new
Priority: major | Milestone: 1.3.1
Component: code-mutable | Version: 1.3.0
Keywords: | Launchpad_bug:
--------------------------+-------------------------------------------------
When I tried to update my klog --
http://testgrid.allmydata.org:3567/uri/URI:DIR2-RO:j74uhg25nwdpjpacl6rkat2yhm:kav7ijeft5h7r7rxdp5bgtlt3viv32yabqajkrdykozia5544jqa/wiki.html
# (testgrid) -- a couple of days ago I got an !UncoordinatedWriteError. I
also got an !UncoordinatedWriteError when I tried to make a new unlinked
directory on the test grid. The incident report from one of those is
attached as 'incident-2009-02-27-215731-aj5o5ti.flog.bz2' . It contains
the following lines:
{{{
# 21:57:32.317 [14105]: WEIRD our testv failed, so the write did not
happen
# 21:57:32.317 [14106]: somebody modified the share on us: shnum=0: I
thought they had #1:R=7ahx, but testv reported #1:R=7ahx
}}}
This reminds me of an issue that I thought we had fixed before the 1.3.0
release -- #546 (mutable-file surprise shares raise inappropriate UCWE).
At the time I did a check on my klog and saw that most of its shared were
on a single storage server:
{{{
<zooko> check reports all good
[18:33]
<zooko> 8 shares
<zooko> 3 hosts with shares
<zooko> 6 of those 8 are on bs3c1
<zooko> recoverable versions 1
<zooko> unrecoverable versions 0
<zooko> Scary to realize that my klog is reliant upon the continued life
of a
single node.
<zooko> :-(
}}}
This morning I tried again to update my klog, and this time I got a
!KeyError (full HTML output attached as "!KeyError.html"). Doing a check
now shows:
{{{
# Report:
Unrecoverable Versions: 2*seq20-qb3p
Unhealthy: some versions are unrecoverable
Unhealthy: no versions are recoverable
# Share Counts: need 3-of-10, have 2
# Hosts with good shares: 2
# Corrupt shares: none
# Wrong Shares: 0
# Good Shares (sorted in share order):
seq20-qb3p-sh2 2y7ldksggg447xnf4zwsjccx7ihs6wfm
(amduser at tahoebs3:public/bs3c4)
seq20-qb3p-sh6 6fyx5u4zr7tvz3szynihc4x3uc6ct5gh
(amduser at tahoebs1:public/client2)
# Recoverable Versions: 0
# Unrecoverable Versions: 1
# Share Balancing (servers in permuted order):
u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
seq20-qb3p-sh2
2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
seq20-qb3p-sh2
}}}
I just now verified that bs3c1 is connected to my client:
{{{
PeerID Nickname Connected? Since Announced
Version Service Name
amduser at tahoebs3:public/bs3c1 Yes: to 207.7.153.161:48663 13:08:38
02-Mar-2009 11:17:09 26-Feb-2009 allmydata-tahoe/1.3.0 storage
}}}
Then tried to load the write-cap to my klog again, got !KeyError again,
then did a check and got the same check results as above, then confirmed
that bs3c1 is still connected.
So, what's going on? Is bs3c1 failing to respond to my client's requests,
or has it somehow deleted the shares of my klog that it held a couple of
days ago?
Oh! I see that I *can* access the read-only view of my klog through
http://testgrid.allmydata.org:3567 even though I can't access the exact
same URL with my local tahoe node. So either there is a networking
problem, or there is a problem with the version of tahoe that I'm running
here (allmydata-tahoe: 1.3.0-r3698, foolscap: 0.3.2, pycryptopp: 0.5.12,
zfec: 1.4.4, Twisted: 8.2.0) but not the version running on testgrid
(allmydata-tahoe: 1.3.0, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec:
1.4.0-4, Twisted: 2.5.0).
Here's the result of a check (and verify) on the read-only view of the
directory through testgrid.allmydata.org:
{{{
# Report:
Recoverable Versions: 8*seq1128-fgyi/3*seq1122-37fb/8*seq1129-6veq
Unhealthy: there are multiple recoverable versions
Best Recoverable Version: seq1129-6veq
Unhealthy: best version has only 8 shares (encoding is 3-of-10)
# Share Counts: need 3-of-10, have 8
# Hosts with good shares: 3
# Corrupt shares: none
# Wrong Shares: 11
# Good Shares (sorted in share order):
seq1122-37fb-sh3 xiktf6ok5f5ao5znxxttriv233hmvi4v
(amduser at tahoebs4:public/bs4c3)
seq1122-37fb-sh8 lwkv6cjicbzqjwwwuifik3pogeupsicb
(amduser at tahoebs4:public/bs4c4)
seq1122-37fb-sh9 6fyx5u4zr7tvz3szynihc4x3uc6ct5gh
(amduser at tahoebs1:public/client2)
seq1128-fgyi-sh0 ckpjhpffmbmpv5rxc7uzrcdlu2ad6slj
(amduser at tahoebs3:public/bs3c3)
2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
seq1128-fgyi-sh1 fcmlx6emlydpmgsksztuvtpxf5gdoamr
(amduser at tahoebs4:public/bs4c1)
seq1128-fgyi-sh2 jfdpabh34vsrhll3lbdn3v23vem4hr2z
(amduser at tahoebs4:public/bs4c2)
trjdor3okozw4eld3l6zl4ap4z6h5tk6 (amduser at tahoebs5:public/bs5c4)
seq1128-fgyi-sh4 uf7kq2svc6ozcawfm63e2qrbik2oixvt
(amduser at tahoebs5:public/bs5c1)
seq1128-fgyi-sh5 wfninubkrvhlyscum7rlschbhx5iarg3
(amduser at tahoebs1:public/client1)
seq1128-fgyi-sh7 iktgow2qpu6ikooaqowoskgv4hfrp444 (nej1)
seq1129-6veq-sh0 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh1 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh2 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh3 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh4 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh6 q5l37rle6pojjnllrwjyryulavpqdlq5
(amduser at tahoebs3:public/bs3c1)
seq1129-6veq-sh8 7tlov7egj7ultza3dy2dlgev6gijlgvk
(amduser at tahoebs5:public/bs5c3)
seq1129-6veq-sh9 ivjakubrruewknqg7wgb5hbinasqupj6
(amduser at tahoebs5:public/bs5c2)
# Recoverable Versions: 3
# Unrecoverable Versions: 0
}}}
The most recent incident in my local incidents log says:
{{{
11:53:20.380 [43315]: WEIRD error during query: [Failure instance:
Traceback: <class 'foolscap.ipb.DeadReferenceError'>: Calling Stale Broker
/home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/mutable/servermap.py:528:_do_query
/home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/mutable/servermap.py:540:_do_read
/home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/util/rrefutil.py:26:callRemote
build/bdist.linux-x86_64/egg/foolscap/referenceable.py:395:callRemote ---
<exception caught here> --- /usr/lib/python2.5/site-
packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:106:maybeDeferred
build/bdist.linux-x86_64/egg/foolscap/referenceable.py:434:_callRemote
build/bdist.linux-x86_64/egg/foolscap/broker.py:467:newRequestID ]
FAILURE:
}}}
The full incident is attached as
incident-2009-03-04-115318-ebt4x5a.flog.bz2
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/651>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list