[tahoe-dev] [tahoe-lafs] #651: errors on directory write: UncoordinatedWriteError, KeyError

Wed Mar 4 11:01:13 PST 2009

#651: errors on directory write: UncoordinatedWriteError, KeyError
--------------------------+-------------------------------------------------
 Reporter:  zooko         |           Owner:       
     Type:  defect        |          Status:  new  
 Priority:  major         |       Milestone:  1.3.1
Component:  code-mutable  |         Version:  1.3.0
 Keywords:                |   Launchpad_bug:       
--------------------------+-------------------------------------------------
 When I tried to update my klog --
 http://testgrid.allmydata.org:3567/uri/URI:DIR2-RO:j74uhg25nwdpjpacl6rkat2yhm:kav7ijeft5h7r7rxdp5bgtlt3viv32yabqajkrdykozia5544jqa/wiki.html
 # (testgrid) -- a couple of days ago I got an !UncoordinatedWriteError.  I
 also got an !UncoordinatedWriteError when I tried to make a new unlinked
 directory on the test grid.  The incident report from one of those is
 attached as 'incident-2009-02-27-215731-aj5o5ti.flog.bz2' .  It contains
 the following lines:

 {{{
 # 21:57:32.317 [14105]: WEIRD our testv failed, so the write did not
 happen
 # 21:57:32.317 [14106]: somebody modified the share on us: shnum=0: I
 thought they had #1:R=7ahx, but testv reported #1:R=7ahx
 }}}

 This reminds me of an issue that I thought we had fixed before the 1.3.0
 release -- #546 (mutable-file surprise shares raise inappropriate UCWE).

 At the time I did a check on my klog and saw that most of its shared were
 on a single storage server:

 {{{
 <zooko> check reports all good
 [18:33]
 <zooko> 8 shares
 <zooko> 3 hosts with shares
 <zooko> 6 of those 8 are on bs3c1
 <zooko> recoverable versions 1
 <zooko> unrecoverable versions 0
 <zooko> Scary to realize that my klog is reliant upon the continued life
 of a
         single node.
 <zooko> :-(
 }}}

 This morning I tried again to update my klog, and this time I got a
 !KeyError (full HTML output attached as "!KeyError.html").  Doing a check
 now shows:

 {{{
 # Report:

 Unrecoverable Versions: 2*seq20-qb3p
 Unhealthy: some versions are unrecoverable
 Unhealthy: no versions are recoverable

 # Share Counts: need 3-of-10, have 2
 # Hosts with good shares: 2
 # Corrupt shares: none
 # Wrong Shares: 0
 # Good Shares (sorted in share order):
 seq20-qb3p-sh2  2y7ldksggg447xnf4zwsjccx7ihs6wfm
 (amduser at tahoebs3:public/bs3c4)
 seq20-qb3p-sh6  6fyx5u4zr7tvz3szynihc4x3uc6ct5gh
 (amduser at tahoebs1:public/client2)
 # Recoverable Versions: 0
 # Unrecoverable Versions: 1
 # Share Balancing (servers in permuted order):
 u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
 u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
 u5vgfpug7qhkxdtj76tcfh6bmzyo6w5s (amduser at tahoebs3:public/bs3c2)
 jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
 jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
 jfdpabh34vsrhll3lbdn3v23vem4hr2z (amduser at tahoebs4:public/bs4c2)
 2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
 seq20-qb3p-sh2
 2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
 seq20-qb3p-sh2
 }}}

 I just now verified that bs3c1 is connected to my client:

 {{{
 PeerID          Nickname        Connected?      Since   Announced
 Version         Service Name
 amduser at tahoebs3:public/bs3c1   Yes: to 207.7.153.161:48663     13:08:38
 02-Mar-2009    11:17:09 26-Feb-2009    allmydata-tahoe/1.3.0   storage
 }}}

 Then tried to load the write-cap to my klog again, got !KeyError again,
 then did a check and got the same check results as above, then confirmed
 that bs3c1 is still connected.

 So, what's going on?  Is bs3c1 failing to respond to my client's requests,
 or has it somehow deleted the shares of my klog that it held a couple of
 days ago?

 Oh!  I see that I *can* access the read-only view of my klog through
 http://testgrid.allmydata.org:3567 even though I can't access the exact
 same URL with my local tahoe node.  So either there is a networking
 problem, or there is a problem with the version of tahoe that I'm running
 here (allmydata-tahoe: 1.3.0-r3698, foolscap: 0.3.2, pycryptopp: 0.5.12,
 zfec: 1.4.4, Twisted: 8.2.0) but not the version running on testgrid
 (allmydata-tahoe: 1.3.0, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec:
 1.4.0-4, Twisted: 2.5.0).

 Here's the result of a check (and verify) on the read-only view of the
 directory through testgrid.allmydata.org:

 {{{
 # Report:

 Recoverable Versions: 8*seq1128-fgyi/3*seq1122-37fb/8*seq1129-6veq
 Unhealthy: there are multiple recoverable versions
 Best Recoverable Version: seq1129-6veq
 Unhealthy: best version has only 8 shares (encoding is 3-of-10)

 # Share Counts: need 3-of-10, have 8
 # Hosts with good shares: 3
 # Corrupt shares: none
 # Wrong Shares: 11
 # Good Shares (sorted in share order):
 seq1122-37fb-sh3        xiktf6ok5f5ao5znxxttriv233hmvi4v
 (amduser at tahoebs4:public/bs4c3)
 seq1122-37fb-sh8        lwkv6cjicbzqjwwwuifik3pogeupsicb
 (amduser at tahoebs4:public/bs4c4)
 seq1122-37fb-sh9        6fyx5u4zr7tvz3szynihc4x3uc6ct5gh
 (amduser at tahoebs1:public/client2)
 seq1128-fgyi-sh0        ckpjhpffmbmpv5rxc7uzrcdlu2ad6slj
 (amduser at tahoebs3:public/bs3c3)
         2y7ldksggg447xnf4zwsjccx7ihs6wfm (amduser at tahoebs3:public/bs3c4)
 seq1128-fgyi-sh1        fcmlx6emlydpmgsksztuvtpxf5gdoamr
 (amduser at tahoebs4:public/bs4c1)
 seq1128-fgyi-sh2        jfdpabh34vsrhll3lbdn3v23vem4hr2z
 (amduser at tahoebs4:public/bs4c2)
         trjdor3okozw4eld3l6zl4ap4z6h5tk6 (amduser at tahoebs5:public/bs5c4)
 seq1128-fgyi-sh4        uf7kq2svc6ozcawfm63e2qrbik2oixvt
 (amduser at tahoebs5:public/bs5c1)
 seq1128-fgyi-sh5        wfninubkrvhlyscum7rlschbhx5iarg3
 (amduser at tahoebs1:public/client1)
 seq1128-fgyi-sh7        iktgow2qpu6ikooaqowoskgv4hfrp444 (nej1)
 seq1129-6veq-sh0        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh1        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh2        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh3        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh4        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh6        q5l37rle6pojjnllrwjyryulavpqdlq5
 (amduser at tahoebs3:public/bs3c1)
 seq1129-6veq-sh8        7tlov7egj7ultza3dy2dlgev6gijlgvk
 (amduser at tahoebs5:public/bs5c3)
 seq1129-6veq-sh9        ivjakubrruewknqg7wgb5hbinasqupj6
 (amduser at tahoebs5:public/bs5c2)
 # Recoverable Versions: 3
 # Unrecoverable Versions: 0
 }}}

 The most recent incident in my local incidents log says:

 {{{
 11:53:20.380 [43315]: WEIRD error during query: [Failure instance:
 Traceback: <class 'foolscap.ipb.DeadReferenceError'>: Calling Stale Broker
 /home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/mutable/servermap.py:528:_do_query
 /home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/mutable/servermap.py:540:_do_read
 /home/zooko/playground/allmydata/tahoe/trunk/trunk/src/allmydata/util/rrefutil.py:26:callRemote
 build/bdist.linux-x86_64/egg/foolscap/referenceable.py:395:callRemote ---
 <exception caught here> --- /usr/lib/python2.5/site-
 packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:106:maybeDeferred
 build/bdist.linux-x86_64/egg/foolscap/referenceable.py:434:_callRemote
 build/bdist.linux-x86_64/egg/foolscap/broker.py:467:newRequestID ]
 FAILURE:
 }}}

 The full incident is attached as
 incident-2009-03-04-115318-ebt4x5a.flog.bz2

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/651>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid