Opened at 2009-10-05T17:57:35Z
Last modified at 2014-11-28T22:56:18Z
#812 new defect
server-side crawlers: tolerate corrupted shares, verify shares
Reported by: | zooko | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-storage | Version: | 1.4.1 |
Keywords: | reliability | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
From twistd.log on prodtahoe17 data6:
2009/09/25 13:00 -0700 [-] Log opened. 2009/09/25 13:00 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up 2009/09/25 13:00 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'> 2009/09/25 13:00 -0700 [-] Loading tahoe-client.tac... 2009-09-25 20:01:14.954Z [-] Loaded. 2009-09-25 20:01:14.956Z [-] foolscap.pb.Listener starting on 39324 2009-09-25 20:01:14.956Z [-] twisted.conch.manhole_ssh.ConchFactory starting on 8226 2009-09-25 20:01:14.956Z [-] Starting factory <twisted.conch.manhole_ssh.ConchFactory instance at 0x8bfe2cc> 2009-09-25 20:01:14.957Z [-] nevow.appserver.NevowSite starting on 9006 2009-09-25 20:01:14.957Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8db516c> 2009-09-25 20:01:14.957Z [-] Manhole listening via SSH on port 8226 2009-09-25 20:01:14.958Z [-] twisted.internet.protocol.DatagramProtocol starting on 35896 2009-09-25 20:01:14.958Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c> 2009-09-25 20:01:14.960Z [-] (Port 35896 Closed) 2009-09-25 20:01:14.961Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c> 2009-09-27 12:57:40.124Z [-] lease-checker error processing /data6/storage/storage/shares/g6/g6rvkc5jdkgoqhljuxgkquzhvq/5 2009-09-27 12:57:40.130Z [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 561, in runUntilCurrent call.func(*call.args, **call.kw) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 262, in start_slice self.start_current_prefix(start_slice) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 321, in start_current_prefix buckets, start_slice) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 361, in process_prefixdir self.process_bucket(cycle, prefix, prefixdir, bucket) --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 128, in process_bucket wks = self.process_share(sharefile) File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 171, in process_share for li in sf.get_leases(): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 242, in get_leases for i, lease in self._enumerate_leases(f): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 247, in _enumerate_leases for i in range(self._get_num_lease_slots(f)): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 227, in _get_num_lease_slots num_extra_leases = self._read_num_extra_leases(f) File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 129, in _read_num_extra_leases (num_extra_leases,) = struct.unpack(">L", f.read(4)) File "/usr/lib/python2.5/struct.py", line 87, in unpack return o.unpack(s) struct.error: unpack requires a string argument of length 4
$ tahoe --version tahoe-server: 1.4.1, foolscap: 0.4.2, pycryptopp: 0.5.16-r669, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.26, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, pyutil: 1.3.20, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, z-base-32: 1.0.1, setuptools: 0.6c8
No incident logs.
Attachments (1)
Change History (11)
comment:1 Changed at 2009-10-05T17:59:49Z by zooko
comment:2 Changed at 2009-10-05T23:49:10Z by warner
hm, the share appears to be truncated. I looked at sh7 (in /data5 on that same box), which appears to be intact. Both are for seqnum 6, and both appear to match up until the truncation point (which occurs in the middle of the encrypted private key).
These shares are supposed to be written atomically, with a write vector of length one (consisting of the entire share) that is written in a single python f.write call. Perhaps the write got interrupted (node reboot or system crash), in such a bad way that it caused only part of the data to be written out to disk? The truncation point isn't on a particularly round boundary (the file is 0xb7e bytes long), so it doesn't feel like a disk block size or anything like that.
Weird.
I suppose the important part is to recover gracefully from it. I believe the share-crawler should keep going after the error.. that'd be the first thing to verify.
I guess the second step would be to build a validating share crawler, and/or have some code in the lease-checking share crawler which would respond to problems like this by moving the corrupt share into a junkpile and logging the issue somewhere.
comment:3 Changed at 2009-10-06T16:50:46Z by zooko
- Summary changed from exception from attempt to parse leases to handle corrupted lease files
Let's make this ticket be about recovering from this sort of corruption:
- keep share-crawling after encountering corrupted leases
- move the corrupted lease file into a junkpile and log the issue
In the meanwhile, I assume the work-around is to rm that file, right?
comment:4 Changed at 2009-10-08T16:53:12Z by warner
- Summary changed from handle corrupted lease files to server-side crawlers: tolerate corrupted shares, verify shares
Sounds good. To be specific, this is unrelated to leases, it's just that the lease-expiring crawler is what first noticed the corruption. So this ticket is about:
- verify that the crawlers keep crawling after an exception in their per-share handler functions (I believe that it already does this, but we should verify it)
- implement a share-verifying crawler (the server-side verifier), and have it quarantine any corrupted share in some offline junkpile
And yeah, just rm the file, it's useless to anyone. The next time that directory is modified, a new copy of that share will be created.
comment:5 Changed at 2009-11-03T03:38:09Z by davidsarah
- Keywords reliability added
comment:6 Changed at 2013-11-14T19:06:57Z by zooko
- Description modified (diff)
#1834 would remove the lease-checking crawler and bucket-counting crawlers, making this ticket irrelevant. However, we might then want to invent a share-verifying crawler, just for the purpose of looking for corrupted shares, which would make this ticket relevant again.
comment:7 Changed at 2014-11-28T15:59:12Z by zooko
My consulting client (codename "WAG") (see comment:6:ticket:1278) has corrupt shares. This message is in the storage server's twistd.log:
2014-10-16 20:28:32+0530 [-] lease-checker error processing /GFS2_tahoe/.tahoe-filestore1/storage/shares/jz/jze7ynr5mtrehxj7vchc3dqtpa/5 2014-10-16 20:28:32+0530 [-] Unhandled Error Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent call.func(*call.args, **call.kw) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 261, in start_slice self.start_current_prefix(start_slice) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 320, in start_current_prefix buckets, start_slice) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 360, in process_prefixdir self.process_bucket(cycle, prefix, prefixdir, bucket) --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 128, in process_bucket wks = self.process_share(sharefile) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 171, in process_share for li in sf.get_leases(): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 265, in get_leases for i, lease in self._enumerate_leases(f): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 270, in _enumerate_leases for i in range(self._get_num_lease_slots(f)): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 250, in _get_num_lease_slots num_extra_leases = self._read_num_extra_leases(f) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 133, in _read_num_extra_leases (num_extra_leases,) = struct.unpack(">L", f.read(4)) struct.error: unpack requires a string argument of length 4
And this message is in an incident report file generated by the node (which is both the storage server and the gateway):
CorruptShareError server=5e4d7u5c shnum[2]: block hash tree failure: new hash soakg4py7uswum3y6vzvyk4qt3736ia37l7ti22mce3ntesuhe7a does not match existing hash wuoq2uiuyzope6gwyvucwx3b5q47hn46kvynkhsvzh7jndrcowla at [2 of 3] (leaf [1] of 2) /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:827:_cbDeferred /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:380:callback /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:488:_startRunCallbacks --- <exception caught here> --- /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks /data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py:797:_validate_block
This is with this version of Tahoe-LAFS:
Nevow: 0.11.1 Twisted: 13.0.0 allmydata-tahoe: 1.10.0 foolscap: 0.6.4 mock: 1.0.1 platform: Linux-redhat_6.5-x86_64-64bit_ELF pyOpenSSL: 0.10 pyasn1: 0.1.7 pycrypto: 2.6.1 pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958 python: 2.6.6 setuptools: 0.6c16dev4 simplejson: 2.0.9 twisted: 13.0.0 zfec: 1.4.7 zope.interface: unknown
comment:8 Changed at 2014-11-28T22:01:32Z by zooko
Okay, I've been looking into this, and I see that in fact this kind of corruption is handled, by logging it, accounting for it in a count called corrupt-shares, and skipping the corrupted share. See expirer.py. The only reason that we've thought it was not being handled all these years is because it calls twisted.python.log.err(), which emits a string to the twistd.log that says "Unhandled Error", plus a stack trace. So I propose that we just remove that call to twisted.python.log.err() and add a unit test which requires the code under test to detect and skip over corrupted shares without emitting this error log.
comment:9 Changed at 2014-11-28T22:45:29Z by daira
Wait, why do we ever log directly to the Twisted log? Let's remove *all* the places that do that.
comment:10 Changed at 2014-11-28T22:56:18Z by daira
Filed #2343 (remove all direct logging to twisted.python.log/err).
Here is the mutable share in question, attached.