Opened at 2009-06-17T18:37:55Z
Closed at 2009-07-07T19:09:34Z
#738 closed defect (fixed)
failure in block hash tree
Reported by: | midnightmagic | Owned by: | bdew |
---|---|---|---|
Priority: | critical | Milestone: | 1.5.0 |
Component: | code-encoding | Version: | 1.4.1 |
Keywords: | integrity | Cc: | bdew, midnightmagic |
Launchpad Bug: |
Description (last modified by warner)
Running tahoe on the machine in which python2.5 setup.py test fails as reported in trac ticket#737, generates the attached incident report.
Brief summary from flog debugger viewer:
<ValidatedReadBucketProxy #1> (2-<ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>-m2st7xk76cvd): hash failure in block=0, shnum=2 on <ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe> [...] <BlockDownloader #2>(<ValidatedReadBucketProxy #3>-2): failure to get block [... etc ...] failure in block hash tree
.. and so on.
Actual cli error message is:
tahoe get URI:CHK:lapry55oui4psmeiyxhvitfmpi:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 meh
Error, got 410 Gone NotEnoughSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more.
Finally, dump-share on the 33MB file:
share filename: /v/tahoe/.tahoe/storage/shares/jo/jow42sylefxjxsns3alv5ptghe/2 version: 1 file_size: 99758080 num_segments: 762 segment_size: 131073 needed_shares: 3 total_shares: 5 codec_name: crs codec_params: 131073-3-5 tail_codec_params: 11529-3-5 crypttext_hash: zodzh33f7mnowsxine5mzejiahlxsilgggpxmop5bfrh4zzzdpha crypttext_root_hash: nuqsysg5zqkz5nsvpi32n5h6h5ilbepsbvmssji2xv773kqw53tq share_root_hash: m2st7xk76cvdutgf5lzmkdzbf72h75cxpkytwjegi5jgntir3u5q UEB_hash: 75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq verify-cap: URI:CHK-Verifier:jow42sylefxjxsns3alv5ptghe:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 Size of data within the share: data: 33252694 uri-extension: 325 validation: 196648 Lease #0: owner=0, expire in 2607324s (30 days)
Machine details:
NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386 AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM Python 2.5.2
This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads *should* be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.
NOTE: The share file IS AVAILABLE UPON REQUEST. I will save it for posterity.
Attachments (6)
Change History (24)
Changed at 2009-06-17T18:38:54Z by midnightmagic
comment:1 Changed at 2009-06-17T18:39:42Z by midnightmagic
- Component changed from unknown to code-storage
- Owner nobody deleted
comment:2 Changed at 2009-06-17T19:22:32Z by zooko
- Component changed from code-storage to code-encoding
- Keywords integrity added
- Milestone changed from undecided to 1.5.0
- Priority changed from major to critical
I want to know if Tahoe-LAFS unit tests pass (excluding the one that locks up as described in #737) and if pycryptopp unit tests pass.
comment:3 Changed at 2009-06-18T03:20:50Z by warner
- Description modified (diff)
(wrapped some of the description text to improve formatting)
I've looked a bit at the share file you sent me, and it seems ok (no corruption that I've seen so far). My next step is to examine the Incident report and see if I can figure out exactly which hash check failed, and compare them against hashes that I'll generate locally from that share.
Another approach will be to get a copy of two more shares, put them in a private grid, and attempt to download the file. If successful, the shares must be ok, and we'll focus on how the download process might be acting differently on your host.
comment:4 follow-up: ↓ 5 Changed at 2009-06-18T18:07:51Z by warner
I've looked at that report and compared it against the scrubber that I wrote (a server-side share verification tool). It looks like your client is generating a different hash for the first data block than it's supposed to. The incident report contains a copy of the first 50 bytes and the last 50 bytes of the block, and they match what I'm getting out of the share.
So, either your client is incorrectly computing the SHA256d hash of that 43kB-ish data, or it is using a block of data that is corrupted somewhere in the middle. Your client seems to compute the rest of the hash tree correctly (and I think you might have said that pycryptopp tests pass on this platform), so it seems like SHA256d is working in general. So that points to either the wrong hash tag (in allmydata.util.hashutil), or some sort of transport-level error that is corrupting or inserting/deleting data in the middle of the block.
I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.
If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.
comment:5 in reply to: ↑ 4 Changed at 2009-06-19T19:27:28Z by midnightmagic
Replying to warner:
I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.
If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.
My system make is a form of bmake. I had to run gmake to execute your tests. However, the result is apparently success:
Ran 63 tests in 5.852s
PASSED (successes=63)
In doing so, I have found that the default, lone python interpreter was actually different from the one I was manually using to build and install tahoe. Additionally, my tahoe installation is installed system-wide. I will (hrm.. manually?) de-install it and try again with a proper symlink from /usr/pkg/bin/python to python2.5. (Long shot with no evidence, I know.. but still.)
Nope. Same problem.
I'm sorry I don't know more about Python or I would be a *lot* more useful to you.
comment:6 follow-up: ↓ 8 Changed at 2009-06-24T01:36:05Z by warner
ok, if you would, please apply the 738-dumpblock.diff patch to a trunk tree, then perform the failing download again. That patch will write out the block-with-bad-hash into logs/badhash-STUFF .. then please attach that file to this ticket, and we'll compare it against the block that we were supposed to get, to see if your machine is receiving bad data, or if it's computing the hashes incorrectly.
Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.
thanks!
comment:7 Changed at 2009-06-29T23:22:10Z by midnightmagic
Excellent! I got some badhash's! I *believe* they're the same ones from that same file. I recognise the beginning of the hash reference anyway (jo blah blah).
I'll try to attach them to this note.
Changed at 2009-06-29T23:22:46Z by midnightmagic
comment:8 in reply to: ↑ 6 Changed at 2009-06-29T23:36:11Z by midnightmagic
Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.
Indeed it was, and in a new test with the latest (via darcs pull) and that includes the patch that generated the badhash-* files:
HashUtilTests test_chk ... [OK] test_constant_time_compare ... [OK] test_hashers ... [OK] test_known_answers ... [OK] test_random_key ... [OK] test_sha256d ... [OK] test_sha256d_truncated ... [OK] [ ... and etc ...] TimeFormat test_epoch ... [OK] test_epoch_in_London ... [OK] --------------------------------------------------------- Ran 63 tests in 5.815s PASSED (successes=63)
comment:9 Changed at 2009-06-30T00:58:30Z by warner
pycryptopp#24 opened.
midnightmagic and I were able to narrow this down to a failure in pycryptopp, in which hashing a 128-byte string in two chunks of size (33,95) gets the wrong value on !NetBSD. The block data hasher uses a tag (including netstring padding) of length 33. I suspect that the actual problem is with any block size BS such that (33+BS)%128==0.
This smells a lot like pycryptopp#17, which was an ARM-specific alignment issue that corrupted AES output on certain chunk sizes. I haven't looked deeply at the SHA256 code yet, but I suspect the same sort of bug, this time affecting i386.
comment:10 Changed at 2009-06-30T02:23:04Z by warner
how interesting.. Black Dew's debian/i386 buildslave (which has experienced hangs in the test suite that look fairly similar to hangs that midnightmagic has seen) fails the new pycryptopp-24 test in exactly the same way.
comment:11 Changed at 2009-06-30T03:27:21Z by zooko
- Cc bdew midnightmagic added
- Owner set to bdew
adding Cc: bdew, midnightmagic so they will know that there is something they can do to help. Setting 'assigned to' bdew at random.
comment:12 Changed at 2009-06-30T19:44:52Z by zooko
A-ha! Now Black Dew's buildslave got an internal compiler error in g++ while building Crypto++:
This suggests to me that the machine has hardware problems.
This raises an interesting question of: what if anything can Tahoe-LAFS do to be robust and to fail clearly and nicely in the presence of hardware problems such as flaky RAM?
comment:13 Changed at 2009-07-02T20:46:37Z by zooko
See also Black Dew's discoveries over in http://allmydata.org/trac/pycryptopp/ticket/24 . Crypto++ is being built to use the MOVDQA instruction, which may be buggy on his AthlonXP.
comment:14 Changed at 2009-07-03T22:29:53Z by midnightmagic
So! Just as an update, this particular issue may be solved by the fact that crypto++ on my machine actually seems to fail (when ASM optimizations are turned on) on a test program I wrote. See the aforementioned pycryptopp ticket for more details and the test program.
comment:15 follow-up: ↓ 16 Changed at 2009-07-03T23:01:19Z by midnightmagic
SUCCESS! Rebuilding pycryptopp without ASM optimizations makes it pass the chunked SHA256 test, and setting PYTHONPATH to that top-level directory makes "tahoe ls" Just Work on an existing grid, and the failing command NOW WORKS PERFECTLY.
So there is a patch in pycryptopp/ticket/24 for setup.py which detects the platform involved and turns off assembly optimizations in just that specific platform and bit-width (32bit).
I would say that, if bdew could do the same and it works, we can put a platform detection for his also and likely close all these tickets out until crypto++ fixes their CPU features detection.
comment:16 in reply to: ↑ 15 Changed at 2009-07-06T05:07:00Z by midnightmagic
Wow, Wei Dai is fast. Check it out, he's fixed the problem already in SVN:
Impressive.
comment:17 Changed at 2009-07-06T23:18:28Z by midnightmagic
pycryptopp in trunk now works perfectly (well the tests don't fail anyway) on all three machines, as listed in pycryptopp ticket 24. Using pycryptopp trunk, I now have apparently perfectly-working tahoe nodes where before they were only remotely usable.
Therefore, I believe this ticket can be closed, from my perspective. If 1.5.0 is going to include these fixes, then all's well!
comment:18 Changed at 2009-07-07T19:09:34Z by zooko
- Resolution set to fixed
- Status changed from new to closed
Fixed by 9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.
Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2