Opened at 2009-02-10T08:48:40Z
Closed at 2009-02-20T21:41:06Z
#616 closed defect (fixed)
bug in repairer causes sporadic hangs in unit tests
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 1.3.0 |
Component: | code-encoding | Version: | 1.2.0 |
Keywords: | Cc: | tahoe-dev@… | |
Launchpad Bug: |
Description
There is a bug in DownUpConnector._satisfy_reads_if_possible():
It should be putting leftover bytes back into the self.bufs and the rest into the result, not putting all-but-leftover bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext).
Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for DownUpConnector._satisfy_reads_if_possible() -- it turns out that we need unit tests in addition to the functional tests that I already wrote: src/allmydata/test/test_repairer.py.
This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.
Change History (2)
comment:1 Changed at 2009-02-12T00:00:32Z by warner
- Milestone changed from 1.3.0 to 1.3.1
comment:2 Changed at 2009-02-20T21:41:06Z by zooko
- Milestone changed from 1.3.1 to 1.3.0
- Resolution set to fixed
- Status changed from new to closed
fixed in 1.3.0 by d7dbd6675efa2f25
as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test.
Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones.
I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis.
So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.