Opened at 2009-11-28T22:03:42Z
Last modified at 2011-10-03T03:35:51Z
#846 assigned defect
allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine
Reported by: | zooko | Owned by: | zooko |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-mutable | Version: | 1.5.0 |
Keywords: | test arm reliability hang heisenbug | Cc: | |
Launchpad Bug: |
Description
On François's lenny-armv5tel box, allmydata.test.test_system.SystemTest.test_mutable sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/timings .)
Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g. 2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--) and the first message from Node startup (e.g. 2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403) was about 1 second, and in the failing cases, e.g. start test 2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <-- and Node startup 2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397 was about 5 seconds.
So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win.
Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete...
The next step is to make this test reproducible. François, could you please run just this one test, such as with trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH.
François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you can't do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!
Attachments (2)
Change History (18)
comment:1 in reply to: ↑ description Changed at 2009-11-29T23:47:04Z by francois
comment:2 Changed at 2009-11-30T10:58:14Z by francois
The test 'allmydata.test.test_system.SystemTest?.test_mutable' failed after 20 successful runs, it definitely looks like a race condition. See the attached logfile for details.
Changed at 2009-11-30T10:58:36Z by francois
Changed at 2009-11-30T11:02:42Z by francois
comment:3 Changed at 2009-12-01T00:16:53Z by davidsarah
- Component changed from unknown to code-mutable
- Keywords reliability added
comment:4 Changed at 2009-12-05T17:22:20Z by francois
- Owner changed from francois to warner
The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase
Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email
comment:5 Changed at 2009-12-06T04:23:56Z by zooko
Why do you say that it seems related to the connect_timeout expiration?
comment:6 Changed at 2009-12-06T11:47:09Z by francois
Because of that kind of log messages which appears in test.log:
2009-11-30 01:07:46.730Z [-] Reconnector._failed (furl=pb://z3g3meylekyi7yjo4azff2xbqwjbuioa@62.220.138.135:42639,127.0.0.1:42639/kdnymib6zqlgwo6nrihhofctqoo7chfe): [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.NegotiationError'>: no connection established within client timeout ]
comment:7 follow-up: ↓ 8 Changed at 2009-12-06T16:50:55Z by davidsarah
- Keywords arm added; ARM removed
#838 was a duplicate. It has another log attached to it; I don't know whether that provides any more information.
comment:8 in reply to: ↑ 7 Changed at 2009-12-06T16:52:09Z by davidsarah
Replying to davidsarah:
#838 ... has another log attached to it
I meant, it links to http://allmydata.org/buildbot/builders/Fran%C3%A7ois%20lenny-armv5tel/builds/12/steps/test/logs/stdio
comment:9 Changed at 2009-12-12T16:26:52Z by zooko
Well, I haven't had time to investigate this, and I guess [our Buildbot Policy wiki:BuildbotPolicy] says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-(
I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.
comment:10 Changed at 2010-01-26T15:44:40Z by zooko
- Milestone changed from 1.6.0 to eventually
comment:11 Changed at 2010-03-24T23:03:08Z by davidsarah
- Keywords hang added
comment:12 Changed at 2010-05-27T22:11:00Z by zooko
- Milestone changed from eventually to 1.8.0
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
comment:13 Changed at 2010-08-14T06:45:57Z by zooko
- Milestone changed from 1.8.0 to soon
comment:14 Changed at 2011-09-09T18:26:02Z by davidsarah
ticket:1084#comment:5 (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in tahoe-846-test-fail.txt.
comment:15 Changed at 2011-09-28T16:23:03Z by zooko
- Owner changed from warner to zooko
- Status changed from new to assigned
comment:16 Changed at 2011-10-03T03:35:51Z by zooko
- Keywords heisenbug added
Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently.
The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).