[tahoe-dev] [tahoe-lafs] #846: allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine
tahoe-lafs
trac at allmydata.org
Sat Nov 28 14:03:42 PST 2009
#846: allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a
slow machine
----------------------+-----------------------------------------------------
Reporter: zooko | Owner: francois
Type: defect | Status: new
Priority: major | Milestone: 1.6.0
Component: unknown | Version: 1.5.0
Keywords: test ARM | Launchpad_bug:
----------------------+-----------------------------------------------------
On François's lenny-armv5tel box,
{{{allmydata.test.test_system.SystemTest.test_mutable}}} sometimes stops
making progress and then gets timed out after 3600 seconds, e.g.:
http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16
and many more. In the cases where that test does pass it takes only a
couple of hundred seconds, e.g.:
http://allmydata.org/buildbot/builders/François lenny-
armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In
that same passing test run other tests took longer than 227 seconds -- see
http://allmydata.org/buildbot/builders/François lenny-
armv5tel/builds/8/steps/test/logs/timings .)
Brian looked at the test.log files from passing and failing examples and
said that there was little information there, but that one difference was
that in the passing cases that he saw, the time between the beginning of
the test case (e.g. {{{2009-11-20 18:08:54.346Z [-] -->
allmydata.test.test_system.SystemTest.test_mutable <--}}}) and the first
message from Node startup (e.g. {{{2009-11-20 18:08:55.475Z [-]
foolscap.pb.Listener starting on 35403}}}) was about 1 second, and in the
failing cases, e.g. start test {{{2009-11-28 13:36:48.970Z [-] -->
allmydata.test.test_system.SystemTest.test_mutable <--}}} and Node startup
{{{2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397}}}
was about 5 seconds.
So it could be that there is some sort of race condition where if it takes
the Node longer than 5 seconds to start up (perhaps waiting to bind to a
TCP port or something) then some other part of the test gets confused by
having won a race that it didn't expect to win.
Hm, I wonder if I could simulate that on a fast computer by inserting some
sort of 10s delay before allowing Node startup to complete...
The next step is to make this test reproducible. François, could you
please run just this one test, such as with {{{trial --reporter=verbose
--until-failure allmydata.test.test_system.SystemTest.test_mutable}}} and
see if you can tell when it passes vs. when it fails? (Maybe it has to do
with other processes loading the CPU?) Note that which version of Tahoe-
LAFS gets imported and tested by that command-line will be determined by
your PYTHONPATH.
François: I'd like to get this fixed so that ARM can be a supported
platform for the upcoming v1.6 release, so if you ''can't'' do this soon
then please either give me or Brian an ssh account on your box or just say
"Can't work on this now" so that we can think of some alternative
strategies. Thanks!
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/846>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list