#2017 closed defect

non-deterministic test hang on OpenBSD — at Version 10

Reported by: zooko Owned by: sickness
Priority: normal Milestone: soon
Component: code Version: 1.10.0
Keywords: iputil heisenbug openbsd test hang Cc:
Launchpad Bug:

Description (last modified by daira)

sickness's OpenBSD buildslave showed a test timeout:

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 1784.336s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/27)

Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:

https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28

In that run (build number 28), those tests took only a few seconds:

 19.917 seconds: allmydata.test.test_runner.RunNode.test_client
 13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28/steps/test/logs/timings)

So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.

Change History (10)

comment:1 Changed at 2013-07-06T15:17:35Z by zooko

  • Description modified (diff)

Questions:

  1. Does this happen on any other buildslaves?
  1. Did this ever happen before the recent patches which changed the behavior of iputil — [b0883807361830c609dff1677c3cb34fd64d3ebb], [f97b8e5e1df75284aa9b89dd830f8728040eab67], [08590b1f6a880d51751fdcacea6a007ebc568f2e], [16b245563db2f6ca71b9332b06debbe3e1d734b4], [b31a4f6e870cb56efa40c785a868a944b964e8b9], [a493ee0bb641175ecf918e28fce4d25df15994b6], [6104950ed8a7a356eed2218f2df958d074022eea], [f77ec470d75f4b8fb81b1abca4ee3b73f1ad8b22], [8e31d66cd0b0821ccaa2c7c259e7d6f262ad4738], [6a445d73bc5253ec4ae0dec70af02e33bc869cf6]?

I suspect those iputil patches of causing this hang.

sickness: could you please run the unit tests from the current trunk version repeatedly with trial's --until-failure option? ./bin/tahoe debug trial --until-failure allmydata.test (See HowToWriteTests for more options.) If you can reliably reproduce the problem, then would you use git to rewind to before those patches and see if that makes the problem go away? Thanks!

Last edited at 2013-08-28T15:52:23Z by daira (previous) (diff)

comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness

did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:

[...]

===============================================================================
[TODO]   
Reason: 'Fix this after 1.7.1 release.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such
 servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav
e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac
ed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

===============================================================================
[SKIPPED]
A non-ASCII argument/output could not be encoded on this platform.

allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output
===============================================================================
[TODO]   
Reason: "this isn't fixed yet"
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such
 servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav
e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac
ed none (of which 2 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187
allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124
===============================================================================
[TODO]   
Reason: 'Invent a smarter uploader that uploads successfully in this case.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 2020.513s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
Last edited at 2013-07-07T12:47:50Z by zooko (previous) (diff)

comment:3 Changed at 2013-07-07T15:06:45Z by sickness

so then I've run:

git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a
python setup.py build && ./bin/tahoe --version
./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
and it failed way earlier than the previous branch with this:
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.027s
Last edited at 2013-08-28T15:37:52Z by zooko (previous) (diff)

comment:4 Changed at 2013-07-07T15:31:11Z by daira

  • Keywords test hang added

comment:5 Changed at 2013-07-07T15:34:34Z by daira

If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.

comment:6 Changed at 2013-07-07T19:58:06Z by daira

Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)

comment:7 Changed at 2013-07-09T15:44:44Z by zooko

  • Keywords regression added
  • Milestone changed from undecided to 1.11.0

comment:8 Changed at 2013-08-28T15:42:13Z by zooko

  • Keywords regression removed

sickness:

comment:9 Changed at 2013-08-28T15:43:53Z by daira

  • Description modified (diff)

comment:10 Changed at 2013-08-28T15:51:57Z by daira

  • Description modified (diff)
Note: See TracTickets for help on using tickets.