Opened at 2013-07-06T15:16:55Z
Closed at 2020-01-13T20:36:24Z
#2017 closed defect (wontfix)
non-deterministic test hang on OpenBSD
Reported by: | zooko | Owned by: | sickness |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code | Version: | 1.10.0 |
Keywords: | iputil heisenbug openbsd test hang | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
sickness's OpenBSD buildslave showed a test timeout:
=============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 1784.336s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)
Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:
In that run (build number 28), those tests took only a few seconds:
19.917 seconds: allmydata.test.test_runner.RunNode.test_client
13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise
So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.
Change History (15)
comment:1 Changed at 2013-07-06T15:17:35Z by zooko
- Description modified (diff)
comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness
did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:
[...] =============================================================================== [TODO] Reason: 'Fix this after 1.7.1 release.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac ed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) =============================================================================== [SKIPPED] A non-ASCII argument/output could not be encoded on this platform. allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output =============================================================================== [TODO] Reason: "this isn't fixed yet" Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac ed none (of which 2 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187 allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124 =============================================================================== [TODO] Reason: 'Invent a smarter uploader that uploads successfully in this case.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128 =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 2020.513s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
comment:3 Changed at 2013-07-07T15:06:45Z by sickness
so then I've run:
git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a python setup.py build && ./bin/tahoe --version ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise and it failed way earlier than the previous branch with this: =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1 tests in 240.027s
comment:4 Changed at 2013-07-07T15:31:11Z by daira
- Keywords test hang added
comment:5 Changed at 2013-07-07T15:34:34Z by daira
If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.
comment:6 Changed at 2013-07-07T19:58:06Z by daira
Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)
comment:7 Changed at 2013-07-09T15:44:44Z by zooko
- Keywords regression added
- Milestone changed from undecided to 1.11.0
comment:9 Changed at 2013-08-28T15:43:53Z by daira
- Description modified (diff)
comment:10 Changed at 2013-08-28T15:51:57Z by daira
- Description modified (diff)
comment:11 Changed at 2013-08-28T15:53:38Z by zooko
sickness: we don't know how to proceed with this. It seems like there might be an old bug, since, per daira's comment:6, d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before the recent iputil patches. So, could you do some investigation of what version(s) of tahoe-lafs exhibit this failure on your !OpenBSD system and report back? You might just start with the Tahoe-LAFS v1.10 release [f9af0633d8da426cbcaed3ff05ab6d7128148bb0] and see if your system could stably run this test on that version. Thanks!
comment:12 Changed at 2013-08-29T07:16:37Z by sickness
# ./bin/tahoe --version allmydata-tahoe: 1.10.0 foolscap: 0.6.4 pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958 zfec: 1.4.24 Twisted: 13.0.0 Nevow: 0.10.0 zope.interface: unknown python: 2.7.3 platform: OpenBSD-5.2-i386-32bit-ELF pyOpenSSL: 0.13 simplejson: 3.3.0 pycrypto: 2.6 pyasn1: 0.1.7 mock: 1.0.1 setuptools: 0.6c16dev4 # ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise Test Pass 1 allmydata.test.test_runner RunNode test_client_no_noise ... [OK] ------------------------------------------------------------------------------- Ran 1 tests in 4.987s PASSED (successes=1) Test Pass 2 allmydata.test.test_runner RunNode test_client_no_noise ... [OK] ------------------------------------------------------------------------------- Ran 1 tests in 5.011s PASSED (successes=1) Test Pass 3 allmydata.test.test_runner RunNode test_client_no_noise ... [ERROR] [ERROR] =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x83fa9dac [0.00449967384338s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x83fc83e4>, 1377735752.361727), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1 tests in 240.022s
comment:13 Changed at 2013-08-29T08:45:11Z by sickness
I have to add that after the previous test exits, a process seems to still hang: tahoe --quiet start test_runner/RunNode/test_client_no_noise/c1
and it seems to sit there forever until I manually kill it :/
comment:14 Changed at 2013-08-29T11:41:04Z by zooko
Hm. Could you try editing the source code, i.e. the code of trunk/src/allmydata/test/test_runner.py, and add these lines somewhere near the top:
from twisted.internet import base base.DelayedCall.debug = True
And then re-run that experiment?
comment:15 Changed at 2020-01-13T20:36:24Z by exarkun
- Resolution set to wontfix
- Status changed from new to closed
The OpenBSD builder is gone. This could be re-opened and fixed by a dedicated OpenBSD maintainer but won't be otherwise.
Questions:
I suspect those iputil patches of causing this hang.
sickness: could you please run the unit tests from the current trunk version repeatedly with trial's --until-failure option? ./bin/tahoe debug trial --until-failure allmydata.test (See HowToWriteTests for more options.) If you can reliably reproduce the problem, then would you use git to rewind to before those patches and see if that makes the problem go away? Thanks!