Opened at 2019-04-09T12:36:53Z
Closed at 2019-05-14T08:13:08Z
#3025 closed defect (fixed)
Sometimes the test suite hangs in/around allmydata.test.test_runner.RunNode.test_introducer
Reported by: | exarkun | Owned by: | GitHub <noreply@…> |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | unknown | Version: | 1.12.1 |
Keywords: | review-needed | Cc: | |
Launchpad Bug: |
Description
This is similar to https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3021
Now, though, instead of hanging on the 2nd RunTest test method, it hangs on the 4th.
Change History (7)
comment:1 Changed at 2019-04-09T13:41:43Z by exarkun
comment:2 Changed at 2019-04-10T13:42:37Z by exarkun
With some more logging added, a failed CI run had this to say:
2019-04-09 16:19:12.155Z [-] --> allmydata.test.test_runner.RunNode.test_baddir <-- 2019-04-09 16:19:15.241Z [-] Main loop terminated. 2019-04-09 16:19:15.242Z [-] --> allmydata.test.test_runner.RunNode.test_client <-- 2019-04-09 16:19:19.160Z [-] Polling <function _node_has_started at 0x7f6d596cd410> 2019-04-09 16:19:19.160Z [-] Polling completed successfully 2019-04-09 16:21:15.256Z [-] Main loop terminated. 2019-04-09 16:21:15.257Z [-] --> allmydata.test.test_runner.RunNode.test_client_no_noise <-- 2019-04-09 16:21:19.093Z [-] Polling <function _node_has_started at 0x7f6d5aa02e60> 2019-04-09 16:21:19.094Z [-] Polling completed successfully
This is a hang in the _third_ test on the test case, not the fourth. Unfortunately the logging in that test is not as comprehensive (heh) as in test_introducer so all I can really surmise is that the hang happened after "tahoe start" completed and before "tahoe stop" completed.
This is somewhat similar to the test_introducer behavior, I suppose, since in that case it's "tahoe restart" that doesn't complete and "restart" implies "stop".
comment:3 Changed at 2019-05-02T14:44:17Z by exarkun
I managed to catch this live on CI for inspection (strangely the branch for https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3038 produces this failure much more regularly).
ps output looked like this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND nobody 6 0.1 0.0 123848 18336 ? Ssl 13:32 0:00 /bin/circleci-agent --config /.circleci-runner-config.json --task-data /.circleci-task-data --outerServerUrl https://circleci-internal-outer-build-a gent:5500 _internal step --sshAdvertiseAddr=3.95.66.234:64535 --sshBindPort=54782 nobody 88 0.0 0.0 21340 3600 pts/1 Ss+ 13:32 0:00 \_ /bin/bash /tmp/project/.circleci/run-tests.sh /tmp/venv /tmp/project /tmp/artifacts coverage nobody 90 0.0 0.0 64456 22140 pts/1 S+ 13:32 0:00 | \_ /tmp/venv/bin/python2.7 /tmp/venv/bin/tox -c /tmp/project/tox.ini --workdir /tmp/tahoe-lafs.tox -e coverage nobody 110 47.8 0.3 620428 282520 pts/1 S+ 13:32 4:03 | \_ /tmp/tahoe-lafs.tox/coverage/bin/python /tmp/tahoe-lafs.tox/coverage/bin/coverage run --branch -m twisted.trial --reporter=subunitv2-fil e --rterrors allmydata nobody 431 0.3 0.1 212320 103772 pts/1 S+ 13:37 0:00 | \_ /tmp/tahoe-lafs.tox/coverage/bin/python -m allmydata.scripts.runner --quiet restart test_runner/RunNode/test_client/c1 nobody 441 0.0 0.0 21472 3680 pts/2 Ss 13:41 0:00 \_ /bin/bash nobody 444 0.0 0.0 37364 3304 pts/2 R+ 13:41 0:00 \_ ps fauxwww nobody 1 0.0 0.0 4628 856 pts/0 Ss+ 13:32 0:00 /bin/sh nobody 429 0.0 0.0 0 0 ? Zs 13:37 0:00 [python] <defunct> nobody 430 0.3 0.0 0 0 ? Z 13:37 0:00 [python] <defunct>
test_runner/RunNode/test_client/c1/logs/twistd.log contains:
2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 19.2.0 (/tmp/tahoe-lafs.tox/coverage/bin/python 2.7.15) starting up. 2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor. 2019-05-02T13:37:54+0000 [-] Foolscap logging initialized 2019-05-02T13:37:54+0000 [-] Note to developers: twistd.log does not receive very much. 2019-05-02T13:37:54+0000 [-] Use 'flogtool tail -c NODEDIR/private/logport.furl' instead 2019-05-02T13:37:54+0000 [-] and read docs/logging.rst 2019-05-02T13:37:54+0000 [-] NevowSite starting on 41427 2019-05-02T13:37:54+0000 [nevow.appserver.NevowSite#info] Starting factory <nevow.appserver.NevowSite instance at 0x7f7a7fd8e098> 2019-05-02T13:37:54+0000 [-] My pid: 430 2019-05-02T13:37:54+0000 [-] Listener starting on 35327 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1a10 on CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint object at 0x7f7a7fee1990>, _fd=14, _listened=True) with tub q7kkkqqglfxfgfcae2fhnq2wzmm6wrno> 2019-05-02T13:37:54+0000 [-] Listener starting on 34521 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1490 on <twisted.internet.endpoints.TCP4ServerEndpoint object at 0x7f7a7fee14d0> with tub lytxba7m4nb3jfcjqmhcc77wu5m5in5n> 2019-05-02T13:37:54+0000 [-] Listener starting on 38229 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1690 on CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint object at 0x7f7a7fee1350>, _fd=13, _listened=True) with tub jguyasayigt7a3igreb7h7sdxcsnqhkw> 2019-05-02T13:37:54+0000 [-] client running
Note this reports the pid of the tahoe process for that directory as 430 which the ps listing reports as a zombie. I wasn't able to find a log that told me who pid 429 was but possibly that's the parent of the daemonized 430.
I also noticed that as a little time passed, more zombies appeared:
nobody 429 0.0 0.0 0 0 ? Zs 13:37 0:00 [python] <defunct> nobody 430 0.0 0.0 0 0 ? Z 13:37 0:00 [python] <defunct> nobody 431 0.0 0.0 0 0 ? Z 13:37 0:00 [python] <defunct> nobody 468 0.0 0.0 0 0 ? Zs 13:41 0:00 [python] <defunct> nobody 469 0.0 0.0 0 0 ? Z 13:41 0:00 [python] <defunct> nobody 470 0.0 0.0 0 0 ? Z 13:41 0:00 [python] <defunct> nobody 514 0.0 0.0 0 0 ? Zs 13:45 0:00 [python] <defunct> nobody 515 0.0 0.0 0 0 ? Z 13:45 0:00 [python] <defunct> nobody 516 0.0 0.0 0 0 ? Z 13:45 0:01 [python] <defunct>
nobody@9a341290331a:/tmp/project/_trial_temp$ grep 'My pid' ./ -r ./test.log:2019-05-02 13:35:05.396Z [-] My pid: 110 ./test.log:2019-05-02 13:35:09.600Z [-] My pid: 110 ./test_runner/RunNode/test_client_no_noise/c1/logs/twistd.log:2019-05-02T13:41:54+0000 [-] My pid: 469 ./test_runner/RunNode/test_introducer/c1/logs/twistd.log:2019-05-02T13:45:54+0000 [-] My pid: 515 ./test_runner/RunNode/test_client/c1/logs/twistd.log:2019-05-02T13:37:54+0000 [-] My pid: 430
469, 515, and 430 are all zombies.
This suggests to me that the test suite isn't actually hanging, it's just going too slowly to satisfy the CI runner. It also suggests to me that the CI environment itself is tied up with the problem: why are these processes zombies? It is probably the CI environment's fault that they are not being reaped. That said, why doesn't python -m allmydata.scripts.runner --quiet restart test_runner/RunNode/test_client/c1 return when the process is dead? Maybe it is waiting for it to be reaped as well, I suppose.
comment:4 Changed at 2019-05-02T14:49:07Z by exarkun
Note:
$ ps -p 1 PID TTY TIME CMD 1 pts/0 00:00:00 sh
comment:5 Changed at 2019-05-03T15:51:02Z by exarkun
So, my interpretation here is that "tahoe start" doesn't work in the CI environment because it performs standard *NIX daemonization. This relies on having a correctly functioning init process which will eventually reap the zombie child process. A number of our CI environments don't have a correctly functioning init process so the zombies linger forever. "tahoe restart" and "tahoe stop" can't deal with this situation. They wait forever for the zombies to get reaped. This leads the test suite to time out.
One possible fix is to give the CI environments a correctly functioning init process. That's probably a good idea.
Another possible fix is to stop using "tahoe start" in these tests and use the non-daemonizing "tahoe run" instead. We know that we're trying to switch from "start" to "run" so this seems like a useful step in two ways.
Probably should eventually do both but for now I'm going to do the latter.
comment:6 Changed at 2019-05-14T06:17:44Z by exarkun
- Keywords review-needed added
comment:7 Changed at 2019-05-14T08:13:08Z by GitHub <noreply@…>
- Owner set to GitHub <noreply@…>
- Resolution set to fixed
- Status changed from new to closed
In 3193611/trunk:
With some logging added I observed:
and then a hang. So the hang happens after the "tahoe start" succeeds but before the "tahoe restart" succeed.