Opened at 2022-11-27T02:21:29Z
Closed at 2022-12-12T17:45:47Z
#3945 closed task (wontfix)
Retry moody GitHub Actions steps
Reported by: | sajith | Owned by: | sajith |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | dev-infrastructure | Version: | n/a |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
Some workflows fail on GitHub Actions either because the tests are moody or GitHub Actions itself is moody. Example: https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477
2022-11-27T01:09:13.3236569Z [FAIL] 2022-11-27T01:09:13.3236873Z Traceback (most recent call last): 2022-11-27T01:09:13.3237795Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 47, in _convert_done 2022-11-27T01:09:13.3238340Z f.trap(PollComplete) 2022-11-27T01:09:13.3239166Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap 2022-11-27T01:09:13.3244610Z self.raiseException() 2022-11-27T01:09:13.3245778Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException 2022-11-27T01:09:13.3259779Z raise self.value.with_traceback(self.tb) 2022-11-27T01:09:13.3260719Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\internet\defer.py", line 206, in maybeDeferred 2022-11-27T01:09:13.3261254Z result = f(*args, **kwargs) 2022-11-27T01:09:13.3261923Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\pollmixin.py", line 69, in _poll 2022-11-27T01:09:13.3262457Z self.fail("Errors snooped, terminating early") 2022-11-27T01:09:13.3262935Z twisted.trial.unittest.FailTest: Errors snooped, terminating early 2022-11-27T01:09:13.3263257Z 2022-11-27T01:09:13.3263547Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3263989Z =============================================================================== 2022-11-27T01:09:13.3264288Z [ERROR] 2022-11-27T01:09:13.3264609Z Traceback (most recent call last): 2022-11-27T01:09:13.3265386Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\allmydata\util\rrefutil.py", line 26, in _no_get_version 2022-11-27T01:09:13.3268422Z f.trap(Violation, RemoteException) 2022-11-27T01:09:13.3269217Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 480, in trap 2022-11-27T01:09:13.3269711Z self.raiseException() 2022-11-27T01:09:13.3270396Z File "D:\a\tahoe-lafs\tahoe-lafs\.tox\py310-coverage\lib\site-packages\twisted\python\failure.py", line 504, in raiseException 2022-11-27T01:09:13.3270976Z raise self.value.with_traceback(self.tb) 2022-11-27T01:09:13.3271553Z foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIStorageServer.tahoe.allmydata.com:get_version) 2022-11-27T01:09:13.3271977Z 2022-11-27T01:09:13.3272448Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3272884Z =============================================================================== 2022-11-27T01:09:13.3273207Z [ERROR] 2022-11-27T01:09:13.3273530Z Traceback (most recent call last): 2022-11-27T01:09:13.3274088Z Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=4vg7) (during method=RIUploadHelper.tahoe.allmydata.com:upload) 2022-11-27T01:09:13.3274512Z 2022-11-27T01:09:13.3274802Z allmydata.test.test_system.SystemTest.test_upload_and_download_convergent 2022-11-27T01:09:13.3275437Z ------------------------------------------------------------------------------- 2022-11-27T01:09:13.3275958Z Ran 1776 tests in 1302.475s 2022-11-27T01:09:13.3276195Z 2022-11-27T01:09:13.3276435Z FAILED (skips=27, failures=1, errors=2, successes=1748)
That failure has nothing to do with the changes that triggered that workflow; it might be a good idea to retry that step.
Some other workflows take a long time to run. Examples: on https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3556042011/jobs/5973114477, coverage (ubuntu-latest, pypy-37), integration (ubuntu-latest, 3.7), and integration (ubuntu-latest, 3.9). Although in this specific instance integration tests are failing due to #3943, it might be a good idea to retry them after a reasonable timeout, and give up altogether after a number of tries instead of spinning for many hours on end.
This perhaps would be a good use of actions/retry-step?
Change History (6)
comment:1 Changed at 2022-11-30T15:00:06Z by exarkun
comment:2 Changed at 2022-11-30T15:26:08Z by sajith
Hmm, that is true. Do you think there's value in using a smaller timeout value though? Sometimes running tests seem to get stuck without terminating cleanly. Like in this case, for example:
https://github.com/tahoe-lafs/tahoe-lafs/actions/runs/3525447679
Integration tests on Ubuntu ran for six hours, which I guess GitHub's default timeout. From a developer experience perspective, I guess it would be useful for them to fail sooner than that.
comment:3 follow-up: ↓ 4 Changed at 2022-11-30T18:06:31Z by meejah
A timeout less than 6 hours sounds good (!!) but yeah I mostly agree with what jean-paul is saying.
That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)
comment:4 in reply to: ↑ 3 Changed at 2022-12-01T20:26:04Z by sajith
Replying to meejah:
That said, _is_ there a ticket to explore that particular "known" spurious failure? (It seems somewhat "well known" that test_system sometimes has problems...)
A quick search for "flaky", "spurious", and "test_upload_and_download_convergent" here in Trac turned up #3413, #3412, #1768, #1084, and this milestone: Integration and Unit Testing.
There might be more tickets. I guess all those tickets ideally should belong to that milestone.
Perhaps it might be worth collecting some data about these failures when testing master branch alone, since PR branches are likely add too much noise? https://github.com/tahoe-lafs/tahoe-lafs/actions?query=branch%3Amaster does not look ideal. However, since GitHub doesn't keep test logs long enough for organizations on free plans, collecting that data is going to be rather challenging.
Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)
comment:5 Changed at 2022-12-01T20:34:45Z by exarkun
Maybe fixing flaky tests is not worth the trouble, given the limited resources and the fact that this never has been annoying enough to become a priority. :-)
I wouldn't say this is the case. I spent a large chunk of time last year fixing flaky tests. The test suite is currently much more reliable than it was before that effort.
comment:6 Changed at 2022-12-12T17:45:47Z by exarkun
- Resolution set to wontfix
- Status changed from new to closed
I don't think automatically doing a rerun of the whole test suite when a test fails is a good idea.
If there is a real test failure then the result is CI takes N times as long to complete. If there is a spurious test failure that we're not aware of then the result is that we don't become aware of it for much longer. If there is a spurious test failure that we are aware of then the result is that it is swept under the rug and is much easier to ignore for much longer.
These all seem like downsides to me.