Opened at 2019-05-01T11:43:29Z
Closed at 2019-05-14T10:13:52Z
#3038 closed defect (fixed)
Many Python 2.7 macOS builds fail with "Too many open files"
Reported by: | exarkun | Owned by: | exarkun |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | unknown | Version: | 1.12.1 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
An example from a run on Travis (but note buildbot macOS fails the same way):
=============================================================================== [ERROR] Traceback (most recent call last): Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in errback self._startRunCallbacks(fail) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", lin -- TRACEBACK ELIDED -- /defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/broker.py", line 579, in <lambda> File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/broker.py", line 614, in _doCall File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/referenceable.py", line 59, in doRemoteCall File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/server.py", line 300, in remote_allocate_buckets File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/immutable.py", line 209, in __init__ self._sharefile = ShareFile(incominghome, create=True, max_size=max_size) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/immutable.py", line 53, in __init__ f = open(self.home, 'wb') exceptions.IOError: [Errno 24] Too many open files: u'/Users/travis/build/tahoe-lafs/tahoe-lafs/_trial_temp/system/SystemTest/test_upload_and_download_convergent/client2/storage/shares/incoming/v7/v7tjbaysgdl3br6ymaek73z5sa/5' ]'> ]) ]'> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent =============================================================================== [ERROR] Traceback (most recent call last): Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/call.py", line 679, in _done File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/call.py", line 60, in complete File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 460, in callback self._startRunCallbacks(result) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/immutable/offloaded.py", line 345, in _start_reading exceptions.IOError: [Errno 24] Too many open files: u'/Users/travis/build/tahoe-lafs/tahoe-lafs/_trial_temp/system/SystemTest/test_upload_and_download_random_key/client0/helper/CHK_incoming/hozcimqmlsb7gxruhdbi3gaz4a' ]'> allmydata.test.test_system.SystemTest.test_upload_and_download_random_key -------------------------------------------------------------------------------
This seems to have begun around the middle of April.
Change History (14)
comment:1 Changed at 2019-05-01T11:43:36Z by exarkun
- Owner set to exarkun
comment:2 Changed at 2019-05-01T12:42:09Z by exarkun
comment:3 Changed at 2019-05-01T12:42:31Z by exarkun
allmydata.test.test_system.SystemTest?.test_upload_and_download_convergent doesn't fail in isolation.
comment:4 Changed at 2019-05-01T12:45:54Z by exarkun
allmydata.test.test_system.SystemTest? doesn't fail in isolation.
comment:5 Changed at 2019-05-01T12:49:56Z by exarkun
allmydata.test.test_system doesn't fail in isolation
comment:6 Changed at 2019-05-01T12:59:07Z by exarkun
It's not 100% deterministic when running the whole test suite. Sometimes the tests pass.
comment:7 Changed at 2019-05-01T13:08:45Z by exarkun
Note the max open files limit on macOS defaults to 256 and using lsof to inspect the test process when it gets to test_system shows that it is awfully close yo this (lsof says max open fd number is 255! didn't read through all the other fds to see if there are gaps).
comment:8 Changed at 2019-05-01T13:23:55Z by exarkun
It seems more reproducible using tox than running trial directory. Two tox runs both encountered it, two direct trial runs both passed.
Also, using some grep magic with lsof, I see that allmydata.test.test_system.SystemTest??.test_upload_and_download_convergent seems to want to open up around 150 fds. So if there are already ~100 open then it's going to run into the macOS limit.
comment:9 Changed at 2019-05-01T14:41:02Z by exarkun
I guess ... maybe related ... allmydata.test.test_client leaks file descriptors.
comment:10 Changed at 2019-05-01T14:47:03Z by exarkun
It looks like the leaked descriptors are the dup'd descriptors passed to AdoptedStreamServerEndpoint in _foolscapEndpointForPortNumber.
comment:11 Changed at 2019-05-01T14:53:53Z by exarkun
If AdoptedStreamServerEndpoint is never told to listen then it will never convert its integer file descriptor into a listening port. If the tub that is passed to listenOnUnused is never started then the endpoint will never be told to listen.
A bunch of tests seem to create a client but never start it. In this case, the tub is never started and the descriptor leaks.
comment:12 Changed at 2019-05-01T17:08:35Z by exarkun
I tried making a wrapper for the endpoint with a del that closes the file descriptor if it needs to be closed (if listen hasn't been called). This didn't help because apparently the garbage collector doesn't run frequently enough to clean anything up before file descriptors run out (this makes sense, you only have to leak a hundred or so file descriptors to cause a problem and the GC only wants to run after a lot more has happened than that, I expect (CPython GC is driven by allication/deallocation from Python memory arenas, not file descriptors)).
I expect a weakref-based solution to have the same problem since weakrefs are driven by the CPython GC.
Possibly this issue can be worked around by inserting an explicit gc.collect() call into this port allocation code path. This would tie collection frequency to port allocation frequency which should let us keep on top of the stale resources.
comment:13 Changed at 2019-05-13T10:33:14Z by exarkun
I added some gc.collect calls at carefully chosen places/times. Perhaps this will work. Waiting on CI (while also trying to sift through and fix all the _other_ failures we currently have).
comment:14 Changed at 2019-05-14T10:13:52Z by GitHub <noreply@…>
- Resolution set to fixed
- Status changed from new to closed
In bee18e1/trunk:
Reproducible on the macOS machine (macstadium) I have, hooray.