#2787 closed defect (wontfix)
intermittent "Address Already In Use" error during tests
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code-network | Version: | 1.11.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
I'm seeing occasional errors during tests like this:
File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 662, in test_system_v2_server return self.do_system_test() File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 378, in do_system_test self.create_tub() File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 321, in create_tub tub.listenOn("tcp:%d" % portnum) File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/foolscap/pb.py", line 514, in listenOn l.setServiceParent(self) File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 188, in setServiceParent self.parent.addService(self) File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 309, in addService service.privilegedStartService() File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 278, in privilegedStartService service.privilegedStartService() File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/internet.py", line 113, in privilegedStartService self._port = self._getPort() File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/internet.py", line 141, in _getPort 'listen%s' % (self.method,))(*self.args, **self.kwargs) File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 478, in listenTCP p.startListening() File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startListening raise CannotListenError(self.interface, self.port, le) twisted.internet.error.CannotListenError: Couldn't listen on any:49299: [Errno 48] Address already in use. [ERROR]
I'm still tracing this down, but it looks like iputil.py allocate_tcp_port() (which I wrote for Foolscap and copied over a few months ago) is sometimes giving us port numbers that are actually already in use. Those ports are coming from the kernel (we do a bind(port=0) and then ask what port got allocated).
One problem that I know about is that we're binding the test port to 127.0.0.1, and using SO_REUSEADDR, and the combination of those two might make the kernel think it's ok to give us a port that's already bound to something *other* than 127.0.0.1. But in some tests, replacing that with 0.0.0.0 didn't help: I was still given ports that are already in use.
I have to experiment some more to figure out what's going on. I think in the long run, allocate_tcp_port() might need to actually try to listen on the port, and if that fails, grab a different one.
Change History (2)
comment:1 Changed at 2018-05-23T13:39:52Z by exarkun
- Resolution set to wontfix
- Status changed from new to closed
comment:2 Changed at 2018-05-23T14:38:34Z by exarkun
- Create a private network namespace for the test suite. This removes the possibility of a port collision involving unrelated activities on the same host. It does not remove the possibility of a port collision of Tahoe-LAFS code with other Tahoe-LAFS code though. Network namespaces are highly platform specific and this would likely involve three or more implementations of the same idea. Also, creating network namespaces likely requires elevated privileges imposing a practical barrier to deployment.
- Avoid binding to IN_ADDRANY. Instead, bind to a specific interface. This avoids collisions with other ports bound to different specific interfaces. It doesn't avoid collisions with other ports bound to IN_ADDRANY. Since most collisions are probably with IN_ADDRANY-bound sockets this probably doesn't help a lot.
It's not possible to fix this inside allocate_tcp_port itself. So I'm planning to close this ticket. Instead, we'll have a ticket for each test which can fail this way and they'll have to be fixed one by one.
The reason we cannot fix this inside allocate_tcp_port is that the approach it is a component of is suffers from an unavoidable race condition. allocate_tcp_port tries to figure out a specific TCP port number which _will not be in use at a later point in time_. Since there is no part of the system which allows the port number to be reserved or otherwise kept out of us *except by the one piece of code we intend* it cannot actually know whether any port number it selects will satisfy this requirement.
In practice, it does succeed with high probability. However, due to the large number of cases in which it is used (many times per test suite run and the test suite itself is run many times), even this high probability of success is not good enough. I will make an incredibly naive estimate that there are 215 ports available for "random" assignment and that the chance of an unrelated intermediate assignment being made is about 1 in 2 (I suspect some tests themselves trigger an unrelated intermediate port assignment). The chance of a collision is therefore 1 in 216 (around a thousandth of a percent). If there are 100 users of allocate_tcp_port in the test suite then the chance of a collision anywhere in the test suite is 100 in 216. There are about 15 different CI runners of the test suite. So the chance of a failure on any of them for one build set is 15 * 100 in 216. The test suite is run for every pull request and every master revision. If there is one PR merged a day, the chance of a failure in a week is at least 14 * 15 * 100 in 216 which reduces to around 32%. Quite easily high enough to be disruptive to development.
There are several possible general fixes for this issue.
Considering all of these, (2) is my preference. However, there is the matter of Windows support to contend with in that case.