#2787 closed defect (wontfix)

intermittent "Address Already In Use" error during tests

Reported by: warner Owned by: warner
Priority: normal Milestone: soon
Component: code-network Version: 1.11.0
Keywords: Cc:
Launchpad Bug:

Description

I'm seeing occasional errors during tests like this:

  File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 662, in test_system_v2_server
    return self.do_system_test()
  File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 378, in do_system_test
    self.create_tub()
  File "/Users/warner/stuff/tahoe/tahoe/src/allmydata/test/test_introducer.py", line 321, in create_tub
    tub.listenOn("tcp:%d" % portnum)
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/foolscap/pb.py", line 514, in listenOn
    l.setServiceParent(self)
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 188, in setServiceParent
    self.parent.addService(self)
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 309, in addService
    service.privilegedStartService()
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/service.py", line 278, in privilegedStartService
    service.privilegedStartService()
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/internet.py", line 113, in privilegedStartService
    self._port = self._getPort()
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/application/internet.py", line 141, in _getPort
    'listen%s' % (self.method,))(*self.args, **self.kwargs)
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 478, in listenTCP
    p.startListening()
  File "/Users/warner/stuff/tahoe/tahoe/.tox/py27/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startListening
    raise CannotListenError(self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on any:49299: [Errno 48] Address already in use.
[ERROR]

I'm still tracing this down, but it looks like iputil.py allocate_tcp_port() (which I wrote for Foolscap and copied over a few months ago) is sometimes giving us port numbers that are actually already in use. Those ports are coming from the kernel (we do a bind(port=0) and then ask what port got allocated).

One problem that I know about is that we're binding the test port to 127.0.0.1, and using SO_REUSEADDR, and the combination of those two might make the kernel think it's ok to give us a port that's already bound to something *other* than 127.0.0.1. But in some tests, replacing that with 0.0.0.0 didn't help: I was still given ports that are already in use.

I have to experiment some more to figure out what's going on. I think in the long run, allocate_tcp_port() might need to actually try to listen on the port, and if that fails, grab a different one.

Change History (2)

comment:1 Changed at 2018-05-23T13:39:52Z by exarkun

  • Resolution set to wontfix
  • Status changed from new to closed

It's not possible to fix this inside allocate_tcp_port itself. So I'm planning to close this ticket. Instead, we'll have a ticket for each test which can fail this way and they'll have to be fixed one by one.

The reason we cannot fix this inside allocate_tcp_port is that the approach it is a component of suffers from an unavoidable race condition. allocate_tcp_port tries to figure out a specific TCP port number which _will not be in use at a later point in time_. Since there is no part of the system which allows the port number to be reserved or otherwise kept out of us *except by the one piece of code we intend* it cannot actually know whether any port number it selects will satisfy this requirement.

In practice, it does succeed with high probability. However, due to the large number of cases in which it is used (many times per test suite run and the test suite itself is run many times), even this high probability of success is not good enough. I will make an incredibly naive estimate that there are 215 ports available for "random" assignment and that the chance of an unrelated intermediate assignment being made is about 1 in 2 (I suspect some tests themselves trigger an unrelated intermediate port assignment). The chance of a collision is therefore 1 in 216 (around a thousandth of a percent). If there are 100 users of allocate_tcp_port in the test suite then the chance of a collision anywhere in the test suite is 100 in 216. There are about 15 different CI runners of the test suite. So the chance of a failure on any of them for one build set is 15 * 100 in 216. The test suite is run for every pull request and every master revision. If there is one PR merged a day, the chance of a failure in a week is at least 14 * 15 * 100 in 216 which reduces to around 32%. Quite easily high enough to be disruptive to development.

There are several possible general fixes for this issue.

  1. Add retry logic. If a test randomly allocates a port and then discovers it cannot bind that port, just try the whole process over again. A small number of retries should be able to drive the failure rate down dramatically (the chance of success of each try should be independent; if the chance of failure of 1 try is a thousandth of a percent, the chance of failure of 3 tries is the cube of that - under a billionth of a percent). This solution is conceptually simple but the implementation might not be so. Detecting the failure (asynchronously, often across process boundaries) and backing up to a point where a retry may be made will probably take a lot of effort.
  1. Switch to pre-allocated sockets. Note that allocate_tcp_port is really trying to allocate a TCP port number. If it allocated a bound TCP socket (perhaps marked as listening) and this socket were handed to application code, there is no possibility for a collision in the application code because there is no longer any need to bind there. There is still the possibility for a collision inside the allocation function but it is much reduced compared to the current situation and it is much more amenable to the addition of retry logic. The most likely downside to this approach is lack of support for the underlying operation on Windows.
  1. Switch to UNIX sockets. It's much easier to avoid collisions with UNIX sockets. When using TCP we are working with only 215 possible values, they are assigned roughly randomly, and we compete with all other users of the system for them. When using UNIX, we have at least 255108 possible values, we can allocate them with structure that inherently avoids self-collision, and we need not compete with anyone else on the system. However, UNIX sockets are not necessarily compatible with all of the components which need to accept connections (for example, their "socket name" necessarily differs from TCP/IPv4; and being inherently private, there is less support in tools like HTTP clients for accessing them).
  1. Reverse the allocation relationship. Let the application code randomly allocate a port number. Arrange for the test code to somehow learn of the allocated value. As with option (2), this dramatically reduces the possibility for a collision and makes it significantly easier to add retry logic at the point where that collision may occur. In contrast to (2), it may require implementation of this allocation and retry logic at multiple code sites. There is also the matter of conveying the allocated port number back to the test code which probably also requires several different implementations.

Considering all of these, (2) is my preference. However, there is the matter of Windows support to contend with in that case.

Version 1, edited at 2018-05-23T14:14:27Z by exarkun (previous) (next) (diff)

comment:2 Changed at 2018-05-23T14:38:34Z by exarkun

  1. Create a private network namespace for the test suite. This removes the possibility of a port collision involving unrelated activities on the same host. It does not remove the possibility of a port collision of Tahoe-LAFS code with other Tahoe-LAFS code though. Network namespaces are highly platform specific and this would likely involve three or more implementations of the same idea. Also, creating network namespaces likely requires elevated privileges imposing a practical barrier to deployment.
  1. Avoid binding to IN_ADDRANY. Instead, bind to a specific interface. This avoids collisions with other ports bound to different specific interfaces. It doesn't avoid collisions with other ports bound to IN_ADDRANY. Since most collisions are probably with IN_ADDRANY-bound sockets this probably doesn't help a lot.
Note: See TracTickets for help on using tickets.