Opened at 2009-05-03T09:38:04Z
Closed at 2009-06-30T11:54:14Z
#690 closed defect (fixed)
raise size limit on furls
Reported by: | adigeo | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | 1.5.0 |
Component: | code-network | Version: | 1.4.1 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
I installed tahoe on a grid of 13 computer running Debian unstable. On two nodes starting tahoe gives this error and the node does not start: 2009-05-03 11:35:46+0200 [-] Log opened. 2009-05-03 11:35:46+0200 [-] twistd 8.1.0 (/usr/bin/python 2.5.2) starting up 2009-05-03 11:35:46+0200 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor?'> 2009-05-03 11:35:46+0200 [-] foolscap.pb.Listener starting on 36033 2009-05-03 11:35:46+0200 [-] nevow.appserver.NevowSite? starting on 3456 2009-05-03 11:35:46+0200 [-] Starting factory <nevow.appserver.NevowSite? instance at 0x8dadc0c> 2009-05-03 11:35:46+0200 [-] twisted.internet.protocol.DatagramProtocol? starting on 49645 2009-05-03 11:35:46+0200 [-] Starting protocol <twisted.internet.protocol.DatagramProtocol? instance at 0x8dadfec> 2009-05-03 11:35:46+0200 [-] (Port 49645 Closed) 2009-05-03 11:35:46+0200 [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol? instance at 0x8dadfec> 2009-05-03 11:35:46+0200 [Negotiation,client] Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 736, in receiveClose
self.request.fail(self.failure)
File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 88, in fail
self.deferred.errback(why)
File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
self._startRunCallbacks(fail)
File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
self.result = callback(self.result, *args, kw)
File "/usr/lib/python2.5/site-packages/allmydata/util/rrefutil.py", line 45, in _wrap_server_failure
raise ServerFailure?(f)
allmydata.util.rrefutil.ServerFailure?: [CopiedFailure? instance: Traceback from remote host -- Traceback (most recent call last):
File "/usr/lib/python2.5/site-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
why = getattr(selectable, method)()
File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 121, in doRead
return Connection.doRead(self)
File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 362, in doRead
return self.protocol.dataReceived(data)
File "/usr/lib/python2.5/site-packages/foolscap/banana.py", line 638, in dataReceived
self.handleData(chunk)
--- <exception caught here> ---
File "/usr/lib/python2.5/site-packages/foolscap/banana.py", line 796, in handleData
top.checkToken(typebyte, header)
File "/usr/lib/python2.5/site-packages/foolscap/referenceable.py", line 220, in checkToken
self.urlConstraint.checkToken(typebyte, size)
File "/usr/lib/python2.5/site-packages/foolscap/constraint.py", line 115, in checkToken
(tokenNames[typebyte], size, limit))
foolscap.tokens.Violation: Violation (<RootUnslicer?>.<methodcall reqID=6 obj=<allmydata.introducer.server.IntroducerService? object at 0xa98be8c> iface=RIIntroducerPublisherAndSubscriberService.tahoe.allmydata.com methodname=subscribe>.<arguments arg[0]>.<ref-1>): ('STRING token too large: 252>200',) ]
Change History (17)
comment:1 Changed at 2009-05-03T09:38:54Z by adigeo
comment:2 Changed at 2009-05-03T22:30:55Z by warner
- Owner set to warner
- Status changed from new to assigned
Curious.. I'm guessing you're running into an unexpected constraint on the size of a FURL that can be transmitted over the wire, in this case inside the "subscribe" message that is sent to the Introducer. This message contains information to let the Introducer know where it ought to send future announcements: the FURL of the subscriber. This FURL is ephemeral (it changes each time the client node is restarted), but has the same general form as other, more-persistent ones.
Do those systems have a lot of IP addresses? Maybe lots of multihoming, or Xen/VMware/etc virtual interfaces that don't actually talk to the outside world? Or some IPv6 interfaces?
Look at $NODEDIR/private/control.furl , as it will have the same tubid and "location hints" string as the ephemeral subscriber FURL. It will be in the form pb://TUBID@HOST:PORT,HOST:PORT/SWISSNUM, where both TUBID and SWISSNUM are base32 characters. The HOST:PORT sections (of which there can be an arbitrary number) are just dotted-quad IP address and decimal port number.
The node will try to figure out a good set of connection hints to put in this string at startup. The code that does this might conceivably put multiple addresses in there if it thinks that you have multiple externally-visible network interfaces. (also note that different variants of this code are run on different operating systems, so the behavior I get out of linux might not be the same as someone running solaris, etc).
The code inside Foolscap that handles FURLs passed this way (i.e. as serialized Referenceables) has a built-in limit of 200 bytes on the length of the FURL string. This is enough for 129 bytes of location hints, or 5 maximal-length hint strings (i.e. "255.255.255.255:12345,"). If the auto-detect-local-IP-addresses code decided to put 6 or more hints in that string, you'd exceed the 200-byte limit.
The workaround for this would be to manually set the advertised location hint in your tahoe.cfg file. Specifically, choose one or two address+port locations where the client can be reached, then in the [node] section, store the hints in the tub.location field:
[node] tub.location = 10.0.0.8:12345,255.255.255.255:12345
Once you do that, the client will publish FURLs with the tub.location hints instead of trying to figure them out on its own, which should hopefully avoid the length limitation.
(note that, for the purposes of the Introducer-to-subscriber announcement messages, the hints are not strictly necessary: those announcements will be sent back over the same TCP connection that the client used to subscribe in the first place. But if you want your storage server to be reachable by other clients, you'll need to provide a working location here).
Please let us know if this works, and what was causing the hint string to be too long (or if I'm wrong and it's some other problem altogether).
comment:3 Changed at 2009-05-04T13:09:51Z by adigeo
- Resolution set to fixed
- Status changed from assigned to closed
You are right, the machines have +4 interfaces (public/private networks + bonding on each) and by setting the tub.location manually to one address all nodes were able to start correctly this time.
Thanks for the good hint, it works now.
comment:4 Changed at 2009-05-04T18:09:04Z by zooko
- Summary changed from STRING token too large: 252>200 to raise size limit on furls
Shouldn't we increase the size limit on the hints? For that matter, why does foolscap enforce a size limit on the hints? If someone somehow gets a giant million-character string into the foolscap hint, would it fail in a way that they would be unable to diagnose?
comment:5 Changed at 2009-05-04T18:09:10Z by zooko
- Resolution fixed deleted
- Status changed from closed to reopened
comment:6 Changed at 2009-05-04T18:09:22Z by zooko
- Status changed from reopened to new
comment:7 Changed at 2009-05-31T14:46:50Z by zooko
Argh -- this is (at least one of) the problems that has kept David Abrahams from being able to use Tahoe for several days now:
http://allmydata.org/pipermail/tahoe-dev/2009-May/001884.html
Here is a patch against foolscap trunk to raise the size limit on furls from 200 to 2,000,000 bytes. I still think it would be better to eliminate the size-check entirely (because "there is no limit" is easier to remember and reason about than any other setting), but at least please apply this patch as soon as possible.
diff -r 776c880c14da foolscap/referenceable.py --- a/foolscap/referenceable.py Fri May 22 17:17:45 2009 -0700 +++ b/foolscap/referenceable.py Sun May 31 08:50:00 2009 -0600 @@ -209,7 +209,7 @@ interfaceName = None url = None inameConstraint = ByteStringConstraint(200) # TODO: only known RI names? - urlConstraint = ByteStringConstraint(200) + urlConstraint = ByteStringConstraint(2000000) def checkToken(self, typebyte, size): if self.state == 0: @@ -679,7 +679,7 @@ state = 0 giftID = None url = None - urlConstraint = ByteStringConstraint(200) + urlConstraint = ByteStringConstraint(2000000) def checkToken(self, typebyte, size): if self.state == 0:
comment:8 Changed at 2009-05-31T15:04:23Z by zooko
One more comment about the question of "Are size limits worth it?". There are three strategies you could take for different fields:
a) Having no limit is easier for everyone to remember and work with, than having a limit of X, for any value of X. (That's what I said above.)
b) For some fields, it is possible to choose a limit X such that no honest, non-buggy code will ever exceed X, but the imposition of a limit prevents malicious or buggy code from soaking up too much RAM.
c) For some fields, it is possible to choose a limit X such that only very rarely will honest, non-buggy code exceed X, but the limit prevents malicious code from soaking up too much RAM.
I am +1 on (a), -2 on (c) (strongly against), and I'm -0 on (b) because I don't think the resulting DoS-resistance is very valuable, and deciding whether a given field is of type (b) or type (c) has its own cost.
comment:9 Changed at 2009-05-31T15:45:42Z by bewst
- Priority changed from major to critical
I agree with zooko. This problem cost me the better part of a day. But if you can't bring yourself to lift the length limits, tahoe should produce an error message that tells a user how to work around the problem!
Incidentally, I'm quite unsure that my workaround is appropriate. I don't know what this tub location is supposed to mean, but I chose 127.0.0.1:3456 because I expect this laptop to be assigned all kinds of different addresses by DHCP.
comment:10 Changed at 2009-05-31T18:12:54Z by zooko
I should add that I really sympathize with Brian's desire for DoS-resistance in foolscap. Foolscap FURLs are nice fine-grained capabilities -- you can give someone a FURL and thus give them the ability to invoke this or that method of this one object without also giving them any other abilities to affect your system. It would be nice if every FURL didn't come with an implicit "... and the ability to drag your system to a halt (Windows) or cause arbitrary processes to be killed (Linux)" etc.
I'm just not sure that it is practical. Certainly I think Brian has erred by trying to make the limit close to the actual "probable max". If he just goes through and multiplies every limit in the foolscap codebase by a factor of 100 then it would probably solve almost all of our problems. (The cost of that is that the malicious client can use up 100 times as much RAM if it maxes out every one of the fields it sends.)
By the way, if you wanted to run network servers in a high-assurance environment, you might want to configure the operating system so that the process that is receiving requests from external sources is the one that gets killed by the OOM killer. With modern Linux you can tell the operating system "These processes here are the ones that talk to foreigners, so if we run out of RAM and have to kill something, kill one of these.".
That's not nearly as fine-grained as the foolscap approach (for example this lets any one of the remote clients of that process make the whole process stop working for all of the other remote clients), but it is the sort of kludge that people might make do with if the foolscap anti-DoS feature turns out to be more trouble than it is worth.
comment:11 Changed at 2009-05-31T22:49:25Z by warner
- Status changed from new to assigned
I've (slowly, reluctantly) come to agree. The Foolscap DoS-resistance approach is impractical. I've created foolscap#127 to remove it. The first step will be to remove the FURL-gift-sizelimit that's causing this problem, something I expect to put into foolscap-0.4.2 .
bewst: I don't see any details about your workaround in this ticket (perhaps it is in a different one? or in email?), but I can tell you that tub.location is meant to be your externally-visible host+port, to which other tahoe nodes can connect (when they want to use your node as a storage server). If you don't set tub.location, Tahoe will try to enumerate all of your network interfaces and put all of their addresses in the location, which is probably why you ran into a FURL-length limitation.
If your laptop is changing addresses all the time, and you just want to discourage other hosts from connecting to you at all (maybe you aren't even running a storage server), then set the location to an address that won't resolve like "unreachable.example.org:0" or something. Other clients will still try and connect to that "location", but they'll get an error instead of creating and discarding a bunch of loopback connections. Hm, it might be worthwhile to establish a clear syntax for this (like setting location to "none" or to an empty string or something).
source:docs/configuration.txt has an explanation about what tub.location does and how it's meant to be used.
comment:12 Changed at 2009-06-02T19:07:31Z by zooko
Brian:
If you're not going to rush out another foolscap release in time for Tahoe v1.5 (with higher/removed limit on furl size), then how about if we add some explicit detection in Tahoe if its generated furl > 200 bytes then we get a nice failure message at startup, or in twistd.log, or however we can communicate to the user. How to communicate to the user about such things is another open issue. Change the Welcome page to say WARNING BIG FURL? All of the above?
comment:13 Changed at 2009-06-02T19:50:40Z by warner
I'll kick out a foolscap-0.4.2 this week, with at least the 200-byte limit removed.
comment:14 Changed at 2009-06-10T17:37:47Z by zooko
- Milestone changed from 1.6.0 to 1.5.0
Okay, I guess once Brian releases foolscap-0.4.2 then Tahoe should require foolscap >= 0.4.2. We could also leave the requirement as-is (foolscap >= 0.4.1) since for most people (who don't have too many interfaces) that older version of foolscap already works fine.
comment:15 Changed at 2009-06-19T18:44:02Z by warner
foolscap-0.4.2 is released
comment:16 Changed at 2009-06-24T04:18:33Z by warner
- Priority changed from critical to major
Reducing severity because the latest foolscap fixes this.
comment:17 Changed at 2009-06-30T11:54:14Z by zooko
- Resolution set to fixed
- Status changed from assigned to closed
I'm going to mark this as 'fixed'. If you are using foolscap >= 0.4.2 then this issue won't effect you, but we don't want to raise the requirement on the version of foolscap because foolscap v0.4.1 is fine unless you have too many IP addresses on your system. (Hopefully in the future we'll have multi-versioned dependencies, e.g. #530 (use setuptools's --multi-version mode), so that we can specify that Tahoe-LAFS requires a newer foolscap without requiring the user to upgrade or uninstall older foolscaps which might be in use by other programs on her system).
Better formating of error lines: