#237 closed defect (fixed)
webish directory page can't display more than 192 children
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | 0.8.0 (Allmydata 3.0 Beta) |
Component: | code-frontend-web | Version: | 0.7.0 |
Keywords: | twisted deferred | Cc: | |
Launchpad Bug: |
Description (last modified by warner)
While doing some load testing (which involves writing lots and lots of files into a few directories), I discovered an exception when using the webish interface to list one of these large directories. The exception came out of nevow as it attempted to render the 193rd child of the directory:
2007-12-18 23:48:36.611Z [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback callbackKeywords=kw) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks self._runCallbacks() File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 41, in cb _drive(iterable, finished) --- <exception caught here> --- File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 23, in _drive next = iterable.next() File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 83, in iterflatten for item in gen: File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 103, in TagSerializer yield serialize(toBeRenderedBy, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize return partialflatten(context, obj) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten return flattener(obj, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 262, in DirectiveSerializer return serialize(renderer, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize return partialflatten(context, obj) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten return flattener(obj, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 245, in MethodSerializer return FunctionSerializer(original, context, nocontext) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 234, in FunctionSerializer result = original(context, data) File "/home/amduser/tahoe-loadnet/source/src/allmydata/webish.py", line 227, in render_row ctx.fillSlots("size", target.get_size()) File "/home/amduser/tahoe-loadnet/source/src/allmydata/filenode.py", line 25, in get_size return IFileURI(self.uri).get_size() File "/usr/lib/python2.5/site-packages/twisted/python/components.py", line 113, in _hook return factory(ob) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 428, in from_string_filenode u = from_string(s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 391, in from_string return CHKFileURI().init_from_string(s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 52, in init_from_string self.key = idlib.a2b(key_s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/util/idlib.py", line 190, in a2b precondition(could_be_base32_encoded(cs), "cs is required to be possibly base32 encoded data.", cs=cs) exceptions.RuntimeError: maximum recursion depth exceeded
I'm not sure if there is actual recursion taking place.. it may just be a function that's returning defer.succeed(result) and this basically surprises some other piece of code that doesn't expect the Deferred to fire so quickly.
Change History (11)
comment:1 Changed at 2007-12-19T00:03:44Z by warner
- Description modified (diff)
comment:2 Changed at 2007-12-19T06:53:35Z by warner
grr. I can reproduce this reliably with a live node (the failure sometimes has a nevow traceback on the web page, sometimes it hits the stack limit while rendering the failure and you get the "additionally an error occurred rendering the error page" message). But I can't get my test case to fail (test_web.WebTest?.test_GET_DIRURL_large) at all.
The Deferred fired by webish's data_children() is .paused=1 in the live node, because somewhere in the upstream chain a callback returned another Deferred that hasn't fired yet. I can simulate this in test_web by modifying test.common.FakeMutableFileNode?.download_to_data() to add a callback rung that returns fireEventually(data), and that makes .paused=1 like in the live node, but it doesn't trigger the failure.
On the "I'm desperate so try to fix it even without a failing test case" front, adding fireEventually() to data_children() doesn't help. I'm pretty sure the problem is that Nevow's flattening code (iterflatten?) sees a Deferred that fires with a big list of stuff, and when the top deferred fires, it does an addCallback for every item inside it. Each addCallback fires immediately, which reenters the loop, and that's what blows the stack.
I *was* able to make the live node stop failing with the following ugly hack in data_children():
def data_children(self, ctx, data): d = self._dirnode.list() from foolscap.eventual import fireEventually d.addCallback(lambda dict: sorted(dict.items())) def _stall(items): return [fireEventually(i) for i in items] d.addCallback(_stall) return d
I.e. force iterflatten to slow down by firing each individual item in a separate turn. Inefficient!
Ok, I have to run, so I'm not going to get this one fixed before I get back from vacation. The fireEventually hack will work if we're desperate, but I really don't want to add it, and also I don't want to add it without being able to produce a failing test case. (the next step for a test case is to do it in test_system() where we use real dirnodes, but that's slower and results in failures that are harder to track down than a focussed test like test_web).
comment:3 Changed at 2007-12-31T21:39:58Z by warner
- Milestone changed from undecided to 0.7.1
- Owner set to warner
I think this is important enough to bring forward into the 0.7.1 release.
comment:4 Changed at 2008-01-09T23:40:34Z by warner
robk pointed out that the test might be passing (when I'd expect it to fail) because the client/consumer is in the same process as the server/producer. Basically the producer flattens a number of entries and writes it to the HTTP socket, but then it has to pause until the socket is drained, and since the reader of that socket is in the same process, it gets a chance to retire more deferreds before it is run again.
If we look at Nevow and don't see any evidence of Producer/Consumer? code, that blows this theory out the window. If this *is* the reason the test is not failing, then a better way to trigger the bug might be to ask nevow to render the directory page to a string, instead of hitting it over HTTP. That might cause it to do the flattening all in a single turn, which ought to trigger the problem. I seem to recall that Nevow had some way of doing this (so that it can be used in library-mode, driven by something other than twisted.web), but I don't remember exactly how to do it right now.
comment:5 Changed at 2008-01-23T02:40:34Z by secorp
- Milestone changed from 0.7.1 to undecided
comment:6 Changed at 2008-02-13T02:52:16Z by warner
I tried using renderString(), but that didn't trigger the problem either:
def _check(dirnode): #large_url = "/uri/" + dirnode.get_uri() + "/" #return self.GET(large_url) # we play some games here to trigger nevow's flattener r = webish.Directory("rootname", dirnode, ("dirpath",)) import nevow.context ctx = nevow.context.WovenContext() from nevow import inevow class MyRequest: pass req = MyRequest() req.prePathURL = lambda: "" req.uri = "" req.args = {} req.fields = {} ctx.remember(req, inevow.IRequest) ctx.remember(self.s, webish.IClient) return r.renderString(ctx) d.addCallback(_check)
comment:7 Changed at 2008-02-13T03:28:48Z by warner
- Milestone changed from undecided to 0.8.0 (Allmydata 3.0 Beta)
- Resolution set to fixed
- Status changed from new to closed
I just thought of a less-inefficient form of that fireEventually() workaround: only insert the turn break once every 100 items, instead of on every item. The following code seems to work:
def _stall_some(items): output = [] for i,item in enumerate(items): if i % 100 == 0: output.append(fireEventually(item)) else: output.append(item) return output d.addCallback(_stall_some) return d
I still haven't been able to come up with a failing unit test for this, but I'm going to commit this fix and close the ticket anyways.
Fixed in 124b2160b930c180
comment:8 Changed at 2008-11-15T02:34:47Z by warner
Some more notes on this problem, so I can find them again later:
Problem One: long list of callbacks, all of them are ready
Each Deferred (we'll call the first one Deferred A) has a list of callback functions. Each time you do d.addCallback(), this list grows by one element.
When Deferred A fires, the list is executed in a 'while' loop, in Deferred._runCallbacks. If the callbacks all return either a normal value or a Failure, then the list is completely consumed during the one call to _runCallbacks, and everything is fine.
However, when a callback returns another Deferred B (chaining), the first Deferred A must wait for the second to finish. The code that does this looks like:
if isinstance(self.result, Deferred): self.pause() self.result.addBoth(self._continue) break
The second Deferred B might have already been fired by this point, either because it was born ready (created with defer.succeed, or defer.maybeDeferred), or because whatever was being waited upon has already occurred.
If this occurs, the subsequent callback in Deferred A's chain will fire (with B's result), but it will fire through a 6-frame recursive loop instead of firing on the next pass of the 'while' loop. As a result, each such ready-to-fire Deferred will add 6 stack frames. 166 such loops are enough to put more than 1000 frames on the stack, which will exceed Python's default sys.getrecursionlimit() .
The 6-frame cycle is:
File "twisted/internet/defer.py", line 214, in addBoth callbackKeywords=kw, errbackKeywords=kw) File "twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 341, in _runCallbacks self.result.addBoth(self._continue)
The following sample code will cause this situation:
import traceback from twisted.internet import defer def fire(res, which): print "FIRE", which, "stack:", len(traceback.extract_stack()) if which == 2: traceback.print_stack() return defer.succeed(None) d = defer.Deferred() for i in range(5): # set this to 170 to see failure d.addCallback(fire, i) d.callback("go")
The exception that this provokes is caught by the Deferred's Failure mechanisms, but then Twisted has an internal failure while trying to capture it. The actual Unhandled error in Deferred that gets put into the logs is:
Unhandled error in Deferred: Traceback (most recent call last): File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 341, in _runCallbacks self.result.addBoth(self._continue) File "twisted/internet/defer.py", line 214, in addBoth callbackKeywords=kw, errbackKeywords=kw) File "twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 344, in _runCallbacks self.result = failure.Failure() File "twisted/python/failure.py", line 265, in __init__ parentCs = reflect.allYourBase(self.type) File "twisted/python/reflect.py", line 542, in allYourBase accumulateBases(classObj, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) exceptions.RuntimeError: maximum recursion depth exceeded
Problem Two: deep chain of callbacks, e.g. recursive delayed polling
The other kind of recursion-limit-violation failures that occurrs with Deferreds involves a long chain that finally fires. The most common way to generate such a chain is with a recursive method that separates each call with a Deferred, such as a polling function that returns a Deferred:
def wait_until_done(self, ignored=None): if self.done: return True else: d = Deferred() reactor.callLater(1.0, d.callback, None) d.addCallback(self.wait_until_done) return d
If this function must poll more than 331 times, the reactor tick which notices the expired timer and fires d.callback will see a recursion-depth-exceeded exception. The last Deferred fires, which triggers the _continue callback on the next-to-last Deferred, which allows it to fire, which triggers the [-2] Deferred, etc. This recursive cycle is of length 3 and has the following frames:
File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks()
This one is trickier to find, because the root of the stack will be some internal reactor call rather than application code. In particular, the bottom of the stack will look like:
File "/tmp/t.py", line 26, in <module> reactor.run() File "twisted/internet/base.py", line 1048, in run self.mainLoop() File "twisted/internet/base.py", line 1057, in mainLoop self.runUntilCurrent() File "twisted/internet/base.py", line 705, in runUntilCurrent call.func(*call.args, **call.kw) File "twisted/internet/defer.py", line 243, in callback self._startRunCallbacks(result) File "twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks()
The other tricky thing about this failure is that the application code is sitting on the end of the stack: any callback that is attached to the Deferred that wait_until_done returns will run in a low-stack environment. As a result, recursion-depth-exceeded exceptions will be triggered by seemingly innocent application code. Note how the "DONE" number changes as you modify the self.count comparsion value in this example:
#! /usr/bin/python import traceback from twisted.internet import reactor from twisted.internet.defer import Deferred class Poller: count = 0 def wait_until_done(self, ignored=None): self.count += 1 if self.count > 301: # 331 works, 332 fails. return True else: d = Deferred() reactor.callLater(0.0, d.callback, None) d.addCallback(self.wait_until_done) return d p = Poller() def done(res): #traceback.print_stack() print "DONE", len(traceback.extract_stack()) d = p.wait_until_done() d.addCallback(done) reactor.run()
When this fails, the traceback that shows up in the logs looks like:
Unhandled error in Deferred: Traceback (most recent call last): File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() --- <exception caught here> --- File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 344, in _runCallbacks self.result = failure.Failure() File "twisted/python/failure.py", line 265, in __init__ parentCs = reflect.allYourBase(self.type) File "twisted/python/reflect.py", line 542, in allYourBase accumulateBases(classObj, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) exceptions.RuntimeError: maximum recursion depth exceeded
Combinations
Note that these two problems can interact. Each ready-to-fire callback attached to a single Deferred uses 6 stack frames, and each chained callback uses 3 stack frames. If X*6+Y*3 > 1000, the code will fail.
Solutions
For problem one, the requirement is that Deferreds never wind up with more than 166 callbacks that are ready to fire. In other words, there must be at least one not-ready-to-fire Deferred in each span of 166 callbacks.
One way to accomplish this is to have every 100th call return foolscap.eventual.fireEventually(result) instead of defer.succeed(result). Having every call do this works too, it just slows things down a bit.
def fire(res, which): return defer.fireEventually(None) d = defer.Deferred() for i in range(170): d.addCallback(fire, i)
For problem two, the requirement is that the depth of the tail-recursion chain not exceed 331 cycles, minus some room for the code you're eventually going to attach to the end. One way to accomplish this is to have every 300th call (or every single call, if you are willing to accept the slowdown) add an additional fireEventually to slow things down.
def wait_until_done(self, ignored=None): self.count += 1 if self.count > 301: # 331 works, 332 fails. return True else: d = Deferred() reactor.callLater(0.0, d.callback, None) d.addCallback(self.wait_until_done) d.addCallback(lambda res: fireEventually(res)) return d
comment:9 Changed at 2010-11-30T05:04:45Z by zooko
Twisted 10.2.0 has the fix for the stack overflow in chained deferreds: http://twistedmatrix.com/trac/ticket/1833
I suppose this means that some year, when everyone has upgraded to Twisted >= v10.2.0 (which I estimate will be in about 2015), we can simplify the Tahoe-LAFS code by depending on that behavior.
comment:10 Changed at 2010-11-30T05:09:40Z by davidsarah
- Keywords twisted deferred added
Keywords so I can find this ticket again. (Yes, I know it is fixed, but comment:8 is a really good description of the problem.)
comment:11 Changed at 2010-11-30T16:24:36Z by warner
hm, I think you meant http://twistedmatrix.com/trac/ticket/411
hrm, the exception got mangled during pasting..