Opened at 2007-08-10T02:29:40Z
Last modified at 2012-08-11T01:49:15Z
#97 assigned defect
reducing memory footprint in share reception
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code | Version: | 0.4.0 |
Keywords: | performance memory upload large | Cc: | |
Launchpad Bug: |
Description
One source of large memory footprint is the known issue with twisted.web's POST handling, addressed in #29. This ticket is not about that.
The memory footprint as reported by 'make check-memory' for uploading large files directly from disk is currently about 29MB. This seems acceptable for now. This test does not allow the uploading node to hold its own shares.
The "upload-self" series of tests does allow the node to hold its own shares, and this increases the memory footprint considerably (it seems to stabilize at 59MB). I have two theories on this, both of which center on Foolscap's behavior:
- Foolscap might not be freeing the arguments of inbound method calls quickly enough: perhaps the arguments are involved in a reference cycle of some sort. This could cause the shares to be kept around for a while, gradually accumulating memory footprint.
- Foolscap's loopback broker (which is used when you ask the Tub to connect to itself) might have some similar misbehavior, causing the tokens that normally get sent over the wire to be retained longer than expected.
The first theory should be exercised by writing a variant of the check-memory test which, instead of asking the isolated process-under-test to send files, it should instead have it receive shares, and watch its memory usage as it receives large numbers of shares of various sizes. If the first theory is correct, then I'd expect to see its memory usage be high when given small numbers (10s) of large (1MB) shares, since gc isn't going to kick in until we've seen hundreds of objects get allocated and freed. Large numbers (1000s) of small (10-100B) shares ought to trigger gc before the overhead gets too high.
If that theory doesn't hold up, then work on the second theory should start by replacing the code in broker.LoopbackTransport?.write with a queue.append. The theory is that the naieve eventually(self.peer.dataReceived, data) puts dozens of calls (plus the shares themselves) on the eventual-send queue, and then foolscap.eventual._SimpleCallQueue._turn isn't freeing the queue until all events have been processed. The second change to try out is to modify that _turn method to pop events off the queue as they are processed, to miminize the lifetime of those shares.
I also have some suspicions about the way that nested functions work: do they capture the state of all local variables, or just the ones that are referenced in their body? Is there some introspection trick (like locals() or dir() or something) that would obligate Python to keep all local variables alive? For this, we need to be aggressive about using del to get rid of shares/blocks the moment they are no longer necessary.
Also, Deferreds do not do tail-recursion, so any chains that involve things like defer.succeed will cause d.addCallback to run synchronously, and can result in very long call stacks (hundreds of frames long). This means that all the variables that were expected to go out of scope might stick around for a surprisingly long time, and when their lifetimes overlap, that increases the memory footprint. Having methods always return a Deferred (and using defer.succeed when the method is in fact synchronous) is an important pattern, because it improves flexibility for later: "Premature synchronization is the root of all evil (in an event-driven world)". To balance these, I think that foolscap.eventual.fireEventually can be used as a "reactor turn barrier", to ensure that the call chain is cut short. Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly.
Another direction that might be promising would be to implement custom foolscap Slicers and Unslicers for blocks. I think we'd have to write the shares to disk during encoding, but if we did that then a Slicer could read just a small amount of data from disk each time the transport buffer opened up (using the producer/consumer model). This could effectively push the memory footprint off to disk.
I think that in the long run, we'll be able to closely model the amount of memory footprint that each concurrent upload will represent (probably as a linear function of min(filesize, max_segment_size)), and then we can build an upload scheduler that sequences the upload work to keep our memory footprint below some configured ceiling. i.e. you can do many small uploads at once, but only a certain number of bigger ones, but also once the file gets larger than 1MB then it doesn't matter how big it is because we're only processing a single segment at a time.
Change History (13)
comment:1 Changed at 2007-08-11T01:43:55Z by warner
comment:2 Changed at 2007-08-14T18:55:18Z by warner
- Component changed from code to code-performance
- Owner somebody deleted
comment:3 Changed at 2007-08-16T17:49:15Z by warner
- Summary changed from reducing memory footprint to reducing memory footprint in share reception
comment:4 Changed at 2007-08-16T17:56:32Z by warner
One data point: I observed the test framework process for the 100MB check_memory.py 'upload' test to hit 600MB. This framework process holds 4 or 5 nodes and is receiving all of the shares emitted by the program-under-test. It is the program-under-test whose memory footprint is stable at 29MB.
This was bad enough that I had to disable the 100MB test. The buildslave that hosts this test is on a somewhat memory-constrained VM, and that 600MB RSS caused the rest of the machine to thrash and choke.
comment:5 Changed at 2007-09-28T02:45:53Z by warner
- Milestone changed from undecided to 1.0
comment:6 Changed at 2007-09-28T02:45:58Z by warner
- Owner set to warner
- Status changed from new to assigned
comment:7 Changed at 2008-03-21T22:32:12Z by zooko
- Milestone changed from 1.0 to undecided
comment:8 Changed at 2008-06-01T21:02:50Z by warner
- Milestone changed from eventually to undecided
comment:9 Changed at 2009-05-04T17:46:02Z by zooko
comment:10 Changed at 2009-12-04T05:13:55Z by davidsarah
- Component changed from code-performance to code
- Keywords performance upload added
comment:11 Changed at 2010-01-15T20:34:49Z by davidsarah
- Keywords large added
comment:12 Changed at 2011-07-23T02:07:57Z by davidsarah
warner wrote:
I also have some suspicions about the way that nested functions work: do they capture the state of all local variables, or just the ones that are referenced in their body?
You are right to be suspicious: in CPython closures keep a reference to the entire enclosing frame. See Owen S.'s answer at http://stackoverflow.com/questions/3145893/how-are-closures-implemented-in-python .
BTW, this approach not only leads to memory leaks, but also gives the wrong -- or at least, unexpected by most programmers -- semantics for captured mutable variables. And since changing that would be an incompatible change, it's not likely that either the memory leaks or the semantics will be fixed.
comment:13 Changed at 2012-08-11T01:49:15Z by davidsarah
Alternatively, maybe we should investigate Deferred and see how it might be made to implement tail recursion properly.
Trunk now depends on a version of Twisted that solves the Deferred tail recursion problem.
There are some useful notes about simultaneous-peer share pushing and the way that foolscap uses memory in #29, even though the ticket as a whole is focusing on the web-POST problem that was just resolved.