Opened at 2010-01-10T06:16:08Z
Last modified at 2012-05-23T00:14:29Z
#891 new defect
web gateway memory grows without bound under load
Reported by: | zooko | Owned by: | warner |
---|---|---|---|
Priority: | critical | Milestone: | soon |
Component: | code-frontend-web | Version: | 1.5.0 |
Keywords: | reliability scalability memory | Cc: | |
Launchpad Bug: |
Description
I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a flogtool tail running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.
Attachments (2)
Change History (8)
Changed at 2010-01-10T06:18:37Z by zooko
Changed at 2010-01-10T06:26:09Z by zooko
Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents...
comment:1 Changed at 2010-01-10T06:28:56Z by zooko
So while I was running flogtool tail --save-as=dump.flog I started a second tail, like this: flogtool tail --save-as=dump-2.flog. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.
comment:2 Changed at 2010-02-27T09:07:13Z by davidsarah
- Keywords memory added
- Milestone changed from undecided to 1.7.0
comment:3 Changed at 2010-06-16T03:58:49Z by davidsarah
- Milestone changed from 1.7.0 to soon
comment:4 Changed at 2010-06-19T18:16:05Z by warner
incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the flogtool tail command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me.
I'll take a look at the logs as soon as I can.
comment:5 Changed at 2010-06-21T20:35:48Z by zooko
No I ran flogtool tail on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).
comment:6 Changed at 2012-05-23T00:14:29Z by warner
Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.
"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life