Opened at 2007-12-18T16:52:32Z
Last modified at 2020-01-17T23:54:04Z
#235 new enhancement
scale up to many nodes
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-network | Version: | 0.7.0 |
Keywords: | scalability memory performance openssl leastauthority | Cc: | |
Launchpad Bug: |
Description (last modified by exarkun)
I updated The UseCases Page to reflect that someone might want to run a managed Tahoe grid comprising one thousand nodes. (If each node has a single 1 TB hard drive, that's a 1 PB grid. Obviously there are lots of other options, such as each node having six 1 TB hard drives in a RAID-6 configuration, resulting in 4 usable TB per node or a 4 PB grid.)
Anyway, we expect that the current Tahoe grid would have problems handling more simultaneously connected nodes. One known problem is that pyOpenSSL uses almost 1 MB of RAM per SSL connection. (See also #11.)
This ticket can be closed when Tahoe is demonstrated to handle one thousand simultaneously connected nodes smoothly.
Change History (6)
comment:1 Changed at 2008-06-01T21:04:59Z by warner
- Milestone changed from eventually to undecided
comment:2 Changed at 2009-12-13T03:02:47Z by davidsarah
- Keywords memory performance openssl added; scaling removed
comment:3 Changed at 2010-01-03T02:56:00Z by davidsarah
comment:4 Changed at 2010-01-06T07:36:57Z by warner
There are a number of hurdles to scale up to lots of nodes. This ticket is sort of a reminder to enumerate some of them, or record known limitations and potential solutions.
The limitation alluded to in the summary is probably the first hurdle: Foolscap, at least, has been observed (as of the creation date of this ticket, some 2 years ago) to consume an unreasonable approx. 1MB of RAM per open connection. I seem to remember doing some analysis and deciding that pyOpenSSL was to blame, rather than foolscap, but that was a long time ago and the tests should be run again before putting too much energy into it. There's no good reason for it to use this much memory.. the connection state and buffers should really fit into a couple of kilobytes.
The next hurdle will be the current practice of maintaining open connections to all known storage servers. If we left the protocols alone, we could change this to open connections on-demand, but that would incur a significant per-file latency (for both upload and download), and of course things like file-check and mutable-file publish would become really really slow because both want to query lots of servers. So changing the peer-selection protocols would probably be necessary to effectively remove this limitation.
comment:5 Changed at 2010-01-16T00:10:07Z by davidsarah
If you like this ticket, you might like #444 (reduce number of active connections: connect on demand).
comment:6 Changed at 2020-01-17T23:54:04Z by exarkun
- Description modified (diff)
- Keywords leastauthority added
From ticket:872#comment:16 :
Note that this shouldn't be needed in order to scale to 1000 nodes -- the size of the location and public key info for 1000 nodes should easily be small enough to fit into memory. Do we need another ticket for scaling to grids with hundreds of thousands of nodes, or am I being too prematurely ambitious? :-)