Some basic notes on performance:

== Memory Footprint ==

We try to keep the Tahoe memory footprint low by continuously monitoring the
memory consumed by common operations like upload and download.

For each currently active upload or download, we never handle more than a
single segment of data at a time. This serves to keep the data-driven
footprint down to something like 4MB or 5MB per active upload/download.

Some other notes on memory footprint:

 * importing sqlite (for the share-lease database) raised the static
   footprint by 6MB, going from 24.3MB to 31.5MB (as evidenced by the munin
   graph from 2007-08-29 to 2007-09-02).

 * importing nevow and twisted.web (for the web interface) raises the static
   footprint by about 3MB (from 12.8MB to 15.7MB).

 * importing pycryptopp (which began on 2007-11-09) raises the static footprint
   (on a 32-bit machine) by about 6MB (from 19MB to 25MB). The 64-bit machine
   footprint was raised by 17MB (from 122MB to 139MB).

The
[http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats.html 32-bit memory usage graph]
shows our static memory footprint on a 32bit machine (starting a node but not doing
anything with it) to be about 24MB. Uploading one file at a time gets the
node to about 29MB. (we only process one segment at a time, so peak memory
consumption occurs when the file is a few MB in size and does not grow beyond
that). Uploading multiple files at once would increase this.

We also have a
[http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats_64.html 64-bit memory usage graph], which currently shows a disturbingly large static footprint.
We've determined that simply importing a few of our support libraries (such
as Twisted) results in most of this expansion, before the node is ever even
started. The cause for this is still being investigated: we can think of plenty
of reasons for it to be 2x, but the results show something closer to 6x.

== Network Speed ==

=== Test Results ===

Using a 3-server testnet in colo and an uploading node at home (on a DSL line
that gets about 78kBps upstream and has a 14ms ping time to colo) using
0.5.1-34 takes 820ms-900ms per 1kB file uploaded (80-90s for 100 files, 819s
for 1000 files). The DSL speed results are occasionally worse than usual,
when the owner of the DSL line is using it for other purposes while a test is
taking place.

'scp' of 3.3kB files (simulating expansion) takes 8.3s for 100 files and 79s
for 1000 files, 80ms each.

Doing the same uploads locally on my laptop (both the uploading node and the
storage nodes are local) takes 46s for 100 1kB files and 369s for 1000 files.

Small files seem to be limited by a per-file overhead. Large files are limited
by the link speed.

The munin
[/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay.html delay graph] and
[/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_rate.html rate graph] show these Ax+B numbers for a node in colo and a node behind a DSL line.

The [/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_SSK.html mutable-file delay graph] shows the "B" per-file latency number
for mutable (aka "SSK") files. In the 0.7.0 release, this is dominated by the RSA keypair generation necessary to create each new mutable file.

The 
[/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_rtt.html delay*RTT graph] shows this per-file delay as a multiple of the average round-trip
time between the client node and the testnet. Much of the work done to upload
a file involves waiting for message to make a round-trip, so expressing the
per-file delay in units of RTT helps to compare the observed performance
against the predicted value.

=== Roundtrips ===

The 0.5.1 release requires about 9 roundtrips for each share it uploads. The
upload algorithm sends data to all shareholders in parallel, but these 9
phases are done sequentially. The phases are:

 1. allocate_buckets
 2. send_subshare (once per segment)
 3. send_plaintext_hash_tree
 4. send_crypttext_hash_tree
 5. send_subshare_hash_trees
 6. send_share_hash_trees
 7. send_UEB
 8. close
 9. dirnode update

We need to keep the send_subshare calls sequential (to keep our memory
footprint down), and we need a barrier between the close and the dirnode
update (for robustness and clarity), but the others could be pipelined.
9*14ms=126ms, which accounts for about 15% of the measured upload time.

Doing steps 2-8 in parallel (using the attached pipeline-sends.diff patch)
does indeed seem to bring the time-per-file down from 900ms to about 800ms,
although the results aren't conclusive.

With the pipeline-sends patch, my uploads take A+B*size time, where A is 790ms
and B is 1/23.4kBps . 3.3/B gives the same speed that basic 'scp' gets, which
ought to be my upstream bandwidth. This suggests that the main limitation to
upload speed is the constant per-file overhead, and the FEC expansion factor.

== Storage Servers ==

=== storage index count ===

ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a
single parent directory. In 0.5.1, this appears as a limit on the number of
buckets (one per storage index) that any StorageServer can hold. A simple
nested directory structure will work around this.. the following code would
let us manage 33.5G shares (see #150).

{{{
  from idlib import b2a
  os.path.join(b2a(si[:2]), b2a(si[2:4]), b2a(si))
}}}

This limitation is independent of problems of memory use and lookup speed.
Once the number of buckets is large, the filesystem may take a long time (and
multiple disk seeks) to determine if a bucket is present or not. The
provisioning page suggests how frequently these lookups will take place, and
we can compare this against the time each one will take to see if we can keep
up or not. If and when necessary, we'll move to a more sophisticated storage
server design (perhaps with a database to locate shares).

I was unable to measure a consistent slowdown resulting from having 30000
buckets in a single storage server.

== System Load ==

The source:src/allmydata/test/check_load.py tool can be used to generate
random upload/download traffic, to see how much load a Tahoe grid imposes on
its hosts.

=== test one: 10kB mean file size ===

Preliminary results on the Allmydata test grid (14 storage servers spread
across four machines (each a 3ishGHz P4), two web servers): we used three
check_load.py clients running with 100ms delay between requests, an
80%-download/20%-upload traffic mix, and file sizes distributed exponentially
with a mean of 10kB. These three clients get about 8-15kBps downloaded,
2.5kBps uploaded, doing about one download per second and 0.25 uploads per
second. These traffic rates were higher at the beginning of the process (when
the directories were smaller and thus faster to traverse).

The storage servers were minimally loaded. Each storage node was consuming
about 9% of its CPU at the start of the test, 5% at the end. These nodes were
receiving about 50kbps throughout, and sending 50kbps initially (increasing
to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB
VmSize per node, 25MB RSS. The load average on a 4-node box was about 0.3 .

The two machines serving as web servers (performing all encryption, hashing,
and erasure-coding) were the most heavily loaded. The clients distribute
their requests randomly between the two web servers. Each server was
averaging 60%-80% CPU usage. Memory consumption is minor, 37MB VmSize and
29MB RSS on one server, 45MB/33MB on the other. Load average grew from about
0.6 at the start of the test to about 0.8 at the end. Network traffic
(including both client-side plaintext and server-side shares) outbound was
about 600Kbps for the whole test, while the inbound traffic started at
200Kbps and rose to about 1Mbps at the end.

=== test two: 1MB mean file size ===

Same environment as before, but the mean file size was set to 1MB instead of
10kB.

{{{
clients: 2MBps down, 340kBps up, 1.37 fps down, .36 fps up
tahoecs2: 60% CPU, 14Mbps out, 11Mbps in, load avg .74  (web server)
tahoecs1: 78% CPU, 7Mbps out, 17Mbps in, load avg .91  (web server)
tahoebs4: 26% CPU, 4.7Mbps out, 3Mbps in, load avg .50  (storage server)
tahoebs5: 34% CPU, 4.5Mbps out, 3Mbps in  (storage server)
}}}

Load is about the same as before, but of course the bandwidths are larger.
For this file size, the per-file overhead seems to be more of a limiting
factor than per-byte overhead.

=== initial conclusions ===

So far, Tahoe is scaling as designed: the client nodes are the ones doing
most of the work, since these are the easiest to scale. In a deployment where
central machines are doing encoding work, CPU on these machines will be the
first bottleneck. Profiling can be used to determine how the upload process
might be optimized: we don't yet know if encryption, hashing, or encoding is
a primary CPU consumer. We can change the upload/download ratio to examine
upload and download separately.

Deploying large networks in which clients are not doing their own encoding
will require sufficient CPU resources. Storage servers use minimal CPU, so
having all storage servers also be web/encoding servers is a natural
approach.