#878 closed defect (fixed)
warn users about the performance issues of mutable files
Reported by: | zooko | Owned by: | kevan |
---|---|---|---|
Priority: | major | Milestone: | 1.6.0 |
Component: | unknown | Version: | 1.5.0 |
Keywords: | docs mutable performance large reviewed | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
Performance issues:
- mutable files are stored in their entirety in RAM briefly during upload
- creating a new mutable file requires creating a new RSA public/private key-pair, which can take as many as a billion CPU cycles
Currently, new users can carefully read the Tahoe-LAFS docs and then go on and decide to use mutable files without being aware of these issues. To close this ticket, fix that.
Attachments (1)
Change History (21)
comment:1 Changed at 2010-01-04T22:35:53Z by zooko
- Description modified (diff)
comment:2 Changed at 2010-01-04T22:38:59Z by zooko
comment:3 Changed at 2010-01-05T17:54:36Z by kevan
- Owner changed from nobody to kevan
I'll take care of this.
comment:4 Changed at 2010-01-05T19:07:34Z by kevan
I added the documentation to known_issues.txt, since there are proposals and tickets open that hope to fix this (which would seem to imply that it is a known issue).
Thoughts? Things that should be there but aren't?
comment:5 Changed at 2010-01-05T19:09:44Z by kevan
- Keywords review-needed added
comment:6 Changed at 2010-01-06T17:20:05Z by kevan
After reading a message (http://allmydata.org/pipermail/tahoe-dev/2010-January/003488.html) on tahoe-dev, I realized that I had misunderstood mutable file modification when writing my first patch. While the process I described was accurate for certain operations (specifically directory modification), it didn't apply to file creation using the CLI or the WUI, the places where users would be creating mutable files, and the places where the warning would be relevant. I'm attaching a reworded patch that fixes this issue.
comment:7 Changed at 2010-01-14T04:57:55Z by zooko
This ticket is a subset of #757 (there isn't a doc that says "which operations are efficient").
comment:8 Changed at 2010-01-14T19:52:26Z by zooko
FWIW here are measurements of how many CPU cycles are needed to generate an RSA 2048 bit key: http://bench.cr.yp.to/results-sign.html (the ones labelled "ronald2048"). That is not measuring the same implementation of RSA as the one we use, but it is a good benchmark to show that generating RSA keys is expensive.
comment:9 Changed at 2010-01-14T21:49:55Z by davidsarah
http://allmydata.org/trac/tahoe/attachment/ticket/878/mutable_docs.txt#L21 : "will be invalidated if the file is modified" -> "would be invalidated if the file were modified".
comment:10 Changed at 2010-01-14T21:51:59Z by davidsarah
"tahoe-lafs" -> "Tahoe-LAFS" (three times)
comment:11 Changed at 2010-01-14T22:40:39Z by warner
while "billions of CPU cycles" is technically accurate, it would be more meaningful to users to say "perhaps an entire second on a desktop PC" (and maybe a parenthetical remark about small ARM boxes). We don't want to scare them away from using directories altogether, just help them understand why a loop that creates a million directories might take a million seconds.
Also, I believe the motivation for this ticket was specifically about *large* mutable files, so I'd emphasize the unfortunate-and-we-haven't-fixed-with-MDMF performance aspects (i.e. the cost=O(filesize) parts) rather than the unfortunate-and-we-haven't-fixed-with-ECDSA aspects (like the constant cost of creating new mutable files).
comment:12 Changed at 2010-01-14T22:54:46Z by zooko
For Jody Harris, seconds elapsed on today's average PC might be more useful (or maybe not -- perhaps he prefers CPU cycles), but for Jonathan Ellis (the bug reporter of #757) CPU cycles is probably more useful. Also I wonder about people who are running their Tahoe-LAFS gateway on virtual machine. Would seconds-on-an-average-modern CPU significantly underestimate the cost to them?
comment:13 Changed at 2010-01-14T23:34:28Z by warner
like I said, "billions of CPU cycles" is more accurate (and more universal), but I think the most likely audience for this document will be well-served by having at least one human-meaningful unit of measure in there somewhere, even if only anecdotally. For example, I tell people that the unit tests currently take about 240s on my 2008-era laptop, and I tell them that "tahoe mkdir" takes about 800ms on the same machine. And I expect that people will know how their own hardware compares to a reference point like that. Let's not refuse to offer them a translation hint just because we can't give them an exact number of seconds for their particular hardware.
comment:14 Changed at 2010-01-15T03:35:24Z by kevan
I'm updating the patch to include David-Sarahs' suggestions. Thanks for the feedback. :)
comment:15 Changed at 2010-01-15T04:57:08Z by kevan
zooko and I were talking in IRC, and concluded that the explanation of why RSA is used with mutable files is inappropriate for known_issues.txt. I'll remove it when I work on the cycles versus seconds issue.
comment:16 Changed at 2010-01-15T20:32:09Z by davidsarah
- Keywords large added
comment:17 Changed at 2010-01-15T20:54:11Z by kevan
I think I agree with Brian.
Without a meaningful human figure to put "billions of CPU cycles" into perspective, that paragraph is a tad scarier than it needs to be. My first instinct when reading this exchange was to try to work both figures in there, but the point of that paragraph seems a lot clearer with only seconds than with both cycles and seconds.
I moved the explanation of mutable file performance issues to docs/performance.txt, because that seemed like a more appropriate place for it.
comment:18 Changed at 2010-01-18T02:53:14Z by davidsarah
- Keywords reviewed added; review-needed removed
Looks good to me.
comment:19 Changed at 2010-01-26T14:34:50Z by zooko
- Resolution set to fixed
- Status changed from new to closed
Applied as 26c6b806d7922da1. Thank you!
comment:20 Changed at 2010-01-26T15:04:01Z by zooko
- Milestone changed from undecided to 1.6.0
Here's the thread where new user Jody Harris made it clear that a new user who does read the docs still doesn't learn about these issues: http://allmydata.org/pipermail/tahoe-dev/2010-January/003478.html