Opened at 2009-12-27T04:50:22Z
Last modified at 2011-07-22T13:34:03Z
#873 new defect
upload: tolerate lost or unacceptably slow servers
Reported by: | warner | Owned by: | kevan |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-encoding | Version: | 1.5.0 |
Keywords: | upload preservation availability performance hang error | Cc: | |
Launchpad Bug: |
Description
As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is.
To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.
Attachments (1)
Change History (12)
comment:1 Changed at 2009-12-27T04:52:18Z by warner
- Keywords upload added
comment:2 Changed at 2009-12-27T16:06:42Z by davidsarah
- Keywords preservation availability performance added
comment:3 Changed at 2009-12-29T19:07:26Z by davidsarah
- Keywords hang added
Changed at 2009-12-29T21:36:27Z by kmarkley86
comment:4 Changed at 2009-12-29T21:44:25Z by kmarkley86
I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions:
allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1
comment:5 Changed at 2009-12-30T00:03:22Z by davidsarah
My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything:
- twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup)
- dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing)
(I'm on the allmydata.com production grid.)
comment:6 Changed at 2010-05-16T05:21:27Z by zooko
- Milestone changed from undecided to 1.8.0
- Owner set to zooko
- Status changed from new to assigned
comment:7 Changed at 2010-06-12T23:44:48Z by davidsarah
- Keywords error added
comment:8 Changed at 2010-07-24T05:38:14Z by zooko
- Milestone changed from 1.8.0 to eventually
It was impulsive of me to put this ticket into the 1.8 Milestone. This ticket will probably get fixed in a complete rewrite of the upload code at some point.
comment:9 Changed at 2010-07-29T04:53:25Z by zooko
- Summary changed from upload: tolerate lost or missing servers to upload: tolerate lost or unacceptably slow servers
comment:10 Changed at 2011-04-21T14:52:28Z by davidsarah
comment:11 Changed at 2011-07-22T13:34:03Z by zooko
- Owner changed from zooko to kevan
- Status changed from assigned to new
Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in comment:10, that might be good
Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.