#1636 closed defect (fixed)
Unhandled error in Deferred during retrieve
Reported by: | killyourtv | Owned by: | warner |
---|---|---|---|
Priority: | critical | Milestone: | 1.9.2 |
Component: | code-network | Version: | 1.9.0 |
Keywords: | regression | Cc: | zooko, killyourtv@… |
Launchpad Bug: |
Description
allmydata-tahoe: 1.9.0-r5387, foolscap: 0.6.2, pycryptopp: 0.5.29, zfec: 1.4.22, Twisted: 11.0.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.3.0, pycrypto: 2.4.1, pyasn1: unknown, mock: 0.7.2, sqlite3: 2.6.0 [sqlite 3.7.9], setuptools: 0.6c16dev3
...with patches from #1007, #1010, and #1628 applied.
When uploading files to the volunteer grid on I2P, I received the following exception:
2011-12-14 13:34:49+0000 [-] Unhandled error in Deferred: 2011-12-14 13:34:49+0000 [-] Unhandled Error Traceback (most recent call last): File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'
I do not see any flog files from the time of the error (and it took place when I was AFK).
At http://127.0.0.1:3456/status/ it's still visible in the "Active Operations" section, even though the event took place hours ago.
Mutable File Publish Status Started: 13:34:51 14-Dec-2011 Storage Index: 7kkjs3zair343ujzz7ghzwx5si Helper?: No Current Size: 2972 Progress: 0.0% Status: Pushing shares Retrieve Results Encoding: 3 of 10 Sharemap: 0 -> Placed on [q5b52rmg], [bgahn3ft] 4 -> Placed on [q5b52rmg] 7 -> Placed on [5bqq6b7f] 9 -> Placed on [jvgf7m73] Timings: Total: () Setup: 6.2ms Encrypting: 310us (9.59MBps) Encoding: 264us (11.25MBps) Packing Shares: () RSA Signature: 6.7ms Pushing: ()
For what it's worth, there are other failed upload attempts that are remaining in the list of "Active Operations" (even though they failed many, many hours ago). I'm retrying the batch upload; I'll try to generate the flog files if it errors out while near.
(The fact that uploads may fail occasionally on networks like I2P or Tor isn't unusual; that they're still shown as active long after failing is (and it's not something that I saw with 1.8.x)).
Attachments (4)
Change History (53)
comment:1 Changed at 2011-12-14T21:30:00Z by zooko
- Cc zooko added
comment:2 follow-up: ↓ 5 Changed at 2011-12-16T14:30:41Z by killyourtv
Changed at 2011-12-16T14:31:18Z by killyourtv
comment:3 Changed at 2011-12-16T16:09:54Z by zooko
- Keywords regression added
- Milestone changed from undecided to 1.10.0
- Priority changed from major to critical
I'm promoting this to priority critical and adding the keyword regression and putting it into the Milestone 1.10.0.
comment:4 Changed at 2011-12-16T16:53:46Z by zooko
- Milestone changed from 1.10.0 to 1.9.1
comment:5 in reply to: ↑ 2 ; follow-up: ↓ 10 Changed at 2011-12-17T05:29:02Z by davidsarah
Replying to killyourtv:
I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...
I can't see any evidence that it is the same bug. What makes you suspect that it is?
(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)
comment:6 Changed at 2011-12-28T23:58:22Z by warner
Ok, I think we have two things going on here:
- notify_server_corruption needs to check if the rref it gets is None before invoking callRemoteOnly on it. This was probably exposed by my IServer refactoring, since before that, the servermap was holding actual RemoteReference objects, and now it just holds IServer objects (from which you can get a RemoteReference, if it is still connected, with .get_rref()).
- certain errors during the retrieve code get dropped, and the retrieve hangs. To fix this, the control flow needs to be carefully structured to make sure that all exceptions will get caught in a Deferred chain that leads into the main Retrieve deferred.
Zooko is going to write a patch for the first problem. Brian will walk through the code to fix the second problem.
comment:7 follow-up: ↓ 11 Changed at 2011-12-29T00:20:14Z by zooko
I believe the current state of this ticket is blocked on someone verifying that it occurs with upstream Tahoe-LAFS (no i2p-specific patches or other patches that aren't in trunk).
Is that right?
killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation and include actual facts like the twistd file...
Okay, just had a conversation with Brian and David-Sarah on IRC and it seems clear that there is a bug in the code for mutable files inasmuch as it assumes that the server object's .rref member will be an object, but instead that may be None due to normal, inevitable network entropy (here).
There are two possible fixes:
- Find all places that call .get_rref make sure they sensibly handle the case that they get None.
- Change the server instance to keep a RemoteReference object in .rref even after it has been notified that the connection has been disconnected and that object will raise DeadReferenceError if it is used.
I'll investigate both approaches and produce a patch for one or each of them.
comment:8 Changed at 2011-12-29T00:20:34Z by zooko
- Owner changed from nobody to zooko
- Status changed from new to assigned
comment:9 Changed at 2012-01-08T06:19:34Z by warner
I've cleaned up the control flow in retrieve.py, in a series of 6 patches ending in 893eea849ba7cf78 / 8db39f0e. I still need to do one more careful inspection, but I think it's pretty solid now.
On the .get_rref front, I had a thought. Since my long-term goal with IServer is to remove knowledge of server protocols from the upload/download code and move it into the IServer class (to facilitate having multiple backends), why not make it IServer.callRemote(..) instead of IServer.get_rref().callRemote(..)? The current IServer class would just do:
def callRemote(self, methname, *args, **kwargs): if self.rref: return self.rref.callRemote(methname, *args, **kwargs) else: return defer.fail(DeadReferenceError())
It'd be easy to grep for that callRemote later, when it's time to hoist the abstraction later and turn those calls into putShare or whatever.
comment:10 in reply to: ↑ 5 Changed at 2012-02-14T00:43:15Z by killyourtv
Replying to davidsarah:
Replying to killyourtv:
I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...
I can't see any evidence that it is the same bug. What makes you suspect that it is?
(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)
Sorry, just saw this reply. :|
It may have been stupid speculation on my part. I don't remember why I suspected it might be. Disregard (as you likely have by now).
comment:11 in reply to: ↑ 7 ; follow-up: ↓ 12 Changed at 2012-02-14T01:14:09Z by killyourtv
- Cc killyourtv@… added
Replying to zooko:
[...]
killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation
[...]
That was until comment:2 above. :| *Sigh*.
Anyhow...I'm back testing with was the latest git revision as of a few days ago (the webui shows 1.9.0.post77, rev 2da8339dd762aa96e73b1c1d98e217df43802418).
allmydata-tahoe: 1.9.0.post77, foolscap: 0.6.2, pycryptopp: 0.5.29, zfec: 1.4.22, Twisted: 11.1.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.3.2, pycrypto: 2.5, pyasn1: unknown, mock: 0.7.2, sqlite3: 2.6.0 [sqlite 3.7.10], setuptools: 0.6c16dev3
Once again, patches from #1007, #1010, and #68 have been applied. Everything, so far, seems to be working fine. Or...I should say that things are much better than they had been. It's been running for ~36 hours now.
The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as
publish 3l6y3gkgdxxwwsn2mwtdmssjse No 0B 0.0% Started publish zyt7tuse472ikqnjc7dmw3ai4y No 589B 0.0% Pushing shares
These entries are from around two hours ago (it currently just after 1am UTC). These tasks are no longer running but they're still in the "Active Operations" section.
Started: 22:53:57 13-Feb-2012 Storage Index: 3l6y3gkgdxxwwsn2mwtdmssjse Helper?: No Current Size: 0 Progress: 0.0% Status: Started Retrieve Results Encoding: 5 of 10 Sharemap: 0 -> Placed on [3t4enhba] 1 -> Placed on [eyp5vu5v], [bgahn3ft] 2 -> Placed on [3ejzcskd] 3 -> Placed on [uxqmbg6c] 5 -> Placed on [jvgf7m73] 6 -> Placed on [3teqr2cf], [n64vwepc] 7 -> Placed on [q5b52rmg] 8 -> Placed on [ql3cupul], [p6463dyg] 9 -> Placed on [poeonyl6] Timings: Total: () Setup: Encrypting: 0us () Encoding: 0us () Packing Shares: () RSA Signature: Pushing: ()
Started: 23:18:25 13-Feb-2012 Storage Index: zyt7tuse472ikqnjc7dmw3ai4y Helper?: No Current Size: 589 Progress: 0.0% Status: Pushing shares Retrieve Results Encoding: 3 of 10 Sharemap: 0 -> Placed on [p6463dyg], [3t4enhba] 1 -> Placed on [n64vwepc] 2 -> Placed on [basz6g7o] 3 -> Placed on [5mefqfis] 4 -> Placed on [eyp5vu5v] 5 -> Placed on [3ejzcskd] 6 -> Placed on [jvgf7m73] 7 -> Placed on [vvnvgne4] 8 -> Placed on [iqhk5e2t] 9 -> Placed on [uxqmbg6c] Timings: Total: () Setup: 5.7ms Encrypting: 283us (2.08MBps) Encoding: 151us (3.90MBps) Packing Shares: () RSA Signature: 6.5ms Pushing: ()
This is the only entry in twistd.log from around that time:
2012-02-13 23:37:48+0000 [-] Unhandled error in Deferred: 2012-02-13 23:37:48+0000 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 42, in cb _drive(iterable, finished) File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 26, in _drive finished.callback('') File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback self._startRunCallbacks(result) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 181, in _cbFinishRender self.finishRequest( True ) File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this
If I'm around if/when it happens again I'll generate an incident report.
All that said, things look *much better*.
comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed at 2012-02-14T02:51:32Z by zooko
Replying to killyourtv:
The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as
Hrm. I don't think bad behavior on the part of the servers should ever cause a 5xx error code from the gateway. Maybe I'm wrong about that, but at the very least the gateway should emit a clear explanation of what went wrong. Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.
comment:13 in reply to: ↑ 12 Changed at 2012-02-14T10:34:38Z by killyourtv
Replying to zooko:
[...]
Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.
Filed #1675.
comment:14 Changed at 2012-03-31T23:00:18Z by davidsarah
- Owner changed from zooko to warner
- Status changed from assigned to new
comment:9 seems to suggest that this is fixed. Is that right? Do we have adequate tests for the original problem?
comment:15 follow-up: ↓ 24 Changed at 2012-03-31T23:04:41Z by davidsarah
Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.
comment:16 Changed at 2012-04-20T09:11:39Z by killyourtv
- Resolution set to fixed
- Status changed from new to closed
For what it's worth, I cannot reproduce this anymore (current git) so I think this may have indeed been fixed.
allmydata-tahoe: 1.9.0.post133.dev0, foolscap: 0.6.3, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, zfec: 1.4.24, Twisted: 11.1.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.3rc2, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.5.0, pycrypto: 2.5, pyasn1: unknown, mock: 0.8.0, sqlite3: 2.6.0 [sqlite 3.7.11], setuptools: 0.6c16dev3
Everything has been working wonderfully during the last week that I've been on this version. Thanks for all of your (collective) work on this. :)
comment:17 Changed at 2012-04-22T18:10:08Z by zooko
- Keywords review-needed added
Thanks for the update (comment:16), kytv. I hope someone who understands the code better than I can tell us how likely it is that some specific patch or set of patches fixed this specific problem. If not, I'll try to dig into the question myself at some point. Adding the keyword "review-needed" to indicate that we need someone to look into the source code and try to connect this particular issue with some specific source code changes.
comment:18 Changed at 2012-04-22T18:10:23Z by zooko
- Resolution fixed deleted
- Status changed from closed to reopened
comment:19 Changed at 2012-04-22T20:06:32Z by zooko
Just to explain why I re-opened this ticket:
While having a report from a user like kytv, who actually experiences the bug in the wild, is extremely valuable, and I would prefer not to close a ticket without such a report, also having a source-code-oriented explanation of what the bug was and how it was fixed is something I would prefer not to close a ticket without.
Bugs are sneaky things, and I don't believe they are really gone until I see the bodies.
comment:20 Changed at 2012-04-22T22:55:25Z by killyourtv
OK. Should I refrain from closing bugs like this in the future? (I don't want to create "more work" for others--in fact, I was trying to help by closing it. ;))
comment:21 Changed at 2012-04-22T23:53:13Z by davidsarah
As I implied in comment:15, I don't think the patches warner mentioned in comment:9 can have fixed this bug, since the bug occurs on upload, not retrieve. I don't know of any other changes that would have affected it, so it probably still exists.
killyourtv: in general closing a bug with "cannot reproduce" should be done with great reluctance, since bugs don't fix themselves, and we like to know what change fixed something. That resolution status is really intended for the case where the relevant code has changed so much since the bug was reported that the original report is probably no longer relevant. It's fine to close bugs after having verified that a particular fix works, though.
comment:22 Changed at 2012-05-11T00:47:00Z by zooko
- Keywords review-needed removed
Okay since David-Sarah says (per comment:21) that the patches mentioned in comment:9 can't have fixed this issue, I'm removing the review-needed tag. (I was reading the most recent Tahoe-LAFS Weekly News and saw this ticket listed as being one of the review-needed tickets.)
comment:23 Changed at 2012-05-11T00:53:03Z by zooko
killyourtv: yeah, one practice that we have in this project which is different from a lot of other projects is that we focus on reducing uncertainty about bugs. A lot of projects promote a sort of vague notion of "degree of bugginess" with associated folklore, like "Oh yeah, version 13.1.6 was way more crashy than version 13.1.3, but then version 22.0.7 was way more stable...".
I prefer for our collective understanding of bugs to be as specific and reproducible as possible. Therefore, since you definitely spotted a bug originally (thank you!) and since we don't have a specific reason to believe that the bug is fixed, we should leave this ticket open.
Thanks again for your good bug reports!
comment:24 in reply to: ↑ 15 ; follow-up: ↓ 26 Changed at 2012-05-13T20:39:41Z by warner
Replying to davidsarah:
Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.
Look at the traceback: the exception happens in Retrieve. I imagine that the WUI operation was to upload a file and add it to the directory, so a Retrieve operation occurs when it fetches the previous contents of the directory (so it can modify and publish the new contents). No (immutable) Upload code was implicated in the traceback.
comment:25 Changed at 2012-05-13T20:53:23Z by warner
- Owner changed from warner to zooko
- Status changed from reopened to new
I gather zooko has some specific idea in mind for how this bug can be closed, handing it off to him.
comment:26 in reply to: ↑ 24 Changed at 2012-05-14T00:52:13Z by davidsarah
- Keywords test-needed added
Replying to warner:
Replying to davidsarah:
Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.
Look at the traceback: the exception happens in Retrieve.
Ah, I see. OK, I agree that the patches referenced in comment:9 address the problem in the traceback. However, I don't see any regression tests for that problem.
comment:27 Changed at 2012-05-14T01:05:44Z by davidsarah
- Summary changed from Unhandled error in Deferred while uploading (current darcs build) to Unhandled error in Deferred during retrieve
comment:28 Changed at 2012-05-14T08:00:52Z by zooko
I don't think the patches references in comment:9 fix this bug. I read the patches and don't see a fix for this. The path that led to this exception is still in place in trunk. That is: IServer.get_rref() can return None (because the connection-lost handling in storage_client.py set self.rref to None), and the send-corruption-report-to-server functionality doesn't test for None.
This is an instance of a problem that appears in many places in trunk, which problem and some proposed solutions are described in comment:6, comment:7, comment:11.
comment:29 Changed at 2012-05-16T22:02:18Z by zooko
- Milestone changed from 1.9.2 to 2.0.0
- Status changed from new to assigned
comment:30 Changed at 2012-05-16T22:02:35Z by zooko
- Milestone changed from 2.0.0 to 1.9.2
comment:31 Changed at 2012-06-12T16:48:17Z by davidsarah
- Owner changed from zooko to warner
- Status changed from assigned to new
comment:32 Changed at 2012-06-12T22:40:12Z by warner
- Keywords review-needed added
- Owner changed from warner to davidsarah
attachment:dead-rref.diff implements zooko's second approach from comment:11, namely to make the IServer retain the dead RemoteReference after disconnect, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers).
This patch should remove the problem. A later cleanup step could go the callers of get_rref() and remove the now-redundant guards, but a better change would be to remove get_rref() and have callers do server.callRemote(), which will then be replaced by higher-level actions like server.putShare() as we move away from Foolscap.
comment:33 Changed at 2012-06-12T22:51:34Z by davidsarah
- Status changed from new to assigned
Will review.
comment:34 Changed at 2012-06-14T19:55:53Z by zooko
I reviewed attachment:dead-rref.diff. I see no bugs in it, it is a small patch, and I (still) like the idea of doing it that way. It does make me mildly uncomfortable that there are no unit tests in the patch.
comment:35 Changed at 2012-06-14T23:43:24Z by warner
In the new patch (which includes a test), the assignment of node_key_s will eventually need to move out of the conditional init_storage(), which only happens if the node is running a storage server. (client-only nodes will get a node-key too). But that fix can happen after 1.10, and will happen naturally as part of the client-side accounting work.
comment:36 Changed at 2012-06-15T01:44:04Z by davidsarah
- Component changed from unknown to code-network
- Keywords reviewed added; test-needed review-needed removed
I reviewed dead-rref-test.diff and approve it for 1.9.2. Will apply.
comment:37 Changed at 2012-06-15T02:05:59Z by david-sarah@…
- Resolution set to fixed
- Status changed from assigned to closed
In 1b07d307619049dd:
comment:38 Changed at 2012-06-15T02:09:02Z by david-sarah <david-sarah@…>
In 1b07d307619049dd:
comment:39 Changed at 2012-06-15T02:58:26Z by davidsarah
- Keywords reviewed removed
- Resolution fixed deleted
- Status changed from closed to reopened
It looks like this test failure is in the new test:
[ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x70d3cf8 [0.972191095352s] called=0 cancelled=0 Reconnector._timer_expired()> allmydata.test.test_system.Connections.test_rref
comment:40 Changed at 2012-06-15T02:59:12Z by davidsarah
- Owner changed from davidsarah to warner
- Status changed from reopened to new
comment:41 Changed at 2012-06-15T20:44:26Z by warner
I traced this to a bug in foolscap, foolscap#196, and fixed it. I'll try to make a new release this weekend, and we can bump tahoe's dependency if you like. (it should only affect tests).
(the bug was that Tub.stopService was shutting off Reconnectors by iterating over the self.reconnectors list, but the Reconnector shutdown process was removing items from that list during the iteration. The shared-state hazard resulted in some Reconnectors being dropped, so their timers weren't shut down, so one was still running at the end of the test, triggering the DirtyReactorError)
comment:42 Changed at 2012-06-15T23:21:36Z by david-sarah@…
In 3ab0f33413de23dd:
comment:43 Changed at 2012-06-16T04:36:29Z by davidsarah
- Keywords review-needed added
Review needed for attachment:fix-1636-for-1.9.2.darcs.patch .
comment:44 Changed at 2012-06-16T04:38:39Z by davidsarah
warner: I took your suggestion to compare self.c0.nodeid and s.get_serverid(), which seems to work. Please test with your candidate foolscap 0.6.4 release.
comment:45 Changed at 2012-06-16T17:33:30Z by david-sarah <david-sarah@…>
In 3ab0f33413de23dd:
comment:46 Changed at 2012-06-16T18:20:43Z by david-sarah@…
In 323774698ee0a2e8:
Changed at 2012-06-16T19:06:39Z by davidsarah
[rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test
comment:47 Changed at 2012-06-21T19:33:29Z by david-sarah@…
- Resolution set to fixed
- Status changed from new to closed
In 5520/1.9.2:
comment:48 Changed at 2012-06-29T13:22:40Z by davidsarah
- Keywords review-needed removed
I don't know how to best explain this (it may be related to the regression being tackled in #1628), but:
During my attempted re-upload, my system became _really_ sluggish. After what seemed like an eternity I managed to get "htop" running. The system load was at 35 with the tahoe process using ~1.2GB of RAM.
Unfortunately, after 15 minutes or so of trying to get back to a prompt (or ssh in), I grew impatient and hit the power switch.
It may not be very helpful but attached is a snippet of twistd.log from that period. The fun started at 18:43:28 UTC. The logging stopped at 18:44:48 UTC. The attached file has 49679 lines.
There isn't anything in ~/.tahoe/logs/incidents from that time.
I haven't seen any "Unhandled error in Deferred:" entries in twistd.log from my use of 1.8.x.