#1636 closed defect (fixed)

Unhandled error in Deferred during retrieve

Reported by: killyourtv Owned by: warner
Priority: critical Milestone: 1.9.2
Component: code-network Version: 1.9.0
Keywords: regression Cc: zooko, killyourtv@…
Launchpad Bug:

Description

allmydata-tahoe: 1.9.0-r5387,
foolscap: 0.6.2,
pycryptopp: 0.5.29,
zfec: 1.4.22,
Twisted: 11.0.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.2+,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.3.0,
pycrypto: 2.4.1,
pyasn1: unknown,
mock: 0.7.2,
sqlite3: 2.6.0 [sqlite 3.7.9],
setuptools: 0.6c16dev3

...with patches from #1007, #1010, and #1628 applied.

When uploading files to the volunteer grid on I2P, I received the following exception:

2011-12-14 13:34:49+0000 [-] Unhandled error in Deferred:
2011-12-14 13:34:49+0000 [-] Unhandled Error
	Traceback (most recent call last):
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

I do not see any flog files from the time of the error (and it took place when I was AFK).

At http://127.0.0.1:3456/status/ it's still visible in the "Active Operations" section, even though the event took place hours ago.

Mutable File Publish Status

    Started: 13:34:51 14-Dec-2011
    Storage Index: 7kkjs3zair343ujzz7ghzwx5si
    Helper?: No
    Current Size: 2972
    Progress: 0.0%
    Status: Pushing shares

Retrieve Results

    Encoding: 3 of 10
    Sharemap:
        0 -> Placed on [q5b52rmg], [bgahn3ft]
        4 -> Placed on [q5b52rmg]
        7 -> Placed on [5bqq6b7f]
        9 -> Placed on [jvgf7m73]
    Timings:
        Total: ()
            Setup: 6.2ms
            Encrypting: 310us (9.59MBps)
            Encoding: 264us (11.25MBps)
            Packing Shares: ()
                RSA Signature: 6.7ms
            Pushing: ()

For what it's worth, there are other failed upload attempts that are remaining in the list of "Active Operations" (even though they failed many, many hours ago). I'm retrying the batch upload; I'll try to generate the flog files if it errors out while near.

(The fact that uploads may fail occasionally on networks like I2P or Tor isn't unusual; that they're still shown as active long after failing is (and it's not something that I saw with 1.8.x)).

Attachments (4)

twistd.log.old.xz (1.9 KB) - added by killyourtv at 2011-12-16T14:31:18Z.
dead-rref.diff (3.0 KB) - added by warner at 2012-06-12T22:36:10Z.
retain rref even after disconnect
dead-rref-test.diff (5.0 KB) - added by warner at 2012-06-14T23:41:53Z.
same patch, plus a test
fix-1636-for-1.9.2.darcs.patch (104.1 KB) - added by davidsarah at 2012-06-16T19:06:39Z.
[rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test

Download all attachments as: .zip

Change History (53)

comment:1 Changed at 2011-12-14T21:30:00Z by zooko

  • Cc zooko added

comment:2 follow-up: Changed at 2011-12-16T14:30:41Z by killyourtv

I don't know how to best explain this (it may be related to the regression being tackled in #1628), but:

During my attempted re-upload, my system became _really_ sluggish. After what seemed like an eternity I managed to get "htop" running. The system load was at 35 with the tahoe process using ~1.2GB of RAM.

Unfortunately, after 15 minutes or so of trying to get back to a prompt (or ssh in), I grew impatient and hit the power switch.

It may not be very helpful but attached is a snippet of twistd.log from that period. The fun started at 18:43:28 UTC. The logging stopped at 18:44:48 UTC. The attached file has 49679 lines.

There isn't anything in ~/.tahoe/logs/incidents from that time.


I haven't seen any "Unhandled error in Deferred:" entries in twistd.log from my use of 1.8.x.

Last edited at 2011-12-16T14:33:18Z by killyourtv (previous) (diff)

Changed at 2011-12-16T14:31:18Z by killyourtv

comment:3 Changed at 2011-12-16T16:09:54Z by zooko

  • Keywords regression added
  • Milestone changed from undecided to 1.10.0
  • Priority changed from major to critical

I'm promoting this to priority critical and adding the keyword regression and putting it into the Milestone 1.10.0.

comment:4 Changed at 2011-12-16T16:53:46Z by zooko

  • Milestone changed from 1.10.0 to 1.9.1

comment:5 in reply to: ↑ 2 ; follow-up: Changed at 2011-12-17T05:29:02Z by davidsarah

Replying to killyourtv:

I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...

I can't see any evidence that it is the same bug. What makes you suspect that it is?

(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)

comment:6 Changed at 2011-12-28T23:58:22Z by warner

Ok, I think we have two things going on here:

  • notify_server_corruption needs to check if the rref it gets is None before invoking callRemoteOnly on it. This was probably exposed by my IServer refactoring, since before that, the servermap was holding actual RemoteReference objects, and now it just holds IServer objects (from which you can get a RemoteReference, if it is still connected, with .get_rref()).
  • certain errors during the retrieve code get dropped, and the retrieve hangs. To fix this, the control flow needs to be carefully structured to make sure that all exceptions will get caught in a Deferred chain that leads into the main Retrieve deferred.

Zooko is going to write a patch for the first problem. Brian will walk through the code to fix the second problem.

comment:7 follow-up: Changed at 2011-12-29T00:20:14Z by zooko

I believe the current state of this ticket is blocked on someone verifying that it occurs with upstream Tahoe-LAFS (no i2p-specific patches or other patches that aren't in trunk).

Is that right?

killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation and include actual facts like the twistd file...

Okay, just had a conversation with Brian and David-Sarah on IRC and it seems clear that there is a bug in the code for mutable files inasmuch as it assumes that the server object's .rref member will be an object, but instead that may be None due to normal, inevitable network entropy (here).

There are two possible fixes:

  1. Find all places that call .get_rref make sure they sensibly handle the case that they get None.
  2. Change the server instance to keep a RemoteReference object in .rref even after it has been notified that the connection has been disconnected and that object will raise DeadReferenceError if it is used.

I'll investigate both approaches and produce a patch for one or each of them.

comment:8 Changed at 2011-12-29T00:20:34Z by zooko

  • Owner changed from nobody to zooko
  • Status changed from new to assigned

comment:9 Changed at 2012-01-08T06:19:34Z by warner

I've cleaned up the control flow in retrieve.py, in a series of 6 patches ending in 893eea849ba7cf78 / 8db39f0e. I still need to do one more careful inspection, but I think it's pretty solid now.

On the .get_rref front, I had a thought. Since my long-term goal with IServer is to remove knowledge of server protocols from the upload/download code and move it into the IServer class (to facilitate having multiple backends), why not make it IServer.callRemote(..) instead of IServer.get_rref().callRemote(..)? The current IServer class would just do:

    def callRemote(self, methname, *args, **kwargs):
        if self.rref:
            return self.rref.callRemote(methname, *args, **kwargs)
        else:
            return defer.fail(DeadReferenceError())

It'd be easy to grep for that callRemote later, when it's time to hoist the abstraction later and turn those calls into putShare or whatever.

comment:10 in reply to: ↑ 5 Changed at 2012-02-14T00:43:15Z by killyourtv

Replying to davidsarah:

Replying to killyourtv:

I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...

I can't see any evidence that it is the same bug. What makes you suspect that it is?

(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)

Sorry, just saw this reply. :|

It may have been stupid speculation on my part. I don't remember why I suspected it might be. Disregard (as you likely have by now).

comment:11 in reply to: ↑ 7 ; follow-up: Changed at 2012-02-14T01:14:09Z by killyourtv

  • Cc killyourtv@… added

Replying to zooko:

[...]

killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation

[...]

That was until comment:2 above. :| *Sigh*.

Anyhow...I'm back testing with was the latest git revision as of a few days ago (the webui shows 1.9.0.post77, rev 2da8339dd762aa96e73b1c1d98e217df43802418).

allmydata-tahoe: 1.9.0.post77,
foolscap: 0.6.2,
pycryptopp: 0.5.29,
zfec: 1.4.22,
Twisted: 11.1.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.2+,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.3.2,
pycrypto: 2.5,
pyasn1: unknown,
mock: 0.7.2,
sqlite3: 2.6.0 [sqlite 3.7.10],
setuptools: 0.6c16dev3

Once again, patches from #1007, #1010, and #68 have been applied. Everything, so far, seems to be working fine. Or...I should say that things are much better than they had been. It's been running for ~36 hours now.

The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as

publish 	3l6y3gkgdxxwwsn2mwtdmssjse 	No 	0B 	0.0% 	Started
publish 	zyt7tuse472ikqnjc7dmw3ai4y 	No 	589B 	0.0% 	Pushing shares

These entries are from around two hours ago (it currently just after 1am UTC). These tasks are no longer running but they're still in the "Active Operations" section.

    Started: 22:53:57 13-Feb-2012
    Storage Index: 3l6y3gkgdxxwwsn2mwtdmssjse
    Helper?: No
    Current Size: 0
    Progress: 0.0%
    Status: Started

Retrieve Results

    Encoding: 5 of 10
    Sharemap:
        0 -> Placed on [3t4enhba]
        1 -> Placed on [eyp5vu5v], [bgahn3ft]
        2 -> Placed on [3ejzcskd]
        3 -> Placed on [uxqmbg6c]
        5 -> Placed on [jvgf7m73]
        6 -> Placed on [3teqr2cf], [n64vwepc]
        7 -> Placed on [q5b52rmg]
        8 -> Placed on [ql3cupul], [p6463dyg]
        9 -> Placed on [poeonyl6]
    Timings:
        Total: ()
            Setup:
            Encrypting: 0us ()
            Encoding: 0us ()
            Packing Shares: ()
                RSA Signature:
            Pushing: ()
    Started: 23:18:25 13-Feb-2012
    Storage Index: zyt7tuse472ikqnjc7dmw3ai4y
    Helper?: No
    Current Size: 589
    Progress: 0.0%
    Status: Pushing shares

Retrieve Results

    Encoding: 3 of 10
    Sharemap:
        0 -> Placed on [p6463dyg], [3t4enhba]
        1 -> Placed on [n64vwepc]
        2 -> Placed on [basz6g7o]
        3 -> Placed on [5mefqfis]
        4 -> Placed on [eyp5vu5v]
        5 -> Placed on [3ejzcskd]
        6 -> Placed on [jvgf7m73]
        7 -> Placed on [vvnvgne4]
        8 -> Placed on [iqhk5e2t]
        9 -> Placed on [uxqmbg6c]
    Timings:
        Total: ()
            Setup: 5.7ms
            Encrypting: 283us (2.08MBps)
            Encoding: 151us (3.90MBps)
            Packing Shares: ()
                RSA Signature: 6.5ms
            Pushing: ()

This is the only entry in twistd.log from around that time:

2012-02-13 23:37:48+0000 [-] Unhandled error in Deferred:
2012-02-13 23:37:48+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 42, in cb
            _drive(iterable, finished)
          File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 26, in _drive
            finished.callback('')
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 181, in _cbFinishRender
            self.finishRequest(  True )
          File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 176, in finishRequest
            server.Request.finish(self)
          File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
            "Request.finish called on a request after its connection was lost; "
        exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this

If I'm around if/when it happens again I'll generate an incident report.

All that said, things look *much better*.

comment:12 in reply to: ↑ 11 ; follow-up: Changed at 2012-02-14T02:51:32Z by zooko

Replying to killyourtv:

The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as

Hrm. I don't think bad behavior on the part of the servers should ever cause a 5xx error code from the gateway. Maybe I'm wrong about that, but at the very least the gateway should emit a clear explanation of what went wrong. Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.

comment:13 in reply to: ↑ 12 Changed at 2012-02-14T10:34:38Z by killyourtv

Replying to zooko:

[...]

Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.

Filed #1675.

comment:14 Changed at 2012-03-31T23:00:18Z by davidsarah

  • Owner changed from zooko to warner
  • Status changed from assigned to new

comment:9 seems to suggest that this is fixed. Is that right? Do we have adequate tests for the original problem?

comment:15 follow-up: Changed at 2012-03-31T23:04:41Z by davidsarah

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.

comment:16 Changed at 2012-04-20T09:11:39Z by killyourtv

  • Resolution set to fixed
  • Status changed from new to closed

For what it's worth, I cannot reproduce this anymore (current git) so I think this may have indeed been fixed.

allmydata-tahoe: 1.9.0.post133.dev0,
foolscap: 0.6.3,
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958,
zfec: 1.4.24,
Twisted: 11.1.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.3rc2,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.5.0,
pycrypto: 2.5,
pyasn1: unknown,
mock: 0.8.0,
sqlite3: 2.6.0 [sqlite 3.7.11],
setuptools: 0.6c16dev3

Everything has been working wonderfully during the last week that I've been on this version. Thanks for all of your (collective) work on this. :)

comment:17 Changed at 2012-04-22T18:10:08Z by zooko

  • Keywords review-needed added

Thanks for the update (comment:16), kytv. I hope someone who understands the code better than I can tell us how likely it is that some specific patch or set of patches fixed this specific problem. If not, I'll try to dig into the question myself at some point. Adding the keyword "review-needed" to indicate that we need someone to look into the source code and try to connect this particular issue with some specific source code changes.

comment:18 Changed at 2012-04-22T18:10:23Z by zooko

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:19 Changed at 2012-04-22T20:06:32Z by zooko

Just to explain why I re-opened this ticket:

While having a report from a user like kytv, who actually experiences the bug in the wild, is extremely valuable, and I would prefer not to close a ticket without such a report, also having a source-code-oriented explanation of what the bug was and how it was fixed is something I would prefer not to close a ticket without.

Bugs are sneaky things, and I don't believe they are really gone until I see the bodies.

comment:20 Changed at 2012-04-22T22:55:25Z by killyourtv

OK. Should I refrain from closing bugs (that I reported) in the future? (I don't want to create "more work" for others--in fact, I was trying to help by closing it. ;))

Last edited at 2012-04-22T22:58:23Z by killyourtv (previous) (diff)

comment:21 Changed at 2012-04-22T23:53:13Z by davidsarah

As I implied in comment:15, I don't think the patches warner mentioned in comment:9 can have fixed this bug, since the bug occurs on upload, not retrieve. I don't know of any other changes that would have affected it, so it probably still exists.

killyourtv: in general closing a bug with "cannot reproduce" should be done with great reluctance, since bugs don't fix themselves, and we like to know what change fixed something. That resolution status is really intended for the case where the relevant code has changed so much since the bug was reported that the original report is probably no longer relevant. It's fine to close bugs after having verified that a particular fix works, though.

comment:22 Changed at 2012-05-11T00:47:00Z by zooko

  • Keywords review-needed removed

Okay since David-Sarah says (per comment:21) that the patches mentioned in comment:9 can't have fixed this issue, I'm removing the review-needed tag. (I was reading the most recent Tahoe-LAFS Weekly News and saw this ticket listed as being one of the review-needed tickets.)

comment:23 Changed at 2012-05-11T00:53:03Z by zooko

killyourtv: yeah, one practice that we have in this project which is different from a lot of other projects is that we focus on reducing uncertainty about bugs. A lot of projects promote a sort of vague notion of "degree of bugginess" with associated folklore, like "Oh yeah, version 13.1.6 was way more crashy than version 13.1.3, but then version 22.0.7 was way more stable...".

I prefer for our collective understanding of bugs to be as specific and reproducible as possible. Therefore, since you definitely spotted a bug originally (thank you!) and since we don't have a specific reason to believe that the bug is fixed, we should leave this ticket open.

Thanks again for your good bug reports!

comment:24 in reply to: ↑ 15 ; follow-up: Changed at 2012-05-13T20:39:41Z by warner

Replying to davidsarah:

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.

Look at the traceback: the exception happens in Retrieve. I imagine that the WUI operation was to upload a file and add it to the directory, so a Retrieve operation occurs when it fetches the previous contents of the directory (so it can modify and publish the new contents). No (immutable) Upload code was implicated in the traceback.

comment:25 Changed at 2012-05-13T20:53:23Z by warner

  • Owner changed from warner to zooko
  • Status changed from reopened to new

I gather zooko has some specific idea in mind for how this bug can be closed, handing it off to him.

comment:26 in reply to: ↑ 24 Changed at 2012-05-14T00:52:13Z by davidsarah

  • Keywords test-needed added

Replying to warner:

Replying to davidsarah:

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:9 only changed the retrieve code.

Look at the traceback: the exception happens in Retrieve.

Ah, I see. OK, I agree that the patches referenced in comment:9 address the problem in the traceback. However, I don't see any regression tests for that problem.

Last edited at 2012-05-16T22:42:27Z by davidsarah (previous) (diff)

comment:27 Changed at 2012-05-14T01:05:44Z by davidsarah

  • Summary changed from Unhandled error in Deferred while uploading (current darcs build) to Unhandled error in Deferred during retrieve

comment:28 Changed at 2012-05-14T08:00:52Z by zooko

I don't think the patches references in comment:9 fix this bug. I read the patches and don't see a fix for this. The path that led to this exception is still in place in trunk. That is: IServer.get_rref() can return None (because the connection-lost handling in storage_client.py set self.rref to None), and the send-corruption-report-to-server functionality doesn't test for None.

This is an instance of a problem that appears in many places in trunk, which problem and some proposed solutions are described in comment:6, comment:7, comment:11.

comment:29 Changed at 2012-05-16T22:02:18Z by zooko

  • Milestone changed from 1.9.2 to 2.0.0
  • Status changed from new to assigned

comment:30 Changed at 2012-05-16T22:02:35Z by zooko

  • Milestone changed from 2.0.0 to 1.9.2

comment:31 Changed at 2012-06-12T16:48:17Z by davidsarah

  • Owner changed from zooko to warner
  • Status changed from assigned to new

Changed at 2012-06-12T22:36:10Z by warner

retain rref even after disconnect

comment:32 Changed at 2012-06-12T22:40:12Z by warner

  • Keywords review-needed added
  • Owner changed from warner to davidsarah

attachment:dead-rref.diff implements zooko's second approach from comment:11, namely to make the IServer retain the dead RemoteReference after disconnect, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers).

This patch should remove the problem. A later cleanup step could go the callers of get_rref() and remove the now-redundant guards, but a better change would be to remove get_rref() and have callers do server.callRemote(), which will then be replaced by higher-level actions like server.putShare() as we move away from Foolscap.

comment:33 Changed at 2012-06-12T22:51:34Z by davidsarah

  • Status changed from new to assigned

Will review.

comment:34 Changed at 2012-06-14T19:55:53Z by zooko

I reviewed attachment:dead-rref.diff. I see no bugs in it, it is a small patch, and I (still) like the idea of doing it that way. It does make me mildly uncomfortable that there are no unit tests in the patch.

Changed at 2012-06-14T23:41:53Z by warner

same patch, plus a test

comment:35 Changed at 2012-06-14T23:43:24Z by warner

In the new patch (which includes a test), the assignment of node_key_s will eventually need to move out of the conditional init_storage(), which only happens if the node is running a storage server. (client-only nodes will get a node-key too). But that fix can happen after 1.10, and will happen naturally as part of the client-side accounting work.

comment:36 Changed at 2012-06-15T01:44:04Z by davidsarah

  • Component changed from unknown to code-network
  • Keywords reviewed added; test-needed review-needed removed

I reviewed dead-rref-test.diff and approve it for 1.9.2. Will apply.

comment:37 Changed at 2012-06-15T02:05:59Z by david-sarah@…

  • Resolution set to fixed
  • Status changed from assigned to closed

In 1b07d307619049dd:

After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636

comment:38 Changed at 2012-06-15T02:09:02Z by david-sarah <david-sarah@…>

In 1b07d307619049dd:

After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636

comment:39 Changed at 2012-06-15T02:58:26Z by davidsarah

  • Keywords reviewed removed
  • Resolution fixed deleted
  • Status changed from closed to reopened

It looks like this test failure is in the new test:

[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x70d3cf8 [0.972191095352s] called=0 cancelled=0 Reconnector._timer_expired()>

allmydata.test.test_system.Connections.test_rref

comment:40 Changed at 2012-06-15T02:59:12Z by davidsarah

  • Owner changed from davidsarah to warner
  • Status changed from reopened to new

comment:41 Changed at 2012-06-15T20:44:26Z by warner

I traced this to a bug in foolscap, foolscap#196, and fixed it. I'll try to make a new release this weekend, and we can bump tahoe's dependency if you like. (it should only affect tests).

(the bug was that Tub.stopService was shutting off Reconnectors by iterating over the self.reconnectors list, but the Reconnector shutdown process was removing items from that list during the iteration. The shared-state hazard resulted in some Reconnectors being dropped, so their timers weren't shut down, so one was still running at the end of the test, triggering the DirtyReactorError)

comment:42 Changed at 2012-06-15T23:21:36Z by david-sarah@…

In 3ab0f33413de23dd:

Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636

comment:43 Changed at 2012-06-16T04:36:29Z by davidsarah

  • Keywords review-needed added

comment:44 Changed at 2012-06-16T04:38:39Z by davidsarah

warner: I took your suggestion to compare self.c0.nodeid and s.get_serverid(), which seems to work. Please test with your candidate foolscap 0.6.4 release.

comment:45 Changed at 2012-06-16T17:33:30Z by david-sarah <david-sarah@…>

In 3ab0f33413de23dd:

Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636

comment:46 Changed at 2012-06-16T18:20:43Z by david-sarah@…

In 323774698ee0a2e8:

Make the intent of the loop over servers in test_system.Connections.test_rref clearer, and able to be the same in 1.9.2 and trunk. Remove the now-unused node_key_s attribute of Client. refs #1636

Changed at 2012-06-16T19:06:39Z by davidsarah

[rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test

comment:47 Changed at 2012-06-21T19:33:29Z by david-sarah@…

  • Resolution set to fixed
  • Status changed from new to closed

In 5520/1.9.2:

[rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for 1.9.2.

comment:48 Changed at 2012-06-29T13:22:40Z by davidsarah

  • Keywords review-needed removed

comment:49 Changed at 2012-07-16T16:33:54Z by david-sarah@…

In 5882/cloud-backend:

[rebased for cloud-branch] After a server disconnects, make the IServer retain the dead RemoteReference?, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for cloud-branch

Note: See TracTickets for help on using tickets.