#1679 closed defect (fixed)

Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1

Reported by: nejucomo Owned by: nejucomo
Priority: major Milestone: 1.10.0
Component: code-network Version: 1.8.3
Keywords: download LeastAuthority.com cache Cc:
Launchpad Bug:

Description

Hi,

I have two separate tahoe clients on two separate hosts. One is version 1.8.3 and is able to retrieve a capability, the other is 1.9.1 and says (in browser):

NoSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more.

The full error message is:
no shares (need 1). Last failure: None

Both web apis show the same introducer furl with a connected status and the same storage servers with connected status.

The full version blobs from the UI:

This one *succeeds*:

allmydata-tahoe: 1.8.3, foolscap: 0.6.1, pycryptopp: 0.5.29, zfec: 1.4.5, Twisted: 11.0.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-Ubuntu_11.10-x86_64-64bit_ELF, pyOpenSSL: 0.12, simplejson: 2.1.6, pycrypto: 2.3, pyasn1: unknown, mock: 0.7.1, sqlite3: 2.6.0 [sqlite 3.7.7], setuptools: 0.6 [distribute] Warning: dependency twisted-web (version 11.0.0) found by pkg_resources not found by import. Warning: dependency twisted-core (version 11.0.0) found by pkg_resources not found by import.

This one gives the NoSharesError response page:

allmydata-tahoe: 1.9.1, foolscap: 0.6.3, pycryptopp: 0.5.29, zfec: 1.4.22, Twisted: 11.1.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-debian_wheezy/sid-i686-32bit_ELF, pyOpenSSL: 0.13, simplejson: 2.2.0, pycrypto: 2.4, pyasn1: unknown, mock: 0.7.2, sqlite3: 2.6.0 [sqlite 3.7.7], setuptools: 0.6 [distribute]

Attachments (1)

fix-1679.diff.txt (1.8 KB) - added by zooko at 2012-10-28T03:48:19Z.

Download all attachments as: .zip

Change History (35)

comment:1 Changed at 2012-02-23T19:58:55Z by nejucomo

Some more details:

The functioning lafs client (v 1.8.3) lives behind a public proxy:

https://con.struc.tv/uri/URI:DIR2-RO:7qyqs2kopmglnstoquia4stxta:iz5jmy5cq34spq5h47my55mmb4uucfolbnfb6lxxe545su26cgta/index.html

-you can poke at that if interested.

By using curl against the malfunctioning client (v 1.9.1) to the same URL path, I get the NoSharesError. However, I *can* successfully make a request to the containing directory /uri/URI:DIR2-RO:7qyqs2kopmglnstoquia4stxta:iz5jmy5cq34spq5h47my55mmb4uucfolbnfb6lxxe545su26cgta/ - so I made a t=json request on the directory to find the file's capability, and make another curl request to that, which also fails in the same manner.

*Importantly*, the public gateway (v 1.8.3) also results in the same error when the request is made by capability. Example:

https://con.struc.tv/uri/URI:CHK:kqnb2i6tufbyrou4klft7qn36y:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:64

It appears the first request (through v 1.8.3) succeeds and the later (through 1.8.3) fails consistently. Can anyone else verify this?

Meanwhile, *both* requests fail against my 1.9.1 client.

comment:2 Changed at 2012-02-26T05:59:40Z by zooko

  • Keywords regression download added
  • Priority changed from major to critical

comment:3 Changed at 2012-02-28T21:53:48Z by nejucomo

I just verified that the same 1.9.1 gateway described above *now* has the behavior described for the 1.8.3 gateway. Specifically:

$ curl 'http://127.0.0.1:3456/uri/URI:DIR2-RO:7qyqs2kopmglnstoquia4stxta:iz5jmy5cq34spq5h47my55mmb4uucfolbnfb6lxxe545su26cgta/?t=json'
[
 "dirnode",
 {
  "mutable": true,
  "verify_uri": "URI:DIR2-Verifier:l6onu7huqqij624tbj2sbbuyve:iz5jmy5cq34spq5h47my55mmb4uucfolbnfb6lxxe545su26cgta",
  "ro_uri": "URI:DIR2-RO:7qyqs2kopmglnstoquia4stxta:iz5jmy5cq34spq5h47my55mmb4uucfolbnfb6lxxe545su26cgta",
  "children": {
   "helloworld.txt": [
    "filenode",
    {
     "mutable": false,
     "format": "CHK",
     "ro_uri": "URI:LIT:jbswy3dpeblw64tmmqqqu",
     "metadata": {
      "tahoe": {
       "linkmotime": 1327402892.605141,
       "linkcrtime": 1327402892.605141
      }
     },
     "size": 13
    }
   ],
   "index.html": [
    "filenode",
    {
     "format": "CHK",
     "verify_uri": "URI:CHK-Verifier:cmszlggyh6ljvy4kzfqdfbdtx4:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:647",
     "ro_uri": "URI:CHK:kqnb2i6tufbyrou4klft7qn36y:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:647",
     "mutable": false,
     "size": 647,
     "metadata": {
      "tahoe": {
       "linkmotime": 1327618975.067045,
       "linkcrtime": 1327405154.386591
      }
     }
    }
   ]
  }
 }
]

-and now when requesting the "index.html" cap directly:

$ curl 'http://127.0.0.1:3456/uri/URI:CHK:kqnb2i6tufbyrou4klft7qn36y:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:647'
NoSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more.

The full error message is:
no shares (need 1). Last failure: None

I can verify the version with:

$ curl 'http://127.0.0.1:3456/' | grep 'My versions'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5657    0  5657    0     0   372k      0 --:--:-- --:--:-- --:--:--  394k
    <tr><th>My versions:</th> <td>allmydata-tahoe: 1.9.1,

comment:4 Changed at 2012-03-05T22:46:48Z by zooko

  • Milestone changed from undecided to 1.9.2

This is (probably) a regression in 1.8→1.9. I propose we put it into the "1.9.2" Milestone.

comment:5 Changed at 2012-03-06T01:26:52Z by nejucomo

Why do you suggest it's a regression? Both client versions display similar behavior at different times.

It could be a storage server or introducer regression.

Again, to summarize the most salient aspects:

  • The failure is transient.
  • The failure occurs when retrieving content with $CHK_CAP directly and sometimes via $DIR2_RO_CAP/index.html.
  • Sometimes the $CHK_CAP retrieval fails while the $DIR2_RO_CAP/index.html succeeds on the same gateway!
  • Both 1.8.3 and 1.9.1 gateways have displayed similar behavior at different times.

It's very mysterious to me that two different retrieval methods for the same content fail or succeed respectively.

Maybe the retrieval method is a red-herring and instead there is some networking connectivity, caching, or other transient state which changes in a poorly handled / detected edge case.

comment:6 Changed at 2012-03-06T06:39:06Z by zooko

  • Keywords regression removed
  • Milestone changed from 1.9.2 to soon

Oh, this is my mistake. In your original report you said that it happened with a 1.9.1 client but not a 1.8.3 client, right? But later you figured out that it also happens with the 1.8.3 client? Removing the regression keyword and moving it out of the 1.9.2 Milestone.

comment:7 Changed at 2012-03-06T18:01:28Z by davidsarah

  • Component changed from unknown to code
  • Keywords heisenbug added
  • Owner changed from nobody to somebody
  • Summary changed from Potential download regression 1.9.1 versus 1.8.3 to Nondeterministic NoSharesFound error for direct CHK download in 1.8.3 and 1.9.1
  • Version changed from 1.9.1 to 1.8.3

comment:8 Changed at 2012-03-06T18:02:50Z by davidsarah

  • Summary changed from Nondeterministic NoSharesFound error for direct CHK download in 1.8.3 and 1.9.1 to Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1

comment:9 Changed at 2012-04-09T21:55:53Z by davidsarah

  • Keywords lae added

comment:10 Changed at 2012-05-07T22:57:00Z by zooko

Here's a link to a Story in the Least Authority Enterprises product backlog to investigate and fix this: http://www.scrumdo.com/projects/project/s3/iteration/3801#story_169884 username=scrumdoguest, password=scrumdoguest. Warning: I think if we move this Story out of the Product Backlog into a Sprint or elsewhere then that link will break.

comment:11 Changed at 2012-05-08T20:24:49Z by zooko

Aha. Trying to reproduce this, I see something that might explain one of the observations. It looks like the immutable file cap was accidentally truncated from https://con.struc.tv/uri/URI:CHK:kqnb2i6tufbyrou4klft7qn36y:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:647 to https://con.struc.tv/uri/URI:CHK:kqnb2i6tufbyrou4klft7qn36y:5a6alvxpu2vsw45aiqfjriukxaoxno4har2hz2s2isczyppkuhja:1:1:64, which would explain why the link directly to the immutable file cap never worked even when the link through its parent directory sometimes did.

comment:12 Changed at 2012-07-20T18:35:00Z by zooko

nejucomo:

Please help us narrow this down. First of all, in some of the tests you earlier did, you appear to have been using an accidentally truncated cap -- ending with ":64" instead of with ":647". Please try your tests again and make sure to use the correct filecap. Secondly, since you initially had these problems, we've released Tahoe-LAFS v1.9.2 which fixes many problems in handling of mutable files. Please try your tests again with Tahoe-LAFS v1.9.2 instead of 1.9.0. (Trying the tests again with 1.8.3 and the corrected file cap would also be illuminating.)

Thank you very much for your help!

comment:13 Changed at 2012-07-20T18:49:00Z by zooko

P.S. Oh, and also since your initial bug report we have fixed two issues in the Tahoe-LAFS-on-S3 service which might have caused your problem or at least interacted with it. So that is a third reason to re-run your tests and see if it works now.

comment:14 Changed at 2012-10-26T01:04:02Z by Zancas

  • Owner changed from somebody to nejucomo

Per zooko's "P.S." the linked changes have been implemented.

https://leastauthority.zendesk.com/entries/21800143-500-internal-error-from-s3-fixed

It's quite possible that this fixed the bug. I'm reluctant to close the ticket without confirmation from nejucomo.

comment:15 Changed at 2012-10-26T02:48:05Z by zooko

I don't think the 5xx errors from S3 could have been the sole cause of this bug, because this bug happened repeatably, and those S3 5xx errors were rare. Zancas measured them -- I forget, but I think it was about 3% of requests failed. Anyway, I think that at least in some configurations nejucomo's file would fail to be retrieved *every time*, which rules out the occasional S3 failures as the cause of that bug.

However, this does sound a lot like the bug that thedod recently reported which was 100% reproducible, which had the strange feature that a "check --verify" on a file would succeed but a "get" on that same file would fail (in both cases with 100% reproducibility). If thedod gives permission, I'll attach his log showing this happening. That bug -- thedod's went away when he upgraded the storage-client (gateway) from Tahoe-LAFS v1.8.3 to Tahoe-LAFS v1.9.2.

comment:16 Changed at 2012-10-26T21:30:32Z by zooko

Update: thedod's problem was *not* fixed by upgrading to Tahoe-LAFS v1.9.2. It is still happening, with Tahoe-LAFS v1.9.2.

comment:17 Changed at 2012-10-28T03:47:48Z by zooko

Okay, I'm leaving this under this ticket on the assumption that nejucomo's bug was the same as thedod's. If nejucomo could confirm or disconfirm that, I would appreciate it!

Thedod's bug proved to be due to caching of filenode objects in nodemaker.py. An immutable filenode object has failed to download (due to thedod's internet connections being flakey), and then a new download of that same filenode object is attempted. For some reason the old one is still in place (I don't know why) so the caching causes the new download to use that old filenode object, so it immediately fails too.

Brian diagnosed it in the Tahoe-LAFS Sixth Birthday Party hangout. I wrote a patch which thedod is testing on his live site: https://dubiousdod.org/uri/URI:DIR2-RO:4ihykxpazbbd3b4y527o5ibeom:edmce7aavjq7lh7git6p4bm5errgc4t65ndewhpivz7pqjj37l7q/jetpack.html#[[28%20October%202012%5D%5D

The patch, which I'll attach, solves it by excluding immutables from the filenode cache. Brian and I searched for ways that mutables could suffer from the same sort of problem and concluded that there aren't any, and mutables need the cache for making the write-lock work, so we're leaving it with mutables being cached.

There could be much nicer solutions to this, involving caching the filenode objects and making them resume work after failure, either because they've reached the end of their rope (I don't like that approach so much), just constantly renew their work as a part of how they do their work (serendipity, trying to call back servers which used to work really well but then disconnected, etc.) (I like that approach a bit better), or in response to a user initiating a new download (I like that approach a lot: when a user indicates interest, the programs should pull out all the stops!).

We still need to write a unit test that demonstrates thedod's bug happening, and confirm that this patch makes that test go from red to green, and commit this to trunk, and then talk about whether we want to release a 1.9.3 release just for this bugfix, or make people wait for 1.10. This hits LeastAuthority? customers, because they have only one storage server. Other users probably have this problem masked by automatic fail-over to other servers when the first few servers get stuck in this state.

Also other users probably have a more reliable internet connection than thedod has.

Changed at 2012-10-28T03:48:19Z by zooko

comment:18 Changed at 2012-10-28T03:48:47Z by zooko

  • Component changed from code to code-network

comment:19 Changed at 2012-10-28T03:49:16Z by zooko

  • Keywords test-needed review-needed added

comment:20 follow-up: Changed at 2012-10-30T18:00:59Z by nejucomo

Can I test this manually without waiting for or writing a unittest?

In order to do so, I need some more clarification:

Where does the invalid cache live? Is it in the downloading gateway? Does that mean if the gateway cannot connect to the storage server during an immutable download, then the cache records this fact and is not correctly bypassed later?

If all those are true, a manual test would be:

  1. Pick a known-uploaded CHK cap which has *not* been recently downloaded by the target gateway.
  1. Prevent the gateway from connecting to relevant storage servers. (For LAE service this is easier because there's only one storage node; ifdown $iface can work for a local gateway test, or adding a special temporary black-hole route for the storage node IP might work for a remote gateway.)
  1. Attempt to fetch that CAP on the network-impaired gateway.
  1. Repair the network of the gateway.
  1. Attempt to fetch that CAP again. If the fetch fails, this is evidence of the bug. If not, there's some flaw in these assumptions.
  1. If e. produces evidence of the bug, then stop that gateway, apply the patch, start the patched gateway and repeat steps a. through e. (with a *new* CAP to help control the experiment).
  1. Publish the results of step e. in the first iteration (unpatched) and the second iteration (patched).

At the same time or afterwards, write a unittest.

comment:21 in reply to: ↑ 20 Changed at 2012-10-31T10:02:55Z by zooko

Replying to nejucomo:

Can I test this manually without waiting for or writing a unittest?

In order to do so, I need some more clarification:

Where does the invalid cache live? Is it in the downloading gateway?

The cache is in the downloading gateway.

Does that mean if the gateway cannot connect to the storage server during an immutable download, then the cache records this fact and is not correctly bypassed later?

I think so. I just checked the code ([source:git/src/allmydata/immutable/download/finder.py) which I think is at fault. It asks the storage broker for a list of connected servers when it starts, then it tries to use the servers. I think if the server is excluded from that list by storage broker because it isn't connected, or if finder tries to use the server and gets an error (because the connection just failed), then finder will never again try to use that server. The cache causes new downloads to use the same finder object.

If all those are true, a manual test would be:

  1. Pick a known-uploaded CHK cap which has *not* been recently downloaded by the target gateway.
  1. Prevent the gateway from connecting to relevant storage servers. (For LAE service this is easier because there's only one storage node; ifdown $iface can work for a local gateway test, or adding a special temporary black-hole route for the storage node IP might work for a remote gateway.)
  1. Attempt to fetch that CAP on the network-impaired gateway.
  1. Repair the network of the gateway.
  1. Attempt to fetch that CAP again. If the fetch fails, this is evidence of the bug. If not, there's some flaw in these assumptions.
  1. If e. produces evidence of the bug, then stop that gateway, apply the patch, start the patched gateway and repeat steps a. through e. (with a *new* CAP to help control the experiment).
  1. Publish the results of step e. in the first iteration (unpatched) and the second iteration (patched).

At the same time or afterwards, write a unittest.

This sounds like a good protocol!

comment:22 Changed at 2012-11-02T01:17:29Z by davidsarah

  • Keywords cache added

comment:23 Changed at 2012-11-02T01:19:56Z by davidsarah

Note that at this point we've still not confirmed that nejucomo's issue is the same as thedod's. The latter is definitely due to the cache. Ideally, it would have been a separate ticket, which we could have marked as a duplicate if and when we confirmed they are the same issue.

comment:24 Changed at 2012-12-06T20:17:47Z by zooko

  • Milestone changed from soon to 1.10.0

We need to release this fix. Into Tahoe-LAFS v1.10!

comment:25 Changed at 2012-12-13T17:52:06Z by zooko

  • Keywords blocker added

comment:26 Changed at 2012-12-20T17:49:26Z by warner

  • Owner changed from nejucomo to warner
  • Status changed from new to assigned

I'll try to write a test for this so we can land it this weekend. I think it will go into test_client.NodeMaker, and just needs to create multiple nodes from the same URI and assert that they are not identical to each other.

comment:27 Changed at 2013-01-04T20:28:55Z by zooko

  • Keywords LeastAuthority.com added; lae removed

comment:28 Changed at 2013-02-07T16:37:28Z by zooko

  • Keywords heisenbug test-needed removed

Made a unit test:

https://github.com/tahoe-lafs/tahoe-lafs/pull/30

Please review!

comment:29 Changed at 2013-02-09T00:14:35Z by davidsarah

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixed in 45c64ed, with unit test in 8df3a07 and an extra comment in cdf1752.

comment:30 Changed at 2013-02-09T00:16:39Z by davidsarah

  • Resolution fixed deleted
  • Status changed from closed to reopened

Oops, we never actually established that nejucomo's issue was the same as thedod's. Reassigning to nejucomo to test that.

comment:31 Changed at 2013-02-09T00:17:12Z by davidsarah

  • Owner changed from warner to nejucomo
  • Priority changed from critical to major
  • Status changed from reopened to new

comment:32 Changed at 2013-03-07T01:36:27Z by zooko

This ticket is waiting on nejucomo to test whether their issue goes away when the patch is applied.

comment:33 Changed at 2013-03-07T16:07:53Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

nejucomo says that he isn't, in the next week, going to get around to manually investigating whether his issue is reproducible and whether this patch fixes it. Since we have a unit test and since thedod manually confirmed, and since our understanding of the code is clear, I say we close this as fixed and make it stop being a blocker for 1.10.

comment:34 Changed at 2013-03-10T23:48:30Z by davidsarah

  • Keywords review-needed blocker removed

I disagree. thedod confirmed that the bug he saw was fixed. We have very little evidence to confirm the hypothesis that thedod's bug is the same as nejucomo's. (I really would have preferred that thedod's bug were a separate ticket.)

Note: See TracTickets for help on using tickets.