#1921 closed defect (wontfix)

leasedb: NonExistentShareError: can't find [share] in `shares` table

Reported by: davidsarah Owned by: daira
Priority: major Milestone: 1.15.0
Component: code-storage Version: 1.9.2
Keywords: leasedb accounting-crawler blocks-cloud-merge Cc:
Launchpad Bug:

Description (last modified by zooko)

The attached incident was seen when running a test with a large number of uploads with the OpenStack cloud backend. The most relevant part seems to be:

local#18690 23:06:18.999: storage: allocate_buckets lp6ibjsxe6vf6ern6v3soepjh4
local#18691 23:06:18.999: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flp%2Flp6ibjsxe6vf6ern6v3soepjh4%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18692 23:06:18.999: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830>
local#18693 23:06:19.504: OpenStack list objects response: 200 OK
local#18694 23:06:19.505: OpenStack list read 201 bytes, parsed as 1 items
local#18695 23:06:19.505: OpenStack get object request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lpcjuif2ixx6khivy6zxdmvofe/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18696 23:06:19.506: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320>
local#18697 23:06:19.506: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x46b6c68>
local#18698 23:06:19.520: OpenStack list objects response: 200 OK
local#18699 23:06:19.521: OpenStack list read 2 bytes, parsed as 0 items
local#18700 23:06:19.523: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830>
local#18701 23:06:19.537: OpenStack put object request PUT https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lp6ibjsxe6vf6ern6v3soepjh4/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'Content-Type': ['application/octet-stream'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18702 23:06:19.537: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0>
local#18703 23:06:20.020: OpenStack get object response: 200 OK
local#18704 23:06:20.022: share SI=lp6ibjsxe6vf6ern6v3soepjh4 shnum=0 unexpectedly disappeared [INCIDENT-TRIGGER]
local#18705 23:06:20.059: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320>
local#18706 23:06:20.062: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flq%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18707 23:06:20.062: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4341e60>
local#18708 23:06:20.521: OpenStack put object response: 201 Created
local#18709 23:06:20.527: an inbound callRemote that we [n4zt] executed (on behalf of someone else, TubID uzie) failed
local#18710 23:06:20.527:  reqID=6970, rref=<allmydata.storage.bucket.BucketWriter object at 0x31a9050>, methname=RIBucketWriter.close
local#18711 23:06:20.527:  args=[]
local#18712 23:06:20.527:  kwargs={}
local#18713 23:06:20.527:  the LOCAL failure was:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/foolscap/eventual.py", line 26, in _turn
     cb(*args, **kwargs)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/util/deferredutil.py", line 55, in _with_log
     op(res)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 368, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 551, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/bucket.py", line 62, in _got_used_space
     self._account.add_or_renew_default_lease(storage_index, shnum)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 58, in add_or_renew_default_lease
     return self.add_or_renew_lease(storage_index, shnum, renewal_time, expiration_time)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 63, in add_or_renew_lease
     renewal_time, expiration_time)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/leasedb.py", line 250, in add_or_renew_leases
     raise NonExistentShareError(si_s, shnum)
 allmydata.storage.leasedb.NonExistentShareError: can't find SI='lp6ibjsxe6vf6ern6v3soepjh4' shnum=0 in `shares` table
 ]
local#18714 23:06:20.533: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0>
local#18715 23:06:20.545: storage: aborting write to share None

I'll avoid speculation in the ticket description, but I don't think this is specific to OpenStack.

Attachments (1)

incident-2013-02-20--23-06-20Z-jprthci.flog.bz2 (9.6 KB) - added by davidsarah at 2013-02-21T01:10:40Z.

Download all attachments as: .zip

Change History (23)

comment:1 follow-up: Changed at 2013-02-21T01:35:31Z by davidsarah

Okay, now for the speculation:

Notice that the accounting crawler is running concurrently with a PUT. The accounting crawler is what is causing the list requests for prefix 'lp', 'lq', etc.

What triggers the incident is that the crawler sees a share that unexpectedly disappeared, i.e. it has a leasedb entry but no chunk object(s):

local#18704 23:06:20.022: share SI=lp6ibjsxe6vf6ern6v3soepjh4 shnum=0
                          unexpectedly disappeared [INCIDENT-TRIGGER]

Almost immediately afterward, the concurrent PUT request fails. That request is for the same share that the crawler saw "disappear". It seems as though what happened is that there was a race between the crawler and the share creation:

  1. leasedb entry is created for share, but it isn't stored yet
  2. crawler sees an entry with no stored share; deletes the entry
  3. the share creator goes to add a default lease and fails because the leasedb entry isn't there

I think this will only happen when the accounting crawler is looking at a prefix while a share is being uploaded to it (that is not difficult to reproduce if you do enough uploads).

[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]

comment:2 in reply to: ↑ 1 Changed at 2013-02-21T03:44:11Z by davidsarah

Replying to davidsarah:

[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]

This has been fixed; header values are no longer logged.

comment:3 Changed at 2013-02-27T23:53:05Z by zooko

  • Keywords crawlers added

comment:4 Changed at 2013-03-01T01:40:35Z by davidsarah

  • Keywords accounting-crawler added; openstack cloud-backend crawlers removed

I believe this was fixed in commits 4cd54e36 ("leasedb/accounting crawler: only treat stable shares as disappeared or unleased") and 9ebc0e8b ("OpenStack: fix a type error introduced by the fix to #1921") on the 1909-cloud-openstack branch. Note that this is not on the 1818-leasedb branch, and in general I need to review all patches on 1909-cloud-openstack 1819-cloud-merge to see whether they are applicable to 1818-leasedb. I have just noted this on #1818.

The problem was indeed not specific to OpenStack (or the cloud backend). The leasedb design doc had the correct design, which was for the accounting crawler to treat shares in states other than STABLE as leased, but that requirement had been missed in the implementation. That reminds me that we need better tests for edge cases in the accounting crawler.

I'll leave this ticket open until the fix is on the 1818-leasedb branch.

Last edited at 2013-11-21T22:47:40Z by daira (previous) (diff)

comment:5 Changed at 2013-03-03T22:48:26Z by davidsarah

I saw another instance of this after the patch that was supposed to have fixed it :-( I haven't investigated that yet.

comment:6 Changed at 2013-05-25T22:31:56Z by daira

  • Description modified (diff)

#1987 may be the same bug as this. I'm not marking them as duplicates because I'm not sure of that yet.

comment:7 Changed at 2013-07-04T18:48:32Z by zooko

  • Description modified (diff)

comment:8 Changed at 2013-07-04T19:14:00Z by daira

  • Keywords blocks-cloud-merge added
  • Owner changed from davidsarah to daira
  • Status changed from new to assigned

comment:9 Changed at 2013-07-22T20:51:55Z by daira

  • Milestone changed from soon to 1.12.0

comment:10 follow-up: Changed at 2013-11-14T18:23:08Z by zooko

If this is caused by a race between accounting-crawler and leasedb updates, then it would be fixed by #1833.

comment:11 in reply to: ↑ 10 ; follow-up: Changed at 2013-11-21T22:51:40Z by daira

Replying to zooko:

If this is caused by a race between accounting-crawler and leasedb updates, then it would be fixed by #1833.

#1833 proposes a long-running process that is scheduled to delete shares. (They can't be deleted synchronously because an HTTP DELETE is not synchronous.) This would presumably be based on, or directly copied from the current accounting crawler code, and so would probably have the same race conditions wrt other share operations.

comment:12 in reply to: ↑ 11 Changed at 2013-11-22T05:20:25Z by zooko

Replying to daira:

#1833 proposes a long-running process that is scheduled to delete shares. (They can't be deleted synchronously because an HTTP DELETE is not synchronous.) This would presumably be based on, or directly copied from the current accounting crawler code, and so would probably have the same race conditions wrt other share operations.

You're right, there could still be race conditions. However, wouldn't there be fewer possible race conditions, because now the decref (expire-lease or remove-lease) operation is atomic with the decide-to-rm operation, where with the current lease-crawler design, those two operations can race with each other?

comment:13 Changed at 2013-11-22T13:15:55Z by daira

Nitpick: "synchronous with" != "atomic with". Two things can be synchronous, but an error causes one of them to happen and the other not. The decref would only be synchronous with, not atomic with the decide-to-rm operation if #1833 were implemented. (This is why Noether's failure handling model is better: synchronous operations are always atomic! :-)

I don't know for sure whether #1833 would fix this particular race condition, but my intuition is that it wouldn't because the race isn't between the decref and the decide-to-rm.

Last edited at 2013-11-22T13:17:23Z by daira (previous) (diff)

comment:14 Changed at 2014-02-28T07:16:02Z by Jbenisek

Ok My 4th try at posting.

I would like to test this bug by uploading 1000's of files to the GRID. Please verify which server/client versions I should use.

comment:15 Changed at 2014-03-02T20:56:25Z by daira

This bug is relevant only to the cloud branch. To check out that branch, install git and then do:

git clone https://github.com/tahoe-lafs/tahoe-lafs.git cloud-branch
cd cloud-branch
git checkout 1819-cloud-merge
python setup.py test

Then you would need to set up an account at an OpenStack provider such as Rackspace or HP Cloud (if you want to reproduce the bug as originally discovered) or some other supported cloud service. There is some documentation on that in here, but you'll probably need additional help (ask on the #tahoe-lafs channel on Freenode) since that documentation isn't complete.

Last edited at 2014-04-09T04:08:48Z by daira (previous) (diff)

comment:16 Changed at 2014-03-03T00:35:04Z by daira

By the way, if you're just starting to contribute to Tahoe, then your efforts are certainly welcomed, but I would recommend looking at some other bug than this hard-to-provoke race condition in experimental code!

comment:17 Changed at 2014-04-15T16:26:07Z by daira

#2204 is possibly a duplicate. (Note that that's on the most recent HEAD of the 1819-cloud-merge branch, so if it's a duplicate then this bug is not fixed.)

comment:18 Changed at 2016-03-22T05:02:25Z by warner

  • Milestone changed from 1.12.0 to 1.13.0

Milestone renamed

comment:19 Changed at 2016-06-28T18:17:14Z by warner

  • Milestone changed from 1.13.0 to 1.14.0

renaming milestone

comment:20 Changed at 2017-08-15T13:03:55Z by exarkun

I believe this is a different issue from #2204. I've fixed #2204 in the branch for #2237. It had to due with recovery from a lost leasedb and apparently nothing to do with a race condition as described here (nor with other interactions with the bucket crawler). The fix was contained entirely within the write implementation provided by ShareSet.

comment:21 Changed at 2020-06-30T14:45:13Z by exarkun

  • Milestone changed from 1.14.0 to 1.15.0

Moving open issues out of closed milestones.

comment:22 Changed at 2020-10-30T12:35:44Z by exarkun

  • Resolution set to wontfix
  • Status changed from assigned to closed

The established line of development on the "cloud backend" branch has been abandoned. This ticket is being closed as part of a batch-ticket cleanup for "cloud backend"-related tickets.

If this is a bug, it is probably genuinely no longer relevant. The "cloud backend" branch is too large and unwieldy to ever be merged into the main line of development (particularly now that the Python 3 porting effort is significantly underway).

If this is a feature, it may be relevant to some future efforts - if they are sufficiently similar to the "cloud backend" effort - but I am still closing it because there are no immediate plans for a new development effort in such a direction.

Tickets related to the "leasedb" are included in this set because the "leasedb" code is in the "cloud backend" branch and fairly well intertwined with the "cloud backend". If there is interest in lease implementation change at some future time then that effort will essentially have to be restarted as well.

Note: See TracTickets for help on using tickets.