#1921 closed defect

leasedb: NonExistentShareError: can't find [share] in `shares` table — at Version 6

Reported by: davidsarah Owned by: davidsarah
Priority: major Milestone: 1.15.0
Component: code-storage Version: 1.9.2
Keywords: leasedb accounting-crawler blocks-cloud-merge Cc:
Launchpad Bug:

Description (last modified by daira)

The attached incident was seen when running a test with a large number of uploads with the OpenStack? cloud backend. The most relevant part seems to be:

local#18690 23:06:18.999: storage: allocate_buckets lp6ibjsxe6vf6ern6v3soepjh4
local#18691 23:06:18.999: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flp%2Flp6ibjsxe6vf6ern6v3soepjh4%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18692 23:06:18.999: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830>
local#18693 23:06:19.504: OpenStack list objects response: 200 OK
local#18694 23:06:19.505: OpenStack list read 201 bytes, parsed as 1 items
local#18695 23:06:19.505: OpenStack get object request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lpcjuif2ixx6khivy6zxdmvofe/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18696 23:06:19.506: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320>
local#18697 23:06:19.506: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x46b6c68>
local#18698 23:06:19.520: OpenStack list objects response: 200 OK
local#18699 23:06:19.521: OpenStack list read 2 bytes, parsed as 0 items
local#18700 23:06:19.523: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830>
local#18701 23:06:19.537: OpenStack put object request PUT https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lp6ibjsxe6vf6ern6v3soepjh4/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'Content-Type': ['application/octet-stream'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18702 23:06:19.537: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0>
local#18703 23:06:20.020: OpenStack get object response: 200 OK
local#18704 23:06:20.022: share SI=lp6ibjsxe6vf6ern6v3soepjh4 shnum=0 unexpectedly disappeared [INCIDENT-TRIGGER]
local#18705 23:06:20.059: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320>
local#18706 23:06:20.062: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flq%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']}
local#18707 23:06:20.062: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4341e60>
local#18708 23:06:20.521: OpenStack put object response: 201 Created
local#18709 23:06:20.527: an inbound callRemote that we [n4zt] executed (on behalf of someone else, TubID uzie) failed
local#18710 23:06:20.527:  reqID=6970, rref=<allmydata.storage.bucket.BucketWriter object at 0x31a9050>, methname=RIBucketWriter.close
local#18711 23:06:20.527:  args=[]
local#18712 23:06:20.527:  kwargs={}
local#18713 23:06:20.527:  the LOCAL failure was:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/foolscap/eventual.py", line 26, in _turn
     cb(*args, **kwargs)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/util/deferredutil.py", line 55, in _with_log
     op(res)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 368, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 551, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/bucket.py", line 62, in _got_used_space
     self._account.add_or_renew_default_lease(storage_index, shnum)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 58, in add_or_renew_default_lease
     return self.add_or_renew_lease(storage_index, shnum, renewal_time, expiration_time)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 63, in add_or_renew_lease
     renewal_time, expiration_time)
   File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/leasedb.py", line 250, in add_or_renew_leases
     raise NonExistentShareError(si_s, shnum)
 allmydata.storage.leasedb.NonExistentShareError: can't find SI='lp6ibjsxe6vf6ern6v3soepjh4' shnum=0 in `shares` table
 ]
local#18714 23:06:20.533: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0>
local#18715 23:06:20.545: storage: aborting write to share None

I'll avoid speculation in the ticket description, but I don't think this is specific to OpenStack?.

Change History (7)

comment:1 follow-up: Changed at 2013-02-21T01:35:31Z by davidsarah

Okay, now for the speculation:

Notice that the accounting crawler is running concurrently with a PUT. The accounting crawler is what is causing the list requests for prefix 'lp', 'lq', etc.

What triggers the incident is that the crawler sees a share that unexpectedly disappeared, i.e. it has a leasedb entry but no chunk object(s):

local#18704 23:06:20.022: share SI=lp6ibjsxe6vf6ern6v3soepjh4 shnum=0
                          unexpectedly disappeared [INCIDENT-TRIGGER]

Almost immediately afterward, the concurrent PUT request fails. That request is for the same share that the crawler saw "disappear". It seems as though what happened is that there was a race between the crawler and the share creation:

  1. leasedb entry is created for share, but it isn't stored yet
  2. crawler sees an entry with no stored share; deletes the entry
  3. the share creator goes to add a default lease and fails because the leasedb entry isn't there

I think this will only happen when the accounting crawler is looking at a prefix while a share is being uploaded to it (that is not difficult to reproduce if you do enough uploads).

[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]

comment:2 in reply to: ↑ 1 Changed at 2013-02-21T03:44:11Z by davidsarah

Replying to davidsarah:

[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]

This has been fixed; header values are no longer logged.

comment:3 Changed at 2013-02-27T23:53:05Z by zooko

  • Keywords crawlers added

comment:4 Changed at 2013-03-01T01:40:35Z by davidsarah

  • Keywords accounting-crawler added; openstack cloud-backend crawlers removed

I believe this was fixed in commits 4cd54e36 ("leasedb/accounting crawler: only treat stable shares as disappeared or unleased") and 9ebc0e8b ("OpenStack: fix a type error introduced by the fix to #1921") on the 1909-cloud-openstack branch. Note that this is not on the 1818-leasedb branch, and in general I need to review all patches on 1909-cloud-openstack 1819-cloud-merge to see whether they are applicable to 1818-leasedb. I have just noted this on #1818.

The problem was indeed not specific to OpenStack (or the cloud backend). The leasedb design doc had the correct design, which was for the accounting crawler to treat shares in states other than STABLE as leased, but that requirement had been missed in the implementation. That reminds me that we need better tests for edge cases in the accounting crawler.

I'll leave this ticket open until the fix is on the 1818-leasedb branch.

Last edited at 2013-11-21T22:47:40Z by daira (previous) (diff)

comment:5 Changed at 2013-03-03T22:48:26Z by davidsarah

I saw another instance of this after the patch that was supposed to have fixed it :-( I haven't investigated that yet.

comment:6 Changed at 2013-05-25T22:31:56Z by daira

  • Description modified (diff)

#1987 may be the same bug as this. I'm not marking them as duplicates because I'm not sure of that yet.

Note: See TracTickets for help on using tickets.