#3793 closed task

allocate_buckets() shouldn't hardcode BucketWriter lifetime as mapping to connections — at Version 7

Reported by: itamarst Owned by: itamarst
Priority: normal Milestone: HTTP Storage Protocol
Component: unknown Version: n/a
Keywords: Cc:
Launchpad Bug:

Description (last modified by itamarst)

Original issue: Currently, if you allocate buckets twice with the same inputs, without writing to any buckets, the result is (allocated = empty set(), writeable buckets = empty dict()). This is wrong, each share number should always be one of the two categories.

Later decided the above is wrong, but the more fundamental point is that HTTP has no concept of connections, so we shouldn't tie BucketWriter lifetime to connections. See below comment 4 for the plan for updating these semantics.

Change History (7)

comment:1 Changed at 2021-09-17T14:21:07Z by exarkun

allocate_buckets is the network storage API for preparing to upload new immutable shares.

storage_index and sharenums serve to address the user data and allocated_size tells the storage server how much user data the client wants to upload. Note that the given share numbers are only those the client intends to upload to this storage server at this time. There may be other shares being uploaded to other servers and a client may choose to upload more shares for the same storage index to this server at a later time.

Assuming only clients following Tahoe's CHK protocol, there is at most one complete set of shares per storage index. This means that there is only ever one possible length for a particular storage index. It also means that there is only ever one possible sequence of bytes for any (storage index, share number) tuple.

What's the point? If the missing shares are going to be represented somewhere - where? There are three options:

  • In *allocated*
  • In *writeable buckets*
  • Somewhere else

The points about what data can possibly be uploaded seem relevant in reasoning about the consequences of resolving this issue by putting the shares into *writeable buckets* (thus allowing more than one caller to write data for a given (storage index, share number) tuple). It *seems* somewhat encouraging for this case. If both callers decide to use their bucket writer, they *should* both write exactly the same byte sequence. This means first-wins, last-wins, or reject-duplicate should all result in the same shares on the server in the end.

First-wins and last-wins both let a client upload redundant data. Reject-duplicate *could* avoid allowing redundant uploads if it rejected the upload soon enough. It seems like the current behavior is an attempt at that: if a bucket writer exists already, deny the new caller the chance to get a redundant bucket writer. It does this at the cost of making part of the invariant subtle (subtler than I first appreciated): The union of allocated and writeable buckets equals the union of the requested shares and the *in-progress* uploads.

With that understanding of the invariant, I don't think the current storage server interface or behavior is wrong - though the implicit third group is maybe not ideal.

However, the real motivation question here is how this interface can be mapped to Great Black Swamp. The Foolscap protocol makes this work by maintaining the in-progress uploads as connection state with the client performing the upload. If the connection is destroyed and the upload is incomplete, the data is wiped and the share moved from the in-progress set to the "writeable buckets" set (and a subsequent call can try the upload again).

Because of its HTTP foundation, Great Black Swamp can only have connection state that lives as long as a single request (let's just call it "request state" instead). Great Black Swamp also splits uploads across multiple requests. Therefore the *connection* gives the server no way to recognize an abandoned upload attempt. Say a client allocates (storage index[x], share number[y]) and then writes the first half of the share with one request. If it never makes another request to upload the rest of the data, how does the server know (x, y) is abandoned?

Or put more practically, how do a client and server collaborate to allow interrupted uploads to be resumed? Maybe it is worth noting that "interrupted" here probably involves more than just a network interruption. Both Foolscap and GBS should deal with network interruptions fine. Instead, this interruption is more like the storage server being restarted, the client being restarted, or maybe a mostly unrelated client trying to upload the same data as was previously partially uploaded.

Here are some possible answers:

  1. Require detection that the share is abandoned before allowing a new allocate_buckets call to retry the upload. This is like the current implicit third group of partially uploaded shares that appear nowhere in the response of allocate_buckets.
    • Assume an upload is abandoned after it is untouched for a set period of time.
    • Assume an upload is abandoned after the storage server restarts (this is what is currently implemented).
    • Assume an upload is abandoned when disk space begins to be constrained (preferring to clean up older partial shares first).
    • Require an explicit API call for abandoning an upload (this is also currently implemented).
  2. Give out a bucket writer for a partially written share. This is like putting partially uploaded shares into *writeable buckets* in the allocate_buckets response.
    • First write wins
    • Last write wins
    • Reject overlapping writes (current GBS behavior)
  3. Change the protocol to be able to introduce an explicit third set

I'm just gonna stop there, possibly I should have stopped a while ago. I feel like this problem would benefit a lot from some more rigorous form of analysis than I've approached here. Maybe some of the ideas here at least provide some useful context for such an analysis?

comment:2 Changed at 2021-09-17T18:00:21Z by itamarst

So there's the case where someone else might write different data; basically a security attack by predicting keys somehow based on knowing person A will upload file F. I'm going to assume that's out of scope and either is not possible or will get fixed some other way

There's also the case where a person starts uploading file A, that fails half-way, and they decide to upload file B instead. As far as I can tell this ... won't happen due to the way storage indexes are generated.

So that just leaves the case of uploading file A, and then later trying again with file A. Given uploading the same file, with no changes, idempotent operations seem just fine, ideal in fact.

And HTTP needs to support multiple immutable.ShareFile instances in parallel, otherwise you can't do parallel uploads.

I therefore propose:

  1. Allow parallel uploads by multiple clients or connections.
  2. Implement write-last semantics.
  3. Half-finished uploads get expired by a timeout, e.g. 48 hours (should be long enough that a prospective client has long since given up).

comment:3 Changed at 2021-09-23T17:33:26Z by itamarst

Another problem: in Foolscap, there's notifyOnDisconnect such that if connection is lost, the BucketWriter? aborts itself. In HTTP this doesn't exist, so there is a new state: a half-finished BucketWriter? with no connection? Or maybe it's the same and it's just an "infinite" connection and it's fine.

comment:4 Changed at 2021-09-28T16:47:00Z by itamarst

New plan, summarized and somewhat modified from #3801.

  1. There is a single, persistent BucketWriter whose lifetime is no longer tied to the connection lifetime.
  2. Cancellation has 30-minute timeout.
  3. allocate_buckets() maintains same external behavior.
  4. Foolscap connections specifically still preserve current semantics, for backwards compat with existing clients: disconnection causes cancellation.
Last edited at 2021-09-29T16:38:35Z by itamarst (previous) (diff)

comment:5 Changed at 2021-09-29T17:51:51Z by itamarst

Decided to move timeouts to different issue, #3807, since it'll be easier to do when we have actual HTTP uploads (and might want additional refactoring).

Last edited at 2021-09-29T17:53:26Z by itamarst (previous) (diff)

comment:6 Changed at 2021-09-29T17:54:02Z by itamarst

  • Description modified (diff)
  • Type changed from defect to task

comment:7 Changed at 2021-09-29T18:03:40Z by itamarst

  • Description modified (diff)
  • Summary changed from allocate_buckets() should always list all shares in its response to allocate_buckets() shouldn't hardcode BucketWriter lifetime as mapping to connections
Note: See TracTickets for help on using tickets.