Opened at 2021-09-08T16:09:08Z
Closed at 2021-10-06T20:07:21Z
#3793 closed task (fixed)
allocate_buckets() shouldn't hardcode BucketWriter lifetime as mapping to connections
Reported by: | itamarst | Owned by: | itamarst |
---|---|---|---|
Priority: | normal | Milestone: | HTTP Storage Protocol |
Component: | unknown | Version: | n/a |
Keywords: | Cc: | ||
Launchpad Bug: |
Description (last modified by itamarst)
Original issue: Currently, if you allocate buckets twice with the same inputs, without writing to any buckets, the result is (allocated = empty set(), writeable buckets = empty dict()). This is wrong, each share number should always be one of the two categories.
Later decided the above is wrong, but the more fundamental point is that HTTP has no concept of connections, so we shouldn't tie BucketWriter lifetime to connections. See below comment 4 for the plan for updating these semantics.
Change History (8)
comment:1 Changed at 2021-09-17T14:21:07Z by exarkun
comment:2 Changed at 2021-09-17T18:00:21Z by itamarst
So there's the case where someone else might write different data; basically a security attack by predicting keys somehow based on knowing person A will upload file F. I'm going to assume that's out of scope and either is not possible or will get fixed some other way
There's also the case where a person starts uploading file A, that fails half-way, and they decide to upload file B instead. As far as I can tell this ... won't happen due to the way storage indexes are generated.
So that just leaves the case of uploading file A, and then later trying again with file A. Given uploading the same file, with no changes, idempotent operations seem just fine, ideal in fact.
And HTTP needs to support multiple immutable.ShareFile instances in parallel, otherwise you can't do parallel uploads.
I therefore propose:
- Allow parallel uploads by multiple clients or connections.
- Implement write-last semantics.
- Half-finished uploads get expired by a timeout, e.g. 48 hours (should be long enough that a prospective client has long since given up).
comment:3 Changed at 2021-09-23T17:33:26Z by itamarst
Another problem: in Foolscap, there's notifyOnDisconnect such that if connection is lost, the BucketWriter? aborts itself. In HTTP this doesn't exist, so there is a new state: a half-finished BucketWriter? with no connection? Or maybe it's the same and it's just an "infinite" connection and it's fine.
comment:4 Changed at 2021-09-28T16:47:00Z by itamarst
New plan, summarized and somewhat modified from #3801.
- There is a single, persistent BucketWriter whose lifetime is no longer tied to the connection lifetime.
- Cancellation has 30-minute timeout.
- allocate_buckets() maintains same external behavior.
- Foolscap connections specifically still preserve current semantics, for backwards compat with existing clients: disconnection causes cancellation.
comment:5 Changed at 2021-09-29T17:51:51Z by itamarst
Decided to move timeouts to different issue, #3807, since it'll be easier to do when we have actual HTTP uploads (and might want additional refactoring).
comment:6 Changed at 2021-09-29T17:54:02Z by itamarst
- Description modified (diff)
- Type changed from defect to task
comment:7 Changed at 2021-09-29T18:03:40Z by itamarst
- Description modified (diff)
- Summary changed from allocate_buckets() should always list all shares in its response to allocate_buckets() shouldn't hardcode BucketWriter lifetime as mapping to connections
comment:8 Changed at 2021-10-06T20:07:21Z by GitHub <noreply@…>
- Resolution set to fixed
- Status changed from new to closed
In 0a60553/trunk:
allocate_buckets is the network storage API for preparing to upload new immutable shares.
storage_index and sharenums serve to address the user data and allocated_size tells the storage server how much user data the client wants to upload. Note that the given share numbers are only those the client intends to upload to this storage server at this time. There may be other shares being uploaded to other servers and a client may choose to upload more shares for the same storage index to this server at a later time.
Assuming only clients following Tahoe's CHK protocol, there is at most one complete set of shares per storage index. This means that there is only ever one possible length for a particular storage index. It also means that there is only ever one possible sequence of bytes for any (storage index, share number) tuple.
What's the point? If the missing shares are going to be represented somewhere - where? There are three options:
The points about what data can possibly be uploaded seem relevant in reasoning about the consequences of resolving this issue by putting the shares into *writeable buckets* (thus allowing more than one caller to write data for a given (storage index, share number) tuple). It *seems* somewhat encouraging for this case. If both callers decide to use their bucket writer, they *should* both write exactly the same byte sequence. This means first-wins, last-wins, or reject-duplicate should all result in the same shares on the server in the end.
First-wins and last-wins both let a client upload redundant data. Reject-duplicate *could* avoid allowing redundant uploads if it rejected the upload soon enough. It seems like the current behavior is an attempt at that: if a bucket writer exists already, deny the new caller the chance to get a redundant bucket writer. It does this at the cost of making part of the invariant subtle (subtler than I first appreciated): The union of allocated and writeable buckets equals the union of the requested shares and the *in-progress* uploads.
With that understanding of the invariant, I don't think the current storage server interface or behavior is wrong - though the implicit third group is maybe not ideal.
However, the real motivation question here is how this interface can be mapped to Great Black Swamp. The Foolscap protocol makes this work by maintaining the in-progress uploads as connection state with the client performing the upload. If the connection is destroyed and the upload is incomplete, the data is wiped and the share moved from the in-progress set to the "writeable buckets" set (and a subsequent call can try the upload again).
Because of its HTTP foundation, Great Black Swamp can only have connection state that lives as long as a single request (let's just call it "request state" instead). Great Black Swamp also splits uploads across multiple requests. Therefore the *connection* gives the server no way to recognize an abandoned upload attempt. Say a client allocates (storage index[x], share number[y]) and then writes the first half of the share with one request. If it never makes another request to upload the rest of the data, how does the server know (x, y) is abandoned?
Or put more practically, how do a client and server collaborate to allow interrupted uploads to be resumed? Maybe it is worth noting that "interrupted" here probably involves more than just a network interruption. Both Foolscap and GBS should deal with network interruptions fine. Instead, this interruption is more like the storage server being restarted, the client being restarted, or maybe a mostly unrelated client trying to upload the same data as was previously partially uploaded.
Here are some possible answers:
I'm just gonna stop there, possibly I should have stopped a while ago. I feel like this problem would benefit a lot from some more rigorous form of analysis than I've approached here. Maybe some of the ideas here at least provide some useful context for such an analysis?