Opened at 2013-07-05T19:36:56Z
Closed at 2013-07-09T14:38:12Z
#2016 closed defect (duplicate)
Not enough available servers are found
Reported by: | kapiteined | Owned by: | daira |
---|---|---|---|
Priority: | major | Milestone: | 1.10.1 |
Component: | code-peerselection | Version: | 1.10.0 |
Keywords: | servers-of-happiness upload error | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
When uploading a file, it fails with the following error:
<class 'allmydata.interfaces.UploadUnhappinessError'>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error))
There are 12 servers connected to this grid (pubgrid) yet 6 queries are send, and because two are full the upload fails (if i interpreted the error right).
Shouldn't there be another round of queries if the first round does not yield enough available servers?
Change History (9)
comment:1 in reply to: ↑ description Changed at 2013-07-05T19:57:05Z by kapiteined
comment:2 Changed at 2013-07-05T20:49:40Z by daira
Here's the most important part of the log:
local#675113 20:33:49.785: CHKUploader starting local#675114 20:33:49.786: starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x31a3378> local#675115 20:33:49.786: creating Encoder <Encoder for unknown storage index> local#675116 20:33:49.787: file size: 658086 local#675117 20:33:49.789: my encoding parameters: (3, 5, 5, 131073) local#675118 20:33:49.790: got encoding parameters: 3/5/5 131073 local#675119 20:33:49.790: now setting up codec local#675120 20:33:49.878: using storage index jbljj local#675121 20:33:49.878: <Tahoe2ServerSelector for upload jbljj>(jbljj): starting local#675122 20:33:49.927: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server psdgefgf for any existing shares local#675123 20:33:49.954: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server 5sqtlw for any existing shares local#675124 20:33:49.964: got result from [hrtib2], 0 shares local#675125 20:33:49.965: but we're not running, so we'll ignore it local#675126 20:33:49.966: _check_for_done, mode is 'MODE_READ', 2 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675127 20:33:49.967: but we're not running local#675128 20:33:49.988: got result from [nszizg], 0 shares local#675129 20:33:49.989: but we're not running, so we'll ignore it local#675130 20:33:49.990: _check_for_done, mode is 'MODE_READ', 1 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675131 20:33:49.990: but we're not running local#675132 20:33:50.083: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server psdgefgf: alreadygot=() local#675133 20:33:50.112: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server 5sqtlw: alreadygot=() local#675134 20:33:50.216: got result from [r7cddi], 0 shares local#675135 20:33:50.217: but we're not running, so we'll ignore it local#675136 20:33:50.218: _check_for_done, mode is 'MODE_READ', 0 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675137 20:33:50.219: but we're not running local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=() local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,) local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=() local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=() local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns
comment:3 follow-up: ↓ 4 Changed at 2013-07-05T20:59:39Z by daira
Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.
If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.
comment:4 in reply to: ↑ 3 Changed at 2013-07-05T21:03:15Z by zooko
Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!
comment:5 Changed at 2013-07-05T21:05:59Z by zooko
- Description modified (diff)
comment:6 Changed at 2013-07-05T21:08:50Z by kapiteined
And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied
Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?
comment:7 Changed at 2013-07-07T19:40:32Z by daira
We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.
comment:8 follow-up: ↓ 9 Changed at 2013-07-09T14:33:42Z by daira
- Component changed from unknown to code-peerselection
- Keywords servers-of-happiness upload error added
- Milestone changed from undecided to 1.11.0
- Priority changed from normal to major
Same bug as #1791?
comment:9 in reply to: ↑ 8 Changed at 2013-07-09T14:38:12Z by daira
- Resolution set to duplicate
- Status changed from new to closed
Replying to kapiteined:
somehow attaching a file to this ticket failed, so i put the error report ( incident-2013-07-05--19-34-13Z-7o6admq.flog.bz2 ) at URI:CHK:7tbpjhxokkmpere6nxwfa5cvey:37ypgfhpwg67veqpyhjve22edmh3w3jwpbds47yfnvjussvalmaq:3:5:74128 in the pubgrid.