#2875 new defect

Multiples storage servers can be registered with the same FURL (UncoordinatedWriteError on I2P grid) — at Version 5

Reported by: nwks Owned by:
Priority: normal Milestone: undecided
Component: code-network Version: 1.12.1
Keywords: i2p introduction Cc:
Launchpad Bug:

Description (last modified by nwks)

On the I2P grid, introducers send announcements with different key_s but with the same FURL.

As NativeStorageServer.get_serverid() returns key_s, it is possible for StorageFarmBroker to register these announcements as separate storage servers. So when publishing, share writers can end up writing to the same destination and then find surprise shares, triggering a UncoordinatedWriteError.

In release 1.11, _tubid was used in get_serverid() instead of key_s, so this issue did not exist. Also it is not yet widespread on the I2P grid as most nodes are still running 1.10.

Workaround: check for duplicated FURLs in NativeStorageServer announcement processing, or restore the return value of NativeStorageServer.get_serverid() to _tubid, instead of key_s.

Original description

I've setup a node on the I2P grid which has ~30 storage server connected. However I am unable to create aliases or directories with the default setting (SDMF) as it always fails with an UncoordinatedWriteError.

> tahoe create-alias newalias
Error during HTTP request: 500 Internal Server Error
Traceback (most recent call last):
Failure: allmydata.mutable.common.UncoordinatedWriteError:

 > tahoe mkdir tahoe:newdir
Error during HTTP request: 500 Internal Server Error
Traceback (most recent call last):
Failure: allmydata.mutable.common.UncoordinatedWriteError:

Here is a flog dump when it happens (flogs are also attached):

Remote Versions:
 Nevow: 0.14.2
 OpenSSL: 1.0.1t
 PyYAML: 3.12
 Twisted: 17.1.0
 appdirs: 1.4.3
 asn1crypto: 0.22.0
 attrs: 17.1.0
 automat: 0.6.0
 cffi: 1.10.0
 characteristic: 14.3.0
 constantly: 15.1.0
 cryptography: 1.8.1
 enum34: 1.1.6
 foolscap: 0.12.6
 idna: 2.5
 incremental: 16.10.1
 ipaddress: 1.0.18
 packaging: 16.8
 platform: Linux-debian_8.8-x86_64-64bit_ELF
 pyOpenSSL: 17.0.0
 pyasn1: 0.2.3
 pyasn1-modules: 0.0.8
 pycparser: 2.17
 pycrypto: 2.6.1
 pycryptopp: 0.7.1.869544967005693312591928092448767568728501330214
 pyparsing: 2.2.0
 python: 2.7.9
 service-identity: 16.0.0
 setuptools: 35.0.2
 simplejson: 3.10.0
 six: 1.10.0
 tahoe-lafs: 1.12.1
 twisted: 17.1.0
 zfec: 1.4.24
 zope.interface: unknown
08:19:43.855 L20 []#172995 Publish(iarly): starting
08:19:43.855 L20 []#172996 starting publish, datalen is 0
08:19:43.856 L10 []#172997 new seqnum will be 1
08:19:43.857 L20 []#172998 building encoding parameters for file
08:19:43.857 L20 []#172999 got segsize 0
08:19:43.857 L20 []#173000 got 0 segments
08:19:43.857 L20 []#173001 got start segment 0
08:19:43.857 L20 []#173002 got end segment -1
08:19:43.857 L10 []#173003 current goal: before update:
08:19:43.857 L10 []#173004 we are planning to push new seqnum=#1
08:19:43.857 L10 []#173005 current goal: after update: , sh0 to [3cqfcfch], sh1 to [ysse5rzw], sh2 to [vl3r33qr], sh3 to [xazwtxob], sh4 to [t6b4wzcs], sh5 to [6i2sk6xm], sh6 to [wbaqeiex], sh7 to [qlnepnf6], sh8 to [svq7zfma], sh9 to [55khl43t]
08:19:43.857 L10 []#173006 we are planning to push new seqnum=#1
08:19:43.859 L20 []#173007 Starting push
08:19:44.306 L20 []#173008 storage: slot_writev iarlyicewprjrxlabf5pvsgaiu
08:19:44.320 L20 []#173009 _got_write_answer from 3cqfcfch, share 0
08:19:44.320 L20 []#173010 found the following surprise shares: set([])
08:19:44.321 L20 []#173011 wrote successfully: adding new share to servermap
08:19:44.864 L20 []#173012 _got_write_answer from qlnepnf6, share 7
08:19:44.864 L20 []#173013 found the following surprise shares: set([])
08:19:44.864 L20 []#173014 wrote successfully: adding new share to servermap
08:19:45.097 L20 []#173015 _got_write_answer from svq7zfma, share 8
08:19:45.097 L20 []#173016 found the following surprise shares: set([7])
08:19:45.098 L30 []#173017 they had shares [7] that we didn't know about
08:19:46.178 L20 []#173018 wrote successfully: adding new share to servermap
08:19:46.194 L20 []#173019 _got_write_answer from 6i2sk6xm, share 5
08:19:46.194 L20 []#173020 found the following surprise shares: set([])
08:19:46.194 L20 []#173021 wrote successfully: adding new share to servermap
08:19:46.208 L20 []#173022 _got_write_answer from 55khl43t, share 9
08:19:46.208 L20 []#173023 found the following surprise shares: set([])
08:19:46.208 L20 []#173024 wrote successfully: adding new share to servermap
08:19:46.877 L20 []#173025 _got_write_answer from ysse5rzw, share 1
08:19:46.878 L20 []#173026 found the following surprise shares: set([])
08:19:46.878 L20 []#173027 wrote successfully: adding new share to servermap
08:19:47.432 L20 []#173028 _got_write_answer from vl3r33qr, share 2
08:19:47.432 L20 []#173029 found the following surprise shares: set([1])
08:19:47.432 L30 []#173030 they had shares [1] that we didn't know about
08:19:47.433 L20 []#173031 wrote successfully: adding new share to servermap
08:19:47.641 L20 []#173032 _got_write_answer from wbaqeiex, share 6
08:19:47.641 L20 []#173033 found the following surprise shares: set([5])
08:19:47.641 L30 []#173034 they had shares [5] that we didn't know about
08:19:47.641 L20 []#173035 wrote successfully: adding new share to servermap
08:19:51.945 L20 []#173036 _got_write_answer from xazwtxob, share 3
08:19:51.945 L20 []#173037 found the following surprise shares: set([])
08:19:51.945 L20 []#173038 wrote successfully: adding new share to servermap
08:19:55.978 L20 []#173039 _got_write_answer from t6b4wzcs, share 4
08:19:55.978 L20 []#173040 found the following surprise shares: set([3])
08:19:55.978 L30 []#173041 they had shares [3] that we didn't know about
08:19:57.085 L20 []#173042 wrote successfully: adding new share to servermap
08:19:57.086 L20 []#173043 Publish failed with UncoordinatedWriteError
08:19:57.106 L20 []#173044 web: 127.0.0.1 POST /uri/[CENSORED]..?t=mkdir 500 95

Creating MDMF files or directories in the WUI, and un-linking throws the following KeyError, but the job actually gets done. (I manually created the tahoe: alias that way.)

 > tahoe rm tahoe:testfile
ERROR: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/internet/base.py\", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File \"/home/debian/.local/lib/python2.7/site-packages/foolscap/eventual.py\", line 26, in _turn
    cb(*args, **kwargs)
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 459, in callback
    self._startRunCallbacks(result)
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 567, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 856, in _retry
    f.trap(UncoordinatedWriteError)
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/python/failure.py\", line 346, in trap
    self.raiseException()
  File \"/home/debian/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 854, in <lambda>
    self._modify_once(modifier, first_time))
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 880, in _modify_once
    d = self._try_to_download_data()
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 958, in _try_to_download_data
    d = self._read(c, fetch_privkey=True)
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 980, in _read
    d = r.download(consumer, offset, size)
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/retrieve.py\", line 253, in download
    self._start_download(consumer, offset, size)
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/retrieve.py\", line 265, in _start_download
    self._setup_download()
  File \"/home/debian/.local/lib/python2.7/site-packages/allmydata/mutable/retrieve.py\", line 301, in _setup_download
    shares = versionmap[self.verinfo]
exceptions.KeyError: (12, \"\\x9f\\x02FDS5O\\xed\\x1fEz\\xa2\\x02\\xd1\\x97\\xb9%\\x0b\\xef\\xc2{\\xa6\\xcc\\x0b'&\\xf3]\\x9c\\x8e\\x05\\xa7\", None, 131073, 1108, 3, 10, \"\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x9f\\x02FDS5O\\xed\\x1fEz\\xa2\\x02\\xd1\\x97\\xb9%\\x0b\\xef\\xc2{\\xa6\\xcc\\x0b'&\\xf3]\\x9c\\x8e\\x05\\xa7\\x03\\n\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x04T\", (('EOF', 2585), ('verification_key', 1731), ('enc_privkey', 123), ('share_data', 2167), ('signature', 1475), ('block_hash_tree', 2553), ('share_hash_chain', 1339), ('verification_key_end', 2023)))

I had no issues on the testGrid.

Change History (9)

comment:1 Changed at 2017-06-21T16:20:42Z by nwks

After some additional logging I found that I end up with duplicated servers (i.e. multiples storage server clients pointing to the same destination). And so two share writers end up writing to the same destination, and find that this destination holds other shares beside their own.

I've attached a new full flog containing the whole node startup sequence, as well as the UncoordinatedWriteError resulting from the publish starting at event #7738.

Changed at 2017-06-21T16:23:51Z by nwks

Node startup sequence and UncoordinatedWriteError at #7738

comment:2 Changed at 2017-06-23T16:58:21Z by nwks

Investigated further and found the issue: some servers are misconfigured and share the same FURL.

As NativeStorageServer.get_serverid() returns key_s, it is possible for StorageFarmBroker to register multiples (misconfigured) servers with the same FURL, triggering the behavior described in my previous comment.

In release 1.11, _tubid was used in get_serverid() instead of key_s, so this issue did not exist. Also it is not yet widespread on the I2P grid as most nodes are still running 1.10.

Workaround: restore the return value of NativeStorageServer.get_serverid() to _tubid, instead of key_s.

comment:3 Changed at 2017-06-23T17:02:19Z by nwks

  • Keywords introduction added

comment:4 Changed at 2017-06-23T17:20:25Z by nwks

  • Description modified (diff)
  • Summary changed from UncoordinatedWriteError on SDMF directory creation (I2P grid) to Multiples storage servers can be registered with the same FURL (UncoordinatedWriteError on I2P grid)

comment:5 Changed at 2017-06-23T18:04:06Z by nwks

  • Description modified (diff)
Note: See TracTickets for help on using tickets.