#605 assigned defect

two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable

Reported by: zooko Owned by: zooko
Priority: major Milestone: eventually
Component: code-network Version: 1.2.0
Keywords: windows performance hang introduction Cc:
Launchpad Bug:

Description

Current trunk 91a28da3aa2e98c9 takes something like two hours to connect to all of the testgrid storage servers. Two seconds would be more like it! Strangely, it spent a long time (tens of minutes) with a connection open to one tahoe server node on each server out of the three tahoe server nodes running on each server. Eventually it then connected to a second server node on each server, and finally to the third.

Attachments (1)

flogtail.txt (439.7 KB) - added by zooko at 2009-02-04T03:51:17Z.

Download all attachments as: .zip

Change History (20)

comment:1 Changed at 2009-02-03T23:21:18Z by zooko

  • Status changed from new to assigned

I'm trying to reproduce it with tahoe-1.2.0 on the same Windows laptop (zaulaw).

comment:2 Changed at 2009-02-03T23:48:10Z by zooko

A-ha! That's interesting! The same behavior is occurring with tahoe-1.2.0. Here is the full version string:

My versions: allmydata: 1.2.0, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.3.4, twisted: 8.1.0, nevow: 0.9.32, simplejson: 1.7.3, pyopenssl: 0.8, setuptools: 0.7a1

The next experiment is to try an older version of foolscap. Let's see... tahoe-1.2.0 shipped saying that it required foolscap >= 0.2.9: _auto_deps@20080722010403-92b7f-a74ec17a8a9cff0834ecc122ffa275280f563cea, so I'll try that.

comment:4 Changed at 2009-02-04T02:26:02Z by warner

could there be some sort of firewall that's rate-limiting connections to a single IP address? or a NAT table which can only hold one entry per destination address?

Try using telnet or netcat to connect to something on the not-yet-connected servers while it's in this state, or point a browser on that box at the server's webapi ports.

comment:5 Changed at 2009-02-04T03:24:34Z by zooko

Those are good ideas. The builtin Windows XP firewall is turned off. I can netcat to one of the servers that hasn't connected and the TCP connection is established, although I don't know how to type "hello foolscap begin your STARTTLS negotiation now", so all I can say is that the connection was established and then a while later was dropped.

It certainly can't be the fault of the wireless router or the DSL modem on my LAN, because four other computers on the same LAN connect to all blockserver quickly as expected.

Next step is to examine the flog tail while this is happening. If someone else wanted to try to reproduce it on their Windows machine (including allmydata.com's virtual Windows machine), that would be helpful.

Changed at 2009-02-04T03:51:17Z by zooko

comment:6 Changed at 2009-02-04T03:53:16Z by zooko

Here is the output from flogtool tail while the client is running and connecting to several servers including:

lwkv6cjicbzqjwwwuifik3pogeupsicb  	amduser@tahoebs4:public/bs4c4  	Yes: to 207.7.153.162:51561  	20:41:20 03-Feb-2009  	20:41:18 03-Feb-2009  	1.2.0-r3056  	storage

but not:

xiktf6ok5f5ao5znxxttriv233hmvi4v  	amduser@tahoebs4:public/bs4c3  	No  	20:47:59 03-Feb-2009  	20:41:19 03-Feb-2009  	1.2.0-r3056  	storage

Here is an excerpt from this flogtail file:

$ grep -iEe"xikt|wkv6" flogtail.txt  | cut -c-180       
20:41:18.953 L20 []#118 new announcement[storage]: ('pb://lwkv6cjicbzqjwwwuifik3pogeupsicb@207.7.153.162:51561,127.0.0.1:51561/yadf5xhmz5hrv2iqc6ygr52w73fhqnnh', 'storage', 'RIStor
20:41:18.953 L20 []#119 attempting to connect to lwkv6cji
20:41:18.953 L20 []#120 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:18.953 L20 []#122 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F490>
20:41:18.953 L20 []#124 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F470>
20:41:19.030 L20 []#394 new announcement[storage]: ('pb://xiktf6ok5f5ao5znxxttriv233hmvi4v@207.7.153.162:55217,127.0.0.1:55217/duzihdt3iweli5eendljqihnouceczzl', 'storage', 'RIStor
20:41:19.030 L20 []#395 attempting to connect to xiktf6ok
20:41:19.030 L20 []#396 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:41:19.030 L20 []#398 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A70>
20:41:19.030 L20 []#400 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A50>
20:41:19.203 L10 []#521 initClient: to target pb://lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:18.953 L20 []#118 new announcement[storage]: ('pb://lwkv6cjicbzqjwwwuifik3pogeupsicb@207.7.153.162:51561,127.0.0.1:51561/yadf5xhmz5hrv2iqc6ygr52w73fhqnnh', 'storage', 'RIStor
20:41:18.953 L20 []#119 attempting to connect to lwkv6cji
20:41:18.953 L20 []#120 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:18.953 L20 []#118 new announcement[storage]: ('pb://lwkv6cjicbzqjwwwuifik3pogeupsicb@207.7.153.162:51561,127.0.0.1:51561/yadf5xhmz5hrv2iqc6ygr52w73fhqnnh', 'storage', 'RIStor
20:41:18.953 L20 []#119 attempting to connect to lwkv6cji
20:41:18.953 L20 []#120 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:18.953 L20 []#122 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F490>
20:41:18.953 L20 []#124 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F470>
20:41:19.030 L20 []#394 new announcement[storage]: ('pb://xiktf6ok5f5ao5znxxttriv233hmvi4v@207.7.153.162:55217,127.0.0.1:55217/duzihdt3iweli5eendljqihnouceczzl', 'storage', 'RIStor
20:41:19.030 L20 []#395 attempting to connect to xiktf6ok
20:41:19.030 L20 []#396 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:41:19.030 L20 []#398 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A70>
20:41:19.030 L20 []#400 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A50>
20:41:19.203 L10 []#521 initClient: to target pb://lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:19.203 L10 []#522 sendPlaintextClient: GET for tubID lwkv6cjicbzqjwwwuifik3pogeupsicb
20:41:19.796 L30 []#546 Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x01A5F550 from cjzndszpggbecw2r7cpf3hsl3snd4bly to 7tlov7egj7ultza3dy2dlgev6gijlgv
20:41:20.187 L10 []#547 dataReceived(isClient=True,phase=1,options={}): 'banana-negotiation-range: 3 3\r\ninitial-vocab-table-range: 0 1\r\nmy-incarnation: bab985ce462e6915\r\nmy-t
20:41:20.187 L10 []#548 evaluateHello(isClient=True): offer={'my-tub-id': 'lwkv6cjicbzqjwwwuifik3pogeupsicb', 'my-incarnation': 'bab985ce462e6915', 'initial-vocab-table-range': '0 
20:41:20.217 L20 []#555 negotiationComplete, <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F470> won
20:41:20.217 L20 []#557 connectorFinished (<foolscap.negotiate.TubConnector object at 0x01A5F3D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsicb>)
20:41:20.217 L20 []#558 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to lwkv6cji] at 0x01A5F490>
20:41:20.217 L20 []#559 connectorFinished (<foolscap.negotiate.TubConnector object at 0x01A5F3D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsicb>)
20:41:20.217 L30 []#560 Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x01A5F3D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to lwkv6cjicbzqjwwwuifik3pogeupsic
20:41:20.453 L20 []#567 got connection to lwkv6cji, getting versions
20:41:20.625 L10 []#573  reqID=2, rref=<RemoteReference at 0x28b0690 [pb://lwkv6cjicbzqjwwwuifik3pogeupsicb@207.7.153.162:51561,127.0.0.1:51561/yadf5xhmz5hrv2iqc6ygr52w73fhqnnh]>, 
20:41:20.625 L20 []#575 connected to lwkv6cji, version {'http://allmydata.org/tahoe/protocols/storage/v1': {'maximum-immutable-share-size': 4294967296L}, 'application-version': 'un
20:41:43.015 L30 []#646 Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x01A76D30 from cjzndszpggbecw2r7cpf3hsl3snd4bly to 6fyx5u4zr7tvz3szynihc4x3uc6ct5g
20:41:49.062 L20 []#792 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A50>
20:41:49.062 L20 []#795 connectorFinished (<foolscap.negotiate.TubConnector object at 0x028459B0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:41:49.062 L20 []#796 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02845A70>
20:41:53.015 L20 []#1012 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:41:53.015 L20 []#1014 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02996490>
20:41:53.015 L20 []#1016 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02996630>
20:42:23.015 L20 []#1254 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02996630>
20:42:23.015 L20 []#1257 connectorFinished (<foolscap.negotiate.TubConnector object at 0x02996650 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:42:23.015 L20 []#1258 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x02996490>
20:42:25.578 L20 []#1444 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:42:25.578 L20 []#1446 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD350>
20:42:25.578 L20 []#1448 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD310>
20:42:55.592 L20 []#1659 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD350>
20:42:55.592 L20 []#1660 connectorFinished (<foolscap.negotiate.TubConnector object at 0x029FD850 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:42:55.592 L20 []#1661 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD310>
20:43:01.733 L20 []#1812 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:43:01.733 L20 []#1814 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FDAB0>
20:43:01.733 L20 []#1816 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD970>
20:43:31.750 L20 []#2013 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FD970>
20:43:31.750 L20 []#2014 connectorFinished (<foolscap.negotiate.TubConnector object at 0x029FD270 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:43:31.750 L20 []#2015 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029FDAB0>
20:43:50.733 L20 []#2210 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:43:50.733 L20 []#2212 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029E85F0>
20:43:50.733 L20 []#2214 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029E8230>
20:44:20.750 L20 []#2411 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029E85F0>
20:44:20.750 L20 []#2412 connectorFinished (<foolscap.negotiate.TubConnector object at 0x029E8C30 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:44:20.750 L20 []#2413 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029E8230>
20:45:12.265 L20 []#2616 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:45:12.265 L20 []#2618 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029A2970>
20:45:12.265 L20 []#2620 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029A2070>
20:45:42.265 L20 []#2799 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029A2970>
20:45:42.265 L20 []#2800 connectorFinished (<foolscap.negotiate.TubConnector object at 0x029A28B0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:45:42.265 L20 []#2801 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x029A2070>
20:48:06.625 L20 []#3076 TubConnector created from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v
20:48:06.625 L20 []#3078 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x0282E3D0>
20:48:06.625 L20 []#3080 Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x0282EAD0>
20:48:36.640 L20 []#3186 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x0282EAD0>
20:48:36.640 L20 []#3187 connectorFinished (<foolscap.negotiate.TubConnector object at 0x0282E6D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:48:36.640 L20 []#3188 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x0282E3D0>

That's the complete output of that grep command at this time.

Attached is the current flogtail.txt file.

comment:7 Changed at 2009-02-04T03:57:26Z by zooko

So it looks like there's something funny about foolscap's attempts to connect to xikt but not lwkv. What's the next step? Perhaps it is to attach flog gatherers to xikt as well as this client? Also, of course, I would like someone to try to reproduce this problem elsewhere. To reproduce it:

  1. Start a node pointed at the test grid.
  2. See if it has connected to all the blockservers within a few seconds.

Of course, even better would be automated tests of this functionality! We currently have automated tests of this functionality, but they are run by buildbot only on a Linux system. Perhaps we could run them on a Windows buildslave as well.

comment:8 Changed at 2009-02-04T04:31:31Z by zooko

Okay, at 20:52:54 it connected:

20:52:53.625 L10 []#3657 initClient: to target pb://xiktf6ok5f5ao5znxxttriv233hmvi4v
20:52:53.625 L10 []#3658 sendPlaintextClient: GET for tubID xiktf6ok5f5ao5znxxttriv233hmvi4v
20:52:53.625 L10 []#3659 sendPlaintextClient: wantEncryption=True
20:52:53.780 L10 []#3660 dataReceived(isClient=True,phase=0,options={}): 'HTTP/1.1 101 Switching Protocols\r\nUpgrade: TLS/1.0, PB/1.0\r\nConnection: Upgrade\r\n\r\n'
20:52:53.780 L10 []#3661 handlePLAINTEXTClient: header='HTTP/1.1 101 Switching Protocols^M
Upgrade: TLS/1.0, PB/1.0^M
Connection: Upgrade'
20:52:53.780 L10 []#3662 startENCRYPTED(isClient=True, encrypted=True)
20:52:53.780 L10 []#3663 startTLS, client=True
20:52:53.780 L10 []#3664 Negotiate.sendHello (isClient=True): {'my-incarnation': 'd60b43291400a40a', 'last-connection': 'none 0', 'my-tub-id': 'cjzndszpggbecw2r7cpf3hsl3snd4bly', '
initial-vocab-table-range': '0 1', 'banana-negotiation-range': '3 3'}
20:52:53.780 L10 []#3665 dataReceived(isClient=True,phase=1,options={}): 'banana-negotiation-range: 3 3\r\ninitial-vocab-table-range: 0 1\r\nmy-incarnation: 8e3b459d16af4062\r\nmy-
tub-id: ivjakubrruewknqg7wgb5hbinasqupj6\r\n\r\n'
20:52:53.780 L10 []#3666 evaluateHello(isClient=True): offer={'my-tub-id': 'ivjakubrruewknqg7wgb5hbinasqupj6', 'my-incarnation': '8e3b459d16af4062', 'initial-vocab-table-range': '0
 1', 'banana-negotiation-range': '3 3'}
20:52:53.780 L10 []#3667 iAmTheMaster: False
20:52:53.905 L20 []#3668 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to sskpc5gr] at 0x01A5F810>
20:52:53.905 L20 []#3669 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to hvoivvgm] at 0x01A5F350>
20:52:53.905 L20 []#3670 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to ivjakubr] at 0x029DC4F0>
20:52:54.092 L10 []#3671 dataReceived(isClient=True,phase=2,options={}): 'banana-decision-version: 3\r\ncurrent-connection: 8e3b459d16af4062 10\r\ninitial-vocab-table-index: 1 bb33
\r\n\r\n'
20:52:54.092 L10 []#3672 handleDECIDING(isClient=True): banana-decision-version: 3^M
current-connection: 8e3b459d16af4062 10^M
initial-vocab-table-index: 1 bb33
20:52:54.092 L20 []#3673 Banana.acceptDecision: got {'banana-decision-version': '3', 'initial-vocab-table-index': '1 bb33', 'current-connection': '8e3b459d16af4062 10'}
20:52:54.092 L10 []#3674 Negotiate.switchToBanana(isClient=True)
20:52:54.092 L10 []#3675 params: {'banana-decision-version': 3, 'initial-vocab-table-index': 1}
20:52:54.092 L20 []#3676 negotiationComplete, <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to ivjakubr] at 0x029DC670> won
20:52:54.092 L20 []#3677 connectorFinished (<foolscap.negotiate.TubConnector object at 0x029DCD90 from cjzndszpggbecw2r7cpf3hsl3snd4bly to ivjakubrruewknqg7wgb5hbinasqupj6>)
20:52:54.250 L10 []#3678 dataReceived(isClient=True,phase=1,options={}): 'banana-negotiation-range: 3 3\r\ninitial-vocab-table-range: 0 1\r\nmy-incarnation: 962e316d4519b914\r\nmy-
tub-id: xiktf6ok5f5ao5znxxttriv233hmvi4v\r\n\r\n'
20:52:54.250 L10 []#3679 evaluateHello(isClient=True): offer={'my-tub-id': 'xiktf6ok5f5ao5znxxttriv233hmvi4v', 'my-incarnation': '962e316d4519b914', 'initial-vocab-table-range': '0
 1', 'banana-negotiation-range': '3 3'}
20:52:54.250 L10 []#3680 iAmTheMaster: False
20:52:54.296 L20 []#3681 got connection to ivjakubr, getting versions
20:52:54.530 L10 []#3682 dataReceived(isClient=True,phase=2,options={}): 'banana-decision-version: 3\r\ncurrent-connection: 962e316d4519b914 10\r\ninitial-vocab-table-index: 1 bb33
\r\n\r\n'
20:52:54.530 L10 []#3683 handleDECIDING(isClient=True): banana-decision-version: 3^M
current-connection: 962e316d4519b914 10^M
initial-vocab-table-index: 1 bb33
20:52:54.530 L20 []#3684 Banana.acceptDecision: got {'banana-decision-version': '3', 'initial-vocab-table-index': '1 bb33', 'current-connection': '962e316d4519b914 10'}
20:52:54.530 L10 []#3685 Negotiate.switchToBanana(isClient=True)
20:52:54.530 L10 []#3686 params: {'banana-decision-version': 3, 'initial-vocab-table-index': 1}
20:52:54.530 L20 []#3687 negotiationComplete, <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x028B04F0> won
20:52:54.530 L20 []#3688 told to disconnect
20:52:54.530 L20 []#3689 connectorFinished (<foolscap.negotiate.TubConnector object at 0x028B05D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:52:54.530 L20 []#3690 Stopping factory <foolscap.negotiate.TubConnectorClientFactory object [from cjzndszp] [to xiktf6ok] at 0x028B0CD0>
20:52:54.530 L20 []#3691 connectorFinished (<foolscap.negotiate.TubConnector object at 0x028B05D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi4v>)
20:52:54.530 L30 []#3692 Tub.connectorFinished: WEIRD, <foolscap.negotiate.TubConnector object at 0x028B05D0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to xiktf6ok5f5ao5znxxttriv233hmvi
4v> is not in [<foolscap.negotiate.TubConnector object at 0x029DCFF0 from cjzndszpggbecw2r7cpf3hsl3snd4bly to jrpz55j6rlzqjh5ghnx4hlpa4pmnd4th>, <foolscap.negotiate.TubConnector ob
ject at 0x029DCD30 from cjzndszpggbecw2r7cpf3hsl3snd4bly to mnvtm46ne4r4he5qeoavek3gvpry7lpt>, <foolscap.negotiate.TubConnector object at 0x028B0D30 from cjzndszpggbecw2r7cpf3hsl3s
nd4bly to trjdor3okozw4eld3l6zl4ap4z6h5tk6>, <foolscap.negotiate.TubConnector object at 0x028B0930 from cjzndszpggbecw2r7cpf3hsl3snd4bly to reezjqi5hylz4nmsajurqosvo6ou44xc>, <fool

comment:9 Changed at 2009-02-05T07:09:10Z by zooko

ndurner on IRC said that Windows XP SP 2 limits the number of concurrent TCP connections, and said that if this had happened then there would be a note in Start -> Control Panel -> Administrative Tools -> Event Viewer. Sure enough there it is:

"TCP/IP has reached the security limit imposed on the number of concurrent TCP connection attempts."

It offers to connect to a Microsoft server and fetch more information, and I can't find a URL for this information on the public web, so here it is:

Details 
Product: Windows Operating System 
ID: 4226 
Source: Tcpip 
Version: 5.2 
Symbolic Name: EVENT_TCPIP_TCP_CONNECT_LIMIT_REACHED 
Message: TCP/IP has reached the security limit imposed on the number of concurrent (incomplete) TCP connect attempts. 
    
Explanation 
The TCP/IP stack in Windows XP with Service Pack 2 (SP2) installed limits the number of concurrent, incomplete outbound TCP connection attempts. When the limit is reached, subsequent connection attempts are put in a queue and resolved at a fixed rate so that there are only a limited number of connections in the incomplete state. During normal operation, when programs are connecting to available hosts at valid IP addresses, no limit is imposed on the number of connections in the incomplete state. When the number of incomplete connections exceeds the limit, for example, as a result of programs connecting to IP addresses that are not valid, connection-rate limitations are invoked, and this event is logged.

Establishing connection–rate limitations helps to limit the speed at which malicious programs, such as viruses and worms, spread to uninfected computers. Malicious programs often attempt to reach uninfected computers by opening simultaneous connections to random IP addresses. Most of these random addresses result in failed connections, so a burst of such activity on a computer is a signal that it may have been infected by a malicious program.

Connection-rate limitations may cause certain security tools, such as port scanners, to run more slowly.
 
    
User Action 
This event is a warning that a malicious program or a virus might be running on the system. To troubleshoot the issue, find the program that is responsible for the failing connection attempts and, if the program might be malicious, close the program as follows.

To close the program

At the command prompt, type
Netstat –no
Find the process with a large number of open connections that are not yet established.
These connections are indicated by the TCP state SYN_SENT in the State column of the Active Connections information.
Note the process identification number (PID) of the process in the PID column.
Press CTRL+ALT+DELETE and then click Task Manager.
On the Processes tab, select the processes with the matching PID, and then click End Process.
If you need to select the option to view the PID for processes, on the View menu, click Select Columns, select the PID (Process Identifier) check box, and then click OK.
 
 

--------------------------------------------------------------------------------
 
 Related Knowledge Base articles
 You can find additional information on this topic in the following Microsoft Knowledge Base articles: 
 • A file sharing connection to a Windows Server 2008-based server drops unexpectedly if the server has Symantec Antivirus installed 
 Describes an issue in which a file sharing connection to a Windows Server 2008-based server may drop unexpectedly if the server has Symantec Antivirus installed. Provides a workaround and a resolution. 
 
 

comment:10 Changed at 2009-02-05T07:10:17Z by zooko

Note that even though the production grid has more servers than the test grid, and even though this happened with tahoe-1.2.0 and foolscap-0.2.9 in my own tests, this doesn't seem to be happening with tahoe-prod-3.0.0 on the production grid.

comment:11 Changed at 2009-02-05T15:42:43Z by zooko

  • Milestone 1.3.0 deleted

Ah, here's why it doesn't effect the prod grid:

"""During normal operation, when programs are connecting to available hosts at valid IP addresses, no limit is imposed on the number of connections in the incomplete state. When the number of incomplete connections exceeds the limit, for example, as a result of programs connecting to IP addresses that are not valid, connection-rate limitations are invoked, and this event is logged."""

There are too many no-longer-reachable servers on the test grid.

Moving this out of the 1.3.0 Milestone.

comment:12 Changed at 2009-04-13T03:54:15Z by zooko

  • Keywords Win32 added
  • Milestone set to eventually
  • Summary changed from delayed connection on Windows to two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable

Changing the name to indicate the severity.

comment:13 Changed at 2010-02-27T08:50:01Z by davidsarah

  • Keywords win32 performance hang added; Win32 removed
  • Milestone changed from eventually to 1.7.0

I'm not sure whether this is actually fixable; it might be an unavoidable problem with Windows TCP/IP, but we should investigate it further.

comment:14 Changed at 2010-02-27T19:12:02Z by zooko

Changes which reduce the number of attempts to connect to servers, such as #448 (download: speak to as few servers as possible), might help. If a Windows user starts up their Tahoe-LAFS gateway and then tries to download a single file, it might not start too many attempts to connect to unreachable servers and thus avoid this penalty from Windows. However, if the user tries to download many files at once, it would still incur this penalty (since each file would trigger attempts to connect to a different set of servers). Another way we can fix, or "work-around" the problem in some sense is by implementing some sort of presence and routing protocol other than the current naïve one. The current design of introduction (which I rather like for its simplicity) is that we separate out three notions: announcement of a node's existence, routing information (currently: the IP addresses and port numbers you can use to reach the node), and presence notification (whether the node is currently on-line or off-line). The current design of introduction, which I rather like for its simplicity, is that announcement-of-existence and propagation-of-routing-information are both done by the introducer and presence is done directly from each node to each of its peers. You could imagine an alternative design in which the Introducer (or hopefully instead a decentralized introduction scheme (#68, #295)) informed you "Here is a server, but it is currently off-line", or else didn't tell you about off-line servers at all. In that case, the storage client running on Windows would usually open connections only to currently-live servers, and avoid this penalty from Windows entirely.

comment:15 Changed at 2010-02-27T20:15:07Z by davidsarah

  • Keywords windows introducer added; win32 removed

comment:16 Changed at 2010-03-12T23:48:46Z by davidsarah

  • Keywords introduction added; introducer removed

comment:17 Changed at 2010-05-08T19:42:40Z by zooko

  • Milestone changed from 1.7.0 to eventually

This isn't going to happen for v1.7.0, and also it is blocked on #448 because I want to see what effect #448 has on this issue.

comment:18 Changed at 2010-05-08T19:46:51Z by zooko

If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #800 (improve alacrity by downloading only the part of the Merkle Tree that you need), #798 (improve random-access download to retrieve/decrypt less data), #809 (Measure how segment size affects upload/download speed.), #287 (download: tolerate lost or missing servers), and #448 (download: speak to as few servers as possible).

comment:19 Changed at 2010-08-15T06:16:23Z by zooko

#448 is fixed. I looked in the code and this issue #605 is still going to be unchanged by #448 being fixed because the storage client starts attempting to connect to each storage server as soon as it learns about the server: trunk/src/allmydata/storage_client.py@4131#L102. One way to improve the behavior of the client with regard to this ticket would be to make that call to start_connection() be invoked lazily, only when the client actually wanted to send a message to that particular server. However, this would harm alacrity—it would move TCP and foolscap handshaking from introduction time to request time and thus increase the delay between making the request and receiving the response (for the first request made to each server). I don't think that is a good trade-off.

That means this ticket is going to stay open until we implement something more along the lines of comment:14—a way for clients to get hints that certain servers are currently off-line without attempting to open a TCP connection to those servers. Oh, also something that might help with this would be μTP (#1179).

Version 0, edited at 2010-08-15T06:16:23Z by zooko (next)
Note: See TracTickets for help on using tickets.