#807 closed defect (cannot reproduce)
exceptions.OverflowError: join() result is too long for a Python string
Reported by: | zooko | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code | Version: | 1.3.0 |
Keywords: | logging foolscap | Cc: | warner |
Launchpad Bug: |
Description
Here is the complete contents from one "Log opened." to the next of tahoebs5.allmydata.com:/home/amduser/public/bs5c1/logs/twistd.log:
2009/07/15 06:30 -0700 [-] Log opened. 2009/07/15 06:30 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up 2009/07/15 06:30 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'> 2009/07/15 06:30 -0700 [-] Loading tahoe-client.tac... 2009-07-15 13:30:44.135Z [-] Loaded. 2009-07-15 13:30:44.139Z [-] foolscap.pb.Listener starting on 45963 2009-07-15 13:30:44.141Z [-] nevow.appserver.NevowSite starting on 9051 2009-07-15 13:30:44.141Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8b7e74c> 2009-07-15 13:30:44.145Z [-] twisted.internet.protocol.DatagramProtocol starting on 50791 2009-07-15 13:30:44.145Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8b7ec2c> 2009-07-15 13:30:44.161Z [-] (Port 50791 Closed) 2009-07-15 13:30:44.162Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8b7ec2c> 2009-07-27 05:00:32.314Z [Negotiation,158,98.202.225.214] got banana ERROR from remote side: internal server error, see logs 2009-08-07 00:49:15.226Z [-] Unhandled error in Deferred: 2009-08-07 00:49:15.259Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost 2009-08-31 01:56:53.611Z [-] Unhandled error in Deferred: 2009-08-31 01:56:53.635Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost 2009-08-31 01:58:11.104Z [-] Unhandled error in Deferred: 2009-08-31 01:58:11.104Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost 2009-08-31 12:32:18.472Z [-] Unhandled error in Deferred: 2009-08-31 12:32:18.496Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost 2009-09-11 03:10:34.056Z [Negotiation,1031,97.118.104.193] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 48, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 33, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib/python2.5/site-packages/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib/python2.5/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/selectreactor.py", line 139, in _doReadOrWrite why = getattr(selectable, method)() File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 154, in doWrite return Connection.doWrite(self) File "/usr/lib/python2.5/site-packages/twisted/internet/abstract.py", line 104, in doWrite self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer) exceptions.OverflowError: join() result is too long for a Python string
Here is the startup version announcement from an incident report file which was recorded on 2009-06-02:
Application versions (embedded in logfile): Nevow: 0.9.26 Twisted: 2.5.0 allmydata-tahoe: 1.3.0-r3747 argparse: 0.8.0 foolscap: 0.3.2 platform: Linux-Ubuntu_8.04-i686-32bit pyOpenSSL: 0.6 pycryptopp: 0.5.2-1 python: 2.5.2 pyutil: 1.3.16-12 setuptools: 0.6c8 simplejson: 1.7.3 twisted: 2.5.0 zfec: 1.4.0-4 zope.interface: 3.3.1 PID: 4189
Change History (7)
comment:1 Changed at 2009-10-27T03:39:47Z by zooko
- Version changed from 1.5.0 to 1.3.0
comment:2 Changed at 2009-10-27T04:05:37Z by zooko
- Owner changed from somebody to warner
comment:3 Changed at 2009-12-13T02:21:24Z by davidsarah
- Keywords logging foolscap added
comment:4 Changed at 2010-04-13T04:18:03Z by zooko
Brian: I'm still concerned about the meta-issue here. Can we somehow ensure that all exceptions get logged as foolscap logging incidents or at least as twistd.log lines? I really don't like the feeling that exceptions silently disappear sometimes.
comment:5 Changed at 2010-04-13T04:23:50Z by zooko
Okay, that meta-issue about disappearing exceptions is now #1021.
comment:6 Changed at 2011-12-20T18:47:26Z by zooko
- Cc warner added
- Resolution set to cannot reproduce
- Status changed from new to closed
I wonder if we should just close this ticket as "irreproducible and possibly fixed by some other change in the interim". This may have been related to a bug that we had a long time ago in the combination of Tahoe-LAFS+Twisted which caused very large strings to be produced during exceptions (#379). This issue may have been caused by that, and there may be nothing more we need to do about it.
I'm going to add Cc: Brian as I close this so he has once last chance to look at it and think if we should do anything else about it. :-)
comment:7 Changed at 2011-12-21T20:31:49Z by warner
No idea. Smells like an out-of-memory problem, or memory corruption, or something. It's unlikely that this server tried to buffer 2GiB of output data to a client, unless maybe someone had a multi-gigabyte file with k=1 and a client which overparallelized segment downloads (which I suppose we did have briefly, but which is fixed now).
So yeah, CANNOTREPRODUCE sounds fine.
http://svn.python.org/view/python/tags/r252/Objects/stringobject.c?revision=60915&view=markup
Search in text for "join() result is too long for a Python string". It is guarded by if (sz < old_sz || sz > PY_SSIZE_T_MAX). PY_SSIZE_T_MAX is defined in http://svn.python.org/view/python/tags/r252/Include/pyport.h?revision=60915&view=markup to be ((size_t)-1)>>1 which would be 2,147,483,647 on that platform. So, I guess we were trying to .join() together a string that would have been more than 2.1 billion bytes.
I don't see how to investigate this, reproduce it, or determine if it has been fixed in newer versions of Tahoe-LAFS. One of the reasons why not is that the exception raised by the selectreactor's _doReadOrWrite() apparently didn't get propagated to foolscap, because no accompanying incident report file was generated.
Brian: am I interpreting that correctly? Is there a way to make sure that all unhandled exceptions get registered with the foolscap logging system so that they can be reported as incidents? Do you have any other ideas how to learn more about this issue, or should we just close it as "wontfix"?