[tahoe-dev] [tahoe-lafs] #737: python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4"
tahoe-lafs
trac at allmydata.org
Sun Jun 21 13:55:55 PDT 2009
#737: python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4"
---------------------------+------------------------------------------------
Reporter: midnightmagic | Owner: warner
Type: defect | Status: new
Priority: major | Milestone: 1.5.0
Component: code | Version: 1.4.1
Keywords: | Launchpad_bug:
---------------------------+------------------------------------------------
Comment(by warner):
Wow, fun! A quick look at the python-2.6 source
(Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a
NaN.. it calls the C gettimeofday/ftime/time (depending upon what your
platform has), adds the pieces together, and returns the result.
You said that a simple test case that just calls time.time() repeatedly
didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was
involved then I'd suggest instrumenting timemodule.c to remember the
pieces it got, build the !PyFloat, then if it's NaN immediately print out
the pieces, so we could figure out what gettimeofday() returned that
provoked a NaN.
If there were a low-level threading bug that was clobbering memory, I'd
expect to see exceptions or deeper errors than just a NaN. If time()
couldn't allocate the memory for the !PyFloat object, it would raise an
exception instead of returning NaN.
Hm, it's worth noting that floats are formatted to strings (in
Objects/floatobject.c:format_float) by doing snprintf(), so if your
platform's libc does something funky with snprintf(), that might cause
problems. Also, Python doesn't appear to do anything to define or test NaN
directly: it just tells C to do a+b or a>=b or whatever. So something
weird in your C compiler's implementation of floating-point math (or your
CPU) could get involved too.
If you get into this, you might try:
* modify python's timemodule.c to store the values retrieved from
gettimeofday() in a file-global variable, just before it adds them
together to create the return value for floatseconds()/time()
* add a function to timemodule.c which retrieves these stored values with
as little interpretation as possible (maybe memcopy them into a string in
addition to interpreting them as floats)
* in your catch-NaN-in-reactor.callLater assertion, retrieve and print
these values
If we catch gettimeofday() returning something insane, it's either the
kernel or some weird memory corruption that's just not causing anything
else to catch fire. If gettimeofday() is behaving, then we should suspect
the floatseconds() math or the floating point operations done afterward.
Another idea is to add code to floatseconds() that stringifies the float
and compares it against NaN right away. Then run everything under gdb and
put a breakpoint on the 'if' side of that comparison, then start using the
tahoe node until it fails in this way. Then look at the local variables in
the debugger and see if anything looks suspicious.
boy, you know how to find the fun bugs, don't you? :-)
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/737#comment:10>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list