#1395 closed defect (fixed)

error doing a check --verify on files bigger than about 1Gbyte

Reported by: sickness Owned by: zooko
Priority: critical Milestone: 1.9.0
Component: code-encoding Version: 1.8.2
Keywords: memory verify error performance test-needed Cc:
Launchpad Bug:

Description (last modified by davidsarah)

on a grid with 10 storage nodes, all on win32, and a gateway/client on win32, with all the nodes up and all the shares ok, I can download all the files ok, and check all of them ok. only thing that doesn't work is check --verify of files bigger than 1Gbyte

here's the error log of the gateway, and after it of a storage node:

######################################################################

2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] [Failure instance: Traceback: <type 'exceptions.IOError'>: [Errno 22] invalid mode ('wb') or filename: u'C:\\tahoeclient\\logs\\incidents\\incident-2011-05-05--18:33:34Z-w2qn32q.flog'
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\defer.py:106:maybeDeferred
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\src\allmydata\web\common.py:302:renderHTTP
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\src\allmydata\web\root.py:124:render_POST
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:210:msg
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] --- <exception caught here> ---
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:265:add_event
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\incident.py:35:event
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\log.py:278:declare_incident
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] c:\tahoe-lafs\support\Lib\site-packages\foolscap-0.6.1-py2.6.egg\foolscap\logging\incident.py:85:incident_declared
2011-05-05 20:33:34+0200 [HTTPChannel,42,10.100.9.71] ]
2011-05-05 20:42:33+0200 [HTTPChannel,43,10.100.9.71] Unhandled Error
	Traceback (most recent call last):
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\rend.py", line 569, in _renderHTTP
	    return self.flattenFactory(doc, ctx, writer, finisher)
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\rend.py", line 528, in <lambda>
	    flattenFactory = lambda self, *args: flat.flattenFactory(*args)
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\__init__.py", line 14, in flattenFactory
	    return deferflatten(stan, ctx, writer).addCallback(finisher)
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\twist.py", line 63, in deferflatten
	    _drive(iterable, finished)
	--- <exception caught here> ---
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\twist.py", line 24, in _drive
	    next = iterable.next()
	  File "c:\tahoe-lafs\support\Lib\site-packages\nevow-0.10.0-py2.6.egg\nevow\flat\ten.py", line 85, in iterflatten
	    straccum.append(item)
	exceptions.MemoryError: 

######################################################################

now for the storage log:

######################################################################

2011-05-05 20:29:01+0200 [Negotiation,8,10.100.26.102] Unhandled Error
	Traceback (most recent call last):
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\log.py", line 84, in callWithLogger
	    return callWithContext({"system": lp}, func, *args, **kw)
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\log.py", line 69, in callWithContext
	    return context.call({ILogContext: newCtx}, func, *args, **kw)
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\context.py", line 59, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\python\context.py", line 37, in callWithContext
	    return func(*args,**kw)
	--- <exception caught here> ---
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\selectreactor.py", line 146, in _doReadOrWrite
	    why = getattr(selectable, method)()
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\tcp.py", line 180, in doWrite
	    return Connection.doWrite(self)
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\tcp.py", line 431, in doWrite
	    result = abstract.FileDescriptor.doWrite(self)
	  File "c:\tahoe-lafs\support\Lib\site-packages\twisted-8.2.0-py2.6-win32.egg\twisted\internet\abstract.py", line 101, in doWrite
	    self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer)
	exceptions.MemoryError: 

###################################################################### the gateway/client process quickly grow over 1.5 Gbyte in ram, then it shrunk to 180 Mbyte, the storage process grow to 300 Mbyte and stayed there

Attachments (4)

serialize-verifier.diff (1.6 KB) - added by warner at 2011-05-08T20:47:00Z.
patch to serialize block-fetches in Verifier. Multiple shares are still done in parallel.
1395-overparallel.diff (4.6 KB) - added by warner at 2011-06-11T19:37:12Z.
serialize block-fetches, add test
not-too-parallel-test-by-patching.diff (6.8 KB) - added by zooko at 2011-06-17T06:09:26Z.
not-too-parallel-test-by-low-tech-patching.diff (5.0 KB) - added by zooko at 2011-06-17T06:15:36Z.

Download all attachments as: .zip

Change History (42)

comment:1 Changed at 2011-05-05T20:13:44Z by sickness

  • Description modified (diff)

comment:2 Changed at 2011-05-05T21:45:16Z by davidsarah

I suspect that this isn't specific to win32, although the limit on usable process address space for 32-bit Windows might be smaller than for other 32-bit platforms.

There was also an error while reporting the incident:

[Failure instance: Traceback: <type 'exceptions.IOError'>: [Errno 22] invalid mode ('wb') or filename: u'C:\\tahoeclient\\logs\\incidents\\incident-2011-05-05--18:33:34Z-w2qn32q.flog'

but it isn't clear whether that's due to the low-memory condition or an independent issue. 'wb' is a valid mode, so probably the path is incorrect. Does C:\tahoeclient\logs\incidents\ exist?

comment:3 Changed at 2011-05-05T21:55:00Z by sickness

yeah, I can confirm that C:\tahoeclient\logs\incidents\ exists

comment:4 follow-up: Changed at 2011-05-06T15:50:59Z by davidsarah

I see the problem. ':' is not valid in a Windows filename. Will file another ticket for that.

comment:5 in reply to: ↑ 4 Changed at 2011-05-06T16:49:29Z by davidsarah

Replying to davidsarah:

I see the problem. ':' is not valid in a Windows filename. Will file another ticket for that.

#1396 and http://foolscap.lothar.com/trac/ticket/177

comment:6 Changed at 2011-05-06T16:53:21Z by davidsarah

  • Component changed from unknown to code-encoding
  • Description modified (diff)
  • Keywords memory verify added
  • Summary changed from error doing a check --verify on files bigger than 1Gb on win32 to error doing a check --verify on files bigger than about 1Gbyte

comment:7 Changed at 2011-05-06T16:55:06Z by davidsarah

  • Keywords error added

How was this reported by tahoe check --verify?

comment:8 Changed at 2011-05-06T18:14:56Z by sickness

it simply reported that 0 shares could be find for that file

comment:9 Changed at 2011-05-06T18:36:13Z by zooko

So both the gateway and the storage node had MemoryError. Can you reproduce it and tell what the Process Explorer says about how much memory those two different processes are using at different times in the operation? Are they using a large amount of memory continuously, or does it go up just before this failure, or what?

Thank you!

comment:10 Changed at 2011-05-06T19:18:10Z by davidsarah

#1229 is another checker-related memory issue. I've no evidence that it's the same issue, but it's possible that they might interact (i.e. the fact that we need memory proportional to the file size might increase the size of the #1229 memory leak, or the leak might increase the memory required for a single check).

Anyway, we should try to reproduce this on a non-Windows system to confirm that it's not win32-specific.

comment:11 Changed at 2011-05-06T19:23:33Z by davidsarah

Note that performance.rst claims:

memory footprint: N/K*S

for verifying a file (where S is the segment size). That seems to be contradicted by the behaviour in this bug.

Version 0, edited at 2011-05-06T19:23:33Z by davidsarah (next)

comment:12 follow-up: Changed at 2011-05-07T18:33:32Z by warner

Not that it's likely to be the problem here, but we should update that claim to be "N/K*S times a small multiple". I think the multiple is currently about 2 or 3. During encryption, we hold both a plaintext share and a ciphertext share in RAM at the same time (so 2*S), then we drop the plaintext. During erasure-coding, we hold a whole S of ciphertext in memory at the same time as the N/K*S shares, then we drop the ciphertext before pushing. We also pipeline the sends a little bit, I think 10kB or 50kB per server, to get better utilization out of a non-zero-latency wire.

Also Python's memory-management strategy interacts weirdly. Dropping the plaintext segment may not be enough: Python might not re-use that memory space for anything else right away. Although I'd expect it to de-fragment or coalesce free blocks before asking the OS for so much memory that it crashed.

comment:13 in reply to: ↑ 12 Changed at 2011-05-08T01:43:40Z by zooko

Replying to warner:

Not that it's likely to be the problem here, but we should update that claim to be "N/K*S times a small multiple". I think the multiple is currently about 2 or 3.

I would be willing to update these docs to be more precise or more correct, but I'm not entirely sure what you want them to say. (N/K)*S*3+50KB? (But only for immutable repair.)

Note that some of the other numbers in there are marked as approximate by a preceding tilde ~, e.g. performance.rst "Repairing an A-byte file". Maybe we should use the computer science tradition of ignoring constant factors which are independent of the variables (K, N, S, A, B, and G). However, I would rather follow that tradition only when the constant factors that we're ignoring are sufficiently small that our users will be willing to ignore them too. :-)

So in short: +0 from me, but you would need to write a patch for performance.rst yourself. Your attention to that document would be much appreciated by me because I would like for your admirable concern for precision in resource usage to be better represented there.

The pipeline size, which applies only to immutable objects and only to uploads, is 50 KB: immutable/layout.py WriteBucketProxy.

comment:14 Changed at 2011-05-08T12:47:51Z by zooko

I've opened #1398 (make docs/performance.rst more precise and accurate) for the documentation issues and attached a patch there. I do hope Brian looks at it, because my previous pass seems to have left a lot of incorrect stuff in that document and I would like the next version of it to be less wrong.

Changed at 2011-05-08T20:47:00Z by warner

patch to serialize block-fetches in Verifier. Multiple shares are still done in parallel.

comment:15 Changed at 2011-05-08T21:02:54Z by warner

I think I see the problem. The immutable Verifier code (here in checker.py) is overly parallelized. It uses a DeferredList to work on all shares in parallel, and each share worker uses a DeferredList to work on all blocks in parallel. The result is that every single byte of every single share is fetched at the same time, completely blowing our memory budget. As to why the server is crashing, I suspect that when the server gets a gigantic batch of requests for every single byte of the file, it responds to all of them, queueing a massive amount of data in the output buffers, which blows the memory space. A separate issue is to protect our servers against this sort of DoS, but I'm not sure how (we'd need to delay responding to a request if there were more than a certain number of bytes sitting in the output queue for that connection, which jumps wildly across the abstraction boundaries).

The Verifier should work just like the Downloader: one segment at a time, all blocks for a single segment being fetched in parallel. That approach gives a memory footprint of about S*N/k (whereas regular download is about S). We could reduce the footprint to S/k (at the expense of speed) by doing just one block at a time (i.e. completely verify share 1 before touching share 2, and within share 1 we completely verify block 1 before touching block 2), but I think that's too much.

I've attached a patch which limits parallelism to approximately the right thing, given the slightly funky design of the Verifier (the verifier iterates primarily over *shares*, not segments). The patch continues to verify all shares in parallel. However, within each share, it serializes the handling of blocks, so that each share-handler will only look at one block at a time.

The patch needs tests, which should verify a moderate-size artificially-small-segment (thus high-number-of-segment) file, probably with N=1 for simplicity. It needs to confirm that one block is completed before the next begins: I don't know an easy way to do that.. probably needs some instrumentation in checker.py. My manual tests just added some printfs, one just before the call to vrbp.get_block(), another inside _discard_result(), and noticed that there were lots of get_blocks without interleaved _discard_results.

comment:16 Changed at 2011-05-08T21:51:14Z by davidsarah

  • Keywords performance added

Could this patch cause a performance regression for small files, or is each server going to serialize the requests to it anyway?

comment:17 Changed at 2011-05-28T21:09:36Z by zooko

  • Milestone changed from undecided to 1.9.0

This is a major bug. I really think we should fix this for v1.9.0.

comment:18 Changed at 2011-05-28T21:58:57Z by davidsarah

+1 on fixing this for 1.9.0.

comment:19 Changed at 2011-05-28T21:59:21Z by davidsarah

  • Priority changed from minor to major

comment:20 Changed at 2011-06-10T11:01:45Z by sickness

YAY, this patch solves my problem! (and also speeds up the testing and greatly reduces the memory usate of the storage node =)

comment:21 Changed at 2011-06-10T14:35:42Z by ChosenOne

Cannot reproduce the bug on my virtual machine: Windows 7 x64, 1G RAM. Test files were 1G and 2G.

comment:22 Changed at 2011-06-11T18:50:08Z by warner

ok, so sickness' report suggests this was the right problem to fix. Now to figure out how to write a test for it.

Changed at 2011-06-11T19:37:12Z by warner

serialize block-fetches, add test

comment:23 Changed at 2011-06-11T19:39:11Z by warner

  • Keywords review-needed added

Ok, that latest patch should be ready: it adds a test which includes a tiny bit of instrumentation on the immutable verifier (to measure how many block fetches are active at any one time), and it fails without the verifier fix, and passes once that fix is applied. Ready for review.

comment:24 Changed at 2011-06-11T19:44:25Z by zooko

  • Owner changed from nobody to zooko
  • Status changed from new to assigned

comment:25 Changed at 2011-06-12T14:59:27Z by davidsarah

comment:7 "How was this reported by tahoe check --verify?"

comment:8 "it simply reported that 0 shares could be find for that file"

That seems like an error-reporting bug that wouldn't be fixed by attachment:1395-overparallel.diff, even if it would happen in fewer cases.

comment:26 Changed at 2011-06-15T17:40:46Z by warner

davidsarah: hmm, if the storage server got clobbered by all the parallel fetches, then it would throw MemoryError (and either crash completely, or at least return Failures to the remote_read() calls). From the client's point of view, this is identical to there being no shares available: crashing servers are not providing shares.

Maybe the CLI command output could be extended to mention how many servers were contacted, and how many answered, so the user could distinguish crashing-servers (or no-network) from solid NAKs.

This patch is still important to apply, though.. zooko: please don't defer reviewing it because of the CLI deficiency. It'll be easier to test a CLI change with special-purpose test code (building a client with no servers, or with intentionally-failing servers) than by trying to provoke a MemoryError.

comment:27 Changed at 2011-06-17T06:11:20Z by zooko

Okay, I've reviewed the patch and it is good. Please review this alternate patch — attachment:not-too-parallel-test-by-patching.diff — in which the test patches the ValidatedReadBucketClass instead of the code under test (Checker) having code in it for testing purposes. Unfortunately this patch depends on a variant of mock.py named mockutil.py which knows what to do when a test case returns a Deferred instance, so I don't think we should commit this patch as is. I've submitted a patch to the mock project (http://code.google.com/p/mock/issues/detail?id=96 ) and I'll proceed to work up a patch for Tahoe-LAFS which uses a lower-tech patching tool. But please review this patch because it most clearly shows why I like this approach. (If you ignore the whole mockutil.py part.)

comment:28 Changed at 2011-06-17T06:18:12Z by zooko

  • Owner changed from zooko to warner
  • Status changed from assigned to new

Okay here's a version that doesn't depend on mockutil.py: attachment:not-too-parallel-test-by-low-tech-patching.diff. As you can see it makes the test code a little uglier and it doesn't handle exceptions (it'll leave the ValidatedReadBucketProxy patched out for the rest of the trial process if there is an exception raised from the test). I guess I'm a bit on the fence about which of these two patches I prefer. What's your preference?

comment:29 Changed at 2011-06-20T20:31:49Z by zooko

To clarify, the original patch attachment:1395-overparallel.diff passes my review and I approve of applying it to trunk. I also think that attachment:not-too-parallel-test-by-low-tech-patching.diff is better, and hope you (warner) will consider it as a replacement for attachment:1395-overparallel.diff. I also like attachment:not-too-parallel-test-by-patching.diff , but it has the flaw that it will probably break if someone uses it with a newer version of mock.py, so we should probably avoid it.

comment:30 Changed at 2011-06-23T21:39:09Z by warner

  • Keywords review-needed removed
  • Owner changed from warner to zooko

I'm -1 on a test that leaves the code in a broken/patched state after an exception.. too much opportunity for confusion and hair-tearing debugging of other problems later. And I'm -1 on a test that depends deeply upon internals of a library that we don't include a copy of (note that I'm historically inconsistent about this one, q.v. Foolscap vs Twisted, but I'm trying to mend my ways).

At first, I found your -low-tech-patching test hard to follow, but looking more closely, I'm growing fond of it. The original code isn't too hard to patch in the right way, so the mock subclass isn't too too weird. It might be nice if the new variable names were shorter (and since they only appear in the context of the TooParallel unit test, they don't need to be as fully-qualified as if they were in the original checker.py). But after getting used to it, I think I'm ok with the way it stands.

I'd be ok with simple partial fix to the cleanup-after-exception problem. Instead of using setUp/tearDown or pushing the whole middle of the test method into a sub-method, just put the set_up_grid part into a small function so you can get it into the Deferred chain. Something like:

    d = defer.succeed(None)
    def _start(ign):
        self.set_up_grid(num_servers=4)
        c0 = self.g.clients[0]
        c0.DEFAULT_ENCODING_PARAMETERS = { "k": 1, "happy": 4, "n": 4, 
                                           "max_segment_size": 5, } 
        self.c0 = c0
        self.uris = {} 
 	DATA = "data" * 100 # 400/5 = 80 blocks 
 	return c0.upload(Data(DATA, convergence="")) 
    d.addCallback(_start)
    def _do_check(ur):
        n = self.c0.create_node_from_uri(ur.uri)
        return n.check(Monitor(), verify=True)
    d.addCallback(_do_check)

Since the "critical region" where exceptions could cause problems doesn't start until after the ...checker.ValidatedReadBucketProxy = make_mock_VRBP line, it's good enough to just capture the first chunk of the code after that in a function which is run from within the Deferred chain. Any exceptions in set_up_grid or c0.upload (which are fairly big and complex) will still get caught, so the cleanup can happen.

comment:31 Changed at 2011-06-30T14:59:54Z by zooko

  • Status changed from new to assigned

I'll modify the low-tech patcher approach (attachment:not-too-parallel-test-by-low-tech-patching.diff) to clean up in case of exception, like Brian suggested.

Note that Michael Foord, author of mock.py, has proposed a patch for mock.py that would make it do everything we need. He just wants code-review and unit tests. See the mock.py ticket to help with that.

Once a version of mock.py is out with that functionality and we are willing to make Tahoe-LAFS depends on that version of mock then we can have the more succinct attachment:not-too-parallel-test-by-patching.diff style of mockery in our unit tests.

comment:32 Changed at 2011-07-07T05:46:25Z by zooko

  • Keywords test-needed added
  • Status changed from assigned to new

comment:33 Changed at 2011-07-18T21:41:21Z by zooko

  • Status changed from new to assigned

comment:34 Changed at 2011-08-02T06:44:56Z by zooko@…

  • Resolution set to fixed
  • Status changed from assigned to closed

In f426e82287c11b11:

verifier: serialize the fetching of blocks within a share so that we don't use too much RAM

Shares are still verified in parallel, but within a share, don't request a
block until the previous block has been verified and the memory we used to hold
it has been freed up.

Patch originally due to Brian. This version has a mockery-patchery-style test
which is "low tech" (it implements the patching inline in the test code instead
of using an extension of the mock.patch() function from the mock library) and
which unpatches in case of exception.

fixes #1395

comment:35 Changed at 2011-08-02T18:58:09Z by david-sarah@…

In 9f8d34e63aa1aeeb:

(The changeset message doesn't reference this ticket)

comment:36 Changed at 2011-08-02T19:20:10Z by zooko

  • Priority changed from major to critical
  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:37 Changed at 2011-08-02T19:39:48Z by david-sarah@…

In c7f65ee8ad254f3f:

(The changeset message doesn't reference this ticket)

comment:38 Changed at 2011-08-02T20:59:40Z by davidsarah

  • Resolution set to fixed
  • Status changed from reopened to closed

c7f65ee8ad254f3f seems to have fixed the failures of allmydata.test.test_repairer.Verifier.test_corrupt_sharedata on the builders. (Some remaining nondeterministic failures are due to #1084.)

Note: See TracTickets for help on using tickets.