#755 new defect

Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

Reported by: zooko Owned by: daira
Priority: critical Milestone: soon
Component: code-dirnodes Version: 1.4.1
Keywords: usability error tahoe-check wui verify repair Cc: ludo@…, zooko, kyle@…
Launchpad Bug:

Description (last modified by daira)

If I do a deep-check on a directory, I start getting results reported on the web page showing the files and subdirectories within that directory. Reloading (or waiting for the automatic self-reload) shows more and more results. Until one of the subdirectories is unrecoverable, in which case the web page containing the deep check results is replaced with a web page saying only this:

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present.

I'm using the current trunk: 1.4.1-r3982.

Brian: are you willing to take this ticket?

Attachments (2)

755-fix-for-review.diff (4.4 KB) - added by francois at 2010-11-20T23:42:41Z.
patch-755.darcs.diff (30.1 KB) - added by francois at 2010-11-21T22:46:52Z.

Download all attachments as: .zip

Change History (41)

comment:1 Changed at 2009-07-11T23:28:09Z by warner

  • Component changed from code-frontend-web to code-dirnodes
  • Status changed from new to assigned

yeah, I'll work on this. Basically traversal failures during a deep-check or deep-repair operation should increment a counter and move on, instead of throwing an exception and stopping the walker. I don't know if I can finish it in time for 1.5.0 though.

comment:2 Changed at 2009-07-15T05:24:36Z by zooko

  • Milestone changed from 1.5.0 to eventually

This isn't really a blocker for v1.5.0.

comment:3 Changed at 2009-08-11T13:54:27Z by zooko

  • Cc ludo@… added

On the mailing list Ludo reported:

$ tahoe deep-check
ERROR: UnrecoverableFileError(no recoverable versions)
[Failure instance: Traceback: <class 'allmydata.mutable.common.UnrecoverableFileError'>: no recoverable versions
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/base.py:757:runUntilCurrent
/nix/store/nk39m80fi7ll7460713djzw3qzwgb4kr-python-foolscap-0.4.2/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:243:callback
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:312:_startRunCallbacks
--- <exception caught here> ---
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:328:_runCallbacks
/nix/store/yj6q079b58rfnnf8g70ib5vaah6gxlhq-tahoe-1.5.0/lib/python2.5/site-packages/allmydata_tahoe-1.5.0-py2.5.egg/allmydata/mutable/filenode.py:312:_once_updated_download_best_version

Is this an example of the issue in this ticket?

By the way, see also #583 (repairer: test cancel, upload failure, download failure).

comment:4 Changed at 2009-12-31T16:16:49Z by zooko

  • Keywords usability added

I just got bitten by this bug again. I have a directory (on the volunteergrid) that has an unrecoverable subdirectory in it. When I do a deep check in the WUI then it shows useful information about the other contents of the directory until it reaches that subdirectory, at which point I lose the other information. Also, the resulting error message doesn't tell me any identifying information about which file or directory was unrecoverable!

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

comment:5 Changed at 2010-01-16T01:10:50Z by davidsarah

  • Keywords error added

comment:6 Changed at 2010-02-02T03:08:44Z by davidsarah

  • Milestone changed from eventually to 1.7.0

comment:7 Changed at 2010-02-14T20:36:22Z by zooko

  • Priority changed from major to critical

This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to critical to remind myself that I really care about this.

comment:8 Changed at 2010-02-15T18:51:02Z by davidsarah

  • Milestone changed from 1.7.0 to 1.6.1

comment:9 Changed at 2010-02-15T19:50:05Z by davidsarah

  • Keywords tahoe-check wui verify repair added
  • Summary changed from if there is an unrecoverable subdirectory, the web deep-check report loses other information to if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

Unifying this with #880; this ticket now covers both CLI and WUI.

comment:10 Changed at 2010-02-16T04:16:22Z by zooko

This might be too ambitious to finish for v1.6.1. I would like to get v1.6.1 released this coming weekend of 2010-02-20 so that people who have started packaging or deploying v1.6.0 have the option of quickly upgrading to v1.6.1 before their packages/deployments of v1.6.0 spread too far.

However, I'm leaving it in the Milestone v1.6.1 for now because I don't object to fixing it in v1.6.1.

comment:11 Changed at 2010-02-22T05:04:34Z by zooko

  • Milestone changed from 1.6.1 to 1.7.0

We're not going to fix this in time for v1.6.1. Hopefully in time for v1.7.0!

comment:12 Changed at 2010-05-16T23:40:04Z by zooko

  • Milestone changed from 1.7.0 to eventually

comment:13 Changed at 2010-05-17T02:15:24Z by davidsarah

  • Milestone changed from eventually to soon

comment:14 Changed at 2010-10-28T23:14:39Z by davidsarah

  • Milestone changed from soon to 1.9.0

This is one of our more commonly encountered usability problems, so I think it should be a priority for 1.9.0.

comment:15 Changed at 2010-11-01T11:12:31Z by francois

  • Owner changed from warner to francois
  • Status changed from assigned to new

comment:16 Changed at 2010-11-01T11:12:49Z by francois

I'm willing to try to fix this bug.

Changed at 2010-11-20T23:42:41Z by francois

comment:17 Changed at 2010-11-20T23:44:34Z by francois

  • Keywords review-needed test added

The patch 755-fix-for-review.diff is how I intent to fix this bug. The associated tests are still being worked on.

Changed at 2010-11-21T22:46:52Z by francois

comment:18 Changed at 2010-11-21T22:47:28Z by francois

  • Keywords test removed
  • Owner francois deleted

comment:19 Changed at 2010-11-21T22:48:49Z by francois

The patch patch-755.darcs.diff contains the fix for this issue and associated tests.

comment:20 Changed at 2011-01-01T21:19:51Z by davidsarah

  • Owner set to davidsarah
  • Status changed from new to assigned

comment:21 Changed at 2011-01-06T00:31:29Z by davidsarah

  • Milestone changed from 1.9.0 to 1.8.2

comment:22 follow-ups: Changed at 2011-01-07T05:31:20Z by warner

  • Keywords review-needed removed
  • Owner changed from davidsarah to francois
  • Status changed from assigned to new

Good patch! I like the approach of making filenode.check_and_repair() signal inability to repair by returning CheckAndRepairResults.repair_successful=False instead of by throwing an exception. A few things I'd like to see changed:

  • we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results. In particular, we shouldn't scare users into thinking that repair failures of "scratched" files (unhealthy but recoverable) indicate unrecoverable files: this makes benign things like UnhappinessError look like data loss. This should be fixed in both mutable and immutable files.
  • the newly-enabled test in test_repairer.Repairer.test_harness (which previously got a self.shouldFail()) should be slightly enhanced to check the return value of check_and_repair(). We should verify that it has crr.repair_attempted=True, crr.repair_successful=False, and crr.post_repair_results.recoverable=False
  • we should add a similar test for mutable files that have had 8 shares deleted. There's something awfully close in test_mutable.Repair.test_unrepairable_1share .. it should be changed to use self._fn.check_and_repair() instead of self._fn.repair() . To be honest, I'm not sure why that test was passing before, because from what I can tell it should have been behaving the same way as immutable repair on an unrecoverable file.
    • it's probably worth checking the code coverage when we exercise test_mutable and make sure the new code is getting run
  • do we have any tests that confirm deep-repair on a tree with an unrecoverable file (or directory) makes it through to the end without an errback? We probably do but I'd like to be sure.. probably something in test_deepcheck exercises this.
  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken() asserts that an unrecoverable dirnode causes the traversal to halt. Is this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*? (broken dirnodes are more significant than files, because it means you've probably lost access to even more data). We certainly want the deep-traversal to keep going and repair more things, but we also need to make sure the user learns about the dead dirnode.

Otherwise, looks great! With those few changes we can land this one for 1.8.2!

comment:23 in reply to: ↑ 22 Changed at 2011-01-07T05:55:02Z by davidsarah

Replying to warner:

Good patch! I like the approach of making filenode.check_and_repair() signal inability to repair by returning CheckAndRepairResults.repair_successful=False instead of by throwing an exception.

+1

A few things I'd like to see changed:

  • we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results.

If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results?

(Note: needs_rebalancing=False is not pessimistic.)

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken() asserts that an unrecoverable dirnode causes the traversal to halt. Is this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*?

I thought it was both.

comment:24 in reply to: ↑ 22 ; follow-up: Changed at 2011-01-15T16:20:04Z by francois

Thanks for the review! My comments are inline.

Replying to warner:

  • we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results.

I agree with what davidsarah said in 23, it is difficult to know the actual status when an exception was raised during the check operation. However, it seems that simply removing the fields from the results would necessitate other changes because I guess that many parts of the code except them to be present.

What would you think about setting healthy to its value before the repair (most likely False) and other fields to None? Something along those lines?

  def _repair_error(f):
    prr = CheckResults(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr
  • the newly-enabled test in test_repairer.Repairer.test_harness (which previously got a self.shouldFail()) should be slightly enhanced to check the return value of check_and_repair(). We should verify that it has crr.repair_attempted=True, crr.repair_successful=False, and crr.post_repair_results.recoverable=False

Good point, will be done in the next patch.

  • we should add a similar test for mutable files that have had 8 shares deleted. There's something awfully close in test_mutable.Repair.test_unrepairable_1share .. it should be changed to use self._fn.check_and_repair() instead of self._fn.repair().

Will be done in the next patch.

To be honest, I'm not sure why that test was passing before, because from what I can tell it should have been behaving the same way as immutable repair on an unrecoverable file.

I don't know either, will try to look in details into this.

  • it's probably worth checking the code coverage when we exercise test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build system actually works. It would be very kind of you if you tell me which command I should run?

  • do we have any tests that confirm deep-repair on a tree with an unrecoverable file (or directory) makes it through to the end without an errback? We probably do but I'd like to be sure.. probably something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside DeepCheckWebBad.test_bad() should be doing, isn't it?

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken() asserts that an unrecoverable dirnode causes the traversal to halt. Is this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*? (broken dirnodes are more significant than files, because it means you've probably lost access to even more data). We certainly want the deep-traversal to keep going and repair more things, but we also need to make sure the user learns about the dead dirnode.

Oh, that's correct! Yes, the traversal must continue in both cases but it looks like my patch does not already support unrecoverable dirnode. Next version will hopefully do so.

Version 0, edited at 2011-01-15T16:20:04Z by francois (next)

comment:25 in reply to: ↑ 24 Changed at 2011-01-17T09:21:03Z by warner

Replying to francois:

Thanks for the review! My comments are inline.

Replying to warner:

  • we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results.

I agree with what davidsarah said in 23, it is difficult to know the actual status when an exception was raised during the check operation. However, it seems that simply removing the fields from the results would necessitate other changes because I guess that many parts of the code except them to be present.

What would you think about setting healthy to its value before the repair (most likely False) and other fields to None? Something along those lines?

  def _repair_error(f):
    prr = CheckResults(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr

Ok, but set_recoverable() and set_needs_rebalancing() should be copied from the pre-repair values too. For immutable files it's certainly the case that repair cannot make things any worse, so if the file was recoverable before repair, it will be recoverable afterwards too. For mutable files, it's fuzzier, but once we get #1209 fixed, then repair that doesn't involve UCWE collisions or multiple versions should be strictly an improvement too. I think set_needs_rebalancing() is roughly the same.

My big concern is doing a deep-repair while you're missing a few servers: all files are missing a few shares, so they aren't healthy and we try to repair them, but you're missing too many servers to successfully meet the servers-of-happiness threshold, so repair fails. On every single file. All the files are actually recoverable, but the post-repair results suggest that they are not. What I want to avoid is the deep-repair summary message telling users that 4000 out of 4000 files are now unrecoverable and scaring the socks off them.

  • it's probably worth checking the code coverage when we exercise test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build system actually works. It would be very kind of you if you tell me which command I should run?

I usually do 'make quicktest-coverage', but I think "python setup.py trial --coverage" (or perhaps "python setup.py trial --coverage --test-suite test_mutable" to be a bit more selective) should do the same. That will create a .coverage file with the raw data. "make coverage-output", or following the commands listed in that section of the Makefile, will give you an HTML summary with color-coded source lines.

  • do we have any tests that confirm deep-repair on a tree with an unrecoverable file (or directory) makes it through to the end without an errback? We probably do but I'd like to be sure.. probably something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside DeepCheckWebBad.test_bad() should be doing, isn't it?

I think that's mostly correct: it looks like set_up_damaged_tree() creates a root directory with 8 files (half mutable, half immutable), some of which are unrecoverable. But 1: do_web_stream_check() doesn't attempt repair, merely deep-check, and 2: there are no directories in that root, only files. Adding an unrecoverable directory is the important bit, since I think deep-repair and deep-check have enough common code paths that exercising deep-check is sufficient. (note that I think the 'broken' directory set up there is not used by do_web_stream_check()).

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken() asserts that an unrecoverable dirnode causes the traversal to halt. Is this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*? (broken dirnodes are more significant than files, because it means you've probably lost access to even more data). We certainly want the deep-traversal to keep going and repair more things, but we also need to make sure the user learns about the dead dirnode.

Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.

Yeah, do_web_stream_check() should cover the unrecoverable-immutable-file case (well, unless there's a difference in behavior between a web-based t=stream-deep-check and an internal dirnode-based dirnode.start_deep_check(), which is worth testing). So I agree, unrecoverable dirnodes is the important thing to check.

So my hunch here is that we should add an unrecoverable directory to the 'root' tree created in set_up_damaged_tree(), and adjust the counters to match, and then maybe we should get rid of the 'broken' tree and do_deepcheck_broken().

comment:26 follow-up: Changed at 2011-01-17T09:22:59Z by warner

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

comment:27 in reply to: ↑ 26 Changed at 2011-01-17T20:47:28Z by francois

Replying to warner:

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.

comment:28 Changed at 2011-01-17T20:47:41Z by francois

  • Milestone changed from 1.8.2 to 1.9.0

comment:29 Changed at 2011-02-06T06:04:01Z by zooko

  • Cc zooko added

comment:30 Changed at 2011-07-16T20:49:20Z by davidsarah

This needs some work to address the comments and to be rebased to trunk, but has a good chance of getting into 1.9.

comment:31 Changed at 2011-08-02T15:43:50Z by davidsarah

  • Owner changed from francois to davidsarah
  • Status changed from new to assigned

I have a patch in progress that builds on patch-755.darcs.diff and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.

comment:32 Changed at 2011-08-02T15:44:19Z by davidsarah

  • Milestone changed from 1.9.0 to 1.10.0

comment:33 Changed at 2012-08-22T02:12:52Z by davidsarah

I'll try to find the patch mentioned in comment:31, but if I haven't done so in two weeks, it can be assumed that I've lost it.

comment:34 Changed at 2013-04-26T02:00:35Z by daira

#1955 was a duplicate.

comment:35 Changed at 2014-11-19T07:26:39Z by daira

  • Description modified (diff)

#2337 was a duplicate.

Last edited at 2014-11-19T07:26:53Z by daira (previous) (diff)

comment:36 Changed at 2015-02-03T17:43:33Z by zooko

  • Summary changed from if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information to Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

comment:37 Changed at 2016-01-14T17:45:29Z by daira

  • Cc kyle@… added
  • Owner changed from davidsarah to daira
  • Status changed from assigned to new

Kyle Markley wrote on tahoe-dev:

When tahoe deep-check --repair encounters a file it can't repair, it stops without reporting anything about what file gave it trouble. What do I do about this? I rerun, this time with -v, so I get a listing of what files it is working on. From that list I can often infer which file had the error. Assuming I still have the original file, the corrective action is to tahoe put the file. Then I can restart the deep-check. But in a directory tree with thousands of files, that takes forever! Instead, I can restart the deep-check in a subdirectory closer to the previous failure. But this is a lot of tedious work.

I wish that tahoe deep-check would:

  1. Report which file is unrepairable.
  2. Not stop at the first error, but continue and report all errors upon completion.

When an unrepairable file is an immutable directory, what corrective action should be taken? I have resorted to modifying the directory by creating an empty file, performing a tahoe backup, and then continuing the deep-check --repair. But I cannot then remove the empty file, because that would cause the next backup to point to the original (unrepaired) directory. Can this be improved?

I wish that tahoe backup could be combined with tahoe deep-check --repair. The behavior would be like deep-check, but if any file is unrepairable yet exists in in the local filesystem at the corresponding path, upload it. And for bonus points this should guarantee happiness, not just healthiness. Or, it would be almost as good if deep-check would update the backup database so the next invocation of tahoe backup would re-upload the appropriate files and directories.

Essentially, I struggle with the fact that "tahoe backup" completes successfully without guaranteeing the recoverability of files it claims to have backed up. The backup database is out-of-sync with the healthiness of files on the grid, and there is no way to bring them in-sync. Sure, I can delete the backup database, but I don't want to pointlessly re-upload all the healthy files.

comment:38 Changed at 2017-03-09T18:55:14Z by tlhonmey

Kyle: It won't have to re-upload all the healthy files. The deduplication algorithm will find that the data for any unchanged files is already available and will re-use whatever shares it can. It'll just take a bit longer to run because it'll have to scan and encode every file.

Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month.

This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner.

As a work-around, I use:

tahoe manifest alias: | cut -d" " -f 1 | xargs -L1 -P5 tahoe check --add-lease --repair

This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.

comment:39 Changed at 2018-08-21T21:48:05Z by tlhonmey

Ok, so tahoe manifest also gives up on the first error it encounters, it just only encounters errors on damaged directories. But it will still bite you hard if you are actually stupid enough to rely on it.

So I've resorted to the following bash script:

#! /bin/bash
tahoe="/home/tahoe/tahoe/bin/tahoe"
THREADS=5
FAILEDLOG="/tmp/failed.txt"


recurser() {
  CHILDREN=""
  echo "checking directory: $1"
  $tahoe check --add-lease "$1" || $tahoe check --add-lease --repair "$1" || sleep 5m #give it 5 minutes before continuing to let the grid come back up if this is a connection failure.  This prevents the entire script from finishing as failures if the network connection goes down.
  local ITEM
  for ITEM in $($tahoe ls -F "$1"); do
    echo "checking: ${1}${ITEM}"
    echo "$ITEM" | grep "/" >> /dev/null && echo "  Is a directory..." && recurser "${1}${ITEM}"
    ( $tahoe check --add-lease "${1}${ITEM}" | grep -n10 healthy || $tahoe check --repair --add-lease "${1}${ITEM}" || echo "${1}${ITEM}" >> $FAILEDLOG ) &
    CHILDREN="$? $CHILDREN"
    if [[ $(echo "$CHILDREN" | wc -w) == "$THREADS" ]]; then
      wait 
      CHILDREN=""
    fi
  done
}


echo "If it blows up immediately when passed a URI make sure you end it with a /"
recurser "$1"

The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly.

Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system...

If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed.

Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...

Last edited at 2018-08-22T21:11:24Z by tlhonmey (previous) (diff)
Note: See TracTickets for help on using tickets.