[tahoe-dev] Object Health
Brad Rupp
bradrupp at gmail.com
Mon Jul 9 21:16:45 UTC 2012
On 7/9/2012 10:47 AM, Greg Troxel wrote:
>
> Brad Rupp <bradrupp at gmail.com> writes:
>
>> I am running the following command:
>>
>> ~/tahoe/bin/tahoe deep-check --repair --verbose my-alias:
>
> I would include --add-lease, because the servers might be doing expiration.
The servers should not be doing expiration. They should be all set to
expire in 365 days. My data is only a few weeks old.
Having said that, dumber things have happened. I will check.
Once per week, I do a deep-check with both --repair and --add-leases. I
started running these repairs (--repair only) as a sanity check that my
data was in fact safe.
>
>> The output from repair #1:
>>
>> repair successful
>> done: 11801 objects checked
>> pre-repair: 11725 healthy, 76 unhealthy
>> 76 repairs attempted, 76 successful, 0 failed
>> post-repair: 11801 healthy, 0 unhealthy
>>
>> The output from repair #2:
>>
>> done: 11801 objects checked
>> pre-repair: 11789 healthy, 12 unhealthy
>> 12 repairs attempted, 11 successful, 1 failed
>> post-repair: 11800 healthy, 1 unhealthy
>
> This is a clue that your servers are unstable somehow; it isn't normal.
> I would use tcpdump and see if connection are coming and going.
>
> To measure without changing, I would do deep-check (with --add-lease)
> without using --repair and see if you get stable output.
I will give this a try and let you know.
>
>> As you can see, the first repair found and fixed 76 unhealthy
>> objects. The second repair, approximately 12 hours later, found 12
>> unhealthy objects and fixed 11 of them.
>
> How many servers? Are they all stably present, both uptime and
> connectivity?
20 servers total, 17 up consistently. This is a public grid (Volunteer
Grid 2), so I don't own most of the servers.
>
>> Why would the second repair find 12 unhealthy objects? I would have
>> expected it to find 0 unhealthy objects given that the first repair
>> was performed only 12 hours earlier.
>
> Absent servers not being reachable, you are right.
>
>> This is just one repair run out of many. I can consistently get
>> similar results. I guess the deeper question is are the objects
>> stored in Tahoe safe? Or when I really need them due to a
>> catastrophic event will I lose a handful of objects due to this?
>
> So far your objects were repairable, so you haven't lost data. But
> there is IMHO something wrong.
There have been cases where objects were not repairable. The runs that
I copied and pasted just happened to have successful repairs both times.
More information about the tahoe-dev
mailing list