[tahoe-dev] Object Health

Mon Jul 9 21:16:45 UTC 2012

On 7/9/2012 10:47 AM, Greg Troxel wrote:
>
> Brad Rupp <bradrupp at gmail.com> writes:
>
>> I am running the following command:
>>
>> ~/tahoe/bin/tahoe deep-check --repair --verbose my-alias:
>
> I would include --add-lease, because the servers might be doing expiration.

The servers should not be doing expiration.  They should be all set to 
expire in 365 days.  My data is only a few weeks old.

Having said that, dumber things have happened.  I will check.

Once per week, I do a deep-check with both --repair and --add-leases.  I 
started running these repairs (--repair only) as a sanity check that my 
data was in fact safe.

>
>> The output from repair #1:
>>
>> repair successful
>> done: 11801 objects checked
>>   pre-repair: 11725 healthy, 76 unhealthy
>>   76 repairs attempted, 76 successful, 0 failed
>>   post-repair: 11801 healthy, 0 unhealthy
>>
>> The output from repair #2:
>>
>> done: 11801 objects checked
>>   pre-repair: 11789 healthy, 12 unhealthy
>>   12 repairs attempted, 11 successful, 1 failed
>>   post-repair: 11800 healthy, 1 unhealthy
>
> This is a clue that your servers are unstable somehow; it isn't normal.
> I would use tcpdump and see if connection are coming and going.
>
> To measure without changing, I would do deep-check (with --add-lease)
> without using --repair and see if you get stable output.

I will give this a try and let you know.

>
>> As you can see, the first repair found and fixed 76 unhealthy
>> objects. The second repair, approximately 12 hours later, found 12
>> unhealthy objects and fixed 11 of them.
>
> How many servers?  Are they all stably present, both uptime and
> connectivity?

20 servers total, 17 up consistently.  This is a public grid (Volunteer 
Grid 2), so I don't own most of the servers.

>
>> Why would the second repair find 12 unhealthy objects?  I would have
>> expected it to find 0 unhealthy objects given that the first repair
>> was performed only 12 hours earlier.
>
> Absent servers not being reachable, you are right.
>
>> This is just one repair run out of many.  I can consistently get
>> similar results.  I guess the deeper question is are the objects
>> stored in Tahoe safe?  Or when I really need them due to a
>> catastrophic event will I lose a handful of objects due to this?
>
> So far your objects were repairable, so you haven't lost data.  But
> there is IMHO something wrong.

There have been cases where objects were not repairable.  The runs that 
I copied and pasted just happened to have successful repairs both times.