Opened at 2010-03-20T23:13:23Z
Last modified at 2010-03-21T17:12:42Z
#1003 new defect
add-lease may fail to mark a node if the path by which it is reachable changes during marking
Reported by: | davidsarah | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code | Version: | 1.6.1 |
Keywords: | leases gc preservation usability docs | Cc: | |
Launchpad Bug: |
Description (last modified by davidsarah)
Tahoe uses a relatively simplistic mark/sweep-style GC algorithm: in the marking phase, tahoe deep-check --add-lease is used to perform a deep traversal from a given root node, and extend the leases of all nodes visited by the traversal.
This is subject to race conditions if the path by which a node is reachable changes during marking. For example, suppose we have two always-reachable mutable directories, A and B, and we also have a subtree that is referenced only from C. When the marking phase begins, there is a link from A to C, but during marking, a tahoe mv or equivalent is used to move that link to be from B to C. If the traversal visits B before the move, and A after it, then it will fail to mark C even though it was reachable throughout.
This is mitigated somewhat by the recommended marking schedule; source:src/docs/garbage-collection.txt#L26 says:
The current recommended values for a small Tahoe grid are to renew the leases once a week, and to give each lease a duration of 31 days.
This gives a node 4 chances to be marked. It could be argued that it is unlikely for the above race condition to happen 4 times for a given node -- but there is no guarantee of that, or even a way to bound the probability of it happening, other than by avoiding mutations during marking.
There are distributed GC algorithms that would handle this correctly, but they are complicated. For the time being, we should document it as a known issue.
Change History (2)
comment:1 Changed at 2010-03-21T15:54:29Z by davidsarah
- Description modified (diff)
- Version changed from 1.6.0 to 1.6.1
comment:2 Changed at 2010-03-21T17:12:42Z by kpreid
It seems to me that a simple fix would be to have a temporary record kept of the fact that A referred to C, and consider this a type of reachability. One such record could be the previous versions of A. The record would need to be kept for as long as an add-lease operation might take; perhaps the cleanup of such records could be integrated into (or be a type of) share deletion, since if your lease duration is shorter than your add-lease you probably have a problem already. I think. This is a quick note; I have missed some aspects of the problem.
clarify description