#1834 new defect

stop using share crawler for anything except constructing a leasedb — at Version 5

Reported by: zooko Owned by:
Priority: normal Milestone: undecided
Component: code-storage Version: 1.9.2
Keywords: leases garbage-collection accounting performance crawlers Cc:
Launchpad Bug:

Description (last modified by daira)

I think we should stop using a "share crawler" — a long-running, persistent, duty-cycle-limited process that visits every share held by a storage server — for everything that we can.

And, I think that the only thing that we can't do in a different way is: construct a leasedb when we are first upgrading the server to a leasedb-capable version, or the leasedb has been lost or corrupted.

Here are the other things that are currently done by crawlers and how I think they should be done differently:

  • Updating and/or checking the leases on shares to see if they have expired;

On David-Sarah's 666-accounting branch, this is now done for all shares by a single, synchronous command/query to leasedb. (#666)

  • Delete shares that have lost all their leases (by cancellation or expiry);

I propose that this be done instead by the storage server maintaining a persistent set of shares to be deleted. When lease-updating step (which, in #666, is synchronous and fast) has identified a share that has no more leases, the share's id gets added to the persistent set of shares to delete. A long-running, persistent, duty-cycle-limited processes deletes those shares from the backend and removes their ids from the set of shares-to-delete. This is cleaner and more efficient than using a crawler, which has to visit all shares and which never stops twitching, since this has to visit only shares that have been marked as to-delete, and it quiesces when there is nothing to delete. (#1833 — storage server deletes garbage shares itself instead of waiting for crawler to notice them)

  • Discover newly added shares that the operator copied into the backend without notifying the storage server;

I propose that we stop supporting this use case. It can be replaced by some combination of: 1. requiring you to run a tahoe-lafs storage client tool (a share migration tool) to upload the shares through the server instead of copying the shares directly into the backend, 2. various kludgy workarounds, 3. a new tool for registering specific storage indexes in the leasedb after you've added the shares directly into the backend, or 4. simply requiring that the operator manually trigger the crawler to start instead of expecting the crawler to run continuously. (#1835 — stop grovelling the whole storage backend looking for externally-added shares to add a lease to)

  • Count how many shares you have;

This can be nicely replaced by leasedb (a simple SQL "COUNT" query), and also the functionality can be extended to compute the aggregate sizes of data in addition to the mere number of objects, which would be very useful for customers of LeastAuthority.com (who pay per byte), among others. (#1836 — stop crawling share files in order to figure out how many shares you have)

What would be better about removing these uses of crawler?

  1. The storage server would be more efficient in terms of accesses to its storage backend. This might turn out to matter when the storage backend is a cloud storage service and you pay per API call. (Or it might not, if the cost is cheap enough and the crawly way to do it is efficient enough.)
  1. The crawling would be a quiescent process — something that finishes its job and then stops, and doesn't start again unless a user tells it to. I like this way of doing things. See wiki:FAQ#Q18_unobtrusive_software .
  1. Some of these operations would be faster and better if done in the newly proposed way instead of by relying on a crawler.

Change History (5)

comment:1 Changed at 2012-10-30T23:14:43Z by zooko

  • Description modified (diff)

comment:2 Changed at 2012-10-30T23:14:57Z by zooko

  • Description modified (diff)

comment:3 Changed at 2012-10-30T23:21:46Z by zooko

  • Description modified (diff)

comment:4 follow-up: Changed at 2012-11-07T16:54:05Z by zooko

I'm pretty interested in taking this design to the extreme to get the best efficiency. In that extreme, we never go to persistent storage for either read or write (or existence check) — which requires at least a disk seek for a direct-attached-storage backend or at least a cloud service API request for a cloud backend — unless the leasedb told us to go to persistent storage. (Except, in the case that we're currently building or rebuilding leasedb by crawling persistent storage.)

comment:5 in reply to: ↑ 4 Changed at 2013-05-28T02:01:55Z by daira

  • Description modified (diff)
  • Keywords performance crawlers added

Replying to zooko:

I'm pretty interested in taking this design to the extreme to get the best efficiency. In that extreme, we never go to persistent storage for either read or write (or existence check) — which requires at least a disk seek for a direct-attached-storage backend or at least a cloud service API request for a cloud backend — unless the leasedb told us to go to persistent storage. (Except, in the case that we're currently building or rebuilding leasedb by crawling persistent storage.)

I agree with returning a positive response to an existence check (DYHB) when the leasedb says we have a share. The case where we turn out not to actually have the share is an error that the downloader, uploader, or repairer should tolerate anyway.

I think that returning a negative response to an existence check when the leasedb says we don't have a share is more debatable. In principle, this shouldn't affect latency of downloads because the downloader should use the first shares.needed servers that respond positively, so the latency of negative responses shouldn't matter.

Note: See TracTickets for help on using tickets.