[tahoe-dev] BackupDB proposal, and core functionality improvements for v1.4.0

Thu Sep 25 19:13:56 PDT 2008

>  >  1. record the file's size and modification time
>  >  2. upload the file into the grid, obtaining an immutable file
>  > read-cap
>  >  3. add an entry to the 'caps' table, with the read-cap, and the
>  > current time
>  >  4. extract the read-key from the read-cap, add an entry to
>  > 'keys_to_files'
>  >  5. add an entry to 'last_upload'
> 
> Hm.  There is a failure which happens rarely but which has ugly
> consequences.  (You mentioned this in your design document.)  It could
> happen that the file has an mtime of X, and then our backup process
> does step 1 here and the part of step 2 that reads the file contents,
> and then a new file contents are written to that file fast enough that
> its mtime is still X.  (If the new file contents are put into place
> with a relink instead of overwriting the file then the part of step 2
> that happens could be simply opening the file instead of reading it.)

What if we include the devno/inode information (.st_dev and .st_ino, in
python terms)? That should cover the relinking case, at least on unix.. what
is the equivalent of relink-into-place on windows? (and is there an
equivalent to the devno/inode?).

That leaves a worst-case scenario of modifying the file in-place without
changing the size (say, modifying the first byte of the file), quickly enough
that the timestamp did not change since step one.

> Step 1.5: if the mtime that we just read in step 1 is equal to the
> current time, within the granularity of the mtimes in the underlying
> system, then do not back it up, but instead wait until the file's
> mtime is sufficiently old that we would be able to tell if it changed,
> then go back to step 1.
> 
> What do you think?  This would impose no cost in the common case (that
> the mtime is old), and even if the mtime is current, on most of our
> supported platforms the granularity of mtime is sufficiently small
> that this would impose only a momentary delay.

Yeah, I like it. Perhaps any file with an mtime of less than a second gets
put to the back of the queue and scanned again later. We establish some upper
bound on how long the backup process is willing to wait (so that having, say,
an active /var/log/syslog in the tree you're trying to back up doesn't hang
the whole process forever). Cool.

(heh, or worse yet, your own tahoe node's twistd.log :-).

>  > If the path is present but the mtime differs, the file may have
>  > changed. [...]  The client will compute the CHK read-key for the
>  > file by hashing its contents [...] then checks the 'keys_to_files'
>  > table to see if this file has been uploaded before
> 
> Well, it may or may not be a win to hash the current file content in
> order to check whether that file content has already been uploaded,

Note that we can use this hash-to-decide-about-uploading pass (let's call it
the pre-hash pass) to pre-compute the AES key for the file, saving one of the
two passes that the upload process would normally need to do. Then imagine
the following outcomes:

 1: we never pre-hash, we always try to upload everything that might have
    changed, and rely upon the grid to tell us that the file is already
    present.

  1a: the file was not already present.
      Cost: pass-to-compute-key+SI, network roundtrip

  1b: file not already present.
      Cost: pass-to-compute-key+SI, pass-to-encode, lots of network

 2: always pre-hash, upload with pre-computed key if not in the backupdb

  2a: the key was found in our local backupdb.
      Cost: pass-to-compute-key

  2b: the key was not found in the backupdb, but the file *was* on the grid.
      Cost: pass-to-compute-key, network roundtrip

  2c: key not in backupdb, file not on the grid
      Cost: pass-to-compute-key, pass-to-encode, lots of network

So, I think it is always a win to pre-hash the file before submitting it to
the upload process (assuming the storage/CPU to manage the backupdb is not
significant), because we get to save that network roundtrip when we get a hit
in the backupdb.

(for reference below, let's identify those three costs as "hash",
"hash+roundtrip", and "hash+upload".)

> but this is not a good heuristic for when to try that technique.
> 
> Consider several policies for "when do we hash the current file
> content in order to check whether that file content is already
> recorded as uploaded in the db":

The way I would look at this is to tradeoff complexity of the algorithm
against time/network/disk-io savings for a typical spectrum of filesystem
behavior. Let's list the sorts of things that we expect to happen to a local
filesystem, since the last backup, in rough order of frequency:

 v-- more frequent
 * add a new file
 * rewrite/append an existing file (changing size and mtime)
   (also changing inodenum, when applications use write-and-relink)
 * modify an existing file (changing mtime, not changing size)
 * move a file (leaving size and mtime alone)
 * copy a file (changing ctime+mtime)
 * touch a file (changing mtime, leaving contents alone)
 * modify a file quickly (mtime doesn't change): use your wait-to-backup trick
 ^-- less frequent

The purpose of the backupdb is to reduce the workload for some reasonable
portion of these actions. Clearly it isn't worth trying to avoid the upload
work for everything, especially not the less-frequent things.

If it makes things significantly simpler, I'm ok with copying or moving a
previously-backed-up file costing hash+roundtrip. It would be nice to avoid
the roundtrip (it probably costs a second or two), but it's not as frequent
of an operation as adding new files or modifying existing files.

On the other hand, if it doesn't add a lot of complexity, having such a check
would make typical backup operations that much faster. If I were to rename
some high-level directory on my system (imagine 'mv /home/warner
/home/brianwarner'), the subsequent backup would cost (hash+roundtrip) for
each of the 545000 files on my workstation: about 6.3 days at one second per
file (assuming we only ask one at a time.. fix that and maybe make this 10x
faster, so say 15 hours). If an extra db table could reduce that cost to just
(hash), the backup would take 45 minutes.

(mind you, none of this accounts for the directories, which I've observed to
take 6 seconds to update over a slow DSL line)

I need to rework that flowchart to handle directory management, since I'd
like a nothing-has-changed backup to avoid network IO completely, and I'd
like to avoid changing directories that haven't been modified (i.e. re-use
read-only directories from previous backups). I'll see if I can incorporate
these changes into that work: not trying to avoid upload of moved/copied
files, not doing automatic re-check of previously uploaded files.

thanks,
 -Brian