#598 closed enhancement (fixed)

add 'tahoe backup' command: fast versioned readonly backups

Reported by: warner Owned by:
Priority: major Milestone: 1.3.0
Component: code-frontend-cli Version: 1.2.0
Keywords: Cc: tahoe-dev@…
Launchpad Bug:

Description

As a complement to the only-the-latest-version 'tahoe sync' command described in #597, I'd like to have a full-featured multiple-version 'tahoe backup' command too. This would behave like the existing windows-only allmydata.com backup tool:

tahoe backup LOCALDIR ALIAS:BACKUPBASEDIR

LOCALDIR refers to a directory on the local disk. ALIAS:BACKUPBASEDIR will refer to a writeable Tahoe directory; it will be created if it does not already exist.

Each time this is run, ALIAS:BACKUPBASEDIR/$TIMESTAMP will be created, as a read-only directory that contains an exact mirror of the local disk's LOCALDIR subtree. In addition, ALIAS:BACKUPBASEDIR/Latest will be a read-only reference to the same directory. Over time, BACKUPBASEDIR/ will be filled with a series of timestamped directories, containing historical backups.

Whenever possible, $TIMESTAMP[n] will contain references to files and directories created under $TIMESTAMP[n-1]; i.e. backups will share unchanged objects with earlier backups. Each backup, once finished, will not be changed again. If/when Tahoe acquires immutable dirnodes, 'tahoe backup' will take advantage of them. Meanwhile, it will use read-only dirnodes, by throwing out the write-cap for the $TIMESTAMP directory when the backup is done.

This will use the same backupdb as described in #597 to reduce the amount of work that must be done for unchanged files.

A basic backup system could be constructed by simply running 'tahoe backup' in a cron job. It might be a good idea to have a lockfile of some sort to make this usage safer (i.e. prevent overruns from causing two simultaneous backups from running at the same time).

Change History (9)

comment:1 Changed at 2009-01-31T00:10:03Z by warner

Looks like this is more important than #597 .

comment:2 Changed at 2009-01-31T00:22:51Z by warner

The basic flowchart I've got in mind:

  • start with a writecap to the Backups/ directory
  • locate the most recent version, get its readcap (or None)
  • newdircap = process(olddircap, localdir)
  • process(olddircap, localdir):
    • fetch contents of olddircap, if any
    • create empty mapping for new directory contents
    • list localdir
    • for each directory:
      • newdircontents[name] = process(olddircontents[name], localdir+name)
    • for each file:
      • newdircontents[name] = upload-with-backupdb(localdir+name)
    • now compare newdircontents with olddircontents, including metadata
      • if identical, return olddircap
      • if not, mkdir, set_children(newdircontents), return readonly(new-dircap)
  • add top-level newdircap to Backups/$CURRENT_TIMESTAMP

If upload-with-backupdb works as described in http://allmydata.org/pipermail/tahoe-dev/2008-May/000620.html , then the workload of a null backup will be the recursive read of the entire most-recent-version subtree. To avoid even that:

  • maintain a backupdb table that maps from HASH(newdircontents) to dircap
  • instead of comparing newdircontents with olddircontents, hash newdircontents and look for the result in the table
  • if mkdir() must be used, add an entry to the table afterwards
  • allow entries to be removed from the table at some point (perhaps any entry which is not used in a 'tahoe backup' run should be discarded at the end of that run)

With that in place, a null backup should involve nothing but local stat() calls.

comment:3 Changed at 2009-02-03T01:52:56Z by warner

Some data points: home directory sizes on some developer's machines:

  • warner@fluxx: 98k dirs, 776k files, 50GB of data
  • warner@luther: 85k dirs, 1065k files, 81GB
  • zandr: 500k dirs, 1.1M files, 1.2TB
  • zandr (notebook): 61k dirs, 400k files, 184GB
  • zooko: 153k dirs, 1306k files, ??

So, to use "tahoe backup" on these systems, the backupdb must be able to efficiently manage a million entries. I think this is too big for a simple pickle to handle well.

I'll do some experimentation, but my current plan is to use a sqlite database, one for the file-oriented backupdb, and a second for the directory-contents db.

Going forward, of course, it would be nice to allow the use of mysql or postgres. But sqlite is in the python2.5 stdlib, and has a synchronous interface (which makes the implementation of tahoe_backup.py a bit easier), and doesn't require any external setup. Whereas mysql/postgres would require a separate process to be configured and a DB to be set up, along with user-account setup. Another question is to use sqlite directly or use the Axiom layer (which we're using as an experiment in the disk-watcher).. I'm inclined to use sqlite directly, again because of avoiding lots of new dependencies.

comment:4 Changed at 2009-02-03T02:06:50Z by warner

zooko's system with 153k dirs and 1306k files has about 69GB of data

comment:5 Changed at 2009-02-03T03:10:30Z by warner

cfce8b5eab431772 has the first cut: no backupdb, but the other functionality is there.

comment:6 Changed at 2009-02-03T03:43:58Z by zooko

My system with 153k dirs and 1306k files has 35,350 files which are duplicates -- that set of 35,350 files has only 17,675 unique md5 hashes.

comment:7 Changed at 2009-02-03T04:00:49Z by zooko

  • Cc tahoe-dev@… added

Note that I'm adding Cc: tahoe-dev@… to this ticket, so until that Cc: is removed any comments posted here will be mailed to the list.

comment:8 Changed at 2009-02-06T04:19:15Z by warner

  • Milestone changed from undecided to 1.3.0
  • Resolution set to fixed
  • Status changed from new to closed

Done. 177ffa0870390c6e was the last patch: the "tahoe backup" command now uses the backupdb and avoids uploading any file that looks like it was unchanged. I'll create a separate ticket (#606) for adding a directory cache to the backupdb.. that can be a future enhancement that will improve performance even further.

comment:9 Changed at 2009-02-20T23:46:42Z by azazel

I've done some little benchmarks of uploading one of my darcs repos to the production grid. I've uploaded it first using "tahoe cp -r -v" and then i uploaded a tar (not zipped) of the same data. The repo is composed of 67 dirs and 4098 files, the tar size is 27 MB. The "cp -r -v" took roughly 3.5 hours, the "cp repo.tar" took 760 seconds. The client is configured to use an helper.

Here are the stats for one of the files involved in the first upload:

    * Timings:
          o File Size: 3424 bytes
          o Total: 3.88s (882Bps)
                + Storage Index: 194us (17.64MBps)
                + [Contacting Helper]: 723ms
                      # [Helper Already-In-Grid Check]: 228ms
                + [Upload Ciphertext To Helper]: 352ms (9.7kBps)
                + Peer Selection: 879ms
                + Encode And Push: 1.05s (69.9kBps)
                      # Cumulative Encoding: 705us (4.86MBps)
                      # Cumulative Pushing: 48ms (71.0kBps)
                      # Send Hashes And Close: 881ms
                + [Helper Total]: 3.37s

Next, the stats for the tar upload:

    * Timings:
          o File Size: 27176960 bytes
          o Total: 760.13s (35.8kBps)
                + Storage Index: 261us (104099.02MBps)
                + [Contacting Helper]: 702ms
                      # [Helper Already-In-Grid Check]: 454ms
                + [Upload Ciphertext To Helper]: 723.25s (37.6kBps)
                + Peer Selection: 461ms
                + Encode And Push: 35.01s (807.5kBps)
                      # Cumulative Encoding: 1.50s (18.13MBps)
                      # Cumulative Pushing: 32.16s (845.1kBps)
                      # Send Hashes And Close: 996ms
                + [Helper Total]: 759.47s

This small test demostrated an overhead of 1.5 ~ 2 seconds for every upload operation. Lastly i post the results of a "du --si $repo; find $repo -type f |wc -l; find $repo -type d |wc-l" command:

33k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/EGG-INFO
136k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/command/generate
144k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/command
25k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/examples/db
29k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/examples
115k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/loadable
29k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/setup_cmd
78k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_command/test_generate
82k     wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_command
213k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test/test_loadable
426k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture/test
922k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg/fixture
955k    wip/cute/lib/python2.5/site-packages/fixture-1.1.1-py2.5.egg
353k    wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope/schema/tests
627k    wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope/schema
635k    wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/zope
54k     wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg/EGG-INFO
689k    wip/cute/lib/python2.5/site-packages/zope.schema-3.5.0a2-py2.5.egg
46k     wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/common/tests
168k    wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/common
267k    wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface/tests
1,0M    wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope/interface
1,1M    wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/zope
87k     wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg/EGG-INFO
1,1M    wip/cute/lib/python2.5/site-packages/zope.interface-3.5.0-py2.5-linux-i686.egg
21k     wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/zope/event
29k     wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/zope
29k     wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg/EGG-INFO
58k     wip/cute/lib/python2.5/site-packages/zope.event-3.4.0-py2.5.egg
41k     wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component/bbb
29k     wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component/testfiles
672k    wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope/component
680k    wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/zope
95k     wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg/EGG-INFO
775k    wip/cute/lib/python2.5/site-packages/zope.component-3.5.1-py2.5.egg
4,0M    wip/cute/lib/python2.5/site-packages
13k     wip/cute/lib/python2.5/distutils
4,0M    wip/cute/lib/python2.5
4,0M    wip/cute/lib
0       wip/cute/include
1,2M    wip/cute/bin
5,3M    wip/cute/cute/_darcs/pristine.hashed
13M     wip/cute/cute/_darcs/patches
21k     wip/cute/cute/_darcs/prefs
435k    wip/cute/cute/_darcs/inventories
19M     wip/cute/cute/_darcs
463k    wip/cute/cute/docs/tutorial/images
517k    wip/cute/cute/docs/tutorial
0       wip/cute/cute/docs/experiments
517k    wip/cute/cute/docs
29k     wip/cute/cute/lib/cute/app
25k     wip/cute/cute/lib/cute/ui/widgets
91k     wip/cute/cute/lib/cute/ui/resources
13k     wip/cute/cute/lib/cute/ui/designer_plugins
13k     wip/cute/cute/lib/cute/ui/ui
8,2k    wip/cute/cute/lib/cute/ui/test
304k    wip/cute/cute/lib/cute/ui
8,2k    wip/cute/cute/lib/cute/db/search
17k     wip/cute/cute/lib/cute/db/source
91k     wip/cute/cute/lib/cute/db
3,6M    wip/cute/cute/lib/cute/tests/sample_data/birt/images/logos
263k    wip/cute/cute/lib/cute/tests/sample_data/birt/images/productlines
4,1M    wip/cute/cute/lib/cute/tests/sample_data/birt/images
4,3M    wip/cute/cute/lib/cute/tests/sample_data/birt
4,3M    wip/cute/cute/lib/cute/tests/sample_data
4,3M    wip/cute/cute/lib/cute/tests
4,8M    wip/cute/cute/lib/cute
4,8M    wip/cute/cute/lib
24M     wip/cute/cute
33M     wip/cute
3647
70
Note: See TracTickets for help on using tickets.