Context Navigation

backupdb.rst

Visit:

Last change on this file was 82579ce, checked in by Zooko Wilcox-O'Hearn <zooko@…>, at 2013-11-08T21:08:05Z

magic first line tells emacs to use utf8+bom

Add ".. -*- coding: utf-8-with-signature -*-" to the first line of each .rst
file. This tells emacs to treat the file contents as utf-8, and also to prepend
a so-called utf-8 "bom" marker at the beginning of the file. This patch also
prepends those markers to each of those files.

Property mode set to 100644

File size: 8.2 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	==================
4	The Tahoe BackupDB
5	==================
6
7	1. `Overview`_
8	2. `Schema`_
9	3. `Upload Operation`_
10	4. `Directory Operations`_
11
12	Overview
13	========
14	To speed up backup operations, Tahoe maintains a small database known as the
15	"backupdb". This is used to avoid re-uploading files which have already been
16	uploaded recently.
17
18	This database lives in ``~/.tahoe/private/backupdb.sqlite``, and is a SQLite
19	single-file database. It is used by the "``tahoe backup``" command. In the
20	future, it may optionally be used by other commands such as "``tahoe cp``".
21
22	The purpose of this database is twofold: to manage the file-to-cap
23	translation (the "upload" step) and the directory-to-cap translation (the
24	"mkdir-immutable" step).
25
26	The overall goal of optimizing backup is to reduce the work required when the
27	source disk has not changed (much) since the last backup. In the ideal case,
28	running "``tahoe backup``" twice in a row, with no intervening changes to the
29	disk, will not require any network traffic. Minimal changes to the source
30	disk should result in minimal traffic.
31
32	This database is optional. If it is deleted, the worst effect is that a
33	subsequent backup operation may use more effort (network bandwidth, CPU
34	cycles, and disk IO) than it would have without the backupdb.
35
36	The database uses sqlite3, which is included as part of the standard Python
37	library with Python 2.5 and later. For Python 2.4, Tahoe will try to install the
38	"pysqlite" package at build-time, but this will succeed only if sqlite3 with
39	development headers is already installed. On Debian and Debian derivatives
40	you can install the "python-pysqlite2" package (which, despite the name,
41	actually provides sqlite3 rather than sqlite2). On old distributions such
42	as Debian etch (4.0 "oldstable") or Ubuntu Edgy (6.10) the "python-pysqlite2"
43	package won't work, but the "sqlite3-dev" package will.
44
45	Schema
46	======
47
48	The database contains the following tables::
49
50	CREATE TABLE version
51	(
52	version integer # contains one row, set to 1
53	);
54
55	CREATE TABLE local_files
56	(
57	path varchar(1024), PRIMARY KEY -- index, this is an absolute UTF-8-encoded local filename
58	size integer, -- os.stat(fn)[stat.ST_SIZE]
59	mtime number, -- os.stat(fn)[stat.ST_MTIME]
60	ctime number, -- os.stat(fn)[stat.ST_CTIME]
61	fileid integer
62	);
63
64	CREATE TABLE caps
65	(
66	fileid integer PRIMARY KEY AUTOINCREMENT,
67	filecap varchar(256) UNIQUE -- URI:CHK:...
68	);
69
70	CREATE TABLE last_upload
71	(
72	fileid INTEGER PRIMARY KEY,
73	last_uploaded TIMESTAMP,
74	last_checked TIMESTAMP
75	);
76
77	CREATE TABLE directories
78	(
79	dirhash varchar(256) PRIMARY KEY,
80	dircap varchar(256),
81	last_uploaded TIMESTAMP,
82	last_checked TIMESTAMP
83	);
84
85	Upload Operation
86	================
87
88	The upload process starts with a pathname (like ``~/.emacs``) and wants to end up
89	with a file-cap (like ``URI:CHK:...``).
90
91	The first step is to convert the path to an absolute form
92	(``/home/warner/.emacs``) and do a lookup in the local_files table. If the path
93	is not present in this table, the file must be uploaded. The upload process
94	is:
95
96	1. record the file's size, ctime (which is the directory-entry change time or
97	file creation time depending on OS) and modification time
98
99	2. upload the file into the grid, obtaining an immutable file read-cap
100
101	3. add an entry to the 'caps' table, with the read-cap, to get a fileid
102
103	4. add an entry to the 'last_upload' table, with the current time
104
105	5. add an entry to the 'local_files' table, with the fileid, the path,
106	and the local file's size/ctime/mtime
107
108	If the path is present in 'local_files', the easy-to-compute identifying
109	information is compared: file size and ctime/mtime. If these differ, the file
110	must be uploaded. The row is removed from the local_files table, and the
111	upload process above is followed.
112
113	If the path is present but ctime or mtime differs, the file may have changed.
114	If the size differs, then the file has certainly changed. At this point, a
115	future version of the "backup" command might hash the file and look for a
116	match in an as-yet-defined table, in the hopes that the file has simply been
117	moved from somewhere else on the disk. This enhancement requires changes to
118	the Tahoe upload API before it can be significantly more efficient than
119	simply handing the file to Tahoe and relying upon the normal convergence to
120	notice the similarity.
121
122	If ctime, mtime, or size is different, the client will upload the file, as
123	above.
124
125	If these identifiers are the same, the client will assume that the file is
126	unchanged (unless the ``--ignore-timestamps`` option is provided, in which
127	case the client always re-uploads the file), and it may be allowed to skip
128	the upload. For safety, however, we require the client periodically perform a
129	filecheck on these probably-already-uploaded files, and re-upload anything
130	that doesn't look healthy. The client looks the fileid up in the
131	'last_checked' table, to see how long it has been since the file was last
132	checked.
133
134	A "random early check" algorithm should be used, in which a check is
135	performed with a probability that increases with the age of the previous
136	results. E.g. files that were last checked within a month are not checked,
137	files that were checked 5 weeks ago are re-checked with 25% probability, 6
138	weeks with 50%, more than 8 weeks are always checked. This reduces the
139	"thundering herd" of filechecks-on-everything that would otherwise result
140	when a backup operation is run one month after the original backup. If a
141	filecheck reveals the file is not healthy, it is re-uploaded.
142
143	If the filecheck shows the file is healthy, or if the filecheck was skipped,
144	the client gets to skip the upload, and uses the previous filecap (from the
145	'caps' table) to add to the parent directory.
146
147	If a new file is uploaded, a new entry is put in the 'caps' and 'last_upload'
148	table, and an entry is made in the 'local_files' table to reflect the mapping
149	from local disk pathname to uploaded filecap. If an old file is re-uploaded,
150	the 'last_upload' entry is updated with the new timestamps. If an old file is
151	checked and found healthy, the 'last_upload' entry is updated.
152
153	Relying upon timestamps is a compromise between efficiency and safety: a file
154	which is modified without changing the timestamp or size will be treated as
155	unmodified, and the "``tahoe backup``" command will not copy the new contents
156	into the grid. The ``--no-timestamps`` option can be used to disable this
157	optimization, forcing every byte of the file to be hashed and encoded.
158
159	Directory Operations
160	====================
161
162	Once the contents of a directory are known (a filecap for each file, and a
163	dircap for each directory), the backup process must find or create a tahoe
164	directory node with the same contents. The contents are hashed, and the hash
165	is queried in the 'directories' table. If found, the last-checked timestamp
166	is used to perform the same random-early-check algorithm described for files
167	above, but no new upload is performed. Since "``tahoe backup``" creates immutable
168	directories, it is perfectly safe to re-use a directory from a previous
169	backup.
170
171	If not found, the web-API "mkdir-immutable" operation is used to create a new
172	directory, and an entry is stored in the table.
173
174	The comparison operation ignores timestamps and metadata, and pays attention
175	solely to the file names and contents.
176
177	By using a directory-contents hash, the "``tahoe backup``" command is able to
178	re-use directories from other places in the backed up data, or from old
179	backups. This means that renaming a directory and moving a subdirectory to a
180	new parent both count as "minor changes" and will result in minimal Tahoe
181	operations and subsequent network traffic (new directories will be created
182	for the modified directory and all of its ancestors). It also means that you
183	can perform a backup ("#1"), delete a file or directory, perform a backup
184	("#2"), restore it, and then the next backup ("#3") will re-use the
185	directories from backup #1.
186
187	The best case is a null backup, in which nothing has changed. This will
188	result in minimal network bandwidth: one directory read and two modifies. The
189	``Archives/`` directory must be read to locate the latest backup, and must be
190	modified to add a new snapshot, and the ``Latest/`` directory will be updated to
191	point to that same snapshot.
192

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/backupdb.rst

Download in other formats: