Opened at 2009-02-16T23:09:47Z
Closed at 2010-06-19T03:52:09Z
#629 closed defect (fixed)
'tahoe backup' doesn't tolerate 8-bit filenames
Reported by: | warner | Owned by: | zooko |
---|---|---|---|
Priority: | major | Milestone: | 1.7.0 |
Component: | code-frontend-cli | Version: | 1.3.0 |
Keywords: | tahoe-backup unicode reviewed | Cc: | freestorm77@… |
Launchpad Bug: |
Description
Sigh, another unicode problem, not unlike #565 or #534, but this time affecting 'tahoe backup'. Sometimes the sqlite database doesn't like to accept 8-bit strings.
I think that the database should hold whatever pathname we get back from os.listdir(), i.e. if os.listdir() returns bytestrings that contain UTF-8 encoded filenames, then the database should hold bytestrings that happen to contain UTF-8 encoded filenames. This means modifying the way we use sqlite to tolerate bytestrings. According to this traceback provided by Andrej Falout, this involves a custom text_factory:
File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 243, in process newfilecap, metadata = self.upload(childpath) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 322, in upload must_upload, bdb_results = self.check_backupdb(childpath) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 268, in check_backupdb r = self.backupdb.check_file(childpath, use_timestamps) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/backupdb.py", line 168, in check_file (path,)) sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Command exited with non-zero status 1
This still leaves the question of what filenames we should pass to Tahoe. The Tahoe webapi expects URL-encoded UTF-8-encoded bytestrings in the URL, and the set_children() command expects JSON-encoded unicode strings as childnames. The main question (as explored by #565/#534) is how to get from the return value of os.listdir() (and the starting point in sys.argv) to a tahoe-suitable unicode object.. that part depends upon what the local encoding is, and on what convention is in use on any particular local filesystem.
Attachments (2)
Change History (22)
comment:1 Changed at 2009-02-24T01:13:31Z by francois
- Owner set to francois
- Status changed from new to assigned
comment:2 Changed at 2009-05-23T21:14:18Z by zooko
comment:3 Changed at 2009-07-20T21:45:29Z by midnightmagic
Uh.. one might be me. :-) I have ill-encoded filenames as a leftover from old, old data stores: I actually have direct backups from my Amiga, for example, that have survived all the machine disasters I've had (so far) and persist in my backup stores. I have files from my old Commodore-64 too, for that matter. Sometimes it's a matter of trying to keep as pristine a backup of those files as possible so that I can use them in things like emulators, or even actual antique hardware, that keeps me from doing any sorts of conversions. Also, the OS filesystem seems to ignore encodings for the most part in its command-line activities: and so these artifacts persist. (And I'd love to store them in a tahoe backup grid without transforming them with.. for example, tar or something.)
comment:4 Changed at 2009-08-07T02:03:58Z by warner
- Milestone changed from undecided to eventually
I can think of two reasons:
- deciding what unicode to use for a given os.listdir() return value is equivalent to solving #731/#734, and that is proving to be difficult and lengthy
- backupdb paths are entirely local: they are not shared with other Tahoe users. They will only be compared against other strings coming from the same host's os.listdir(). So we aren't obligated to find a way to make them useful to other platforms with different encoding preferences.
I think the simplest approach is to take sqlite's advice and "use a text_factory that can interpret 8-bit bytestrings (like text_factory=str)". If I'd known about this up front (or if my development platform had signalled this error), that's what I would have done.
comment:5 Changed at 2009-08-07T02:37:38Z by warner
Hey, can someone who's experienced this problem see if the attached unit test fails? It passes on my system, but I think I have a less-picky version of sqlite, or something (maybe my system is giving my unicode filenames, instead of UTF-8-encoded filenames).
Also, http://docs.python.org/library/sqlite3.html is the place to go for text_factory information. I think we may want to define the local_files.path column to be of a special type, and then tell sqlite that we want to interact with that type as a bytestring. Also, if we *do* get unicode from os.listdir(), I guess we have to encode it (as UTF-8, doesn't really matter as long as it's one-way consistent) before passing it in as this special type.
comment:6 Changed at 2009-08-09T20:29:23Z by zooko
- Owner changed from francois to andrej
- Status changed from assigned to new
comment:7 Changed at 2009-12-06T14:14:15Z by francois
- Keywords tahoe-backup added; backup removed
comment:8 Changed at 2010-02-02T00:16:55Z by davidsarah
- Milestone changed from eventually to 1.7.0
comment:9 follow-up: ↓ 10 Changed at 2010-03-09T19:11:41Z by jsgf
This patch fixes the crash I see with utf-8 encoded characters.
diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py --- old-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.371372576 -0800 +++ new-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.529372862 -0800 @@ -279,7 +279,7 @@ self['exclude'] = set() def parseArgs(self, localdir, topath): - self.from_dir = localdir + self.from_dir = localdir.decode(sys.getfilesystemencoding()) self.to_dir = topath def getSynopsis(Self):
comment:10 in reply to: ↑ 9 Changed at 2010-03-25T22:58:28Z by freestorm
- Cc freestorm77@… added
Replying to jsgf:
Hello, I've tested this patch, it work fine for me.
My system: Windows XP SP2 French
Tahoe --version: allmydata-tahoe: 1.6.1, foolscap: 0.4.2, pycryptopp: 0.5.17-r683, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33, zope.interface: 3.5.2, python: 2.6.2, platform: Windows-XP-5.1.2600-SP3, sqlite: 3.5.9, simplejson: 2.0.9, pyopenssl: 0. 9, argparse: 0.9.1, twisted: 8.2.0, nevow: 0.9.33-r17222, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1
I have tested with this big file (all chars that Windows accept in filename):
"LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² .txt"
The result:
C:\>tahoe backup -v c:\temp\tahoe ROOT:TEST/TEST_CHARS processing c:\temp\tahoe skipping c:\temp\tahoe\LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² .txt..
re-using old directory for c:\temp\tahoe 0 files uploaded (1 reused), 0 files skipped, 0 directories created (1 reused), 0 directories skipped 0 files checked, 0 directories checked backup done, elapsed time: 0:00:06
C:\>
Fred
This patch fixes the crash I see with utf-8 encoded characters.
diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py --- old-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.371372576 -0800 +++ new-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.529372862 -0800 @@ -279,7 +279,7 @@ self['exclude'] = set() def parseArgs(self, localdir, topath): - self.from_dir = localdir + self.from_dir = localdir.decode(sys.getfilesystemencoding()) self.to_dir = topath def getSynopsis(Self):
comment:11 Changed at 2010-05-11T15:57:45Z by zooko
- Keywords review-needed added
- Owner changed from andrej to nobody
I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.
comment:12 Changed at 2010-05-11T19:30:56Z by davidsarah
- Owner changed from nobody to davidsarah
- Status changed from new to assigned
comment:13 Changed at 2010-05-14T04:28:00Z by zooko
- Milestone changed from 1.7.0 to 1.8.0
comment:14 Changed at 2010-06-12T21:00:00Z by davidsarah
- Milestone changed from 1.8.0 to 1.7.0
Changed at 2010-06-18T04:09:59Z by davidsarah
comment:15 Changed at 2010-06-18T04:10:26Z by davidsarah
- Owner changed from davidsarah to zooko
- Status changed from assigned to new
comment:16 follow-up: ↓ 17 Changed at 2010-06-18T05:09:36Z by zooko
- Keywords reviewed added; review-needed removed
- Resolution set to fixed
- Status changed from new to closed
Reviewed! Should we update NEWS to include this change? Otherwise it is good. Applying...
applied as 178401eb4e87589c, 1a0674bf37779751
comment:17 in reply to: ↑ 16 Changed at 2010-06-18T05:27:45Z by davidsarah
Replying to zooko:
Reviewed! Should we update NEWS to include this change?
It was already mentioned in the NEWS for 1.7beta, even though there were no tests then.
comment:18 Changed at 2010-06-18T05:49:21Z by davidsarah
Hang on. Why does this work?
The field in the sqlite3 db that holds local filenames is declared as:
path VARCHAR(1024) PRIMARY KEY, -- index, this is os.path.abspath(fn)
VARCHAR is 8-bit, but source:src/allmydata/scripts/tahoe_backup.py is calling check_file in source:src/allmydata/scripts/backupdb.py with a unicode argument. os.path.abspath returns unicode when given a unicode path, so how come we are able to look that up in the database?
(If it is being converted to UTF-8, that is fine and will work correctly.)
comment:19 follow-up: ↓ 20 Changed at 2010-06-19T02:03:00Z by davidsarah
- Resolution fixed deleted
- Status changed from closed to reopened
This change caused a test failure: http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio . Investigating.
comment:20 in reply to: ↑ 19 Changed at 2010-06-19T03:52:09Z by davidsarah
- Resolution set to fixed
- Status changed from reopened to closed
Replying to davidsarah:
This change caused a test failure: http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio . Investigating.
Fixed in 6b2f99fa9a1f9552.
Why not take sqlite's advice and store only unicode strings in the db? This would be consistent with the direction that we're going on the tahoe cp issues: basically we're not going to support people who want to have ill-encoded filenames or a wrongly configured locale and who want their filenames to make round trips unchanged. I don't think that class of user is big enough (is it even non-null?) or their demands are reasonable enough that we should keep trying to satisfy them as well as satisfying people who have well-encoded filenames.