#629 closed defect (fixed)

'tahoe backup' doesn't tolerate 8-bit filenames

Reported by: warner Owned by: zooko
Priority: major Milestone: 1.7.0
Component: code-frontend-cli Version: 1.3.0
Keywords: tahoe-backup unicode reviewed Cc: freestorm77@…
Launchpad Bug:

Description

Sigh, another unicode problem, not unlike #565 or #534, but this time affecting 'tahoe backup'. Sometimes the sqlite database doesn't like to accept 8-bit strings.

I think that the database should hold whatever pathname we get back from os.listdir(), i.e. if os.listdir() returns bytestrings that contain UTF-8 encoded filenames, then the database should hold bytestrings that happen to contain UTF-8 encoded filenames. This means modifying the way we use sqlite to tolerate bytestrings. According to this traceback provided by Andrej Falout, this involves a custom text_factory:

  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 243, in process
    newfilecap, metadata = self.upload(childpath)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 322, in upload
    must_upload, bdb_results = self.check_backupdb(childpath)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 268, in check_backupdb
    r = self.backupdb.check_file(childpath, use_timestamps)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/backupdb.py", line 168, in check_file
    (path,))
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
you use a text_factory that can interpret 8-bit bytestrings (like
text_factory = str). It is highly recommended that you instead just
switch your application to Unicode strings.
Command exited with non-zero status 1

This still leaves the question of what filenames we should pass to Tahoe. The Tahoe webapi expects URL-encoded UTF-8-encoded bytestrings in the URL, and the set_children() command expects JSON-encoded unicode strings as childnames. The main question (as explored by #565/#534) is how to get from the return value of os.listdir() (and the starting point in sys.argv) to a tahoe-suitable unicode object.. that part depends upon what the local encoding is, and on what convention is in use on any particular local filesystem.

Attachments (2)

test_backupdb.py.diff (1.6 KB) - added by warner at 2009-08-07T02:44:48Z.
updated test case
tahoe-backup-unicode-exclude-and-tests.dpatch (59.4 KB) - added by davidsarah at 2010-06-18T04:09:59Z.
Make 'tahoe backup --exclude' option accept Unicode (the other arguments were already fixed as part of #534/#565). Add tests, based partly on the existing #629 patch.

Download all attachments as: .zip

Change History (22)

comment:1 Changed at 2009-02-24T01:13:31Z by francois

  • Owner set to francois
  • Status changed from new to assigned

comment:2 Changed at 2009-05-23T21:14:18Z by zooko

Why not take sqlite's advice and store only unicode strings in the db? This would be consistent with the direction that we're going on the tahoe cp issues: basically we're not going to support people who want to have ill-encoded filenames or a wrongly configured locale and who want their filenames to make round trips unchanged. I don't think that class of user is big enough (is it even non-null?) or their demands are reasonable enough that we should keep trying to satisfy them as well as satisfying people who have well-encoded filenames.

comment:3 Changed at 2009-07-20T21:45:29Z by midnightmagic

Uh.. one might be me. :-) I have ill-encoded filenames as a leftover from old, old data stores: I actually have direct backups from my Amiga, for example, that have survived all the machine disasters I've had (so far) and persist in my backup stores. I have files from my old Commodore-64 too, for that matter. Sometimes it's a matter of trying to keep as pristine a backup of those files as possible so that I can use them in things like emulators, or even actual antique hardware, that keeps me from doing any sorts of conversions. Also, the OS filesystem seems to ignore encodings for the most part in its command-line activities: and so these artifacts persist. (And I'd love to store them in a tahoe backup grid without transforming them with.. for example, tar or something.)

comment:4 Changed at 2009-08-07T02:03:58Z by warner

  • Milestone changed from undecided to eventually

I can think of two reasons:

  • deciding what unicode to use for a given os.listdir() return value is equivalent to solving #731/#734, and that is proving to be difficult and lengthy
  • backupdb paths are entirely local: they are not shared with other Tahoe users. They will only be compared against other strings coming from the same host's os.listdir(). So we aren't obligated to find a way to make them useful to other platforms with different encoding preferences.

I think the simplest approach is to take sqlite's advice and "use a text_factory that can interpret 8-bit bytestrings (like text_factory=str)". If I'd known about this up front (or if my development platform had signalled this error), that's what I would have done.

comment:5 Changed at 2009-08-07T02:37:38Z by warner

Hey, can someone who's experienced this problem see if the attached unit test fails? It passes on my system, but I think I have a less-picky version of sqlite, or something (maybe my system is giving my unicode filenames, instead of UTF-8-encoded filenames).

Also, http://docs.python.org/library/sqlite3.html is the place to go for text_factory information. I think we may want to define the local_files.path column to be of a special type, and then tell sqlite that we want to interact with that type as a bytestring. Also, if we *do* get unicode from os.listdir(), I guess we have to encode it (as UTF-8, doesn't really matter as long as it's one-way consistent) before passing it in as this special type.

Changed at 2009-08-07T02:44:48Z by warner

updated test case

comment:6 Changed at 2009-08-09T20:29:23Z by zooko

  • Owner changed from francois to andrej
  • Status changed from assigned to new

comment:7 Changed at 2009-12-06T14:14:15Z by francois

  • Keywords tahoe-backup added; backup removed

comment:8 Changed at 2010-02-02T00:16:55Z by davidsarah

  • Milestone changed from eventually to 1.7.0

It makes sense to fix this with #534 and #565.

comment:9 follow-up: Changed at 2010-03-09T19:11:41Z by jsgf

This patch fixes the crash I see with utf-8 encoded characters.

diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py
--- old-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.371372576 -0800
+++ new-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.529372862 -0800
@@ -279,7 +279,7 @@
         self['exclude'] = set()
 
     def parseArgs(self, localdir, topath):
-        self.from_dir = localdir
+        self.from_dir = localdir.decode(sys.getfilesystemencoding())
         self.to_dir = topath
 
     def getSynopsis(Self):

comment:10 in reply to: ↑ 9 Changed at 2010-03-25T22:58:28Z by freestorm

  • Cc freestorm77@… added

Replying to jsgf:

Hello, I've tested this patch, it work fine for me.

My system: Windows XP SP2 French

Tahoe --version: allmydata-tahoe: 1.6.1, foolscap: 0.4.2, pycryptopp: 0.5.17-r683, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33, zope.interface: 3.5.2, python: 2.6.2, platform: Windows-XP-5.1.2600-SP3, sqlite: 3.5.9, simplejson: 2.0.9, pyopenssl: 0. 9, argparse: 0.9.1, twisted: 8.2.0, nevow: 0.9.33-r17222, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

I have tested with this big file (all chars that Windows accept in filename):

"LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt"

The result:

C:\>tahoe backup -v c:\temp\tahoe ROOT:TEST/TEST_CHARS processing c:\temp\tahoe skipping c:\temp\tahoe\LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt..

re-using old directory for c:\temp\tahoe 0 files uploaded (1 reused), 0 files skipped, 0 directories created (1 reused), 0 directories skipped 0 files checked, 0 directories checked backup done, elapsed time: 0:00:06

C:\>

Fred

This patch fixes the crash I see with utf-8 encoded characters.

diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py
--- old-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.371372576 -0800
+++ new-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.529372862 -0800
@@ -279,7 +279,7 @@
         self['exclude'] = set()
 
     def parseArgs(self, localdir, topath):
-        self.from_dir = localdir
+        self.from_dir = localdir.decode(sys.getfilesystemencoding())
         self.to_dir = topath
 
     def getSynopsis(Self):

comment:11 Changed at 2010-05-11T15:57:45Z by zooko

  • Keywords review-needed added
  • Owner changed from andrej to nobody

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.

comment:12 Changed at 2010-05-11T19:30:56Z by davidsarah

  • Owner changed from nobody to davidsarah
  • Status changed from new to assigned

comment:13 Changed at 2010-05-14T04:28:00Z by zooko

  • Milestone changed from 1.7.0 to 1.8.0

comment:14 Changed at 2010-06-12T21:00:00Z by davidsarah

  • Milestone changed from 1.8.0 to 1.7.0

Changed at 2010-06-18T04:09:59Z by davidsarah

Make 'tahoe backup --exclude' option accept Unicode (the other arguments were already fixed as part of #534/#565). Add tests, based partly on the existing #629 patch.

comment:15 Changed at 2010-06-18T04:10:26Z by davidsarah

  • Owner changed from davidsarah to zooko
  • Status changed from assigned to new

comment:16 follow-up: Changed at 2010-06-18T05:09:36Z by zooko

  • Keywords reviewed added; review-needed removed
  • Resolution set to fixed
  • Status changed from new to closed

Reviewed! Should we update NEWS to include this change? Otherwise it is good. Applying...

applied as 178401eb4e87589c, 1a0674bf37779751

comment:17 in reply to: ↑ 16 Changed at 2010-06-18T05:27:45Z by davidsarah

Replying to zooko:

Reviewed! Should we update NEWS to include this change?

It was already mentioned in the relnotes for 1.7beta, even though there were no tests then.

Version 0, edited at 2010-06-18T05:27:45Z by davidsarah (next)

comment:18 Changed at 2010-06-18T05:49:21Z by davidsarah

Hang on. Why does this work?

The field in the sqlite3 db that holds local filenames is declared as:

 path  VARCHAR(1024) PRIMARY KEY, -- index, this is os.path.abspath(fn)

VARCHAR is 8-bit, but source:src/allmydata/scripts/tahoe_backup.py is calling check_file in source:src/allmydata/scripts/backupdb.py with a unicode argument. os.path.abspath returns unicode when given a unicode path, so how come we are able to look that up in the database?

(If it is being converted to UTF-8, that is fine and will work correctly.)

comment:19 follow-up: Changed at 2010-06-19T02:03:00Z by davidsarah

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:20 in reply to: ↑ 19 Changed at 2010-06-19T03:52:09Z by davidsarah

  • Resolution set to fixed
  • Status changed from reopened to closed
Last edited at 2010-06-19T03:52:51Z by davidsarah (previous) (diff)
Note: See TracTickets for help on using tickets.