[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

Mon Mar 30 20:48:45 PDT 2009

#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
     Reporter:  francois           |       Owner:  francois                          
         Type:  defect             |      Status:  assigned                          
     Priority:  minor              |   Milestone:  1.3.1                             
    Component:  code-frontend-cli  |     Version:  1.2.0                             
   Resolution:                     |    Keywords:  cp encoding unicode filename utf-8
Launchpad_bug:                     |  
-----------------------------------+----------------------------------------

Comment(by zooko):

 Francois:  thanks for working on this!  I was planning to amend your patch
 myself, but I'll let you do it.

 Here is my most recent idea about how this should be done:

 http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html

 Except that this *isn't* my most recent idea after all.  I amended my
 intent a little, as prompted by pointed questions from nejucomo on IRC,
 and by looking at the actual source code where directories are processed:

 http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=20090313233135
 -e01fd-de54bf81e1eec0220eaa101a3f1e71ce64f41da7#L168

 Then I tried to write down my ideas in detail and this forced me to
 realize that they were incomplete and wrong and I had to amend them a
 whole lot more in order to finish this letter.  Finally, I asked JP
 Calderone for help, and he helped me understand how to write filenames
 back into a local Linux filesystem without risking that the user will
 accidentally overwrite their local files with tahoe files (because the
 tahoe files were written out under different representation than they were
 displayed), and how to do normalization, and how to cheaply ensure that
 silent misdecodings could be repaired in some future generation.

 Okay, here's the best design yet:

 I think that the unicode representation of the filename should continue to
 be the unique key in the directory (which current Tahoe 1.3.0 requires).

 So there should be a data structure with a required "filename" part, and a
 required "failed_decode" flag, and an optional "alleged_encoding" part.
 The "filename" part is the canonical value of the filename, but we
 recognize that sometimes we can't actually get the *real* filename into
 unicode form.  If our attempt to interpret the filename into unicode
 fails, then we set the "failed_decode" flag and put the
 iso-8859-1-decoding of it into the "filename" part.

 Here are the steps of reading a filename from the filesystem and adding
 that filename into an existing Tahoe directory.

 1.  On Windows or Mac read the filename with the unicode APIs.  Normalize
 the string with filename = unicodedata.normalize('NFC', filename).  Leave
 out the "alleged_encoding" part.  Set the "failed_decode" flag to False.

 2.  On Linux read the filename with the string APIs to get "bytes" and
 call sys.getfilesystemencoding() to get "alleged_encoding".  Then, call
 bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

 2.a.  If this decoding succeeds then normalize the unicode filename with
 filename = unicodedata.normalize('NFC', filename), store the resulting
 filename and the alleged_encoding, and set the "failed_decode" to False.
 (Storing the alleged_encoding is for the benefit of future generations,
 who may discover that the decoding was actually wrong even though it
 didn't raise an error, and who could then use the alleged_encoding to undo
 the damage.  For example Shawn Willden has a prototype tool which lets a
 human examine the filename as decoded with different encodings and pick
 the one that means something in a language they know.)

 2.b.  If this decoding fails, then we decode it again with
 bytes.decode('iso-8859-1', 'strict').  Do not normalize it.  Put the
 resulting unicode object into the "filename" part, set the "failed_decode"
 flag to True, and leave the "alleged_encoding" field out.  This is a case
 of mojibake:

 http://en.wikipedia.org/wiki/Mojibake

 The reason to go the mojibake route is that it preserves the information,
 and in theory someone could later decode it and figure out the original
 filename.  This has actually happened at least once, as shown by the
 photograph on that wikipedia page of the package which was delivered to
 the Russian recipient.  Mojibake!  (footnote 1)

 How does that sound?

 Phewf.  Okay, now for the trip in the other direction.  Suppose you have a
 Tahoe filename object, and you need to create a file in the local
 filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir
 .".  There are four cases:

 Case 1:  You are using a unicode-safe filesystem such as Windows or Mac,
 and you have a unicode object with failed_decode=False.

 This is easy: use the Python unicode filesystem APIs to create the file
 and be happy.

 Case 2:  You are using a unicode-safe filesystem and you have a unicode
 object with failed_decode=True.

 This is easy: use the Python unicode filesystem APIs to create the file,
 passing the latin-1-decoded filename (mojibake!).

 Case 3:  You are using a plain-bytes filesystem such as Linux, and you
 have a unicode object with failed_decode=False.

 This is easy: use the Python unicode filesystem APIs to create the file.

 Case 4:  You are using a plain-bytes filesystem such as Linux, and you
 have a unicode object with failed_decode=True.

 Now we should *encode* the filename using iso-8859-1 to get a sequence of
 bytes, and then write those bytes into the filesystem using the Python
 string filesystem API.  This is no worse than any alternative, and in the
 case that the target filesystem has the same encoding as the original
 filesystem (such as because it is the *same* as the original filesystem,
 or because it is owned by a friend of the owner of the original
 filesystem), then this will restore the file to its proper name.

 By the way, please see David Wheeler's recent proposal to start enforcing
 filename constraints in Linux: http://lwn.net/Articles/325304 .  His
 proposals include changing Linux to require utf-8-encoding of all
 filenames.

 Regards,

 Zooko

 footnote 1: I know that Alberto Berti has previously argued on tahoe-dev
 and on IRC that mojibake is less clean than the alternative of using
 bytes.decode(alleged_encoding, 'replace').  The latter is lossy, but it
 more clearly shows to the user that some or all of the filename couldn't
 be decoded.  Alberto and others had convinced me of the wisdom of this,
 and I actually wrote this entire document specifying the 'decode-with-
 replace' approach instead of the mojibake approach, but I eventually
 realized that it wouldn't work.  For one thing it was rather complicated
 to decide how to handle multiple filenames that all decode-with-replace to
 the same unicode name (you could imagine a whole directory full of files
 all named '????' because the locale is wrong).  But the real killer is
 what to do when you are going to write the file back into the local
 filesystem.  If you write a decoded-with-replace file back, then this
 means a round-trip from linux to tahoe and back can mess up all of your
 filenames.  If you write the actual original bytes into the filesystem,
 then this means that you might accidentally overwrite files with a "tahoe
 cp", since "tahoe ls" just shows files with "???" in their names, but
 "tahoe cp" writes files out with actual characters instead of question
 marks.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:47>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid