[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue
tahoe-lafs
trac at allmydata.org
Mon Mar 30 20:48:45 PDT 2009
#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
Reporter: francois | Owner: francois
Type: defect | Status: assigned
Priority: minor | Milestone: 1.3.1
Component: code-frontend-cli | Version: 1.2.0
Resolution: | Keywords: cp encoding unicode filename utf-8
Launchpad_bug: |
-----------------------------------+----------------------------------------
Comment(by zooko):
Francois: thanks for working on this! I was planning to amend your patch
myself, but I'll let you do it.
Here is my most recent idea about how this should be done:
http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html
Except that this *isn't* my most recent idea after all. I amended my
intent a little, as prompted by pointed questions from nejucomo on IRC,
and by looking at the actual source code where directories are processed:
http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=20090313233135
-e01fd-de54bf81e1eec0220eaa101a3f1e71ce64f41da7#L168
Then I tried to write down my ideas in detail and this forced me to
realize that they were incomplete and wrong and I had to amend them a
whole lot more in order to finish this letter. Finally, I asked JP
Calderone for help, and he helped me understand how to write filenames
back into a local Linux filesystem without risking that the user will
accidentally overwrite their local files with tahoe files (because the
tahoe files were written out under different representation than they were
displayed), and how to do normalization, and how to cheaply ensure that
silent misdecodings could be repaired in some future generation.
Okay, here's the best design yet:
I think that the unicode representation of the filename should continue to
be the unique key in the directory (which current Tahoe 1.3.0 requires).
So there should be a data structure with a required "filename" part, and a
required "failed_decode" flag, and an optional "alleged_encoding" part.
The "filename" part is the canonical value of the filename, but we
recognize that sometimes we can't actually get the *real* filename into
unicode form. If our attempt to interpret the filename into unicode
fails, then we set the "failed_decode" flag and put the
iso-8859-1-decoding of it into the "filename" part.
Here are the steps of reading a filename from the filesystem and adding
that filename into an existing Tahoe directory.
1. On Windows or Mac read the filename with the unicode APIs. Normalize
the string with filename = unicodedata.normalize('NFC', filename). Leave
out the "alleged_encoding" part. Set the "failed_decode" flag to False.
2. On Linux read the filename with the string APIs to get "bytes" and
call sys.getfilesystemencoding() to get "alleged_encoding". Then, call
bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.
2.a. If this decoding succeeds then normalize the unicode filename with
filename = unicodedata.normalize('NFC', filename), store the resulting
filename and the alleged_encoding, and set the "failed_decode" to False.
(Storing the alleged_encoding is for the benefit of future generations,
who may discover that the decoding was actually wrong even though it
didn't raise an error, and who could then use the alleged_encoding to undo
the damage. For example Shawn Willden has a prototype tool which lets a
human examine the filename as decoded with different encodings and pick
the one that means something in a language they know.)
2.b. If this decoding fails, then we decode it again with
bytes.decode('iso-8859-1', 'strict'). Do not normalize it. Put the
resulting unicode object into the "filename" part, set the "failed_decode"
flag to True, and leave the "alleged_encoding" field out. This is a case
of mojibake:
http://en.wikipedia.org/wiki/Mojibake
The reason to go the mojibake route is that it preserves the information,
and in theory someone could later decode it and figure out the original
filename. This has actually happened at least once, as shown by the
photograph on that wikipedia page of the package which was delivered to
the Russian recipient. Mojibake! (footnote 1)
How does that sound?
Phewf. Okay, now for the trip in the other direction. Suppose you have a
Tahoe filename object, and you need to create a file in the local
filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir
.". There are four cases:
Case 1: You are using a unicode-safe filesystem such as Windows or Mac,
and you have a unicode object with failed_decode=False.
This is easy: use the Python unicode filesystem APIs to create the file
and be happy.
Case 2: You are using a unicode-safe filesystem and you have a unicode
object with failed_decode=True.
This is easy: use the Python unicode filesystem APIs to create the file,
passing the latin-1-decoded filename (mojibake!).
Case 3: You are using a plain-bytes filesystem such as Linux, and you
have a unicode object with failed_decode=False.
This is easy: use the Python unicode filesystem APIs to create the file.
Case 4: You are using a plain-bytes filesystem such as Linux, and you
have a unicode object with failed_decode=True.
Now we should *encode* the filename using iso-8859-1 to get a sequence of
bytes, and then write those bytes into the filesystem using the Python
string filesystem API. This is no worse than any alternative, and in the
case that the target filesystem has the same encoding as the original
filesystem (such as because it is the *same* as the original filesystem,
or because it is owned by a friend of the owner of the original
filesystem), then this will restore the file to its proper name.
By the way, please see David Wheeler's recent proposal to start enforcing
filename constraints in Linux: http://lwn.net/Articles/325304 . His
proposals include changing Linux to require utf-8-encoding of all
filenames.
Regards,
Zooko
footnote 1: I know that Alberto Berti has previously argued on tahoe-dev
and on IRC that mojibake is less clean than the alternative of using
bytes.decode(alleged_encoding, 'replace'). The latter is lossy, but it
more clearly shows to the user that some or all of the filename couldn't
be decoded. Alberto and others had convinced me of the wisdom of this,
and I actually wrote this entire document specifying the 'decode-with-
replace' approach instead of the mojibake approach, but I eventually
realized that it wouldn't work. For one thing it was rather complicated
to decide how to handle multiple filenames that all decode-with-replace to
the same unicode name (you could imagine a whole directory full of files
all named '????' because the locale is wrong). But the real killer is
what to do when you are going to write the file back into the local
filesystem. If you write a decoded-with-replace file back, then this
means a round-trip from linux to tahoe and back can mess up all of your
filenames. If you write the actual original bytes into the filesystem,
then this means that you might accidentally overwrite files with a "tahoe
cp", since "tahoe ls" just shows files with "???" in their names, but
"tahoe cp" writes files out with actual characters instead of question
marks.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:47>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list