[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue
tahoe-lafs
trac at allmydata.org
Tue Apr 28 09:08:48 PDT 2009
#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
Reporter: francois | Owner: francois
Type: defect | Status: assigned
Priority: minor | Milestone: 1.5.0
Component: code-frontend-cli | Version: 1.2.0
Resolution: | Keywords: cp encoding unicode filename utf-8
Launchpad_bug: |
-----------------------------------+----------------------------------------
Comment(by zooko):
Hm, so there is this idea by Markus Kuhn called {{{utf-8b}}}. {{{utf-
8b}}} decoding is just like {{{utf-8}}} decoding, except that if the input
string turns out not to be valid {{{utf-8}}} encoding, then {{{utf-8b}}}
stores the invalid bytes of the string as invalid code points in the
resulting unicode object. This means that
{{{utf8b_encode(utf8b_decode(x)) == x}}} for any {{{x}}} (not just for
{{{x}}}'s which are {{{utf-8}}}-encodings of a unicode string).
I wonder if {{{utf-8b}}} provides a simpler/cleaner way to accomplish the
above. It would look like this. Take the design written in
http://allmydata.org/trac/tahoe/ticket/534#comment:47 and change step 2 to
be like this:
2. On Linux read the filename with the string APIs to get "bytes" and call
{{{sys.getfilesystemencoding()}}} to get "alleged_encoding". If the
alleged encoding is {{{ascii}}} or {{{utf-8}}}, or if it absent or invalid
or denotes a codec that we don't have an implementation for, then set
{{{alleged_encoding = 'utf-8b'}}} instead. Then, call
{{{bytes.decode(alleged_encoding, 'strict')}}} to try to get a unicode
object.
2.a. If this decoding succeeds then normalize the unicode filename with
{{{filename = unicodedata.normalize('NFC', filename)}}}, store the
resulting filename and if the encoding that was used was ''not'' {{{utf-
8b}}} then store the alleged_encoding. (If the encoding that was used was
{{{utf-8b}}}, then don't store the alleged_encoding -- {{{utf-8b}}} is the
default and we can save space by omitting it.)
2.b. If this decoding fails, then we decode it with {{{bytes.decode('utf-
8b')}}}. Do not normalize it. Put the resulting unicode object into the
"filename" part. Do not store an "alleged_encoding".
Using {{{utf-8b}}} to store bytes from a failed decoding instead of
{{{iso-8859-1}}} means that if the name or part of the name is actually
{{{ascii}}} or {{{utf-8}}}, then it will be (at least partially) legible.
It also means that we can omit the "failed_decode" flag, because it makes
no difference whether the filename was originally alleged to be in
{{{koi8-r}}}, but failed to decode using the {{{koi8-r}}} codec, and so
was instead decoded using {{{utf-8b}}}, or whether the filename was
originally alleged to be in {{{ascii}}} or {{{utf-8}}}, and was decoded
using {{{utf-8b}}}. (Right? I think that's right.)
An implementation, including a Python codec module, by Eric S. Tiedemann
(1966-2008; I miss him):
http://hyperreal.org/~est/utf-8b
An implementation for GNU iconv by Ben Sittler:
http://bsittler.livejournal.com/10381.html
A PEP by Martin v. Löwis to automatically use {{{utf-8b}}} whenever you
would otherwise use {{{utf-8}}}:
http://www.python.org/dev/peps/pep-0383
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:58>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list