[tahoe-dev] #534: "tahoe cp" command encoding issue
Brian Warner
warner at lothar.com
Sun Mar 1 23:40:33 PST 2009
Shawn Willden wrote:
> On Friday 27 February 2009 10:45:24 am Brian Warner wrote:
>> One limitation to keep in mind is that JSON cannot represent arbitrary
>> binary data without application-visible encoding, and that both the
>> webapi GET $dircap?t=json and the dirnode-format metadata dict use
>> JSON. So any "store the original bytes and let the reader sort it out"
>> approach must e.g. base32-encode those bytes on the way in and base32-
>> decode them on the way out, in the CLI tool on the user side of the
>> HTTP connection.
>
> You don't need to use base 32. simplejson can output arbitrary Unicode
> strings, it just spits out ASCII-unrepresentable characters in \uXXX format.
> This is more convienient and often more compact than base 32.
I suspect I didn't express myself very clearly in my
soft-keyboard-induced brevity. It isn't a question of what simplejson
can do, it's a matter of what JSON itself can and cannot represent.
JSON cannot represent arbitrary bytestrings: the only string type in
JSON is a unicode object, a variable-length sequence of unicode
codepoints. (It happens to represent these in its standard ASCII-based
encoding using mostly-ASCII-plus-\uXXXX-when-necessary, but that's
irrelevant). If you have an unconstrained binary string, like the output
of a hash function, or an encryption function, or os.listdir (heh), then
you can't do js=simplejson.dumps(BINARYSTRING).
(this is a pity.. I would find JSON much more useful if it could hold
bytestrings, since I often want to put hashes or encrypted values in
JSON containers. But I vaguely understand why they did it that way)
Instead, you have to transform it somehow. One way to transform it would
be to use base32 or base64, as in
js=simplejson.dumps(base64.b64encode(BINARYSTRING)). Another (crazy) way
to transform it would be to pretend that your random binary string was
really supposed to be a unicode string that's been encoded into latin-1,
and use js=simplejson.dumps(BINARYSTRING.decode("latin-1")).
But in any case, you have to transform it one way on the inbound side,
and the reverse way on the outbound side. Your DTD (if you had such a
thing, and if I'm using the term correctly), instead of saying "this
field is a sha256 hash of blah blah", must declare "this field is a hash
blah blah which has been encoded in such-and-such a way, be sure to
encode on the way in and decode it on the way out". The need for
encoding is application-visible.
This limitation of JSON hits us in two places. The first is the existing
program-oriented Tahoe webapi to retrieve the contents of a directory,
abbreviated as "GET $dirnode?t=json". This API returns the child names
(filenames or subdirectory names) as keys of a JSON dictionary, and puts
information about each child in the values of that dictionary. So the
child names are well-established as being a regular JSON (unicode)
string: changing the definition of the $dirnode?t=json format to say
"decode the key names with such-and-such before using them" would be a
big change. There's no way to "just store the binary string and let the
reader sort it out" if we're talking about storing it as the keys of the
t=json dictionary.
The other place this JSON limitation shows up is in the dirnode edge
metadata, which is defined to be a dictionary of whatever you like as
long as it can be encoded into JSON. The metadata dictionary is in fact
encoded into JSON and then stored as a netstring in the dirnode data
structure (which is then written into a mutable file). That means that
anything unconstrained binary strings that you might want to stuff into
the metadata must also be encoded to make it unicode-safe before you
give it to Tahoe. The webapi also imposes this limitation, because the
metadata dictionary is transferred as JSON to and from the client
(retrieved with GET t=json, and set with POST t=set-children, IIRC).
So, my point was simply that any solution we come up with that says
"store the original uninterpreted bytestring and let the reader figure
it out" must use some sort of encoding (probably one that has nothing to
do with unicode, like base64) to get this bytestring into a form that
Tahoe's existing dirnode format and webapi can tolerate.
cheers,
-Brian
More information about the tahoe-dev
mailing list