[tahoe-dev] #534: "tahoe cp" command encoding issue

Brian Warner warner at lothar.com
Sun Mar 1 23:40:33 PST 2009


Shawn Willden wrote:
> On Friday 27 February 2009 10:45:24 am Brian Warner wrote:
>> One limitation to keep in mind is that JSON cannot represent arbitrary
>> binary data without application-visible encoding, and that both the
>> webapi GET $dircap?t=json and the dirnode-format metadata dict use
>> JSON. So any "store the original bytes and let the reader sort it out"
>> approach must e.g. base32-encode those bytes on the way in and base32-
>> decode them on the way out, in the CLI tool on the user side of the
>> HTTP connection.
> 
> You don't need to use base 32.  simplejson can output arbitrary Unicode 
> strings, it just spits out ASCII-unrepresentable characters in \uXXX format.  
> This is more convienient and often more compact than base 32.

I suspect I didn't express myself very clearly in my 
soft-keyboard-induced brevity. It isn't a question of what simplejson 
can do, it's a matter of what JSON itself can and cannot represent.

JSON cannot represent arbitrary bytestrings: the only string type in 
JSON is a unicode object, a variable-length sequence of unicode 
codepoints. (It happens to represent these in its standard ASCII-based 
encoding using mostly-ASCII-plus-\uXXXX-when-necessary, but that's 
irrelevant). If you have an unconstrained binary string, like the output 
of a hash function, or an encryption function, or os.listdir (heh), then 
you can't do js=simplejson.dumps(BINARYSTRING).

(this is a pity.. I would find JSON much more useful if it could hold 
bytestrings, since I often want to put hashes or encrypted values in 
JSON containers. But I vaguely understand why they did it that way)

Instead, you have to transform it somehow. One way to transform it would 
be to use base32 or base64, as in 
js=simplejson.dumps(base64.b64encode(BINARYSTRING)). Another (crazy) way 
to transform it would be to pretend that your random binary string was 
really supposed to be a unicode string that's been encoded into latin-1, 
and use js=simplejson.dumps(BINARYSTRING.decode("latin-1")).

But in any case, you have to transform it one way on the inbound side, 
and the reverse way on the outbound side. Your DTD (if you had such a 
thing, and if I'm using the term correctly), instead of saying "this 
field is a sha256 hash of blah blah", must declare "this field is a hash 
blah blah which has been encoded in such-and-such a way, be sure to 
encode on the way in and decode it on the way out". The need for 
encoding is application-visible.

This limitation of JSON hits us in two places. The first is the existing 
program-oriented Tahoe webapi to retrieve the contents of a directory, 
abbreviated as "GET $dirnode?t=json". This API returns the child names 
(filenames or subdirectory names) as keys of a JSON dictionary, and puts 
information about each child in the values of that dictionary. So the 
child names are well-established as being a regular JSON (unicode) 
string: changing the definition of the $dirnode?t=json format to say 
"decode the key names with such-and-such before using them" would be a 
big change. There's no way to "just store the binary string and let the 
reader sort it out" if we're talking about storing it as the keys of the 
t=json dictionary.

The other place this JSON limitation shows up is in the dirnode edge 
metadata, which is defined to be a dictionary of whatever you like as 
long as it can be encoded into JSON. The metadata dictionary is in fact 
encoded into JSON and then stored as a netstring in the dirnode data 
structure (which is then written into a mutable file). That means that 
anything unconstrained binary strings that you might want to stuff into 
the metadata must also be encoded to make it unicode-safe before you 
give it to Tahoe. The webapi also imposes this limitation, because the 
metadata dictionary is transferred as JSON to and from the client 
(retrieved with GET t=json, and set with POST t=set-children, IIRC).

So, my point was simply that any solution we come up with that says 
"store the original uninterpreted bytestring and let the reader figure 
it out" must use some sort of encoding (probably one that has nothing to 
do with unicode, like base64) to get this bytestring into a form that 
Tahoe's existing dirnode format and webapi can tolerate.

cheers,
  -Brian



More information about the tahoe-dev mailing list