#994 new enhancement

support precompressed files

Reported by: davidsarah Owned by: somebody
Priority: major Milestone: undecided
Component: code Version: 1.6.0
Keywords: compression space-efficiency performance bandwidth security integrity backward-compatibility Cc:
Launchpad Bug:

Description

A "precompressed file" is a file where the plaintext has been compressed using an algorithm supported by HTTP (gzip or deflate -- we'd probably support only one). When the file is served via the webapi, it is served in compressed form with the Content-Encoding HTTP header set appropriately. The Content-Encoding can also be set in a PUT or POST request to upload a precompressed file.

Storage servers would be completely ignorant of precompressed files. The CLI, SFTP and FTP frontends would have to be decompress them. The gateway would also have to decompress if it receives an HTTP request that does not have an Accept-Encoding header allowing the compression algorithm used for that file.

This would provide a performance improvement as long as the HTTP clients have sufficient CPU capacity, that the time taken for them to decompress is outweighed by the savings in bandwidth. CPU-constrained clients (connecting to a less CPU-constrained gateway) are not a problem because they can just not set Accept-Encoding.

This would rely on HTTP clients implementing decompression correctly; if they don't then there is a potential loss of integrity, and the possibility of attacks against the client from maliciously constructed compressed data. It is possible to protect against "decompression bombs" if that is required.

Change History (5)

comment:1 follow-up: Changed at 2010-03-12T06:44:45Z by davidsarah

  • Keywords backward-compatibility added

Note that, as pointed out in ticket:992#comment:2, the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)

comment:2 in reply to: ↑ 1 ; follow-up: Changed at 2010-03-12T06:47:17Z by davidsarah

Replying to davidsarah:

... the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

comment:3 in reply to: ↑ 2 ; follow-up: Changed at 2010-03-12T17:58:03Z by jsgf

Replying to davidsarah:

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change.

Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency.

As I mention in ticket:992#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding.

But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding. We'd need to test that the CLI ends up doing the right thing, of course. I don't know what would happen to apps directly using the python APIs.

comment:4 in reply to: ↑ 3 Changed at 2011-10-11T01:46:01Z by davidsarah

  • Keywords compression space-efficiency bandwidth security added; compress removed

Replying to jsgf:

Replying to davidsarah:

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change.

The effect of making the file data (as an uncompressed sequence of bytes) dependent on metadata that is detached from the file URI, would be an even bigger semantic change. The file URI has to unambiguously determine the file data.

One way of achieving that would be to put the bit that determines whether a file has been stored compressed in the URI, for example "UCHK:gz:..." could be the gzip-decompressed version of "CHK:...".

As I mention in ticket:992#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding.

But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding.

We can't send Content-Encoding: gzip if the client hasn't sent an Accept-Encoding that includes gzip; that would obviously be incorrect and not compliant to RFC 2616. We can't do much about clients that are sometimes unable to correctly decompress encodings that they advertise they accept, such as Netscape 4.x (well, we could blacklist such clients by User-Agent, but yuck).

Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency.

There's no usable consistency in file extensions.

comment:5 Changed at 2011-10-11T01:53:09Z by davidsarah

#1354 is about supporting compression at the storage layer.

Note: See TracTickets for help on using tickets.