[tahoe-dev] future Tahoe ideas

Brian Warner warner-tahoe at allmydata.com
Tue Mar 24 12:11:04 PDT 2009


I just wanted to do a quick brain-dump on what my personal plans are
for Tahoe over the next few months: just to publish the projects that
are sitting on my longer-term todo list. Many of these have tickets..
I'll add tickets for the ones that don't.

 * overhaul immutable upload (no format or protocol changes):
  * parallelize peer-selection work, instead of spending one RTT per
    server
  * handle existing shares better: the current uploader usually ends up
    placing duplicate shares and doubling up shares on the same server.
    Any evidence of a previous upload should trigger an immediate
    search for all existing shares, and new-share placement should be
    smarter.
  * pipeline segments, maybe just two deep, to improve performance
  * merge small write requests, in particular the various hashes. Small
    files should use a single remote write request.

 * overhaul immutable download (no format or protocol changes):
  * download/populate minimal hash chain instead of whole hash tree (we
    can't currently download 10GB files because our alacrity is longer
    than a lot of HTTP clients' timeouts)
  * rewrite in terms of random-access, Producer/Consumer, prefetch one
    segment. Remove big Deferred chain, replace with state machine.
  * merge small read requests, small files should use a single remote
    read

 * overhaul immutable repairer: improve share-placement, add
   share-rebalancing, consider adding peer-to-peer share transfer
   (requires protocol additions)
  * consider changing immutable lease-renewal to expire duplicate
    shares
  * new uploads should be a special case of repair, in which there are
    no existing shares and the plaintext is available locally

 * new immutable upload format/protocol:
  * separate "server-selection index" from "storage index", use UEB
    hash as storage index, enable improved server-side local share
    verification (#654)
   * add slow-crawler to verify all local shares on a periodic basis
  * add "upload identifier" to storage-server protocol, allow resumed
    uploads, allow storage index to be decided at the end, enable
    streaming uploads
  * add copy-share-from-peer method to storage-server protocol (for
    share rebalancing that doesn't use the controller's bandwidth)
  * new human-readable cap format with less baggage, smaller UEB hash
  * add binary cap format

 * new mutable format/protocol:
  * make it easier for servers to refuse mutable shares, to honor the
    'read-only' setting better
  * DSA-based mutable files, traversal caps, pubkey-as-storage-index
  * new human-readable cap format: smaller. new binary cap format
  * consider storage-server validity check to replace write-enabler
    (accept write iff new share would be valid and seqnum is higher
    than before): measure CPU cost for server.

 * MDMF (medium-sized mutable files: modify one segment without
   touching other segments). LDMF (large-sized: revlog based,
   append-only backend shares, efficient insert/delete span, revision
   graph, readcaps for individual revisions, writecap for the whole
   thing).

 * new directory format:
  * traversal caps
  * faster to parse: encrypt each column separately. Maybe one
    unencrypted column with [(offset,size), ..] references to the other
    columns, then one column each for names, traversalcaps, readcaps,
    writecaps, metadata. Use binary caps.
  * new human-readable cap format: smaller. new binary cap format.

 * immutable dirnodes: may contain only filecaps and immutable-dircaps.
   Put stats information in the dirnode. Add tool to create "virtual
   CDs".

 * enhance "tahoe backup" CLI tool to share old directories. Create
   immutable dirnodes when available.

 * enhance "tahoe cp" to use backupdb. Use backupdb in tahoe-to-local
   copies. Enable cron-driven "tahoe cp -r" to do minimal work.

 * Accounting. Add management tools, status web pages.
  * use ECDSA pubkeys to manage leases instead of shared secrets (each
    lease is stored with a verifying key, lease renewal is based upon
    signed messages instead of matching renewal-secrets)
  * use Accounting privileges to manage leases instead of pubkeys (the
    right to control a label includes the right to manage leases with
    that label. Individual lease records could still have a pubkey for
    delegation to a renewal agent).

 * Lease-tracking cache database. One-shot share crawler to populate or
   regenerate DB. Once populated, use it for DYHB queries instead of
   filesystem access.

 * non-Foolscap-based storage server. Consider one embedded in HTTP,
   try to reuse connections. Make life easier for implementors.
   Removing the write-enabler and renew-secrets would remove the need
   for transport-layer confidentiality. Consider a
   cert-chain+signed-request message, messageid for responses, replay
   protection.
  * imagine an Apache mod_tahoe_server, mod_tahoe_client
  * imagine a firefox extension

 * Reorganize node classes, improve Service tree layout. Node should be
   the parent, with StorageServer and Client as children. Replace
   "tahoe create-client" with "tahoe create client" and "tahoe create
   client,storage,helper" (the client is just one thing you might
   create, the construction command takes a list).
  * create plugin architecture. Add an App Store (hey, everybody's
    doing it)

 * local file access through webapi, based upon a ~/.tahoe/private/
   -based secret. Enable JS tool to drag+drop between local disk and
   tahoe FS. Enable web-based configuration of backup process. Add CLI
   command to launch the JS tool with the secret from .tahoe/private,
   encourage folks to bookmark it. Consider a "login page" to bounce
   user to that bookmark.

 * improve automated performance testing: compare "tahoe cp -r" against
   tar+netcat, scp -r, rsync.

 * add content-identifying metadata: hashes, rsync signatures

 * improve "tahoe cp" to capture local metadata, like "tahoe backup"
   does. Tolerate/capture symlinks.

 * webapi to retrieve subtree as .tar/.tar.gz/.zip files. Represent
   cycles as symlinks. Populate owner/mode from "tahoe cp/backup"
   metadata.

 * add symlinks: mutable redirection file, special cap format, webapi
   GET automatically follows. And/or establish convention for "sharing
   slots", which an uploader can give to a downloader before the upload
   is finished, to indicate upload progress, eventual filecap, and
   subsequent revocation.

 * webapi for verifycaps: GET returns ciphertext, PUT accepts
   ciphertext. Enable checking+repair. Imagine JS code to perform
   decryption. webapi for traversalcaps.

 * improved Introducer. Use Introducer for disk-watcher,
   stats-gatherer, helpers. Distributed Introducer, gossip-based
   dissemination.

 * consider improving storage-server-side logging: timestamp, client
   ipaddr, remote tubid, storageop, size.

 * maintenance daemon: configure with rootcaps (or .com DB to pull
   rootcaps from), it does: manifest, check, repair, lease-renewal,
   stats. Give it a squad of worker nodes and configure relative
   priorities. Web status pages with progress, expected cycle times,
   queue depths, account stats, etc.


Whew! That was quite a list. I guess the next step is to go through and
locate or add ticket numbers for each.

Comments welcome!
 -Brian


More information about the tahoe-dev mailing list