[tahoe-dev] example GSoC Proposal Re: working in public Re: Google Summer of Code chooses to sponsor Tahoe-LAFS!

Zooko O'Whielacronx zookog at gmail.com
Tue Apr 6 22:42:59 PDT 2010


Folks:

Kevan Carstensen wrote up a good Google Summer of Code proposal for
Redundant Array of Independent Clouds (RAIC), but then he decided not
to propose RAIC for GSoC but instead to propose MDMF. He agreed that I
could make his proposal public so that other students applying for
GSoC can see the sort of detail that we desire in proposals.

If you are the one student who is currently writing up a Proposal to
work on RAIC you may copy Kevan's proposal and modify it how you like.

If you are one of the students who are currently writing up other
Proposals, you can at least see how Kevan has written a lot of detail
about what sort of code would need to be written and also about
dividing the work into successive steps. The GSoC Mentors will help
you if you choose to add that level of detail into your own Proposals.

If you are logged into the GSoC site, you can view Kevan's RAIC proposal here:

http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/kevan/t127060884801

The current version of it is appended below.

Regards,

Zooko

-------
== Abstract ==

Interesting use cases would open up for Tahoe-LAFS if Tahoe-LAFS storage
servers knew how to write to storage backends other than the filesystem
of the machine that they run on; in particular, if it knew how to write
to commodity grid storage providers such as Amazon's S3 service, the
Rackspace cloud, and others. To open these use cases, I will modify
Tahoe-LAFS to support multiple storage backends in a modular and
extensible way, then implement support for as many cloud storage
providers as time allows.

== Background and Example ==

Tahoe-LAFS storage servers, as they are written now, depend (eventually,
depending on how a node is configured) on the underlying filesystem of
the machine on which they run to store the shares that they are
responsible for. This makes it hard and expensive to build grids that
are robust to failures. Running a grid of several Tahoe-LAFS storage
servers on one machine with one disk is no more robust than simply
copying files to the disk directly, for example, because in both cases a
disk failure will destroy the files. Running a grid of several distinct
storage servers that all write to a centralized NAS is vulnerable to the
failure of the NAS. Running a grid of several distinct storage servers
that all have separate disks in a single datacenter is vulnerable to the
failure of the datacenter. To build a grid that addresses these and
other robustness challenges using Tahoe-LAFS as it is currently written
is expensive, probably beyond the reach of all but the most well-funded
grid operators. However, for users who want a strict assurance of the
confidentiality and integrity of their data while it is in the cloud,
Tahoe-LAFS is ideal; it is designed to give users exactly that.

Services like Amazon AWS S3 [1] and the Rackspace cloud [2] abstract
some of these robustness details neatly away. If you have an Amazon S3
bucket, you can put files there and let Amazon take care of the details
of crisis planning and decentralization. However, these services afford
no quantifiable guarantee of confidentiality -- users must trust that
the cloud providers will be free of malice, security flaws, and other
potentially compromising attributes, or otherwise protect their data
against snooping and tampering.

Extending the Tahoe-LAFS storage server to write to cloud storage
services will make it easy for users to create grids that are robust on
the level of a commodity cloud computing provider, but also provide
strict assurances of confidentiality and integrity. This project will do
that. If successful, users will be able to configure their storage
servers to write to Amazon S3, the Rackspace Cloud, Google Docs, and
perhaps other backends, in addition to the local filesystem of a node.

One use case that this opens up is the "Redundant Array of Independent
Clouds". If this project is successful, any user could create a grid
(for the cost of two or three grid storage subscriptions) that stores
their data, ensures the confidentiality and integrity of their data, and
is as robust as the most robust chosen cloud provider. This use case (or
the resulting robustness, confidentiality, and integrity) would be all
but impossible for most users with Tahoe-LAFS as it is now.

== Backward and Forward Compatibility ==

None of the changes associated with this project should require
significant changes to the remote interface for storage servers. To an
old Tahoe-LAFS client, a storage server writing to Amazon S3 will look
just like a storage server writing to its local disk. Similarly, to a
Tahoe-LAFS client, an old storage server without the changes resulting
from this project will look just like a newer storage server. In other
words, there is no reason for forward or backward compatibility to be
affected with this project.

== What should IStorageProvider look like? ==

The use of the filesystem is fairly tightly coupled into the existing
storage server -- parts of the server that do not directly write files
still rely on the ability to list and detect the existence of files to
do certain things, for example. Given this, it may make sense to
implement an IStorageProvider that is very similar in functionality to
the implicit filesystem API already used by the storage server.
Integration into the existing code would, aside from threading the
IStorageProvider implementation into BucketWriters, BucketReaders,
ShareFiles and MutableShareFiles, mainly consist of replacing calls to
Python's built-in filesystem functions with calls to those defined in
IStorageProvider. The downside of this approach is that it possibly
constrains IStorageProvider implementations by eliminating backends that
do not map well to the semantics of Python's default filesystem
functions. For example, IStorageProviders would need to provide for
callers the functional equivalent of directories, something that might
not be the case for all of the storage backends that we might want to
support.

Another approach, though one with a larger front-end analysis cost,
would be to identify high-level operations that rely on the filesystem,
then abstract those out of the core logic of the storage server. Then
IStorageProvider is not necessarily one single interface, but a
collection of objects that together provide the high-level functionality
of a filesystem, as Tahoe-LAFS uses it. This is potentially less
constraining than simply attempting to clone Python's filesystem
built-ins, though at the cost of forcing future development of the core
storage server logic to use our high-level operations and objects
instead of those like the primitives provided by the operating system
(i.e.: it is constraining, but in a different way. which isn't
necessarily a bad thing -- programming is intrinsically constraining --
but is something to consider).  Further, this assumes that it is
possible to elegantly and intelligently refactor the existing storage
server code into abstractions that are meaningful and useful on their
own. The main concern, given my limited analysis of the existing storage
server -- notably, storage/server.py, storage/mutable.py, and
storage/immutable.py -- is that there is not necessarily enough
filesystem-independent functionality in the existing storage server to
merit having a skeletal filesystem-independent storage server object --
whatever benefit might be realized by reducing code duplication would be
counteracted by the resulting complexity of the more generalized design.
Further, code re-use could be achieved through other means -- the
statistics mechanisms, for example, could be moved to a mixin.

Alternatively, each backend could have its own storage server. This is
conceptually simpler than either of the other approaches -- we simply
require each storage backend to implement RIStorageServer, so that they
all look the same to remote clients. The downside of this is that any
backend-independent code that exists in the current storage server
implementation ends up being duplicated over all of the storage server
implementations, though this functionality could be abstracted and
re-used if necessarily or desirable.

Which of these is the right approach will probably become clearer when
coding begins. In any case, it is something to think about; in many
important ways, this project depends on how IStorageProvider is defined,
since an incorrect, kludgy, or contrived abstraction will affect the
ability to add other storage backends to Tahoe-LAFS even after work on
the specific Summer of Code project stops.

== Timeline and Deliverables ==

This project breaks into three portions.

  0. Present the Redundant Array of Inexpensive Clouds to the community.
     Gather feedback from storage server administrators about which
     features they would like to see implemented as part of the project,
     and which cloud backends they would like to see supported. Work
     with Tahoe-LAFS developers to finalize an approach to decoupling
     the filesystem and storage server logic, or decide that it is
     better to implement discrete storage servers for each backend.

     (this step would be performed well before the start of coding,
     which is why it is called step 0)

  1. Depending on the results of step 0, decouple the storage backend
     independent logic in the current storage server implementation from
     the filesystem-specific logic.  This may result in a new interface,
     IStorageProvider, which provides a very simple API for basic
     filesystem functions, and an implementation of IStorageProvider
     that uses the filesystem as a backend. It may also result in
     nothing, or in only a few pieces of code that get re-used with
     new storage backends. No significant new functionality will be
     introduced at this point; however, this step is necessary to enable
     the later steps.

  2. Determine, using the results of step 0 and common sense, how
     storage server implementations will be configured in general.
     Specifically, we will need to have some way of mapping user
     configuration choices and other necessary information (for example,
     desired login credentials, service-specific configuration, etc) to
     what happens when a storage server is actually started. A
     successful solution to this will need to identify and address the
     implications of placing potentially sensitive credentials in
     configuration files, possibly providing a more palatable
     alternative (e.g., integration with the keychain in OS X).

  3. Document, develop, and test storage server implementations for as
     many interesting storage backends as time allows. At a minimum, it
     would be nice to support Amazon S3, Rackspace Cloud files, and
     possibly Google Docs as Tahoe-LAFS backends.

== About me ==

I'm a student studying Computer Science at California Polytechnic State
University, Pomona. I'll be available to work on Tahoe-LAFS full-time
over the summer.

I've worked with Tahoe-LAFS before; I have contributed several small
improvements and bugfixes to the project, have also contributed
documentation and code review, and have been following its development
(through IRC and the tahoe-dev mailing list) for the better part of a
year. I'm familiar with the codebase, and comfortable with the
requirements (thorough testing; clear, efficient, and robust code)
placed upon contributions.

I've worked as a programmer and system administrator throughout college.
I'm comfortable working with Python, Objective-C, C, and PHP.

Academically, I have an interest in security; particularly capabilities
and systems that use them, and cryptography. Outside of school, work,
and computers, I'm interested in cooking, food, and cars.

== Contact ==


[1] http://aws.amazon.com/s3/
[2] http://www.rackspacecloud.com/


More information about the tahoe-dev mailing list