[tahoe-dev] Alternative backup and security setup

Wed Mar 13 01:17:19 UTC 2013

Hey all,

Over the years, I've been putting together various UNIX hacks in the name
of security and backup and I wanted to share a few of them and possibly
join forces with other like-minded individuals.  A BSD/Linux user of a
couple decades, I am also an Apple fan.  Hopefully you will find some of
the tweaks interesting and maybe there can be cause to co-hack or
collaborate.

As I'm sure you're familiar with the principle of least privilege design,
I've set up a practical secure backup configuration that achieves the aim
of security even if a third party provider is compromised.  And so for one,
I'm stably running a TrueCrypt volume on top of an encrypted sparse bundle,
and I have that sparse bundle set to back up to a cheap external backup
provider.  I omit unneeded metadata from the backup such as the token file
and instead pool all metadata into another secure archive which is then
also backed up.  Then even if the external provider becomes compromised or
my private provider-specific key is known to them, they'll only be seeing
encrypted data, and without the token they would have a hard time even
brute forcing the data to see if a given passphrase works.  Then I can
leverage the advantages of a given provider and can even give friends
updated band files that they can redundantly back up using spare space
without any worries of confidential information being shared.  I understand
I'm suggesting different tradeoffs and this has additional overhead
compared to Tahoe LAFS for example since I'm using an intermediate
"staging" server, but my setup includes a server purposed for that and the
overhead doesn't affect me too much.

And so I think there are a few assumptions we can make about people's data.
 For one, metadata can be assumed to be much smaller (<<) than data, and so
things like timestamps or checksum can be included in an encrypted archive
meant for easy restoration.  Also, frequently used/modified data << the
majority of the data.  And configuration data, private keychains, GnuPG and
SSH keys, some root files, and in general the user's most important files
can be considered smaller (<<) in size compared to the rest.  So why not
ensure that the most critical data is most spread out geographically in
case something goes wrong?  It is posisble to make file backup decisions
based upon derivation information for the source of acquired files, so that
a video or source code file that appears in a stable web location can be
considered "reproducible" and thus not as critical to back up individually.
 As an extension, it could be possible to actively trace and map out the
derivation of files so that a binary file is known to originate from a
hierarchy of dependent source files.

What about a staging server that deliberately lies on a local network and
performs deduplication privately for a group (family, group of friends,
connected researchers, group behind the same intranet -- thus without
exposing which files have been pooled/shared) and then after compression
and encryption forwards on the data to be backed up?  What about a very
practical compromise of just storing relatively inert data encrypted on
hard drives and archived in a vault in a geographically distant location?
 Then incremental updates can be combined and the overhead cost is very
little above a hard drive, with a short day or two turnaround time in
recovery.  The staging server can very quickly synchronise itself with any
local connected devices due to network proximity and presumably internal
bandwidth wouldn't need to be paid for, but at the same time the server
would be "staging" in that incremental high-priority updates would be
pushed to the cloud and offsite during the evenings after some
postprocessing (which can include further compression or incremental delta
calculations if the intermediate data is available as plaintext before it
is encrypted offsite).

I thought it would be interesting to implement a meta-{data backup}
solution that learns what resources are available (network bandwidth and
associated cost, free space and reliability of proximal drives, overall
budget, use cases for local systems, possible staging servers) and then
uses those resources automatically in an efficient and clever manner.
 There is a lot of very useful metadata that can fit within 2GB and that
can be redundantly backed up to various reliable geographically diverse
sites (most likely for free or at very low cost).  A nearby university can
offer very cheap storage for archived data that is not accessed very often.
 There are various backup sources and sinks and these can be configured in
a way to achieve an optimal result overall.  Also, there is much to tweak
and configure by default.  What if some over-arching UI can be provided
that will automatically create and maintain needed accounts and will
forward on payment as needed?  The UI can be like a universal interface to
backup functionality.  This is just one idea but suffice it to say that
there is already a lot of unneeded complexity and diversity in various
backup solutions and resources and while the flexibility is there, based
upon overall high-level guiding principles, maybe the parameters can be
categorised such that the flexibility can be hidden under a sensible and
unified interface.

Also for an archive of a keyfile I make use of (threshold) secret sharing
which in my opinion is an under-utilised paradigm.  But although it can
take twice the amount of storage, for a set of critical data like password
files the overhead is not significant in comparison with a full backup.
 The basic idea is E_K1(data + R), E_K2 (R), where R is randomly generated
like a one time pad and + is an XOR.  Then those two halves can be kept
separate.  The statistical properties of each are highly random and
uncorrelated, and even when someone has both halves, they in principle need
both keys in entirety to be able to decrypt the data.  Anyhow, as much as
this appears abstract and impractical, I can say that FreeBSD's GEOM layer
offers a gshsec implementation which I use, and so I combine two
GELI-encrypted devices under a shared secret and form a filesystem on top
of it.

Technologically, on the hardware side, I'm a big fan of the 2TB My Passport
Studio drive (what a small elegant little drive that doesn't need
additional power source -- it can even be a boot volume on OS X and via
Firewire can be disconnected while a laptop sleeps), a series of reliable
$120 refurbished netbooks, and am considering combining a Rasperry Pi or
similar with the former drive to create a local file sync and staging
server that continuously backs up to multiple geographically redundant
backup providers (while automatically connecting to local wired/wireless
networks and presenting itself as a server for various protocols).

I hope you'll find some of this intriguing or interesting.

Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20130312/2bbe78d6/attachment.html>