[Infrastructures] a tapeless backups system for a workgroup (drafty proposal v1)

Will Partain partain@dcs.gla.ac.uk
Fri, 04 Jan 2002 15:19:53 +0000


Greetings, infrastructural folks!  Great to meet some of
you at LISA. 

I have been musing slightly about doing backups without
tapes, and pass along my current notes (below) for your
comment and criticism.

(Once we get beyond the high-level issues that kinda make
sense for the infrastructures list, we should probably chat
elsewhere -- I'd suggest ark-dev
[http://lists.sf.net/mailman/listinfo/ark-dev], as that's
where I'll probably do any detailed implementation chit-chat
(if it happens).)

Will

=========================

A tapeless backups system for a workgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At a LISA 2001 session on "recovery-oriented computing" [1],
Dave Patterson (Berkeley) commented that "tape is dead".
Reason: if tape doesn't cost more per byte than disk yet, it
soon will.  Quick check (2002.01): 120GB EIDE disk for $208
(Pricewatch) = $1.73/GB, 40GB DLT IV tape for $53
(Pricescan) = $1.32/GB.  Close enough for me!

And, if you haven't noticed lately, tape is a *pain*.

So, what can we do backup-wise [2] (NB: not
`archiving'... [3]) using only cheap disks?  Assumptions:

* I have a place "far away" (a fire won't get there...) on
  my intranet where I can park a machine with disks hanging
  off of it.

* My "workgroup" world isn't *that* big...  Nor is it a
  nothing-must-ever-be-down-ever-no-matter-what environment. 
  (Even in such an environment, where you will probably
  RAIDify every disk-bit, you still need backups...)

Terminology:

`disk chunk' [4] - a logically-coherent chunk
   of stuff on disk "in one piece" (i.e. under a single
   directory).  Examples: "Fred's home directory" (/home/fred);
   "where we keep our source tarballs" (/system/open-src); ...
   
   [If you have a record of all of your 'disk chunks', you
   can generate your automount maps (for example).]

`replica' - a copy of a disk chunk.  The `primary' replica
  is the one you really use (perhaps read-write); all other
  (`secondary') replicas of a disk chunk are read-only (if
  made available at all).

So, the idea is very simple: rsync [5] copies of your disk
chunks to a remote diskful machine in an organized way.
Those remote chunks (replicas) are then available themselves
(read-only).  Details:

* A five-nights-a-week thing rsyncs all live 'disk chunks'
  to a remote diskful box.  This set of (secondary) replicas
  are available (read-only) through a /yesterday automount
  map, using the following mapping:

  primary replica called...  secondary called...

  /home/partain              /yesterday/home--partain
  /system/open-src           /yesterday/system--open-src
  (and so on)

* A weekly (every Sunday?) thing rsyncs all live disk chunks
  to a remote diskful box.  These sets of (secondary)
  replicas are available (read-only) through automount maps
  /weekend-1, /weekend-2, /weekend-3, and /weekend-4.  (If
  there's a fifth Sunday, I think I just won't bother.)
  Thus, my home directory at the first weekend of the
  month: /weekend-1/home--partain

* System-y bits (/usr, /, /opt) -- stuff that came straight
  from the vendor -- are not backed up.  (You could do
  differently, but we're inclined to just rebuild the
  machine.)

That's, er, um, *it*!  Use cases:

* User loses a file?  They can go find it themselves, in
  /yesterday and/or /weekend-{1,2,3,4}.

* Data disk dies?  Either: (a) Copy the stuff from
  /yesterday to some other disk, creating a new primary
  replica; adjust automount maps accordingly; or (b) Switch
  affected people's machines' automount maps to point to the
  /yesterday replica (now read-write :-).  Replace disk at
  leisure and shuffle things about.

* System disk dies? (machine now dead)  (a) [unlikely]
  Physical move of the data disk to another machine; adjust,
  season, enjoy; or (b) For all the primary disk-chunk
  replicas that were on this machine, as for the data-disk
  case.  Fix machine at leisure.  Note: for critical
  machines, it may be worth having a spare system disk ready
  to go.

* Building burns down?  You've got all the bits far away.

Finer details:

* rsync over ssh: yes

* It's better that the "diskful boxes" *pull* the rsync
  copies, rather than having every client push to them.

* Yes, we'll use the rsync 'excludes' mechanism to avoid
  backing up large recreatable files.

* Database-y disk chunks (e.g. Oracle databases, ClearCase
  VOBs, ...): have enough smarts to wrap the rsyncs in
  a `lock'/`unlock' pair.

* Can 'cut over' from a traditional tape-based backup scheme
  gradually (a disk or machine at a time...).

* Need some kind of tidy-up mechanism: e.g. if a /weekend-1
  replica didn't get made correctly on the Sunday, perhaps
  should re-try on the Monday?  (Or just  live with a
  not-quite-right /weekend-1 replica?)

Variants:

* Have the nightly rsync go to a *local* diskful box; it is
  then really quite painless to switch a user/system to that copy
  (no WAN delays).

  The remote one could then be /day-before-yesterday, taken
  from /yesterday before it gets
  refreshed... (Synchronization issues there...)

* If you just can't get over that tape feeling, attach a
  tape drive to the remote diskful boxes, and take/keep
  copies of the /weekend-1 replicas.

Comments?? What have I forgotten/overlooked?

Notes:

[1] http://roc.cs.berkeley.edu/

[2] `Backing up' solves problems ranging from `I deleted my
    cookie recipe' to `The building burned down'.

[3] `Archiving' takes copies of logical entities (`the
    project that just finished', `the month-end accounts')
    to preserve "forever", perhaps for legal reasons.

[4] `Disk chunk' (or 'dchunk' [pronounced "duh-*chunk*"]) is
    an Arusha Project (Sidai team) term.

[5] http://rsync.samba.org/

== end