I've long wished for a versioned home directory, but the svn/ish seem too heavyweight, and it's nice to have a live copy and not an opaque DB-ball. The right answer the stunningly elegant Time Machine. The idea isn't actually new, and it can be approximated across platforms and remotely using standard Unix tools. Now you just need a landing spot.
Some remote backup solutions have come to the fore lately. $Zero/year gets you 2GB remote backup from Mozy. $60/year buys unlimited backup space at Mozy, where 'unlimited' means the ~40 GB/month you'll see by leaving your pipe fully saturated 24/7. A price between free and $100/year gets you the more intriguing CrashPlan. These are slick and easy. If you don't know what ssh is, you want one of these. If you do have ssh and you don't set your mom's computer up with the free Mozy account then you're a bad person.
For the uber-uber-nerds, use what I'm using: an $85/year bluehost account. You may think of it as a 600GB ssh'able rsync'able remote backup host that, by the way, can also act as a webserver. You can install svn (as I have) to do versioning over svn+ssh. Two years of bluehost costs the same as a 500GB hard drive+cheap enclosure, and they regularly increase your diskspace allowance at the same monthly price. [Note: the preceding bluehost link will give me a kickback if you sign up through it. Hit bluehost.com directly if that rubs you the wrong way.]
As described below, all my various project shards get sync'ed back to my desktop PC. The desktop then pushes my files out to a bluehost account via rsync, versioned with an rsync-as-poor-man's-time machine script. This gives me a live, versioned backup, accessible to me from anywhere by ssh, on Bluehosts' offsite, secure, RAID-UPS-and-diesel generator protected colocation, and at the end of a fat pipe. After about a week or so for the initial ~50GB backup to roll in, daily incrementals will take an hour or two each day (bandwidth choked to 25 kBps). When I leave town for a week next month I'll start pushing my music collection into its own unversioned directory.
This is only for the stuff I've created or can't replace: not for system software and not for music/movies/media (apart from my iTunes.xml, inbox, and various bookmarks/pref's/stickiesDB/MySQLCachedQueries folders). Unlike most people, I don't worry too much about backing up my system software. Maybe I'm damaged from my Windows upbringing, but just reinstall from scratch if your OS gets hosed (I keep install disks and images around). The nascent defect may be present in the restore; the accumulated OS cruft certainly is. You're already kinda screwed; better to take a certain two days and finish with a clean system than to fight a flaky restore and then spend those two days again. Yes, you are allowed to come point and laugh if this happens to me.
Beyond the backup, I have various levels of defense-in-depth for my data -- data that is created, changes daily and is essential for my economic well-being has four or more levels of redundancy. Data that is intransigently huge but can be sourced elsewhere has no redundancy. (There's no reason backing up my processed wikipedia dump, for instance: only the scripts that process it.)
Right now my fileverse is spread across
- Desktop computer ~ 1TB
- Flivo (my homebrew Tivo computer ~0.5TB
- School account ~3GB
- sourceforge account
- Four webservers holding sites I operate or caretake ~ 1-5GB each
- GMail account ~4GB
- Flickr account ~6GB
- iPhone, Google Calendar, Yahoo Address book, Plaxo
- blogger/twitter/facebook/etc
This is organized as
- All the stuff I'm currently messing with is in a 'now' folder. This is what sync's to my school account, and every time I go on a trip I burn a DVD of the now folder to toss in my backpack. (You never know when you'll want a file, and it enforces an occasional hardcopy backup).
- Within the now folder, the stuff I develop for work is versioned with svn. I house a private repository on my bluehost account and connect over svn+ssh. I'm reasonably good about checking in every few hours or when I shift conceptual gears.
- Most of the 'now' folder I keep in sloppy sync with the school account. ('sloppy' because I sync when I think of it and not through a cron script as I should).
- Each year I move everything that isn't under current work out of the now folder and into an 'archive/YEAR' folder; there it sits and changes almost never.
- GMail holds all my mail (sync'ed with IMAP)
- Flickr holds an incomplete and poorly correlated segment of my photos (they own my metadata, and yes that bugs me).
- iPhone sync handles the address book; gcal is still quite difficult.
- The rest of the little metacontent is trapped, meh.
- Each webserver's content is replicated to the desktop. For small changes I'll sometimes diddle the file on the live server and then sync back later (tsk tsk); for heavier work I proceed locally and then deploy.
The usage breaksdown like this:
== Work space -- changes ~ daily ==
2 GB vizsage project software Desktop, svn, school, vizsage, bkup
9 GB infochimp site & working data Desktop, svn, school, infochimp, bkup
== Work resources -- changes ~ weekly ==
100 GB 'huge' datasets Desktop, infochimp
60 GB live local DB of datasets Desktop
30 GB infochimp website DB infochimp, bkup
== Slowly changing -- changes ~ semiyearly ==
3 GB Other projects, docs, stuff Desktop, bkup
4 GB Archive - doesn't change Desktop, bkup
(~ 300 MB / year for 11 years)
3 GB Library (prefs,caches,etc) Desktop
== Metacontent -- changes daily-weekly ==
12 GB Photos Desktop, Flickr
3 GB Mail Desktop, GMail, bkup
~ MB iPhone/Addr Book/Calendar Desktop, iPhone, GCal, Yahoo AddrBk
~ MB this blogger blog vizsage, Blogger
== Websites ==
1 GB website1 Desktop, website1, bkup
2 GB website2 Desktop, website2, bkup
6 GB website3 Desktop, website3, bkup
== Media ==
many GB music Desktop, some on iPhone, some on DVD
many GB recorded tv shows, movies, etc Desktop, data DVD
== System Software ==
some GB OS & installed programs on each machine
I'd like to live in a world where I wouldn't have to worry about how these are partitioned across machines. Changes made to 'website1', say, or to a project in 'now', would lazily propagate to each interested shard as well as to the remote time-machineish versioned backup. At any time I could force an immediate sync, whether to deploy a change, to repair a mistake, or to satiate an OCD twinge, if I don't want to wait for automatic syncronization.
I'm actually pretty close to having this out of a McGuyvered patchwork of rsync, svn, time machine, IMAP/Aperture+Flickr and distributed file systems, all enforced by cron. I'm planning to soon waste a weekend buttoning up my sync scripts, getting everything to run daily and being superattentive in case I screw it up.
But it sure would be nifty to augment (a cross-platform) Time Machine into a Time and Space Machine. I'd see an overview of my distributed fileverse (versioned in time, distributed in shards according to how I use it), and I could delegate various live realizations, svn/diff-versioned backups or hard-link-versioned backups to each local or remote instance. No single machine would necessarily hold the entire fileverse: note that a few things up there don't propagate back to my main desktop. And hopefully the whole thing would have polished Apple Fit And Finish instead of Mad Max Homebrew Itworksithinkihope.
Labels: backup, bluehost, diff, disk, distributed, file, hosting, quota, rsync, subversion, svn, system, time machine, version, versioned