Amazon S3 as personal backup
Jeremy recently wrote about using Amazon's S3 service as a backup server for his personal data. I ran the numbers and figured the cost was acceptable for my use case (about 15GB of digital photos with a couple hundred MB added each month), so I looked into existing frontend technology for S3 backups. There's JungleDisk and s3sync (sorry, getting too lazy to convert to links; use your favorite search engine), but neither was quite right for my requirements.
The principal problem is that my wife uses the file system (including filenames) as a filing system. This is probably what 98% of computer users on the planet do, too, but as I've written before, adequate search and automated organization technology (such as you'd find in Google's Picasa) make this work superfluous, so I don't do it -- I tend to dump poorly-named files into folders and let indexing software find them when I need them.
But my wife does organize files, and that means that she moves files around on the filesystem and occasionally renames them. This wreaks havoc with programs like s3sync that identify an object by its path. If the path changes, the object at the old path ceases to exist, and the one at the new path must be uploaded all over again. If you're paying for bandwidth, as you do with S3, this means a single top-level folder rename could be quite expensive.
I think the solution is a Venti-style layer over S3. It would work something like this:
- For every file in the directory to be backed up, compute a strong hash of the file contents.
- For each unique hash generated, upload the file corresponding to that hash, keyed by a hex representation of the hash. Rely on S3's built-in capability to avoid re-uploading objects whose contents haven't changed. Update 10/6/2006: As Antony pointed out, this feature doesn't exist; it was just wishful thinking. I will have to first list the bucket contents, and use the result to skip the files already uploaded.
- Upload a representation of the directory structure mapping paths to hashes, as well as whatever other metadata is needed to reconstitute the file at recovery time.
- Maintain a log of objects and refcounts to them in the directory structure. As objects are orphaned (meaning a file was deleted or revised), add them to a queue with timestamp. Once a certain amount of time has elapsed since addition to the queue, such as two weeks, remove them from the list. If they're still orphans, delete them.
If I've thought this through correctly, then this backup system lets you rename (and move) files all you want, and doing so won't cause them to be sent over the wire again. The recovery process isn't too onerous in terms of backup file format; just reassociate paths and metadata with each object, and you're done. You get versioning of individual files for free via the delayed garbage-collection mechanism. And in fact you could set up the system to back up several home PCs and not worry about double-backups of identical files on each PC, assuming the hash store were a big shared soup.
Disadvantages:
- Compression window is limited to a single file. Probably not a horrible loss for the average home dataset, where files don't have much relationship to each other.
- Granularity of versioning is per file. This would be expensive, for example, if you were making small daily edits to a giant Quark file that actually changed only a few well-localized bytes in the file. Perhaps a Bittorrent-style piece mechanism, or whatever rsync does, would address this issue.
- Not entirely convenient backup format. For example, s3sync ends up mirroring your directory structure on S3. So with appropriate security measures you could use your web browser as a convenient filesystem browser. This proposal would give you the browser structure, but the moment you wanted to actually get a file, you'd have to copy and paste the hash key to generate the path to another part of the S3 bucket.
I think this is about 50 lines of Python (famous last words). Maybe I'll try to write it this weekend, unless someone out there beats me to it.
Update 10/6/2006: Looks like the Perl version of s3sync first lists the bucket and collects all the etags (MD5 hashes), and then it skips files already uploaded. If the Ruby port preserves this behavior of the Perl version, then it might handle the move-triggers-reupload issue. So it's possible that s3sync already does enough of what I want.

"Rely on S3's built-in capability to avoid re-uploading objects whose contents haven't changed."
I haven't looked extensively at the S3 documentation yet, but in a quick browse of http://docs.amazonwebservices.com/AmazonS3/2006-03-01/ I don't see this mentioned anywhere in the REST api's PUT command. I think you could do this manually though by just doing an http HEAD request to see if the object already exists before trying to do the PUT. Or perhaps that's what you meant?
Second, are you planning on using the local file system's update timestamps to only consider files which have been created or changed since the last backup? It seems like this would be far faster because it conserves local disk reads, cpu activity, and a small amount of network bandwidth to/from S3 doing the HEAD requests I mention above. I'm not sure if a hard drive's failure rate is impacted significantly by the number of lifetime reads, or if something like number of writes and/or hours spinning dominate, but if reads are significant it might help avoid the drive failure you're trying to insure against.
Antony, for the duplicate detection, I'm actually relying on something I read on the s3sync thread on the Amazon forums. It does seem surprising because it means each put is actually two separate HTTP requests (otherwise you'd learn you didn't need to upload only after you'd uploaded in the request).
I'll do some tests, and if the behavior is what I expect, I'll track down the documentation pointer.
I did write up a quick version of the program last night and not surprisingly, sha1 hashing takes a long time (not to mention bogging down the SATA channel). So I think that yes, I'll record timestamps to avoid the rehash on subsequent backups. Incidentally, I think the ext2 filesystem has extended file attributes that include the functional equivalent of the archive bit on FAT filesystems, but I don't know much beyond that. In either case, the particular files I'm backing up (photos) never change, or change maybe once to rotate portrait photos to the right orientation, so this optimization would pay off nicely.
Hey Mike, I was just thinking on implementing my own file system for my linux computer. I recently went through a lost of my data, because of one silly mistake when reformating my linux server and lost a good chunk of my data. A friend recomended s3.
This is a good idea. Keep us up to date on the progress. Unfortunatly I don't know much of Python.. so I can't help.