Preserving digital files

During the cleanup project I mentioned in my last blog post, I learned a few things about preserving old personal (not business) digital files.

First, what to save. The short answer: everything and then some.

Why everything? Because it's cheaper than picking and choosing. My entire collection of Apple ][ disks, converted to 143KB disk images and compressed, came to 1.7MB. Even at the time I archived those in the mid-1990s, that would have fit on two floppy disks for a total cost of less than $10.00. Today the economics are a little different, but still, a one-terabyte external hard drive costs $90.

What does "and then some" mean? It means don't just save the TurboTax file; also save a PDF of it. Think about why you might want to look at a 2010 tax return in 2025. Is it because you have a computer with TurboTax 2010 installed and were feeling like editing your old tax return for old times' sake? No, it's because you're running for Congress and have decided to publish your old returns during your campaign. You're much likelier to need to read an old document than to edit it. Save a format that makes it easy to read.

"And then some" also means converting certain physically represented information to digital. Example: take a picture of the CD-ROM case with the serial number, and put the JPEG in the same directory as your files. The picture's usually enough to prove ownership of software, and if you ever have to reinstall TurboTax 2002 to get the IRS off your back, it'll be nice to know what to do when the license-key installer dialog pops up.

Next, which archival format? I like containers rather than individual files. On Windows, that's .zip. On Linux, it's .tar. On a modern Mac, .dmg. These containers are designed to survive transmission over networks, to move easily from an obsolete medium to a modern one, and to preserve metadata. I've witnessed the grief of helping a friend recover an old Mac file, only to find that the resource fork was gone. I've personally needed to know the last-modified date of an old file because that date was more significant than the file itself, and discovered to my horror that the date was today. Stick your old files in the proper container and take reasonably good care of the container, and you'll get them back out again exactly the way they were put in.

Which file format? For files that already exist, don't change them. But going forward, pick formats that are (a) open or ubiquitous, (b) understood by non-DRM applications. Examples:

  • Plain old text files. As is customary, Apple will someday invent another new character for line endings, but otherwise text files are universal.
  • High-resolution, lossless TIFF for scans of old photos. High-quality JPEG is probably OK for photo scans, too, depending on how important they are.
  • PNG for static graphics. The only example I can think of is Eagle PCB files.
  • WAV for important audio like the cassette tape of your dad interviewing his mom when she was 90 years old. But also make an MP3 so you can easily email the interview to your kids.
  • PDF for final versions of electronic documents (see above), and EPS for vector graphics. Both these are proprietary, but enough open-source viewers exist that they're unlikely to be unreadable in the future.

Obviously, it's only an educated guess which file formats will be readable decades from now. But ASCII, PNG, TIFF, JPEG, WAV, MP3, and PDF/EPS are good bets for today's documents.

So we have our entire life's Word documents rendered to PDF and stored in a zipfile. Which medium should we use? Two answers, depending on size.

For small file collections (10GB or less) I'd love to recommend CD-R or DVD-R if my personal experience with their longevity weren't so poor. I'd estimate 50% failure rate for well-stored CD-Rs over 10 years old. Moreover, computers with no moving parts are becoming common; in ten years, perhaps a CD-ROM reader will be as rare as a floppy drive is today. (Update: some have pointed out that the demise of any computer medium won't happen suddenly, and that there will be time to migrate data stored on old media, which is why the CD-ROM obsolescence argument is weak. I agree in principle. In reality, stuff gets put on shelves and discovered years later. Moreover, we're talking about personal files, where the hassle of tracking down a friend with an obsolete reader might be a high enough barrier to recovery.)

That leaves USB drives and SD cards, and I'll pick SD cards, even though they're a little more expensive. Two reasons. First is reliability. I've found SD cards to be more reliable than USB drives, which makes sense because SD cards usually store the only copy of pictures taken on digital cameras, meaning failure is potentially devastating, and USB drive files need to last only long enough for a sneakernet file transfer, meaning failure isn't a big deal. The second reason is form factor. SD cards are uniform and stackable. (Update: there are questions about flash memory longevity. The point is that if the medium is rewritable, then for archival purposes it has failure built into the design. Might make more sense to make multiple DVD-R copies and store them in different places.)

For big file collections, I'd buy an external 1TB hard drive, fill it up, and put it away. I know that hard drives have lots of moving parts, but they're well-sealed inside their cases, and I've personally had great success getting data off hard drives last used nearly 20 years ago.

What about online storage? Nope. I haven't yet found a consumer storage service that has a WORM (write-once, read many) philosophy about storage; it's too easy for me or a mischievous web weenie to issue a command that erases either my files or my entire account. And any software-as-a-service relationship (including any DRM purchase) is effectively a lopsided, eternal contract with a company. They can change the terms of that contract any time, and if the company goes away, it's likely your files will, too. I love Gmail for the service they provide, but I don't expect them to be my email archiving solution.

Should you encrypt your local backups? Several reasons why I say no. First is that I don't want it; if I die, I do want my wife and kids to be able to intelligently dispose of this stuff. Second, I don't need it; nobody cares about my personal data except me. Third, I can't follow through: either I'll forget the passphrase (or forget to divulge it on my deathbed), or else I'll keep it written down next to the SD cards, in which case it's no more effective than physical security of the SD cards. Your opinions may differ on this one, but let me ask: if your files are so important and secret, how come you don't back up or encrypt the files on your computer today?

And finally, in spite of all this carefully reasoned advice, consider throwing it away instead of saving it. If you don't, your future heirs will have to when you're gone. I admit that it's cool to know that I could call up my high school freshman year book reports on a moment's notice, but I doubt I ever actually will. I don't want to be featured on the digital version of Hoarders in 40 years, after all.

Categories

Leave a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About this Entry

This page contains a single entry by Mike Tsao published on April 27, 2010 10:26 AM.

Time out of mind was the previous entry in this blog.

Beatles Mono Box Set is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.2-en