TITLE: The Future of Your Data AUTHOR: Eugene Wallingford DATE: September 09, 2011 2:19 PM DESC: ----- BODY: In Liberating and future-proofing your research data, J.B. Deaton describes a recent tedious effort "to liberate several gigabytes of data from a directory of SigmaPlot files". Deaton works in Python, so he had to go through the laborious process of installing an evaluation copy of SigmaPlot, stepping through each SigmaPlot file workbook by workbook, exporting each to Excel, and then converting the files from Excel to CSV or some other format he could process in Python. Of course, this was time spent not doing research.

All of this would have been a moot point if the data had been stored as CSV or plain text. I can open and process data stored in CSV on any operating system with a large number of tools, for free. And I am confident in 10 years time, I will be able to do the same.

This is a problem we face when we need to work with old data, as Deaton is doing. It's a problem we face when working with current data, too. I wrote recently about how doing a basic task of my job, such as scheduling courses each semester, in a spreadsheet gets in the way getting the job done as well as I might. Had Deaton not taken the time to liberate his data, things could have been worse in the long run. Not only would the data have been unavailable to his current project, but it may well have fallen into disuse forever and eventually disappeared. Kari Kraus wrote this week about the problem of data disappearing. One problem is the evolution of media:

When you saved that unpublished manuscript on [a 5-1/4" floppy disk], you figured it would be accessible forever. But when was the last time you saw a floppy drive?

Well, not a 5-1/4". I do have a 3-1/2" USB floppy drive at home and another in my office. But Kraus has a point. Most of the people creating data aren't CS professionals or techno-geeks. And while I do have a floppy drive, I never use them for my own data. Over the years, I've been careful to migrate my data, from floppy drives to zip drives, from CDs to large, replicated hard drives. Eventually it may live somewhere in the cloud, and I will certainly have to move it to the next new thing in hardware sometime in the future. Deaton's problem wasn't hardware, though, and Kraus points out the bigger problem: custom encoded-data from application software:

If you don't have a copy of WordPerfect 2 around, you're out of luck.

The professional data that I have lost over the years hasn't been "lost" lost. The problem has always been with software. I, too, have occasionally wished I had a legacy of copy of WordPerfect lying around. My wife and I created a lot of files in pre-Windows WordPerfect back in the late 1980s, and I continued to use Mac versions of WP through the 1990s. As I moved to newer apps, I converted most of the files over, but every once in a while I still run across an old file in .wp format. At this point, it is rarely anything important enough to devote the time Deaton spent on his conversion experience. I choose to let that data die. Fortunately, not all of my data from that era was encoded. I wrote most of grad school papers in nroff. That's also how I created our wedding invitations. This is a risk we run as more of our world moves from paper to digital, even when it's just entertainment. Fortunately, for the last 5 years or so, I've been storing more and more of my data in plain text or, short of that, rich text. Like Deaton, I am pretty confident that I will be able to read and process that data 10 years hence. And, whenever possible, I have used an open file formats only policy with my colleagues. Rather than having to liberate data in the future, it is wiser to let it live free from birth. That reduces friction now and later. Deaton offers a set of preferences that can help you keep your data as free as possible:

Open source beats closed source.

Ubiquitous beats niche software.

Automation/scripting beats manual processes.

Plain text beats binaries.

READMEs in every project directory.

That third bullet is good advice even if you are not a computer scientist. Deaton isn't. But you don't have to be a computer scientist to reap the benefits of a little programming! -----