5 Tips for Preserving Your Data Long-Term
In celebration of World Digital Preservation Day 2020 on November 5, we’re sharing a series of posts by University of Pittsburgh Library System librarians and archivists that highlight their expertise and work to preserve the digital!
This post was written by Dominic Bordelon, Research Data Librarian
Like academics everywhere, at the University of Pittsburgh we hope to make valuable contributions to our fields through our publications, which we can expect will outlive us. More recently, thanks to new technological possibilities, we turn our attention to how other research outputs, such as data and software code, can also be stored for posterity.
How can you get started? Here are five tips from Pitt Libraries that you can begin using right away.
1. Use open file formats
Open file formats are those which are widely adopted, well documented, and unhindered by proprietary restrictions which monopolize the creation, editing, or reading of files. These are formats like CSV (comma-separated value) for tabular data, plain text for qualitative data (.txt), or PNG (Portable Network Graphics) for images. Proprietary formats tend to create a barrier to access and may even face obsolescence should the vendor go out of business. These factors have a negative influence on the probable longevity of the files’ contents.
For example, users of IBM’s SPSS statistical software will be familiar with .sav files for their data and analyses. However, .sav is a binary format rather than a character-based one, unreadable without special software (such as SPSS). Nor has IBM published official documentation for community use. Consider instead (or in addition) depositing a version of your data in CSV format, which should be easily readable to any future users.
To find out more about the preservability of the file formats you use, and to see which are recommended, you can see the Library of Congress’ Recommended Formats Statement.
"Keat takes notes" by geekcalendar is licensed under CC BY 2.0
2. Describe and annotate your dataset
In order for your data to be useful in the future, readers will need to be able to make sense of it. Data does not usually explain itself. What does the abbreviation in this column name mean? If an instrument was used to record your data, what model? What steps did you follow in your lab to run the experiment? The answers to these questions have important implications for researchers who want to replicate your study or integrate your data in a new study of their own.
There are several ways you can describe the important context around your data:
A detailed abstract in your data depository, and completion of all appropriate metadata fields
Data dictionaries and codebooks which describe column names and values
Documentation of your research protocols (perhaps with a tool like protocols.io)
3. For software, document your dependencies and computing environment
When you run code, it’s important to know what needs to be installed for it to work properly. Which version of Python did you use? If you used a library like Astropy in Python or osmdata in R in your analysis, what version of the library did you use? Without this information, it might be difficult—or even impossible—for future users to run your code, and for them to be confident that they are running it as intended. You can do this with a text file, but look also at tools like Docker (or the Dockter project for researchers specifically) to containerize and document your environment.
4. Deposit your data in a trustworthy repository
When choosing a data repository, consider how it is maintained and whether they seem to have plans for the future. You can find much of this information in their about pages. For example, is the repository run at a large research institution by a team of dedicated staff, or, at the other extreme of that spectrum, is it a lone researcher’s side project? Do you trust that the repository, or at least its owner, will exist in ten or twenty years? If run by a private company, does it seem well-established with many ties to the academic community? Do persistence and preservation seem to be high priorities for the repository? While other factors might affect one’s choice of repository, we should hold this sense of “trustworthiness” high on the list.
CoreTrustSeal is an organization that certifies research data repositories as trustworthy, i.e., apparently sustainable and stewardship-oriented. Checking their list of repositories is a safe bet. If the repository in question is not CoreTrustSeal certified, your local data librarians (for example, at Pitt, the ULS Digital Scholarship Services team and the HSLS Data Services team) can help you evaluate the repository.
"elephant ears." by brittanyhock is licensed under CC BY-NC 2.0
5. Dark archive your dataset in your institutional repository
Sharing is great, but preservation is important too. The practice of “dark archiving” is simply depositing material in a nonpublic repository, for purely preservationist purposes. If you are planning to share your data in an open repository, consider also investigating whether your institution has a repository where you could dark archive an additional copy. The idea is that, should the open repository eventually fail, the data could still be restored from the dark archive, and then pointers to the open deposit such as DOIs could be redirected to the restored copy.
Why dark? If your dataset is hosted in multiple places online, some users might find it confusing, especially without knowing any rationale. Intellectual property ownership may also be unclear. Furthermore, the user may reasonably wonder whether the two copies are truly identical.
Your institution may not advertise a “dark archive,” but look instead for your general institutional repository, such as Pitt’s D-Scholarship.
Let me know how these tips work for you. Happy preserving!
“5 Tips for Preserving Your Data Long-Term” by Dominic Bordelon is licensed under Creative Commons Attribution-ShareAlike 4.0 (https://creativecommons.org/licenses/by-sa/4.0/).





















