Five Ways to Document Your Data

Overview

How understandable will your data be to other users, including yourself, in the future? This guide considers five common practices for documenting data, ranging from easy to more complex.

Cite the source of your data
Define your data using a data dictionary
Describe the entire data package using a data specification or readme file
Track data lineage to visualize your data processing pipeline
Capture the reproducible environment and workflow of all changes made to your data

Did you know: UW–Madison has requirements for how we describe institutional data sources, elements, processes, integrations, and products? Use the data documentation standard when describing shared university data assets to be better communicated to ensure transparency, clarity, shared understanding, replicability, and ease of use.

1. Cite the source of your data

Simply naming the source of data used in a report or presentation can go a long way toward communicating transparency and trust in your findings. This could be in the form of a data citation or an acknowledgement below your table or graphic.

Example data source acknowledgment in a slideshow presentation or report
Data source figure example

Example citation in a reference list or bibliography

Smith, Tom W., Peter V. Marsden, and Michael Hout. General Social Survey, 1972-2010 Cumulative File. ICPSR31521-v1. Chicago, IL: National Opinion Research Center [producer]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2011. Web. 23 Jan 2012. doi:10.3886/ICPSR31521.v1

Citations for data should at minimum include:

Creator(s) or contributor(s)
Date of publication
Title of dataset
Publisher
Identifier (e.g., Handle, ARK, DOI) or URL of source
Version, when appropriate
Date accessed, when appropriate

2. Define your data using a data dictionary

Providing a data definition and using it consistently helps remove ambiguity about the meaning of commonly used data fields or data elements (e.g., “Academic year” may have a different meaning depending on the institution) and helps aid in data interpretation.

Tip: UW–Madison has 500+ official data definitions of elements used in institutional data sets and data sources. Search the RADAR Data Glossary.

A comprehensive list of definitions for a dataset is called a data dictionary, and lists a short and long name of the element, documents the data format or units of measurement, explains any codes or allowable values, and including the full definition or description that helps users search and understand the data element, and, if appropriate, any derivations.

Example data dictionary

Name	Long name	Measurement unit	Allowed values	Definition
ID	Student ID number	Numeric	0-999999	ID number assigned to a student by an admitting office
group_num	Group number	Numeric	200-900	Group number defining student’s relationship to department
DOB	Date of birth	mm/dd/yyyy	01/01/1900- 01/01/2099	Student’s date of birth collected in the admissions application

To increase compatibility with other similar datasets, metadata standards help ensure that the list of data definitions used across the different data sources are the same and will help make integration of different data sources much easier. Some popular metadata schemas include:

Data Documentation Initiative (social sciences)
Dublin Core (archives and libraries)
NIH Common Data Elements (health sciences)

Tip: UW–Madison Research Data Services offers a list of commonly used metadata standards for research.

3. Describe the entire data package using a data specification or readme file

A data specification goes one step further. It often includes the data dictionary, but also provides more meaning and context about the dataset. For example, a data specification may record basic metadata such as who created the dataset, for what purpose, methodologies used, and any access and reuse conditions. This information can be packaged as a plain text file (Example: a readme.txt file template available from Cornell University) or indexed as a record in a data catalog.

Things to include in your data specification:

Title
Creator(s), their affiliation, and contact information
Access and sharing
- Access instructions
- Licenses and terms of use
- Software needed to use or understand the data
Data specific information
- Data dictionary of elements
- Overview of the contents and sources
- Methodology of data collection or load frequency of how often the load process is run
- Data protection and retention information

Tip: Publish a data specification for your institutional dataset, data source, or data product in the UW–Madison RADAR data catalog.

4. Track data lineage to visualize your data processing pipeline

Data lineage not only describes the source of where your data came from but also details all the transformation, calculations, and changes made throughout the process. Lineage can be a good way to visualize the data pipeline, or how each data element was coded, transformed, and outputted as new data elements. Tools like Microsoft Azure or Google Cloud Data Fusion can help visualize this journey.

Example data lineage visualization from Microsoft Azure

There are several techniques to capture data lineage.

Source to target map: A low-tech solution is to capture all the transformations made using a spreadsheet. You can add comments that detail each stage of the transformation for each data element in your dataset
Data codebook: Tools like SPSS and SAS can help you export a record of the variables, codes, and transformations used in the data. Or you can track your work on your own in a plain text file, like this codebook template from a Coursera course on Data Cleaning.

Tip: Electronic Lab Notebook: ELNs capture more than just the data lineage, but they can be a great way to document the journey that your data takes during the transformation process and also helps date and timestamp discoveries (required for some patentable research). There are many ELN tools to choose from. See UW Madison’s ELN software offerings that are free to campus users.

#5 Capture the environment and reproducible workflow of all changes made to your data

Many data analysis tools will track all changes made during your workflow and communicate these in a friendly to use way. Tools such as:

Git/GitHub Version Control
R and RStudio
Jupyter Notebooks
Open Science Framework
UW Madison Software Carpentry community offers regular training in these and more

For a dataset to be reproducible (e.g., someone else can follow your steps and get the exact same result) your documentation must not only include a complete history of all the changes made to the data (computational transformations and analyses), but also an exact run environment including the same operating system and software versions.

You can capture the complete analysis environment using tools such as:

ReproZip: The UW Library reviewed this tool in 2017
Docker: UW Research Computing offers guidance on how to use Docker

Learn more about reproducible workflows

UW–Madison instructor Karl Broman (Biostatistics & Medical Informatics) teaches Software Carpentry and other training workshops on reproducible workflows. See “Steps toward reproducible research” slides.