Reproducible Research Checklist

What to Do and What Not to Do

Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health

DO: Start With Good Science

Garbage in, garbage out
Coherent, focused question simplifies many problems
Working with good collaborators reinforces good practices
Something that's interesting to you will (hopefully) motivate good habits

DON'T: Do Things By Hand

Editing spreadsheets of data to "clean it up"
- Removing outliers
- QA / QC
- Validating
Editing tables or figures (e.g. rounding, formatting)
Downloading data from a web site (clicking links in a web browser)
Moving data around your computer; splitting / reformatting data files
"We're just going to do this once...."

Things done by hand need to be precisely documented (this is harder than it sounds)

DON'T: Point And Click

Many data processing / statistical analysis packages have graphical user interfaces (GUIs)
GUIs are convenient / intuitive but the actions you take with a GUI can be difficult for others to reproduce
Some GUIs produce a log file or script which includes equivalent commands; these can be saved for later examination
In general, be careful with data analysis software that is highly interactive; ease of use can sometimes lead to non-reproducible analyses
Other interactive software, such as text editors, are usually fine

DO: Teach a Computer

If something needs to be done as part of your analysis / investigation, try to teach your computer to do it (even if you only need to do it once)
In order to give your computer instructions, you need to write down exactly what you mean to do and how it should be done
Teaching a computer almost guarantees reproducibilty

For example, by hand, you can

Go to the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/
Download the Bike Sharing Dataset by clicking on the link to the Data Folder, then clicking on the link to the zip file of dataset, and choosing "Save Linked File As..." and then saving it to a folder on your computer

DO: Teach a Computer

Or You can teach your computer to do the same thing using R:

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00275/
               Bike-Sharing-Dataset.zip", "ProjectData/Bike-Sharing-Dataset.zip")

Notice here that

The full URL to the dataset file is specified (no clicking through a series of links)
The name of the file saved to your local computer is specified
The directory in which the file was saved is specified ("ProjectData")
Code can always be executed in R (as long as link is available)

DO: Use Some Version Control

Slow things down
Add changes in small chunks (don't just do one massive commit)
Track / tag snapshots; revert to old versions
Software like GitHub / BitBucket / SourceForge make it easy to publish results

DO: Keep Track of Your Software Environment

If you work on a complex project involving many tools / datasets, the software and computing environment can be critical for reproducing your analysis
Computer architecture: CPU (Intel, AMD, ARM), GPUs,
Operating system: Windows, Mac OS, Linux / Unix
Software toolchain: Compilers, interpreters, command shell, programming languages (C, Perl, Python, etc.), database backends, data analysis software
Supporting software / infrastructure: Libraries, R packages, dependencies
External dependencies: Web sites, data repositories, remote databases, software repositories
Version numbers: Ideally, for everything (if available)

DO: Keep Track of Your Software Environment

sessionInfo()

## R version 3.0.2 Patched (2014-01-20 r64849)
## Platform: x86_64-apple-darwin13.0.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] slidify_0.3.3
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   knitr_1.5      markdown_0.6.3
## [5] stringr_0.6.2  tools_3.0.2    whisker_0.3-2  yaml_2.1.8

DON'T: Save Output

Avoid saving data analysis output (tables, figures, summaries, processed data, etc.), except perhaps temporarily for efficiency purposes.
If a stray output file cannot be easily connected with the means by which it was created, then it is not reproducible.
Save the data + code that generated the output, rather than the output itself
Intermediate files are okay as long as there is clear documentation of how they were created

DO: Set Your Seed

Random number generators generate pseudo-random numbers based on an initial seed (usually a number or set of numbers)
- In R you can use the set.seed() function to set the seed and to specify the random number generator to use
Setting the seed allows for the stream of random numbers to be exactly reproducible
Whenever you generate random numbers for a non-trivial purpose, always set the seed

DO: Think About the Entire Pipeline

Data analysis is a lengthy process; it is not just tables / figures / reports
Raw data → processed data → analysis → report
How you got the end is just as important as the end itself
The more of the data analysis pipeline you can make reproducible, the better for everyone

Summary: Checklist

Are we doing good science?
Was any part of this analysis done by hand?
- If so, are those parts precisely document?
- Does the documentation match reality?
Have we taught a computer to do as much as possible (i.e. coded)?
Are we using a version control system?
Have we documented our software environment?
Have we saved any output that we cannot reconstruct from original data + code?
How far back in the analysis pipeline can we go before our results are no longer (automatically) reproducible?