Bioinformatics Software Engineer for cancer research
PhD from the University of Michigan
Volunteered with Girls Who Code & Software Carpentry
Defining reproducibility
Reproducibilty is the ability to repeat an analysis with the same methods on the same data and get the same result
Replicability is the ability to repeat an analysis with the same methods on different data and get the same result
Methods
Same data
Different data
Same methods
Reproducibility
Replicability
Different methods
Robustness
Generalizability
Why care about reproducibility?
Reproducibile research practices enable others to validate your work and build on it
Is your work reproducible?
For peers in your field?
For your colleagues?
For yourself?
How to make your work reproducible
Project-oriented organization
Describe the analysis workflow
Define the dependencies
Version control & sharing
Types of (R) projects
One-off analysis script or notebook
Complex analysis with multiple scripts
R package
Types of (R) projects
One-off analysis script or notebook
Complex analysis with multiple scripts
R package
Project-oriented organization
Scenario: you are given a new project with a dataset to analyze. You create a new R Markdown / Quarto / Jupyter notebook for initial exploratory analysis.
Project-oriented organization: keep it contained
Create a separate, self-contained folder for each project.
Use a consistent subdirectory structure within the project folder
library(tidyverse)# relative path from R scriptrna_counts <-read_csv('../data/RNAseq-counts.csv')# absolute pathrna_counts <-read_csv('/Users/myusername/projects/lung-cancer/data/RNAseq-counts.csv')
For complex analyses, workflow managers orchestrate the execution and scale resources up or down as needed
Automate the analysis workflow
For complex analyses, workflow managers orchestrate the execution and scale resources up or down as needed
Snakemake
Nextflow
drake
Describe the dependencies
If you use any packages outside of base R, you’ll need to know which packages and (possibly) which versions
Look for library() in your scripts and direct calls such as dplyr::filter()
List them in your README file
Dependencies
This project requires R >= 4.0 and the following packages:
dplyr
ggplot2
here
readr
tidymodels
Describe the dependencies in a DESCRIPTION file
Your project doesn’t have to be an R package to use a DESCRIPTION file. Create one with usethis::use_description(check_name=FALSE):
Package: repro-r
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R:
person("First", "Last", , "first.last@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Describe the dependencies in a DESCRIPTION file
Add packages to your DESCRIPTION file with usethis::use_package('packagename')
Package: repro-r
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R:
person("First", "Last", , "first.last@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Imports:
dplyr,
ggplot2,
lubridate (>= 1.2)
Install your project’s dependencies with devtools::install_deps()
Describe the dependencies with an environment manager
If you need…
Different package versions for different projects
Packages for other languages too 🐍
An environment manager will help!
Describe the dependencies with an environment manager
renv - primarily for R but can also handle Python. Works seamlessly with DESCRIPTION. renv.lock
conda, mamba - python, R, CLI tools, anything you can find at anaconda.org. environment.yml
docker & singularity containers - if you need to control the operating system too
Describe the dependencies with an environment manager
And document how to restore the environment in your README
Dependencies
Dependencies are listed in DESCRIPTION and can be installed with renv::restore()
Dependencies
Dependencies are listed in environment.yml and can be installed with mamba env create -f environment.yml
Use version control
Help yourself keep track of how your project changes over time
Facilitate collaboration with others
Use version control: git + GitHub
git is a command line program for tracking changes in a project. Each project is its own repository (“repo”) which has a history of changes.
GitHub is a website that hosts git repositories for sharing and collaboration.
Use version control: git + GitHub
The only git commands you strictly need
git init – create a new git repo in your project directory
git add <files> – add files to be tracked by git
git commit -m 'add heatmap script' – commit changes to the repo history
The only git commands you strictly need if you use github
git init – create a new git repo in your project directory
git add <files> – add files to be tracked by git
git commit -m 'add heatmap script' – commit changes to the repo history
git push – upload commits from your local repo to github
git pull – download commits from github to your local repo
The only git commands you strictly need if you use github for collaboration
git init – create a new git repo in your project directory
git add <files> – add files to be tracked by git
git commit -m 'add heatmap script' – commit changes to the repo history
git push – upload commits from your local repo to github
git pull – download commits from github to your local repo
git branch – create a new branch
git switch – switch to a branch
Use GitHub on the web to combine branches with pull requests
Use version control: git + GitHub
Conclusion
Project-oriented organization
Describe the analysis workflow
Define the dependencies
Version control & sharing
Conclusion
Project-oriented organization
Describe the analysis workflow
Define the dependencies
Version control & sharing
Document everything
Reproducibility is all about communicating with your collaborators
My analysis is simple, do I really need all this?
No! Pick and choose what works for you and your peers