Overview Link to heading

Reproducible scholarship is hard, but important for the scientific record and for saving yourself a lot of pain down the line. The same is true for business, because if your software makes you money and it breaks down it stops making you money.

Computational reproducibility requires controlling the code, data and software environment (Buckheit and Donoho 1995; Donoho 2010, 385). Docker is a very helpful but complex solution to reproducibility in R and Rocker Project (Boettiger and Eddelbuettel 2017; Nüst et al. 2020) images make the process easier. However, the r-ver line of images does not version-lock R packages for the latest version of R.

This problem can persist for a long time if one relies too much on the Docker build cache, as I did.

Possible solutions to the Rocker version-lock problem are:

Use an r-ver image with the non-latest R version
Pull images often
Set the CRAN mirror date explicitly in the Dockerfile (my preference)

The failure to ensure reproducibility can lead to catastrophic errors and wild debugging goose chases.

Read on for a thrilling tale of what can go wrong if you don’t read the documentation carefully.

Table of Contents Link to heading

Reproducible Scholarship Link to heading

Those who follow my work know that I spend a lot of time and effort to ensure that my scholarship is reproducible. This quote earned a permanent home in the Code section on my website:

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result. [Buckheit and Donoho (1995); Donoho (2010), 385]¹

Reproducibility is not an easy problem and it means different things for different people in different disciplines at different times, but this zinger is a good start.

Most people who teach and advocate for responsibility spend a lot of time getting people to write clear analysis pipelines in computer code and, most importantly, getting scholars to publish that code so others can check if the code really does what it is supposed to do. This is important and we should keep doing this. Especially in Empirical Legal Studies and Legal Data Science this is not yet the default.

Data cleaning and curation is just as important, but it is taught less often and valued less. Everyone likes using “advanced mathematics”, “state-of-the-art statistical techniques” and “groundbreaking artificial intelligence models” for their fancy research, because its just not the same kind of awesome talking about how you wrestled those 20 date formats into nice ISO YYYY-MM-DD and explaining the REGEX that created all those individual variables you fed into your model. We should be taking more about data curation, we should spend more time on it and we should publish all the data we can legally and ethically do, but this is not what this post is about.

This post is about software environments in computational research and how they are important even if everything else is done right. Ensuring a reproducible software environment is partially about principles and supporting the scientific record, but most importantly it is about saving yourself a lot of (avoidable!) trouble down the line.

Note

By the way, computational reproducibility is also a business problem. If you are offering a service or product that earns you money, you want that product/service to keep working and keep making you money instead of randomly breaking down and making your customers look for their pitchforks.

Remember when Crowdstrike crashed half the world in 2024?

The Software Stack in R Projects Link to heading

In modern software there are always dependencies. No one spells out the ones and zeros for the machine by hand, so we write towers of software built on towers of software built on towers of software. And these dependencies change. Upstream authors fix bugs, introduces new features, change the interface, patch security vulnerabilities and so on. This is a good thing, in general.

However, it is a also a problem. If the behavior of dependencies changes, so does the behavior of our own code. Often this is not a problem, but often enough it is.

For data analysis workflows in R these are some of the most important dependency layers:

R packages, e.g. {quanteda} for text analysis
The R language itself
System dependencies of R packages, e.g. Tesseract for {tesseract}
System dependencies of the R language, e.g. the basic linear algebra subsystem
The operating system

Docker Environments and Rocker Images Link to heading

There are many technical solutions to maintaining a reproducible computational environment in R. The core mechanism is to lock software versions of dependencies in place and record any changes. {renv} is a popular option, but it only locks the versions of R packages. {rix} is a new arrival to CRAN that also locks the R version and system dependencies.

My requirements a are a little more challenging so I use Docker to lock the entire software stack from the OS upwards in place. Docker is constructed around a few core concepts:

A container is the actual environment in which the code runs. These should be treated as disposable and re-created daily, hourly or after every run.
An image is the blueprint for creating a container. Creating a container from an image is as simple as reading from the hard drive. Usually takes a few seconds at most.
A dockerfile is the blueprint for creating an image from layered instructions. Building an image usually involves downloading a base image with a particular OS, installing and compiling software and adding config options. This can take minutes or hours.

Obviously, creating a reproducible R software stack from scratch is a huge amount of work, so I rely on the Rocker Project’s Docker images for R (Boettiger and Eddelbuettel 2017; Nüst et al. 2020) as a baseline. The r-ver line of images freezes the OS and R versions, all system libraries at a certain date 90 days past the Ubuntu version release and the R packages at a certain CRAN date linked to the R version.

With this foundation I specify a Dockerfile that installs TeX packages, system dependencies for R, compile Tesseract from source and install R packages. See this Dockerfile for an example. I re-create containers several times a day, but prefer to rebuild the image only if the Dockerfile changes and use the Docker build cache to save time.

The Limits of the Rocker R Package Version Freeze Link to heading

Now, I said that the Rocker images freeze the versions of all R packages at a certain date, as available on CRAN. This is true, but unfortunately not always. The documentation says this:

Non-latest R version images installs all R packages from a fixed snapshot of CRAN mirror at a given date. This setting ensures that the same version of the R package is installed no matter when the installation is performed. (Source)

Note that it says “non-latest R version images”. So with the image containing the latest version of R — even with the reproducible r-ver lines of images — the CRAN mirror will NOT be locked to a fixed date.

Why is this a problem?

When the CRAN mirror is not locked to a date the R package manager will keep installing the latest versions of R packages. See below for how this caused a catastrophic failure in one of my projects.

If the Docker image is built from cache and the base image is not refreshed this behavior can persist for a very long time. This also happened to me.

Solutions to Ensure Date-Lock of R Packages Link to heading

Fortunately there are some easy solutions to the r-ver package version freezing problem. The hardest part is knowing that the problem is even there.

Solution 1: Use Non-Latest R version Link to heading

The easiest and best option is to use an r-ver image with the non-latest version of R.

I recommend this for everyone.

At the time I couldn’t do this because I needed to upgrade to R 4.4.0 quickly because of a major security vulnerability.

Solution 2: Pull Images Often Link to heading

A different option would be to always pull a fresh image (e.g. daily) when running the pipeline. The R version in the project will become “non-latest” after a few months and then fresh images from Rocker will have the package date lock in place. This can be ensured by adding the --pull and --no-cache flags to the docker build command.

My pipelines always attempt to rebuild the Docker image before they run, but I rely on the cache a lot to ensure speedy development. Rebuilding images without before every run would hit me with 15 minute delays all the time.

I don’t recommend this option because re-pulling the base image will waste a lot of time and resources. It will also not version-lock packages in the period where the R version is still the latest.

Solution 3: Set CRAN Mirror Date Explicitly Link to heading

The solution I opted for was to set the CRAN mirror date explicitly in the Dockerfile. The Rocker Project conveniently provides a script for doing so. The following line will do the trick (in this case setting the date at 2024-06-13:

1RUN /rocker_scripts/setup_R.sh "https://packagemanager.posit.co/cran/__linux__/jammy/2024-06-13"

I prefer this option for my personal workflows because:

I might want to adjust the CRAN mirror date without changing the R version
It ensures that the date lock is in place even if I am forced to choose the latest r-ver image

The full Dockerfile, including use of ARG variables for clarity, looks like this:

 1# Build Arguments
 2ARG R_VERSION="4.4.0"
 3ARG R_CRAN_MIRROR="https://packagemanager.posit.co/cran/__linux__/jammy/2024-06-13"
 4
 5# Base Layer
 6FROM rocker/r-ver:${R_VERSION}
 7
 8# LaTeX Layer
 9RUN apt-get update && apt-get install -y \
10    pandoc \
11    pandoc-citeproc \
12    texlive-science \
13    texlive-latex-extra \
14    texlive-lang-german
15
16# System Dependency Layer
17COPY etc/requirements-system.txt /
18RUN apt-get update && apt-get -y install $(cat /requirements-system.txt)
19
20# Tesseract Layer 
21COPY etc/requirements-tesseract.sh /
22RUN sh /requirements-tesseract.sh
23
24# R Layer
25COPY etc/requirements-R.txt /
26RUN /rocker_scripts/setup_R.sh ${R_CRAN_MIRROR} && \
27    Rscript -e 'install.packages(readLines("/requirements-R.txt"))'

Tracing the Error Link to heading

I wrote this essay because this (to me) unexpected behavior of the r-ver Rocker images caused a catastrophic failure in one the new data pipelines I am developing.

The chain of problems that led me there was epic. The exact error isn’t interesting because I wouldn’t claim that this chain of events is particularly common. However, it provides a valuable lesson of how reproducibility failures may cause arcane failures in practice.

So, what happened?

I recently re-built the Docker Image for this new pipeline and was presented with the following error message:

 1The R temporary directory appears to be within a folder mounted as 'noexec'.
 2 Installation of R packages from sources may fail.
 3 See the section **Note** within `?INSTALL` for more details.
 4
 5 tempdir(): /tmp/RtmpL9N978
 6   |............................            |  70% [pipeline-run]               
 7
 8 Error: targets::tar_make() error
 9
10     • tar_errored()
11
12     • tar_meta(fields = any_of("error"), complete_only = TRUE)
13
14     • tar_workspace()
15
16     • tar_workspaces()
17     • Debug: https://books.ropensci.org/targets/debugging.html
18
19     • Help: https://books.ropensci.org/targets/help.html
20
21     _store_ there is no package called ‘qs2’

My chain of thoughts and debugging efforts was something like this:

The first part sounded horrific. Temporary directory mounted as “noexec”? Even though I hadn’t changed anything? This sort of things usually only happens when my hardware is breaking down and that is usually Bad. Capital B.
Keep on reading. No package called “qs2”? This is ok. I manually installed {qs2} in the container and the pipeline re-ran without errors.
But why didn’t the Dockerfile install {qs2} ? In this case {qs2} is a dependency of {targets} (Landau 2021) but it wasn’t installed automatically because it isn’t a formal dependency on CRAN, but an optional dependency for a variant setting that I enabled.
Add {qs2} to the R package list in the Docker config scripts? This is what I did with {qs} and it’s there. But why does the pipeline suddenly require {qs2} ? This can’t be good.
{qs2} became a dependency of {targets} in version 1.9.0, released on CRAN on 2024-11-20. The r-ver image for R 4.4.0 is supposed to be locked to CRAN at 2024-06-13. This is bad.
I checked the R package mirror with options() and it was set to “latest”. Very bad.
Went on a deep-dive to discover why the CRAN mirror in r-ver was set at “latest” and discovered that this was linked to using the latest R version.
But it’s December 2024, and 4.4.0 is NOT the latest R version at this time. So why was my package manager setting still set to “latest”? Turns out I pulled the r-ver image months ago and forgot to re-pull the Docker image when the R version became non-latest.
How to prevent this from happening again? Set explicit version lock in Dockerfile.

Conclusion Link to heading

This problem can persist for a long time if one relies too much on the Docker build cache, as I did.

Possible solutions to the Rocker version-lock problem are:

Use an r-ver image with the non-latest R version
Pull images often
Set the CRAN mirror date explicitly in the Dockerfile (my preference)

The failure to ensure reproducibility can lead to catastrophic errors and wild debugging goose chases.

References Link to heading

Boettiger, Carl, and Dirk Eddelbuettel. 2017. “An Introduction to Rocker: Docker Containers for R.” The R Journal 9 (2): 527–36. https://doi.org/10.32614/RJ-2017-065.

Buckheit, Jonathan B, and David L Donoho. 1995. “Wavelab and Reproducible Research.” In Wavelets and Statistics, 55–81. Springer.

Donoho, David L. 2010. “An Invitation to Reproducible Computational Research.” Biostatistics 11 (3): 385–88. https://doi.org/10.1093/biostatistics/kxq028.

Landau, William Michael. 2021. “The Targets r Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.

Nüst, Daniel, Dirk Eddelbuettel, Dom Bennett, Robrecht Cannoodt, Dav Clark, Gergely Daróczi, Mark Edmondson, et al. 2020. “The Rockerverse: Packages and Applications for Containerisation with R.” The R Journal 12 (1): 437–61. https://doi.org/10.32614/RJ-2020-007.

The formulation is from Donoho (2010), although it is cited as Buckheit and Donoho (1995). ↩︎