I have authored and continue to maintain a number of open source projects, primarily R code that compiles open access data sets. My open access software repository with CERN can be accessed either via the convenient code.seanfobbe.com or through a direct link. Active development occurs on GitHub.
You can also view and download my Linux configuration (e.g. dot files, package lists, install scripts) for Fedora and Debian in a continuously updated GitHub repository.
When I engage in data science, I am serious about the science part. That is why I am an avid proponent of open source software and strive to make my publications based on computational results fully reproducible. I endorse the famous admonition of Buckheit and Donoho (2010: 385):
An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.1
Naturally, this is easier said than done. As a matter of course I try to release open access all data sets I create, publish the full source code (including version numbers of all dependencies) and make available my computational results with stable identifiers in long-term storage on Zenodo, the scientific repository of the high-energy physics research organization CERN.
My research is only made possible through countless open source tools that I have had the fortune to be able to download and use for free. I would like to mention a few of my favorites as a way of saying ’thank you’ and just in case someone else finds them as useful as I do.
- For my day-to-day and scientific computing needs I am a very happy user of the Fedora and Debian distributions of the Linux operating system.
- For reproducible programming I rely on Docker with Docker Compose to manage deployment details. The Rocker Project is my first port of call if I need version-controlled Docker images for R.
- For note-taking, simple documents and the Codebook of my data sets I use Markdown syntax (Gruber and Swartz 2004).
- For writing/layouting complex documents I am a heavy user of the LaTeX document preparation and typesetting system (Knuth 1978; Lamport 1984).
- These days I do almost all of my serious writing and coding in Emacs (GNU Project 2023) with the Emacs Speaks Statistics (ESS) extension for the R Programming Language (Maechler et al 2021), the extension for LaTeX AUCTeX (GNU Project 2023) and the markdown-mode extension for Markdown (Blevins 2017).
- If you were wondering whether a plaintext-based workflow might be for you, give the Plain Person’s Guide to Plain Text Social Science (Healy 2019) a read!
- My data science workflow relies heavily on the R Programming Language (R Foundation for Statistical Computing 2023), the incredibly fast and concise data.table (Dowle and Srinivasan 2023) and the quanteda framework for the quantitative analysis of text (Benoit et al 2023), as well as rmarkdown (Allaire et al 2023) and knitr (Xie 2021) for writing reproducible reports.
- This website was created with the R Programming Language (R Foundation for Statistical Computing 2023), the blogdown package (Xie, Thomas and Hill 2021), the Hugo framework (Hugo Authors 2023) and the Coder theme (de Prá 2021).
The original sentiment — paraphrasing an idea of geophysicist Jon Claerbout — was published in Buckheit and Donoho 1995, although the exact quote is from Donoho 2010. See: Buckheit, Jonathan B and David L Donoho. 1995. ‘WaveLab and Reproducible Research’. In Wavelets and Statistics, edited by Anestis Antoniadis and Georges Oppenheim, 55–81. New York: Springer, 1995. See also: Donoho, David L. 2010. ‘An Invitation to Reproducible Computational Research’. Biostatistics 11 (3): 385–388. ↩︎