Abstract
In bioinformatics as well as other compute heavy research fields, there is a need for workflows that can be relied upon to produce consistent output, independent of the software environment or configuration settings of the machine on which they are executed. Indeed, this is essential for making controlled comparisons between different observations or distributing software to be used by others. Providing this type of reproducibility, however, is often complicated by the need to accommodate the myriad dependencies included in a larger body of software, each of which often contain multiple versions. In many fields as wells as bioinformatics, these versions are subject to continual change due to rapidly evolving technologies, further complicating problems related to reproducibility. We are proposing a principled approach for building analysis pipelines and taking care of their dependencies. As a case study to demonstrate the utility of our approach, we present a set of highly reproducible pipelines for the analysis of RNA-seq, ChIP-seq, Bisulfite-seq, and single-cell RNA-seq. All pipelines process raw experimental data generating reports containing publication-ready plots and figures, with interactive report elements and standard observables. Users may install these highly reproducible packages and apply them to their own datasets without any special computational expertise apart from using the command line. We hope such a toolkit will provide immediate benefit to laboratory workers wishing to process their own data sets or bioinformaticians who would want to automate parts or all of their analysis. Our approach to reproducibility may also serve as a blueprint for reproducible workflows in other areas. Our pipelines, their documentation and sample reports from the pipelines are available at http://bioinformatics.mdc-berlin.de/pigx