PT - JOURNAL ARTICLE AU - Alexey Alekhin AU - Evdokim Kovach AU - Marina Manrique AU - Pablo Pareja-Tobes AU - Eduardo Pareja AU - Raquel Tobes AU - Eduardo Pareja-Tobes TI - MG7: Configurable and scalable 16S metagenomics data analysis AID - 10.1101/027714 DP - 2015 Jan 01 TA - bioRxiv PG - 027714 4099 - http://biorxiv.org/content/early/2015/09/28/027714.short 4100 - http://biorxiv.org/content/early/2015/09/28/027714.full AB - As part of the Cambrian explosion of omics data, metagenomics brings to the table a specific, defining trait: its social essence. The meta prefix exerts its influence, with multitudes manifesting themselves everywhere; from samples to data analysis, from actors involved to (present and future) applications. Of these dimensions, data analysis is where needs lay further from what current tools provide. Key features are, among others, scalability, reproducibility, data provenance and distribution, process identity and versioning. These are the goals guiding our work in MG7, a 16S metagenomics data analysis system. The basic principle is a new approach to data analysis, where configuration, processes, or data locations are static, type-checked and subject to the standard evolution of a well-maintained software project. Cloud computing, in its Amazon Web Services incarnation, when coupled with these ideas, produces a robust, safely configurable, scalable tool. Processes, data, machine behaviors and their dependencies are expressed using a set of libraries which bring as much as possible checking and validation to the type level, without sacrificing expressiveness. Together they form a toolkit for defining scalable cloud-based workflows composed of stateless computations, with a static reproducible specification of dependencies, behavior and wiring of all steps. The modeling of taxonomy data is done using Bio4j, where the new paradigm of graph databases allows for both a simple expression of taxonomic assignment tasks and the calculation of taxa abundance values considering the hierarchic structure of the taxonomy tree. MG7 includes a new 16S reference database, 16S-DB7, built with a flexible and sustainable update system, and the possibility of project-driven personalization.