TY - JOUR T1 - A cloud-based workflow to quantify transcript-expression levels in public cancer compendia JF - bioRxiv DO - 10.1101/063552 SP - 063552 AU - PJ Tatlow AU - Stephen R. Piccolo Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/07/12/063552.abstract N2 - Public compendia of raw sequencing data are now measured in petabytes. Accordingly, it is becoming infeasible for individual researchers to transfer these data to local computers. Recently, the National Cancer Institute funded an initiative to explore opportunities and challenges of working with molecular data in cloud-computing environments. With data in the cloud, it becomes possible for scientists to take their tools to the data and thereby avoid large data transfers. It also becomes feasible to scale computing resources to the needs of a given analysis. To evaluate this concept, we quantified transcript-expression levels for 12,307 RNA-Sequencing samples from the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas. We used two cloud-based configurations to process the data and examined the performance and cost profiles of each configuration. Using “preemptible virtual machines”, we processed the samples for as little as $0.09 (USD) per sample. In total, we processed the TCGA samples (n=11,373) for only $1,065.49 and simultaneously processed thousands of samples at a time. As the samples were being processed, we collected detailed performance metrics, which helped us to track the duration of each processing step and to identify computational resources used at different stages of sample processing. Although the computational demands of reference alignment and expression quantification have decreased considerably, there remains a critical need for researchers to optimize preprocessing steps (e.g., sorting, converting, and trimming sequencing reads). We have created open-source Docker containers that include all the software and scripts necessary to process such data in the cloud and to collect performance metrics. The processed data are available in tabular format and in Google's BigQuery database (see https://osf.io/gqrz9). ER -