RT Journal Article
SR Electronic
T1 Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 081802
DO 10.1101/081802
A1 Joseph N. Paulson
A1 Cho-Yi Chen
A1 Camila M. Lopes-Ramos
A1 Marieke L Kuijjer
A1 John Platig
A1 Abhijeet R. Sonawane
A1 Maud Fagny
A1 Kimberly Glass
A1 John Quackenbush
YR 2016
UL http://biorxiv.org/content/early/2016/10/20/081802.abstract
AB Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. We find analysis of large RNA-Seq data sets requires both careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies. An R package instantiating our method for large-scale RNA-Seq normalization and preprocessing, YARN, is available at bioconductor.org/packages/yarn.HighlightsOverview of assumptions used in preprocessing and normalizationPipeline for preprocessing, quality control, and normalization of large heterogeneous dataA Bioconductor package for the YARN pipeline and easy manipulation of count dataPreprocessed GTEx data set using the YARN pipeline available as a resource