The case for using Mapped Exonic Non-Duplicate (MEND) read counts in RNA-Seq experiments: examples from pediatric cancer datasets

Holly C. Beale; Jacquelyn M. Roger; Matthew A. Cattle; Liam T. McKay; Drew K. A. Thomson; Katrina Learned; A. Geoffrey Lyle; Ellen T. Kephart; Rob Currie; Du Linh Lam; Lauren Sanders; Jacob Pfeil; John Vivian; Isabel Bjork; Sofie R. Salama; David Haussler; Olena M. Vaske

doi:10.1101/716829

Abstract

Background The accuracy of gene expression as measured by RNA sequencing (RNA-Seq) is dependent on the amount of sequencing performed. However, some types of reads are not informative for determining this accuracy. Unmapped and non-exonic reads do not contribute to gene expression quantification. Duplicate reads can be the product of high gene expression or technical errors.

Findings We surveyed bulk RNA-Seq datasets from 2179 tumors in 48 cohorts to determine the fractions of uninformative reads. Total sequence depth was 0.2-668 million reads (median (med.) 61 million; interquartile range (IQR) 53 million). Unmapped reads constitute 1-77% of all reads (med. 3%; IQR 3%); duplicate reads constitute 3-100% of mapped reads (med. 27%; IQR 30%); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (med. 25%; IQR 21%). Informative reads--Mapped, Exonic, Non-duplicate (MEND) reads--constitute 0-79% of total reads (med. 50%; IQR 31%). Further, we find that MEND read counts have a 0.22 Pearson correlation to the number of genes expressed above 1 Transcript Per Million, while total reads have a correlation of −0.05.

Conclusions Since the fraction of uninformative reads vary, we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment. We provide a Docker image containing 1) the existing required tools (RSeQC, sambamba and samblaster) and 2) a custom script. We recommend that all results, sensitivity studies and depth recommendations use MEND units.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

This version expands on the survey portion of the original version. A forthcoming manuscript will address the relationship between assay results and sensitivity and specificity, which is addressed in the original version but not the current version.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.