Abstract
Motivation One of the main benefits of using modern RNA-sequencing (RNA-seq) technology is the more accurate gene expression estimations. However, numerous issues can result in the possibility that an RNA-seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses.
Results Our investigation into 95 RNA-seq datasets from seven species (totaling 1,951GB) indicates an average of roughly 22% of all reads are MMRs for plant and animal species. Here we present a tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene’s expression level. The underlying algorithm is designed based on extracted genomic and transcriptomic features through extensive use of mathematical and statistical modeling and design. GeneQC utilizes big data-driven mathematical modeling approaches and allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that are of sufficient quality. This tool also enables researchers to investigate continued analysis to determine more accurate gene expression estimates for those with low reliability.
Availability GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html.
Contact qin.ma{at}sdstate.edu