Hierarchical Analysis of Multi-mapping RNA-Seq Reads Improves the Accuracy of Allele-specific Expression

Narayanan Raghupathy; Kwangbom Choi; Matthew J. Vincent; Glen L. Beane; Keith Sheppard; Steven C. Munger; Ron Korstanje; Fernando Pardo-Manual de Villena; Gary A. Churchill

doi:10.1101/166900

Abstract

Allele-specific expression (ASE) refers to the differential abundance of the allelic copies of a transcript. Direct RNA sequencing (RNA-Seq) can provide quantitative estimates of ASE for genes with transcribed polymorphisms. However, estimating ASE is challenging due to ambiguities in read alignment. Current approaches do not account for the hierarchy of multiple read alignments to genes, isoforms, and alleles. We have developed EMASE (Expectation-Maximization for Allele Specific Expression), an integrated approach to estimate total gene expression, ASE, and isoform usage based on hierarchical allocation of multi-mapping reads. In simulations, EMASE outperforms standard ASE estimation methods. We apply EMASE to RNA-Seq data from F1 hybrid mice where we observe widespread ASE associated with cis-acting polymorphisms and a small number of parent-of-origin effects at known imprinted genes. The EMASE software is freely available under GNU license at https://github.com/churchill-lab/emase and it can be adapted to other sequencing applications.

Background

Allele-specific expression (ASE) refers to the relative abundance of the two alleles of a transcript in a diploid organism. ASE can result from differential rates of transcription, differences in mRNA stability, or other mechanisms that affect transcript abundance. Allelic differences can range in magnitude from subtle quantitative effects to purely monoallelic expression. ASE is driven by factors that are linked to the gene and act in cis to affect transcript abundance. These cis-acting factors may be local genetic variants or epigenetic marks that distinguish maternal and paternal alleles. In the absence of cis-acting variation, trans-acting factors should exert an equal influence on both allelic copies of a gene. ASE has been associated with genetic predisposition to disease (De La Chapelle, 2009; McKean et al, 2016). Accurate estimation of ASE can provide insight into mechanisms of normal transcriptional regulation and it can reveal allelic dysregulation that may underlie or reflect disease states (Wittkopp et al, 2004).

Early studies of genome-wide ASE used specialized microarray platforms (Ronald et al, 2005) and hybridization artifacts posed significant problems for accurate quantitation (Gresham et al, 2010). The advent of massively parallel RNA sequencing technologies (RNA-Seq) has provided a unique opportunity to measure ASE directly (Lister et al, 2008; Nagalakshmi et al, 2008; Mortazavi et al, 2008; Wang et al, 2009). But the analysis of ASE from RNA-Seq data presents new challenges. In particular, while transcribed genetic variation provides the information needed to discriminate the allelic origin of a transcript, allelic differences can also introduce systematic biases in alignment and ASE estimates (Degner et al, 2009).

Numerous RNA-Seq experiments have been carried out to measure ASE and to evaluate the relative contributions of genetic variation and parental imprinting in humans (Lappalainen et al, 2013; Baran et al, 2015; Babak et al, 2015; Kukurba et al, 2014) and in model organisms, including yeast (Skelly et al, 2011), flies (McManus et al, 2010; Coolon et al, 2012; Graze et al, 2012), and mice (Gregg et al, 2010; DeVeale et al, 2012; Goncalves et al, 2012; Lagarrigue et al, 2013). Estimates of the extent of parent-of-origin effects reported in these studies vary widely. While biological differences among organism, cell type, and developmental stage may account for some of this variability, it is also likely that technical differences in experimental design, analysis methods, and interpretation have contributed to the discordance (DeVeale et al, 2012; Coolon et al, 2012).

Quantification of ASE from RNA-Seq data begins with alignment of sequence reads to a genome or transcriptome — the set of all transcripts derived from a genome. Polymorphisms between the parental genomes enable the allelic origin of some reads to be unambiguously determined and these reads provide the information needed to estimate ASE. Some of the first attempts to estimate ASE from short read RNA-Seq data aligned reads to a haploid reference genome, allowing for mismatches in the alignment, and then counted allelic proportions at known single nucleotide polymorphisms (SNP) loci (Gregg et al, 2010; Andergassen et al, 2015). However, it is now recognized that alignment to a reference genome can bias estimation in favor of the allele that is most similar to the reference (Degner et al, 2009; Stevenson et al, 2013; Munger et al, 2014). Several approaches have been developed to reduce this bias by accounting for known SNPs in the scoring of alignments (Montgomery et al, 2010; Pickrell et al, 2010; Lalonde et al, 2011; Rozowsky et al, 2011; Stevenson et al, 2013; Castel et al, 2015). Other methods explicitly represent haploid maternal and paternal genomes that incorporate known genetic variation but perform alignment sequentially (Rozowsky et al, 2011). ASE estimation can also be improved by assigning reads to alleles of a transcript based on all known SNPs (Wang et al, 2011; Coolon et al, 2012). These approaches reduce but do not fully remove biases that arise from the initial reference alignment. For example, van de Geijn et al (2015) showed that reads from non-reference alleles frequently map to multiple genomic locations and would be discarded by these methods. Ideally, all of these challenges — diploid alignment, using information in multiple SNPs, and accounting for ambiguous read alignments — should be addressed in a unified statistical analysis framework.

In this work we focus on alignment to a diploid transcriptome, which includes sequences from both allelic copies of all transcript isoforms. The diploid transcriptome has a natural hierarchical structure. Genes, the transcribed regions of the genome, are present as two copies, the maternal and paternal alleles, either of which can be transcribed and processed into multiple different isoforms. A transcript originates from one isoform of one allele of one gene but different transcript sequences may be highly similar or even identical to one another. As a result, a short read sequence may align equally well to multiple transcript sequences. Alignment ambiguities can occur at different levels of the hierarchy. Sequence similarity shared across genes can give rise to genomic multi-reads that align to multiple locations in the genome. Exon or exon-junction sharing between transcripts can result in isoform multi-reads that align to more than one isoform of the same gene. Lastly, the absence of distinguishing polymorphisms can give rise to allelic multi-reads that align equally well to both allelic copies of a gene. A single read can display multiple ambiguities at different levels of this hierarchy. Accounting for multi-mapping reads is known to improve estimation of transcript abundance but little attention has been given to the role of these different types of multi-reads in the estimation process.

One approach to resolve multi-reads is to employ an expectation maximization (EM) algorithm to assign probabilistic weights that apportion the read across multiple transcripts. Previously reported EM algorithms for RNA-Seq analysis do not differentiate between genomic, isoform, and allelic multi-reads (Li and Dewey, 2011; Nicolae et al, 2011; Turro et al, 2011; Roberts and Pachter, 2013; Patro et al, 2014; Bray et al, 2016). Here we report an EM algorithm that accounts for the hierarchical structure of the transcriptome. Our method is implemented in open source software, EMASE (https://github.com/churchill-lab/emase). We describe the EMASE algorithm and evaluate its performance using simulated and real data. We use simulated data to evaluate four EMASE models with different hierarchies and compare the performance of EMASE to several widely used methods for estimating ASE. We demonstrate the application of EMASE to real data by analyzing liver RNA-Seq data from a reciprocal F1 hybrid cross between two inbred mouse strains.

Results

Importance of counting multi-reads

RNA-Seq data consist of millions of sequence reads obtained from an RNA sample. We represent the transcriptome as a collection of sequence elements, one for each allele of each isoform of each gene, and we assume that each read originated from exactly one element. Some elements of the transcriptome may be highly similar or even identical to one another. There are sequence similarities across gene families; isoforms of a gene may share exons or exon-junctions; and alleles may have a few or no distinguishing polymorphisms. As a result, a read may align to one or more elements in the transcriptome with equal alignment quality. If the best alignment is unique we assume it is correct. Otherwise, we assume the read originated from exactly one of the elements with equally best alignment quality.

Discarding ambiguous or multi-mapping reads is unfortunately a common practice in RNA-Seq analysis (reviewed in Conesa et al (2016)). In addition to loss of information, selectively discarding reads can bias results. The impact of discarding genomic multi-reads on total gene expression has been documented (Li et al, 2010; Robert and Watson, 2015). Relatively less attention has been paid to the impact of discarding isoform and allelic multi-reads but it remains a standard practice to discard these reads on the assumption that they are uninformative (Kanitz et al, 2015; Castel et al, 2015).

To illustrate the potential impact of multi-reads, we counted the different classes in our F1 cross data (Figure 1). Only ~14% of all aligned reads are unique (U in Figure 1) at all levels and the remaining ~86% of reads are multi-reads for at least one level of the hierarchy. Simple multi-reads are multiply aligned at exactly one level of the hierarchy; they represent 42% of all reads (G+I+A). Complex multi-reads are multiply aligned at 2 or more levels in the hierarchy; they represent 44% of all reads (GI+GA+IA+GIA). Thus complex multi-reads represent a significant fraction of the total data, and information; these are the reads that are apportioned in different ways depending on what we assume about the hierarchy of genes, alleles and isoforms.

Unique reads are simply the complement of multi-reads. They provide critical information needed to assign weights and allocate multi-reads. The majority of reads (83%) are genomic unique (A+I+AI+U). Reads that are both genomic and allelic unique represent 22% (I+U) of the total; these reads are most informative for ASE. In the diploid transcriptome of our F1 animals, 88% of genes have at least one allelic variant site; for genes with no variants there will be no allelic unique reads and we cannot estimate ASE. Reads that are both genomic and isoform unique represent a larger proportion, 48% (A+U) of total reads. However, many of these reads align to single-isoform genes (36% of total reads) and thus only 12% of total reads are informative for distinguishing among isoforms.

Figure 1:

Multi-read proportions in hybrid mouse data. For each read we determined whether it aligns to multiple genomic locations, multiple isoforms of a gene, and multiple alleles. If, for example, a read is a genomic multi-read and is also an isoform multi-read for at least one of its genomic alignments, the read is counted as an isoform multi-read. Complex multi-reads are shown at the intersections of the Venn diagram. The proportion of reads that align uniquely at all levels is 14.1% as shown.

Fitting an EMASE model

We address the problem of resolving multi-reads in RNA-Seq data by assigning probabilistic weights to each alignment of a multi-read. Current approaches to resolve multi-reads treat all alignments equally regardless of whether the multiple alignments involve alleles, isoforms or genes (Li and Dewey, 2011; Nicolae et al, 2011; Turro et al, 2011; Roberts and Pachter, 2013; Patro et al, 2014; Bray et al, 2016). This approach ignores the hierarchical structure of the transcriptome in which a gene may have multiple isoforms and each isoform will have two alleles. As noted above, a single read can be a multi-read at multiple levels in this hierarchy. It is not immediately obvious how to allocate weights for these complex multi-reads.

Figure 2:

Hierarchical allocation of multi-reads

We implemented four EMASE models (M₁, M₂, M₃, and M₄) with distinct hierarchical structures (Figure 2a). Each model apportions a complex multi-read differently. Under M₁, reads are apportioned among genes first, then between alleles, and then among isoforms. Under M₂, reads are apportioned among genes first, then among isoforms, and then between alleles. Under M₃ reads are apportioned among genes first, then among each isoform-allele combination which are treated equally. Model M₄ assumes no hierarchy and multi-reads are apportioned equally among genes, isoforms, and alleles. M₄ is implicitly the model used by other EM approaches. Under M₄, the gene-level allocation of reads will depend on the number of isoforms that are represented in the transcriptome; genes with more isoforms will receive proportionately higher weights in the allocation of reads that are both genomic and isoform multi-reads. To see why this may be problematic, consider a situation where new isoforms of a gene are discovered and added to the transcriptome. With the new transcriptome definition, this gene will receive a larger share of the read allocation but the evidence that the read originated from this gene has not changed.

To illustrate how the four EMASE models allocate multi-reads, we constructed a hypothetical example of an alignment profile (Figure 2b). This is a complex multi-read at all three levels of the hierarchy. M₁ first allocates equal weight to each gene; it then allocates weight between the two alleles of gene g₁; lastly it allocates weights to isoforms within each gene and allele. Models M₂ and M₃ make similar allocations but in different orders resulting in different overall allocation of weights. We note that all three model M₁, M₂, and M₃ given equal weight to each gene. In contrast, M₄ will apportion weights equally to each alignment such that gene g₁ receives weight and gene g₂ receives weight. In this example, we applied equally-weighted allocations to a single read. Next we describe an iterative algorithm for estimating the allocation parameters that uses data from all of the reads.

An EM algorithm is applied to obtain maximum likelihood parameter estimates for an EMASE model (Figure 2c). The EM algorithm for EMASE M₁ begins with initial estimates of the relative expression of each gene(θ_g), the allelic proportions for each gene (ϕ_h|g), and the isoform proportions within each allele of each gene (δ_i_{|g, h}). Initial values can be equally weighted as in Figure 2b. The product of these parameters λ_g,_{h, i} = θ_g · ϕ_h|_g · δ_i_{|g, h} represents the relative abundance of the transcript. In the E-step, current estimates of transcript abundance are used to apportion the unit count of a multi-read among the gene(s), isoform(s) and allele(s) to which it aligns. This process is repeated for each read and the weights are summed across all reads to obtain an expected read count for each transcript. In the M-step, the expected read counts are used to re-estimate the transcript abundance, incorporating an adjustment for the transcript length. The EM algorithm converges to yield maximum likelihood parameter estimates of transcript abundance and expected read counts. We note that expected read counts may not be integers due to the weighted allocation of multireads. Transcript abundance is a proportion among all transcripts and it is typically standardized to transcripts per million (TPM). Convergence of the EMASE fitting algorithm is declared when the sum of the absolute TPM changes by less than 1.0 on consecutive iterations. A detailed description of the EM algorithms is provided in Methods.

Performance of EMASE on simulated data

We used simulations to evaluate the performance of EMASE models for estimating total and allele-specific expression and for comparison of EMASE to other approaches. We simulated 12 independent sets of 10 million 68 bp single-end reads using RSEM (Li and Dewey, 2011) version 1.3.0 with input parameters obtained by applying RSEM analysis to eight samples of F1 hybrid cross between mouse strains NOD/ShiLtJ (NOD) and PWK/PhJ (PWK). For the alignment phase of analysis, we generated NOD and PWK transcriptomes by incorporating known strain-specific SNPs and short indels into the reference transcriptome and combined these to form the diploid transcriptome of a NODxPWK F1 hybrid mouse using g2gtools (https://github.com/churchill-lab/g2gtools). We built the bowtie (Langmead et al, 2009) index using rsem-prepare-reference and aligned each of the simulated read sets to the diploid NODxPWK transcriptome using the bowtie aligner to generated BAM files for each of the 12 simulated data sets. Details of the simulations are provided in Methods.

We fit each of the four EMASE models with these BAM files and compared the accuracy of ASE and total expression estimation on the 12 simulated samples (Table 1). To determine the accuracy of ASE, we computed the proportion of genes or isoforms for which the absolute differences between estimated and true ASE is less than 0.1 (Table 1a). For gene-level estimates models M₂ and M₄ were equally best in performance and M₂ was best for isoform-level ASE estimation. When we compared EMASE with the other EM methods, RSEM and kallisto, they all fell within a few percentage points of one another with kallisto having only marginally lower accuracy. The unique-reads method and WASP have substantially lower accuracy compared to EM methods. At the gene-level, WASP estimates of ASE fell within 10% of truth for fewer than half of genes. Isoform level estimates are not available with WASP.

To evaluate estimation of total gene expression among the EM based methods we computed the proportions of genes or isoforms for which the relative difference between estimated and true values was less than 10% (Table 1b). At both gene and isoform-level, M₄ was most accurate based on number of genes that fell within 10% of the true value (Table 1b). All of the EM based methods have similar accuracy for total gene expression (~85%) and for total isoform expression (~40%). Kallisto has slightly lower accuracy and, as expected, total expression estimates for EMASE models M₁, M₂, and M₃ are essentially identical.

View this table:

Table 1: Accuracy of ASE and total gene expression estimation (See Figure S1 and S2).

We carried out head-to-head comparisons between all pairs of models to determine the proportions of genes or isoform for which each method provided estimates of ASE or total expression that are closer to the simulated truth. For ASE, we report comparisons of EMASE model M₂ against the other methods (Table 2a and Supplemental Figures S4 and S5). For total expression, we report comparisons between EMASE model M₂ and the other EM methods (Table 2b and Supplemental Figures S6 and S7).

For ASE estimation, we considered the estimates obtained from two methods to be equivalent if the difference between the absolute deviations of estimated values from the simulated truth is 5% or less. In comparison to the other EMASE models, the M₂ estimates are more often closer to the true values. The difference is most pronounced in comparison to M₁ at the gene-level where 11.5% of genes were better estimated by M₂ and only 3.3% were better estimated by M₁. The differences among the EMASE model comparisons are less pronounced at the isoform level. In comparisons to the other estimation methods, M₂ is consistently best and the next best performance is from RSEM followed by Unique and kallisto. The performance of WASP is an outlier — M₂ provided substantially better estimates of ASE for 26.2% of all genes.

We compared EMASE model M₂ estimates of total gene expression to the other EM based methods. Estimates were considered to be equivalent if they are within 5% relative difference. At the gene level, estimates from EMASE models M₁, M₂, and M₃ are essentially identical and the minor differences (0.1%) can be attributed to convergence of the EM fitting algorithm. In comparison to model M₄, we see that the M₂ estimates are closer to truth for 1.4% of genes and M₄ estimates are closer to truth for 5.5% of genes. Model M₂ outperforms both RSEM and kallisto at the gene-level. At the isoform level, we see very similar performance among M₁, M₂, and M₃. Model M₄ is closer to truth than M₂ for 12.7% of isoforms and RSEM also outperforms M₂ in this comparison. We conclude that the best estimates of total gene expression are obtained using either EMASE model M₄ or RSEM, but we note that all of the EM methods are performing within a few percentage points of one another. For both ASE and total expression, EMASE model M₂ was overall best but not best in every comparison.

View this table:

Table 2: Head-to-head model comparisons. See Figure S4, S6, S5, and S7.

It is often of interest to classify the ASE state of genes as monoallelic versus bi-allelic expression. In this simulation, we call a gene monoallelic if its estimated allele proportion is less than 2% or greater than 98%. It is also of interest to classify genes as expressed or not-expressed. We call a gene with expected read count less than 1.0 as not-expressed. Based on these classification rules, we compared the precision-recall of each model (Table 3).

The classification results for ASE are summarized in Table 3a. Precision — the proportion of monoallelic calls that are true — is around 50% for all of the ASE estimation methods with the exception of WASP for which only 16.4% of monoallelic calls were correct. Recall — the proportion of true monoallelic expressions that are called — is more variable across the ASE estimation methods with best performance reported for the unique-reads method followed by EMASE model M₄, RSEM, and kallisto. At the isoform level, we can see a similar result. EMASE model M₁ and M₂ achieved the best precision, in excess of 60%, and the unique-reads method had the best recall of 93.7%, closely followed by EMASE model M₄, RSEM, and kallisto.

Classification results for total expression are summarized in Table 3b. Precision and recall are consistently above 97% for each of the EM-based methods. Classification performance at the isoform-level is also consistently high across the EM-based estimation methods.

View this table:

Table 3: Classification accuracy of monoallelic expression and gene expression (See Figure S1 and S2).

We examine the distribution of allele proportions in the simulated data (Figure 3a) and estimated allelic proportions by EMASE M₂, RSEM, kallisto, WASP, and unique-reads methods (Figure 3). In total, ~22% and ~14% of reads are allelic unique at the gene and isoform level, respectively. EM-based methods (Figure 3b, 3c, and 3d) produced a smooth, bell-shaped distribution similar to the true distribution but with increased variation that reflects estimation error. WASP failed to estimate allele proportion in over 2,200 genes compared to the other methods. The overall distribution of ASE obtained from WASP was skewed toward NOD alleles, which are more similar to the mouse reference genome (Figure 3e). For these reasons, we conclude that postprocessing allele specificity after reference alignment is not fully correcting the alignment bias. The unique-reads method (Figure 3f) resulted in a symmetric distribution of allele proportions but with greater estimation error. We observed a 13% increase in variance at the gene-level and 39% increase at the isoform-level compared to EMASE model M₂. Our implementation of the unique-reads method is unbiased due to the alignment to a customized transcriptome but estimates of ASE are more variable than EM and there are more monoallelic calls. EMASE model M₂ reported the smallest number of false monoallelic expression calls: 168_±9 versus 196_±13 (EMASE M₄), 203_±14 (RSEM), 228_±12 (kallisto), 264_±13 (unique-reads method) or 735_±37 (WASP) across 12 samples. See Table 3a for more detail.

Figure 3:

Estimated distribution of gene-level ASE from simulated data. Histograms show the frequency distribution across genes of the PWK allele proportions based on simulated truth (a), EMASE model M₂ (b), RSEM (c), kallisto (d), WASP (e), and unique-reads only (f). The first and last bins in each histogram represent monoallelic expression defined as PWK allele proportion below 0.02 or above 0.98. Error bars are drawn between minimum and maximum values from 12 simulated data.

In summary, the EMASE models M₂ and M₄ estimates consistently provided the best or close to best estimates of both total expression and allelic proportion in our simulations. The same can be said for the RSEM estimates. Estimates obtained using kallisto were also consistently good but often not as accurate as the other EM methods — this may reflect some loss of information in the fast pseudo-alignment strategy. Among the non-EM methods, the unique-reads method (with alignment to the custom diploid transcriptome) provided consistent but less precise estimation. The WASP algorithm, which relies on a reference alignment strategy, performed poorly in all evaluations using simulated data.

ASE in F1 hybrid data

We applied EMASE to RNA-Seq data from a reciprocal F1 hybrid cross between mouse strains NOD and PWK. There were 48 male mice in total with 24 mice from each direction of the cross (NOD×PWK and PWK×NOD). In order to evaluate the extent of ASE, we applied EMASE model M₂ to estimate the PWK allele proportions for 9,102 informative autosomal genes (see Methods). This distribution of estimated ASE is symmetric (Figure 4a) indicating that there are no strain-specific biases. Monoallelic expression was observed for NOD alleles at 173 genes and for PWK alleles at 174 genes in the NOD×PWK samples. Monoallelic expression was observed for NOD alleles at 150 genes and for PWK alleles at 152 genes in the PWK×NOD samples. These are median values over the 24 samples in each cross direction. Numbers of monoallelic expressed genes varied from 115 to 454 in individual samples.

Male F1 mice from the two reciprocal crosses are hemizygous for the X chromosomes. We included both X chromosomes in our transcriptome definition in order to evaluate the misclassification rate of monoallelic expression. The majority of X chromosome genes (85% and 82% in NOD×PWK and PWK×NOD respectively) demonstrated monoallelic expression for the correct X chromosome (Figure 4a). Among the genes that show bi-allelic expression, 65% percent of the genes have fewer than 5 SNPs or indels to distinguish between alleles and others (35%) share sequence similarity and genomic multireads with autosomal genes.

Figure 4:

Allele-specific expression in a reciprocal F1 hybrid mouse population.

We evaluated the statistical significance of ASE using a beta-binomial test to account for overdispersion in (expected) read counts. We identified 4,216 genes (at FDR 5%) for NOD×PWK mice and 3,869 genes (at FDR 5%) for PWK×NOD mice with overlap of 3,084 genes (p<2.2e^-16). This suggests that ASE is pervasive (affecting more then 35% of genes) and consistent across different groups of animals. A scatterplot of ASE estimates from each direction of cross reveals a striking level of concordance (r² = 0.831) (Figure 4b) and suggests that ASE is continuously distributed with variable degrees of allelic imbalance across the genome.

We observed a handful of genes with a reversal in the PWK allele proportion between the two crosses, consistent with a parent-of-origin effect on ASE. In order to evaluate their significance we employed a logistic regression with a quasi-binomial likelihood (Agresti, 2002) and direction of cross as a predictor. We identified 70 genes with significant parent-of-origin effect, at 5% FDR (Figure S9). The strongest effects were restricted to genes that are already known to be imprinted, including Igf2r, Peg3, Zrsrl, H13, and Impact. We conclude that parent-of-origin effects are limited to a small number of well-characterized genes in adult mouse liver (Figure 4b). We also tested the effect of diet and age on ASE using the overdispersed logistic model and found 12 genes with significant diet effect and 112 genes with a significant age effect on ASE at 5% FDR threshold. These results suggest that allele-specificity is relatively insensitive to the diets and range of ages of mice in this study.

Discussion

Until recently access to individual whole genome sequences has been out of reach for most organisms but sequencing of individual genomes of humans and model organisms is now proceeding rapidly. When individual genomes are not directly available, high-density genotyping arrays and variant databases can support imputation and phasing to obtain accurate approximations of individual diploid genomes. Hybrid mouse genomes, such as the NOD×PWK F1 animals used here, are straightforward to construct; they serve as a proof-of-principle for future applications of individually targeted RNA-Seq analysis. ASE estimation based on reference genome alignment suffers from bias even when secondary analyses are applied to account for misaligned reads. We recommend, whenever possible, to use an individually tailored diploid transcriptome incorporating known or imputed genetic variants as an alignment target for RNA-Seq analysis.

EMASE software works with BAM format files that can be produced by most short-read alignment software. It requires alignment to a collection of discrete sequence elements such as a transcriptome. The transcriptomes of human, mouse and other well-studied organisms are refined and well annotated. There is always room for improvement but for the present we rely on the reference transcriptome and adjust individual sequence elements to incorporate known or imputed SNPs and small indels. The impact of errors or individual variations in the set of the transcribed elements is not clear but will result in failure of some RNA-Seq reads to align to their correct origin or failure to align to any element. New alignment strategies that operate on whole genomes but are transcriptome-aware could help to address some of these concerns (Kim et al, 2015).

EMASE achieves up to 1,000 fold data reduction from BAM alignment format to the read alignment matrix. This reduction entails loss of information about the details of read alignments but this does not appear to impact the accuracy of estimation. Computing time for EMASE is substantially faster than RSEM but not as fast as k-mer methods such as kallisto. This suggests that detailed information about the aligned sequences is of limited value and that counting aligned reads is sufficient for accurate estimation.

The EMASE algorithm is readily adaptable to other contexts. All that is required is an alignment target composed of discrete sequence elements and a hierarchy. EMASE has been adapted to estimate allele-specific binding using ChIP-seq data (Baker et al, 2015) by defining sequence elements around DNA binding sites and applying a two-level hierarchy for sites and alleles. We have implemented an exon-junction version of EMASE as an alternative to the transcript isoform model presented here (Raghupathy et al., unpublished). EMASE has been adapted to analyze multiparent populations (http://churchill-lab.github.io/gbrs) with alleles assigned to eight (or any number of) haplotype classes (Chick et al, 2016). We anticipate the development of EMASE applications to allele-specific methylation, allele-specific RNA editing, and more.

Evaluation of RNA-seq analysis methods requires complex and realistic simulated data. Any simulation software makes assumptions that will affect the properties of the simulated data and the outcome of evaluations. After testing several simulation tools (Griebel et al, 2012; Frazee et al, 2015) including our own EMASE simulator, we decided to base our simulation studies on the RSEM simulator (Li and Dewey, 2011), which we found to be well documented and easy to implement. Moreover, we wished to avoid biasing our conclusions by evaluating the EMASE algorithm with data simulated from the EMASE model. We obtained input parameters for the simulations using values estimated from our F1 hybrid mouse data by RSEM. Thus both the input parameters and the simulated data were obtained from a non-hierarchical model that is most similar to EMASE model M₄. It is difficult to assess the extent to which our simulations might have favored RSEM or EMASE model M₄. RNA-Seq simulations can be sensitive to other choices such as whether and how to simulate poly-A tails and how to introduce sequencing errors. We made the choice to compare estimated read counts and allelic- or isoform-proportions of read counts as the outcome measure for comparison of different analysis tools. Alternative measures such as transcripts per million (TPM), are widely used but we found that each of the software tools has its own unique way of computing the transcript length adjustment for converting read counts to TPM. Read counts are the starting point for most normalization and downstream analysis methods, for example in voom (Law et al, 2014), edgeR (Robinson et al, 2010), and DESeq2 (Love et al, 2014), and they are consistently defined by the analysis methods considered here. Establishing standards for simulation-based evaluation of RNA-Seq analysis is an area in need of further attention.

EMASE estimates read counts at the level of transcriptome elements corresponding to individual isoforms and alleles of a gene. It can also aggregate counts at the gene-level and report average ASE across isoforms or isoform usage across alleles. If allelic proportions vary widely between isoforms, the gene-level average ASE may have little meaning. On the other hand when there is a dominant isoform or when ASE is consistent across isoforms, the gene-level summary will be more accurate due to the larger numbers of reads that are available to estimate ASE. Aggregate summaries of ASE are useful but should be viewed with caution. To obtain the best estimates of ASE, we recommend fitting model M₂ and aggregating counts across isoforms — after checking that the isoform level estimates are not widely divergent.

EMASE explicitly models the different types of multi-reads and uses a hierarchical strategy to allocate weights. Our first implementation of EMASE was based on the hierarchy of model M₁. It seemed logical because transcription acts first on an allele and splicing follows to produce the isoform. Yet model M₁ consistently underperforms in comparison to other EMASE models. To understand the difference between models M₁ and M₂ in particular, we note that 86% of all reads are genomic unique, 50.6% of reads are isoform-unique, and 23.1% of reads are allele-unique. Thus we have more information to distinguish among isoforms than we have to distinguish among alleles. In addition, allelic-unique reads are typically defined by one or by a small number of SNPs; whereas isoform-unique reads can be distinguished across most or all of the nucleotides in the read. As a result we have more information to accurately allocate weights across isoforms compared to alleles. And we have the most information available to allocate multi-reads across genes. Our original motivation for constructing the EMASE hierarchy was based on the biology of transcription but a more pertinent consideration for determining the hierarchy is the information content of the data. We obtained the best results when we allocate at the most informative level first and the least informative level last. The differences in performance among the EMASE algorithms reflect this.

Ambiguity in read alignment presents a significant challenge for RNA-seq analysis. While it is tempting to discard or ignore multi-reads, this can lead to bias and reduced precision in estimation. Ambiguity in read alignment can be addressed by proportionately allocating counts using an EM algorithm, a general approach that outperforms methods that discard multi-reads. There are several EM algorithm implementations available, including EMASE, and they all perform well in head-to-head comparisons. EMASE resolves multi-reads by specifying a hierarchy among genes, isoforms, and alleles and we have found that the hierarchy of EMASE model M₂ has generally the best performance. However, the differences reported in our evaluations are small and we would recommend the use of any these EM methods in practice.

Methods

Construction of the diploid transcriptome

We obtained NOD and PWK-specific SNPs and insertions and deletions (indels) of less than 100 bp from the Sanger Mouse Genomes Project SNP and indel Release Version 4 and mm10 genome (ftp://ftp-mouse.sanger.ac.uk/REL-1410-SNPs_Indels/). We used g2gtools (https://github.com/churchill-lab/g2gtools) to construct the strain-specific genome and strain-specific gene annotations using Ensembl release 75 Mus_musculus.GRCm38.75.gtf (ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus) (Keane et al, 2011) and constructed strain-specific transcriptomes including all annotated isoforms. We merged the two transcriptomes into a single FASTA file and appended labels to track the strain-specific origin of each isoform.

Read alignment and isoform abundance quantification

We built a bowtie index using rsem-prepare-reference from RSEM version 1.3.0 (Li and Dewey, 2011) with the NOD×PWK diploid transcriptome. We set the polyA-length parameter to attach a 67 bp-long poly-A tail to each transcript sequence element. We aligned our 68 bp singleend RNA-Seq reads to this diploid transcriptome using bowtie version 1.1.2 (Langmead et al, 2009) with parameters ‘–best’, ‘–strata’, ‘-a’, ‘-m100’ and ‘-v3’. With these parameter settings, all read alignments with the best alignment score allowing up to 3 mismatches were retained for analysis. We used the same diploid transcriptome for building kallisto (version 0.43.0) pseudo-alignment index with default k-mer size of 31. We note that kallisto clips off the poly-A tail longer than 10 nt by default. We set the mean and standard deviation of fragment lengths to 250 and 25 respectively for both rsem-calculate-expression and kallisto quant.

RNA-Seq simulations for benchmarking

We estimated isoform expression and allele proportion using rsem-calculate-expression on eight of our NOD×PWK samples (Sample IDs: 532702, 532704, 532706, 532708, 532710, 532714, 532716, and 532720) that contain ~40.4M aligned reads overall. We then simulated 12 independent sets of 10M reads using rsem-simulate-reads with the estimated parameters from the real data. For each simulation, we estimated expression at allele, isoform, and gene levels under all four EMASE models, RSEM, kallisto, and WASP. We started rsem-calculate-expression and EMASE from the same BAM files obtained using bowtie with the same index and parameters specified in the previous section.

Animals and experimental design

We obtained male and female NOD/ShiLtJ (Stock No. 001976) and PWK/PhJ (Stock No. 003715) mice from The Jackson Laboratory (Bar Harbor, ME). Animals were received at 3 to 5 weeks of age, and given free access to food and water. We fed mice one of two diets from Harlan Teklad Custom Research Diets with matched ingredients with the exception of cholecalciferol. The standard diet contained 0.009 g/Kg and the enriched diet contained 0.050 g/Kg of cholecalciferol. NOD and PWK mice were mated using a reciprocal cross design (Figure S8) to obtain 24 male F1 progeny from each cross direction. We euthanized six mice from each cross-by-diet cohort at 15 weeks of age, and the remaining mice were euthanized at age 24 weeks. In total there are eight groups defined by the experimental factors (cross-direction, diet and age) and six biological replicates per group. The Animal Care and Use Committee at JAX approved the procedures in this protocol (AUS #99057).

RNA sequencing

We collected liver tissue from the 48 F1 mice and stored it in RNAlater (Invitrogen) until processing. We isolated total RNA using the Trizol Plus RNA extraction kit (Life Technologies) with on-column DNase digestion, and purified mRNA from the total RNA using biotin tagged poly dT oligonucleotides and streptavidin beads. We fragmented mRNA by enzyme digestion, and generated double-stranded cDNA by random hexamer priming. We generated indexed mRNA-Seq libraries from 1_μg total RNA following the Illumina TruSeq unstranded protocol, and checked quality and quantity of the product using the Agilent Bioanalyzer and the Kapa Biosystems qPCR library quantitation method. We generated 68 bp single-end reads on the Illumina GAIIx sequencing platform. We assigned barcoded samples to sequencing lanes by randomization in a balanced design in which each lane included one sample from each of the treatment groups. We observed no significant differences between the technical replicates and for all subsequent analyses we used pooled data.

Parameter estimation with EMASE

We first introduce some notations. Let N_R be the number of RNA-Seq reads and let r = 1, …, N_R be the index for reads. Let g = 1, …, N_G be the index for genes, let h = 1,2 be the index for alleles, and let i = 1,… N_I_|g be the index for isoforms of gene g. It will be convenient to denote the total number of isoforms in the transcriptome as We denote the length of a transcriptome element as l_ghi and assume the lengths are known constants with an appropriate adjustment for edge effects.

We denote true read origins as R = {R_r : r = 1, …, N_R} a set of indicator matrices where R_r is a matrix of size N_I × 2 containing a single element with value one corresponding to the true origin of read r, and the rest zero. We denote read alignments as A = {A_r : r = 1,…, N_R} a set of indicator matrices where A_r is a matrix with the same dimensions as R_r having one or more elements with value one corresponding to read alignments, and the rest zero. For uniquely aligned reads, the read origin and alignment matrices are assumed to be identical, A_r = R_r. For multiply aligned reads we assume that exactly one of the alignments correspond with the true origin of the read. Thus the element-wise product A_r × R_r = R_r. The read alignment profile, A_r, is observed data, but if r is a multi-read then R_r that represents the true origin of r is not directly observable. We also assume R_r given r’s alignment profile, A_r, follows multinomial distribution of sample size 1 with probabilities, {v_g,h,i|Ar} of length N_I× 2. Our goal is to maximize the complete-data likelihood with respect to {v_g,h,i|Ar}. where I[R_r = (g, h, i)], or I_r(g, h, i) in short, is an indicator for whether the true origin of read r is isoform i of allele h of gene g. Note, given a transcriptome target, the alignment profile of a read is fixed, and therefore, P(A_r |Θ is constant. This equation shows Σ_rI_r (g, h, i) is a sufficient statistic of P(A, R|Θ). Depending on the model of hierarchy, I_r (g, h, i) is defined in a different way and {v_g,h,i|Ar} should be estimated accordingly. For example, for Model M₁, where

We also define additional parameters, θ_g, ϕ_h_|g, and δ_i_|g,h, as the transcript proportion allocated to gene g, the transcript proportion from gene g allocated to allele h, and the transcript proportion from gene g and allele h allocated to isoform i respectively. Based upon these Model M₁ indicators and parameters, we will describe EM-algorithm for M₁. Analogous expression for M₂, M₃, and M₄ are shown in the Appendix section. The complete-data likelihood belongs to the exponential family and its sufficient statistics are derived in Equation (1). The EM-algorithm is simple in that case as it reduces to updating the expectation of the sufficient statistic.

E-step: We compute λ_g,h,_i, the expectation on the sufficient statistic Σ_rI_r (g, h, i) across all the possible combination of (g, h, i) given the observed data A and current estimates of the transcript proportion parameters Θ. λ_g,_h,_i can be thought of as the expected read counts for each transcript. Precisely we compute λ_g,_h,i for EMASE model M₁ using the following equation.

As previously mentioned, the quantities {v_{g h i|Ar}} are the probabilities that govern the multinomial outcome of R_r conditional on A_r. They can also be thought of as weights that allocate read r proportionally across each of the elements to which it aligns.

M-step: We use the expected read counts, λ_g,_h,_i, to estimate transcript proportion parameters Θ. For example, δ_i_|g,_h, ϕ_h_|g, and θ_g for M₁ are obtained as

The length adjustment uses l_g,_h,_i, the effective length of a transcript i of allele h of gene g which is defined to be [the length of the transcript] + [the length of the poly-A tail we added to the alignment target] - [the mean fragment length] + 1 (Nicolae et al, 2011; Li and Dewey, 2011).

EMASE software

The EMASE software (DOI:10.5281/zenodo.291791) is implemented in Python and C++. It is available at https://github.com/churchill-lab/emase and is distributed under the GNU General Public License v3.0. We assume diploid genomes and haplotype-specific gene annotations are available or created using a package, e.g., g2gtools (DOI:10.5281/zenodo.292952) available at https://github.com/churchill-lab/g2gtools. These software tools take reference genome sequence file(s) in fasta format and variant call file(s) in vcf format and creates diploid pseudogenomes. g2gtools also lifts over reference gene annotation file (in GTF format) to a new pseudogenome coordinate system and extracts isoform (or genomic) sequences from the pseudogenome. EMASE command create-hybrid creates a diploid transcriptome by pooling two haploid transcriptome with allele identifiers added as suffixes and builds indices for read alignment. The reads can be aligned to the target sequences using any standard alignment software, such as bowtie. The command bam-to-emase converts the BAM format alignment file (Li et al, 2009) into an alignment incidence profile. The alignment profile file is a 3-dim sparse matrix of size N_r × N_I × 2 stored in PyTables HDF5 format (Alted et al, 2002-2014; The HDF Group, 2000-2010). The sparse matrix has value 1 if a read aligns to a specific isoform of an allele. The alignment profile file can be further reduced in size by collapsing together reads with identical alignment profiles as equivalent classes (Nicolae et al, 2011). The collapsed sparse matrix in HDF5 format reduces storage more than 1,000 fold compared to the original BAM file. For example, 40.4M 68-bp reads (See Methods) aligned to the diploid transcriptome results in a BAM file of size ~4.7GB, while the collapsed alignment profile file in HDF5 format is just about 3.1MB. The command run-emase executes the EM algorithm to quantify expression abundance at gene, allele, and isoform level. The user can optionally specify the convergence criteria and choose one of the four EMASE models (Figure 2a). According to our run test using the same 40.4M read data on a cluster compute node with Intel Xeon E5-2695 v3 @2.3GHz and 526GB RAM, EMASE C++ implementation ran 11.8 times faster than RSEM (5m59s versus 1h10m54s) using 16 cores in parallel, both starting from the same BAM file of ~270M alignments. For the same data and resource, kallisto took 1m41s which includes its own pseudoalignment process.

Comparison methods

There are many software packages and workflows implemented for RNA-Seq analysis that provide estimates of ASE and total or isoform-level gene expression. These implementations differ in many small, and some important, details. For the purpose of evaluating EMASE we chose four comparative methods that reflect current practices in ASE estimation. Each method isolates one key feature in the analysis workflow. Using each of these methods, we estimate maternal (M) and paternal (P) allele-level read counts at the gene or isoform-level and estimate ASE as the proportion of maternal allelic read counts, M/(M + P). One method (WASP) does not produce isoform level estimates of ASE or total gene- and isoform-level expression. We excluded the unique-reads method for comparing total gene- or isoform-level expression because it loses too many reads for proper benchmarking. In order to focus just on the impact of discarding allelic multi-reads, we implemented our own version of this widely-used unique-reads approach to ASE estimation. We first aligned reads to the diploid transcriptome using the same BAM files as the EMASE and RSEM analyses. For gene-level ASE estimation, we discard the genomic multi-reads; and for isoform-level ASE estimation, we discard genomic and isoform multireads. Using the count-alignments feature of EMASE, we count reads that uniquely align to each allele at the gene- or isoform-level. WASP is an ASE estimation method that use a reference genome alignment. We aligned the reads to the mm10 reference genome using TopHat and used WASP software to re-assign the reads containing alternate alleles. The WASP corrected BAM files were used as input to GATK’s ASEReadCounter to get allele-level read counts at SNPs. The SNP level counts provide by WASP were averaged to get gene level ASE. We did not obtain estimates of isoform-level ASE with WASP. RSEM estimates ASE and total gene expression from aligned sequence reads. We used the same BAM format alignments for RSEM and EMASE analysis. RSEM retains multi-reads of all types and uses an EM algorithm to estimate expected counts. RSEM does not distinguish among allele, isoform and genomic multi-reads. kallisto estimates ASE and total gene expression based on a k-mer matching algorithms that generates a pseudo-alignment to RNA-Seq reads. It uses an EM algorithm to resolve multi-reads and, like RSEM, treats all types of multireads equally. We provide the NODxPWK diploid transcriptome to kallisto and obtain estimates of ASE and total expression at the gene- and isoform-level.

Statistical significance of ASE

We used a Beta-Binomial model of individual sample allele-specific read counts across 48 samples to evaluate the statistical significance of ASE. The Beta-Binomial model allows sample-to-sample variation in ASE and has been shown to accurately model overdispersion in RNA-Seq data (Pickrell et al, 2010). If X is the expected read count from the PWK allele and n is the total read count, we assume that X is drawn from a binomial distribution with parameters n and p, X ~ Bin(n,p), where p is a random variable drawn from a beta distribution with parameters α and β, p ~ Beta(α, β). We tested the null hypothesis H₀ : α = β, which corresponds to values of p that are on average , against the alternative hypothesis H₀ : α ≠ β using a likelihood ratio test. We evaluated the significance of ASE separately for each direction of cross, treating each of the 24 samples from one direction as independent replicates and made no adjustment for diet or age at this stage.

In addition, we tested for effects of direction-of-cross, diet, and age on ASE using an additive logistic regression model (Cross+Diet+Age) with a quasi-binomial link function to account for overdispersion. We evaluated the significance of each main effect by dropping the corresponding term from the additive model and computing the likelihood ratio (chi-square) statistic for the reduced model.

Software and Data Availability

The EMASE software is freely available under GNU license at https://github.com/churchill-lab/emase. Raw RNA-Seq data files in fastq format and processed allele-level expression estimates are archived at GEO under the accession #xxxxxx. Simulation data are available at ftp://churchill-lab.jax.org/pub/software/EMASE/.

Author’s contributions

NR, KC, SM, and GA conceived the algorithm and KC, MV, GB, and KS implemented the methods. RK and FP provided data and interpretation of findings. KC and NR analyzed data. KC, NR and GAC wrote the manuscript.

Appendix A: EMASE Models

As described in previous sections, the complete-data likelihood is

P(A, R|Θ) is the sum of multinomial random variables {R_r}, each having sample size of 1 with the probabilities {v_{g h}_i|Ar}. The probability v_{g h i}_|Ar is a function of A_r and λ_g,_h,_i, the alignment profile of read r and the proportion of reads that originate from isoform i of allele h of gene g respectively. Depending on our models of hierarchy, I_r (g, h, i) is defined in the following ways, and v_{g h i}_|Ar should be estimated along this structure. No nested structure is assumed in Model 4.

In E-step, we compute the expectation on the sufficient statistic Σ_rI_r (g, h, i) given the observed data A and the current estimates of the model parameters Θ. Similarly to θ_g, ϕ_h_|g, and δ_i_|g,_h for Model 1, we define ϕ_h_|g,_i and δ_i_|g to be the transcript proportion that originate from allele h given isoform i of gene g and the transcript proportion from isoform i given gene g respectively for Model 2. For Model 3, we also define δ_h,i_|g to be the transcript proportion from isoform i of allele h given gene g, and, for Model 4, δ_g,_h,_i to be the transcript proportion from isoform i of allele h of gene g.

In M-step, we update the maximum likelihood estimator (MLE) of δ, ϕ, and θ in each model. Suppose λ_g,h,i denotes the proportion of reads sampled from transcript i of allele h of gene g, then, for all models,

Since δ, ϕ, θ are all proportions with respect to the transcript count (or equivalently coverage depth), where l_g,h,i is the effective length of isoform i from allele h of gene g: (the length of the isoform) + (the length of poly-A tail we add to the target sequences) - (the RNA-Seq read length) + 1.

Figure S1: Performance of EMASE model M₂, RSEM, and kallisto on ASE at the gene level (a)(c)(e) and at the isoform level (b)(d)(f)

Figure S2: Performance of EMASE model M₂, RSEM, and kallisto on total gene expression (a)(c)(e) and total isoform expression (b)(d)(f)

Figure S3:

Signed deviation plot for head-to-head comparison of methods for ASE and total expression estimation methods. This figure provides a guide for interpreting the supplemental figures S4 through S7, which provide details behind the comparison summaries in Table 2. The x- and y-axes represent deviation of the estimated values from the true simulated values. Therefore, each gene or isoform is represented as a point at coordinate (M_x — Truth, M_y — Truth) where M_x and M_y denote estimated values. The range of total expression estimates is wide and we converted them to log scale by taking logarithm on the absolute value of the differences after adding 1.0 and we keep the sign of the deviations. Thus negative values on each axis denote underestimation and positive values indicate overestimation relative to truth. On either scale, regular or log scale, a point nearer to y-axis (quadrants 3a or 3b) indicates higher accuracy of the M_x estimates and a point nearer to the x-axis (quadrants 4a or 4b) indicates higher accuracy of M_y estimates. Dotted lines are demarcate where the estimates of M_x and M_y have essentially the same deviation from the truth. The proportions reported in Table 2 are obtained by comparing the count of items in 3a,b to those in 4a,b relative to the total numbers of genes or isoforms. This configuration is applied to plots in Figures S4, S5, S6, and S7 and log-scale is used for Figures S6 and S7.

Figure S4:

Head-to-head comparison between EMASE Models on ASE. Since M₂ performed best overall, we show it compared with the other models with respect to deviations from the true proportions (%) of the PWK allele expression at the gene-level (a)(c)(e) and at the isoform-level (b)(d)(f). Frequency values in each hexbin cells in these plots are average across twelve samples. These frequencies are also rescaled in log₁₀ space in order to visualize highly dispersed values.

Figure S5:

Head-to-head comparison between EMASE and other EM-based method, RSEM (a)(b) and kallisto (c)(d), on ASE. Comparisons are performed at the gene-level (a)(c) and at the isoform-level (b)(d). Frequencies in 2-dim histogram (or hexbin) are in log₁₀ scale.

Figure S6:

Head-to-head comparison between EMASE Models with respect to total gene expression (a)(c)(e) and total isoform expression (b)(d)(f). Again, we show how M₂ performs compared with other models. Plots are in log scale (See Figure S3 for how to transform the signed deviation plot to log scale). Frequency values in each hexbin cells in these plots are average across twelve samples. These frequencies are also rescaled in log₁₀ space in order to visualize highly dispersed values.

Figure S7:

Head-to-head comparison between EMASE and other EM-based methods, RSEM (a)(b) and kallisto (c)(d), on total expression. Comparisons are performed at the gene level (a)(c) and at the isoform level (b)(d). Frequencies in 2-dim histogram (or hexbin) are in log10 scale. A darker area at the lower left region in the left quadrant of (a) and (c) indicate that EM methods generally underestimate expression but EMASE does it in a lesser degree compared to RSEM or Kallisto.

Figure S8:

Experimental design of the Reciprocal F1 hybrid cross.

Figure S9:

Effects of direction-of-cross, diet and age on ASE. P-value histograms for the effect of (A) direction of cross, (B) diet, and (C) age on ASE. This shows that direction of cross has small effect on ASE, diet has almost no effect on ASE and age has decent effect on ASE.

References

↵
Agresti A (2002) Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience, 2nd edition
↵
Alted F, Vilata I, et al. (2002-2014) PyTables: Hierarchical Datasets in Python
↵
Andergassen D, Dotter CP, Kulinski TM, Guenzl PM, Bammer PC, Barlow DP, Pauler FM, Hudson QJ (2015) Allelome.PRO, a pipeline to define allele-specific genomic features from high-throughput sequencing data. Nucleic acids research : gkv727
↵
Babak T, DeVeale B, Tsang EK, Zhou Y, Li X, Smith KS, Kukurba KR, Zhang R, Li JB, van der Kooy D, Montgomery SB, Fraser HB (2015) Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nat Genet 47: 544-549
OpenUrl CrossRef PubMed
↵
Baker CL, Kajita S, Walker M, Saxl RL, Raghupathy N, Choi K, Petkov PM, Paigen K (2015) PRDM9 Drives Evolutionary Erosion of Hotspots in Mus musculus through Haplotype-Specific Initiation of Meiotic Recombination. PLoS Genet 11: e1004916
OpenUrl CrossRef PubMed
↵
Baran Y, Subramaniam M, Biton A, Tukiainen T, Tsang EK, Rivas MA, Pirinen M, Gutierrez-Arcelus M, Smith KS, Kukurba KR, Zhang R, Eng C, Torgerson DG, Urbanek C, Li JB, Rodriguez-Santana JR, Burchard EG, Seibold MA, MacArthur DG, Montgomery SB, et al. (2015) The landscape of genomic imprinting across diverse adult human tissues. Genome Res 25: 927-936
OpenUrl Abstract/FREE Full Text
↵
Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nature biotechnology 34: 525-527
OpenUrl CrossRef PubMed
↵
Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16: 195
OpenUrl CrossRef PubMed
↵
Chick JM, Munger SC, Simecek P, Huttlin EL, Choi K, Gatti DM, Raghupathy N, Svenson KL, Churchill GA, Gygi SP (2016) Defining the consequences of genetic variation on a proteome-wide scale. Nature 534: 500-505
OpenUrl CrossRef PubMed
↵
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney DJ, Elo LL, Zhang X, et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biology 17: 1
OpenUrl CrossRef PubMed
↵
Coolon JD, Stevenson KR, McManus CJ, Graveley BR, Wittkopp PJ (2012) Genomic imprinting absent in Drosophila melanogaster adult females. Cell Rep 2: 69-75
OpenUrl CrossRef PubMed Web of Science
↵
De La Chapelle A (2009) Genetic predisposition to human disease: allele-specific expression and low-penetrance regulatory loci. Oncogene 28: 3345-3348
OpenUrl CrossRef PubMed Web of Science
↵
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25: 3207-3212
OpenUrl CrossRef PubMed Web of Science
↵
DeVeale B, van der Kooy D, Babak T (2012) Critical evaluation of imprinted gene expression by RNA-Seq: a new perspective. PLoS Genet 8: e1002600
OpenUrl CrossRef PubMed
↵
Frazee AC, Jaffe AE, Langmead B, Leek JT (2015) Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31: 2778-2784
OpenUrl CrossRef PubMed
↵
Goncalves A, Leigh-Brown S, Thybert D, Stefflova K, Turro E, Flicek P, Brazma A, Odom DT, Marioni JC (2012) Extensive compensatory cis-trans regulation in the evolution of mouse gene expression. Genome Res 22: 2376-2384
OpenUrl Abstract/FREE Full Text
↵
Graze RM, Novelo LL, Amin V, Fear JM, Casella G, Nuzhdin SV, McIntyre LM (2012) Allelic imbalance in Drosophila hybrid heads: exons, isoforms, and evolution. Mol Biol Evol 29: 15211532
OpenUrl
↵
Gregg C, Zhang J, Weissbourd B, Luo S, Schroth GP, Haig D, Dulac C (2010) High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science 329: 643-648
OpenUrl Abstract/FREE Full Text
↵
Gresham D, Curry B, Ward A, Gordon DB, Brizuela L, Kruglyak L, Botstein D (2010) Optimized detection of sequence variation in heterozygous genomes using DNA microarrays with isothermal-melting probes. Proc Natl Acad Sci USA 107: 1482-1487
OpenUrl Abstract/FREE Full Text
↵
Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M (2012) Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res 40: 10073-10083
OpenUrl CrossRef PubMed Web of Science
↵
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M (2015) Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome biology 16: 150
OpenUrl CrossRef PubMed
↵
Keane TM, Goodstadt L, Danecek P, White MA, Wong K, et al. (2011) Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477: 289-294
OpenUrl CrossRef PubMed Web of Science
↵
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nature methods 12: 357-360
OpenUrl
↵
Kukurba KR, Zhang R, Li X, Smith KS, Knowles DA, How Tan M, Piskol R, Lek M, Snyder M, Macarthur DG, Li JB, Montgomery SB (2014) Allelic expression of deleterious protein-coding variants across human tissues. PLoS Genet 10: e1004304
OpenUrl CrossRef PubMed
↵
Lagarrigue S, Martin L, Hormozdiari F, Roux PF, Pan C, van Nas A, Demeure O, Cantor R, Ghaz-alpour A, Eskin E, Lusis AJ (2013) Analysis of allele-specific expression in mouse liver by RNA-Seq: a comparison with Cis-eQTL identified using genetic linkage. Genetics 195: 1157-1166
OpenUrl Abstract/FREE Full Text
↵
Lalonde E, Ha KCH, Wang Z, Bemmo A, Kleinman CL, Kwan T, Pastinen T, Majewski J (2011) RNA sequencing reveals the role of splicing polymorphisms in regulating human gene expression. Genome Res 21: 545-554
OpenUrl Abstract/FREE Full Text
↵
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25
OpenUrl CrossRef PubMed
↵
Lappalainen T, Sammeth M, Friedlander MR, ’t Hoen PA, Monlong J, Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlof J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511
OpenUrl CrossRef PubMed Web of Science
↵
Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology 15: R29
OpenUrl CrossRef PubMed
↵
Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12: 323
OpenUrl CrossRef PubMed
↵
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26: 493-500
OpenUrl CrossRef PubMed Web of Science
↵
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-2079
OpenUrl CrossRef PubMed Web of Science
↵
Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR (2008) Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133: 523-536
OpenUrl CrossRef PubMed Web of Science
↵
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15: 550
OpenUrl CrossRef PubMed
↵
McKean DM, Homsy J, Wakimoto H, Patel N, Gorham J, DePalma SR, Ware JS, Zaidi S, Ma W, Patel N, et al. (2016) Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nature communications 7
↵
McManus CJ, Coolon JD, Duff MO, Eipper-Mains J, Graveley BR, Wittkopp PJ (2010) Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Res 20: 816-825
OpenUrl Abstract/FREE Full Text
↵
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Der-mitzakis ET (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464: 773-777
OpenUrl CrossRef PubMed Web of Science
↵
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5: 621-628
OpenUrl CrossRef PubMed Web of Science
↵
Munger S, Raghupathy N, Choi KB, Simons A, Gatti D, Hinerfeld D, Svenson K, Attie A, Hibbs M, Graber J, Churchill G, Chesler E (2014) RNA-seq alignment to individualized genomes improves transcript abundance estimates in Multiparent populations. Genetics 198: 59-73
OpenUrl Abstract/FREE Full Text
↵
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344-1349
OpenUrl Abstract/FREE Full Text
↵
Nicolae M, Mangul S, Mandoiu I, Zelikovsky A (2011) Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology 6:9
OpenUrl
↵
Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32: 462-464
OpenUrl CrossRef PubMed
↵
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768-772
OpenUrl CrossRef PubMed Web of Science
↵
Robert C, Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome biology 16: 177
OpenUrl CrossRef PubMed
↵
Roberts A, Pachter L (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10: 71-73
OpenUrl CrossRef PubMed Web of Science
↵
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140
OpenUrl CrossRef PubMed Web of Science
↵
Ronald J, Akey JM, Whittle J, Smith EN, Yvert G, Kruglyak L (2005) Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res 15: 284-291
OpenUrl Abstract/FREE Full Text
↵
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M (2011) AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol 7: 522
OpenUrl Abstract/FREE Full Text
↵
Skelly DA, Johansson M, Madeoy J, Wakefield J, Akey JM (2011) A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. Genome Res 21: 1728-1737
OpenUrl Abstract/FREE Full Text
↵
Stevenson KR, Coolon JD, Wittkopp PJ (2013) Sources of bias in measures of allele-specific expression derived from RNA-sequence data aligned to a single reference genome. BMC Genomics 14: 536
OpenUrl CrossRef PubMed
↵
The HDF Group (2000-2010) Hierarchical data format version 5
↵
Turro E, Su SY, Goncalves A, Coin LJ, Richardson S, Lewin A (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 12: R13
OpenUrl CrossRef PubMed
↵
van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015) WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods 12: 1061-1063
OpenUrl CrossRef PubMed
↵
Wang X, Soloway PD, Clark AG (2011) A survey for novel imprinted genes in the mouse placenta by mRNA-seq. Genetics 189: 109-122
OpenUrl Abstract/FREE Full Text
↵
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57-63
OpenUrl CrossRef PubMed Web of Science
↵
Wittkopp PJ, Haerum BK, Clark AG (2004) Evolutionary changes in cis and trans gene regulation. Nature 430: 85-88
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted July 22, 2017.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5214)
Biochemistry (11745)
Bioengineering (8751)
Bioinformatics (29195)
Biophysics (14971)
Cancer Biology (12095)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18306)
Genetics (12245)
Genomics (16801)
Immunology (11867)
Microbiology (28083)
Molecular Biology (11592)
Neuroscience (60965)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
Agresti A (2002) Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience, 2nd edition

[2] ↵
Alted F, Vilata I, et al. (2002-2014) PyTables: Hierarchical Datasets in Python

[3] ↵
Andergassen D, Dotter CP, Kulinski TM, Guenzl PM, Bammer PC, Barlow DP, Pauler FM, Hudson QJ (2015) Allelome.PRO, a pipeline to define allele-specific genomic features from high-throughput sequencing data. Nucleic acids research : gkv727

[4] ↵
Babak T, DeVeale B, Tsang EK, Zhou Y, Li X, Smith KS, Kukurba KR, Zhang R, Li JB, van der Kooy D, Montgomery SB, Fraser HB (2015) Genetic conflict reflected in tissue-specific maps of genomic imprinting in human and mouse. Nat Genet 47: 544-549
OpenUrl CrossRef PubMed

[5] ↵
Baker CL, Kajita S, Walker M, Saxl RL, Raghupathy N, Choi K, Petkov PM, Paigen K (2015) PRDM9 Drives Evolutionary Erosion of Hotspots in Mus musculus through Haplotype-Specific Initiation of Meiotic Recombination. PLoS Genet 11: e1004916
OpenUrl CrossRef PubMed

[6] ↵
Baran Y, Subramaniam M, Biton A, Tukiainen T, Tsang EK, Rivas MA, Pirinen M, Gutierrez-Arcelus M, Smith KS, Kukurba KR, Zhang R, Eng C, Torgerson DG, Urbanek C, Li JB, Rodriguez-Santana JR, Burchard EG, Seibold MA, MacArthur DG, Montgomery SB, et al. (2015) The landscape of genomic imprinting across diverse adult human tissues. Genome Res 25: 927-936
OpenUrl Abstract/FREE Full Text

[7] ↵
Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nature biotechnology 34: 525-527
OpenUrl CrossRef PubMed

[8] ↵
Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16: 195
OpenUrl CrossRef PubMed

[9] ↵
Chick JM, Munger SC, Simecek P, Huttlin EL, Choi K, Gatti DM, Raghupathy N, Svenson KL, Churchill GA, Gygi SP (2016) Defining the consequences of genetic variation on a proteome-wide scale. Nature 534: 500-505
OpenUrl CrossRef PubMed

[10] ↵
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney DJ, Elo LL, Zhang X, et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biology 17: 1
OpenUrl CrossRef PubMed

[11] ↵
Coolon JD, Stevenson KR, McManus CJ, Graveley BR, Wittkopp PJ (2012) Genomic imprinting absent in Drosophila melanogaster adult females. Cell Rep 2: 69-75
OpenUrl CrossRef PubMed Web of Science

[12] ↵
De La Chapelle A (2009) Genetic predisposition to human disease: allele-specific expression and low-penetrance regulatory loci. Oncogene 28: 3345-3348
OpenUrl CrossRef PubMed Web of Science

[13] ↵
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25: 3207-3212
OpenUrl CrossRef PubMed Web of Science

[14] ↵
DeVeale B, van der Kooy D, Babak T (2012) Critical evaluation of imprinted gene expression by RNA-Seq: a new perspective. PLoS Genet 8: e1002600
OpenUrl CrossRef PubMed

[15] ↵
Frazee AC, Jaffe AE, Langmead B, Leek JT (2015) Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31: 2778-2784
OpenUrl CrossRef PubMed

[16] ↵
Goncalves A, Leigh-Brown S, Thybert D, Stefflova K, Turro E, Flicek P, Brazma A, Odom DT, Marioni JC (2012) Extensive compensatory cis-trans regulation in the evolution of mouse gene expression. Genome Res 22: 2376-2384
OpenUrl Abstract/FREE Full Text

[17] ↵
Graze RM, Novelo LL, Amin V, Fear JM, Casella G, Nuzhdin SV, McIntyre LM (2012) Allelic imbalance in Drosophila hybrid heads: exons, isoforms, and evolution. Mol Biol Evol 29: 15211532
OpenUrl

[18] ↵
Gregg C, Zhang J, Weissbourd B, Luo S, Schroth GP, Haig D, Dulac C (2010) High-resolution analysis of parent-of-origin allelic expression in the mouse brain. Science 329: 643-648
OpenUrl Abstract/FREE Full Text

[19] ↵
Gresham D, Curry B, Ward A, Gordon DB, Brizuela L, Kruglyak L, Botstein D (2010) Optimized detection of sequence variation in heterozygous genomes using DNA microarrays with isothermal-melting probes. Proc Natl Acad Sci USA 107: 1482-1487
OpenUrl Abstract/FREE Full Text

[20] ↵
Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M (2012) Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res 40: 10073-10083
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M (2015) Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome biology 16: 150
OpenUrl CrossRef PubMed

[22] ↵
Keane TM, Goodstadt L, Danecek P, White MA, Wong K, et al. (2011) Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477: 289-294
OpenUrl CrossRef PubMed Web of Science

[23] ↵
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nature methods 12: 357-360
OpenUrl

[24] ↵
Kukurba KR, Zhang R, Li X, Smith KS, Knowles DA, How Tan M, Piskol R, Lek M, Snyder M, Macarthur DG, Li JB, Montgomery SB (2014) Allelic expression of deleterious protein-coding variants across human tissues. PLoS Genet 10: e1004304
OpenUrl CrossRef PubMed

[25] ↵
Lagarrigue S, Martin L, Hormozdiari F, Roux PF, Pan C, van Nas A, Demeure O, Cantor R, Ghaz-alpour A, Eskin E, Lusis AJ (2013) Analysis of allele-specific expression in mouse liver by RNA-Seq: a comparison with Cis-eQTL identified using genetic linkage. Genetics 195: 1157-1166
OpenUrl Abstract/FREE Full Text

[26] ↵
Lalonde E, Ha KCH, Wang Z, Bemmo A, Kleinman CL, Kwan T, Pastinen T, Majewski J (2011) RNA sequencing reveals the role of splicing polymorphisms in regulating human gene expression. Genome Res 21: 545-554
OpenUrl Abstract/FREE Full Text

[27] ↵
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25
OpenUrl CrossRef PubMed

[28] ↵
Lappalainen T, Sammeth M, Friedlander MR, ’t Hoen PA, Monlong J, Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlof J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511
OpenUrl CrossRef PubMed Web of Science

[29] ↵
Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology 15: R29
OpenUrl CrossRef PubMed

[30] ↵
Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12: 323
OpenUrl CrossRef PubMed

[31] ↵
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26: 493-500
OpenUrl CrossRef PubMed Web of Science

[32] ↵
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-2079
OpenUrl CrossRef PubMed Web of Science

[33] ↵
Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR (2008) Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133: 523-536
OpenUrl CrossRef PubMed Web of Science

[34] ↵
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15: 550
OpenUrl CrossRef PubMed

[35] ↵
McKean DM, Homsy J, Wakimoto H, Patel N, Gorham J, DePalma SR, Ware JS, Zaidi S, Ma W, Patel N, et al. (2016) Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nature communications 7

[36] ↵
McManus CJ, Coolon JD, Duff MO, Eipper-Mains J, Graveley BR, Wittkopp PJ (2010) Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Res 20: 816-825
OpenUrl Abstract/FREE Full Text

[37] ↵
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Der-mitzakis ET (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464: 773-777
OpenUrl CrossRef PubMed Web of Science

[38] ↵
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5: 621-628
OpenUrl CrossRef PubMed Web of Science

[39] ↵
Munger S, Raghupathy N, Choi KB, Simons A, Gatti D, Hinerfeld D, Svenson K, Attie A, Hibbs M, Graber J, Churchill G, Chesler E (2014) RNA-seq alignment to individualized genomes improves transcript abundance estimates in Multiparent populations. Genetics 198: 59-73
OpenUrl Abstract/FREE Full Text

[40] ↵
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344-1349
OpenUrl Abstract/FREE Full Text

[41] ↵
Nicolae M, Mangul S, Mandoiu I, Zelikovsky A (2011) Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology 6:9
OpenUrl

[42] ↵
Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32: 462-464
OpenUrl CrossRef PubMed

[43] ↵
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768-772
OpenUrl CrossRef PubMed Web of Science

[44] ↵
Robert C, Watson M (2015) Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome biology 16: 177
OpenUrl CrossRef PubMed

[45] ↵
Roberts A, Pachter L (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10: 71-73
OpenUrl CrossRef PubMed Web of Science

[46] ↵
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140
OpenUrl CrossRef PubMed Web of Science

[47] ↵
Ronald J, Akey JM, Whittle J, Smith EN, Yvert G, Kruglyak L (2005) Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res 15: 284-291
OpenUrl Abstract/FREE Full Text

[48] ↵
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M (2011) AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol 7: 522
OpenUrl Abstract/FREE Full Text

[49] ↵
Skelly DA, Johansson M, Madeoy J, Wakefield J, Akey JM (2011) A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. Genome Res 21: 1728-1737
OpenUrl Abstract/FREE Full Text

[50] ↵
Stevenson KR, Coolon JD, Wittkopp PJ (2013) Sources of bias in measures of allele-specific expression derived from RNA-sequence data aligned to a single reference genome. BMC Genomics 14: 536
OpenUrl CrossRef PubMed

[51] ↵
The HDF Group (2000-2010) Hierarchical data format version 5

[52] ↵
Turro E, Su SY, Goncalves A, Coin LJ, Richardson S, Lewin A (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 12: R13
OpenUrl CrossRef PubMed

[53] ↵
van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015) WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods 12: 1061-1063
OpenUrl CrossRef PubMed

[54] ↵
Wang X, Soloway PD, Clark AG (2011) A survey for novel imprinted genes in the mouse placenta by mRNA-seq. Genetics 189: 109-122
OpenUrl Abstract/FREE Full Text

[55] ↵
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57-63
OpenUrl CrossRef PubMed Web of Science

[56] ↵
Wittkopp PJ, Haerum BK, Clark AG (2004) Evolutionary changes in cis and trans gene regulation. Nature 430: 85-88
OpenUrl CrossRef PubMed Web of Science

Hierarchical Analysis of Multi-mapping RNA-Seq Reads Improves the Accuracy of Allele-specific Expression

Abstract

Background

Results

Importance of counting multi-reads

Fitting an EMASE model

Performance of EMASE on simulated data

ASE in F1 hybrid data

Discussion

Methods

Construction of the diploid transcriptome

Read alignment and isoform abundance quantification

RNA-Seq simulations for benchmarking

Animals and experimental design

RNA sequencing

Parameter estimation with EMASE

EMASE software

Comparison methods

Statistical significance of ASE

Software and Data Availability

Author’s contributions

Appendix A: EMASE Models

References

Citation Manager Formats

Subject Area