Abstract
Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Current quantification methods rely on probabilistic models, and the scenario where it admits multiple optimal solutions is called non-identifiability. This means multiple estimates of transcript abundances generate the observed RNA-seq reads equally likely, and the underlying true expression cannot be determined. The non-identifiability problem is further exacerbated when incompleteness of reference transcriptome and existence of unannotated transcripts are taken into consideration. State-of-the-art quantification methods usually output a single inferred set of abundances, and the accuracy of the single expression set is unknown compared to other equally optimal solutions. Obtaining the set of equally optimal solutions is necessary for evaluating and extending downstream analyses to take non-identifiability into account. We propose methods to compute the range of equally optimal estimates for the expression of each transcript, accounting for non-identifiability of the quantification model using several novel graph theoretical approaches. It works under two scenarios, one assuming the reference transcriptome is complete, another assuming incomplete reference and allowing for expression of unannotated transcripts. Our methods calculate a “confidence” range for each transcript, representing its possible abundance across equally optimal estimates. This range can be used for evaluating the reliability of detected differentially expressed (DE) transcripts, as a large overlap of confidence range between DE conditions indicates the prediction may be unreliable due to uncertainty. We observe that 5 out of 257 DE predictions are unreliable on an MCF10 cell line and 19 out of 3152 are unreliable on a CD8 T cell dataset. The source code can be found at https://github.com/Kingsford-Group/subgraphquant.
1 Introduction
With the wide usage of gene or transcript expression data, we have seen improvement in probabilistic models of RNA-seq expression quantification [1–4], as well as characterization and evaluation of the errors of quantified expression [5–7]. However, there is an under-characterized type of estimation error which is due to the non-uniqueness of solutions to the probabilistic model. When the sequencing data can be explained by multiple sets of transcripts expression with equal probability, current methods only output one of the optimal solutions. It is usually unknown what the other optimal solutions are or how different they are from each other.
This type of estimation error, due to the non-uniqueness of optimal solutions, is intrinsic to the quantification model and almost inevitable. The non-uniqueness of solutions to an optimization model is known as the model non-identifiability under the assumption of infinite observations. Here we use a relaxed definition of non-identifiability to refer to the non-uniqueness of a model’s solution under a finite set of observations. Whether the model is identifiable is usually determined by the form of the objective function and the dimension of the parameter space in the optimization problem. Figure 1 shows a toy example where the optimal abundances of the set of four transcripts are not unique. The interdependence between parameters is one of the causes of the non-identifiability phenomenon. Particularly for transcript expression quantification problems, the interdependence comes from exon sharing between transcripts.
Assessing the potential error due to non-identifiability helps increase the accuracy of analyses that are based on transcript expression data. Expression data is used in differential expression detection to study gene function, or treated as features to predict disease and treatment outcome [8, 9]. Without considering the estimation error in expression, these analyses may miss the true signal, output false detections or make wrong predictions. Recent differential expression detection methods [10, 11] include estimation error in their models and show an improved accuracy of the detected differentially expressed transcripts. Integrating the multiple optimal estimates due to non-identifiability into consideration may further improve existing methods.
Non-identifiability in transcript quantification has been discussed in previous works, but in-depth characterization is still lacking: no previous work lists out all values of expression estimates that are equally optimal. Lacroix et al. [12] and Hiller et al. [13] developed methods to list the transcripts that have nonunique expression abundances but their methods do not provide information about what values their optimal abundances can be. Roberts and Pachter [14] incorporated the detection of non-identifiable abundances into their quantification method, and designed a tool to output the identifiability flag of each transcript along with a single expression estimate. Bernard et al. [15] proposed a network-flow-based model that has a similar premise and approach to our work. Similar to our approach, their first step is to reparameterize using a network flow on a fragment graph. Bernard et al. [15] then quantified the transcripts with regularization followed by a flow decomposition, without taking identifiability of transcript expression into consideration. Their proposed approach also only works for single-end, fixed-length reads, as phasing paths that include each other introduce ambiguity in the fragment graph and the resulting graph can no longer be quantified correctly. Similarly, LeGault and Dewey [16] introduced a network-flow-based model and specifically focused on the identifiability of the model, but the method requires additional assumptions for phasing reads, and does not provide further information about the multiple optimal abundances of transcripts.
The reference transcriptome may not be complete and cells may express transcripts outside the reference. It is important to study the set of equally optimal estimates under the incomplete reference assumption, as studies of transcriptome assembly, third-generation sequencing with long reads and k-mer decomposition shows that the assumption likely hold for many datasets [17–22]. To accurately quantify the expression under the incomplete reference assumption, the model should take expression of novel transcripts into consideration. This greatly expands the dimension of the parameter space making the model more susceptible to non-identifiability, could lead to different quantification results, and is relatively unstudied. An independent line of research on reference-free transcript quantification (LeGault and Dewey [16] and Bernard et al. [15]) uses specialized splice graph-centric methods to quantify transcript abundances under this assumption. However, these methods deviate from the mainstream quantification models and leave the effect of non-identifiability unaddressed. We resolve both concerns with our purposed methods.
In this work, we propose methods to derive the confidence ranges for transcript abundances. Specifically, we derive a “range of optima” for each transcript, indicating its range of abundance in equally optimal estimates. The set of equally optimal estimates varies by assumptions of reference completeness, and we analyze three cases. (1) Assuming a complete reference, we propose a linear programming (LP) formulation to derive the ranges of optima for each transcript based on a phasing-aware reparameterization of the model (that is, the model properly takes phasing reads into account). (2) Assuming that the reference transcriptome is incomplete but the splice graph is correct, we propose a multi-stage approach: After the reparameterization, we reorganize the quantification problem using a variant of splice graph inspired by the multiple pattern matching problem, optimize using a specialized EM algorithm, and derive the ranges via efficient max-flow-based algorithms. (3) Under the assumption that the reference is incomplete but a known and given percentage of expression is contributed by the reference transcriptome, we derive the ranges based on the ranges under the two previous assumptions.
Applying our method on the probabilistic models used in the quantification software Salmon [4] on 16 Human Body Map samples, we are able to provide an interval of optimal abundances for each transcript. The ranges of optima can be used to evaluate the “confidence” of the expression estimates of Salmon. We observe that Salmon estimates tend to be near the lower bounds or upper bounds of the range of optima rather than in the middle. This may be related to the initialization or specific optimization method of convex optimization and deserves to be further studied from a theoretical perspective. The upper bounds of the range of optima are rarely above 10 times of Salmon’s estimates, indicating that the under-estimation of expression abundance is not severe against non-identifiability.
Applying our method on sequencing datasets of a MCF10 cell line and of CD8 T cells, we use the ranges of optima to evaluate the differentially expressed transcripts detected using tximport [23] and DESeq2 [24]. Between the group of samples treated with and without epidermal growth factor (EGF) in the MCF10 dataset, 257 transcripts are detected as differentially expressed based on single estimates. The majority of the calls are reliable after considering the ranges of optima, assuming the reference transcriptome is somewhat complete and contributes to more than 40% of the expression. There are still a few unreliable detections for which the two groups’ ranges of optima overlap. The tolerance of reference incompleteness and the number of unreliable DE transcript predictions are the CD8 T cell dataset where we compare the differential transcript expression of naive CD8 T cells and effector memory CD8 T cells.
2 Method Overview
We provide a high-level walkthrough of our purposed methods in this section, and provide more details fore each component in Section 3.
Our goal is to investigate and quantify the effect of non-identifiability on transcript quantification. Doing this beyond qualitative statements (saying the expression of a transcript is non-identifiable under some model) requires a deeper understanding of the quantification model, especially the parameterization. Solving the problem also requires development of original frameworks and tools from graph theory and convex optimization, tailored to the quantification model. The end result of our process will be the range of optima [li, hi] for transcript Ti, indicating its inferred abundance across a set of equally optimal transcript abundances.
Our method depends on the following concepts and definitions. A splice graph is a directed acyclic graph representing alternative splicing events in a gene. The graph has two special vertices: S represents start of transcripts and T represents termination of transcripts. Every other vertex represent an exon or a partial exon. Edges in the splice graph represent splice junctions, potential adjacency between the exons in transcripts. A path is a list of vertices such that every adjacent pair is connected by an edge, and an S – T path is a path that starts with S and ends with T. Each transcript corresponds to a unique S – T path in the splice graph, and the set of known transcripts is called the reference transcriptome. S – T paths that are not present in the reference transcriptome correspond to unannotated transcripts. We assume that a correct and complete splice graph is accessible, and use the phrase quantified transcript set to denote a set of transcripts with corresponding abundances.
2.1 Reparameterization
In general, non-identifiability indicates redundancy in the parameter space. Intuitively, if several sets of parameters lead to the same observations, there should be a parameter space with lower dimension that fits the data (in our case, the reads) equally well. This is indeed the case with our toy example present in Figure 1: assuming the four junction read observations can always be perfectly explained, one of them can be derived from the other three by the flow balance condition on exon 3 (total observed reads on incoming edges equal total observed reads on outgoing edges).
However, there are two important differences from our toy example and actual transcript quantification. First, modern quantification methods are based around generative models that are much more complex than the toy setup, and perfect fit is usually impossible. Second, phasing reads, that is, read pairs whose inferred fragments span multiple junctions exist in the input library. Splice graph paths that given set of fragments can map on are the phasing paths. We overcome these differences by first identifying a core model of transcript quantification, then showing that the core model can be reparameterized. The reparameterization uses the abundance over each phasing path as the parameters, called path abundance and denoted by {cp} where p is a phasing path. For genes with abundant alternative splicing events, the process indeed results in a lower dimensional representation.
With the reparameterization, we can concisely state the condition where the model is non-identifiable. If two quantified transcript sets yield the same set of path abundance cp, they generate the same distribution of fragments under the generative model and fit the reads equally well. Given a reference quantification result, we can turn this observation into a linear program (LP) that yields the minimal and maximal abundance across all equivalent parameterizations, which is the range of optima under our first assumption that the reference is complete. See Section 3.1 for the mathematical details of the process.
2.2 Quantification with Incomplete Transcriptome
We next study the core quantification model where all S – T path on the splice graph can be transcripts. Equivalently, the reference transcriptome is now expanded to include all previously unannotated transcripts.
The reparameterized model is uniquely fit for this study, as the main issue with just using the core quantification model is the prohibitive size of the expanded reference transcriptome. The reference transcriptome plays a central role in the original optimization problem as it determines the set of variables and appear in the objective function, but after the reparameterization, it only appears as constraints over path abundances.
A natural step now is to restructure the constraints to eliminate this single dependence on the reference transcriptome. The constraints were in place to ensure the inferred path abundances can be generated by a quantified transcript set. Re-implementing the same set of constraints using the expanded reference transcriptome still results in a large optimization instance, which does not help with our issues.
To motivate the next step, assume there are no phasing reads (every inferred fragment either resides within an exon or contains one junction). In this case, the phasing paths are nodes or edges in the splice graph. If the quantified transcript set is mapped onto the splice graph, we obtain a network flow. The path abundance for a phasing path equals either the flow through a vertex or an edge. By the flow decomposition theorem, given a network flow on the splice graph, we can decompose it into S – T paths with weights, which then naturally maps back to a quantified transcript set. As the two-way mapping (between quantified transcript sets and splice graph flows) preserves path abundances, we conclude it is equivalent to optimize over a quantified transcript set and optimize over a splice graph flow. Specifically, it is easy to restructure the constraints to represent a splice graph flow, and the resulting model is guaranteed equivalent to the core quantification model with the expanded reference transcriptome.
We need to adapt the splice graph with the presence of phasing reads, as the mapping no longer preserves path abundances. We draw inspiration from pattern matching problems, and construct the Aho-Corasick automaton (a classical approach for multiple pattern matching) with the phasing path set. The automaton can be viewed as a directed acyclic graph with states as vertices and transitions as edges, which we call the prefix graph. We can show that the prefix graph, like the splice graph, has a two-way connection with a quantified transcript set: The set can be mapped onto a prefix graph flow, and decomposition of the flow translates to a quantified transcript set. Moreover, path abundances are preserved during the process, where each path abundance is now sum of flows through a set of edges in the prefix graph flow. By the same argument, we can now restructure the constraints with a prefix graph flow, and optimization over it leads to desired quantification results with the same guarantee.
Even if we have a tractable optimization instance now, efficient inference is still a problem, especially with the presence of multi-mapped reads and multiple genes. We handle both concerns with a modified EM algorithm, where the E-step and the M-step operate on different scales. We describe the process in more detail in Section 3.3.
2.3 Subgraph Quantification
To quantify transcripts under the incomplete reference assumption, after constructing and solving the optimization instance as described last section, we need to recover the transcript information from inferred prefix graph flows. This can be done by decomposing the flow, and the result corresponds to a quantified transcript set. However, as shown in our toy example, different flow decompositions may generate the same path abundance set, which is a direct manifestation of the non-identifiability problem. We now determine the minimal and maximal abundance of a particular transcript across all possible decompositions of the same flow, which is the range of optima under our second assumption that the reference is incomplete but the splice graph is correct.
We propose an abstraction of the problem in graph theory language. In our framework, a transcript becomes a list of edge sets from the prefix graph, each corresponding to a junction. An S – T path maps to a given transcript and called a “good path”, if it intersects each edge set in the list corresponding to the transcript (indicating the path includes each junction on the transcript). The problem now becomes finding the minimal and maximal total flow through good paths for any flow decomposition. By different choice of the edge set list, we can quantify different substructures other than full transcripts in a similarly non-identifiability-aware manner.
We call the resulting problem AND-Quant, and solve it in two steps. The first step transforms the problem into an instance of OR-Quant, where a path is good if it intersects with a given edge subset. We then solve OR-Quant by two invocations of MaxFlow. Both steps are as efficient as a MaxFlow on the prefix graph. Details can be found in Section 3.4.
2.4 Hybrid Assumptions
To complete our analysis on effect of non-identifiability, we propose a hybrid model of read generation in Section 3.5, where our previous models can be seen as two extremes to the hybrid model (thus, the third model). We combine the results from the two models to derive the range of optima for each transcript as a hybrid model under our third assumption that there is a given percentage of contribution from the reference as opposed to the rest of splice graph paths. We demonstrate one use of our framework as reliability check for differentially expressed transcripts.
3 Methods
We now describe each component of proposed methods in greater detail. Due to limited space, most detailed descriptions and extended discussions are left to Supplementary Materials.
3.1 Reparameterized Transcript Quantification
We start with the core model of transcript quantification at the foundation of most modern methods [1–4]. Assume the paired-end reads from an RNA-seq experiment are error-free and uniquely aligned to a reference genome with possible gaps as fragments (Assumptions will be relaxed later). We denote the set of fragments (mapped from paired-end reads) as F, the set of transcripts as with corresponding length l1, l2, ⋯, ln and abundance (copies of molecules) c1, c2, ⋯, cn. The transcripts per million (TPM) values are {ci} normalized then multiplied by 106. Under the core model, the probability of observing F is:
P(Ti) denotes the probability of sampling a fragment from transcript Ti, and P(f | Ti) denotes the probability of sampling the fragment f given it comes from Ti. A(f) is the set of transcript indices onto which f can map. Let D(l) be the distribution of generated fragment length. With absence of bias correction, P(f | Ti) is proportional to D(f) = D(l(f)), where l(f) denotes the fragment length inferred from mapping. Define the effective length for Ti as (which can be interpreted as the total “probability” for Ti to generate a fragment), and . Probability of generating a fragment from Ti is assumed to be proportional to its abundance times its effective length, meaning . We discuss other definitions of effective length in Section S1.2 and refer to previous papers [1–4, 25] for more detailed explanation of the model. This leads to:
We now propose an alternative view of the probabilistic model with paths on splice graphs in order to derive a compact parameter set for the quantification problem. The splice graph is constructed so each transcript can be uniquely mapped to an S – T path p(Ti) on the graph, and we assume the read library satisfies that each fragment f can be uniquely mapped to a (non S – T) path p(f) on the graph (this assumption will be relaxed later). With this setup, i ∈ A(f) if and only if p(f) is a subpath of p(Ti), or p(f) ⊂ p(Ti).
We now define ci to be the total abundance of transcripts including path p, called path abundance, and called path effective length, where Tk[i, j] is the fragment generated from transcript k from base i to base j and 1(·) is the indicator function. Intuitively, the path effective length is the total probability of sampling a fragment that maps exactly to the given path. Next, let be the set of paths from the splice graph satisfying . We can re-calculate the normalization term of the model by (proof in Section S1.3). The likelihood can now be rewritten as follows:
This reparameterizes the model with {cp}, the path abundance. When the transcriptome is complex, the dimension of {cp} is lower than the number of transcripts (as each path can be shared by multiple transcripts), a direct manifestation of the non-identifiability problem. In practice, we reduce the size of by discarding long paths with small and no mapped fragments, as they contribute little to the likelihood (See Section S1.4). Our model is also capable of certain bias correction (See Section S1.5).
With the reparameterized model, we now derive the range of optima for each transcript assuming all reads are generated from known transcripts. The reference abundances are derived with existing tools, such as Salmon [4] or kallisto [3]. We assume yields the optimal likelihood. If the model is non-identifiable, it means there are other sets of transcript abundances {ci} that yield exactly the same likelihood.
As we have shown above, it suffices for {ci} to generate an identical set of path abundances as to be optimal. This also holds with presence of read pairs mapped to multiple paths, as the probability of generating these read pairs are weighted sums of {cp}. Note that identical path abundances is not a sufficient condition for non-identifiability, and co-optimal path abundances are outside the scope of discussion for this paper. The set of transcript abundance that satisfies this condition and therefore indistinguishable from cR is defined by the linear system with nonnegative ci. For each transcript Ti, we can calculate the range of ci in this indistinguishable set by setting up two LP instances (called the Reallocation LP), one maximizing ci and one minimizing ci, within aforementioned system.
3.2 Representing Path Abundance with Prefix Graphs
Our next goal is to infer the optimal transcript abundance under the incomplete reference transcriptome assumption, without iterating over every possible S – T path. This is necessary as the number of S – T paths can be exponential to the splice graph size, but assuming most phasing paths contain few exons, the size of grows polynomially instead.
As discussed in the overview, we solve this problem by producing a set of constraints over {cp} that is only dependent on the splice graph, and show optimization over the new model is equivalent to the core model assuming all S – T paths can be transcripts. Equivalently, we want to find the set of constraints, such that: a set of path abundance satisfies the constraints if and only if there exists a corresponding quantified transcript set, with expression of novel transcripts allowed.
The most natural idea is to use the splice graph flow to represent {cp}, as shown in the overview section. However, it no longer works when some phasing path p contains three or more exons. Informally, this can be solved by constructing higher-order splice graphs (as defined in [16] for example), or fixed-order Markov models, but the size of the resulting graph grows exponentially fast and the resulting model may not be identifiable itself. We choose to “unroll” the graph just as needed, roughly corresponding to a variable-order Markov model. In a similar spirit, [15] constructs a fragment graph where each vertex corresponds to a valid phasing path and it is assumed that every transcript can be mapped uniquely to a path on the fragment path. However, as discussed before, this assumption no longer holds when paired-end reads are taken into consideration. For the connection between our model and pattern matching problems, we explain our model starting from analysis of a seeming unrelated problem.
Imagine we are given the set of paths , and for a transcript , we want to determine the set of paths that are subpath of Ti, reading one exon of Ti at a time. We can treat this as an instance of multiple pattern matching, where the nodes in the splice graph (the set of exons) constitute the alphabet, is the set of patterns (short strings to match against), and is the set of texts (long strings to identify pattern in). In this setup, the Aho-Corasick algorithm [26] performs the matching quickly by using a finite state automaton (FSA) to maintain the longest suffix of current text that could extend and match a pattern in the future.
We are interested in the Aho-Corasick FSA constructed from to solve the matching, and we further modify the finite state automaton as follows. Transitions between states of the FSA, called dictionary suffix links, indicate the next state of the FSA given the current state and the upcoming character. We do not need the links for all characters (exons), as we know is an S – T path on the splice graph. If x is the last seen character, the next character y must be a successor of x in the splice graph, and we only generate the state transitions for this set of movements.
With an FSA, we now construct a directed graph from its states and transitions as described above. Formally:
(Prefix Graph). Given splice graph GS and set of splice graph paths (assuming every singlevertex path is in ), we construct the corresponding prefix graph G as follows:
The vertices V of G are the splice graph paths p such that it is a prefix of some path in . For p ∈ V, let x be the last exon in p. For every y that is a successor of x in the splice graph, let p′ be the longest path in V that is a suffix of py (appending y to p). We then add an edge from p to p′.
The source and sink of G are the vertices corresponding to splice graph paths [S] and [T], where [x] denotes a single-vertex path. The set AS(p) is the set of vertices p′ such that p is a suffix of p′.
Intuitively, the states of the automaton are the vertices of the graph, and are labeled with the suffix in consideration at that state. The (truncated) dictionary suffix links indicate the state transitions upon reading a character, and naturally serve as the edges of the graph. All transcripts start with S, end with T and there is no path in containing either of them as they are not real exons, so there exist two vertices labeled [S] and [T]. We call them the source and sink of the prefix graph, respectively. For , AS(p) denotes the set of states in FSA that recognizes p.
The resulting prefix graph has several interesting properties that resembles the splice graph it was constructed from. [S] and [T] work similarly to the source and sink of the splice graph, as every transcript can be mapped to an [S] – [T] path on the prefix graph by feeding the transcript to the finite state automaton and recording the set of visited states, excluding the initial state (the first state after that is always [S] as the first vertex in an S – T path is S). Conversely, a [S] – [T] path on the prefix graph can also be mapped back to a transcript, as it has to follow dictionary suffix links, which by our construction can be mapped back to edges in the splice graph. The prefix graph is also a DAG: If there is a cycle in the prefix graph, it implies an exon would appear twice in a transcript.
A prefix graph flow is a bridge between the path abundance {cp} and the quantified transcript set {ci}:
Every quantified transcript set can be mapped to and from a prefix graph flow. The path abundance is preserved during the mapping and can be calculated exactly from prefix graph flow: cp = Σs∈AS(p) fs, where fs is the flow through vertex s.
Proof. Using the path mapping between splice graph and prefix graph, we can map a quantified transcript set onto the prefix graph as a prefix graph flow, and reconstruct a quantified transcript set by decomposing the flow and map each [S] – [T] path back to the splice graph as a transcript.
To prove the second part, let {cp} be the path abundance calculated from definition given a quantified transcript set, and be the path abundance calculated from the prefix graph flow. We will show for any finite decomposition of the prefix graph flow.
For any transcript Ti and any path , since no exon appears twice for a transcript, if Ti contains p, it will be recognized by the FSA exactly once. This means the [S] – [T] path that Ti maps to intersects with AS(p) by exactly one vertex in this scenario, and it contributes the same abundance to and cp. If Ti does not contain p, by similar reasoning, it contributes to neither nor cp. This holds for any transcript and any path, so the two definitions of path abundance coincide and are preserved in mapping from quantified transcript set to prefix graph flow. Since the prefix graph flow is preserved in flow decomposition, the path abundance is preserved as a function of prefix graph flow.
As discussed in overview, this connection allows us to directly optimize over {cp} by using the prefix graph flow as variables (cp is now represented as seen in Theorem 1), then flow balance and non-negativity as constraints. The corresponding quantified transcript set is guaranteed to exist by a flow decomposition followed by the mapping process.
We close this subsection by noting that there are more efficient constructions of prefix graph (smaller graph for the same splice graph and ) as described in Section S1.6. Under certain regularity constraints of , such construction can be proved to be optimal and identifiable for inference as long as the model is also identifiable with the path abundance.
3.3 Inference with Prefix Graph Flows
While a restructuring process described last section reduces the size of the optimization problem, we still need to solve it efficiently. With no multi-mapped reads (or fragments) and a single gene, we optimize , where the variables and the constraints come from the prefix graph flow, and cp is represented as in Theorem 1. This is a convex problem solvable with existing solvers.
With the presence of multimapped reads (to multiple fragments and/or multiple paths), we can employ a standard EM approach. Optimization over the genome is harder, as we need to infer the prefix graph flow for every gene and the instance becomes impractically huge. Denote the gene abundance is the set of paths in gene g). We decouple the genes during M-step by fixing the gene abundance, which makes the optimization over each gene independent of each other, while provably maintaining correctness of the process. This results in the following localized EM:
M(f) is the set of paths f maps to, D(f | p) is the fragment length probability if f maps to p, z is the allocation vector in EM, and z′/c′ are values from previous iteration. For clarity, we also hide the flow constraints for {cp}. Detailed derivations can be found in Section S1.7.
3.4 Subgraph Quantification
As shown before, given inferred prefix graph flows, we can obtain a quantified transcript set by flow decomposition over the flow. Different decompositions from the same flow are a direct manifestation of the non-identifiability phenomenon. For this section, we define the range of optima for each transcript as the minimum and maximum weight of the corresponding S – T path across all valid decomposition of input flow.
We will implement two small changes to the prefix graph model as described before. These are necessary as we present a simplified model in the main text (the more compact prefix graph model is described in Section S1.6). First, AS(p) is now an edge subset. A vertex subset correspond to an edge subset by taking the union of incoming edges to any vertex in the set. Second, we assume each junction is in .
Now each junction, or phasing path in general, is represented by a set of edges AS(p) in the prefix graph, and an S – T path as a transcript includes the junction/path if it intersects with AS(p). This motivates the following formal definition of the subgraph quantification problem:
(AND-Quant). Let G = (V, E) be a directed acyclic graph with an edge flow, and {Ek} be a list of edge sets satisfying the well-ordering property: If a path visits ei ∈ Ei, then ej ∈ Ej at a later step, i < j. An S – T path is good if it intersects each Ek. For a flow decomposition of G, the total good flow is the total flow from good S – T paths. AND-Quant(G, {Ek}) asks for the minimum and maximum total good flow for all possible decompositions of G.
For quantifying a full transcript, we can construct one Ek = AS([u, v]) for each of its junction (u, v) in order, and the range of optima for this transcript is exactly AND-Quant(G, {Ek}). This formulation also allows us to quantify more general substructures. For example, we can use only a subsequence of {Ek} and the AND-Quant will return the minimum and maximum total abundance for transcripts that include corresponding set of junctions. We discuss this topic in more detail in Section S1.10.
To solve AND-Quant, the first step is to define a similar problem called OR-Quant as follows:
(OR-Quant). Let G = (V, E) be a directed acyclic graph with an edge flow, and E′ be an arbitrary subset of E. An S – T path is good if it intersects E′. The total good flow and the objective are defined in the same way as in AND-Quant.
We now convert an instance of AND-Quant to OR-Quant by constructing a graph GB, called the block graph, that contains every edge that is either in one of Ek, or is between an edge in Ek−1 and an edge in Ek for some k (with some abuse of notation, assume the first edge set is {S} and the last set is {T}). We claim the following for AND-Quant:
An S – T path is good if and only if it is a subset of GB.
We prove this in Section S1.9, which also provides a linear time algorithm for preprocessing and construction of GB. If we know the minimum and maximum total “bad flow” (opposite of good flow), we can obtain the answer to AND-Quant by complementing the result with U, the total flow of G. From the lemma, an S – T path is bad if and only if it intersects with G – GB, which turns the problem into an instance of OR-Quant. To solve OR-Quant(G, E′), we define two auxiliary graphs. G− is a copy of G with all edges in E′ removed. G+ is a copy of G with an extra vertex T′ which replaces T as the flow sink, and all edges in E′ have their destination moved to T′. We claim running MaxFlow on both graphs solves OR-Quant:
Let G+ and G− be constructed as described above. Then OR-Quant(G, E′) = [U – MaxFlow(G−), MaxFlow(G+)].
We prove this in Section S1.8. Note that constructing G+ and G− can be done in linear time, so the time complexity of OR-Quant depends on the MaxFlow. Combining the arguments leads to the solution to AND-Quant (recall U is the total flow on G):
Let GB be the block graph, and [l,r] = OR-Quant(G, G – GB). Then AND-Quant(G, {Ei}) = [U – r, U – l].
As described before the preprocessing steps can be done in linear time, so the time complexity of AND-Quant depends on the OR-Quant, which in turn depends on the MaxFlow.
See Figure 3 for examples of both algorithms. Throughout the section, we define the range of optima by assuming all equally optimal quantified transcript sets map to the same prefix graph flow. This can be overly restrictive, as other sets may have the same set of path abundance with different prefix graph flow (meaning the prefix graph flow itself is non-identifiable). We show that we can integrate the Reallocation LP into the framework to cover this scenario, as discussed in Section S1.10.
3.5 Structured Analysis of Differential Expression
We have discussed non-identifiability-aware transcript quantification under two assumptions. In this section, we model the quantification problem under a hybrid assumption: Some fragments are generated from the reference transcriptome while others are generated by combining known junctions. This model is more realistic than the model under the two extreme assumptions.
For each transcript Ti, let denote its expression calculated from the complete reference transcriptome assumption, and denote its expression calculated from incomplete reference transcriptome assumption. We use parameter λ to indicate the assumed portion of fragments generated by combining known junctions. For 0 < λ < 1, we define , interpolating between the ranges for the extreme assumptions. We cover other possibilities in Section S1.11. The ranges can be useful for analyzing the effects of non-identifiability under milder assumptions, and we now show a usage of this framework in differential expression.
In differential expression analysis, for each transcript we receive two sets of abundance estimates {Ai}, {Bi} under two conditions, and the aim is to determine if the transcript is expressed more in {Ai}. With fixed λ, we can instead calculate the ranges and as described above. If the mean of is less than that of , we define this transcript to be a questionable call for differential expression. Intuitively, if λ portion of expression can be explained by unannotated transcripts, we cannot determine definitely if Ai is larger on average than Bi. This filtering of differential expression calls is very conservative (expect to filter out few calls), as most differential expression callers require much higher standards for a differential expression call.
4 Results
We focus on the quantification method Salmon [4] to calculate the range of optima. Salmon learns and outputs a refined bias model that takes into consideration the GC bias, sequence bias, and positional bias, which we use to calculate the path effective length in inferring graph flows. We retrieve the splice graphs from Gencode transcript annotation (version 26) [27] and assume that they encode all expressed transcripts. Even though splice graphs are constructed in many transcriptome assembly software in a sample-specific manner, we use the splice graph from the reference for all samples so that the ranges of optima across samples are directly comparable.
4.1 Point abundance estimation of Salmon tends to be near the boundaries of ranges of optima across 16 Human Body Map samples
Applying our method to the Human Body Map RNA-seq samples, we characterize the distance of the single estimate of Salmon to the other values in the range of optima. The distance represents the potential quantification error due to non-identifiability. The ranges of optima are calculated under both the complete reference assumption and incomplete reference assumption.
Salmon’s estimates tend to be near either the smallest or the largest values of the range of optima rather than in the middle across all 16 Human Body Map samples (Figure 4A). The cause of the behavior is currently unknown to us. But we suspect this phenomenon is related to the initialization and EM optimization of the convex objectives and it is of a theoretical interest to further study this phenomenon. The upper bounds of the optimal ranges are on the same scale as and usually no greater than 10 times of Salmon’s estimate (Figure 4B). The patterns hold for both complete and incomplete reference assumption. A transcript determined to be lowly expressed under complete reference assumption may be expressed higher when unannotated transcripts are allowed to express. For these transcripts, Salmon’s estimate is closer to the lower bound of the range of optima, as it is derived under complete reference assumption.
Note that Figure 4 only shows the subset of transcripts for which the Salmon estimate is in between the lower and upper bound of the range of optima under incomplete reference case. Under the incomplete reference assumption, Salmon’s point estimation may be outside the range of optima (Supplementary Figure S1). This is because the optimal solution on a particular set of parameters (Salmon’s estimates for reference transcripts) may not be optimal in an expanded parameter space (as we take novel transcripts into consideration).
4.2 Differentially expressed transcripts are generally reliable when assuming the reference contributes more than 40% to the expression
Applying our method to the MCF10 cell line samples with and without EGF treatment (accession SRX1057418) [28], we analyze the reliability of the detection of differentially expression (DE) on the transcript level. The differential expression pipeline uses Salmon [4] for quantification, and then uses tximport [23] and DESeq2 [24] for detecting DE transcripts. This pipeline uses only Salmon’s estimates and predicts 257 DE transcripts of which the adjusted p-value is no greater than 0.01. We use the overlap between the mean range of optima for the two conditions as an indicator for unreliable DE detection (as extended from Section 3.5). Since the range of optima depends on the percentage of expression from the reference according to the hybrid assumption, we compute the overlap for each percentage between 0 and 1 and refer to this percentage as the reference completeness parameter. The overlap is defined as the size of the intersection over the size of the smaller interval of the two ranges of optima. The threshold of overlap to declare an unreliable DE detection is 25%.
We identify some examples of reliable and unreliable DE predictions. There are five differentially expressed transcripts for which their differential expression statuses are very sensitive to the expression of unannotated transcripts. Figure 5A–C shows the lower bounds and upper bounds of the transcript expression of three examples among the five transcripts. Even when reference completeness parameter values are larger than 90% (the unannotated transcripts have less than 10% expression), their expression estimates suffer from great uncertainty such that the ranges of optima between the two DE groups overlap. The five genes corresponding to the five transcripts are involved in the following cellular processes or pathways: mRNA degradation, cell apoptosis, glucose transportation, DNA repair, inhibition and transportation of certain kinetics [29]. These genes contain 6–22 isoforms. The sensitivity of these transcripts to unannotated transcript expression suggest that their differential expression status may be unreliable. Instead, combining all annotated and unannotated splice graph paths for these unreliable transcripts and testing the differential expression on the gene level may provide a more convincing conclusion.
Table 1 shows the five transcripts and the adjusted p-values of differential expression on both transcript level and gene level. We use tximport to summarize the transcript-level quantification of Salmon to gene-level, and use the same procedure of DEseq2 for gene-level DE detection. Using a threshold of 0.01 on the adjusted p-value, we obtain 34 differentially expressed genes. For some differentially expressed transcripts, their gene-level DE tests are not significant, which indicates the occurrence of differential alternative splicing events. The gene-level DE test for the five most unreliable transcripts are not significant. For the five nonsignificant genes with significant but unreliable DE transcripts, it is very likely that neither differential gene/transcript expression nor differential alternative splicing occurs.
On the other hand, when both the gene and its corresponding transcripts are differentially expressed, the transcript is usually reliable against the non-identifiability and has high tolerance to the expression of unannotated transcripts. There are 25 DE transcripts of which the corresponding genes also pass the DE test on gene level. All of them are able to tolerate more than 50% of unannotated transcript expression, at which value the ranges of optima between the two DE groups still have zero or only a small overlap. Transcript ENST00000527470.5 is an example of them (Figure 5D). This transcript belongs to gene NASP, which is involved in histone transportation during cell division [29].
In general, the ranges of optima tend to be wider and more likely to overlap with each other when more expression is from unannotated transcripts. However, the majority of DE predictions are reliable when the reference transcripts are relatively complete and contribute more than 40% to the expression (Figure 5E). Meanwhile, there exists a few DE predictions that are unreliable and sensitive to small proportions of expression from unannotated transcripts.
On another dataset, we observe the same pattern of the reliability of transcript-level differential expression (Supplementary Figure S2). The dataset includes the RNA-seq of several immune cell types of peripheral blood mononuclear cells (accession GSE107011) [30, 31], and we compare the four replicates of naive CD8 T cells and four replicates of effector memory CD8 T cells for differential expression detection. There are 3152 differentially expressed transcripts with an adjusted p-value threshold of 0.01, out of which 19 are unreliable under less than 10% of unannotated transcript expression. The distribution of the reference completeness parameter values under which the DE transcripts are unreliable is similar to the one in the MCF10 cell line dataset. The Gencode reference transcriptome annotation is used for both datasets. Thus, the conclusion that most DE transcript prediction are reliable when the reference transcript expression is more than 40%, is potentially generalizable to more datasets under the same reference annotation.
5 Discussion and Conclusion
In this work, we proposed approaches to compute the range of equally optimal transcript abundances in the presence of non-identifiability. The ranges of optima help evaluate the accuracy of the single solution from current quantifiers. Analyzing the reliability of differential expression detection using the ranges of optima on MCF10 cell lines samples, we observe that most predictions are reliable. Specifically, the ranges of optimal abundances between the case and control groups of the predicted transcripts do not overlap unless the expression of unannotated transcripts is greater than 60% of the whole expression. However, we still identify five unreliable predictions for which the ranges of optima between conditions largely overlap even when the reference transcriptome expression contributes more than 90% of the whole expression.
The reference completeness parameter is unknown in our model, and we address this by investigating the ranges of optima under varying reference completeness values. However, determining the best reference completeness value that fits the dataset as an indicator for reference trustfulness is an interesting question in itself, and we believe transcript assembly or related methods might be useful for choosing the correct reference completeness value for each dataset.
The non-identifiability problem in expression quantification is partly due to the contrast between the complex splicing structure of the transcriptome and short length of observed fragments in RNA-seq. Recent developments of full-length transcript sequencing might be able to close this complexity gap by providing longer range phasing information. However, full-length transcript sequencing technique suffers from problems such as low coverage and high error rate. It is still open whether this technology is appropriate for quantification under different assumptions and how the current expression quantification methods, including this work, should be adapted to work with full-length transcript sequencing data.
Disclosure Statement
C.K. is a co-founder of Ocean Genomics, Inc.
S1 Omitted Proofs and Discussion
We will cover more discussions and omitted proofs in the supplementary materials, some due to page limits and some due to being too technical, as indicated in various places in the main text. The individual sections, unless otherwise specified, are self-contained and can be read out of order. The first section below provides a brief overview over each section and how they support and extend the arguments in the main text.
Note that as we deal with network flows more directly in the supplementary sections, we will use fe to denote the flow through edge e instead of a fragment.
S1.1 Overview
Section S1.2 to Section S1.5 completes and extends our discussion on the reparameterization procedure. One core ingredient of the process is the derivation of path effective length , only possible with a suitably defined transcript effective length, which we define in a different way compared to most other literatures. In Section S1.2, we discuss in detail our definition of transcript effective length and how it relates to the “mainstream” definition (in conclusion, not much difference assuming transcripts are longer than reads). In Section S1.3, we complete the derivation for the “effective length swapping” equation: , which removes dependence of from the model and is critical for the reparameterization. The rest two sections are about implementation of the reparameterized model. Section S1.4 discusses in detail the effect and necessity of culling . Section S1.5 describes how bias correction, an important aspect of modern transcript quantification tools, can be integrated into the reparameterized model, and the limitation of the procedure.
Section S1.6 describes the compact prefix graph. This is both theoretically and practically important for the whole framework, but it is harder to describe than the model we presented in the main text (which has a direct connection to the Aho-Corasick automatons). As suggested by the name, the compact prefix graph yields a smaller prefix graph when given the same set of input (splice graph and phasing path set), while keeping the same guarantees of preserving both transcripts and path abundances. This is the preferred model for implementing our methods, and we indeed use it in our own code. The theoretical importance of the model comes from the fact that under some conditions on , the compact prefix graph is the “minimal” representation of the path abundance, in the sense that any change in the compact prefix graph flow changes the path abundance. This compactness means that we provably recover every optimal quantified transcript set by decomposing the optimal compact prefix graph flow, which might not be true for the original prefix graph model as we discussed before.
Section S1.7 describes the inference process in detail, which we call Localized EM. The idea of the process has been laid out in the main text.
Section S1.8 to S1.9 proves correctness of the algorithms for AND-Quant and OR-Quant. For OR-Quant, we prove the correctness of upper and lower bound separately, and the general idea is the same: Given the maxflow, we generate a decomposition that has the correct amount of good flow, and then we prove the total good flow for any decomposition is limited by the maxflow in certain ways. For AND-Quant, we prove the correctness by exploiting the well-ordering process and stratifying vertices of the prefix graph into layers.
Secton S1.10 extends the subgraph quantification algorithms with the Reallocation LP. The main purpose for such process is to be able to consider other prefix graph flows that yield the same path abundance as the current flow, closing a hole in the statement of the optimal solution set in consideration. This combination is more useful than closing the hole. The main text only introduces edge subsets corresponding to single exons, which is powerful enough to quantify over full length transcripts and partially specified transcripts. However, by “injecting” nonexistent phasing path into the prefix graph, we support much more flexibility in substructure design for the subgraph quantification process, while the Reallocation LP ensures the quantification is accurate in presence of the injected paths.
Finally, Section S1.11 introduces an alternative modeling under hybrid assumptions. Recall in the main text we produce the range of optima for each transcript by linear interpolation of two ranges of optimas from the two extreme assumptions. In fact, this corresponds to a rather naive model of read generation, which we will describe in detail in the supplementary section. We will then propose an alternative and more realistic model of hybrid read generation, at the cost of higher computational complexity.
S1.2 Defining the Effective Length of Transcripts
Recall in Section 2.2 we define the effective length for transcript Ti as . In most works, is instead defined as li – μ(Ti), the actual length of transcript li subtract the truncated mean of D, and the truncated mean is defined as . The following lemma establishes the connection between the two definitions.
.
Proof.
This completes the proof.
Ignoring the difference between li and li + 1, the two definitions differ by a multiplicative factor of (Intuitively, this is the probability of sampling a fragment no longer than li). This factor is exactly 1 if the transcript is longer than any possible fragments, and very close to 1 as long as the transcript is longer than most fragments, which is indeed the case for most transcripts in an RNA-seq experiment. We choose the doubly-sum formula for for its strong connection to a generative model. Additionally, it is also more natural to state with defined this way: ci is the copy of molecules for transcript Ti, and assuming the probability of a observing a fragment f from transcript Ti is proportional to ciD(f | Ti), we can naturally get .
S1.3 Re-calculating Normalization Constant
In this section, we prove the following equation, which is necessary for the reparameterization process described in Section 2.2. The idea is to break down the expression of into sum over fragments, and regroup the fragments by the path they are mapped to.
The third equation holds because the sum of D(k − j) across any transcripts containing path p is the same, as shift in reference does not change D(k − j) assuming no bias correction.
S1.4 Trimming Set of Phasing Paths
Recall the likelihood function under our reparameterized model: . Paths with no mapped fragments do not contribute to , as there are no f ∈ F such that p(f) = p. It does play a role in calculating the normalization constant . However, since we only remove paths with very low , the contribution of for this set is intuitively small. This removal thus causes small underestimation of the normalization constant, and in turn small overestimation of transcript abundances (and path abundances).
If there is a removed path with large cp when optimized under the full model, it means in the inferred quantified transcript set there is a lot of fragments mapped to path p, even though exactly zero fragments are mapped to p in the sequencing library. This mostly happens if there is a dominant transcript with incredibly high abundance, and p is part of the transcript. For such things to happen, the transcript must have huge amount of fragments mappable, and the fact that no single fragment mapped to path p indicates is incredibly small, or the modeling might be faulty.
This trimming is necessary as the fragment length distribution D(l) usually has a long tail when inferred from experiments, due to smoothing and potential mapping errors, leading to a lot of extremely long paths that are near impossible to sample a read from. This trimming might have some effect on our analysis of non-identifiability later from a theoretical perspective, as intuitively by trimming down we are increasing the co-dimension of {cp} in {ci} leading to wider ranges of optima. However, as we described above, it is near impossible to sample a read from such paths, unless we have infinite amount of data, which is not the case in practice. In other words, the trimming can be seen as a modification to the original likelihood model to account for the fact that we cannot generate infinitely large sequencing libraries.
S1.5 Path Dependent Bias Correction
Our proposed model is also capable of bias correction, but not to the full extent as done in Salmon. We define the affinity Ap(j, k) as the unnormalized likelihood of generating a read pair mapped to path p from position j to k in some coordinate. This is the analogue for P(fj | ti) in Salmon model, excluding the sequence specific terms (which can also be integrated into our model directly, but they will not appear in effective length calculation). For non-bias-corrected model, we simply have Ap(j, k) = D(k − j + 1). Certain motif-based correction and GC-content-based correction, that are calculated from the genomic sequence in between the paired-end alignment, can then be integrated into our analysis naturally. To adapt the likelihood model with bias correction, the definition of transcript and path effective length is changed as follows:
Note that is still the same for any Ti and we assume the coordinate when calculating Ap coincides with the internal coordinate of Ti. The definition of path abundance remains unchanged, and the inference and subgraph quantification processes are the same.
Admittedly, there is no way to know the exact location of the read pair within the transcript after reparameterization with path abundance, so transcript-specific bias correction cannot be done in an exact manner. Nonetheless, since the splice graph is known in full, approximate bias correction is possible. In our experiments, let Ai(j, k) denote the affinity value calculated from a full bias correction model, for the fragment generated from base j to k on transcript Ti. Let be the reference abundance of transcript Ti. Then we let , where j′ and k′ are the coordinates of the path sequence in reference coordinate of Ti. This means we average the affinity value calculated from known transcripts on this locus, weighted by their reference abundance. For simplicity, we use Salmon outputs as reference abundance, but other approaches, including iterative process of estimating Ai(j, k) and cp alternatively, are possible.
One limiting factor for bias correction is that calculating can be expensive as we need to calculate this for every possible fragments. To speed up this process, we can use a simplified form for Ap(j, k) so has a closed-form solution (for example, do not allow bias correction when calculating effective length, similar to existing approaches to calculate effective length of transcripts), sample from possible Ap(j, k) when the number of fragments from a particular path is large, or precompute the values for fixed genome and bias correction model. These approaches may result in slightly inaccurate path effective length, and it is still an open question how it affects downstream procedures.
S1.6 Compact Prefix Graph
While we introduced the prefix graph in the main text as the state transition graph of the Aho-Corasick automaton, it is not the best way to construct the graph. Intuitively, this is because the edge of the transition graph stores more information than the vertex (states) of the transition graph, and we only used the vertices to represent phasing paths before. In this section, we show how to use the transitions to represent phasing paths, which we call the compact prefix graph. The compact prefix graph is a smaller graph than the prefix graph, meaning it is easier to perform flow inference and subgraph quantification. Moreover, the compact prefix graph is also guaranteed to be a minimal representation of the path abundances, as we will show in a theorem later this section.
We define the compact prefix graph still by the transition graph of an Aho-Corasick FSA, but with a different set of patterns.
(Aho-Corasick CFSA for phasing path matching). The Compact Aho-Corasick Finite-State Automaton (CFSA) is built as follows:
The alphabet is the set of all vertices in splice graph.
The patterns to match against is the union of (1) all paths of length 1, and (2) all path in with last exon removed.
The dictionary suffix link d[s, a], where s is a state in the FSA and a is a character (vertex in splice graph in our case), is calculated if and only if there is an edge from last vertex in s to a.
Additionally, each suffix link is tagged with t(e) = sa, the path represented by s attached by a, which by our construction is still a valid path in the splice graph.
A path p in is said to be recognized at a suffix link with tag s′, if p is a suffix of s′. AS(p) is defined as the set of suffix links that recognizes p.
Intuitively, the matching to phasing path process now happens over FSA transitions, instead of FSA states, which means the states does not need to contain full phasing paths. We construct the compact prefix graph in the same way as in the main text. The states of the CFSA are the vertices and the dictionary suffix links are the edges, with the initial state excluded as usual. The flow over the compact prefix graph, which we denote as {fe} for the rest of this section (NOT the fragments), has the same properties as the flow over the original prefix graph.
(Properties of Compact Prefix Graph). Given a compact prefix graph and a flow {fe} mapped from a quantified transcript set {ci}:
The compact prefix graph is a DAG.
There is a one-to-one mapping between S – T paths in splice graph and [S] – [T] paths in compact prefix graph.
The value {cp} can be derived from the compact prefix graph alone, and is preserved during mapping between compact prefix graph flows and quantified transcript sets: cp = Σe∈AS(p) fe.
Proof.
We label each node in the prefix graph by the last vertex of their corresponding CFSA state, which we denote l(v) (this is different from the edge tags). If there is an edge (u, v) in the compact prefix graph, there is an edge (l(u), l(v)) in the splice graph. If the compact prefix graph has a cycle, the cycle can be mapped back to a cycle on the splice graph via the labels, leading to a contradiction as we assume the splice graph is a DAG.
For a [S] − [T] path of the compact prefix graph, the labels of every vertex on the prefix graph constitutes a path on the splice graph. On the other hand, we can generate a path on the compact prefix graph by feeding the transcript to the CFSA and list the visited states.
We need to show that exactly one edge in the set AS(p) is visited if a transcript T contains p. This is the transition right before the last exon in p is visited (the destination of this edge has the label that equals the last exon in p), so it happens exactly once as no exons can be visited twice. The rest of the proof is identical to the original prefix graph proof.
The compact prefix graph formulation is the most compact representation of path abundance in a certain sense.
(Compact Prefix Graph Flow can be Minimal Sufficient). With fixed splice graph and phasing path set , if satisfy the condition that for each any prefix of p is also in , there is a one-to-one mapping between feasible set of path abundance and feasible set of compact prefix graph flows {fe}.
Proof. The mapping from a compact prefix graph flow to {cp} is direct and unique. For the other direction, assuming the theorem is false, there exists two sets of {fe}, compact prefix graph flow with identical {cp}. Since each cp is sum of several flow values, the difference between the sets {Δfe} satisfies the following:
We now prove that Δfe= 0 for all e, using induction on the size of |t(e)| (the length of the tag on e), which we denote k. The induction hypothesis for k is that Δfe = 0 for all edge e with tag no shorter than k. For the base case where k > maxe|t(e)|, there are no edges with tag length k or longer, so the hypothesis holds trivially. Assume this holds for k + 1. For each edge e′ with tag length k, let p′ = t(e′). By the construction process of the CFSA, p′ is a prefix of some phasing paths in , and by our assumption p′ is also in . This means there is an equation of form Σe∈AS(p′) Δfe= 0. The set AS(p′) contains the edge e′ and several other edges with tag longer than p′ (as p′ is a suffix of these tags). By induction hypothesis, Δfe for all edges in AS(p′), other than e′, equals 0, which means Δfe′ is also 0.
S1.7 Localized Expectation-Maximization
Technically this section depends on the last section, as we only use the compact prefix graph (which we simply call the prefix graph) for actual derivations of the model. However, the only notable difference in this section is that AS(p) is now a set of edges, instead of vertices. This applies to all later sections.
In this section, we provide full derivation and specification of the EM process designed for the reparameterized model. For clarity, we start with the base case, that is, when there are no multimapped read pairs and only one single gene. The full optimization problem is defined as follows (again, fe is the flow through an edge of the compact prefix graph), where we let E be the edge set of the prefix graph and V be the vertex set of the prefix graph. In(v) and Out(v) denotes the incoming and outgoing edges of v.
Note that we rewrite the optimization target in a slightly different form by restricting instead of normalizing cp by this term in the objective function. The two formulations are equivalent to each other, except cp might be scaled by a constant between the optimal solutions.
We first look at multimapped read pairs. Let M(f) denote the set of paths that a read pair f can map to, D(f | p) be the fragment length probability when read pair is mapped to path p, the objective function becomes: and the constraints being the same as before. First, with a standard EM argument, we can define allocation variable z and perform the following EM step: c′ and z′ denote values from last iteration. We omitted the prefix graph flow constraints for c in the M-step for brevity.
Next, we adapt our EM algorithm to optimize over multiple genes at once. Let g denote a gene and denote the set of paths belonging to g. The target function during the M-step can be decomposed by genes, but all genes together have to satisfy the normalization constraint . If the gene abundance can be known in advance, the optimization can be done on each gene (as the prefix graph flow constraints are all local to genes). The following lemma shows this is indeed achievable, without even knowing cg:
For an optimization instance on gene g with gene abundance cg, let be the optimal solution when cg is set to 1. Then, is optimal for original problem.
Proof. We show that if {cp} is optimal for gene abundance cg, then {cp/cg} is optimal for gene abundance of 1. First, the flow balance constraint is scale-invariant so it holds for {cp/cg}. The normalization constraint also holds by definition. We now look at the objective function: where the Const term is constant to cp. Note that the variable term on the right hand side can be seen as the target function of the problem with cg = 1 and cp/cg as variables. This means maximizing the two sides are equivalent.
We now plug back to the log-likelihood function and solve for cg, which yields by taking derivative of Lagrangian while treating as constant. This gives the localized EM process as follows:
The gene abundances are normalized by |F|, because sum of all over the whole library is exactly |F| (it sums up to 1 for every fragment). Again, in the M-step, the variable satisfies the prefix graph constraint.
S1.8 Algorithms for OR-Quant
In this section, we state the algorithms for OR-Quant from a pure graph theoretical perspective. Recall our setup consists of a DAG G with predetermined source S and sink T, a flow F = {fe} on the graph (we decouple flows from the graph, and also these are not the fragments as in the main text), and an edge subset. We also use C(F) to denote the total flow, and in this way we write U = C(F).
We start by several basic definitions.
(Flow). Given DAG G with predetermined source S and sink T, a flow F is a set of edge weights of G satisfying non-negativity on every edge weight and flow balance on every vertex except S and T.
(Decompositions). Given graph G and a flow F = {fe}, a decomposition R is written as {Ti, ci} where each Ti is an S – T path on G and ci is a nonnegative number, satisfying Σi|e∈Ti ci = fe for every edge e. We use |R| to denote the number of S – T paths in R, and C(R) = Σ ci to denote the total flow / capacity of the decomposition.
(Partial Decompositions). A partial decomposition of a set of non-negative edge weights W = {we} on a graph G is written as {Ti, ci} where each Ti is again an S – T path on G, and ci is a nonnegative number satisfying Σi|e∈Ti ci ≤ we for every edge e. Alternatively, if W is a flow, a partial decomposition can simply be defined as a subset of a (full) decomposition of F.
Note that we define partial decompositions on arbitrary non-negative edge weights, as required for a later proof. As mentioned in main text, we present the following lemma without proof.
(Finite Decomposition). Every flow on a graph of finite size has a decomposition of finite size.
We now restate the definition of OR-Quant slightly more formally.
Let G = (V, E) be a DAG with a flow F, and E′ ⊆ E. An S – T path is good if it intersects E′. (We also say the path is bad if it does not intersect E′.) For a decomposition R = {Ti, ci}, the total good flow is defined as , that is total flow from good paths in the decomposition. OR-Quant(G, E′) asks for the minimum and maximum Q(R) for any possible decomposition R of F.
We now state our algorithms and prove the correctness of the algorithms, starting from the lower bound.
(Diff-Flow). The auxiliary graph G− is built with edge set E − E′. Z = U – MaxFlow(G−) is the minimum total good flow Q(R) from any possible decomposition R of F.
Proof. We prove the theorem in two parts: First, we show there is a decomposition r such that Q(R) = Z, then we show any decomposition satisfies Q(R) ≥ Z.
The key observation is that all S – T paths that are bad are fully in G−. For the first part, fix a maximum flow of G− as F−. We let R− be an arbitrary decomposition of F−, and R+ be an arbitrary decomposition of F – F−.
We have Q(R−) = 0 because no path in R− intersects with E′. We now claim Q(R+) = Z. Because C(R+) = C(R) – C(R−) = Z, Q(R+) = Z if and only if every path in R+ intersects with E′. Assuming otherwise, we can remove that path from R+ and add it to R−, which leads to larger C(R−). This is impossible: it implies F− is not the MaxFlow of G−, because the flow of R− is a strictly better solution. Having proved Q(R+) = Z, we now know R− + R+, a (full) decomposition, has exactly Z total good flow.
Now for any decomposition R, we split it into two parts. R+ contains all good paths in R, and R− contains all bad paths in R. We have C(R−) ≤ MaxFlow(G−), because the flow of R− is a flow on G−. Because Q(R−) = 0, we have Q(R) = Q(R+) = C(R+) ≥ Z.
(Split-Flow). The auxiliary graph G* is built by adding a vertex T′ in G and, for each edge in E′ changing its destination to T′. T′ replaces T as the sink of flows. Y = MaxFlow(G*) is maximum total good flow for any possible decomposition of F.
Proof. We prove the theorem in a similar style: First we show there is a decomposition R of F such that Q(R) = Y, then we show any decomposition satisfies Q(R) ≤ Y.
For the first part, we first build an arbitrary decomposition R* from a maximum flow of G*. However, R* is not a valid partial decomposition of F, because G* and G have different sets of edges. Our goal is to obtain a partial decomposition R+ of F that is a “reconstruction” of R*. We will define several terms for convenience of our proof.
Path Mapping for G*
A good S – T path on G is mapped to an S – T′ path on G* by finding the first edge of the path that is in E′, change the destination of that edge to T′, and discard the edges after that so the path ends at T′. (We will never map a bad S − T path on G this way.)
Similarly, an S − T′ path on G* is mapped to a (incomplete) path on G by moving the destination of last edge (that was T′) back to its original node before the transformation. We assume a label containing its original destination is kept on each edge that ends at T′. This implies multiple edges can exist between a node and T′, and the movement follows the label. The resulting path is guaranteed to intersect with E′, but is not a complete S – T path.
We apply an induction argument, formally defined as follows:
Flow Reconstruction Instance
An instance for flow reconstruction has two inputs: (F, R*). F is a flow on G, and R* is a partial decomposition on F* (of G*), where F* is constructed in the same way as G* by moving endpoints of edges in E′ to a new node T′. Note that F* is not a flow, while partial decomposition is still well-defined in this scenario. The output of the instance is R+, a partial decomposition of F, satisfying that (1) Each path in R+ is good and (2) If we map each path in R+ back to G*, the resulting partial decomposition has the same flow as R*.
We will apply the induction on |R*|, number of paths in R*. The base case is then |R*| =0, where we can simply return an empty R+. Assume now the flow reconstruction can be solved for |R*| ≤ k – 1, we solve the instance with |R*| = k. We first state the algorithm, then prove its correctness.
Induction Algorithm
We first pick the path (P, c) in R* whose second-to-last vertex (right before T′) is the last among all paths in topological ordering of G. Denote the last edge of this path, mapped back to G, as (u, v). P without last edge is then a path from S to u. We let P′ denote P with last edge changed to (u, v).
We next generate an arbitrary decomposition R0 of F, then do the following: First, discard all paths not containing (u, v), which results in a partial decomposition of F. At this step, note that R0 is a full decomposition so its total flow through (u, v) is exactly w. This means the total flow of the partial decomposition is exactly w. Second, as every path in R0 now contains (u, v), we change the routing before u to match P. Finally, let w be the flow on the edge (u, v) in F, and recall c is the flow of the path P we picked from R*. We scale the flow of each path in R0 by c/w, making the total flow c.
Given R* is a partial decomposition of F*, the total flow of R* on the edge (u, T′) mapped from (u, v) does not exceed w. As (P, c) is a path in R*, this further implies c ≤ w and c/w ≤ 1.
We still need to show the resulting R0 is a valid partial decomposition of F. Every path in R0 is an S − T path, and the every flow is still nonnegative, so we only need to prove that the total flow of R0 on each edge does not exceed F. For edges that are not in P′, the condition holds because the total flow on this edge can only decreased during all three steps (discarding paths and scaling down flows). For edges in P′, the total flow on this edge in R0 is exactly c. As (P, c) is a path in R*, each of the edges in P has at least c flow in F*, which immediately means each edge in P′ has at least c flow in F.
To recap, R0 has total flow c, is a valid partial decomposition of F, and every path in R0 has P′ as prefix. We continue the process by recursively calling the procedure on (F′, R′*), where F′ is F minus the flow of R, and R′* is R* without (P, c). We return the partial decomposition from the recursive call merged with R0.
Proof of Induction Correctness
The new instance to be called is an instance with size k – 1, since one path in R* is removed. We first prove that the recursive call is valid, that is, R′* is indeed a partial decomposition of F′* (F′* is constructed from F′ by moving the endpoint of edges in E′ to T′).
If the call is invalid, it means some edge in F′* has less flow than total flow from R′*. Since R* is a partial decomposition of F*, the condition will only be invalidated for an edge with positive flow in R0, as these are the only edges where the flow on F′* is less than the flow on F*. This edge can either be an edge in P, or an edge whose starting point is later than v in the topological order of G by construction of R0. If the edge is in P and is not the last edge, exactly c flow is subtracted on this edge from F to F′ and from R* to R′*, so this would contradict with the condition that R* is a partial decomposition of F*. If this is the edge (u, T′) that is mapped from (u, v), again exactly c flow is subtracted in both F to F’ and R* to R′*.
If this is an edge not in P, we claim the edge has zero flow in R′*. This is because the way we pick (P, c), where the path with latest second-to-last vertex is picked. If the edge has positive flow in R′*, it implies there is a path in R′* that includes this edge, and the ending point of the path would be later than v, meaning the path would have been chosen in place of P during the process.
Given the recursive call is valid, we can prove the correctness of the algorithm by showing that R+ outputted by the algorithm indeed satisfies the two output conditions. For (1), each path in R+ is good because it come from a R0 at some iteration, and all paths in R0 has P′ as prefix whose last edge (u, v) is in E′. For (2), at each recursion call, R0 mapped to F* becomes one path P with flow c, which matches (P, c) in R*, so the condition is maintained similarly by an induction argument. This finishes the correctness proof for the induction.
Completing the Proof
With the induction algorithm, we can reconstruct a partial decomposition R+ of G from R* (a decomposition of F*), then similar to last proof, construct another partial decomposition R− from F minus the flow of R+. By construction of R+, we have Q(R+) = MaxFlow(G*). If some path in R− intersect with E′, we can remove that path from R− and add it to R+, then map the new R+ to G* to obtain a flow better than MaxFlow(G*), a contradiction. This means Q(R−) = 0 and the decomposition R = R+ + R− satisfies Q(R) = MaxFlow(G*).
For the second part of the statement, for a decomposition R of F, we can write R = R+ + R− where R+ contains all good paths in R and R− contains all bad paths in R. We can map R+ to G* and denote the resulting partial decomposition R* (R* is indeed a partial decomposition by noting there is a one-to-one mapping between edges of F* and edges of F). We have Q(R) = Q(R+) = C(R+) = C(R*) ≤ MaxFlow(G*). This completes the whole proof.
S1.9 Algorithms for AND-Quant
In this section, we prove the complementarity argument mentioned in the main text is correct. Our setup consists of a DAG G with predetermined source S and sink T, a flow F = {fe} on G, and a list of edge sets {Ek}. We define flows and decompositions in the same way as last section (Definition S5, Definition S6). Now recall the definition of AND-Quant, written slightly differently with the new notations:
(Well-Ordering Property). A list of edge sets {Ek} satisfies the well-ordering property if a path visiting ei ∈ Ej then ej ∈ Ej at a later step implies i < j.
(AND-Quant). Let G = (V, E) be a directed acyclic graph with an edge flow, and {Ek} be a list of edge sets satisfying the well-ordering property. An S – T path is good if it intersects each Ek. For a decomposition of F, the total good flow is the total flow from good S ‒ T paths. AND-Quant(G, {Ek}) asks for the minimum and maximum total good flow for all possible decompositions of F.
We start by discarding edges e in any of Ei such that no good S – T paths would use e. By definition, removal of these edges will not change the answer to AND-Quant as no good S – T paths would be excluded by removing these edges. This also means each edge e in any Ei satisfies that there is a good S – T path that includes e.
Now given G is a DAG and the edge sets satisfy the well-ordering property, we can define the following:
(Start/End-Sets and Natural Order). Define Ui = {u: (u, v) ∈ Ei}, Vi = {v: (u, v) ∈ Ei} and for convenience, V0 = {S}, Um+1 = {T}.
We define the natural order u ≤ v if there is a directed path starting at u and ending at v. We define x → V as the condition that there is some vi ∈ Vi such that x ≤ vi (intuitively, x can reach Vi). Same goes for Ui and the other direction.
(The Block Graph). Define Bi, the ith block subgraph, the set of edges (u, v) that satisfies Vj–1 → u and v → Uj, for 1 – i – m + 1. The full block graph GB is the union of all Bi and Ei.
Before we continue, we will determine the time complexity for the filtering step (that discards any edge e in some Ei that no good paths would include) and the construction of GB. We claim both processes can be done as quick as topological sorting of G (up to a constant multiplier), assuming the edge sets satisfy the well-ordering property, as follows.
We first discuss the filtering algorithm. We can first run two breadth-first search from S and T to mark the set of vertices that are reachable from S and can reach T. All vertices that are unable to do both will never appear in an S – T path, and will be removed first. We let s(v) denote the largest i such that there is a path from S, via an edge in each of E1, E2,…, Ei in order, to v. We now generate the topological order of G, and let s(S) = 0. For every vertex in the topological order other than S, the value of s(v) is determined as follows. s(v) is first set to be the largest of s(u) from all its predecessors. Then for every (u, v) ∈ Ej and s(u) = j – 1, we set s(v) = max(s(v), j).
We now show the values of s(v) are correctly derived. If v is indeed reachable from S via an edge in each of E1, E2, …, Ei, we have s(v) ≥ i as we will visit all nodes on the path in topological order and after an edge in Ei we always have s(v) ≥ i. If s(v) ≥ i, we show v is reachable from S via an edge in each of E1, E2,…, Ei. We let p(v) be the predecessor of v where s(v) is calculated from, either by passing the value or passing an edge in Ej. By chaining p(v), we obtain a path from S to v where each vertex is responsible for the value of s(v) of its successor in the path. This means along the path the value will only increase by passing through an edge in Ej, and this can only bring s(v) from j – 1 to j, so the path must contain an edge from each E1, E2,…, Ei in order.
Similarly, we can obtain t(v) which is defined as the smallest i such that there is a path from v, via an edge in each of Ei, Ei+1,…, Em to T by setting t(T) = m+1 and traverse the graph in reverse topological order. The filtering process then works as follows. For each (u, v) ∈ Ei, the edge is kept in Ei if and only if s(u) = i – 1 and t(v) = i + 1. We prove this procedure is correct. A good S – T path containing (u, v) must visit an edge in each of {E1, E2,…, Ei–1} in order, visit (u, v) ∈ Ei, then an edge in each of {Ei+1, Ei+2,…, Em}. The first part is possible if and only if s(u) ≥ i – 1, and the last part is possible if and only if t(v) ≤ i + 1. It remains to prove that s(u) ≤ i – 1 (the other part is symmetric). This is because otherwise there will be an edge in Ei that precedes u. As (u, v) is also in Ei, this means there will exist a path containing two edges in Ei, which violates the well-ordering property.
We next discuss the block generation algorithm. In fact, we can simply reuse the value of s(v) and t(v). For an edge (u, v) not in any of Ei, if s(u) + 1 = t(v), the edge is allocated to Bt(v). We prove this procedure is correct. If an edge satisfies s(u) + 1 = t(v) = k, there is an edge in Ek–1 that leads to u, meaning Vk–1 → u, and similarly u ∙ Uk which is the definition of Bk. Similarly, if (u, v) ∈ Bk, we show s(u) + 1 = t(v) = k. Given Vk–1 → u and the edge set is filtered, there exists vk–1 ∈ Vk–1, vk–1 – v, and s(v) ≥ s(vk–1) = k – 1. If s(v) ≥ k, this means there is an edge in Ek that leads to u and Vk → u. Given v → Uk at the same time, we have Vk → u ≤ v → Uk, and there is a path that visits an edge in Ek twice, violating the well-ordering property. We can similarly prove it is necessary and sufficient that t(v) = k.
Combining the results, we have the following lemma for preprocessing (note that topological sorting is a linear time algorithm):
There is an O(n + m) algorithm that filters {Ei} such that every remaining edge in every Ei is included in at least one good S – T path, and constructs {Bi} as described in Definition S12.
We now return to the problem of answering AND-Quant.
(Well-Ordering Property Extended). If u ∈ Ui, v ∈ Vj, v ≤ u, then i > j.
Proof. We can derive this from the well-ordering property of {Ek}, by constructing an S – T path. First, as v ∈ Vj there is an edge in Ej that contains v, and by our previous assumption there is a path from S to v that uses this edge. As v ≤ u, there is a path from v to u. As u ∈ Ui, there is an edge in Ei that contains u, and similarly there is a path from u to T that uses this edge. Combining the three segments together, we have an S – T path that first visits an edge in Ej, then an edge in Ei later, so by the well-ordering property of {Ek}, i > j.
For each vi ∈ Vi, vi → Ui+1.
Proof. Recall that we removed all edges in Ei that does not belong to any good paths. vi ∈ Vi implies there is an edge (Ui, vi) ∈ Ei that is contained in a good path, and later in this good path there is an edge in Ei+1 that starts at some vertices in Ui+1. This implies vi ⊒ Ui+1.
For each Ui ∈ Ui, Ui → Uj where j ≥ i.
Proof. Ui ∈ Ui means there is an edge (Ui, vi) ∈ Ei, and Ui → Ui. As shown last lemma, vi − Uj for j > i. This means the same also holds for Ui as ui < vi.
We can prove the symmetric statements about Vi.
For a fixed i, each vertex x in GB either satisfies x → Ui or Vi → x, but not both.
Proof. If x is in one of Uj, x → Ui if i ≥ j by Lemma S11. Similarly, if x ∈ Uj, Vi → x if i > j by both Lemma S10.
We can similarly prove that the condition holds for all x ∈ Vj, where x ≥ Ui if j < i, and Vi → x if j ≥ i.
If x is in none of Uj or Vj, it is in an edge in Bj, which by definition of Bj means both Vj–1 → x and x → Uj. Combined with Lemma S11, this means x → Ui if j ≤ i, and Vi → x if j > i.
We have proved for any vertex in x and any i, one of x → Ui or Vi → x must be satisfied. No vertices can satisfy both, otherwise we have u ∈ Ui, v ∈ Vi and v ≤ x ≤ u, which would imply i < i by Lemma S9, contradiction.
(Main Lemma for AND-Quant). An S – T path in G is good if and only if it is a subset of GB.
Proof. We first prove that if a path is good, it is within GB. An edge (u, v) on the path is either in some Ei, or between some edge in Ei–1 and some edge in Ei. In latter case, we know Vi–1 → u and v → Ui, which directly implies (u, v) ∈ Bi.
To prove the other direction, we show that removing any of Ei results in disconnection of S to T. By the previous lemma, the vertices of GB can be partitioned into two disjoint sets: . For an edge (u, v) ∈ GB that starts in and ends in :
If (u, v) ∈ Bj, we know Vj–1 → u → Ui (first part from definition of Bj and second part from , similar for the rest of this proof), so j – 1 < i, and at the same time Vi → v → Uj so i < j, but there is not an integer i between j – 1 and j, contradiction.
If (u, v) ∈ Ej, we know Uj → u → Ui so j ≤ i (because Vj–1 → u by using the fact there is a good path including u), and Vi → v → Vj so i ≤ j (because v ≤ vj → Uj+1 for some vj ∈ Vj by Lemma S10), which implies i = j.
So we have (u, v) ∈ Ei. In other words, removing all edge in Ei results in disconnection between and , so each S – T path in GB uses an edge in Ei, for each i. This means the path is good.
The main lemma means that all bad path intersects with G – GB, meaning that we can complement the result of OR-Quant on E′ = G – GB to get the desired result. Combined with our analysis for OR-Quant, we can solve AND-Quant as described in main text, with time complexity equaling that of a MaxFlow.
S1.10 Subgraph Quantification with Flow Reallocation
In this section, we will describe how to integrate subgraph quantification with the Reallocation LP, and how such combination enables more flexibility in the choice of substructures in the subgraph quantification step. Note that the prefix graph we describe here are the compact ones (Section S1.6) and some of the discussions in this section is dependent on the definition of the compact prefix graph. For the most part, the only visible difference is that in a compact prefix graph AS(p) is a set of edges, instead of vertices.
When we do subgraph quantification, we are finding path decompositions over prefix graph flows. As we proved before, if satisfies a regularity constraint, the compact prefix graph (defined in Section S1.6) flows keep track of exactly the set of path abundance that is required for the likelihood. However, if this condition fails, the compact prefix graph flow is keeping more path abundance than necessary: these are the paths that are prefix of some other paths in , but not in the set itself. For such path p, the path abundance is also preserved during the mapping from and to quantified transcript sets. This means when we perform subgraph quantification, by only using the compact prefix graph flow we also fix the path abundance of cp. This leads to the scenario, where other compact prefix graph flows might yield the same set of path abundance for but not for these outsider path. Existing subgraph quantification procedures will never consider these flows, which unnecessarily narrows the range of optima.
This constraint can be removed. For each of the subgraph quantification algorithms, we calculate a MaxFlow of f, over some variant of the compact prefix graph, which can be written in an LP form: where fe is determined from the compact prefix graph flow and the corresponding algorithm, and we use to denote the MaxFlow solution. However, we do not care about the exact compact prefix graph flow, as the path abundances {cp} determine the likelihood of the model. In a similar spirit with the reallocation process under complete reference assumption (Section 2.2), we propose to reallocate the flow of prefix graph as an LP. This results in the following LP which we call reallocated subgraph quantification, where e denotes an edge in compact prefix graph, is the re-allocated flow from {fe}, and is the MaxFlow solution.
Note that fe only appeared once in the new LP, and we can get rid of explicit dependence of fe by using cp on the left-hand side, thanks to the Reallocation LP. In experiments we only report the bounds without this flow-re-estimation step because the effect is small.
The idea of re-expressing maxflows as LPs is very powerful and possibly enables other extensions of subgraph quantification, but at the cost of higher computational complexity. For the rest of this section, we discuss how to quantify various substructures of the splice graph with the Reallocation LP.
Reallocation LP allows us to expand the compact prefix graph arbitrarily. The first step is modify the splice graph: add an extra pair of vertices S* and T* and an extra pair of edges (S*, S) and (T, T*). S* and T* are now the source and sink of the splice graph. This step is necessary as we will allow to contain S or T (that no longer strictly corresponds to a phasing path).
A simple example is quantifying an arbitrary set of transcripts together using an OR-Quant. This was impossible without the Reallocation process, because the variable cp where p corresponds to the whole S – T path does not exist and the path abundance cannot be determined correctly from the prefix graph flow. However, we can now add all transcripts, as S – T paths (note now the source and sink are S* and T*), to the set of phasing path and construct the new compact prefix graph. We can now quantify the transcripts as OR-Quant, by setting E′ to be the union of AS(Ti) for each transcript Ti and using the LP without dependence on fe.
Adding a whole path to is not free as it increases the size of the resulting compact prefix graph. The number of vertices will increase by at most the length of the transcript (measured in exons), and the number of edges is upper bounded by the sum of out degrees for every exon in transcript. This means it is feasible to perform the quantification on a small set of transcripts, but not a set whose size is comparable to the set of all S – T paths.
Another use of the new LP is to sidestep the well-ordering constraint in AND-Quant to an extent. Assume our goal is to quantify the set of transcripts that use either exon 1 or 2 as one of the conditions (so it appears as one of the edge sets Ei), but there is an path from 1 to 2 (if there aren’t, we can simply merge AS([1]) and AS([2]) as Ei, which satisfies the well-ordering property). We will assume the path consists a simple edge, that is, there is an edge from 1 to 2. In this case, we add the following set to . That is, all length-two paths that starts with 1 or ends with 2. The union of AS (p) for this new set now as one Ei satisfies the well-ordering property as none of the edges would precede another.
S1.11 Flow Inference with Partial Reference Completeness Assumptions
By expressing lλ as an affine combination of l0 and l1, we implicitly assume every read is split into a read with count λ that would be from any transcript with known junctions, and a read with count 1 – λ that would be from the reference transcriptome. Such modeling might not be ideal for some applications. A natural alternative is to assume that λ portion of all fragments would come from any transcript with known junctions, while 1 – λ portion would be from the reference. In this section, we show how to extend our methods for this alternative assumption by an alternative inference procedure. Note that in our original modeling, we only need to run inference for λ = 1, so this method is inherently more costly as we need to run an inference instance for every λ in consideration. In addition to the flow values, for each reference transcript Ti, we add a variable ci denoting its abundance, and change the formula of cp as follows:
. We change the normalization term (which was ) as follows:
. The resulting optimization program would quantify the transcripts assuming at least 1 – λ portion of reads come from known transcripts. For genome-wide quantification, the same local-global EM algorithm can be used assuming λ for each gene stay unchanged. For calculation of range of optima, we use the reallocation LP on {ci} and subgraph quantification on {cp}, then add the results together. This method is still limited in the sense that we have to make the assumption on a per-gene basis, and cannot model the scenario where we assume λ portion of the whole read library come from novel transcripts (which requires a huge optimization instance as discussed in Section 2.4). In practice, we stick to the simple affine combination of l0 and l1 as it is the most efficient, however, the algorithm presented in this section might be interesting for future extensions of the model.
Acknowledgements
This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01GM122935). This work was partially funded by The Shurl and Kay Curci Foundation. This project is funded, in part, under a grant (#4100070287) with the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions. We would also like to thank Natalie Sauerwald, Dr. Guillaume Marçais, Xiangrui Zeng and Dr. Jose Lugo-Martinez for insightful comments on the manuscript.