Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation

Kiyoshi Ezawa; Dan Graur; Giddy Landan

doi:10.1101/023614

Abstract

Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. In a separate paper (Ezawa, Graur and Landan 2015a), we established an ab initio perturbative formulation of a continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. And we showed that, under a certain set of conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) separated by gapless columns. Moreover, in another separate paper (Ezawa, Graur and Landan 2015b), we performed concrete perturbation analyses on all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). The analyses indicated that even the fewest-indel terms alone can quite accurately approximate the probabilities of local alignments, as long as the segments and the branches in the tree are of modest lengths.

Results To examine whether or not the fewest-indel terms alone can well approximate the alignment probabilities of more general types of local MSAs as well, and as a first step toward the automatic application of our ab initio perturbative formulation, we developed an algorithm that calculates the first approximation of the probability of a given MSA under a given parameter setting including a phylogenetic tree. The algorithm first chops the MSA into gapped and gapless segments, second enumerates all parsimonious indel histories potentially responsible for each gapped segment, and finally calculates their contributions to the MSA probability. We performed validation analyses using more than ten million local MSAs. The results indicated that even the first approximation can quite accurately estimate the probability of each local MSA, as long as the gaps and tree branches are at most moderately long.

Conclusions The newly developed algorithm, called LOLIPOG, brought our ab initio perturbation formulation at least one step closer to a practically useful method to quite accurately calculate the probability of a MSA under a given biologically realistic parameter setting.

[This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the ab initio perturbative formulation of a general continuous-time Markov model of indels.]

Introduction

The evolution of DNA, RNA, and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination, and other genomic rearrangements (e.g., Graur and Li 2000; Gascuel 2005; Lynch 2007). Thus far, analyses on substitutions have predominated in the field of molecular evolutionary study, in particular using the probabilistic (or likelihood) theory of substitutions that is now widely accepted (e.g., Felsenstein 1981, 2004; Yang 2006). However, some recent comparative genomic analyses have revealed that indels account for more base differences between the genomes of closely related species than substitutions (e.g., Britten 2002; Britten et al. 2003; Kent et al. 2003; The International Chimpanzee Chromosome 22 Consortium 2004; The Chimpanzee Sequencing and Analysis Consortium 2005). It is therefore imperative to develop a stochastic model that enables us to reliably calculate the probability of sequence evolution via mutations including insertions and deletions.

Since the groundbreaking works by Bishop and Thompson (1986) and by Thorne, Kishino and Felsenstein (1991), there have been many efforts to calculate the alignment probabilities under the probabilistic models aiming to incorporate the effects of indels. Over the past few decades, such methods have greatly improved in terms of the computational efficiency and the scope of application (see, e.g., Rivas 2005; Bradley and Holmes 2007; Miklós et al. 2009). However, these methods, mostly based on hidden Markov models (HMMs) or transducer theories, have two fundamental problems, one regarding the theoretical grounds and the other regarding the biological realism. (See the “background” section in part I (Ezawa, Graur and Landan 2015a) for more details on these problems.)

To solve these two problems, we chose to base our study on an indel evolutionary model that is devoid of the problems from the beginning. The model we chose were a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via indels along the time-axis. The model allows any indel rate parameters including length distributions, but it does not impose any unnatural restrictions on indels. In part I of this series of study (Ezawa, Graur and Landan 2015a), we established an ab initio perturbative formulation of the general continuous-time Markov model. We showed that, when the indel rate parameters satisfy a certain set of conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) separated by gapless columns. In part II (Ezawa, Graur and Landan 2015b), we concretely calculated the fewest-indel contributions and the next-fewest-indel contributions to the probability of each local alignment, among all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). Our perturbation analyses indicated that even the fewest-indel contribution can approximate the probability of each local alignment quite accurately, as long as the local alignment is not so long and the branch lengths are at most moderately long. We are confident that this conclusion should be quite general on the local PWAs, because we exhausted all possible types of homology structures (Lunter et al. 2005). However, in order to claim that the conclusion holds generally also on local MSAs, we need a more extensive analysis, by exploring most of the local MSA patterns we could encounter in practical evolutionary processes.

For this purpose, in this study, we developed an algorithm to calculate such a “first-approximate” probability for an input MSA, under a given parameter setting including a phylogenetic tree. To validate our algorithm and the conclusion in part II, we conducted some simulation analyses. Using a genuine molecular evolution simulator, Dawg (Cartwright 2005), we created more than ten million local MSAs and counted the absolute frequency of, as well as the relative frequencies of ancestral states for, each local gap configuration. We used these frequencies as the “correct answers” to be compared to the first-approximate probabilities calculated only from the contributions by the fewest-indel histories. The results indicated that the conclusion in part II seems to hold for a more general set of local MSAs, and thus they demonstrated the use of the first-approximate probabilities under modest settings.

In Results, we describe the results of our validation analyses. In Discussion, we will discuss some possible improvements and applications of our theory and algorithm. The topics include the risks associated with the naïve application of our algorithm to reconstructed alignments. The Methods section details the algorithms and analyses. Subsection M1 of Methods describes our algorithm to calculate the first approximation of the probability of a given MSA. Subsection M2 of Methods describes the details on our validation analyses.

This paper is part III of a series of our papers that documents our efforts to develop, apply, and extend the ab initio perturbative formulation of the general continuous-time Markov model of sequence evolution via indels. Part I (Ezawa, Graur and Landan 2015a) gives the theoretical basis of this entire study. Part II (Ezawa, Graur and Landan 2015b) describes concrete perturbation calculations and examines the applicable ranges of other probabilistic models of indels. Part III (this paper) describes our algorithm to calculate the first approximation of the probability of a given MSA and simulation analyses to validate the algorithm. Finally, part IV (Ezawa, Graur and Landan 2015c) discusses how our formulation can incorporate substitutions and other mutations, such as duplications and inversions.

This paper basically uses the same conventions as used in part I (Ezawa, Graur and Landan 2015a). See its Section 2 for details if necessary. And, as in part I, the following terminology is used. The term “an indel process” means a series of successive indel events with both the order and the specific timings specified, and the term “an indel history” means a series of successive indel events with only the order specified. And, throughout this paper, the union symbol, such as in A ∪B and , should be regarded as the union of mutually disjoint sets (i.e., those satisfying A ∩ B = ∅ and A_i ∩ A_j = ∅ for i ≠ j (∈ {1,…, I}), respectively, where ∅ is an empty set), unless otherwise stated.

Results

In Subsection 1.2 of part II (Ezawa, Graur and Landan 2015b), we saw that, as long as the indel lengths and the branch lengths are at most moderate, the contributions from the fewest-indel histories alone can well approximate the multiplication factors for any local gap configurations in PWAs. And, in Subsection 1.3 of part II, we saw that this is also the case with some typical gap configurations in MSAs. In MSAs, however, there could be many patterns of gap configurations, in addition to those examined in Subsection 1.3 of part II. Thus, to examine whether or not the contributions by the fewest-indel histories can in general well approximate the multiplication factors for local gap-configurations of MSAs, we conducted simulation analyses.

First, we developed an algorithm that performs the following series of three processes (Figure 1 A). (i) It first partitions a given MSA into an alternating series of gapped and gapless segments. (ii) It second enumerates the fewest-indel local histories (i.e., the parsimonious local indel histories) giving rise to each of the gapped segments. And (iii) it third calculates the “fewest-indel approximation” of the multiplication factor for each gapped segment (Eq.(1.1.2a) of part II) by summing the contributions from all the fewest-indel local histories. The absolute probability of the given MSA is approximated by the product of the probability, (given by Eq.(4.2.9b) of part I (Ezawa, Graur and Landan 2015a)), that a reference root state is kept throughout the tree T, and the approximate multiplication factors for the gapped segments (calculated in parts I & II). (Because the algorithm only enumerates the fewest-indel histories, it ignores “null local indel histories” that leave no traces in the MSA, which were discussed in Subsection 3.3 of part I.) As a by-product, the algorithm also calculates the relative probabilities among the fewest-indel local histories that can give rise to each gapped segment. For details of the algorithm, see Methods M1 and Figures 1-6. The algorithm is currently implemented only under Dawg’s indel model (Cartwright 2005; see also Eqs.(2.4.4a,b,c) of part I), and the indel length distributions can be chosen from power-law and geometric distributions. We provided the current implementation of the algorithm in a prototype package named LOLIPOG (log-likelihood for the pattern of gaps), which we made available at the FTP repository of the Bioinformatics Organization (Ezawa 2013).

Figure 1. Overall workflow in our algorithm to calculate the MSA probability.

The entire algorithm consists of steps (ia), (ib), (ic), (ii) and (iii), processing the input (o) into the final output at step (iv). (A) The flowchart. (B) The schematic illustration of the pre-processing steps (ia-ic). The input data [(o)] consists mainly of a MSA (of DNA sequences here) and a phylogenetic tree of the aligned sequences (labeled with boldface numbers). An evolutionary model via indels is assumed to be given but is omitted here. Step (ia) reduces the input MSA to a binary 1/0 pattern, with 1 and 0 representing the “presence” (of a residue) and the “absence” (i.e., a gap), respectively. Step (ib) decomposes the binary pattern into “gap-pattern block"s, or “block”s for short, each of which consists of contiguous columns of a given 1/0 pattern. Here each block is represented as a rectangular array of neighboring cells with a particular color. Step (ic) sorts the blocks into gapless segments (each represented as contiguous blue cells enclosed by a blue rectangle labeled B_k (with k = 0,1, 2)) and gapped segments K (each represented as contiguous cells enclosed by a red rectangle labeled (with K = 1, 2)). See M1.1 (in Methods) for more details. [NOTE: The set of all gapped segments, , is a subset of , which is the set of all regions that can accommodate local indel histories along the tree.]

Figure 2. Merging indel events in effectively contiguous gap-pattern blocks.

In each panel, given a gapped segment consisting of contiguous gap-pattern blocks (“block”s), and a phylogenetic tree of aligned sequences (left), the Dollo parsimonious history for each block is first inferred (middle), then the indel histories in the effectively contiguous blocks are merged if they are of the same type and occur along the same branch (right). As in Figure 1 B, a “1” and a “0” represent the presence state (i.e., a residue) and the absence state (i.e., a gap), respectively. Note that each column under the “Blocks” (left) represents a gap-pattern block, and not necessarily a single column, in the MSA. In the indel histories in the middle step, “+x” and “-y” represent the insertion of block “x” and the deletion of block “y”, respectively. In the local indel histories in the final step (on the right), blocks in the same parentheses after the “+” or the “-” sign, respectively, are inserted or deleted simultaneously. (A) Merging indel events in literally contiguous blocks. (B) Merging indel events in two blocks separated by a (run of) block(s) in which no downstream nodes with the “presence” state interrupt the merger. (C) In this case, the deletions of block a and block c, both along the exterior branch leading to sequence 1, cannot be merged because they are interrupted by the downstream node with the “presence” state (the red “1”) in block b.

Figure 3. Looking for parsimonious local indel histories.

For the gap-configuration (under the “Blocks”) and the tree shown in (A), the initial step infers the history in (B), but there is actually another parsimonious history (C). For the segment and the tree shown in (D), using the history in (E) as an “intermediate” point always reachable from the initial history, we can find the actual parsimonious history shown in (F). (G,H) a “branch-and-merge” operation performed on the situation in (D). (G) Looking closely at the indel history in (E), we see that a deletion of a subsequence in block b occurs along the branch of the common ancestor of sequences 1 and 2. With this history as a starting point, in the “branching” step (H), the deletion is re-interpreted as deletions along the child branches. Finally, merging the resulting deletions with the effectively contiguous deletion(s) gives the local indel history in (F) in this example.

Figure 4. Sorting indel events that will undergo “branch-and-merge” processes.

(A) The initial local indel history (right), given a gapped segment (under the “Blocks”) and a sequence tree (left). (B) If the input tree is rooted, it gets unrooted. (C) Then, an insertion event (as in block b in this example) can be re-interpreted as a ‘deletion’ event by reversing the (virtual) time direction (represented by a blue arrow). Here, “+(b):(4)” denotes that block b was inserted into sequence 4, and “-(b):(1,2,3)” denotes that block b was deleted from the (‘last common ancestor’ of) sequences 1, 2, and 3. Similarly, “+(a):(3,4)” in the original history will also be re-interpreted as “- (a):(1,2).” (D) In this way, we can re-interpret all the indel events as ‘deletions’ (left), and sort them in descending order of the number of ‘deleted’ sequences (right).

Figure 5. Composite “branch-and-merge” operation: schematic illustrations.

Panel (A) partially shows an input local indel history. The red tree on the left shows the “presence” and “absence” states (the solid and open red circles, respectively), as well as a single ‘deletion’ event (yellow lightening bolt) along branch e_o, on gappattern block c. The symbols up(e_o) and lw(e_o) label the nodes at the ‘upper-end’ and the ‘lower-end,’ respectively, of branch e_o. The dashed lines ‘above’ up(e_o) are the remaining part of the tree, whose details don’t matter here. The array of “0” and “1” on the right briefly represents the pairwise alignment of the sequences at nodes up(e_o) and lw(e_o). We assume that gap-pattern block c (in red), which concerns us the most, is effectively flanked by blocks b and e. The block d was skipped because it has the “absence” state at nodes up(e_o) and lw(e_o), as well as at all nodes in the ‘downstream’ of them (not shown). (B, C) Shown on the left are the input indel history on block c and those on the effectively flanking blocks b and e. On the right is the local indel history on the entire segment consisting of blocks b, c and e, superimposed by the history on block c (in red). Both of the histories are after a composite “branch-and-merge” operation on the ‘deletion’ of block c along branch e_o. At each node, (w,…,z) represents that blocks w,…,z are “present.” Along each branch, +(x,…, y) denotes that blocks x,…, y are ‘inserted,’ and −(x,…, y) denotes that blocks x,…, y are ‘deleted.’ (B) The ‘deletions’ involving effectively flanking blocks, b and e, along branches ‘under’ branch e_o (black-contoured lightening bolts), can ‘delete’ all the sequences ‘under’ e_o (left). In this case, the total number of indels reduces by 1 after the “branch-and-merge” operation (right), providing a local history that truly replaces the old candidate histories. (C) In this case, to ‘delete’ all the sequences ‘under’ branch e_o, the ‘deletions’ involving effectively flanking blocks, b and e, are not enough (left). Actually, an additional ‘deletion’ is necessary (“-(c)” along a branch of the tree in the right dashed box). Thus, in this case, the total number of indels does not change after the “branch-and-merge” operation (right), and the resulting local history joins the set of current candidate histories.

Figure 6. ‘Bottom-up’ algorithm to search for minimal set of ‘deletion’ events in composite “branch-and-merge” operation.

This schematic illustration uses the input local indel history on the left of panel (C) in Figure 5 as an example. (A) The algorithm starts from the external branches ‘under’ branch e_o, and goes upwards until it reaches e_o (upward dashed arrows). Here, the branches are numbered from 1 to 6 (in order of being processed), to facilitate the explanation. Each branch is assigned a flanking ‘deletion’ status, “R”, “L”, “RL”, or “” (nothing), in parentheses after a colon, indicating that the branch undergoes (a) ‘deletion(s)’ involving the right-“flanking” block (b in Figure 5), the left-“flanking” block (e in Figure 5), both blocks, or none, respectively. (B) Annotating the branches. Each branch is classified as “directly absorbable” (D), if it itself is not assigned the status “” (branches 1, 2, and 4 in this example); it is classified as “indirectly absorbable” (I), if it is assigned the status “” and all of its child branches are classified as “directly” or “indirectly” “absorbable” (branch 5); or otherwise, it is classified as “non-absorbable” (N) (branches 3 and 6). The set of numbers in braces after the D/I symbol on each branch represents the minimal set of directly absorbable branches the ‘deletions’ along which can ‘delete’ all the sequences ‘under’ the branch in question. Branch 7 in panel C gives a more complex example of an “indirectly absorbable” branch. Branch 7 in panel D is a more complex example of a “non-absorbable” branch. (E,F,G) The algorithm finishes on branch e_o by giving the minimum number of additional ‘deletion’ events (min_AD (e_o)), as well as the minimum set of ‘deletions’ that jointly substitute for the original ‘deletion’ (MS_SD (e_o)). (E) When all ‘child’ branches are (“directly” or “indirectly”) “absorbable.” (F) When some ‘child’ branches are “absorbable” and others are “non-absorbable”. (G) When all ‘child’ branches are “non-absorbable".

Second, to validate the component of the algorithm that enumerates all fewest-indel local histories potentially resulting in the gap configuration of a given gapped segment, we applied it to a set of simple MSAs each accompanied by a phylogenetic tree of the sequences (Methods M2.1 and Figures 7-27). The MSAs and accompanying trees were chosen to extensively cover typical cases of the gap-configurations of the segments and local indel histories that can generate them. We manually confirmed that our implementation of the algorithm certainly enumerates all conceivable fewest-indel histories that can generate each of the gap-configurations, except for some complex cases that are expected to be very rare (Methods M2.1; see also Discussion).

Figure 7 Input data for validation of our indel parsimony algorithm.

Each of the panels, A through R, shows a two-component set of input data. One component is the gap-configuration of a gapped segment in a MSA, consisting of contiguous gap-pattern blocks (columns of “0”s denoting gaps and/or “1” s denoting residues, each enclosed by a dashed rectangle labeled with a bold italic alphabet). The other is a phylogenetic tree of aligned sequences (labeled with bold Arabic numerals).

Figure 8. Local indel histories output by our parsimony algorithm, given input in Figure 7 A.

The output is schematically illustrated using the phylogenetic tree of the aligned sequences. At each external node (labeled with a bold Arabic numeral), or at each internal node (unlabeled), “(x,..,z)” denotes a set of blocks that are “present” in the corresponding (existing or ancestral) sequence. In particular, the “()” at a node represents the situation where all relevant blocks are “absent” from the corresponding sequence. In a red box, “+(s,…,t)” denotes an insertion of the subsequence consisting of blocks s,…, and t, and “-(u,…,v)” denotes a deletion of the subsequence consisting of blocks u,…, and v. A red arrow points to the branch along which the indel event occurred. (Nominal) time is supposed to run from left to right.