## Abstract

**Background**
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, such probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. These models, however, have two fundamental problems: (1) it is unclear how they are related with any *genuine* evolutionary model, which describes the stochastic evolution of an *entire* sequence along the time-axis; and (2) they cannot fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions.
**Results**
Here, we theoretically tackle the *ab initio* calculation of the probability of a given sequence alignment under a *genuine* evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an *entire* sequence via insertions and deletions. Our model allows general indel rate parameters including length distributions but does not impose any unrealistic restrictions on indels. Using techniques of the perturbation theory in physics, we expand the probability into a series over different numbers of indels. This perturbation expansion provides a concise version of Feller’s theorem (1940), which underpins the authenticity of the widely used stochastic evolutionary simulation method by Gillespie (1977). We find a sufficient and nearly necessary set of conditions under which the probability can be expressed as the product of an overall factor and the contributions from regions separated by gapless columns of the alignment. The indel models satisfying these conditions include those with some kind of rate variation across regions, as well as space-homogeneous models. We also prove that, though with a caveat, pairwise probabilities calculated by the method of Miklós et al. (2004) are equivalent to those calculated by our *ab initio* formulation, at least under a space-homogenous model.
**Conclusions**
Our *ab initio* perturbative formulation provides a firm theoretical ground that other indel models can rest on.
[This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the *ab initio* perturbative formulation of a general continuous-time Markov model of indels.]