Abstract
The finite state projection (FSP) approach to solving the chemical master equation (CME) has enabled successful inference of discrete stochastic models to predict single-cell gene regulation dynamics. Unfortunately, the FSP approach is highly computationally intensive for all but the simplest models, an issue that is highly problematic when parameter inference and uncertainty quantification takes enormous numbers of parameter evaluations. To address this issue, we propose two new computational methods for the Bayesian inference of stochastic gene expression parameters given single-cell experiments. First, we present an adaptive scheme to improve parameter proposals for Metropolis-Hastings sampling using full FSP-based likelihood evaluations. We then formulate and verify an Adaptive Delayed Acceptance Metropolis-Hastings (ADAMH) algorithm to utilize with reduced Krylov-basis projections of the FSP. We test and compare both algorithms on three example models and simulated data to show that the ADAMH scheme achieves substantial speedup in comparison to the full FSP approach. By reducing the computational costs of parameter estimation, we expect the ADAMH approach to enable efficient data-driven estimation for more complex gene regulation models.
Introduction
An important goal of quantitative biology is to elucidate and predict the mechanisms of gene expression. Evidence increasingly suggests that gene expression processes are inherently stochastic with substantial cell-to-cell variability.1–3 In an isogenic population with the same environmental factors, much of these fluctuations can be attributed to intrinsic chemical noise, which is captured well by the chemical master equation (CME).4 Predictive models for gene expression dynamics can be identified by fitting the solution of the CME to the empirical histogram of single-cell data at several experimental conditions or time-points.5–8
The finite state projection (FSP),9 which approximates the dynamics of the CME with a finite system of linear ODEs, provides a framework to analyze full distributions of stochastic gene expression models with computable error bounds. It has been observed that the full distribution-based analyses using the FSP perform well, even when applied to realistically small experimental datasets on which summary statistics-based fits may fail.10 On the other hand, the FSP requires solving a large system of ODEs that grows quickly with the complexity of the gene expression network under consideration. Our present study borrows from model reduction strategies in other complex systems fields to alleviate this issue by reducing the computational cost of FSP-based parameter estimation.
There has been intensive research on efficient computational algorithms to quantify the uncertainty in complex models.11 A particularly promising approach is to utilize multifidelity algorithms to systematically approximate the original system response. In these approximations, surrogate models or meta-models allow for various degrees of model fidelity (e.g., error compared to the exact model) in exchange for reductions in computational cost. Surrogate models generally fall into two categories: response surface and low-fidelity models.12,13 We will focus on the second category that consists of reduced-order systems, which approximate the original high-dimensional dynamical system using either simplified physics or projections onto reduced order subspaces.11,14,15 Reduced-order modeling has already begun to appear in the context of stochastic gene expression. When all model parameters are known, the CME can be reduced by system-theoretic methods,16,17 sparse-grid/aggregation strategies,18,19 tensor train representations20–22 and hierarchical tensor formats.23 Model reduction techniques have also been applied to parameter optimization by Waldherr and Hassdonk24 who projected the CME onto a linear subspace spanned by a reduced basis, and Liao et al.25 who approximated the CME with a Fokker-Planck equation that was projected onto the manifold of low-rank tensors.26 While these previous works clearly show the promise of reduced-order modeling, there remains a vast reservoir of ideas from the broader computational science and engineering community that remain to be adapted to the quantitative analysis of stochastic gene expression.
In this paper, we introduce two efficient algorithms, which are based on the templates of the adaptive Metropolis algorithm27 and the delayed acceptance Metropolis-Hastings (DAMH28,29) algorithm, to sample the posterior distribution of gene expression parameters given single-cell data. The adaptive Metropolis approach automatically tunes parameter proposal distributions to more efficiently search spaces of unnormalized and correlated parameters. The DAMH provides a two-stage sampling approach that uses a cheap approximation to the posterior distribution at the first stage to quickly filter out unlikely parameters. Improvements to the DAMH allow algorithmic parameters to be updated adaptively and automatically by the DAMH chain.30,31 The DAMH has been applied to the inference of stochastic chemical kinetics parameters from time-course data.32 Our algorithm is a modified version of DAMH that is specifically adapted to improve Bayesian inference from population snapshots of single-cell data, such as data arising from flow cytometry or fixed-cell microscopy experiments. We employ parametric reduced order models using Krylov-based projections,33,34 which give an intuitive means to compute expensive FSP-based likelihood evaluations.35,36 To improve the accuracy and the DAMH acceptance rate, we allow the reduced model to be refined during parameter space exploration. The resulting method, which we call the ADAMH-FSP-Krylov algorithm, is tested on three common gene expression models. We also provide a theoretical guarantee and numerical demonstrations that the proposed algorithms converge to equivalent target posterior distributions.
The organization of the paper is as follows. We review the background on the FSP analysis of single-cell data, and basic Markov chain Monte Carlo (MCMC) schemes in the Background section. In the Materials and Methods section, we introduce our method to generate reduced FSP models, as well as our way of monitoring and refining their accuracy. These reduced models give rise to an approximation to the true likelihood function, which is then employed to devise an Adaptive Delayed Acceptance Metropolis-Hastings with FSP-Krylov reduced models (ADAMH-FSP-Krylov). We make simple adjustments to the existing ADAMH variants in the literature to prove convergence, and we give the mathematical details in the supplementary materials. We provide empirical validation of our methods on three gene expression models, and we compare the efficiency and accuracy of the approaches in the Numerical Results section. Interestingly, we find empirically that the reduced model learned through the ADAMH run could fully substitute the original FSP model in a Metropolis-Hastings run without incurring a large difference in the sampling results. Finally, we conclude with a discussion of future work and the potential of computational science and engineering tools to analyze stochastic gene expression.
Background
Stochastic modeling of gene expression and the chemical master equation
Consider a well-mixed biochemical system with N ≥ 1 different chemical species that are interacting via M ≥ 1 chemical reactions. Assuming constant temperature and volume, the time-evolution of this system can be modeled by a continuous-time Markov process.4 The state space of the Markov process consists of integral vectors x ≡ (x1,…,xN)T, where xi is the population of the ith species. Each reaction channel, such as the transcription of an RNA species, is characterized by a stoichiometric vector νj (j = 1,…, M) that represents the change when the reaction occurs; if the system is in state x and reaction j occurs, then the system transitions to state x + νj. Given x(t) = x, the propensity αj(x; θ)dt determines the probability that reaction j occurs in the next infinitesimal time interval [t, t + dt), where θ is the vector of model parameters.
Since the state space is discrete, we can index the states as x1,…,xn,… The time-evolution of the probability distribution of the Markov process is the solution of the linear system of differential equations known as the chemical master equation (CME): where the probability mass vector p = (p1,p2,…)T is such that each component, pi = P(t, xi) = Prob{x(t) = xi}, describes the probability of being at state xi at time t, for i = 1,…, n. The vector p0 = p(0) is an initial probability distribution and A(θ) is the infinitesimal generator of the Markov process. Here, we have made explicit the dependence of A on the model parameter vector θ, which is often inferred from experimental data.
Finite State Projection
The state space of the CME could be infinite or extremely large. To alleviate this problem, the finite state projection (FSP 9) was introduced to truncate the state space to a finite size. In the simplest FSP formulation, the state space is restricted to a hyper-rectangle where the nk are the maximum copy numbers of the chemical species.
The infinite-dimensional matrix A and vector p in eq. (1) are replaced by the corresponding submatrix and subvector. When the bounds nk are chosen sufficiently large and the propensities satisfy some regularity conditions, the gap between the FSP and the original CME is negligible and computable.9,37 Throughout this paper, we assume that the bounds nk have been chosen appropriately and that the FSP serves as a high-fidelity model of the gene expression dynamics of interest. Our goal is to construct lower-fidelity models of the FSP using model order reduction and incorporate these reduced models in the uncertainty analysis for gene expression parameters.
Bayesian inference from single-cell data
Data from smFISH experiments5,8,38,39 consist of several snapshots of many independent cells taken at discrete times t1,…,tT. The snapshot at time ti records gene expression in ni cells, each of which can be collected in the data vector cj,i, j = 1,…, ni of molecular populations in cell j at time ti. Let p(t, x|θ) denote the entry of the FSP solution corresponding to state x at time t, with model parameters θ. The FSP-based approximation to the log-likelihood of the data set 𝓓 given parameter vector θ is given by
It is clear that when the FSP solution converges to the true solution of the CME, the FSP-based log-likelihood converges to the true data likelihood. The posterior distribution of model parameters θ given the data set 𝓓 then takes the form where f0 is the prior density that quantifies prior knowledge and beliefs about the parameters. When f0 is a constant, the parameters that maximize the posterior density are equivalent to the maximum likelihood estimator. However, we also want to quantify our uncertainty regarding the accuracy of the parameter fit, and the MCMC framework provides a way to address this by sampling from the posterior distribution.
For convenience, we limit our current discussion to models and inference problems that have the following characteristics:
The matrix A(θ) can be decomposed into where gj are continuous functions and Aj are independent of the parameters.
The support of the prior is contained in a bounded domain of the form
The first assumption means that the CME matrix depends “linearly” on the parameters, ensuring the efficient assembly of the parameter-dependent matrix. In particular, the factors Aj can be computed and stored in the offline phase before parameter exploration and only a few (sparse) matrix additions are required to compute A(θ) in the online phase. When there are nonlinear dependence on parameters, more sophisticated methods such as the Discrete Empirical Interpolation method40 could be applied, but we leave this development for future work in order to focus more on the parameter sampling aspect. Nevertheless, condition (4) covers an important class of models, including all models defined by mass-action kinetics. The second assumption means that the support of the posterior distribution is a bounded and well-behaved domain (in mathematical terms, a compact set). This allows us to derive convergence theorems more straightforwardly. In practice, condition (5) is not a severe restriction since it can be interpreted as the prior belief that physical parameters cannot assume infinite values.
The Metropolis-Hastings and the adaptive Metropolis algorithms
The Metropolis-Hastings (MH) Algorithm41,42 is one of the most popular methods to sample from a multivariate probability distribution (Algorithm 1). The basic idea of the MH is to generate a Markov chain whose limiting distribution is the target distribution. To do so, the algorithm includes a probabilistic acceptance/rejection step. More precisely, let f denote the target probability density. Assume the chain is at state θi at step i. Let θ′ be a proposal from the pre-specified proposal density q(.|θi). The DAMH computes a first-step acceptance probability of the form to decide whether to accept θ′ as the next state of the chain. If θ′ fails to be promoted, the algorithm moves on to the next iteration with θi+1 := θi.
There could be many choices for the proposal density q (for example, see the survey of Roberts and Rosenthal43). We will consider only the symmetric case where q is a Gaussian, that is, where Σ is a positive definite matrix that determines the covariance of the proposal distribution. With this choice, the MH reduces to the original Metropolis Algorithm.41 For gene expression models, the MH has been combined with the FSP for parameter inference and model selection in several studies.8,10
The appropriate choice of Σ is crucial for the performance of the Metropolis algorithm. Haario et al.27 proposes an Adaptive Metropolis (AM) algorithm in which the proposal Σ is updated at every step using the values visited by the chain. This is the version that we will implement for sampling the posterior distribution with the full FSP model. In particular, let θ1,…, θi be the samples accepted so far, the AM updates the proposal covariance using the formula
Here, the function Cov returns the sample covariances. The constant sd is assigned the value (2.4)2/d following Haario et al.27 The matrix Σ0 is an initial choice for the Gaussian proposal density, and n0 is the number of initial steps without proposal adaptations. Using the adaptive Metropolis allows for more efficient search over un-normalized and correlated parameters spaces and eliminates the need for the user to manually tune the algorithmic parameters. In the numerical results that we will show, the adaptive Metropolis results in reasonable acceptance rates (19% – 23.4%). Although non-adaptive MH algorithms have been consider in the past,8,10 to the best of our knowledge, this is the first adaptive MH algorithm to be proposed for Bayesian inference of gene expression models.
Materials and Methods
Delayed acceptance Metropolis-Hastings algorithm
Previous applications of the MH to gene expression have required 104 to 106 or more iterations per combination of model and data set,10 and computational cost is a significant issue when sampling from a high-dimensional distribution whose density is expensive to evaluate. A practical rule of thumb for balancing between exploration and exploitation for a MH algorithm with the Gaussian proposal is to have an acceptance rate close to 0.234, which was derived by Roberts et al.44 as the asymptotically optimal acceptance rate for random walk MH algorithms. Assuming the proposal density of Algorithm 1 is tuned to have an acceptance rate of approximately 23.4%, one could achieve significant improvement to computation time if one can quickly screen out the remaining rejected proposals without evaluating the expensive posterior density.
The delayed acceptance Metropolis-Hasting (DAMH)28 seeks to alleviate the computational burden of rejections in the original MH by employing a rejection step based on a cheap approximation to the target density (cf. Algorithm 2). Specifically, let f (.) be the density of the target distribution of the parameter θ. Let be a cheap state-dependent approximation to f. At iteration i, let θ′ be a proposal from the current parameter θ using a pre-specified proposal density q(.|.). The DAMH promotes θ′ as a potential candidate for acceptance with probability
If θ′ fails to be promoted, the algorithm moves on to the next iteration with θi+1 := θi. If the θ′ passes the first inexpensive check, than a second acceptance probability is computed using the formula and the DAMH algorithm accepts θ′ for the next step with probability β. In this manner, much computational savings can be expected if unlikely proposals are quickly rejected in the first step, leaving only the most promising candidates for careful evaluation in the second step. Christen and Fox show that the ADAMH converges to the target distribution under conditions that are easily met in practice.28 However, the quality of the approximation affects the overall efficiency. Poor approximations lead to many false promotions of parameters that are rejected at the expensive second step. On the other hand, the first step may falsely reject parameters that could have been accepted using the accurate log-likelihood evaluation. This leads to subsequent developments that seek appropriate approximations and ways to adapt these approximations to improve the performance of DAMH in specific applications.30,45 Specifically, the adaptive DAMH variant in Cui et al., 201445 formulates via reduced basis models that can be updated on the fly using samples accepted by the chain. The adaptive version in Cui et al., 2011,30 allows adaptations for the proposal density and the error model, with convergence guarantees.31 We will borrow these elements in our sampling scheme that we introduce below. However, the stochastic gene expression models that we investigate here differ from the models studied in those previous contexts, since our likelihood function incorporates intrinsic discrete state variability instead of external Gaussian noise.
Reduced-order models for the FSP dynamics
Projection-based model reduction
We approximate the full parameter-dependent FSP dynamics, with a sequence of reduced-order dynamics,
Here, i = 1,…, nB indexes the user-specified subintervals [ti−i, ti] with t0 = 0. Each matrix Φ(i) ∈ ℝn×ri, ri ≤ n, has orthonormal columns that span the subspace onto which we project the full dynamics. Equation (8) implies that the solution at a previous time interval will be projected onto the subspace of the next interval. While this introduces some extra errors, subdividing the long time interval helps to reduce the subspace dimensions for systems with complicated dynamics. Given an ordered set of reduced bases the approximations to the full distributions are given by
Under assumption (4), the reduced system matrices B(i)(θ) in eq. (7) can be decomposed as where This decomposition allows us to assemble the reduced systems quickly with complexity.
We build the reduced basis for the parameter-dependent dynamics by concatenation (see, e.g., Benner et al.15). Specifically, we assume that for any fixed parameter 6, we can construct a set of orthogonal basis matrices. We can sample different bases from a finite set of ‘training’ parameters θ1,…, θntrain. Then, through the iterative updates we obtain the bases Φ(i) = Φ(i,ntrain) that provide global approximations for the full dynamical system across the parameter domain. The operation Gram-Schmidt implies that the columns in V(i) (θj) are orthogonalized against the columns in Φ(i,j−1) to produce a new matrix with orthonormal columns.
Krylov subspace approximation for single-parameter model reduction
Consider a fixed parameter combination θ. Let the time points 0 < t1 < … < tB = tf be given. Using a high-fidelity solver, we can compute the full solution at those time points, and we let pi denote the full solution at time ti. Our aim is to construct a sequence of orthogonal matrices V(i) ≡ V(i)(θ) with i = 1… B such that the full model dynamics at the parameter θ on the time interval [ti−1,ti] is well-approximated by a projected reduced model on the span of V(i).
A simple and effective way to construct the reduced bases is to choose V(i) as the orthogonal basis of the Krylov subspace
In order to determine the subspace dimension mi, we use the error series derived by Saad33 which we reproduce here using our notation as
Here, are the outputs at step mi of the Arnoldi procedure (Algorithm 10.5.1 in Golub and Van Loan46) to build the orthogonal matrix V(i), where φk(X) = for any square matrix X. The matrix H(i) = (V(i))T A(θ)V(i) is the state matrix of the reduced-order system obtained via projecting A onto the Krylov subspace Kmi. The terms can be computed efficiently using Expokit (Theorem 1, Sidje34). We use the Euclidean norm of the first term of the series (14) as an indicator for the model reduction error. Given an error tolerance εKrylov, we iteratively construct the Krylov basis V(i) with increasing dimension until the error per unit time step of the reduced model falls below the tolerance, that is,
Adaptive Delayed Acceptance Metropolis with reduced-order models of the CME
The approximate log-likelihood formula
The reduced bases described above allow us to find reduced-cost approximations p ≈ pΦ to the full FSP dynamics. We can then approximate the full log-likelihood of single-cell data in equation (3) by the reduced-model-based log-likelihood where εs is a small constant, chosen to safeguard against undefined values. We need to include εs in our approximation since the entries of the reduced-order approximation are not guaranteed to be positive (not even in exact arithmetic). We aim to make the approximation to be accurate for parameters θ with high posterior density, and crude on those with low density, which should be visited rarely by the Monte Carlo chain.
One can readily plug in the approximation (16) to the DAMH algorithm. Since 0 for all the chain will eventually converge to the target posterior distribution (Theorem 1 in Christen and Fox,28 and Theorem 3.2 in Efendiev et al.29). On the other hand, a major problem with the DAMH is that the computational efficiency depends on the quality of the reduced basis approximation. Crude models result in high rejection rates at the second stage, thus increasing sample correlation and computation time. Therefore, it is advantageous to fine-tune the parameters of the algorithm and update the reduced models adaptively to ensure a reasonable acceptance rate. This motivates the adaptive version of the DAMH that we discuss next.
Delayed acceptance posterior sampling with infinite model adaptations
We propose an adaptive version of the DAMH for sampling from the posterior density of the CME parameters given single-cell data (Algorithm 3). We have borrowed elements from the adaptive DAMH algorithms in Cui et al.30,45 The first step proposal uses an adaptive Gaussian similar to the adaptive Metropolis of Haario et al.,27 where the covariance matrix is updated at every step from the samples accepted so far. Here, we generate the proposals in log10 space.
The reduced bases are updated as the chain explores the parameter domain. Instead of using a finite adaptation criterion to stop model adaptation as in Cui et al.,45 we introduce an adaptation probability with which the reduced basis updates are considered. This means that there could be an infinite amount of model adaptations that occur with diminishing probability as the chain progresses. This idea is taken from the “doubly-modified example” in Roberts and Rosenthal.47 The advantage of the probabilistic adaptation criteria is that it allows us to prove ergodicity for the adaptive algorithm. The mathematical proofs are presented in the Appendix.
The adaptation probability a(i) is chosen to converge to 0 as the chain iteration index i increases. In particular, we use where I0 is a user-specified constant. This formula means that the probability for an adaptation to occur decreases by half after every I0 chain iterations. In addition, we further restrict the adaptation to occur only when the error indicator is above a threshold at the proposed parameters. As a consequence of our model updating criteria, the reduced-order bases will be selected at points that are close to the support of the target posterior distribution.
Numerical Results
We conduct numerical tests on several stochastic gene expression models to study performance of our proposed Algorithms. The test platform is a desktop computer running Linux Mint and MATLAB 2017a, with 32 GB RAM and Intel Core i7 3.4 GHz quad-core processor.
We compare three sampling algorithms:
Adaptive Metropolis-Hastings with full FSP-based likelihood evaluations (AMH-FSP): This version is an adaptation of the Adaptive Metropolis of Haario et al.,27 which updates the covariance of the Gaussian proposal density at every step. The algorithm always uses the FSP-based likelihood (3) to compute the acceptance probability, and it is solved using the Krylov-based Expokit.34 This is the reference algorithm by which we assess the accuracy and performance of the other sampling schemes. To the best of our knowledge, such an adaptive Metropolis scheme has not been used elsewhere for gene expression models.
Adaptive Delayed Acceptance Metropolis-Hastings with reduced FSP model constructed from Krylov subspace projections (ADAMH-FSP-Krylov): This is Algorithm 3 mentioned above. Similar to AMH-FSP, this algorithm uses a Gaussian proposal with an adaptive covariance matrix. However, it has a first-stage rejection step that employs the reduced model constructed adaptively using Krylov-based projection.
Adaptive Metropolis-Hastings with only reduced model-based likelihood evaluations (AMH-ROM): This is similar to AMH-FSP, but we instead use the approximate log-likelihood formula (16). The reduced model is constructed during the run of the ADAMH-FSP-Krylov, and therefore this variant can only be executed after the ADAMH-FSP-Krylov has terminated. We include this variant here in order to study the accuracy and potential speedup when leaving the acceptance/rejection decision fully to the reduced model.
We rely on two metrics for performance evaluation: total CPU time to finish each chain, and the multivariate effective sample size as formulated in Vats et al.48 Given samples θ1,…, θn, the multivariate effective sample size is estimated by where Λn is an estimation of the posterior covariance using the sample covariance, and Σn the multivariate batch means estimator. An algorithm, whose posterior distribution matches the full FSP implementation, but with a lower ratio of CPU time per (multivariate) effective sample will be deemed more efficient. We use the MATLAB implementation by Luigi Acerbi49 for evaluating the effective sample size from the MCMC outputs.
Implementation details
To achieve reproducible results for each example, we reset the random number generator to Mersenne Twister with seed 0 in Matlab using the rng(‘default’) command before simulating the single-cell observations with Gillespie’s Algorithm50 and running the ADAMH-Krylov-FSP and AMH-FSP chains. The random seed is then set to the ‘default’ value again before running the AMH-ROM chain.
Two-state gene expression
We first consider the common model of bursting gene expression39,51–54 with a gene that can switch between ON and OFF states and an RNA species that is transcribed when the gene is switched on (Table 1). We simulate data at ten equally spaced time points from 0.1 to 1 hour, with 200 independent observations per time point. The gene states are assumed to be unobserved. We generate the reduced bases on subintervals generated by the time points in the set where Δtdata = 0.1hr and Agasis = 0.01hr. Thus, Δtbasis includes the observation times. We choose the basis update threshold as δ = 10−4. The prior distribution in our test is the log-uniform distribution on a rectangle, whose bounds are given in Table 2. The full FSP state space is chosen as
We choose a starting point for the sampling algorithms using five iterations of MATLAB’s genetic algorithm with a population size of 100, resulting in 600 full FSP evaluations. We then refine the output of the genetic algorithm with a local search using fmincon with a maximum of 1000 further evaluations of the full model. This is a negligible cost in comparison to the 10, 000 iterations that we set for the sampling algorithms.
We summarize the performance characteristics of the sampling schemes in Table 3. The ADAMH-FSP-Krylov requires less computational time (Fig. 1) without a significant reduction in the multivariate effective sample size. In terms of computational time, the ADAMH-FSP-Krylov takes less time to generate an independent sample. This is partly explained by observing that the first stage of the scheme filters out many unlikely samples with the efficient approximation, resulting in 78.34% fewer full evaluations in the second stage (cf. Table 3).
We observe from the scatterplot of log-posterior values of the parameters accepted by the ADAMH-FSP-Krylov that the reduced model evaluations are very close to the FSP evaluations, with the majority of the approximate log-posterior values having a relative error below 10−4, with an average of 1.09 × 10−6 and a median of 8.49 × 10−8 across all 2152 accepted parameter combinations (Fig. 1 C). This accuracy is achieved with a reduced set of no more than 168 basis vectors per time subinterval that was built using solutions from only four sampled parameter combinations (Fig. 2). All the basis updates occur during the first tenth portion of the chain, and these updates consume less than one percent of the total chain runtime (Table 4).
From the samples obtained by the ADAMH-Krylov-FSP, we found that full and reduced FSP evaluation take approximately 0.25 and 0.09 seconds on average, allowing for a maximal speedup factor of approximately 100(0.25 – 0.09)/0.25 ≈ 65.73% for the current model reduction scheme. Here, the term reduced model refers to the final reduced model obtained from the adaptive reduced basis update of the ADAMH-Krylov-FSP. The speedup offered by the ADAMH-Krylov-FSP was found to be 100(2497.70 – 1424.32)/2497.70 ≈ 42.97%, or approximately two thirds the maximal achievable improvement for the current model reduction scheme. To further investigate the speed and quality of the reduced model learned from the ADAMH-FSP-Krylov run, we performed another run of the adaptive Metropolis-Hastings algorithm with the log-likelihood evaluated solely using the reduced model constructed by the ADAMH-FSP-Krylov. Interestingly, we observe almost identical results using the reduced model alone in comparison to using the full model (Fig. 2 and Table 5), and the 65.03% reduction in computational effort matched very well to the maximal estimated improvement.
A gene expression model with spatial components
We consider an extension of the previous model to distinguish between the nucleus and cytoplasmic compartments in the cell, similar to a stochastic model recently considered for MAPK-activated gene expression dynamics in yeast.10 The gene can transition between four states {0,1, 2, 3} with transcription activated when the gene state is in states 1 to 3. RNA is transcribed in the nucleus and later transported to the cytoplasm as a first order reaction.
These cellular processes and the degradation of RNA in both spatial compartments are modeled by a reaction network with six reactions and three species (Table 6).
We simulated a data set of 200 single-cell measurements at five equally-spaced time points between 1 min and 10 min, that is, Tdata = {2, 4, 6, 8,10} (min). The time points for generating the basis are Tbasis = Tdata ∪ {j × 0.2 min, j = 1,…, 50}. We chose the basis update threshold as δ = 10−4. The prior distribution in our test is the log-uniform distribution on a rectangle, whose bounds are given in Table 7. The full FSP state space is chosen as
To find the starting point for the chains, we run five generations of MATLAB’s genetic algorithm (implemented in the function ga) with 600 full FSP evaluations. Then, we run another 500 steps of fmincon to refine the output of the ga solver. Using the parameter vector output obtained by this combined optimization scheme as the initial sample, we run both the ADAMH-FSP-Krylov and the AMH-FSP for 10, 000 iterations.
The acceleration obtained by using the reduced model is quite evident, with the ADAMH generating an effective sample about twice as fast as the AMH (Table 8). The log-posterior evaluations from the reduced model are accurate (Fig. 3 C and Table 9), with relative error below the algorithmic tolerance of 10−4, with a mean of 1.11 × 10−5 and a median of 6.98 × 10−6. This accurate model was built automatically by the ADAMH scheme using just 18 points in the parameter space (Fig. 4), resulting in a set of no more than 438 vectors per time subinterval. All the basis updates occur during the first fifth portion of the chain, and these updates consume about 11.25% of the total runtime (Table 10). The high accuracy of the posterior approximation translates into a very high second-stage acceptance of 96.15% of the proposals promoted by the first-stage reduced-model-based evaluation. Such high acceptance rates in the second stage are crucial to the efficiency for the delayed acceptance scheme, since almost all of the expensive FSP evaluations are accepted.30
The close agreement between the first and second stage of the ADAMH algorithm suggests that the reduced model constructed by ADAMH can provide a reliable substitute of the full model. Upon finishing the ADAMH chain, we run another chain with 10, 000 iterations using only the reduced-model-based evaluations, where the reduced model is the final model output from the ADAMH-Krylov-FSP run. We observe that the marginal posterior distributions sampled from this chain are not markedly different from the results of the other two chains (see Fig. (4) for a representative example).
From the posterior samples of the ADAMH chain, we estimate that an average full FSP evaluation would take 1.31 seconds, while an average reduced model evaluation takes 0.30 seconds, leading to an average speedup (in terms of total CPU time) of approximately 77.35%. The comparative runtimes shown in Table 8 confirms this estimate, with the AMH-ROM taking about 76.58% less time than the AMH-FSP chain. The speed up of the ADAMH-Krylov-FSP was comparable at approximately 45.91%.
Genetic toggle switch
The final model we consider in our numerical tests is the nonlinear genetic toggle switch55 with the propensity functions listed in Table 11. We use the same parameters as those in Fox and Munsky.56 Using the stochastic simulations and the ‘true’ parameters as given in Table 12, we generate data at 2, 6 and 8 hours, each with 500 single-cell samples. To build the reduced bases for the FSP reduction, we use the union of ten equally-spaced points between zero and 8 hrs and the time points of observations. The prior distribution in our test was chosen as the log-uniform distribution on a rectangle, whose bounds are given in Table 13. The full FSP size is set as the rectangle {0,…, 100} × {0,…, 100}, corresponding to 10,201 states.
To find the starting point for the chains, we run five generations of MATLAB’s genetic algorithm with 600 full FSP evaluations. Then, we run another 1000 iterations of fmincon to refine the output of the ga solver. Using the parameter vector output by this combined optimization scheme as initial sample, we run both the ADAMH-FSP-Krylov and the AMH-FSP for 100, 000 iterations.
The efficiency of the ADAMH-Krylov-FSP is confirmed in Table 14, where the delayed acceptance scheme is 37.16% faster than the AMH-FSP algorithm, compared to a maximum potential savings of 59.82% when exclusively using the reduced FSP model.
Similar to the last two examples, we observe a close agreement between the first and second stage of the ADAMH run, where 98.36% of the proposals promoted by the reduced-model-based evaluations are accepted by the full-FSP-based evaluation. This high second-stage acceptance rate is explained by the quality of the reduced model in approximating the log-posterior values (Fig. 5 C). We also ran another chain using the reduced model outputted by the ADAMH, which yields similar results to the reference chain (Fig. 6) but with reduced computational time (Table 14). The accurate reduced model consists of no more than 634 basis vectors per time subinterval, with all the basis updates occurring during the first tenth portion of the chain.
From the samples obtained by the ADAMH, we found that Expokit takes 0.42 sec to solve the full FSP model and 0.17 sec to solve the reduced model.
Discussion and concluding remarks
There is a clear need for efficient computational algorithms for the uncertainty analysis of gene expression models. In this work, we proposed and investigated new approaches for Bayesian parameter inference of stochastic gene expression parameters from single-cell data that employ adaptive tuning of proposal distributions in addition to delayed acceptance MCMC and reduced-order modeling. Numerical tests confirm that the reduced model can be used to significantly speed up the sampling process without incurring much loss in accuracy.
A surprising observation from our numerical results is that once trained, the reduced model constructed by the ADAMH-FSP-Krylov closely matches the original FSP sampling results. This suggests that the ADAMH-FSP-Krylov algorithm could be used as a data-driven method to learn reduced representations of the full FSP-based model, which could then be successfully substituted for the full FSP model in subsequent Bayesian updates. In other words, it could be equally accurate but more efficient to cease full FSP evaluations in the ADAMH scheme once confident about the accuracy of the reduced model. In our numerical tests, the ADAMH updates completed first 10-20% of the MCMC chain, at which point the remaining chain could have been sampled using only the reduced model. Perhaps other approaches to substitute function approximations into the expensive likelihood evaluations57,58 could provide additional insights to the reduced order modeling approximations we have used.
While we have achieved a significant reduction in computational time with our implementation of the Krylov subspace projection, other model reduction algorithms may yet improve this performance.59 For example, the reduced models discovered here achieved levels of accuracy (i.e., relative errors of 10−8 or less) that are much higher than one would expect to be necessary to compare models in light of far less accurate data. In light of this finding and the fact that parameter discrimination can be achieved at different levels of accuracy for different combinations of models and data,60 we suspect that it could be advantageous to build less accurate models that can be evaluated in less time.
Our present work assumes the full FSP-based solution can be computed for use to learn the reduced model bases and to evaluate the second stage likelihood in the ADAMH-FSP-Krylov algorithm. For many problems, the required FSP state space can be so large that it would be impossible even to keep the full model in computer memory. Representing the FSP model in a low-rank tensor format20 is a promising approach that we plan to investigate in order to overcome this limitation. Our current work has focused on using reduced models for uncertainty quantification, but the equally important task of finding optimal parameter fits should also benefit from reduced order modeling. For example, techniques from other engineering fields, such as trust-region methods,61 may provide valuable improvements to infer stochastic models from gene expression data. In time, a wealth of algorithms and insight remains to be gained by adapting computational technology from the broader computational science and engineering communities to analyze stochastic gene expression.
Acknowledgments
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award numbers R25GM105608 and R35GM124747. The work reported here was partially supported by a National Science Foundation grant (DGE-1450032). Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the National Science Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Appendix: Mathematical proofs
Preliminaries on adaptive MCMC algorithms
We will derive ergodicity results in the following sections based on Theorem 1 in the paper of Roberts and Rosenthal,47 and we will use some proof techniques of Theorem 1 from Cui et al.31 for part of our analysis. All random variables we will discuss below will be of the form X: Ω → 𝓧 where 𝓧 is a metric space with the associated Borel σ-algebra B(𝓧).
Let 𝓧 be the parameter space, assumed to have a metric space topology, and π: B(𝓧) → [0,1] the target distribution to be sampled from by an adaptive MCMC algorithm. We will assume that π has a density f: 𝓧 → [0, ∞). Let Kγ denote a transition kernel that depends on an adaptation index γ ∈ 𝓨, and assume that each Kγ has π as an invariant distribution. We assume that for each fixed γ, an MCMC algorithm with Kγ as the Markov transition kernel will eventually converge to π, that is where ∥μ − ν∥TV = supB∈B(𝓧) |μ(B) − ν(B)| is the total variation distance between two probability measures on 𝓧.
Let Xn be the random variable representing the state of the adaptive MCMC at iteration n, and let Γn be the random variable representing the choice of kernel for updating from Xn to Xn+1. The state of the algorithm is then modeled by the discrete-time stochastic process {(Xn, Γn)}, whose transition between steps is determined by the underlying rules of the algorithm. Finally, let 𝓖n = σ ({X0,…, Xn, Γ0,…, Γn}) denote the filtration generated by {(Xn, Γn)}. Thus, each Γn+1 is a 𝓖n+1 -measurable random variable.
Roberts and Rosenthal proved the following important result, which gives sufficient conditions for ergodicity of an adaptive MCMC.
(Theorem 1 in Roberts and Rosenthal47). Consider an adaptive MCMC algorithm with state space 𝓧 and adaptation index 𝓨, with transition kernels Kγ, γ ∈ 𝓨. The algorithm is ergodic if the following conditions hold
(Simultaneous uniform ergodicity) For every ε > 0, there exists N = N(ε) such that for every x ∈ 𝓧, γ ∈ 𝓨, and n > N.
(Diminishing adaptation) limn→∞ Dn = 0 in probability where is a 𝓖n+1-measurable random variable.
We immediately get a useful corollary.
Consider an adaptive MCMC with state space 𝓧 and transition kernels Kγ, γ ∈ 𝓨 that are ergodic w.r.t π. Assume that the following conditions are satisfied:
The algorithm satisfies the diminishing adaptiation condition.
𝓧 is a compact metric space.
where each 𝓨j is a compact metric space.
For each n = 1, 2,…, and on each set 𝓧 × 𝓨j with the product metric space topology, the mapping is continuous.
Then, the adaptive MCMC algorithm is ergodic.
Our proof is a modification of the proof of Corollary 3 in.47 Fix a number ε > 0 and an index j ∈ {1,…, m}. Let be the set of all (x, γ) ∈ 𝓧 × 𝓨j such that
Since each kernel is ergodic, for every (x, γ) ∈ 𝓧 × 𝓨j there exists some n such that (x, γ) ∈ and that S(x,j; n′, j) < ε for all n′ > n. We thus have
Due to continuity, each is an open set. By compactness, there exists a finite subcover for 𝓧 × 𝓨j. Choose Nj(ε) to be the maximum of all n1,…,nrj. Then, choose N (ε) = N1(ε) + … + Nm (ε), we have for all n > N (ε) and (x, γ) ∈ 𝓧 × 𝓨. Thus, simultaneous uniform ergodicity is satisfied. Combining with diminishing adaptation, the preceding theorem shows that the algorithm is ergodic. □
Convergence of adaptive DAMH with diminishing model adaptations
In this section, we analyze the convergence of an adaptive variant of the DAMH. As seen in the pseudocode of Algorithm 4, this variant modifies the approximation and the proposal density at every step, using the samples accepted so far on the chain. The update of the approximate model occurs randomly, with the upate probability at step n pre-specified as a(n).
Consider an adaptive delayed acceptance Metropolis-Hastings algorithm with the target distribution supported on a state space 𝓧, proposal adaptation space 𝓨, approximation space 𝓩. Let f be the density of the target distribution π with respect to a finite reference measure λ, that is, π(dx) = f(x)λ(dx). Let be the family of approximations to f. Let qγ be the first-step proposal densities. The algorithm is ergodic under the following conditions:
𝓧, 𝓨 are compact metric spaces, and where each 𝓩j is a compact metric space.
For each fixed γ, φ, the transition kernel Kγ,φ is ergodic.
λ{x} = 0 for all x ∈ 𝓧.
The mapping (x,y, γ) ↦ qγ (x,y) is continuous and uniformly bounded on 𝓧 × 𝓧 × 𝓨 which is a compact metric space equipped with the product space metric.
For each y ∈ 𝓨, the mapping is continuous on each 𝓧 × 𝓩j.
Diminishing adaptation: The chain (Γn, Φn) satisfies in probability.
Adaptive Delayed Acceptance MH with probabilistic approximation adaptation
The ADAMH could be viewed as an adaptive MCMC algorithm with state space 𝓧 and adaptation space 𝓨 × 𝓩. In order to apply corollary 0.2, we will prove that for any fixed n = 1, 2,…, and fixed j = 1,…, m, the mapping is continuous on 𝓧 × 𝓨 × 𝓩j. In order to do so, we proceed as in the proof of theorem 1 in.31 Fix (x,γ, φ) ∈ 𝓧 × 𝓨 × 𝓩j, the transition kernel for the DAMH associated with (x, γ, φ) is where is the first step acceptance probability, βγ,φ(x,z) = is the second step acceptance probability, and is the overall probability for a proposal to be accepted.
Fix the value of z, then due to conditions (iv) and (v), g(x, z, γ, φ) = qγ(x, z)αγ,φ(x, z)βγ,φ(x, z) is jointly continuous in (x, γ, φ) ∈ 𝓧 × 𝓨 × 𝓩j. Furthermore, condition (iv) implies that the functions z ↦ g(x, z, γ, φ) is uniformly bounded for (x, γ, φ) ∈ 𝓧 × 𝓨 × 𝓩j. By the bounded convergence theorem, ργ,φ(x) is jointly continuous in the three variables x,γ, φ.
By induction, we can show that the n-step transition kernel has the form where gn is an appropriate function that is jointly continuous in x, γ and φ. From condition (iii), δx and π are orthogonal measures. Therefore,
The integral on the right hand side is jointly continuous in x,j, φ due to the bounded convergence theorem. This shows that is continuous in the variable (x,γ,φ) ∈ 𝓧 × 𝓨 𝓧 𝓩j. From this, conditions (i), (vi) and corollary 0.2 combined show that the algorithm is ergodic. □
Assume the ADAMH with probabilistic model adaptation satisfies conditions (i)-(v) in proposition 0.3. Assume further that the proposal is symmetric, that the approximate posterior adaptation probability a(n) → 0 as n → ∞, and that d𝓨 (Γn+1, Γn) → 0 in probability (here dy denote the metric on 𝓨). Then, the algorithm satisfies diminishing adaptation.
All conditions for ergodicity in proposition 0.3 are satisfied, except for the diminishing adaptation that we will verify. Fix a value of n. Consider a fixed set of values of adaptivity parameters of the ADAMH chain up to iteration n.
Fix an event A ∈ B(𝓧) and x ∈ 𝓧. We have
We bound each term separately. First of all, we have D2 = 0 if φn = φn+1 and D2 ≤ Kγn,φn+1 (x,A) ≤ 2 if φn ≠ φn+1, with the latter event taking place with probability less than a(n).
Due to the symmetry of the proposal, the first and second step acceptance probabilities do not depend on the choice of γ. This and the uniform continuity of qγ(x,y) gives us where C > 0 is independent of x, y, β and φ.
Combining the bounds on D1 and D2 we get where 𝓧(A) = 1 if A is true and 0 otherwise. Taking the supremum over all x and A we get
Fix a scalar ε > 0. The set of runs where Dn < ε include sample chains where both events φn = φn+1 and C.d𝓨(γn, γn+1) < ε hold. Therefore, the event [Dn > ε] is a subset of the event [C · d𝓨(Γn, Γn+1) ≥ ε] ∪ [Φn = Φn+1]. We therefore have
The last right hand side of the inequality above converges to 0 as n → ∞. Therefore, Dn converges to 0 in probability. The diminishing adaptation condition is satisfied and the algorithm is ergodic. □
Regularity of the ROM-based likelihood approximation
Let 𝓢j be the set of all n × j matrices Q such that QTQ = Ij×j. It is known that 𝓢j with the metric defined by the induced matrix 2-norm is a compact metric space (indeed, it is the inverse image of Ij×j via the continuous mapping A ↦ ATA). Let mmax be the maximum dimension allowed in the reduced basis and let Φ be a particular basis set constructed during a run of the ADAMH chain, then there exists a tuple (j1,…, jnB) with 1 ≤ jk ≤ mmax such that
Thus, the set of all possible choices of reduced basis set Φ is the finite union of all Sj with j bounded elementwise by mmax. Note that each Sj is a compact metric space with the product space topology. Thus, we can apply the theory developed in the previous section to show that the ADAMH-FSP-Krylov is ergodic. The following propositions concern the continuity in the change of the reduced-order approximations with respect to the change in basis.
Fix a space Sj as above, and let Φ and Ψ be elements of this space. For for every fixed θ ∈ Θ we have as Ψ → Φ in Sj, where is the approximation to the FSP log-likelihood as defined in eq. (16).
From eq. (7), it is clear that the mapping Φ ↦ pΦ(tk) is continuous on Sj for all time points tk. The mapping is a composition of continuous mappings Φ ↦ pΦ(tk) and and is therefore continuous. □
Ergodicity of the ADAMH-FSP-Krylov algorithm
The ADAMH-FSP-Krylov algorithm is ergodic.
We apply proposition 0.3 with 𝓧 = Θ. The proposal densities of the first step are Gaussian with γ being the modified empirical covariance matrix as in the adaptive Metropolis Algorithm.27 Similar to the proof of Theorem 1 in Haario et al.,27 we can take 𝓨 to be a closed, bounded subset of the set of positive definite matrices. The reduced model space is 𝓩 = ∪j Sj the finite union of the compact spaces Sj with j ≤ mmax pointwise. These spaces satisfy condition (i), and the proposal density satisfies condition (iv).
The posterior density is and the approximate posterior densities are where these are the densities of the true and approximate posterior distributions with respect to the Lebesgue measure. From Theorem 1 in Christen and Fox,28 condition (ii) is satisfied.
Condition (v) is then satisfied using proposition 0.5.
Since the empirical covariances are computed from values in a bounded set, the modification to the empirical covariance matrix γ at step n is O(1/n), so changes in Γn converge to 0 (see Haario et al.27). Thus, the conditions in proposition 0.4 are satisfied. The algorithm therefore satisfies all sufficient conditions for ergodicity outlined in proposition 0.3. □
Footnotes
E-mail: Huy.Vo{at}colostate.edu; Munsky{at}colostate.edu
References
- (1).↵
- (2).↵
- (3).↵
- (4).↵
- (5).↵
- (6).
- (7).
- (8).↵
- (9).↵
- (10).↵
- (11).↵
- (12).↵
- (13).↵
- (14).↵
- (15).↵
- (16).↵
- (17).↵
- (18).↵
- (19).↵
- (20).↵
- (21).
- (22).↵
- (23).↵
- (24).↵
- (25).↵
- (26).↵
- (27).↵
- (28).↵
- (29).↵
- (30).↵
- (31).↵
- (32).↵
- (33).↵
- (34).↵
- (35).↵
- (36).↵
- (37).↵
- (38).↵
- (39).↵
- (40).↵
- (41).↵
- (42).↵
- (43).↵
- (44).↵
- (45).↵
- (46).↵
- (47).↵
- (48).↵
- (49).↵
- (50).↵
- (51).↵
- (52).
- (53).
- (54).↵
- (55).↵
- (56).↵
- (57).↵
- (58).↵
- (59).↵
- (60).↵
- (61).↵