Abstract
Stochastic gene expression in regulatory networks is conventionally modelled via the Chemical Master Equation (CME) (van Kampen 1981). As explicit solutions to the CME, in the form of so-called propagators, are not readily available, various approximations have been proposed (Zechner et al. 2013, Feigelman et al. 2016, Popović, Marr and Swain 2016). A recently developed analytical method (Veerman, Marr and Popović 2017) is based on a scale separation that assumes significant differences in the lifetimes of mRNA and protein in the network, allowing for the efficent approximation of propagators from asymptotic expansions for the corresponding generating functions. Here, we showcase the applicability of that method to a ‘telegraph’ model for gene expression that is extended with an autoregulatory mechanism. We demonstrate that the resulting approximate propagators can be successfully applied for Bayesian parameter inference in the non-regulated model with synthetic data; moreover, we show that in the extended autoregulated model, autoactivation or autorepression may be refuted under certain assumptions on the model parameters. Our results indicate that the method showcased here may allow for successful parameter inference and model identification from longitudinal single cell data.
1 Introduction and background
Gene expression in regulatory networks is an inherently stochastic process (Elowitz et al. 2002). Mathematical models typically take the form of a Chemical Master Equation (CME), which describes the temporal evolution of the probabilities of observing specific states in the network (van Kampen 1981). Recent advances in single-cell fluorescence microscopy (Crane et al. 2014, Filipczyk et al. 2015, Hoppe et al. 2016, Suter et al. 2011, Zenklusen, Larson and Singer 2008) have allowed for the generation of experimental longitudinal data, whereby the fluorescence intensity of mRNA or protein abundances in single cells is measured. While most common models assume the availability of protein abundance data, the abundance of mRNA may equally be of interest, depending on the model (Janicki et al. 2004). Here, we focus exclusively on protein abundances, which we assume to be measured at regular sampling intervals Δt, for the sake of simplicity. Given the resulting data set D, parameter inference is performed on the basis of the log-likelihood that can be calculated over a range of values for the model parameters to yield a ‘log-likelihood landscape’, the maximum of which corresponds to the most likely parameter set Θ subject to D. Here, the propagator encodes the probability for the transition ni → ni+1 between protein numbers ni and ni+1 to occur after time Δt, given Θ. Due to the complex nature of the underlying gene networks, explicit expressions for are difficult to obtain in general. Hence, a variety of approximations have been developed, which can be either numerical (Zechner et al. 2013, Feigelman et al. 2016) or analytical (Schnoerr, Sanguinetti and Grima 2017, Popović, Marr and Swain 2016), to name but a few. Here, we apply the analytical method recently developed by the current authors (Veerman, Marr and Popović 2017), which was based on ideas presented by Popović, Marr and Swain 2016, to obtain fully time-dependent approximate propagators; an outline of the method is given in Section 2.
Our aim in the present article is to demonstrate the applicability of these propagators, as well as to evaluate their performance in the context of Bayesian parameter inference for synthetic data. Specifically, we showcase the resulting inference procedure for a family of models for stochastic gene expression. First, in Section 3, we consider a model that incorporates DNA on/off states (‘telegraph model’); see also the work of Raj et al. 2006 and Shahrezaei and Swain 2008. Subsequently, in Section 4, that model is extended with an autoregulatory mechanism, where the protein influences its own production through an autocatalytic reaction. In Section 5, we summarise our results and present an outlook to future research; finally, in Appendix A, we collate the analytical formulae that underly our inference procedure for the family of models showcased here.
2 Method
2.1 Calculation of propagators
Our method (Veerman, Marr and Popović 2017) is based on an analytical approximation of the probability generating function that is introduced for analysing the CME corresponding to the given gene expression model. Propagators can be calculated from the generating function via the Cauchy integral formula, which implies here, F(z; t, ni, Θ) is the generating function of the (complex) variable z, which additionally depends on time t, the protein number ni, and the model parameter set Θ. The integration contour γ is a closed contour in the complex plane around z = 0. The choice of contour is arbitrary; however, it can have a significant effect on computation times and the accuracy of the resulting integrals; see the work by Bornemann 2011. Here, we choose γ to be a regular 150-sided polygon approximating a circle of radius 0.8 that is centred at the origin of the complex plane, which results in a ‘hybrid analytical-numerical’ procedure for the evaluation of .
2.2 Parameter inference
The parameter inference procedure proposed here can be divided into the following steps:
1) Data binning
The simulated data D is presented as a time series {ni}, 0 ≤ i ≤ N, which yields N transitions ni → ni+1. Generically, some of these transitions occur more than once. We bin the data accordingly to create a binned data set D0 = {(kj, (n0 → n) j)}, with 0 ≤ j ≤ N0 for N0 ≤ N, where kj denotes the frequency of the transition (n0 → n) j; see also Figure 1.
2) Marginalisation
Frequently, some of the involved species in a model are not observed, and hence have to be marginalised over. In the models discussed in Sections 3 and 4, this is the case for mRNA. Marginalisation over unobserved species is usually carried out on the transition probabilities in (2.1). However, since the marginalisation procedure is linear, it commutes with the Cauchy integral. Introducing the linear ‘marginalisation operator’ 𝕄, we may write where 𝕄 now acts on the generating function F. Therefore, given the analytical approximation for F resulting from our method (Veerman, Marr and Popović 2017), we define where is the subset of parameters that remain after the marginalisation procedure has been applied. Note that is still a fully analytical, general expression which depends on the as yet unspecified values of its arguments.
3) Evaluation
We choose a set of numerical values for the parameters in . Moreover, we specify the integration contour γ, which we discretise as described in 2) to approximate the Cauchy integral in (2.1) by a finite sum. Suppose that the contour γ is discretised as {ζ(l)}, with 0 ≤ l ≤ L and ζ(0) = ζ(L); then, the integral of a function G along γ is approximated as
Now, for every element (k, n0 → n)j of the binned data set D0, we evaluate , as given in (2.3), for the chosen parameter values along the discretised contour. We hence obtain the array which we sum over l to find as the approximate value of the propagator for the transition (n0 → n) j.
3 Showcase 1: The telegraph model
To demonstrate our parameter inference procedure, we consider a stochastic gene expression model that incorporates DNA on/off states (‘telegraph model’) (Raj et al. 2006, Shahrezaei and Swain 2008):
In recent work (Veerman, Marr and Popović 2017), we presented an analytical method for obtaining explicit, general, time-dependent expressions for the generating function associated to the CME that arises from the model in (3.1). A pivotal element of the application of that method to (3.1) is the assumption that the protein decay rate d1 is notably smaller than the mRNA decay rate d0, which implies that the parameter is small; hence, the associated generating function is approximated to a certain order O = k, corresponding with a theoretical accuracy that is proportional to εk. For more details on the resulting approximation, we refer to Appendix A.
We simulate the model in (3.1) using Gillespie’s algorithm (Gillespie 1977), for fixed values of the (rescaled) parameters on the time interval 0 ≤ t ≤ 10, and measure the protein abundance n with a fixed sampling interval Δt. As our method assumes that Δt is of order ε, cf. again Appendix A, we set Δt = ε = 0.1, which yields N = 100 transitions. Based on the simulated measurement data, we perform the parameter inference procedure described in Section 2. As the data consists of protein abundances only, and as propagators for the model in (3.1) depend on both protein and mRNA abundances, we marginalise over mRNA abundance assuming a Poisson distribution with parameter . We assume that the values of κ0, κ1, ε, and d1 are known,and calculate the log-likelihood in (2.7) for varying λ and μ. We scan these two parameters in {10−3 ≤ λ ≤ 103, 10−3 ≤ μ ≤ 102}, using a logarithmically spaced grid of 50 × 40 grid points. Figure 2 shows the resulting log-likelihood landscapes and, in particular, a comparison of the performance of the leading (zeroth) order approximation for the generating function, see Figure 2(A), with that of the first order approximation in Figure 2(B).
To quantify the performance of the method developed by Veerman, Marr and Popović 2017 for parameter inference, we compare four different scenarios:
Parameter values as in (3.2), with sampling interval Δt = ε = 0.1 on the time interval 0 ≤ t ≤ 10, corresponding to N = 100 transitions, which is the original setup that yields the results shown in Figure 2.
As in (a), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.
As in (a), with ε = 0.01; the sampling interval is decreased accordingly to Δt = ε = 0.01; measurements are taken on the time interval 0 ≤ t ≤ 1, which yields N = 100 transitions.
As in (a), with μ = 28.5.
For each scenario, we infer the most likely values of the parameters λ and μ, for increasing approximation order O. The inferred values of λ and μ are compared to the ‘true’ values λtrue and μtrue, where we consider relative errors to quantify the performance of our inference procedure. The results of that comparison are shown in Figure 3.
4 Showcase 2: An autoregulated telegraph model
We extend the telegraph model in (3.1) with an autoregulatory mechanism, where the DNA activation rates are influenced by the presence of protein. Autoregulation is modelled in a catalytic manner, via one of the two following reactions:
The above pair of autoregulation mechanisms was introduced by Hornos et al. 2005, and implemented e.g. by Iyer-Biswas and Jayaprakash 2014; see Section 5 for a discussion of the physical validity of these mechanisms.
To assess the performance of our parameter inference procedure, we fix the parameter values as in (3.2). We generate six data sets, as follows:
Simulate the model in (3.1) without autoregulation (‘null model’; aP = rP = 0) on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.
As in (A), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.
Simulate the extended model {(3.1),(4.1)} with autoactivation for aPδ = 0.3 on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.
As in (C), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.
Simulate the extended model {(3.1),(4.1)} with autorepression for rPδ = 0.3 on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.
As in (E), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.
Every data set consists of 10 runs of equal length.
Generating functions for the autoregulated extension (4.1) of the telegraph model in (3.1) have been derived in the theoretical companion article (Veerman, Marr and Popović 2017) to the current work, under the assumption that the autoregulation rate aP or rP is small compared to the protein decay rate d1. That assumption implies that the ratios and are small.
Parameter inference now proceeds as follows. We fix a data set, and take a single run from that data set. For that run, we determine the likelihood of the autoactivated model in {(3.1),(4.1a)}, varying 0 ≤ αPδ ≤ 1; likewise, we determine the likelihood of the autorepressed model in {(3.1),(4.1b)}, varying 0 ≤ ρPδ ≤ 1. The likelihood of the non-regulated model in (3.1) is then used to determine the model score according to the Bayesian information criterion (BIC) (Schwarz 1978), where
Here, L is the likelihood of an autoregulated extension, with autoregulation as in (4.1), of the model in (3.1), while L0 denotes the likelihood of the non-regulated model in (3.1). The information difference ΔBIC – which is also known as the log Bayes factor - quantifies the evidence for the model in question. Typically, a ΔBIC-value above 3 is considered strong evidence (Kass and Raftery 1995). We repeat the above procedure for all 10 runs in the data set, and determine the mean and standard deviation; the outcome is illustrated in Figure 4.
5 Discussion
In the present article, we showcase a parameter inference procedure that is based on a recently developed analytical method (Veerman, Marr and Popović 2017) which allows for the efficient numerical approximation of propagators via the Cauchy integral formula on the basis of asymptotic series for the underlying generating functions. The resulting hybrid analytical numerical approach reduces the need for computationally expensive simulations; moreover, due to its per-turbative nature, it is highly applicable over relatively short time scales, such as occur naturally in the calculation of the log-likelihood in (1.1).
We present results for synthetic data in a family of models for stochastic gene expression from the literature under the assumption that lifetimes of protein are significantly longer than those of mRNA, which introduces a small parameter ε and, hence, a separation of scales. For an extensive discussion of the validity of our assumption that ε is small, we refer to (Veerman, Marr and Popović 2017).
In Section 3, we discuss a simple (‘telegraph’) gene expression model without autoregulation, showing that our approach can successfully infer relevant model parameters. Unlike in previous work by Feigelman, Popović and Marr 2015, the underlying implementation avoids potential bias due to zero propagator values and large initial protein numbers through the use of ‘implicit’ series expansions in ε; see Appendix A for an in-depth argument.
In Section 4, we perform a model comparison in an autoregulated extension of the standard telegraph model. We consider three types of gene regulation: autoactivation, autorepression, and no regulation of DNA activity (null model). For each type, we simulate data with 100 and 1000 protein transitions, respectively. Throughout, we find that 100 data points are insufficient to reject model hypotheses with our approach. With 1000 data points, however, we can successfully reject the non-regulated and the autorepressed model for simulated data from an autoactivated model, in which case we can even infer the correct order of the autoactivation parameter. For simulated autorepression, we can reject the model with autoactivation, but not the non-regulated model. Our approach fails to identify the correct model for data from a non-regulated model for 1000 transitions, where the autoactivated model is clearly, but wrongly favoured. We believe that more research is needed into the sources of these discrepancies in dependence on model parameters and the order of our approximation.
In both showcases, we observe a trade-off between the accuracy of inference versus the required computation time. Computation times seem to increase exponentially with the approximation order, at least for the setup realised in this article. For practical purposes, we hence propose an algorithm whereby the fastest, leading order approximation is used to obtain a first estimate for the underlying model parameters; that estimate can then be improved by including higher order corrections, resulting in a much more computationally efficient procedure.
It is insightful to compare our results with other recent work on parameter inference in regulated gene expression models. In work by Feigelman et al. 2016, three models for regulated gene expression with a slightly different structure compared to the models studied in the present article were simulated and inferred via a stochastic particle filtering-based inference procedure that employs genealogical information of dividing cells. Interestingly, positive and negative autoregulation could be successfully rejected there for data that was simulated from a no-feedback model. However, the no-feedback model could not be rejected for data originating from thecorresponding models with positive or negative feedback. From this comparison with Feigelman et al. 2016, we conclude that the structure of the data, the intensity of regulatory feedback, and the chosen inference procedure together will influence the extent of insight which can be obtained from a inference approach that is based on stochastic models of gene expression.
We emphasise that the application of the analytical method showcased here is not restricted to specific models; the goal of the present article is to demonstrate the applicability of that method, and to investigate its performance, rather than to assess the biological validity of a given model. It is important to note that our approach can equally be extended to recent, physiologically more relevant modifications of the telegraph model with autoregulation (Hornos et al. 2005) by Grima, Schmidt and Newman 2012 and Congxin et al. 2018; another feasible altern-ative model can be obtained by introduction of a refractory state (Zoller et al. 2015).
The input for our propagator-based approach is the abundance of the involved species, viz. of protein. Such abundances are challenging to obtain experimentally due to the unknown relation with the fluorescence that is observed under a microscope. A linear relation is regularly assumed (Suter et al. 2011, Zechner et al. 2013); an improvement over that assumption may be achievable through recent work by Bakker and Swain 2018.
Finally, the showcases presented in this article are based on synthetic data that was generated in silico; in future work, we plan to consider experimental data, such as can be found in work by Suter et al. 2011.
Acknowledgements
The authors thank Ramon Grima and Peter Swain (both University of Edinburgh) for valuable comments and suggestions. This work has been supported by the Leverhulme Trust, through Research Project Grant RPG-2015-017 (‘A geometric analysis of multiple-scale models for stochastic gene expression’).
A Analytical details
A.1 Approximate generating functions
The generating functions used to approximate propagators in the present article, cf. Section 2, are derived via the analytical method presented by Veerman, Marr and Popović 2017. For the telegraph model in Section 3, the leading order generating function is given by here, . All parameters have been rescaled according to (3.2). The generating function has been marginalised over mRNA abundance, using a Poisson distribution with parameter . Analogously, the first order approximation of the generating function is given by
For the autoregulated model discussed in Section 4, the same expressions for the generating functions are used; however, χ now depends on the autoregulatory mechanism according to
A.2 ‘Implicit’ expansions
It is important to note that neither the leading order generating function in (A.1) nor the first order approximation given by (A.2) are expressed as asymptotic series in powers of ε, as would be expected on the basis of the perturbative approach taken by Veerman, Marr and Popović 2017. The underlying reasoning can be summarised as follows.
First, in the derivation of the generating functions, it was assumed that the sampling time Δt was small, i.e. of order ε; note that this assumption is satisfied in all numerical simulations shown in the current article, where Δt = ε throughout.Thus, we can write
With the above scaling for Δt, an expansion of and , as defined in (A.1) and (A.2), respectively, into asymptotic series in ε to the appropriate order yields
From (A.5), one can immediately conclude that
From the series for in (A.6), we see that we can write hence, it follows that
More generally, an expansion of the generating function to order k in ε will yield
From these observations, we conclude that decreasing transitions (ni → ni+1), where ni > ni+1 + k, will be assigned a probability that is identically zero. Hence, if such transitions do occur in the data, the model is ruled out immediately, as our perturbative approach excludes the possibility that such transitions can occur. One can understand this phenomenon by considering the definition of the small parameter ε, which is defined as the ratio of the protein decay rate d1 over the mRNA decay rate d0. A leading order approximation of ε = 0 is thus equivalent to taking the protein decay rate d1 → 0 which, in turn, implies that protein does not decay at all, since (natural) protein decay is the only reaction in (3.1) that can decrease protein abundance. By the same reasoning, a straightforward expansion of the generating function to order εk will restrict the model to transitions (ni → ni+1), where ni+1 − ni ≥ −k. It would follow that either the order O of the method would be limited from below by the data, leading to high-order expansions in ε and, hence, to long computation times, or that the method could only be applied to a subset of the data, which would introduce a bias.
Lastly, an asymptotic expansion such as (A.6) implicitly assumes that all parameters and variables in the model are of order 1 in ε. For the series expansion of in (A.6), that assumption would significantly restrict the range of λ; in comparison, in Figure 2, likelihood values for λ up to order ε−3 are calculated. More importantly, the above assumption would restrict the range of n0, implying that only a subset of the data - with sufficiently low protein numbers - could be used as input for parameter inference.
We emphasise that none of these difficulties occur with the expressions in (A.1) and (A.2), where the expansion order in ε is expressed ‘implicitly’ in the respective functional forms of and .
Footnotes
↵* Joint corresponding authors.