Abstract
Many complex processes, from protein folding and virus evolution to brain activity and neuronal network dynamics, can be described as stochastic exploration of a high-dimensional energy landscape. While efficient algorithms for cluster detection and data completion in high-dimensional spaces have been developed and applied over the last two decades, considerably less is known about the reliable inference of state transition dynamics in such settings. Here, we introduce a flexible and robust numerical framework to infer Markovian transition networks directly from time-independent data sampled from stationary equilibrium distributions. Our approach combines Gaussian mixture approximations and self-consistent dimensionality reduction with minimal-energy path estimation and multi-dimensional transition-state theory. We demonstrate the practical potential of the inference scheme by reconstructing the network dynamics for several protein folding transitions and HIV evolution pathways. The predicted network topologies and relative transition time scales agree well with direct estimates from time-dependent molecular dynamics data and phylogenetic trees. The underlying numerical protocol thus allows the recovery of relevant dynamical information from instantaneous ensemble measurements, effectively alleviating the need for time-dependent data in many situations. Owing to its generic structure, the framework introduced here will be applicable to modern cryo-electron-microscopy and high-throughput single-cell RNA sequencing data and can guide the design of new experimental approaches towards studying complex multiphase phenomena.
Energy landscapes encapsulate the effective dynamics of a wide variety of physical, biological and chemical systems1,2. Well-known examples include a myriad of biophysical processes3–7, multiphase systems2, thermally activated hopping in optical traps8, chemical reactions1, brain neuronal expression9, and cellular development10–14. Energetic concepts have also been connected to machine learning15 and to viral fitness landscapes, where pathways with the lowest energy barriers may explain typical mutational evolutionary trajectories of viruses between fitness peaks16,17. Recent advances in experimental techniques including cryo-electron microscopy (cryo-EM)3 and single-cell RNA sequencing18, as well as new online social interaction datasets19, are producing an unprecedented wealth of high-dimensional instantaneous snapshots of biophysical and social systems. Although much progress has been made in dimensionality reduction20–22 and the reconstruction of effective energy landscapes in these settings3,11,14,23, the problem of inferring dynamical information such as protein folding or mutation pathways and rates from instantaneous ensemble data remains a major challenge.
To address this practically important question, we introduce here an integrated computational framework for identifying metastable states on reconstructed highdimensional energy landscapes and for predicting the relative mean first passage times (MFPTs) between those states, without requiring explicitly time-dependent data. Our inference scheme employs an analytic representation of the data based on a Gaussian mixture model (GMM)24 to enable efficient identification of minimum-energy transition pathways25–27. We show how the estimation of transition networks can be optimized by reducing the dimension of a high-dimensional landscape while preserving its topology. Our algorithm utilizes experimentally validated analytical results8 for transition rates1,28,29. Thus, it is applicable whenever the timeevolution of the underlying system can be approximated by a Fokker-Planck-type Markovian dynamics, as is the case for a wide range of physical, chemical and biological processes1.
Specifically, we illustrate the practical potential by inferring protein folding transitions and HIV evolution pathways. Current standard methods for coarse-graining the conformational dynamics of biophysical structures30 typically estimate Markovian transition rates from timedependent trajectory data in large-scale molecular dynamics simulations31. By contrast, we show here that protein folding pathways and rates can be recovered without explicit knowledge of the time-dependent trajectories, provided the system is sufficiently ergodic and equilibrium distributions are sampled accurately. The agreement with the trajectory-based estimates suggests that the inference of complex transition networks via reconstructed energy landscapes can provide a viable and often more efficient alternative to traditional time series estimates, particularly as new experimental techniques will offer unprecedented access to high-dimensional ensemble data.
RESULTS
Minimum-energy-path (MEP) network reconstruction
The equilibrium distribution p (x) of a particle diffusing over a potential energy landscape E(x) is the Boltzmann distribution p(x) = exp [–E(x)/kBT] /Z, where kB is the Boltzmann constant, T is the temperature and Z is a normalization constant. Given the probability density function (PDF) p(x), the effective energy can be inferred from where pmax is the maximum value of the PDF, included to fix the minimum energy at zero. Our goal is to estimate the MFPTs between minima on the landscape using only sampled data. We divide this task into three steps, as illustrated in Fig. 1 for test data (Supplementary Information). In the first step, we approximate the empirical PDF by using the expectation maximization algorithm to fit a Gaussian mixture model (GMM) in a space of sufficiently large dimension d (Methods, Fig 1A). Mixtures with a bounded number of components can be recovered in time polynomial in both d and the required accuracy32. The resulting GMM yields an analytical expression for E(x) via Eq. (1).
In the second step, the inferred energy landscape E(x) is reduced to an MEP network whose nodes (states) are the minima of E(x) (Fig. 1B top). Each edge represents an MEP that connects two adjacent minima and passes through an intermediate saddle point (Fig. 1B). The MEPs are found using the nudged elastic band (NEB) algorithm25,26, which discretizes paths with a series of bead-spring segments (Supplementary Information).
Markov state model (MSM)
Given the MEP network, the final step is to infer the rates for transitioning from a minimum α to an adjacent minimum β. Assuming overdamped Brownian dynamics, the directed transition α → β can be characterized by the generalized transition Kramers rate1 where γ is the effective friction, Eb is the energy difference between the saddle point S on the MEP and the minimum α, are the stable angular frequencies at the minimum α, while and ωb are the stable and unstable angular frequencies at the saddle. Eq. (2) assumes isotropic friction but can be generalized to a tensorial form1 if anisotropies are relevant. In most practical applications, the error from assuming γ to be isotropic is likely negligible compared to other experimental noise sources. In principle, Eq. (2) can be refined further by including quartic (or higher) corrections to the prefactor ωb/γ to account for details of the saddle shape1. Such corrections can be significant for GMMs (Supplementary Information).
Each edge (αβ) has two weights, kαβ and kβα, assigned to it. The rate matrix (kαβ) completely specifies the MSM on the network. Solving the MSM yields the matrix of pairwise mean first passage times (MFPTs) between states (Fig. 1C, Methods). In a simple two-state system, the MFPTs are determined up to a time scale by detailed balance, but for three or more states the influence of landscape topography and the associated state network topology (Methods) can lead to interesting hierarchical ordering of passage times. Identifying these hierarchies, and ways to manipulate them, is key to controlling protein folding or viral evolution pathways.
Topology-preserving dimensionality reduction
To ensure that the inference protocol can be efficiently applied to larger systems with a high-dimensional energy landscape, we derive a general method for reducing the dimension D of an energy landscape while preserving its topology. A probability density function with C wellseparated Gaussians in D dimensions can be projected onto the d = C − 1 dimensional hyperplane spanning the Gaussian means using principal component analysis (PCA). In practice, it suffices to choose C to be larger than the number of energy minima if their number is not known in advance. Reduction to fewer than d = C − 1 dimensions does in general not allow a correct recovery of the MFPTs.
To preserve the topology under such a transformation-which is essential for the correct preservation of energy barriers and MEPs in the reduced-dimensional space-one needs to rescale GMM components in the lowdimensional space depending on the covariances of the Gaussians in the D − d neglected dimensions (Fig. 1C). Explicitly, one finds that within the subspace spanned by the retained principal components (Supplementary Information) as long as p satisfies certain minimally-restrictive conditions. Here, Ud denotes the first d = C – 1 columns of the matrix of sorted eigenvectors U of the covariance matrix of the Gaussian means, and ϕi, and Σi are the mixing components, reduced-dimensional PDF and the covariance matrix of each individual Gaussian in the mixture, respectively (Supplementary Information). Neglecting the determinant scale-factors in Eq. (3), as is often done when GMM models are fitted to PCA-projected data, generally leads to inaccurate MFPT estimates (Fig. 1C, bottom). Note that Eq. (3) does not represent inversion of the transformation performed on the data by PCA, unless all D dimensions are retained; if some dimensions are neglected, Eq. (3) represents a rescaling of the marginal distribution in the retained dimensions to reconstruct the probability density function in the original dimension. In other words, the transition rates are best recovered from the conditional – not marginal – distributions, which are given by Eq. (3) up to a constant factor that does not affect energy differences.
Dimensionality reduction can substantially improve the efficiency of the NEB algorithm step: when the MEPs in the reduced d-dimensional space have been computed, the identified minima and saddles can be transformed back into the original data dimension D to calculate the Hessian matrices at these points, allowing Kramers’ rates to be calculated as usual (Fig. 1C, Supplementary Information). Alternatively, in specific situations where the MEPs lie outside the hyperplane spanning the means (Supplementary Information), the MEP in the reduced d-dimensional space can be transformed back to the D-dimensional space and used as an initial condition in that space, significantly reducing computational cost. These results present a step towards a general protocol for identifying reaction coordinates or collective variables for projection of a high-dimensional landscape onto a reduced space while quantitatively preserving the topology of the landscape.
Protein folding
To illustrate the vast practical potential of the above scheme, we demonstrate the successful recovery of several protein folding pathways, using data from previous largescale molecular dynamics (MD) simulations31. The protein trajectories, consisting of the time-dependent coordinates of the alpha carbon backbone, were pre-processed, treated as a set of static equilibrium measurements, and reduced in dimension before fitting a GMM (Methods). As is typical for high-dimensional parameter estimation with few structural assumptions, the fitting error due to a finite sample size n in d dimensions scales approximately as (Supplementary Information); see Refs.33,34 for advanced techniques tackling sample size limitations. Here, d < 10 so the sample size n ∼ 105 suffices for effective recovery (Methods, Supplementary Information).
For each of the four analyzed proteins Villin, BBA, NTL9 and WW, the reconstructed energy landscapes reveal multiple states including a clear global minimum corresponding to the folded state (Fig. 2A,B). To estimate MFPTs, we determined the effective friction γ in Eq. (2) for each protein from the condition that the line of best fit through the predicted vs. measured MFPTs has unit gradient. Although not usually known, γ could in principle be calculated by comparing MD simulations with experimental data. Our MFPT predictions agree well with direct estimates (Supplementary Information) from the time-dependent MD trajectories (Fig 2C). Detailed analysis confirms that the MFPT estimates are robust under variations of the number of Gaussians used in the mixture (Fig. S1). Also, the estimated MEPs are in good agreement with the typical transition paths observed in the MD trajectories (Fig. S2).
Viral evolution
As a second proof-of-concept application, we demonstrate that our inference scheme recovers the expected evolution pathways between HIV sequences as well as the key features of a distance-based phylogenetic tree (Fig. 3). To this end, we reconstructed an effective energy landscape from publicly available HIV sequences sampled longitudinally at several points in time from multiple patients35, assuming that the frequency of an observed genotype is proportional to its probability of fixation and that the high-dimensional discrete sequence space can be projected onto a continuous reduced-dimensional phenotype space (Fig. 3A; Supplementary Information). First, a Gaussian was fit to each patient and then combined in a GMM with equal weights, to avoid bias in the fitness landscape towards sequences infecting any specific patient (Supplementary Information). Thereafter, we applied our inference protocol to reconstruct the effective energy landscape, transition network (Fig. 3B) and disconnectivity graph (Fig. 3C), where each state is associated to a separate patient. As expected, states corresponding to patients infected with different HIV subtypes are not connected by MEPs (Fig 3A,B). The disconnectivity graph reproduces the key features of a coarsegrained patient-level representation of the phylogenetic tree (Fig. 3C). Using our inference scheme, vertical evolution in the tree can be tracked along the minimum energy paths in a reduced-dimensional sequence space (Fig. 3B). The energy barriers, represented by the lengths of the vertical lines in the disconnectivity graph (Fig. 3C), provide an estimate for the relative likelihood of evolution to fixation via point mutations between fitness peaks (energy minima). If mutation rates are known, the MEPs can also be used to estimate the time for evolution to fixation from one fitness peak to another36.
DISCUSSION
Preserving landscape topology under dimensionality reduction
Finding the appropriate number of collective macrovariables to describe an energy landscape is a generic problem relevant to many fields. For example, although some proteins can be described through effective one-dimensional reaction coordinates5, the accurate description of their diffusive dynamics over the full microscopic energy landscape requires many degrees of freedom37. Whenever dynamics are inherently highdimensional, topology-preserving dimensionality reduction can enable a much faster search of the energy landscape for minima and MEPs. In practice, data dimension is often reduced with PCA or similar methods before constructing an energy landscape37,38. The extent to which commonly used dimensionality reduction techniques alter MEP network topology or quantitatively preserve energy barriers is not well understood. Eq. (3) suggests that reducing dimensions using PCA should not introduce significant errors if the variance of the landscape around each state (energy minimum) in the neglected dimensions is similar. For instance, we found that the protein folding data could be reduced to five dimensions while maintaining accuracy (Fig. S1), although additional higher energy states may become evident in higher dimensions. Overall, our theoretical results demonstrate the benefits of combining an analytical PDF with a linear dimensionality reduction technique so that the neglected dimensions can be accounted for explicitly.
Biological and biophysical applications
Rapidly advancing imaging techniques, such as cryogenic electron microscopy (cryo-EM), will allow many snapshots of biophysical structures to be taken at the atomic level in the near future3. A biologically and biophysically important task will be to infer dynamical information from such instantaneous static ensemble measurements. The protein folding example in Fig. 2 suggests that the framework introduced here can help overcome this major challenge. Another promising area of future application is the analysis of single-cell RNA-sequencing data quantifying the expression within individual cells18. In related recent work, an effective energy landscape of single-cell expression snapshots was inferred using the Laplacian of a k-nearest neighbor graph on the data, allowing lineage information to be derived via a Markov chain13. The GMM-based framework here provides a complementary approach for reconstructing faithful lowdimensional transition state dynamics from such highdimensional data.
Furthermore, the proof-of-concept results in Fig. 3 suggests that our inference scheme for Markovian network dynamics can be useful for studying viral and bacterial evolution, which are often modeled as movements through a series of DNA or protein sequences39. The fitness landscape of an organism in sequence space is analogous to the negative of an effective energy landscape. The process of fixation by a succession of mutants in a population, whereby each mutant replaces the previous lineage as the population’s most recent common ancestor, has been modeled as a Markov process40. Successive sweeps to fixation have been observed in long-term evolution experiments, promising groundbreaking data for future analysis as whole-genome sequencing technologies improve41.
Outlook and extensions
The inference protocol opens the possibility to analyze previously intractable multi-phase systems: many high-dimensional physical, chemical and other stochastic processes can be described by a Fokker-Planck dynamics1, with phase equilibria corresponding to maxima of the stationary distribution. By taking near-simultaneous measurements of many subsystems within a large multistable Fokker-Planck system, the above scheme allows the inference of coexisting equilibria and transition rates between them. Other possible applications may include neuronal expression9 and social networks19, which have been described in terms of effective energy landscapes.
While we focused here on normal white-noise diffusive behavior, as is typical of protein folding dynamics, the above ideas can in principle be generalized to other classes of stochastic exploration processes. Such extensions will require replacing Eq. (2) through suitable generalized rate formulas, as have been derived for correlated noise1,42. Conversely, the present framework provides a means to test for diffusive dynamics: if the MFPTs of an observed system differ markedly from those inferred by the above protocol, then either important degrees of freedom have not been measured; the system is out of equilibrium on measurement time scales; or the system does not have Brownian transition statistics, necessitating further careful investigation of its time dependence.
To conclude, the conformational dynamics of biophysical structures such as viruses and proteins are characterized by their metastable states and associated transition networks, and can often be captured through Markovian models. Current experimental techniques, such as cryoEM or RNA-sequencing, provide limited dynamical information. In these cases, transition networks must be inferred from structural snapshots. Here, we have introduced a numerical framework for inferring Markovian state-transition networks via reconstructed energy landscapes from high-dimensional static data. The successful application to protein folding and viral evolution pathways illustrates that high-dimensional energy landscapes can be reduced in dimension without losing relevant topological information. Generally, the inference scheme presented here is applicable whenever the dynamics of a high-dimensional physical, biological or social system can be approximated by diffusion in an effective energy landscape.
METHODS
Population landscapes
A Gaussian mixture model (GMM) was used to represent the probability density function (PDF), or population landscape, of samples. The PDF at position x of a GMM with C mixture components in d dimensions is ,where ϕi are the weights of each component, µi are the means and Σi are the covariance matrices. More details on GMMs and how they were fit to data is given in the Supplementary Information.
Mean first passage times
We form a discrete-state continuous-time Markov chain on states given by the minima of the energy landscape. For a pair of states α and β directly connected by a minimum-energy pathway via a saddle, we approximate the transition rate α → β by the Kramers rate kαβ in Eq. (2), while if α and β are not directly connected we set kαβ = 0. Given these rates, the Markov chain has generator matrix Mαβ where Mαβ = kαβ for α ≠ β and Mαα = – Σ β:β≠α kαβ. Then the matrix ταβ of MFPTs (hitting times) for transitions α → β satisfies
Protein data pre-processing
Protein folding trajectories were obtained from all-atom molecular dynamics (MD) simulations performed by D.E. Shaw Research31. Data was subsampled by a factor of 5 to reduce the size. For some proteins, residues at the flexible tails of proteins were removed from the dataset to reduce noise. Pairwise distances between carbon alpha atoms on the protein backbone were taken, with a cut off of 68 Å, depending on the size of the protein. Samples were reduced in dimension using principal component analysis (PCA). The first five principle components of the protein data were found to be sufficient for inference of energy landscapes and transition networks (Fig. S1).
Code availability
The source code used in this study to learn a dynamical transition network and mean first passage times from a Gaussian mixture model is publicly available from Github (https://github.com/philippearce/learning-dynamical). Also included are all data processing codes required to convert the raw data used in this study into the appropriate format.
Data availability
Two publicly available datasets were used in this study. Protein folding trajectories31 are available from D.E. Shaw Research (https://www.deshawresearch.com/). HIV sequences35 are available from https://hiv.biozentrum.unibas.ch/.
ACKNOWLEDGMENTS
We thank D.E. Shaw Research for protein folding trajectories and Stefano Piana-Agostinetti of D.E. Shaw Research for helpful discussions. This work was supported by the Royal Society International Exchanges award IE160909 (H.K. and J.D.) and Complex Systems Scholar Award from the James S. McDonnell Foundation (J.D.).