Abstract
Population dynamics can be inferred from genetic sequence data using phylodynamic methods. These methods typically quantify the dynamics in unstructured populations or assume the parameters describing the dynamics to be constant through time in structured populations. Inference methods allowing for structured populations and parameters to vary through time involve many parameters which have to be inferred. Each of these parameters might be however only weakly informed by data. Here we introduce an approach that uses so-called predictors, such as geographic distance between locations, within a generalized linear model to inform the population dynamic parameters, namely the time-varying migration rates and effective population sizes under the marginal approximation of the structured coalescent. By using simulations, we show that we are able to reliably infer the parameters from phylogenetic trees. We then apply this framework to a previously described Ebola virus dataset. We infer incidence to be the strongest predictor for effective population size and geographic distance the strongest predictor for migration. This allows us to show not only on simulated data, but also on real data, that we are able to identify reasonable predictors. Overall, we provide a novel method that allows to identify predictors for migration rates and effective population sizes and to use these predictors to quantify migration rates and effective population sizes. Its implementation as part of the BEAST2 software package MASCOT allows to jointly infer population dynamics within structured populations, the phylogenetic tree, and evolutionary parameters.
Introduction
Genetic sequence data can be used to reconstruct the shared evolutionary history or phylogenetic tree of pathogens. These trees are shaped by migration and transmission dynamics, which in turn can be quantified from these trees by using so-called phylodynamic methods. These methods however typically assume that all sequences are from the same well-mixed population.
Methods that account for population structure such as the structured coalescent (Takahata, 1988; Hudson, 1990; Notohara, 1990) allow to infer how lineages coalesce within sub-populations and migrate between them. This is done by inferring effective population sizes and migration rates, with the effective population sizes being related to transmission dynamics (Volz et al., 2009a). Even when only considering constant parameters through time, the number of parameters to estimate grows quadratically with the number of sub-populations. When additionally considering these parameters to change through time using mathematically tractable skyline approaches, this number has to be multiplied by the number of time points considered. The information in a single phylogenetic tree can however be too limited to inform all these parameters. This problem is not only limited to structured coalescent process but also applies to structured birth-death models (Stadler and Bonhoeffer, 2013; Kühnert et al., 2016).
Alternatively, one can make use of additional data, such as transportation data, that potentially predict these parameters by using so-called generalized linear models (GLM) (Lemey et al., 2014). This has for example been done to study the cross-species transmission of bat rabies virus (Faria et al., 2013) or spatial spread of dog rabies virus through rural Tanzania (Brunker et al., 2017). It was further used to study Ebola virus dissemination throughout West Africa (Dudas et al., 2017) or the spread of Dengue virus in the Americas (Nunes et al., 2014). It has further been extended to inform effective population sizes through time in unstructured populations (Gill et al., 2016)
The underlying migration model (Lemey et al., 2009) used in these structured models, although computationally feasible, relies on simplifying assumptions of the tree generating process. Namely, it is assumed that the process which generated the phylogenetic tree is independent from the migration process. This essentially assumes that any two lineages have the same probability of coalescing, no matter where they are. This in turn can lead to biased estimates of migration rates, for example when sampling is biased (De Maio et al., 2015). Additionally, since only the migration process is modelled, information about the coalescent process in different sub-populations can not be incorporated.
The structured coalescent (Takahata, 1988; Hudson, 1990; Notohara, 1990) on the other hand does not make this independence assumption. This enables us to model the tree generating process by using coalescence within and migration between sub-populations. It however only allows a very limited number of different sub-populations to be considered (Vaughan et al., 2014), due to computational issues (De Maio et al., 2015).
In order to allow for structured models with many different sub-populations and parameters that change through time, alternative approaches have been developed (Volz, 2012). By avoiding the sampling of migration histories by formally integrating over every possible migration history, these approaches allow to consider scenarios with more parameters (De Maio et al., 2015). These approaches have however been subject to strong biases due to simplifying assumptions that were initially not being accounted for (Müller et al., 2017). The marginal approximation of the structured coalescent on the other hand allows to integrate over every possible migration history, avoiding such biases (Müller et al., 2017, 2018).
We here introduce an approach based on this approximation that infers the time varying effective population sizes of and the migration rates between different sub-populations from predictor and sequence data, with predictor data icharacterizing one location (e.g. population size, location) or the interaction of two locations (e.g. transportation, distance). We do so by using a generalized linear model where we infer to what degree each predictor predicts effective population sizes or migration rates. By using simulations, we show that we are able to retrieve the extent to which each predictor informs the population dynamics parameters. We then apply our GLM approach to a subset of sequence data from the West African Ebola virus (EBOV) dataset (Dudas et al., 2017). This subset is comprised of lineages descended from the major introduction of EBOV into Sierra Leone (Dudas et al., 2017), further downsampled to only sequences collected in 2014. The Sierra Leonean lineage was sustained via intense endemic transmission, making it the dominant EBOV lineage in the entire epidemic (Dudas et al., 2017). Following its introduction into Sierra Leone this lineage was also the source of EBOV in neighbouring Liberia and Guinea in the late stages of the epidemic (Dudas et al., 2017). Using the example of Ebola, we demonstrate that our approach is able to retrieve reasonable predictors for the migration rates and effective population sizes.
New Approaches
The marginal approximation of the structured coalescent (Müller et al., 2017) allows to consider datasets with many different sub-populations. When one wants to consider time varying rates alongside many different sub-populations, each individual rate in a given time interval is potentially only weakly informed by the phylogentic tree.
To inform rates, Lemey et al. (2014) introduced a generalized linear model approach to estimate migration between different sub-populations. This allows to inform these rates not only from the phylogenetic tree, but also from predictor data. We here apply this generalized linear model approach to inform effective population sizes and migration rates that vary through time.
Similar to (Lemey et al., 2014), we define the effective population sizes and migration rates as log-linear combinations of coefficients, indicators and time varying predictors. These predictors describe differences of effective population sizes of different sub-populations or migration rate differences between sub-populations. Some previously used predictors for migration rates include air traffic data between different locations (Lemey et al., 2014; Nunes et al., 2014) and distances between them (Dudas et al., 2017). The indicators and coefficients quantify if and to what degree each predictor contributes in predicting effective population size or migration rate differences across different sub-populations and points in time. Whereas the use of indicators is not strictly necessary, they allow us to use priors on the number of active predictors, thereby helping to reduce over-fitting. We implemented this approach as part of the BEAST2 (Bouckaert et al., 2012) package MASCOT (Müller et al., 2018). This allows to co-infer indicators and coefficients from genetic sequence and predictor data alongside phylogenetic trees and evolutionary parameters.
Results
Inference of predictor contributions from phylogenetic trees
We first tested how well indicators and coefficients can be inferred from phylogenetic trees. We randomly simulated 20 time-varying migration rate and 20 time-varying effective population size predictors. Each value of each predictor at every point in time was drawn from a normal distribution with mean=0 and sigma=1. As in (Lemey et al., 2014), we standardized each predictor to have mean 0 and standard deviation 1. Next, we randomly chose 3 of the 20 migration rate and 3 of the 20 effective population size predictors to be active predictors. As such, they actually predict migration rate and effective population sizes. All other predictors are considered inactive and are used only to see if inactive predictors can be reliably identified as such. Each of the active predictors was then assigned a random coefficient from a normal distribution with mean=0 and sigma=0.5. By using equations 1 and 2, we then calculated the migration rates between every sub-population and the effective population size in every sub-population at any point in time from the active predictors and coefficients. We then used these parameters to simulate phylogenetic trees using MASTER (Vaughan and Drummond, 2013) under the exact structured coalescent with 1000 serially sampled lineages. Next, we inferred which predictors explained patterns of migration and effective population size in simulated phylogenies and their relative contributions (coefficients).
In figure 1, we show the inferred coefficient values of active predictors as well as the probability that active predictors are identified as such. The coefficients are inferred well for both migration rate and effective population size predictors. While inactive predictors are reliably excluded and predictors with strong effects (large coefficients) are reliably included, predictors with only minor effects (small coefficients) can be falsely excluded. This is however expected due to a small effect size.
2014 Ebola epidemic in Sierra Leone
We used EBOV sequences sampled in 14 different regions of Sierra Leone in 2014 (Dudas et al., 2017). As migration rate predictors, we used the same time invariant predictors as Dudas et al. (2017). Namely, we used mean travel time to the nearest major settlement of at least 100,000 inhabitants, gridded economic output, population size and density, mean annual temperature and precipitation, and index of precipitation and temperature seasonality. All these predictors can either inform migration rates from or to a particular county and are therefore called origin/destination predictors. Additionally, we added predictors to account for possible random effects of migration between counties. Furthermore, we used the distances between the different counties as a migration rate predictor. For effective population sizes we used origin/destination predictors from Dudas et al. (2017). These predictors were not used previously, since information about the coalescent process in different sub-populations could not be incorporated in previous approaches Lemey et al. (2014).
Additionally, we incorporated the weekly case data of each location as a time variant predictor of the effective population size. Instead of using 0 for weeks with no reported incidences, we used 0.01 in order to not completely exclude lineages to be in a location if there are no reported cases there. This also avoids computational issues arising for effective population sizes being 0 and therefore coalescent rates being . Figure 2 shows the inferred maximum clade credibility tree with the different colors denoting the inferred locations of the nodes calculated as shown in Müller et al. (2018). Further, predictors for the migration rates and the effective population sizes with a Bayes Factor of more than 5 are shown. Incidence is inferred to be the strongest predictor of effective population size. This is to be expected since incidence should be approximately proportional to viral effective population sizes in each location given similarity in transmission rates (Volz et al., 2009b).
We infer great circle distances between different locations to be the strongest predictor for migration rates. This means that migration rates are inferred the strongest between regions whose population centroids are closer. The root of the tree is inferred to be in Kailahun with a 63% probability, with the rest of the probability mass approximately evenly distribution across the other locations. The 63% is possibly overestimated due to not considering un-sampled locations outside Sierra Leone.
Discussion
We here introduce a method that is able to infer time varying predictors of effective population sizes and migration rates. Previous GLM approaches were restricted to time invariant (Lemey et al., 2014) and time variant (Bielejec et al., 2014) rates in models that did not allow to jointly consider the migration and coalescent process. Other approach were using the GLM framework to inform effective population sizes through time in unstructured populations (Gill et al., 2016). In contrast to that, we here allow to model the migration and coalescent process jointly. While other approach that allow to infer effective population sizes from phylogenetic trees and incidence jointly in structured populations exist (Rasmussen et al., 2011, 2014), they currently are computationally not feasible for larger datasets.
By using simulations, we show that indicators and coefficients of predictors can be inferred reliably. Predictors that do not explain migration rates or effective population sizes are reliably excluded. This however also applies to predictors with small effect size. These are often inferred to not predict effective population sizes or migration rates at all. In contrast to for example Gill et al. (2016), we currently not allow for error terms in the GLM equation. We therefore essentially assume that all or a subset of the predictors fully explain the migration rates and the effective population sizes through time. Future improvements could fill that gap by allowing for such error terms. This in turn would however require efficient operators to sample these. Further, it would require to develop reasonable priors on these error terms, similar to the ones used for skyline methods (see for example Drummond et al. (2005) or Minin et al. (2008)). Furthermore, the here presented GLM approach could be applied to inform birth, death, migration and sampling rates through time for structured birth-death models (Stadler and Bonhoeffer, 2013; Kühnert et al., 2016).
By using the example of the 2014 Sierra Leone Ebola virus disease outbreak, we show that out approach is able to infer the effect size of predictors reasonably from real data as well. We infer weekly case numbers to predict effective population sizes best. For migration rate predictors, the distances between population-weighted centres of different locations is inferred to be the strongest predictor. Previously, distances have been identified as an important predictor of geographic spread for Ebola virus in West Africa, by both phylodynamic (Dudas et al., 2017) and epidemiological approaches (Kramer et al., 2016), even when only Sierra Leone is considered (Gustafson and Proctor, 2017). Overall, we infer similar migration rate predictors as Dudas et al. (2017) which used the panmictic and time invariant model described in (Lemey et al., 2014). We mainly expect there to be great differences in the inference of migration rate predictors between the two approaches when sampling is strongly biased (De Maio et al., 2015).
Sampling of Ebola cases was fairly dense during the outbreak. Whilst Ebola virus sequencing in West Africa has generally kept up well with increasing numbers of cases (Dudas et al., 2017), numerous locations are however known to have been under-sampled or un-sampled altogether. For example, an EBOV lineage established early in Conakry prefecture of Guinea resurfaced at least three times during the epidemic (Carroll et al., 2015; Simon-Loriere et al., 2015; Quick et al., 2016). This suggests the presence of a substantial, yet cryptic, localised transmission chain not seen outside of Conakry. It remains unclear how to treat entirely un-sampled locations and what their effects might be on internal node state reconstruction or inference of predictor importance. Future research will need to study these effects of so-called ghost states (Slatkin, 2005) on the generalized linear model approach.
Overall, this newly introduced method allows to include predictor data, such as transportation or incidence data, into phylodynamic analyses. This allows us to infer population dynamic parameters as well as the location of ancestral nodes more reliably in a computationally tractable way. Predictor data, such as the movement of people using mobile phone data (Deville et al., 2014; Wesolowski et al., 2015) or the social mixing of different age groups (Mossong et al., 2008), is increasingly being gathered. This in turn means that methods that are able to combine various sources of information in a computationally feasible way will be playing an ever increasing role in epidemiology.
Methods and Material
Effective Population sizes and migration rates as generalized linear models
Instead of inferring the effective population size Nea(t) of state a at time t directly, we define it as a linear combination of c different predictors , coefficients and indicators :
The coefficients can be between –∞ and ∞ and denote the extent to which each predictor contributes in predicting effective population sizes. The indicators can be 0 or 1 and denote if a predictor contributes at all. This allows to put prior distributions on the number of predictors that are actually used and reduces the issue of over-fitting effective population sizes with too many predictors. We therefore only have to infer the coefficients for predictors where the indicator is 1. denotes a scaling parameter, scaling every effective population size at every point in time with the same value. The different predictors are in log space and in order to have comparable predictors, they are typically standardized, such that their mean is 0 and their standard deviation is 1. The values of these predictors vary across different states a as well as different time points t. This parametrization of the generalized linear model is the same as described in (Lemey et al., 2014).
We apply the same framework for the forward-in-time varying migration rates between state a and state b: where βm is the overall rate scaler, describing the overall magnitude of migration. Since the structured coalescent uses backwards in time migration rates, we define the backwards in time rates as:
This equality is exact for the case when αa = αb such that Nea(t) = αaIa(t) and Neb(t) = αbIb(t) with Iα(t) denoting the number of infected in population a at time t. (Volz, 2012).
Ebola sequence and incidence data
Sequences belonging to the major Sierra Leonean Ebola virus lineage that dominated the country’s epidemic (Dudas et al., 2017) were extracted and down-sampled to sequences collected up to 31st of December 2014, leaving 473 taxa. Stretches of putative hypermutation tracts corresponding to hypothesised ADAR edits were identified and masked as described in Dudas et al. (2017).
Incidence data were compiled from the latest WHO report on Ebola virus disease (EVD) cases in Sierra Leone: http://apps.who.int/gho/data/view.ebola-sitrep.ebola-countrySLE-new-conf-prob-districs-20160511-data?lang=en. These data report the number of new EVD cases for each subnational division of Sierra Leone (district) and epi week, split by whether the cases are confirmed or probable. Additionally, due to the scale of the epidemic across the region, there are two databases (an earlier patient database and later situation reports) for EVD incidence that overlap by around a year (2014 Sep - 2015 Sep) with slightly different reported incidences. Available data are likely underestimates of the true burden of EVD in Sierra Leone, and thus we combine confirmed and probable cases, and keep the higher number for each epi week for when the reporting of patient and situation report databases overlap (Dudas et al., 2017)
Software
The method above is implemented into our BEAST2 package MASCOT (Marginal Approximation of the Structured COalsescenT). Simulations were performed using a backwards in time stochastic simulation algorithm of the structured coalescent process using MASTER 5.0.2 (Vaughan and Drummond, 2013) and BEAST 2.4.6 (Bouckaert et al., 2014). Script generation and post-processing were performed in Matlab R2015b. Plotting was done in R 3.2.3 using ggplot2 (Wickham, 2009). Plotting of the EBOV analys was done by using baltic (https://github.com/blab/baltic) and matplotlib (Hunter, 2007). Effective sample sizes for MCMC runs were calculated using coda 0.18-1 (Plummer et al., 2006).
Data availability
The source code of the BEAST 2 package MASCOT and the GLM method is available at https://github.com/nicfel/Mascot.git. All scripts for performing the simulations and analyses presented in this paper are available at https://github.com/nicfel/GLM-Material.git. Output files from these analyses, which are not on the github folder, are available upon request from the authors.
Acknowledgement
We would like to thank Oliver Pybus for his helpful discussion on generalized linear models. We also would like to thank Alexei Drummond and Jing Yang for their helpful comments on the implementation of the generalized linear model. NM and TS are funded in part by the Swiss National Science foundation (SNF; grant number CR32I3_166258). TS is supported in part by the European Research Council under the Seventh Framework Programme of the European Commission (PhyPD: grant agreement number 335529). GD is supported by the Mahan postdoctoral fellowship from the Fred Hutchinson Cancer Research Center and NIH R35 GM119774-01.