Abstract
Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce a deep learning approach that uses a convolutional neural network for image processing, which is trained with coalescent simulations incorporating population-specific history, to discover selective sweeps from population genomic data. This approach distinguishes between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate the performance of our deep learning classifier partialS/HIC, which exhibits unprecedented resolution for detecting partial sweeps. We also apply our method to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a catalog of candidate adaptive loci that may aid mosquito control efforts. More broadly, the success of our supervised machine learning approach introduces a powerful method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.
Author Summary Recent successful efforts to reduce malaria transmission are in danger of collapse due to evolving insecticide resistance in the mosquito vector Anopheles gambiae. We aim to understand the genetic basis of current adaptation to vector control efforts by deploying a novel method that can classify multiple categories of selective sweeps from population genomic data. In recent years, there has been great progress made in the identification of completed selective sweeps through the use of supervised machine learning (SML), but SML techniques have rarely been applied to partial or ongoing selective sweeps. Partial sweeps represent an important facet of evolution as they reflect present-day selection and thus may give insight into future dynamics. However, the genomic impact left by partial sweeps is more subtle than that left by completed sweeps, making such signatures more difficult to detect. To this end, we extend a recent SML method to partial sweep inference and apply it to elucidate ongoing selective sweeps from Anopheles population genomic samples.
Introduction
Malaria represents an enormous burden on human health, with an estimated 214 million cases and 438,000 deaths in 2015 [1]. As mosquitos of the Anopheles gambiae species complex are the major vector for Plasmodium parasites, roughly 70% of global malaria relief budgets have been focused on mosquito control, including insecticide treated bed-nets, indoor residual spraying, and larva control through the direct modification of habitats as well as the application of larvicide. While these vector control efforts have successfully produced major reductions of malaria transmission rates over the past 15 years [1], there has been an alarming increase in mosquitos resistant to insecticides, specifically pyrethroids, observed across nearly all areas of the world covered by anti-malarial efforts [2]. Pyrethroids are the only class of insecticide used in long-lasting insecticidal nets and are applied in many indoor spraying programs, thus the evolutionary innovation of resistance is a well-recognized Achilles heel of anti-malarial efforts.
The increase in insecticide resistant mosquitoes is to be expected from an evolutionary standpoint: anti-malaria control efforts exert a strong selective pressure to which mosquito populations will respond through the differential survivorship and reproduction of those individuals that can best cope with the applied insecticides. Pyrethroid resistance was reported within African malaria vectors first in Sudan during the 1970s, then later in West Africa during the 1990s, most likely stemming from accidental exposure of mosquitos to crop applications of pyrethroids [3,4]. Subsequent analysis showed this earliest resistance to be a result of mutations in the knockdown resistance locus kdr, which is known to contribute to pyrethroid resistance in other insect species [5]. Mutations conferring resistance at kdr as well as other loci have since spread throughout Africa, and threaten to nullify the gains in malaria control achieved over the past decade [6]. While control efforts are now looking toward non-pyrethroid insecticides [2,7] as well as gene drive technologies [8], it is anticipated that resistance to these control modalities will eventually evolve as well [9,10]. Hence, an important goal in the continued fight against malaria is to identify genomic targets of resistance in Anopheles, especially in such a way that might inform vector managers in the field.
Alleles that confer resistance to control efforts should rapidly increase in frequency within Anopheles populations in a manner consistent with selective sweeps. When an allele increases in frequency under selection, its linked genetic background comes with it in a process known as genetic hitchhiking. Selective sweeps, through this hitchhiking effect, lead to decreased levels of polymorphism [11–13], skewed allele frequency spectra [14,15], and increases in linkage disequilibrium surrounding the site under selection [16]. Classically, methods for finding sweeps have focused on a particular aspect of genetic variation, for instance observing the site frequency spectrum (SFS) at a locus and comparing it to expectations under neutrality and selective sweeps [19]. More recently, the field has made excellent progress in combining signals across multiple features of genetic variation through supervised machine learning (SML) [20–27], which has substantially improved power, accuracy, and robustness in what have been stubbornly difficult inference problems within population genetics [28]. While much attention has been paid to applying SML for the identification and classification of completed selective sweeps in the genome [24,26], less effort has been made for using SML to identify sweeps that are incomplete within a population, sometimes called partial sweeps (although see Sugden et al. [27] for a recent example). In these cases, the beneficial allele is not currently fixed within the population, thereby creating a weaker hitchhiking effect in comparison to a completed sweep, and accordingly a more subtle perturbation of patterns of genetic variation [29]. Partial sweeps may perhaps be implicated in cases where recently initiated selective forces cause presently ongoing adaptation, directional selection ceases prior to fixation, or an intermediate allele frequency is favored by balancing, polygenic, and/or pleiotropic selection.
Here we introduce partialS/HIC, an extension of S/HIC [24] and diploS/HIC [26] that includes both hard and soft partial sweeps along with their associated linked classes (i.e. regions adjacent to either a partial hard or soft sweep) as selection states for which a genomic window can be classified. We apply our extended method to data from phase I of the Anopheles gambiae 1000 genomes project (Ag1000G) [6], with particular emphasis on discovering sweeps currently in progress that might be the result of vector control efforts such as insecticide spraying. We find that our method is considerably more powerful for finding partial sweeps than iHS [30], even in the face of complex population size histories such as those found among the Ag1000G samples. Application of our method to the Ag1000G data reveals a large number of partial sweeps as well as completed sweeps from standing genetic variation. Moreover, we find that our sweep candidates are highly enriched for loci that have been previously identified as contributing to insecticide resistance.
Results
Coalescent simulations of feature vector images for partialS/HIC training
We aimed to classify genomic segments into one of nine states: unaffected by selection (i.e. neutral); containing a completed hard, completed soft, partial hard, or partial soft sweep, respectively; or linked to a completed hard, completed soft, partial hard, or partial soft sweep, respectively (Figure 1). To this end, we developed partialS/HIC, a SML classifier that uses a deep convolutional neural network (CNN) to classify a genomic window, which is represented by a two-dimensional matrix constructed from a large collection of summary statistics. To train partialS/HIC, we deployed the program discoal [31] to perform coalescent simulations of completed and partial as well as hard and soft selective sweeps, along with simulations without sweeps, in a manner analogous to Schrider and Kern [24]. This was conducted for each of eight population size histories corresponding to the empirical Ag1000G population datasets, which were previously inferred as part of the initial data release [6]. These Anopheles population datasets from Miles et al. [6] are labeled here as AOM (A. coluzzii from Angola), BFM (A. coluzzii from Burkina Faso), BFS (A. gambiae from Burkina Faso), CMS (A. gambiae from Cameroon), GAS (A. gambiae from Gabon), GNS (A. gambiae from Guinea), GWA (Anopheles of uncertain species from Guinea-Bissau), and UGS (A. gambiae from Uganda). Individual simulations were converted into two-dimensional matrices, or feature vector images, built from 89 rows corresponding to different summary statistics, and 11 columns corresponding to adjacent sub-windows. The 89 statistics include 17 that are implemented in diploS/HIC along with 72 derivatives of the recently developed SNP-specific SAFE statistic [32]. We defined the four completed hard/soft and partial hard/soft selective sweep states as containing a sweep within the central, focal sub-window. In contrast, the four linked selection states were defined as having a sweep of the given type within one of the remaining ten sub-windows. Heatmaps constructed from median values across simulations reveal expected spatial patterns, such that values immediately flanking a sweep are substantially different than those further from the focal sub-window, while neutral regions display no discernible pattern among sub-windows (Figure S1). Additionally, spatial patterns of statistics differ qualitatively between selection states. These observations are consistent regardless of mosquito population history, suggesting that there is signal within this collection of summary statistics to isolate the location of a sweep to a specific sub-window as well as distinguish among neutral regions and types of selective sweeps.
Deep learning excels in detecting selective sweeps, including partial hard sweeps
We utilized partialS/HIC to train a separate CNN for nine-state classification on each of the eight demographic histories associated with the Ag1000G population samples (Figures 1–2). In order to assess accuracy, each CNN was subsequently tested against another set of simulated data that was generated under the same specifications as the training dataset. Among the eight test sets, there was moderate overall accuracy for this simulation experiment (median accuracy = 66.4%; Table S1). However, confusion matrix heatmaps provide a more informative view of our classifier’s performance, which was generally sufficient for identifying neutral regions, completed hard and soft sweeps, partial hard sweeps, and regions linked to completed hard/completed soft/partial hard sweeps (Figures 3, S2). Assignment accuracy was highest for completed hard sweeps in all demographic scenarios save for AOM (median accuracy = 96.0%). We also had excellent accuracy for identification of linked completed hard regions, demonstrating a strong ability to localize completed hard sweeps. Accuracy for completed soft sweeps was also quite good (median accuracy = 84.2%); when misclassification did occur, it was generally to either the neutral or partial soft sweep state, thus completed soft sweeps were rarely incorrectly classified as hard sweeps. Moreover, sub-windows linked to completed soft sweeps beyond one sub-window away had low levels of misclassification to one of the non-linked sweep states, again allowing for excellent localization of the sweep.
Importantly, the purpose of partialS/HIC is to extend our state space to identify ongoing selective sweeps while distinguishing these from completed sweeps. We find that our ability to identify partial hard sweeps was generally strong across population histories (median accuracy = 74.6%) and is often comparable to that of completed soft sweeps. However, localization of partial hard sweeps along the chromosome was more difficult than for completed sweeps, as can be seen from the moderate levels of confusion between partial hard sweep and linked partial hard sweep sub-windows. Undoubtedly, this is due to the limited amount of time recombination has had to whittle down the haplotype carrying the beneficial mutation.
Identifying partial soft sweeps was a much more challenging task (median accuracy = 45.3%), with a high false negative rate (median rate of misclassification as neutral = 27.6%) as well as a substantial probability of misclassification as a completed soft sweep (median rate = 14.4%). It is encouraging though that partial soft sweeps were almost never misclassified as a completed nor partial hard sweep. Additionally, while our accuracy in classifying partial soft sweeps was poor, false positives were not a substantial concern (median rate of misclassifying neutral regions as partial soft sweep = 3.2%). Therefore, partialS/HIC should underestimate the true number of partial soft sweeps when applied to a given dataset. Furthermore, linked partial soft sweeps beyond one sub-window away from the focal sub-window were rarely mistaken for a sweep state, and likewise partial soft sweeps were seldom confused for being linked (median rate of misclassification as linked partial soft sweep = 4.3%), thus demonstrating that localization of partial soft sweeps may be possible.
In summary, partialS/HIC has excellent ability to distinguish partial from completed sweeps for de novo mutations, and lesser yet still substantial power for sweeps from standing variation. Moreover, we demonstrated very strong performance in differentiating between hard and soft sweeps, regardless of whether a sweep was completed or incomplete. Importantly, this is all while maintaining an acceptable false positive rate across each of the population histories tested (median accuracy for neutral regions = 85.1%; median rate of misclassifying neutral regions as any one of the four non-linked sweep states = 4.6%).
Robustness to demographic model misspecification
To assess robustness to demographic misspecification, we applied a CNN trained on simulations from one population sample to data generated from an alternate demographic history (Figure 1). Specifically, we used training data from the GAS population size history, which was fairly stable over time, and leveraged it against the CMS test dataset, which experienced a dramatic population expansion (overall accuracy = 55.0%; rate of misclassifying neutral regions as any one of the four non-linked sweep states = 1.7%). Despite this misspecification, the confusion matrix (Figure S3) strongly resembles the corresponding matrix that is correctly specified for demography (i.e. for CMS in Figure S2). In particular, accuracies for finding neutral regions, completed hard sweeps, and partial hard sweeps are roughly equivalent between the correctly specified model and misspecified model (Figure S3). For soft sweeps, while confusion between completed and partial sweeps is increased for the misspecified model, the overall ability to distinguish sweeps from neutrality is largely preserved. Moreover, the rates at which examples from the linked classes were mistaken for sweeps are seemingly unaffected. Together, these results indicate that sensitivity for sweep discovery and localization was not strongly impacted by the demographic model misspecification during training.
Partial sweeps are unpredictably misclassified as either completed sweep or neutral when not explicitly considered
Since the previous versions of partialS/HIC (S/HIC and diploS/HIC) did not allow for partial sweep selection states, we were interested in how such five-state classifiers would behave when confronted with partial sweeps. To explore this, we conducted a simulation experiment that first removed partial hard and soft sweeps as well as their associated linked classes from the CNN training process, thus training on only five states rather than all nine (Figure 1). Next, in an effort to examine the classification behavior for these five-state CNNs, we applied the full test set that included the partial sweep classes. Unsurprisingly, the trend was for partial sweeps to be most often confused for linked selection (Figures 4, S4). Perhaps more concerning is the false negative rate (i.e. rate at which partial sweeps were misclassified as neutral), which was substantial in partial hard sweeps for several populations (median = 8.8%; max = 32.4%; >1% in all populations) and extreme in partial soft sweeps (>50% in three populations, >40% in three more populations, and >24% in all populations). Partial hard sweeps that were discovered were also often misclassified as a completed soft sweep (median rate of misclassification as completed soft sweep = 5.5%). However, when training included partial sweeps, there is universal and dramatic improvement in both finding sweeps and correctly identifying the model of selection (Figures 3, S2). Meanwhile, overall accuracy remains similar among the five-state and nine-state classifiers with respect to simulations of neutral, completed sweep, and linked classes exclusively (Figures 4, S4; Table S2). As a result, accuracy only stands to benefit from incorporating partial sweeps into training since ignoring such information leads to unacceptably high false negative rates of partial sweeps being called neutral or linked.
partialS/HIC binary classification outperforms a competing individual summary statistic approach
To assess whether our deep learning method extends inferential resolution beyond the signal conferred by iHS, a statistic explicitly designed for detecting partial sweeps [30], we compared receiver operating characteristic (ROC) curves for the binary classification task of broadly detecting selective sweeps (i.e. any of the four selection states involving a sweep within the focal sub-window) vs. neutral regions or linked sweeps (Figure 1). Producing a ROC curve, which plots true positive against false positive rates given varying thresholds, for partialS/HIC per population required optimizing a separate CNN, keeping the same whole training dataset as well as architecture of network layers (Figure 2) except with two final output responses (i.e. sweep in central sub-window vs. no sweep in central sub-window) instead of nine. Mean, maximum, and proportion of outlier iHS values across SNPs within the central sub-window of the training simulations were used to obtain three more respective ROC curves for each population history. All sub-window variants of the iHS statistic performed quite poorly (median AUC for: E[iHS] = 0.555; maximum iHS = 0.505), especially in contrast to the partialS/HIC binary classifier (median AUC = 0.939), in identifying selective sweeps to the focal sub-window (Figures 5, S5).
Soft and partial sweeps are commonplace among A. gambiae populations
Turning our attention to the Ag1000G phase I data, we applied our nine-state CNNs to the corresponding A. gambiae population datasets, classifying 5 KB segments using a 55 KB full sliding window throughout the whole genome (Figure 1). Each of the eight mosquito populations contains a large number of sub-windows identified as completed soft sweeps (median fraction of total calls genome-wide = 5.01%) as well as partial sweeps (median fraction of total calls genome-wide for partial hard sweep = 2.84%; median fraction of total calls genome-wide for partial soft sweep = 7.24%), coupled with only a handful of completed hard sweep predictions (median fraction of total calls genome-wide = 0.03%) (Figure 6; Table S3). Partial soft sweeps were typically discovered the most often (median proportion of sweep calls = 52.59%), with completed soft sweeps often following (median proportion of sweep calls = 28.80%) and partial hard sweeps usually being the third most numerous class of detected sweep (median proportion of sweep calls = 19.75%). Notably, our estimated false discovery rates (FDRs) are higher for soft sweeps (median FDR for completed soft sweeps = 11.09%; median FDR for partial soft sweeps = 12.20%) compared to hard sweeps (median FDR for completed hard sweeps = 0.00%; median FDR for partial hard sweeps = 0.39%); this implies that individual soft sweep candidates should be viewed with more caution, though we should be able to estimate the genome-wide proportion of these classes well. After false discovery correction, classifications for partial soft sweeps still outnumber those for completed soft sweeps in AOM, BFS, CMS, and UGS, as well as for partial hard sweeps in all populations but GNS. Importantly, had partial sweeps not been accounted for in the training process, our results suggest that we would have both underestimated the total number of sweeps and incorrectly labeled many of our partial sweeps (Figures 4, S4). This would have led to the conclusion that adaptation from standing variation rather than de novo mutations dominate selective sweep dynamics in these A. gambiae populations. While it is clear that soft sweeps are indeed more common in these data, our results suggest that hard sweeps often occur as well, though with few reaching fixation. Furthermore, the partialS/HIC classifications indicate that most selective sweeps in these population samples are incomplete, suggesting that we are capturing a view of selection in progress.
Selective sweeps are significantly enriched in functional regions of the A. gambiae genome
To elucidate broad characteristics underlying the genomic targets of selection, we used permutation tests of sweep call locations to discover enrichment patterns in the following DNA regions of interest: gene, mRNA, exon, CDS, five-prime UTR, and three-prime UTR (Figure 1). Permutation tests were based on the total number of calls for the four selection states with sweeps occurring within the focal sub-window, as well as the individual number of calls for each of these states respectively. Across all eight population datasets and for all six DNA regions under investigation, there is a statistically significant enrichment of total sweep calls along with completed soft sweeps calls in particular, whereas completed hard sweep calls are not significantly enriched in any single case (Figures 7, S6; Table S4). Conversely, partial sweep enrichment varies among populations as well as individual DNA regions. Specifically, partial hard sweeps are significantly enriched in five of the six DNA regions for BFS and CMS, and four of the DNA regions for UGS; while partial soft sweeps are significantly enriched in all six DNA regions for UGS, five of the DNA regions for BFM, and four of the DNA regions for BFS.
Insecticide resistance loci, especially related to metabolism, are significantly enriched for selective sweeps
We performed a similar permutation analysis for four sets of genes known to confer insecticide resistance (IR) (Figure 1), finding at least one set of IR genes to be statistically significant for every population dataset in enrichment of total sweep calls (i.e. aggregate of all four sweep classes) and completed soft sweeps, respectively (Figures 8, S7; Table S5). In particular, metabolism-related IR genes are significantly enriched for each of these cases. Furthermore, IR genes corresponding to well-characterized resistance loci (i.e. target sites) are significantly enriched for AOM and BFM total sweep calls as well as completed soft sweeps in AOM, BFM, and GNS; IR genes associated with behavior are significantly enriched for BFM and GNS total sweep calls as well as completed soft sweeps in AOM, CMS, and GAS; and IR genes affiliated with cuticular activity are significantly enriched for completed soft sweeps in GWA and UGS. In contrast, completed hard sweeps are only significantly enriched in BFS, GNS, GWA, and UGS for IR genes connected to behavior (as well as metabolism for BFS). For partial sweeps, significant enrichment only occurs within BFM (partial soft sweeps in metabolism as well as behavior IR genes), CMS (partial hard sweeps in metabolism as well as target site IR genes), and UGS (partial hard sweeps in metabolism IR genes).
Completed soft sweeps are significantly enriched within the same GO term annotations across populations
To uncover further functional traits targeted by selection, we used our permuted datasets to ask for which individual gene ontology (GO) terms our sweep candidates are enriched (Figure 1). Only completed soft sweeps, as well as the total of all four sweep states in combination, are significantly enriched for any GO terms, except for two single cellular components GO terms in UGS (“membrane” is enriched for completed hard sweeps and “nuclear cohesion complex” is enriched for partial hard sweeps); partial soft sweeps are therefore not significantly enriched for any GO terms among populations (Table S6). For the completed soft sweep significant enrichments, we found six cases of the same GO term in all eight populations. Three of these belong to the cellular components GO domain (“nucleus”, “membrane” and “integral component of membrane”), and the other three are connected to molecular function, specifically binding (“nucleic acid binding”, “protein binding” and “ATP binding”). All six of these terms, especially those conferring binding function, are also enriched for total sweep calls across multiple datasets: “nucleic acid binding” and “ATP binding” in seven populations; “protein binding” and “nucleus” in six populations; “integral component of membrane” in five populations; and “membrane” in three populations, one of which is significantly enriched for this GO term of completed hard sweeps as well, the single example of such among all populations. Other cases involving the same GO term significantly enriched for completed soft sweeps in over half of the populations include: “binding”, “cytoplasm”, and “zinc ion binding” in seven populations; “RNA binding” in six populations; and “mRNA splicing, via spliceosome” and “ATPase activity” in five populations.
Discussion
partialS/HIC elucidates both species-wide and population-specific sweep dynamics within A. gambiae
The Ag1000G data provided the opportunity to investigate selection at both the continental scale, where wide-reaching impact across the whole species complex could be uncovered, and the regional level, revealing population-specific sweep dynamics. For the former, we observed that A. gambiae populations consistently experienced very few completed hard sweeps, with nearly all sweeps being partial and/or soft. In fact, the impact of completed hard sweeps on the adaptive process within mosquitos appears to be even more limited than what was observed previously in humans [33]. This is likely a result of the much larger population sizes and concordant levels of genetic variation that are maintained within Anopheles populations. Importantly, we find a large number of ongoing selective sweeps within these populations, particularly in comparison to the number of completed sweeps. There are multiple reasons why this might be the case. A trivial explanation may simply be that we only have power to detect sweeps that have completed in the past few hundred generations, though this seems unlikely. More plausibly, a large number of ongoing sweeps might be expected given the recent change in environment induced by vector control efforts. Another possible explanation is that the frequency dynamics of beneficial alleles within a population is often more complex than assumed and may indeed contain an overdominant component [34]. This would mean that some portion of the partial sweeps that we are observing in Anopheles is actually balanced, or transiently balanced, polymorphisms. A fourth class of explanation is that beneficial mutations may not be able to fix in populations due to competition with beneficial mutations on other genes that have originated in different parts of the species range [35]. Indeed, each of these factors may play some role in our reported abundance of partial sweeps.
Although such genome-wide sweep patterns occur species-wide, enrichment behavior seems much more population-specific. For instance, while every population possesses significant enrichment of completed soft sweeps coupled with no completed hard sweep significant enrichments for the six functional DNA regions studied here, partial sweep enrichments vary widely among datasets. Sweep behavior is even more idiosyncratic for insecticide resistance genes, as the only constant between populations is that metabolism is a recurring target of selection, especially for completed soft sweeps.
These findings from the Ag1000G data provide important genomic resources that will inform continental-wide strategies that apply to the entire A. gambiae species complex, as well as aid management efforts in specializing to certain populations and localities, which can improve malaria control efficacy. Such insight into mosquito vector evolution may also help curb future insecticide resistance adaptation, and in turn prevent impending crises of vector control failure. However, it is important to consider that our partial sweep calls could be capturing more complex selective dynamics at play, for example polygenic and quantitative trait adaptation [36,37], balancing selection [38], and introgression of beneficial alleles. These could lead to different modes of adaptation for the same genomic region across populations, for instance a favorable SNP undergoing a soft sweep at its origin and then carried to neighboring populations (as was suggested in Miles et al. [6]) may appear to be experiencing a partial hard sweep in those recipient populations. Such complicated interactions merit further investigation on the Ag1000G data, which would not only continue advancing methodological development for population genetics, but also address interesting questions for a widespread and ecologically important organism that has crucial ramifications on wildlife management and public health.
partialS/HIC offers powerful and unprecedented detection of partial sweeps
Supervised machine learning approaches are rapidly gaining traction among population geneticists, with deep learning in particular beginning to experience increased attention and methodological development due to its exciting potential to unlock classic population genetics problems. Examples of successful SML implementation in population genomics include demographic model choice [39], demographic parameter inference [40], comparative analysis of independent single-population size changes [41], identification of introgressed regions [42], recombination rate estimation [43–45], and genomic scans of selective sweeps [24]; deep learning specifically has been employed for joint inference of demography and selection [25], discovery of recombination hotspots [46], estimation of demographic and recombination parameters [47], discovery of functional variants [48], and differentiating between hard and soft sweeps from neutral regions [26]. These applications especially benefit from the ability to handle high dimensional input data and bypassing the need of a likelihood function, which is due to SML uncovering data patterns from leveraging a priori information through a training algorithm [25,28]. CNNs expand this utility to image processing, which has been demonstrated with diploS/HIC to be a powerful tool for exploiting the genomic spatial distribution of multiple population-level summary statistics to detect selective sweeps [26].
Here, we demonstrated with partialS/HIC that deep learning can be extended to partial sweeps, especially partial hard sweeps, yielding greater accuracy and robustness than has been previously attained. We also showcased consistent performance in the face of several underlying demographic backgrounds. Specifically, partialS/HIC achieves dependable discovery of selective sweeps, excellent simultaneous disambiguation between partial and completed sweeps as well as between hard and soft sweeps, and reliable spatial localization of selection targets in the genome. Moreover, we have shown that partial sweeps remain mostly undetected if ignored from the training process, even though such selection may be commonplace throughout a genome as with the Ag1000G data. As a result, many previous studies scanning for either complete or ongoing selective sweeps solely (i.e. not jointly inferring both types of selection) may have overlooked an important subset of evolutionary events [35]. Researchers may then be interested in reexamining datasets with partialS/HIC to elucidate the relative contributions of fixed versus incomplete sweeps to adaptive evolution.
Importantly, the efficacy of partialS/HIC relies on several factors that are unexplored here, including simulation prior specifications, CNN architecture with respect to construction and parameterization of neural network layers, and data structure. Hence, it is prudent for future implementations to validate performance by testing a range of configurations, given a project’s individual intricacies, to assess robustness and inherent assumptions. In particular, future exploration of alternate image constructions could be potentially of great methodological benefit. Such images could be derived from different ordering schemes and/or suites of summary statistics, as well as without summary statistics entirely, instead directly exploiting sequence alignments [46,47] or even raw reads. More broadly, CNNs can be further extended to address other long-standing efforts in evolutionary biology, such as parameter inference under complex isolation-migration models or phylogenetic reconstruction.
Methods
Simulations for training and testing CNN classifier
We used discoal [31] to simulate training and test datasets corresponding to each A. gambiae population under nine different selection states: neutrally evolving, completed hard sweep, completed soft sweep, partial hard sweep, partial soft sweep, and linked region for every one of the four sweep classes (Figure 1; Table S7). For the four sweep types, the target SNP was located in the exact middle position within the central, or sixth in sequence, sub-window of 11 in total; the selected SNP was placed in the middle within one of the other ten sub-windows for linked sweeps. There were 2,000 training examples per selection state (with selected sub-window randomized for linked sweeps) and 1,000 test examples per class (including for each of the ten linked sweep locations), resulting in a training dataset of 18,000 simulations and a test dataset of 45,000 simulations given each demographic history, thus totaling 144,000 training and 360,000 test simulations. To conduct single-population simulations with discoal, we used the stairway plot [53] point estimates from Miles et al. [6] for size change parameters as well as N0 (present-day effective population size), assumed a mutation rate (μ) of 3.5×10−9 mutations per base pair per generation, and performed random draws for locus-wide mutation and recombination rates from the following distributions for each independent replicate: , where E[θ] = 4N0μL and L is the length of the simulated sequence with L = 55,000 base pairs; ρ ∼ TEXP(2×E[θ], 6×E[θ]), where ρ = 4N0rL, r is the recombination rate per base pair, and TEXP(β, maximum value) is a truncated exponential distribution with mean β; s ∼ U(1.0×10−4, 1.0×10−2); end time of sweep ∼ U(0, 2,000) generations ago, which represents fixation for completed sweeps and the transition back to neutral evolution for partial sweeps; selected SNP allele frequency at onset of soft sweep ; and selected SNP allele frequency at end of partial sweep ∼ U(0.20, 0.99).
Constructing two-dimensional feature vector images of summary statistics
The eight training and test datasets, as well as empirical datasets, were converted into two-dimensional feature vector matrices for downstream deep learning (Figure 1); this was performed within the Python environment and required usage of the module numpy. Prior to this two-dimensional transformation, the simulated data were modified to better account for uncertainty within the empirical data, specifically: 1) sites that were missing any individual calls or could not be polarized against the outgroup were excluded; and 2) incorrect identification of the derived allele. For the former, each simulation randomly drew from a distribution of 1,552 masking profiles (with test simulations drawing without replacement per selection class of 1,000 simulations), which determined the exact sites to be omitted from further analysis; a masking profile consisted of the site positions within a single full 55 KB window on the A. gambiae genome that had absent at least one sample throughout the entirety of the Ag1000G data and/or ancestral state information, and the total set represented all 1,552 sequential, non-overlapping windows (e.g. 2L: 1–55,000; 2L: 55,001–110,000; 2L: 110,001–165,000; etc.) where the proportion of masked sites did not exceed 75% in any of the constituent sub-windows (i.e. 1,250 sites). To account for mispolarization, estimated rates were obtained from Miles et al. [6] and exploited via a binomial distribution to mispolarize a random subset of SNPs to the other allele per simulation.
The empirical data similarly underwent processing for compatibility with the simulated data. First, chromosomes were delineated into sequential 5 KB sub-windows (e.g. positions 1-5,000 formed the first sub-window, positions 5,001-10,000 formed the second sub-window, etc.), with the aforementioned masking criteria applied across sites and remaining SNPs polarized. Within each population dataset, all polymorphic positions containing more than two alleles were further removed from analysis, such that only polarized monomorphic and biallelic sites comprising a full data matrix of no missing data were left. Sub-windows containing no SNPs or less than 25% of the original sites were subsequently discarded, and every configuration of 11 contiguous 5 KB sub-windows of those remaining formed a single full window, which would be classified into one of the nine selection states based upon its central sub-window while using spatial information from the neighboring five sub-windows on either side. To clarify, this eliminated any window that did not contain a consecutive sequence of 11 sub-windows that survived data filtering, and resulted in a sliding window that progressed a single sub-window at a time, such that succeeding full windows could be overlapping by up to ten sub-windows.
Every independent simulation totaling 55 KB in length from 11 sub-windows of 5 KB, as well as empirical sequence of 11 adjacent sub-windows per population, was then transformed into 89 separate summary statistic vectors, each with 11 elements that captured population-level variation across the sampled individuals per sub-window. The first 17 summary statistics, which were π [54], θW [55], Tajima’s D [14], θH [15], Fay-Wu’s H [15], number of haplotypes, H1 [52], H12 [52], H2/H1 [52], ZnS [56], ω [16], E[iHS] [30], maximum iHS [30], proportion of outlier iHS values [30], variance of pairwise genotype distances [26], skewness of pairwise genotype distances [26], and kurtosis of pairwise genotype distances [26], were calculated with the Python package scikit-allel as done with diploS/HIC [26]. Values for iHS were standardized within 50 derived allele frequency bins, following mispolarization in the case of simulated data. Outlier iHS values were defined as within either 2.0% tail of the distribution obtained from simulations of neutral evolution under the appropriate demographic history.
The remaining 72 summary statistics originated from SAFE, a recently developed statistic [32], and its various components; these SAFE-derived statistics included summaries for the distribution of values for: haplotype allele frequency (HAF), which is the sum of derived allele counts across all the derived alleles present within a sequence; unique HAF score (i.e. each unique HAF value is counted only once, even if representing multiple individuals); φ, which is the sum of HAF scores for sequences harboring the derived allele, divided by the total sum of HAF scores across all sequences; κ, which is the proportion of distinct HAF scores that carry the derived allele; derived allele frequency; and SAFE itself, which is the difference between φ and κ normalized against the derived allele frequency. Notably, HAF is calculated per sequence, whereas φ, κ, derived allele frequency, and SAFE are calculated per polymorphism. The following distribution summaries were utilized to construct individual values spanning a sub-window: mean, median, mode; 2.5%, 25%, 75%, and 97.5% quartiles; maximum, variance, standard deviation, skewness, and kurtosis. Importantly, each summary statistic vector was normalized, in the same manner as the preceding versions to partialS/HIC [24,26], to capture signal solely from the relative spatial distribution of the summary statistics across the 11 sub-windows, rather than allowing influence from absolute values. Subsequently, the 89 vectors were vertically collated to form a two-dimensional matrix that could then be exploited for image processing. The arrangement of these vectors were such that the 11 columns corresponded to the series of sub-windows from left to right, and the 89 rows of summary statistics were in the order presented here (with the distribution summaries iterating first for every SAFE component, e.g. row 52: skewness of φ values; row 53: kurtosis of φ values; row 54: E[κ]). Importantly, column and row order affects deep learning optimization, which may have consequences on overall efficacy, due to the convolutional and pooling windows employed by the CNN architecture, hence related summary statistics were grouped together (e.g. alternative distribution summaries of a SNP-based statistic, various SAFE derivatives). Heatmap images, based on median values per statistic and sub-window, were generated in R for the neutral case and each of the four sweep states under every population history from the training simulations.
Training and testing CNNs for deep learning implementation
The architecture of our CNN was composed of the following sequential layers: 1) 2D convolutional layer with 256 filters using 3×6 windows and “same” padding; 2) 2D max pooling layer given a 3×3 window; 3) a second 2D convolutional layer of 256 filters based on 3×3 windows, “same” padding, and ReLU activation; 4) a second 2D max pooling layer also with a 3×3 window; 5) dropout layer with p=0.25; 6) flattening layer; 7) fully-connected layer with ReLU activation to 512 responses; 8) a second dropout layer with p=0.50; 9) a second fully-connected layer with ReLU activation to 128 elements; 10) another dropout layer with p=0.50; and 11) softmax activation layer to 9 states (Figures 1–2). This architecture was trained using the Python module Keras [57] given the Adam optimizer [58], with 20 epochs, batch size of 32 simulations per step within an epoch, and 10% of the training data (e.g. 1800 simulations from the total nine-state training dataset) randomly removed as a validation set during optimization. Training was performed for every population demography under three experimental settings: 1) given the full set of training data distributed across nine selection states; 2) exploiting a subset of the training data from only five of the selection states, specifically those involving neutral regions or completed sweeps; 3) deploying the entire training data, but with binary classification between selective sweeps in the focal sub-window and all unselected classes (i.e. neutral class together with every linked class). The same test datasets were used for the first two simulation experiments, producing overall accuracy measures as well as confusion matrix heatmaps to assess misclassification bias for each of 45 classes, in this case treating different sub-window placements of linked sweeps discretely. Moreover, to explore the effect of demographic misspecification, we conducted a single additional test under the first experimental set-up whereby the CNN trained on the GAS simulations was applied to the CMS test simulations. Regarding the binary classification experiment, the Python module sklearn, which is available at http://scikit-learn.org/stable/ [59], was used for building ROC curves to evaluate accuracy and sensitivity. For comparison, ROC curves were also constructed from the training simulations based on the focal sub-window mean iHS, maximum iHS, and proportion of iHS values that were outliers, respectively.
Detecting selective sweeps for A. gambiae population datasets
To scan the genome for signatures of selective sweeps, the nine-state trained CNNs were applied to the eight empirical mosquito datasets, with the underlying simulated demography matched to the sampled population (Figure 1). Calls were corrected for false discovery by exploiting the accuracy and error rates for neutral regions from the nine-state simulation experiment, such that the amount of neutral calls was assumed to be underestimated while the amount of calls for the remaining eight selection states were assumed to be inflated. Subsequently, we produced sets of 10,000 randomly permuted calls across the genome to derive null expectations of sweep enrichment, following Schrider and Kern [33]. Using the gene annotation file “Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.7.gff3.gz” from VectorBase, we exploited these permuted datasets to assess statistically significant enrichment within certain DNA regions, groupings of known IR genes (N. Harding, pers. comm.), and all basic GO term definitions from http://www.geneontology.org (last accessed February 18, 2015). The DNA regions of interest included gene, mRNA, exon, CDS, five-prime UTR, and three-prime UTR; IR genes were assigned to four functional categories: metabolism, target sites, behavior, and cuticular. To determine significant enrichment, the number of inferred calls for a particular DNA region or IR gene category had to have a p-value < 0.05 based on the respective distribution of 10,000 permutations; for the GO terms, we deployed a corrected q-value < 0.05 due to concerns of false discovery stemming from the large number of terms tested for enrichment.
Acknowledgments
We thank Jeff Adrion for comments on the manuscript. This work was supported by NIH award no. R01GM117241 to ADK and NIH award no. K99HG008696 to DRS.