Abstract
Characterizing species history and identifying loci underlying local adaptation is crucial in functional ecology, evolutionary biology, conservation and agronomy. The ongoing and constant improvement of next-generation sequencing (NGS) techniques has facilitated the production of an increasingly growing amount of genetic markers across genomes of non-model species.
The study of variation at these markers across natural populations has deepened the understanding of how population history and selection act on genomes. However, this improvement has come with a burst of analytical tools that can confuse users.
This confusion can limit the amount of information effectively retrieved from complex genomic datasets. In addition, the lack of a unified analytical pipeline impairs the diffusion of the most recent analytical tools into fields like conservation biology.
This requires efforts be made in providing introduction to these methods. In this paper we describe possible analytical protocols and list more than 70 methods dealing with genome-scale datasets, clarify the strategy they use to infer demographic history and selection, and discuss some of their limitations.
Introduction
Multiple historical and selective factors shape the genetic makeup of populations. The advent of Next-Generation Sequencing (NGS) in the last 20 years has enhanced our understanding on how intermingled these factors are, and how they can impact genomic variation. Important results have been gathered on model species, or species of economic interest. Such results include, among other examples, an improved perspective on human history of migrations, admixture and adaptation (e.g. Sabeti et al., 2002; Abi-Rached et al., 2011; Li and Durbin, 2011), elucidating the origin of domesticated species (e.g. Axelsson et al., 2013; Schubert et al., 2014), or characterizing the genetic bases of local adaptation in model or near-model species (e.g. Legrand et al., 2009; Kolaczkowski et al., 2011; Roux et al., 2013; Kubota et al., 2015). The amount of population genomic data that is aimed at elucidating the history of natural populations is now becoming very abundant and widespread, even for non-model ones. Studying genetic variation at the genome level allows to characterize how demographic factors shape species history. In return, the picture of demographic events is informative to robustly identify loci under selection; and even can help in conservation efforts by identifying locally adapted genes that can be used to define relevant conservation units (Fraser and Bernatchez, 2001).
Developments in NGS have so far constantly improved the throughput and quality of data, while reducing time and cost of their production. While they have become more affordable for teams studying evolutionary processes in biology, many methods to infer demography and selection have been developed. Consequently, this increase in data production has come at an analytical cost, with an inflation of methods each claiming to address specific issues, making difficult to follow the ongoing developments in the field. In addition, the widespread use of sophisticated analytical tools remains challenged by the lack of communication between fields (Shafer et al., 2015), little user-friendliness of software and the ever-increasing amount of tools made available. Nevertheless, translation of these methods into non-model species is part of a shift towards genomics in evolutionary sciences that aims at better understanding biological diversity at various scales (Mandoli and Olmstead, 2000; Jenner and Wills, 2007; Abzhanov et al., 2008). Recent breakthroughs brought by the study of initially non-model species (e.g. White et al., 2010; Ellegren et al., 2012; Weber et al., 2013; Poelstra et al., 2014) have confirmed the value of population genomics from this perspective. These advances are needed to broaden our view about the evolutionary process and improve sampling of distant clades. Ultimately, this process should provide a more balanced picture than the one brought by the study of a few model species (Abzhanov et al., 2008). Genomic approaches also have the potential to improve conservation genetic inference by scaling up the amount of data available (Shafer et al., 2015). Much effort has been put recently in facilitating the diffusion of sometimes complex, state-of-the-art methods and their application to species with little background has nonetheless become more accessible, and has the potential to bring valuable information.
In this paper, we propose a decision-making pipeline (Figure 1) to help choosing appropriate methods dealing with questions in population genomics and genetics of adaptation in natural populations. We begin with a succinct review of methods available to obtain genome-wide polymorphism data (Box 1) before focusing on i) methods devoted to the study of population structure and identification of selected loci (Table 1 and 2) and ii) methods aiming at quantitatively characterize population structure (Table 3). We end this review by detailing how these analyses can be combined and provide perspective on the use of these methods for non-SNP datasets. Tables and summary of methods from this paper will be kept updated to follow improvements, and are available at www.methodspopgen.com.
Common sequencing methods
Whole genome resequencing: Whole-genome resequencing requires a reference (at least at a draft stage) and is much more expensive, especially for species with long and complex genomes. However, this approach gives a complete overview of structural and coding variation, and allows some of the most powerful methods currently available to track signatures of selection (see below). Pooled sequencing (Futschik and Schlötterer, 2010) can be an option to reduce the costs, but restricts the analysis to methods focusing on allele frequencies. Since individual information is not available, variation in Linkage Disequilibrium across individuals (LD) cannot be exploited. Shallow sequencing (1–5X per individual) may be a way to partly overpass this last issue for a similar cost (Buerkle and Gompert, 2013), but should not be used for methods requiring phasing and unbiased individual genotypes. Shallow shotgun sequencing also allows retrieving complete plastomes, due to the representation bias of mitochondrial or chloroplast sequences. Plastome sequences can provide insightful information about the evolutionary history of populations or species. Recent work has successfully used shallow sequencing to reconstruct mitochondrial or chloroplast sequences in plants (Malé et al., 2014), animals (Hahn et al., 2013) or old and altered museum samples (Besnard et al., 2016). Methods such as MITObim (Hahn et al., 2013) provide an automated and relatively user-friendly way to reconstitute plastome sequences, which can then be analyzed as a single non-recombining marker for phylogeny or population genetics.
RNAseq: RNAseq can be used with and without a reference genome. In the latter case, like any other reduced representation method, it does not provide information of linkage among genes. It has many applications along the evolutionary time scale. Since it is mostly coding regions, a deep phylogeny can be constructed with conserved orthologs. Depth of coverage is gene expression dependent, so calling genotypes varies across genes and it should be taking into consideration (Gayral et al., 2013). RNAseq provide information about the regulation of biochemical process and pathways between different tissues of the same individual and between individuals and across different environments. Applying RNAseq in non-model organism to estimate differentially expressed genes can be challenging. One reason is because biological variance is much higher in the field studies than in controlled conditions, which requires more sampling of individuals to achieve statistical power significance (Todd et al., 2016). If a reference genome is available, it is possible to call variants (Piskol et al., 2013) and estimate differential gene expression with gene annotation (Love et al., 2014). This can be done from alignment using RNAseq tools such Tophat2 (Kim et al., 2013), HiSat (Kim et al., 2015) or STAR (Dobin et al., 2013) hat allow for splicing reads. Using a reference genome to bridge regions with low coverage, software like Cufflinks and more recent ones such as Bayesembler (Maretty et al., 2014) and StringTie (Pertea et al., 2015) are able to assemble more lowly expressed transcripts than a de novo approach (Maretty et al., 2014). However, reference guided methods generally ignore variation from the reference since they focus on the structure of the exon overall in a transcript. Therefore, tools for de novo assembly such as Trinity, Oases, SOAPdenovo-Trans and Bridger are more suitable to retrieve information about variation for population genetic inferences when no reference genome is available (Grabherr et al., 2011; Schulz et al., 2012; Xie et al., 2014; Chang et al., 2015).
Targeted sequencing: This method facilitates the development of markers for a single species. Since the specificity of the probe does not have to be very high, the same probe can used with different related species (Nicholls et al., 2015). Conservation of the target genomic region under study is important. High conservation may lead to higher efficiency of capture but can artificially reduce representation of polymorphic regions. Different technologies allow for targeted sequence capture that can be classified by enrichment methods (hybridization-based; PCR-based; molecular inversion probe-based, see (Mamanova et al., 2010) (Mamanova et al. 2010). Targeted sequencing reduces the genomic representation compared to whole genome sequencing and it allows for multiple individuals to be multiplexed, lowering the cost of sequencing per sample. In addition, the analysis complexity is reduced compared to WGS since only a subset of genomic regions are sequenced. By allowing an improvement to spatial and temporal sampling, targeted sequencing could reconstruct dispersal routes and migration between varieties and subspecies (Nadeau et al., 2012; da Fonseca et al., 2016).
RADseq: Reduced representation allows sampling homogeneously variants across the genome by sequencing DNA fragments flanking restriction sites. Some of the best-known reduced representation techniques include RAD-sequencing ( Baird et al., 2008) and Genotyping by Sequencing or GBS (Elshire et al., 2011). Their main interest is their relatively low cost and that they do not require any reference genome (see (Davey et al., 2011) for details). They therefore allow to sequence many individuals at a low cost, making them widely used for the study of population structure, demography and selection. As a general word of caution, note that RAD-sequencing and related methods display specific properties that can bias genome-wide estimates of diversity, like allelic dropout (Arnold et al., 2013). However, this type of markers remains valuable for phylogenetic estimation, even for distantly related species (Cariou et al., 2013), and allelic dropout can be compensated by focusing only on markers sequenced in all individuals. Many pipelines have been specifically designed to account for RAD-seq specificities, including Stacks (Catchen et al., 2011) or TASSEL-UNEAK (Lu et al., 2013), facilitating the reproducibility of analyses. Reduced representation methods do not cover all mutations in the genome and are thus more likely to miss those actually under selection. Special care in the choice of the restriction enzyme and determining the expected density of markers is needed to retrieve enough mutations close to genes under selection. The number of SNPs ranges from thousands to millions, which is most of the time enough to retrieve substantial information about demography and sometimes selection (see Puritz et al., 2014 for a detailed summary of reduced-representation techniques).
Population structure and data description
Exploring population structure
Many tools currently exist to infer population structure (Table 1). An elegant and efficient class of methods relies on using multivariate approaches to infer relatedness between individuals and populations without a priori. Since these methods do not have underlying assumptions based on population genetics, they are suitable to analyze species displaying polyploidy or mixed-ploidy (Dufresne et al., 2014). A detailed review about these methods has been already performed (Jombart et al., 2009) and an exhaustive list of their applications is beyond the scope of this review. A simple approach that does not assume any a priori grouping is the Principal Component Analysis (PCA), based on analyzing variance-covariance structure among genotypes, which can be performed on both individual and pooled data. These approaches have been especially useful to study the consistency between geographical and genetic structure in European human populations (Novembre et al., 2008). A recent application of this technique using Procrustes rotation to match geographical coordinates with PCA axes has been performed on populations of a freshwater crustacean (Daphnia magna) using RAD-sequencing, allowing to show how isolation by distance shaped genetic structure.
Methods allowing to estimate the relatedness of individuals are suited for studies relying on pedigree information, or if there are reasons to suspect that familial relationships can play a major role in shaping genetic structure of the population(s) considered. When each individual in a study is sampled from a different location or environment, estimating relatedness also provides a way to assess the genetic distance between individuals, in relation with geographical or ecological distance. For example, in a recent study using more than 1000 Arabidopsis thaliana genomes, estimates of relatedness have allowed identifying possibly relict populations that may have subsisted in Europe during last Ice Age (Alonso-Blanco et al., 2016). VCFTools (Danecek et al., 2011) provides two ways calculating relatedness; unadjusted Ajk (Yang et al., 2010) and a kinship coefficient also implemented in KING (Manichaikul et al., 2010). Population stratification and relatedness can also be explored in PLINK based on pairwise identity-by-state (IBS) distance or identity by descent (IBD). These last methods can further be used to identify genomic regions displaying a high sharing of alleles, which can suggest positive selection (see below).
Landscape (and seascape) genetics has widely contributed to our understanding of how ecological and geographical variation affects species history and adaptation (Manel and Holderegger, 2013). Of central importance in this field is the identification of how populations are connected and how organisms move in the landscape matrix. Complementary to these approaches, identifying how and where populations (or closely related species, see Roux et al. 2016) hybridize is crucial when it comes to characterize colonization trajectories, tension zones and secondary contacts (Gay et al., 2008; Bierne et al., 2011).
Approaches such as Structure (Pritchard et al., 2000) and fastSTRUCTURE (Raj et al., 2014) have been widely popular in this framework, allowing to determine hierarchical population structure and admixed populations by grouping individuals in clusters without any a priori. The optimal number of clusters (K) can then be determined based on likelihood, although examining population structure for a range of K can allow better identifying substructure. Since these methods can be slow for large whole-genome data or high-density RAD-seq, reducing SNP redundancy by subsampling unlinked markers (those known to have low LD, e.g. due to a large physical distance between them) is a way to reduce computation time while keeping the relevant information. More precise inference and testing for directional introgression can be then performed in a software such as TreeMix (Pickrell and Pritchard, 2012).
Some methods can explicitly use spatial information to inform clustering. A practical R package is Geneland (Guillot et al., 2012), that allows determining the optimal number of population in a dataset by optimizing linkage and Hardy-Weinberg equilibrium within clusters, and is also able to incorporate geographic coordinates in the model to delineate their spatial organization. It can be useful to characterize the location and shape of hybrid zones. On the other hand, methods such as BEDASSLE (Bradburd et al., 2013) can be used to complement these approaches, and identify which combination of geographical and ecological distance limits dispersion. However, disentangling these effects has been proved to be complex and a deeper analysis of genes more strongly impacted by either geography or ecology may be more informative when it comes to the proximate causes of reduced dispersion and differentiation, such as biased dispersal (Edelaar and Bolnick, 2012; Bolnick and Otto, 2013) or selection against migrants (Hendry, 2004).
Phylogenetic methods such as RAxML (Stamatakis, 2014) or BEAST2 (Drummond and Rambaut, 2007) have been popular to cluster individuals into populations at the species level. Their underlying assumptions (e.g. homoplasia occurs through mutation, not recombination) should however be restricted to complexes of populations with low gene flow or ongoing recombination and with sufficient divergence, since even methods dedicated to reconstruct species tree such as BEAST* can be strongly biased when it comes to estimate divergence times and effective population sizes (Leaché et al., 2014). Methods implemented in Splitstree (Huson and Bryant, 2006), make less assumptions and are therefore more suited for building networks linking individuals. While useful to infer topologies, caution is advised when using branches lengths obtained from SNP-only datasets, e.g. to calculate divergence times between different groups or species (Leaché et al., 2015). For this purpose, it might therefore be easier to extract from the data both variant and invariant sites at several genes or RAD contigs, and analyze the whole sequences in a software like BEAST2.
To assess how diversity is partitioned across the different groups inferred by the methods described previously, it is advisable to perform an Analysis of Molecular Variance (AMOVA). Arlequin (Excoffier and Lischer, 2010) is particularly suited for this task. More generally, investigating patterns of nucleotide diversity, inbreeding, Fst or variation in LD between populations and across the genome is useful to have a preliminary idea of gene flow, admixture and variation in population sizes. These statistics can be easily retrieved with VCFTools or PopGenome (Pfeifer et al., 2014). If a reference genome is available, these statistics can also allow to scan for regions under selection or those more likely to display introgression while controlling for recombination (e.g. with LDHat, Table 1).
Screening for selection and association
Selection and its impact on sequence variation
Checking for population structure is an essential step when performing analyses on genome-level datasets. Neglecting it can bias demographic inferences (Chikhi et al., 2010; Heller et al., 2013) or the detection of loci under selection (e.g. Nielsen et al., 2007); thus, checking for outlier individuals and assessing the global structure is required prior to any more sophisticated analysis. On the other hand, selection acts both on correlations i) between alleles and environment at selected loci and ii) between alleles from different loci, either directly under selection or not. This is reflected respectively by i) variation in polymorphism within and between populations and ii) linkage disequilibrium (LD) between loci. If selection is widespread in the genome, the study of population history can thererore be biased, making necessary the joint study of selection and population structure.
In the sections that follow we present tools that can be used to detect signatures of selection (Table 2), but are also informative to assess how heterogeneous variation at a genome scale can be, an information that can be used to retrieve, e.g., signatures of introgression or identifying loci involved in reproductive isolation. The methods that these tools implement fall into three main categories (partly reviewed in Vitti et al., 2013), corresponding to the signature they try to target: i) study of variation in allele frequencies and polymorphism, ii) study of variation in linkage disequilibrium and iii) reconstruction of allele genealogies using the coalescent. Most of these methods assume that markers are ordered along a genome; although they can also be used to extract individual markers under selection that can be then be aligned (except for most LD-based methods).
It can happen that researchers report the results obtained from only a few methods when studying selection (François et al., 2015). However, many methods (even popular ones, such as Bayescan) can suffer from high positive rates under some demographic scenarios (Lotterhos and Whitlock, 2014). Combining methods can therefore help prevent this.
Methods focusing on polymorphism
While demographic forces such as drift and migration will affect the whole genome, local effects of selection should produce discrepancies with genome-wide polymorphism (Lewontin and Krakauer, 1973). Selection affects allele frequencies and polymorphism in predictable ways at the scale of single populations. Several statistics summarize them, like π, the nucleotide diversity (Nei and Li, 1979), Tajima’s D (Tajima, 1989) or Fay and Wu’s H (Fay and Wu, 2000). They are sensitive to population demographic history, and allow characterizing of summary statistics (e.g., in ABC analyses). They have nonetheless the potential to highlight genomic regions displaying clear signatures of selection, or to confirm selection at candidate genes. For example, balancing selection should lead to an excess of common polymorphisms, similar to a recent bottleneck, leading to high Tajima’s D and π values. Purifying selection leads to the opposite pattern, similar to a recent population expansion, with an excess of rare variants and low diversity. Combination of these statistics allow to identify more precisely targets of selection, and has been used to develop composite tests, like the composite likelihood ratio (CLR) test (Nielsen et al., 2005).
When an allele is under positive selection in a population, its frequency tends to rise until fixation, unless gene flow from other populations or strong drift prevents it. It is therefore possible to contrast patterns of differentiation between populations adapted to their local environment to detect loci under divergent selection (e.g. displaying a high Fst). However, it is essential to control for population structure, as it may strongly affect the distribution of differentiation measures and produce high rates of false positives. First attempts to take into account population structure and variation in gene flow included FDIST2 (Beaumont and Nichols, 1996), which modeled populations as islands and aimed at detecting loci under selection by contrasting heterozygosity to Fst between populations. More sophisticated methods are now available, dedicated to the detection of outliers in large genomic datasets. Most of them correct for relatedness across samples, and are reviewed extensively by Francois et al.(2015). Some methods, like LFMM (Frichot et al., 2013), aim at detecting variants correlated to environmental factors.
Other methods perform a “naïve scan” for outliers on the basis of differentiation, like BAYESCAN (Foll and Gaggiotti, 2008) which considers all populations to drift at different rates from a single ancestral pool. Most recent methods, like BAYPASS (Gautier, 2015), model demographic history by computing a kinship matrix between populations. Those methods are particularly well suited for the study of RAD-sequencing data, for which allele frequencies are often the only information available in the absence of any reference genome.
Detecting association between environment and allele frequencies does not necessarily imply a role for local adaptation. For example, in the case of secondary contact, intrinsic genetic incompatibilities can lead to the emergence of tension zones that may shift until they reach an environmental barrier where they can be trapped (Bierne et al., 2011). Characterizing population history is required to draw conclusions about the possible involvement of a genomic region in adaptation to environment. Sampling strategy must take into account the particular historical and demographic features of the species investigated to gain power (Nielsen et al., 2007). The sequencing strategy has also to be carefully picked to control for spatial autocorrelation of genotypes due to isolation by distance and shared demographic history.
The methods described above focus on allele frequencies at the population scale, but do not allow characterizing association with a trait varying between individuals within populations (e.g. resistance to a pathogen, symbiotic association, individual size or flowering time). For this task, methods performing Genome-wide association analysis (GWAS) are better suited, although the recent development of multivariate methods such as PCAdapt (Duforet-Frebourg et al., 2016) also allows to identify outlier loci in admixed or continuous populations. Methods such as GenAbel in R (Aulchenko et al., 2007) or PLINK (Purcell et al., 2007) are powerful tools. Taking into account relatedness between samples and population history is required to correct for false positives. This is especially recommended for species that undergo episodes of selfing or strong bottlenecks, for which sampling unrelated individuals may be unfeasible.
It is important to keep in mind that uncovering the genetic bases of complex, polygenic traits remains challenging, even in model species (Pritchard and Di Rienzo, 2010; Rockman, 2012). It may be unavoidable in a first step to focus only on traits that are under a relatively simple genetic determinism. This can however lead to an overrepresentation of loci of major phenotypic effect, a fact that should be acknowledged when discussing the impact of selection on genome variation. The fact that loci of major effect are easier to target does not imply that they are the main substrate of selection (Rockman, 2012). Association methods may help targeting variants undergoing soft sweeps, weak selection or involved in polygenic control of traits (Pritchard et al., 2010), for which signatures of selection are subtle and sometimes difficult to retrieve from allele frequency data.
Understanding the origin of genomic regions under selection highlights the evolutionary history of adaptive alleles (e.g. Abi-Rached et al., 2011) and contributes to understanding the origin and maintenance of reproductive isolation.To address the adapative contribution of introgressed segments, one may want to first identify these segments, estimate the relative contribution of each parental population (chromosome painting), then assess whether they display signatures of selection (Racimo et al., 2015). Advantageous alleles can migrate from one population to another, or resist introgression from other populations, and the relative importance of these islands resisting gene flow after secondary contact has been recently discussed (Cruickshank and Hahn, 2014). Many studies having focused on hybrid zones and introgression provide inspiring examples (Hedrick, 2013), as demonstrated by recent work focusing on patterns of heterogenous gene flow in Mytilus mussels (Roux et al., 2014), localized introgression and inversions at a color locus in Heliconius butterflies (The Heliconius Genome Consortium et al., 2012) or adaptive introgression of anticoagulant resistance alleles in mice (Song et al., 2011).
Methods aimed at characterizing heterogeneity in introgression rates are useful to detect adaptive introgression and refine demographic history. A common test for introgression, available in PopGenome, is the ABBA-BABA test, summarized by Patterson’s D (Durand et al., 2011). Another possibility lies in the comparison of absolute and relative measures of divergence (Cruickshank and Hahn, 2014), such as dxy and Fst, which can be calculated in PopGenome. Phylogenetic methods able to contrast gene trees to species trees, such as BEAST* can be used to infer whether a substantial proportion of loci display inconsistent information. A recent ABC framework has also been proposed to characterize genome-wide heterogeneity in migration rates (Roux et al., 2014).
Absolute measures of divergence are correlated to the time since coalescence. In the case of local introgression, both statistics should be reduced. For balancing selection, the decline in Fst is due to an excess of shared ancestral alleles, which should not impact dxy, or should even make it higher than genomic background. However, these methods do not prevent false positives and results should be interpreted with caution (Martin et al., 2015).
Detecting selection with methods focusing on LD
LD is increased and diversity is decreased near a selected allele, especially after recent selection. A class of methods aims at targeting those regions that display an excess of long homozygous haplotypes, such as the extended haplotype homozygosity (EHH) test (Sabeti et al., 2002). It is also possible to compare haplotype extension across populations, with the XP-EHH test (McCarroll et al., 2007) or Rsb (Tang et al., 2007). Individuals included in the analysis should be as distantly related as possible to improve precision and avoid an excess of false positives. These methods require data to be phased in order to reconstruct haplotypes. This procedure can be performed with fastPhase (Scheet and Stephens, 2006), BEAGLE (Browning and Browning, 2011) or SHAPEIT2 (O’Connell et al., 2014). The R package rehh (Gautier and Vitalis, 2012) allows calculating these statistics, as well as Sweep (http://www.broadinstitute.org/mpg/sweep/index.html) or selscan (Szpiech and Hernandez, 2014). Statistics dedicated to the detection of soft sweeps and selection on standing variation are also available, like the nSL statistics (Ferrer-Admetlla et al., 2014) in selscan or the H2/H1 statistics (Garud et al., 2015), although further studies are still needed to understand to what extent hard and soft sweeps can actually be distinguished (Schrider et al., 2015). When the relative order of markers is not known, as it can be the case in RAD-seq studies without a reference genome, LDna (Kemppainen et al., 2015) can be used to target sets of markers displaying strong linkage disequilibrium. This approach can be useful not only to detect selection but also structural variation such as large inversions.
Even hard selective sweeps can be challenging to detect with LD-based statistics (Jensen, 2014). It is advisable to combine several approaches to improve confidence when pinpointing candidate genes for selection. Methods based on LD alone can sometimes miss the actual variants under selection due to the impact of recombination on local polymorphism that can mimic soft or ongoing hard sweeps (Schrider et al., 2015).
These approaches are more powerful with a relatively high density of markers, such as the ones obtained from whole-genome sequencing or high-density RAD-seq, and benefits from using statistics focusing on polymorphism and allele sharing. In a recent study of local adaptation in sticklebacks (Roesti et al., 2015), these methods have been used on dense RAD-sequencing data to characterize the extent of selection at markers displaying high differentiation (FST), allowing to pinpoint new candidates and confirming previous ones (such as Ectodysplasin gene). In addition, the identification of large regions displaying high divergence and LD revealed the importance of large-scale structural variation in shaping genome structure.
Detecting and characterizing selection with the coalescent
When a candidate locus has been identified, it is possible to use coalescent simulations to evaluate the strength of selection and estimate the age of alleles. A software such as msms (Ewing and Hermisson, 2010), which is also available in PopGenome, can then be used. This requires that the neutral history of population be known to properly control for, e.g., population structure and gene flow.
An advantage of full coalescent methods is that they provide a relatively complete picture of individual loci history, by modeling coalescence, recombination and considering variation in mutation rate. They are however computationally intensive, and thus difficult to apply to whole genomes. However, recent computational improvements make this procedure feasible, as illustrated by ARGWeaver (Rasmussen et al., 2014), which allowed recovering known candidate genes for balancing selection in human data. This method uses ancestral recombination graphs to model the genealogy of each non-recombining block in the genome. It permits extracting genealogies for these blocks and provides estimates for local recombination rate, coalescence time and local effective population size for each block. This approach is promising to characterize positive, purifying or balancing selection while taking into account variation in recombination and mutation rate. However, the high stochasticity in parameter estimation can limit resolution when targeting single genes. Other methods use the theoretical framework of the coalescent to target sites under positive selection. A recent method (SCCT) using conditional coalescent trees (Wang et al., 2014) claims to be faster and more precise in targeting selective sweeps. BALLET (DeGiorgio et al., 2014) is a promising method to characterize ancient balancing selection. Most of those methods are designed for medium-to-high depth whole-genome resequencing, and require that individual genotypes be phased and well characterized.
Identifying variants of functional interest
Characterizing the amount of synonymous or non-synonymous mutations is another way to detect whether a specific gene undergoes purifying or positive selection. An excess of non-synonymous mutations can signal positive or balancing selection, or a relaxation of selective constraints on a given gene. This requires that an annotated genome is available. Annotation of mutations can be done with SNPdat (Doran and Creevey, 2013), or directly in PopGenome, which can also perform at the genome scale tests of selection such as the MK test (McDonald and Kreitman, 1991). Another popular test of selection is the comparison of non-synonymous and synonymous mutations between orthologs from different species and can be performed in packages such as PAML (Yang, 2007). To recover information about the putative function of a gene or a genomic region, it may be useful to perform a genome ontology (GO) enrichment analysis, using tools such as BLAST2GO (Conesa et al., 2005). When interpreting the link between selection and genetic variation, a careful review of literature can fruitfully complete the conclusions made using GO enrichment analyses.
While suggestive, genome scans for selection and association in natural populations cannot be considered as conclusive evidence for the function of a given gene, and needs to be combined with functional evidence (Vitti et al., 2013), which might be provided by variation in the expression of a candidate highlighted by RNA-sequencing data (but see Box 1), but more generaly implies that developmental studies be performed, a step that is not always possible for non-model organisms. Pinpointing the exact genetic mutation leading to a change in phenotype is challenging even when combining several tests for selection, and requires whole-genome sequencing data to obtain a near-exhaustive list of mutations. It has been proposed to combine QTL analyses with population genomics to facilitate identification of candidate loci (Stinchcombe and Hoekstra, 2008). Basically, controlled crosses allow for identifying genomic regions associated with a selected phenotype, while the study of variation in natural populations facilitates the fine-mapping of variants actually selected in natural populations. However, this requires that the species of interest can be raised in laboratory, which is unpractical for many research teams. An alternative is the study of candidate genes, for which an extensive description of functional variation is available. For example, in a recent study on bananaquits, GBS data have been used to obtain a neutral distribution to which patterns of substitution and differentiation at candidate genes for color variation were compared (Uy et al., 2016). In another study on color polymorphism in Peromyscus mice, a combination of field experiment, targeted sequencing of candidate genes and neutral regions, genome-scan for selection and association was able to show how selection at many mutations at the same locus drive adaptive phenotypic divergence (Linnen et al., 2013).
The combination of tests aiming at different signatures of selection can allow reducing the size of candidate regions. For example, combining results from environmental association mapping and genomic scans for selection allows the identification of candidate genes for which a function can be proposed (François et al., 2015). Another common approach relies on the combination of rests targeting different signatures of selection, typically those using the allele frequency spectrum and those using haplotype length. A test of this sort has been proposed in human genetics (Grossman et al., 2013), called the composite of multiple signals (CMS) test. Specifically, CMS integrates Fst with iHS and XP-EHH and other statistics describing the AFS. Nevertheless, signatures of selection can be elusive, and obtaining an exhaustive list of genes under positive selection is unlikely. Further advances will require that methods targeting selection be able to better take into account epistatic interaction and weak selection.
Population history
The coalescent has first emerged to provide population geneticists a way of modeling alleles genealogy from a sample taken from a large population. Going backward in time, alleles merge (coalesce) in a stochastic way until reaching their most recent common ancestor (Kingman, 1982). The most well-known coalescent-based tools dedicated to population genetics include IMa (Hey and Nielsen, 2007), Migrate-n (Beerli and Palczewski, 2010) or Lamarc (Kuhner, 2009) (Table 3). Obtaining demographic estimates (e.g. time in years) for parameters usually requires that mutation rate and generation time be known or at least reasonably well estimated, for example from other close species with similar life history. Since selection impacts allele frequencies, it is common that loci that are candidate for selection be removed prior any analysis for methods using the AFS.
Recent methods have been developed to handle whole genome datasets that allow inferring variation in population sizes with time without a priori, such as those based on Pairwise Sequentially Markovian Coalescent (PSMC), that require only a single diploid genome (Li and Durbin, 2011). One general drawback of these methods is that they are limited to rather simple scenarios, not handling more than two populations yet (but see diCal2, Table 2). While powerful, PSMC is sensitive to confounding factors such as population structure (Orozco-terWengel, 2016) that leads to false signatures of expansion or bottleneck. It also does not allow studying recent demographic events since coalescence events for only two alleles from a single individual in the recent past are infrequent. However, extensions of the model allowing for several genomes have been developed for precise population history in the recent past, like MSMC (Schiffels and Durbin, 2014) or diCal (Sheehan et al., 2013). Recently an ABC framework, implemented in PopSizeABC, has been proposed to infer demographic variation from single genomes (Boistard et al., 2016). A recent extension of these methods takes into account population structure and aims to identify the number of islands contributing to a single genome, assuming it is sampled from a Wright n-island meta-population (Mazet et al., 2015). Such developments should help increase the amount of information retrieved from only a few genomes. However, it is essential to keep in mind that natural populations are structured and connected in complex ways, which can bias demographic inferences, even for popular markers such as mitochondrial sequences (Heller et al., 2013).
A computationally faster approach is to use Approximate Bayesian Computation (ABC) methods, which compare the empirical data with a set of simulated data produced by coalescent simulations under predefined scenarios. By measuring the distance between carefully chosen summary statistics describing each simulation with those from the observed dataset, it is possible to infer which scenario explains the data the best. More information on how to perform ABC analysis is described by Csilléry et al. (2010). The main advantage of ABC is that it allows handling any type of markers and arbitrarily complex models, contrarily to methods like IMa where the model is predefined. However, using summary statistics leads to the loss of potentially useful information.
Methods based on IBS (Harris and Nielsen, 2013) and IBD tracts (Palamara and Pe’er, 2013) constitute an interesting alternative for model testing when high density RAD-seq data or whole genome datasets are available in large number (more than 100 individuals were required to infer recent demographic events with DoRIS). The large sample sizes required to infer recent events make these methods mostly relevant to researchers working on near-model species for which a substantial amount of data is available.
More recently, new methods such as dadi based on the allele frequency spectrum (AFS) emerged to facilitate and speed up the analysis of large SNP datasets. Different patterns of gene flow and demographic events all shape the AFS in specific ways (e.g. more alleles are likely to be found at similar frequencies in two recently diverged or highly connected populations). These methods generally assume that SNPs are under linkage equilibrium. Including SNPs in strong LD should not particularly bias model comparison, but can be an issue when estimating parameters (see fastsimcoal manual for more details). Note that the AFS can also be used as a set of summary statistics for ABC inference. Using allele frequencies estimated from pooled datasets is also feasible, as illustrated by a recent study on hybridization in Populus species where AFS was estimated from pooled whole genome resequencing data (Christe et al., 2016).
One drawback when using SNP data without considering monomorphic sites is that the mutation rate per generation is not directly taken into account. For example, in DIYABC, it does not matter when a mutation appears in the simulated genealogy, as long as it happens only once before coalescence, a reasonable assumption for SNP markers. However, this prevents any conversion of parameters into demographic estimates by using mutation rate. Again, it is also possible to extract the complete DNA sequence for a set of randomly selected markers and perform analyses on this dataset including monomorphic sites. Another possibility consists of a calibration of parameter estimates by including in the analysis a fixed parameter, such as population size or divergence time. This approach is also feasible when estimating parameters from the allele frequency spectrum, like in dadi or fastsimcoal2. Reaching a high level of precision in demographic parameters estimation can be challenging when information is lacking about the evolutionary history of the species considered. At larger time-scales, the lack of fossil record can make difficult the calibration of molecular clocks. Thus, for some species, only qualitative interpretation will be possible.
There is currently a balance between methods allowing for arbitrarily complex models that are defined a priori by the user (e.g. ABC), and methods that allow to more "naively" track evolution in population size or migration (e.g. MSMC). While the first one allows to better model the actual complexity of most living systems, the second one is less prone to be impacted by the user's bias. Using both methods can therefore help retrieving robustly the evolutionary history of a given model. For example, a recent study on maize demographic and selective history used both dadi and MSMC to characterize bottleneck and expansion associated to domestication (Beissinger et al., 2016). Note that for this study, all scripts and methods have been made available online, enhancing its reproducibility.
Suggestions and perspectives
Estimating selection and demography jointly along a heterogeneous genome
As stated by Lewontin and Krakauer in 1973, "while natural selection will operate differently for each locus and each allele at a locus, the effect of breeding structure is uniform over all loci and all alleles"(Lewontin and Krakauer, 1973). Since then, traditional studies on selection have mostly considered that demographic processes act on all loci in the same way across a genome, and that positive selection is mostly rare. The traditional approach has thus tended to disconnect the study of selection from the study of demography (Li et al., 2012).
However, this assumption may be incorrect, and a good understanding of demography is crucial to understand how efficient selection can be. On the other hand, removing loci under selection is needed to retrieve the actual demographic story of a set of populations. For example, the large effective population sizes of Drosophila have been hypothesized to facilitate a widespread effect of selection across the genome (Sattath et al., 2011; discussion in Li et al., 2012), making both demographic inference and detection of outliers difficult. Other counfounding factors include variation in recombination and mutation rates and background selection (Ewing and Jensen, 2016). Even for model species, these counfounding effects have been precisely characterized only recently and obtaining precise recombination and mutation maps is challenging for non-model species. It has been shown in the last few years that variation in introgression rates along the genome and coupling between loci involved in reproductive isolation and those involved in local adaptation can bias inference about selection and demography (Bierne et al., 2011; Roux et al., 2014). Locally low recombination rate can lead to reduced polymorphism, and be mistaken with a signature of purifying selection.
These issues can only be addressed by going beyond categorization between methods assigned to either the study of selection or demography, and use the results obtained by one method to inform the other. The availability of a reference genome facilitating the positioning of markers is helpful in this regard. In their study of the two different lineages of the european Sea Bass, and using a RAD-sequencing approach, Tine et al. took into account variation in recombination rate along the genome to interpret signatures of reduced polymorphism as being the result of selection or low recombination (Tine et al., 2014). Since differentiation along the genome seemed to reveal islands resisting gene flow, they could fit a model taking into account variation in introgression rates, providing a better fit to the data and highlighting that islands of high divergence were more likely to be due to locally reduced gene flow after secondary contact. This example illustrates how a combination of descriptive statistics and coalescent analyses can be used to retrieve information from genomic data about both selection and demography.
Most methods do not actually estimate demography and selection jointly, but rather rely on a process where neutral expectations are first drawn from a set of neutral SNPs (e.g. intergenic SNPs), followed by a step where the likelihood for a marker to be under selection is evaluated. Methods such as BAYPASS or PCAdapt are convenient to describe both population structure and give first insights about the proportion of loci that do not follow neutral expectations. When this proportion is not too high (which would suggest recent introgression or an excess of markers displaying high LD due to, e.g., large inversions), outliers can be removed and remaining loci be used to compare neutral models and estimate demographic parameters (e.g. using an ABC framework). These estimated parameters can then be used to simulate sequences or independent SNPs and generate a neutral expectation. Loci that are more likely to be neutral can be used to further calibrate tests for selection such as FLK or BAYPASS (Lotterhos and Whitlock, 2014).
Some recent methods are especially relevant to study both demography and selection at once, while taking into account variation in recombination and mutation rates. For whole-genome data, methods reconstructing ancestral recombination graphs (such as ARGWeaver) have a high potential, since they allow retrieving genealogies along the genome and inform about the timing of coalescence events and therefore selection or migration. A recent application of the method in human paleogenomics allowed to quantitatively characterize introgression between modern humans, Neandertal and Denisovans using only a few whole genomes (Kuhlwilm et al., 2016). This method has however a high computing and sequencing cost, and is therefore not suited for the study of many individuals.
Caution must prevail when attempting to apply sophisticated methods to disentangle selection and demography. In a recent review, Cruickshank et al. suggested that IMa2, which is commonly used to estimate migration rates, did not allow to reliably distinguish between loci under selection and those resisting gene flow (Cruickshank and Hahn, 2014). In the specific case they highlight (Oryctolagus cuniculus rabbits, (Sousa et al., 2013)), a descriptive statistics that should have captured introgression signatures (dxy) did not reveal any evidence for differential gene flow between loci categorized by IMa2. This controversy illustrates that basic description of the data is needed in combination with more sophisticated methods, that sometimes make assumptions that can be violated, such as neutrality or no recombination within loci.
In general, and for all types of datasets, description of the data is essential to assess the proportion of loci displaying consistent pattern and characterize the genomic landscape of a species. One may for example plot the distribution of Fst between populations, mean linkage disequilibrium, nucleotide diversity or p-values of association with a trait. Such an approach has been used in Ficedula flycatchers, allowing to clearly highlight genomic islands of divergence and the higher differentiation on sexual chromosomes due to ongoing reproductive isolation (Ellegren et al., 2012).
To sum up, the field of population genomics is now moving towards a better integration of selection into a historical framework, while taking into account selection when reconstructing demographic history. The joint inference of loci under selection and quantification of demographic dynamics is of crucial importance in fields such as landscape genomics or the study of ongoing speciation, as it should provide perspective about how proximate mechanisms and gene flow can promote or impair local adaptation to new habitats. The growing availability of genome-wide data for non-model species is therefore promising, but requires caution and high stringency in our interpretation of observed patterns.With the decreasing cost of sequencing, it has been suggested that NGS should broaden quickly our perspective on complex evolutionary processes, from biogeography (Lexer et al., 2013) to genetic bases of traits (Hohenlohe, 2014) or the maintenance of polymorphism (Hedrick, 2006). While genome heterogeneity in migration, mutation or recombination rates do not necessarily make impossible any conclusion about evolutionary dynamics, they have the potential to blur inferences. The study of DNA sequence variation, while already challenging by itself, needs therefore to be combined with other disciplines such as ecology and functional analyses to be informative (Habel et al., 2015), in accordance with the Dobzhanskyan dictum stating that biological sense can only be derived from evolutionary context. This can be done for example by assessing the function of selected genes, the consistency of demographic history with information retrieved from the fossil record or geological history, and the broader integration of population genomics with other fields and methods whenever possible, such as niche modeling, common garden experiments or the study of macro-evolutionary patterns of selection and diversification.
Data sharing, consistency and robustness
Most literature in population genomics has focused so far on single species at once, or on sets of closely related species and subspecies. However, many questions require a more global approach to provide general insights about key processes such as speciation or genetic bases of convergent evolution. While this approach becomes more feasible given the increase in the amount of NGS data (e.g. (Romiguier et al., 2014; Roux et al., 2016)), it requires i) that datasets are made available by researchers, ii) that methods used for analyzing data can be reproduced through unified pipelines. There is a need for a more collaborative and open culture in biology, allowing the free access to data and favoring good practices to allow repeatability of analyses (Nekrutenko and Taylor, 2012), although this cultural shift remains challenging (e.g. Mills et al., 2015; Whitlock et al., 2015).
However, current challenges are not limited to data sharing, but also include dealing with the inflation of bioinformatics tools that sometimes overlap. Overall, our present survey of methods reveals a lack of unified pipeline dealing at once with main aspects of population genomics. Instead of working independently, researchers designing those tools could collaborate to propose free, robust and unified pipelines (Prins et al., 2015). Such initiatives, like Galaxy (Goecks et al., 2010) or Bioconductor (Huber et al., 2015) are nonetheless emerging, and propose clear tutorials facilitating their use. ANGSD (Korneliussen et al., 2014) already provides useful utilities allowing to perform both extensive pre- and post processing of data (Table 1, Table 3). R has been a popular software among biologists, and now proposes a set of packages able to handle genome-wide datasets and compatible with the VCF files produced by most SNP callers (Paradis et al., 2016). For example, one may consider using only R to perform clustering analyses, PCA and identify outliers (using sNMF, SNPRelate, PCAdapt, and Bioconductor packages), study population structure (adegenet package, geneland), compute summary statistics, test for selection (PopGenome, rehh), study association with environment (LFMM in LEA package) and performing coalescent simulations and ABC inference for studying demography (coala and abc package). Now that population genomics begins to reach maturity, it is to be hoped that more integrated pipelines will allow minimizing the time spent looking for appropriate software and focus on biological questions.
Beyond SNPs: including structural variation/transposable elements/epigenetics
Most studies of selection and demography have so far focused on SNP, since they are relatively easy to detect with current technlogy and their mutation mechanism produces mostly biallelic alleles, making them easier to use for statistical tests. However, many other heritable genetic alterations can affect genomes, including transposable elements insertions, epigenetics modifications such as methylation, duplications, inversions, deletions or translocations. One of the main issue with this type of variation is that their diversity and their impact on the genome can make them difficult to detect in a systematic way (Iskow et al., 2012), especially for species having only a draft genome. It is however possible to use variation at these markers to study selection, for example by using differentiation statistics, association to environment or extension of haplotypes. Combining information about variant position and SNP variation in flanking regions is also a powerful way to detect variants under selectionas highlighted by a recent study of transposable elements insertions in Drosophila (Kofler et al., 2012). Recent work also shows that classical summary statistics such as Tajima’s D can be adapted to non-SNP datasets, such as methylations (Wang and Fan, 2014).
Sets of neutral SNPs can be used to control for demography and relatedness between samples when inferring selection. For example, this type of appproach has recently began to be exploredfor studying selection on methylation patterns. In a recent Molecular Ecology issue (Verhoeven et al., 2016), a study using bisulfite precipitation was able to replace strongly associated methylated variants near to genes known to be involved in response to environment, while another could show a stronger pattern of Isolation by Distance for methylation-sensitive AFLPs than for regular AFLPs and microsatellites, suggesting a stronger impact of environment on methylation patterns than expected under neutrality (Herrera et al., 2016).
Another potential issue with this type of variation is that there is currently a lack of tools able to simulate their mutation model, complicating any comparison driven from neutral models built from SNPs. This is the case for transposable elements, for which the assumption of mutation/drift equilibrium is challenging, making comparisons of their allele frequency spectrum with neutral SNPs potentially difficult. For example, a recent burst of transposition can lead to an excess of low frequency elements and recent insertions compared to the expectation under equilibrium, even if transposable elements (TEs) are not under purifying selection (Bergman and Bensasson, 2007; Blumenstiel et al., 2014). More generally, neutral models would benefit from new ways to model the appearance of genomic variation through time for non-SNP data, allowing to provide even more conservative assessments of either negative or positive selection.
Acknowledgements
The University of Basel and New York University Abu Dhabi have supported this research. I want to thank two anonymous reviewers, Stephane Boissinot, Joris Bertrand, Anne Roulin and Ben Warren for their insightful comments on previous versions of the manuscript.