Measuring genetic differentiation from Pool-seq data

Valentin Hivert; Raphël Leblois; Eric J. Petit; Mathieu Gautier; Renaud Vitalis

doi:10.1101/282400

Abstract

The advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of polymorphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains expensive for many non-model species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator of F_ST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is unbiased, and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecification, such as sequencing errors and uneven contributions of individual DNAs to the pools. Finally, by reanalyzing published Pool-seq data of different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased F_ST estimator may question the interpretation of population structure inferred from previous analyses.

INTRODUCTION

It has long been recognized that the subdivision of species into subpopulations, social groups and families fosters genetic differentiation (Wahlund 1928; Wright 1931). Characterizing genetic differentiation as a means to infer unknown population structure is therefore fundamental to population genetics, and finds applications in multiple domains, including conservation biology, invasion biology, association mapping and forensics, among many others. In the late 1940s and early 1950s, Malécot (1948) and Wright (1951) introduced F-statistics to partition genetic variation within and between groups of individuals (Holsinger and Weir 2009; Bhatia et al. 2013). Since then, the estimation of F-statistics has become standard practice (see, e.g., Weir 1996; Weir and Hill 2002; Weir 2012), and the most commonly used estimators of F_ST have been developed in an analysis-of-variance framework (Cockerham 1969, 1973; Weir and Cockerham 1984), which can be recast in terms of probabilities of identity of pairs of homologous genes (Cockerham and Weir 1987; Rousset 2007; Weir and Goudet 2017).

Assuming that molecular markers are neutral, estimates of F_ST are typically used to quantify genetic structure in natural populations, which is then interpreted as the result of demographic history (Holsinger and Weir 2009): large F_ST values are expected for small populations among which dispersal is limited (Wright 1951), or between populations that have long diverged in isolation from each other (Reynolds et al. 1983); when dispersal is spatially restricted, a positive relationship between F_ST and the geographical distance for pairs of populations generally holds (Slatkin 1993; Rousset 1997). It has also been proposed to characterize the heterogeneity of F_ST estimates across markers for identifying loci that are targeted by selection (Cavalli-Sforza 1966; Lewontin and Krakauer 1973; Beaumont and Nichols 1996; Vitalis et al. 2001; Akey et al. 2002; Beaumont 2005; Weir et al. 2005; Lotterhos and Whitlock 2014, 2015; Whitlock and Lotterhos 2015).

Next-generation sequencing (NGS) technologies provide unprecedented amounts of polymorphism data in both model and non-model species (Ellegren 2014). Although the sequencing strategy initially involved individually tagged samples in humans (The International HapMap Consortium 2005), whole-genome sequencing of pools of individuals (Pool-seq) is being increasingly popular for population genomic studies (Schlötterer et al. 2014). Because it consists in sequencing libraries of pooled DNA samples and does not require individual tagging of sequences, Pool-seq provides genome-wide polymorphism data at considerably lower cost than sequencing of individuals (Schlötterer et al. 2014). However, non-equimolar amounts of DNA from all individuals in a pool and stochastic variation in the amplification efficiency of individual DNAs have raised concerns with respect to the accuracy of the so-obtained allele frequency estimates, particularly at low sequencing depth and with small pool sizes (Cutler and Jensen 2010; Ellegren 2014; Anderson et al. 2014). Nonetheless, it has been shown that, at equal sequencing effort, Pool-seq provides similar, if not more accurate, allele frequency estimates than individual-based analyses (Futschik and Schlötterer 2010; Gautier et al. 2013). The problem is different for diversity and differentiation parameters, which depend on second moments of allele frequencies or, equivalently, on pairwise measures of genetic identity. With Pool-seq data, however, it is impossible to distinguish pairs of reads that are identical because they were sequenced from a single gene, from pairs of reads that are identical because they were sequenced from two distinct genes that are identical in state (IIS) (Ferretti et al. 2013).

Appropriate estimators of diversity and differentiation parameters must therefore be sought, to account for both the sampling of individual genes from the pool and the sampling of reads from these genes. There has been several attempts to define estimators for the parameter F_ST for Pool-seq data (Kofler et al. 2011; Ferretti et al. 2013), from ratios of heterozygosities (or from probabilities of genetic identity between pairs of reads) within and between pools. In the following, we will argue that these estimators are biased (i.e., they do not converge towards the expected value of the parameter), and that some of them have undesired statistical properties (i.e., the bias depends upon sample size and coverage). Here, following Cockerham (1969), Cockerham (1973), Weir and Cockerham (1984), Weir (1996), Weir and Hill (2002) and Rousset (2007), we define a method-of-moments estimator of the parameter F_ST using an analysis-of-variance framework. We then evaluate the accuracy and the precision of this estimator, based on the analysis of simulated datasets, and compare it to estimates defined in the software package PoPoolation2 (Kofler et al. 2011), and in Ferretti et al. (2013). Furthermore, we test the robustness of our estimators to model misspecifications (including unequal contributions of individuals in pools, and sequencing errors). Finally, we reanalyze the prickly sculpin (Cottus asper) Pool-seq data (published by Dennenmoser et al. 2017), and show how the use of biased F_ST estimators in previous analyses may challenge the interpretation of population structure.

Note that throughout this article, we use the term “gene” to designate a segregating genetic unit (in the sense of the “Mendelian gene” from Orgogozo et al. 2016). We further use the term “read” in a narrow sense, as a sequenced copy of a gene. For the sake of simplicity, we will use the term “Ind-seq” to refer to analyses based on individual data in which we further assume that individual genotypes are called without error.

MODEL

F-statistics may be described as intra-class correlations for the probability of identity in state (IIS) of pairs of genes (Cockerham and Weir 1987; Rousset 1996, 2007), and F_ST is best defined as: where Q₁ is the IIS probability for genes sampled within subpopulations, and Q₂ is the IIS probability for genes sampled between subpopulations. In the following, we develop an estimator of F_ST for Pool-seq data, by decomposing the total variance of gene frequencies in an analysis-of-variance framework. A complete derivation of the model is provided in the Supplemental File S1.

For the sake of clarity, the notation used throughout this article is given in Table 1. We first derive our model for a single locus, and eventually provide a multilocus estimator of F_ST. Consider a sample of n_d subpopulations, each of which is made of n_i genes (i = 1,…,n_d) sequenced in pools (hence n_i is the haploid sample size of the ith pool). We define c_ij as the number of reads sequenced from gene j (j = 1,…,n_i) in subpopulation i at the locus considered. Note that c_ij is a latent variable, that cannot be directly observed from the data. Let X_ijr:k be an indicator variable for read r (r = 1,…, c_ij) from gene j in subpopulation i, such that X_ijr:k = 1 if the rth read from the jth gene in the ith deme is of type k, and X_ijr:k = 0 otherwise. In the following, we use standard dot notations for sample averages, i.e.: , and . The analysis of variance is based on the computation of sums of squares, as follows:

View this table:

Table 1 Summary of main notations

As is shown in the Supplemental File S1, the expected sums of squares depend on the expectation of the allele frequency π_k over all replicate populations sharing the same evolutionary history, as well as on the IIS probability Q_1:_k that two genes in the same pool are both of type k, and the IIS probability Q_2:_k that two genes from different pools are both of type k. Taking expectations (see the detailed computations in the Supplemental File S1), one has: for reads within individual genes, since we assume that there is no sequencing error, i.e. all the reads sequenced from a single gene are identical and X_ijr:k = X_ij·:k for all r. For reads between genes within pools, we get: where is the total number of reads in the full sample (total coverage), C₁_i is the coverage of the ith pool and . D₂ arises from the assumption that the distribution of the read counts c_ij· is multinomial (i.e., that all genes contribute equally to the pool of reads; see Equation A15 in Supplemental File S1). For reads between genes from different pools, we have: where and (see Equation A16 in Supplemental File S1). Rearranging Equations 4-5, and summing over alleles, we get: and: where . Let MSI = SSI/ (C₁ − D₂) and . Then: which yields the method-of-moments estimator: where and: (see Equations A25 and A26 in Supplemental File S1). In Equations 10 and 11, is the average frequency of reads of type k within the ith pool, and is the average frequency of reads of type k in the full sample. Note that from the definition of is the weighted average of the sample frequencies with weights equal to the pool coverage. This is equivalent to the weighted analysis-of-variance in Cockerham (1973) (see also Weir and Cockerham 1984; Weir 1996; Weir and Hill 2002; Rousset 2007; Weir and Goudet 2017). Finally, the full expression of in terms of sample frequencies reads:

If we take the limit case where each gene is sequenced exactly once, we recover the Ind-seq model: assuming c_ij = 1 for all (i,j), then , , D₂ = n_d and . Therefore, n_c = (C₁ − C₂/C₁) / (n_d − 1), and Equation 9 reduces exactly to the estimator of F_ST for haploids: see Weir (1996), p. 182, and Rousset (2007), p. 977.

As in Reynolds et al. (1983), Weir and Cockerham (1984), Weir (1996) and Rousset (2007), a multilocus estimate is derived as the sum of locus specific numerators over the sum of locus-specific denominators: where MSI and MSP are subscripted with l to denote the lth locus. For Ind-seq data, Bhatia et al. (2013) refer to this multilocus estimate as a “ratio of averages” by opposition to an “average of ratios”, which would consist in averaging single-locus F_ST over loci. This approach is justified in the Appendix of Weir and Cockerham (1984) and in Bhatia et al. (2013), who analyzed both estimates by means of coalescent simulations. Note that Equation 13 assumes that the pool size is equal across loci. Also note that the construction of the estimator in Equation 13 is different from Weir and Cockerham’s (1984). These authors defined their multilocus estimator as a ratio of sums of components of variance (a, b and c in their notation) over loci, which give the same weight to all loci, whatever the number of sampled genes at each locus. Equation 13 follows Genepop’s rationale (Rousset 2008), which gives instead more weight to loci that are more intensively covered.

MATERIALS AND METHODS

Simulation study

Generating individual genotypes: we first generated individual genotypes using ms (Hudson 2002), assuming an island model of population structure (Wright 1931). For each simulated scenario, we considered 8 demes, each made of N = 5, 000 haploid individuals. The migration rate (m) was fixed to achieve the desired value of F_ST (0.05 or 0.2), using Equation 6 in Rousset (1996) leading, e.g., to M = 2Nm = 16.569 for F_ST = 0.05 and M = 3.489 for F_ST = 0.20. The mutation rate was set at μ = 10^-6, giving θ = 2N μ = 0.01. We considered either fixed, or variable sample sizes across demes. In the later case, the haploid sample size n was drawn independently for each deme from a Gaussian distribution with mean 100 and standard deviation 30; this number was rounded up to the nearest integer, with min. 20 and max. 300 haploids per deme. We generated a very large number of sequences for each scenario, and sampled independent single nucleotide polymorphisms (SNPs) from sequences with a single segregating site. Each scenario was replicated 50 times (500 times for Figures 3 and S2).

Pool sequencing: for each ms simulated dataset, we generated Pool-seq data by drawing reads from a binomial distribution (Gautier et al. 2013). More precisely, we assume that for each SNP, the number r_i:k of reads of allelic type k in pool i follows: where y_i:k is the number of genes of type k in the ith pool, n_i is the total number of genes in pool i (haploid pool size), and δ_i is the simulated total coverage for pool i. In the following, we either consider a fixed coverage, with δ_i = ∆ for all pools and loci, or a varying coverage across pools and loci, with δ_i ~ Pois(∆).

Sequencing error: we simulated sequencing errors occurring at rate μ_e = 0.001, which is typical of Illumina sequencers (Glenn 2011; Ross et al. 2013). We assumed that each sequencing error modifies the allelic type of a read to one of three other possible states with equal probability (there are therefore four allelic types in total, corresponding to four nucleotides). Note that only biallelic markers are retained in the final datasets. Also note that, since we initiated this procedure with polymorphic markers only, we neglect sequencing errors that would create spurious SNPs from monomorphic sites. However, such SNPs should be rare in real datasets, since markers with a low minimum read count (MRC) are generally filtered out.

Experimental error: non-equimolar amounts of DNA from all individuals in a pool and stochastic variation in the amplification efficiency of individual DNAs are sources of experimental errors in pool sequencing. To simulate experimental errors, we used the model derived by Gautier et al. (2013). In this model, it is assumed that the contribution η_ij = C_ij/C₁_i of each gene j to the total coverage of the ith pool (C₁_i) follows a Dirichlet distribution: where the parameter ρ controls the dispersion of gene contributions around the value η_ij = 1/n_i, expected if all genes contributed equally to the pool of reads. For convenience, we define the experimental error ∊ as the coefficient of variation of %·, i.e.: (see Gautier et al. 2013). When ∊ tends toward 0 (or equivalently when ρ tends to infinity), all individuals contribute equally to the pool, and there is no experimental error. We tested the robustness of our estimates to values of ∊ comprised between 0.05 and 0.5. The case ∊ = 0.5 could correspond, for example, to a situation where (for n_i = 10) 5 individuals contribute 2.8 × more reads than the other 5 individuals.

Other estimators

For the sake of clarity, a summary of the notation of the F_ST estimators used throughout this article is given in Table 2.

View this table:

Table 2 Definition of the F_ST estimators used in the text

PP2_d : this estimator of F_ST is implemented by default in the software package PoPoolation2 (Kofler et al. 2011). It is based on a definition of the parameter F_ST as the overall reduction in average heterozygosity relative to the total combined population (see, e.g., Nei and Chesser 1983): where is the average heterozygosity within subpopulations, and is the average heterozygosity in the total population (obtained by pooling together all subpopulation to form a single virtual unit). In PoPoolation2, is the unweighted average of within-subpopulation heterozygosities: (using the notation from Table 1). Note that in PoPoolation2, PP2_d is restricted to the case of two subpopulations only (n_d = 2). The two ratios in the right-hand side of Equation 17 are presumably borrowed from Nei (1978) to provide an unbiased estimate, although we found no formal justification for the expression in Equation 17 for Pool-seq data. The total heterozygosity is computed as (using the notation from Table 1):

PP2_a : this is the alternative estimator of F_ST provided in the software package PoPoolation2. It is based on an interpretation by Kofler et al. (2011) of Karlsson et al.’s (2007) estimator of F_ST, as: where and are the frequencies of identical pairs of reads within and between pools, respectively, computed by simple counting of IIS pairs. These are estimates of , the IIS probability for two reads in the same pool (whether they are sequenced from the same gene or not) and , the IIS probability for two reads in different pools. Note that the IIS probabiliy is different from Q₁ in Equation 1, which, from our definition, represents the IIS probability between distinct genes in the same pool. This approach therefore confounds pairs of reads within pools that are identical because they were sequenced from a single gene, from pairs of reads that are identical because they were sequenced from distinct, yet IIS genes.

FRP₁₃ : this estimator of F_ST was developed by Ferretti et al. (2013) (see their Equations 3 and 10-13). Ferretti et al. (2013) use the same definition of F_ST as in Equation 16 above, although they estimate heterozygosities within and between pools as “average pairwise nucleotide diversities”, which, from their definitions, are formally equivalent to IIS probabilities. In particular, they estimate the average heterozygosity within pools as (using the notation from Table 1): and the total heterozygosity among the n_d populations as:

Analyses of Ind-seq data

For the comparison of Ind-seq and Pool-seq datasets, we computed F_ST on subsamples of 5,000 loci. These subsamples were defined so that only those loci that were polymorphic in all coverage conditions were retained, and the same loci were used for the analysis of the corresponding Ind-seq data. For the latter, we used either the Nei and Chesser’s (1983) estimator based on a ratio of heterozygosity (see Equation 16 above), hereafter denoted by NC₈₃, or the analysis-of-variance estimator developed by Weir and Cockerham (1984), hereafter denoted by WC₈₄.

All the estimators were computed using custom functions in the R software environment for statistical computing, version 3.3.1 (R Core Team 2017). All these functions were carefully checked against available software packages, to ensure that they provided strictly identical results.

Application example: Cottus asper

Dennenmoser et al. (2017) investigated the genomic basis of adaption to osmotic conditions in the prickly sculpin (Cottus asper), an abundant eury-haline fish in northwestern North America. To do so, they sequenced the whole-genome of pools of individuals from two estuarine populations (CR, Capilano River Estuary; FE, Fraser River Estuary) and two freshwater populations (PI, Pitt Lake and HZ, Hatzic Lake) in southern British Columbia (Canada). We downloaded the four corresponding BAM files from the Dryad Digital Repository (doi: 10.5061/dryad.2qg01) and combined them into a single mpileup file using SAMtools version 0.1.19 (Li et al. 2009) with default options, except the maximum depth per BAM that was set to 5,000 reads. The resulting file was further processed using a custom awk script, to call SNPs and compute read counts, after discarding bases with a Base Alignment Quality (BAQ) score lower than 25. A position was then considered as a SNP if: (i) only two different nucleotides with a read count > 1 were observed (nucleotides with ≤ 1 read being considered as a sequencing error); (ii) the coverage was comprised between 10 and 300 in each of the four alignment files; (iii) the minor allele frequency, as computed from read counts, was ≥ 0.01 in the four populations. The final data set consisted of 608,879 SNPs.

Our aim here was to compare the population structure inferred from pairwise estimates of F_ST, using the estimator on the one hand, and PP2_d on the other hand. Then, to conclude on which of the two estimators performs better, we compared the population structure inferred from and PP2_d to that inferred from the Bayesian hierarchical model implemented in the software package BayPass (Gautier 2015). BayPass allows indeed the robust estimation of the scaled covariance matrix of allele frequencies across populations for Pool-seq data, which is known to be informative about population history (Pickrell and Pritchard 2012). The elements of the estimated matrix can be interpreted as pairwise and population-specific estimates of differentiation (Coop et al. 2010), and therefore provide a comprehensive description of population structure that makes full use of the available data.

Data availability

The authors state that all data necessary for confirming the conclusions presented in this article are fully represented within the article, figures, and tables. Supplemental Tables S1–S3 and Figures S1–S4 are available at FigShare, along with a complete derivation of the model in the Supplemental File S1 at FigShare.

RESULTS

Comparing Ind-seq and Pool-seq estimates of F_ST

Single-locus estimates are highly correlated with the classical estimates WC₈₄ (Weir and Cockerham 1984) computed on the individual data that were used to generate the pools in our simulations (see Figure 1). The variance of across independent replicates decreases as the coverage increases. The correlation between and WC₈₄ is stronger for multilocus estimates (see Figure S1A).

Figure 1 Single-locus estimates of F_ST.

We compared single-locus estimates of F_ST based on allele count data inferred from individual genotypes (Ind-seq), using the WC₈₄ estimator, to estimates from Pool-seq data. We simulated 5,000 SNPs using ms in an island model with n_d = 8 demes. We used two migration rates corresponding to F_ST = 0.05 (A) and F_ST = 0.20 (B). The size of each pool was fixed to 100. We show the results for different coverages (20X, 50X and 100X). In each graph, the cross indicates the simulated value of F_ST.

Comparing Pool-seq estimators of F_ST

We found that our estimator has extremely low bias (< 0.5% over all scenarios tested: see Tables 3 and S1-S3). In other words, the average estimates across multiple loci and replicates closely equals the expected value of the F_ST parameter, as given by Equation 6 in Rousset (1996), which is based on the computation of IIS probabilities in an island model of population structure. In all the situations examined, the bias did neither depend on the sample size (i.e., the size of each pool) nor on the coverage (see Figure 2). Only the variance of the estimator across independent replicates decreases as the sample size increases and/or as the coverage increases. At high coverage, the mean and root mean squared error (RMSE) of over independent replicates are virtually indistinguishable from that of the WC₈₄ estimator (see Table S1).

View this table:

Table 3 Overall F_ST estimates from multiple pools

Overall F_ST was estimated for various conditions of expected F_ST, pool size (n) and coverage (Cov.). For Pool-seq data, we computed our estimator (Equation 13). The mean (RMSE) over 50 independent replicates of the ms simulations are provided, for all populations (n_d = 8). For comparison, we computed WC₈₄ from allele count data inferred from individual genotypes (Ind-seq).

Figure 2 Precision and accuracy of pairwise estimators of F_ST.

We considered two estimators based on allele count data inferred from individual genotypes (Ind-seq): WC₈₄ and NC₈₃. For pooled data, we computed the two estimators implemented in the software package PoPoolation2, that we refer to as PP2_d and PP2_a, as well as the FRP₁₃ estimator and our estimator (Equation 13). Each boxplot represents the distribution of multilocus F_ST estimates across all pairwise comparisons in an island model with n_d = 8 demes, and across 50 independent replicates of the ms simulations. We used two migration rates, corresponding to F_ST = 0.05 (A-B) or F_ST = 0.20 (C-D). The size of each pool was either fixed to 10 (A and C) or to 100 (B and D). For Pool-seq data, we show the results for different coverages (20X, 50X and 100X). In each graph, the dashed line indicates the simulated value of F_ST and the dotted line indicates the median of the distribution of NC₈₃ estimates.

Figure 3 shows the RMSE of F_ST estimates for a wide range of pool sizes and coverage. The RMSE decreases as the pool size and/or the coverage increases. The F_ST estimates are more precise and accurate when differentiation is low. Figure 3 provides some clues to evaluate the pool size and the coverage that is necessary to achieve the same RMSE than for Ind-seq data. Consider, for example, the case of samples of n = 20 haploids. For F_ST ≤ 0.05 (in the conditions of our simulations), the RMSE of F_ST estimates based on Pool-seq data tends to the RMSE of F_ST estimates based on Ind-seq data either by sequencing pools of ca. 200 haploids at 20X, or by sequencing pools of 20 haploids at ca. 200X. However, the same precision and accuracy are achieved by sequencing ca. 50 haploids at ca. 50X.

Figure 3 Root mean squared error (RMSE) of F_ST estimates for a wide range of pool sizes and coverage, with F_ST varying from 0.005 to 0.2 (A – F).

Each density plot gives the RMSE of our estimator , using simple linear interpolation from a set of 44 × 44 pairs of pool size and coverage values. For each pool size and coverage, 500 replicates of 5,000 markers were simulated. Plain white isolines represent the RMSE of the WC₈₄ estimator computed from Ind-seq data, for various sample sizes (n = 5, 10, 20, and 50). Each isoline was fitted using a thin plate spline regression with smoothing parameter λ = 0.005, implemented in the fields package for R (Nychka et al. 2017).

Conversely, we found that PP2_d (the default estimator of F_ST implemented in the software package PoPoolation2) is biased when compared to the expected value of the parameter. We observed that the bias depends on both the sample size, and the coverage (see Figure 2). We note that, as the coverage and the sample size increase, PP2_d converges to the estimator NC₈₃ (Nei and Chesser 1983) computed from individual data (see Figure S1B). This argument was used by Kofler et al. (2011) to validate the approach, even though the estimates PP2d depart from the true value of the parameter (Figure S1B–C).

The second of the two estimators of F_ST implemented in PoPoolation2, that we refer to as PP2_a, is also biased (see Figure 2). We note that the bias decreases as the sample size increases. However, the bias does not depend on the coverage (only the variance over independent replicates does). The estimator developed by Ferretti et al. (2013), that we refer to as FRP_i3, is also biased (see Figure 2). However, the bias does neither depend on the pool size, nor on the coverage (only the variance over independent replicates does). FRP_i3 converges to the estimator NC₈₃, computed from individual data (see Figure 2). At high coverage, the mean and RMSE over independent replicates are virtually indistinguishable from that of the NC₈₃ estimator.

Last, we stress out that our estimator provides estimates for multiple populations, and is therefore not restricted to pairwise analyses, contrary to PoPoolation2’s estimators. We show that, even at low sample size and low coverage, Pool-seq estimates of differentiation are virtually indistinguishable from classical estimates for Ind-seq data (see Table 3).

Robustness to unbalanced pool sizes and variable sequencing coverage

We evaluated the accuracy and the precision of the estimator when sample sizes differ across pools, and when the coverage varies across pools and loci (see Figure 4). We found that, at low coverage, unequal sampling or variable coverage causes a negligible departure from the median of WC₈₄ estimates computed on individual data, which vanishes as the coverage increases. At 100X coverage, the distribution of estimates is almost indistinguishable from that of WC₈₄ (see Figure 4 and Tables S2–S3).

Figure 4 Precision and accuracy of F_ST estimates with varying pool size or varying coverage.

Our estimator (Equation 13) was calculated from Pool-seq data over all loci and demes and compared to the estimator WC₈₄, computed from allele count data inferred from individual genotypes (Ind-seq). Each boxplot represents the distribution of multilocus F_ST estimates across 50 independent replicates of the ms simulations. We used two migration rates, corresponding to F_ST = 0.05 (A and C) or F_ST = 0.20 (B and D). In A-B the pool size was variable across demes, with haploid sample size n drawn independently for each deme from a Gaussian distribution with mean 100 and standard deviation 30; n was rounded up to the nearest integer, with min. 20 and max. 300 haploids per deme. In C-D, the pool size was fixed (n = 100), and the coverage (δ_i) was varying across demes and loci, with δ_i ~ Pois(∆) where ∆ ∈ {20, 50,100}. For Pool-seq data, we show the results for different coverages (20X, 50X and 100X). In each graph, the dashed line indicates the simulated value of F_ST and the dotted line indicates the median of the distribution of WC84 estimates.

Robustness to sequencing and experimental errors

Figure 5 shows that sequencing errors cause a negligible negative bias for estimates. Filtering (using a minimum read count of 4) improves estimation slightly, but only at high coverage (Figure 6B). It must be noted, though, that filtering increases the bias in the absence of sequencing error, especially at low coverage (Figure 6A). With experimental error, i.e., when individuals do not contribute evenly to the final set of reads, we observed a positive bias for estimates (Figure 5). We note that the bias decreases as the size of the pools increases. Figure S2 shows the RMSE of F_ST estimates for a wider range of pool sizes, coverage and experimental error rate. For ∊ ≥ 0.25, increasing the coverage cannot improve the quality of the inference, if the pool size is too small. When Pool-seq experiments are prone to large experimental error rates, increasing the size of pools is the only way to improve the estimation of F_ST. Filtering (using a minimum read count of 4) does not improve estimation (Figure 6C).

Figure 5 Precision and accuracy of F_ST estimates with sequencing and experimental errors.

Our estimator (Equation 13) was computed from Pool-seq data over all loci and demes without error, with sequencing error (occurring at rate μ_e = 0.001), and with experimental error (∊ = 0.5). Each boxplot represents the distribution of multilocus F_ST estimates across 50 independent replicates of the ms simulations. We used two migration rates, corresponding to F_ST = 0.05 (A-B) or F_ST = 0.20 (C-D). The size of each pool was either fixed to 10 (A and C) or to 100 (B and D). For Pool-seq data, we show the results for different coverages (20X, 50X and 100X). In each graph, the dashed line indicates the simulated value of F_ST.

Figure 6 Precision and accuracy of F_ST estimates with and without filtering.

Our estimator (Equation 13) was computed from Pool-seq data over all loci and demes without error (A), with sequencing error (B) and with experimental error (C) (see the legend of Figure 5 for further details). For each case, we computed F_ST without filtering (no MRC) and with filtering (using a minimum read count MRC = 4). Each boxplot represents the distribution of multilocus F_ST estimates across 50 independent replicates of the ms simulations. We used a migration rate corresponding to F_ST = 0.20, and pool size n = 10. We show the results for different coverages (20X, 50X and 100X). In each graph, the dashed line indicates the simulated value of F_ST.

Application example

The reanalysis of the prickly sculpin data revealed larger pairwise estimates of multilocus F_ST using PP2_d estimator, as compared to (see Figure 7A). Furthermore, we found that estimates are smaller for within-ecotype pairwise comparisons as compared to between-ecotype comparisons. Therefore, the inferred relationships between samples based on pairwise estimates show a clear-cut structure, separating the two estuarine samples from the freshwater ones (see Figure 7C). We did not recover the same structure using PP2_d estimates (see Figure 7B). Supportingly, the scaled covariance matrix of allele frequencies across samples is consistent with the structure inferred from estimates (see Figure 7D).

Figure 7 Analysis of the prickly sculpin (Cottus asper) Pool-seq data.

In (A) we compare the pairwise F_ST estimates PP2_d, and (Equation 13) for all pairs of populations from the estuarine (CR and FE) and freshwater samples (PI and HZ). Within-ecotype comparisons are depicted as blue dots, and between-ecotype comparisons as red triangles. In (B-C) we show a UPGMA hierarchical cluster analyses based on PP2_d (B) and (C) pairwise estimates. In (D), we show a heatmap representation of the scaled covariance matrix among the four C. asper populations, inferred from the Bayesian hierarchical model implemented in the software package BayPass.

DISCUSSION

Whole-genome sequencing of pools of individuals is being increasingly popular for population genomic research on both model and non-model species (Schlötterer et al. 2014). The development of dedicated software packages (reviewed in Schlötterer et al. 2014) has undoubtedly something to do with the breadth of research questions that have been tackled using pool-sequencing. Yet, the analysis of population structure from Pool-seq data is complicated by the double sampling process of genes from the pool and sequence reads from those genes (Ferretti et al. 2013).

The naive approach that consists in computing F_ST from read counts, as if they were allele counts (e.g., as in Chen et al. 2016), ignores the extra variance brought by the random sampling of reads from the gene pool during Pool-seq experiments. Furthermore, such computation fails to consider the actual number of lineages in the pool (haploid pool size). Altogether, these limits may result in severely biased estimates of differentiation when the pool size is low (see Figure S3). A possible alternative is to compute F_ST from allele counts imputed from read counts using a maximum-likelihood approach conditional on the haploid size of the pools (e.g., as in Smadja et al. 2012; Leblois et al. 2018), or from allele frequencies estimated using a model-based method that accounts for the sampling effects and the sequencing error probabilities inherent to pooled NGS experiments (see Fariello et al. 2017). However, these latter approaches may only be accurate in situations where the coverage is much larger than pool size, allowing to reduce sampling variance of reads (see Figure S3).

Here, we therefore developed a new estimator of the parameter F_ST for Pool-seq data, in an analysis-of-variance framework (Cockerham 1969, 1973). The accuracy of this estimator is barely distinguishable from that of the Weir and Cockerham’s (1984) estimator for individual data. Furthermore, does neither depend on the pool size nor on the coverage, and is robust to unequal pool sizes and varying coverage across demes and loci. In our analysis the frequency of reads within pools is a weighted average of the sample frequencies with weights equal to the pool coverage. Therefore, our approach follows Cockerham’s (1973) one, which he referred to as a weighted analysis-of-variance (see also Weir and Cockerham 1984; Weir 1996; Weir and Hill 2002; Weir and Goudet 2017).

With unequal pool sizes, weighted and unweighted analyses differ. As discussed recently in Weir and Goudet (2017), the unweighted approach seems appropriate when the between component exceed the within component, i.e. when F_ST is large (Tukey 1957). It turns out that optimal weighting depends upon the parameter to be estimated (Cockerham 1973) and is only efficient at lower levels of differentiation (Robertson 1962). In a likelihood analysis of the island model, Rousset (2007) derived asymptotically efficient weights that are proportional to for the sum of squares of different samples (i.e., as in Robertson 1962). To the best of our knowledge, such optimal weighting has never been considered in the literature. Nevertheless, if these arguments are true for estimators of variance components, they do not necessarily apply to estimates of intra-class correlations (Cockerham 1973).

Analysis of variance and probabilities of identity

In the analysis-of-variance framework, F_ST is defined in Equation 1 as an intraclass correlation for the probability of identity in state (Cockerham and Weir 1987; Rousset 1996). Extensive statistical literature is available on estimators of intraclass correlations. Beside analysis-of-variance estimators, introduced in population genetics by Cockerham (1969, 1973), estimators based on the computation of probabilities of identical response within and between groups have been proposed (see, e.g., Fleiss 1971; Fleiss and Cuzick 1979; Mak 1988; Ridout et al. 1999; Wu et al. 2012), which were originally referred to as kappa-type statistics (Fleiss 1971; Landis and Koch 1977). These estimators have later been endorsed in population genetics, where the “probability of identical response” was then interpreted as the frequency with which the genes are alike (Cockerham 1973; Cockerham and Weir 1987; Weir 1996; Rousset 2007; Weir and Goudet 2017).

This suggests that, with Pool-seq data, another strategy could consist in computing F_ST from IIS probabilities between (unobserved) pairs of genes, which requires that unbiased estimates of such quantities are derived from read count data. We have done so in the second section of the Supplemental File S1, and we provide alternative estimators of F_ST for Pool-seq data (see Equations A44 and A48 in Supplemental File S1). These estimators (denoted by and ) have exactly the same form as the analysis-of-variance estimator if the pools have all the same size and if the number of reads per pool is constant (Equation A33). This echoes the derivations by Rousset (2007) for Ind-seq data, who showed that the analysis-of-variance approach (Weir and Cockerham 1984) and the simple strategy of estimating IIS probabilities by counting identical pairs of genes provide identical estimates when sample sizes are equal (see Equation A28 and also Cockerham and Weir 1987; Weir 1996; Karlsson et al. 2007). With unbalanced samples, we found that analysis-of-variance estimates have better precision and accuracy than IIS-based estimates, particularly for low levels of differentiation (see Figure S4). Interestingly, we found that IIS-based estimates of F_ST for Pool-seq data have generally lower bias and variance if the overall estimates of IIS probabilities within and between pools are computed as unweighted averages of population-specific or pairwise estimates (see Equations A39 and A43), as compared to weighted averages. Equation A28 further shows that our estimator may be rewritten as a function close to , except that it also depends on the sums in both the numerator and the denominator. This suggests that if the Q₁_i’s differ among subpopulations, then our estimator provides an estimate of an average of population-specific F_ST (Weir and Hill 2002; Weir and Goudet 2017).

It follows from the derivations in the Supplemental File S1 that the estimator PP2_a (Equation 19) is biased, because the IIS probability between pairs of reads within a pool is a biased estimator of the IIS probability between pairs of distinct genes in that pool (see Equation A34 in Supplemental File S1). This is so, because the former confounds pairs of reads that are identical because they were sequenced from a single gene copy, from pairs of reads that are identical because they were sequenced from distinct, yet IIS genes.

A more justified estimator of F_ST has been proposed by Ferretti et al. (2013), based on previous developments by Futschik and Schlötterer (2010). Note that, although they defined F_ST as a ratio of functions of heterozygosities, they actually worked with IIS probabilities (see Equations 20 and 21). However, although their Equation 20 is strictly identical to our Equation A34 in Supplemental File S1, we note that they computed the total heterozygosity by integrating over pairs of genes sampled both within and between populations (see Equation 21), which may explain the observed bias (see Figure 2).

Comparison with alternative estimators

An alternative framework to Weir and Cockerham’s (1984) analysis-of-variance has been developed by Masatoshi Nei and coworkers to estimate F_ST from gene diversities (Nei 1973, 1977; Nei and Chesser 1983; Nei 1986). The estimator PP2_d (see Equations 16-18) implemented in the software package PoPoolation2 (Kofler et al. 2011) follows this logic. However, it has long been recognized that both frameworks are fundamentally different in that the analysis-of-variance approach considers both statistical and genetic (or evolutionary) sampling, whereas Nei and coworkers’ approach do not (Weir and Cockerham 1984; Excoffier 2007; Holsinger and Weir 2009). Furthermore, the expectation of Nei and coworkers’ estimators depend upon the number of sampled populations, with a larger bias for lower numbers of sampled populations (Goudet 1993; Excoffier 2007; Weir and Goudet 2017). This is so, because the computation of the total diversity in Equations 18 and 21 includes the comparison of pairs of genes from the same subpopulation, whereas the computation of IIS probabilities between subpopulations do not (see, e.g., Excoffier 2007). Therefore, we do not recommend using the estimator PP2_d implemented in the software package PoPoolation2 (Kofler et al. 2011).

Applications in evolutionary ecology studies

Pool-seq is being increasingly used in many application domains (Schlötterer et al. 2014), such as conservation genetics (see, e.g., Fuentes-Pardo 2017), invasion biology (see, e.g., Dexter et al. 2018) and evolutionary biology in a broader sense (see, e.g., Collet et al. 2016). These studies use a large range of methods, which aim at characterizing fine-scaled population structure (see, e.g., Fischer et al. 2017), reconstructing past demography (see, e.g., Chen et al. 2016; Leblois et al. 2018), or identifying footprints of natural or artificial selection (see, e.g., Chen et al. 2016; Fariello et al. 2017; Leblois et al. 2018).

Here, we reanalyzed the Pool-seq data produced by Dennenmoser et al. (2017), who investigated the adaptive genomic divergence between freshwater and brackish-water ecotypes of the prickly sculpin C. asper, an abundant euryhaline fish in northwestern North America. Measuring pairwise genetic differentiation between samples using , we found a clear-cut structure separating the freshwater from the brackish-water ecotypes. Such genetic strucure supports the hypothesis that populations are locally adaptated to osmotic conditions in these two contrasted habitats, as discussed in Dennenmoser et al. (2017). This structure, which is at odds with that inferred from PP2_d estimates, is not only supported by the scaled covariance matrix of allele frequencies, but also by previous microsatellite-based studies, who showed that populations were genetically more differentiated between ecotypes than within ecotypes (Dennenmoser et al. 2014, 2015).

Limits of the model and perspectives

We have shown that the stronger source of bias for the estimate is unequal contributions of individuals in pools. This is so, because we assume in our model that the read counts are multinomially distributed, which supposes that all genes contribute equally to the pool of reads (Gautier et al. 2013), i.e. that there is no variation in DNA yield across individuals and that all genes have equal sequencing coverage (Rode et al. 2018). Because the effect of unequal contribution is expected to be stronger with small pool sizes, it has been recommended to use pool-seq with at least 50 diploid individuals per pool (Lynch et al. 2014; Schlötterer et al. 2014). However, this limit may be overly conservative for allele frequency estimates (Rode et al. 2018), and we have shown here that we can achieve very good precision and accuracy of F_ST estimates with smaller pool sizes. Furthermore, because genotypic information is lost during Pool-seq experiments, we assume in our derivations that pools are haploid (and therefore that F_IS is nil). Analyzing non-random mating populations (e.g., in selfing species) is therefore problematic.

Finally, our model, as in Weir and Cockerham (1984), formally assumes that all populations provide independent replicates of some evolutionary process (Excoffier 2007; Holsinger and Weir 2009). This may be unrealistic in many natural populations, which motivated Weir and Hill (2002) to derive a population-specific estimator of F_ST for Ind-seq data (see also Vitalis et al. 2001). Even though the use of Weir and Hill’s (2002) estimator is still scarce in the literature (but see Weir et al. 2005; Vitalis 2012), Weir and Goudet (2017) recently proposed a re-interpretation of population-specific estimates of F_ST in terms of allelic matching proportions, which are strictly equivalent to IIS probabilities between pairs of genes. It would therefore be straightforward to extend Weir and Goudet’s (2017) estimator of population-specific F_ST for the analysis of Pool-seq data, using the unbiased estimates of IIS probabilies provided in the Supplemental File S1.

DATA ACCESSIBILITY

A R package, called poolfstat, which impletements F_ST estimates for Pool-seq data, is available at the Comprehensive r Archive Network (CRAN): https://cran.r-project.org/web/packages/poolfstat/index.html.

ACKNOWLEDGEMENTS

We thank Alexandre Dehne-Garcia for his assistance in using computer farms. Analyses were performed on the genotoul bioinformatics platform Toulouse Midi-Pyrénées (bioinfo.genotoul.fr) and the CBGP HPC computational platform. This work is part of Valentin Hivert’s Ph.D., who was supported by a grant from the INRA’s Plant Health and Environment (SPE) Division, and by the BiodivERsA project EXOTIC (ANR-13-EBID-0001). Part of this work was supported by the ANR project SWING (ANR-16-CE02-0015) of the French National Research Agency, and by the CORBAM project of the French region Hauts-de-France. We thank two anonymous reviewers for their positive comments and suggestions.

Footnotes

↵§ These authors are joint senior authors on this work

Literature Cited

↵
Akey, J. M., Zhang, G., Jin, L., and Shriver, M. D. (2002). Interrogating a high-density SNP map for signatures of natural selection. Genome Res., 12:1805–1814.
OpenUrl Abstract/FREE Full Text
↵
Anderson, E. C., Skaug, H. J., and Barshis, D. J. (2014). Next-generation sequencing for molecular ecology: a caveat regarding pooled samples. Mol. Ecol., 23:502–512.
OpenUrl CrossRef
↵
Beaumont, M. A. (2005). Adaptation and speciation: what can F_ST tell us? Trends Ecol. Evol., 20:435–440.
OpenUrl CrossRef PubMed Web of Science
↵
Beaumont, M. A., and Nichols, R. A. (1996). Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B, 263:1619–1626.
OpenUrl CrossRef
↵
Bhatia, G., Patterson, N., Sankararaman, S., and Price, A. L. (2013). Estimating and interpreting F_ST: the impact of rare variants. Genome Res., 23:1514–1521.
OpenUrl Abstract/FREE Full Text
↵
Cavalli-Sforza, L. (1966). Population structure and human evolution. Proc. R. Soc. Lond., B, Biol. Sci., 164:362–379.
OpenUrl CrossRef
↵
Chen, J., Källman, T., Ma, X.-F., Zaina, G., Morgante, M., and Lascoux, M. (2016). Identifying genetic signatures of natural selection using pooled populations sequencing in Picea abies. G3, 6:1979–1989.
OpenUrl
↵
Cockerham, C. C. (1969). Variance of gene frequencies. Evolution, 23:72–84.
OpenUrl CrossRef Web of Science
↵
Cockerham, C. C. (1973). Analyses of gene frequencies. Genetics, 74:679–700.
OpenUrl Abstract/FREE Full Text
↵
Cockerham, C. C., and Weir, B. S. (1987). Analyses of gene frequencies. Proc. Natl. Acad. Sci. USA, 84:8512–8514.
OpenUrl Abstract/FREE Full Text
↵
Collet, J. M., Fuentes, S., Hesketh, J., Hill, M. S., Innocenti, P., Morrow, E. H., Fowler, K., and Reuter, M. (2016). Rapid evolution of the intersexual genetic correlation for fitness in Drosophila melanogaster. Evolution, 70:781–795.
OpenUrl CrossRef
↵
Coop, G., Witonsky, D., Di Rienzo, A., and Pritchard, J. K. (2010). Using environmental correlations to identify loci underlying local adaptation. Genetics, 185:1411–1423.
OpenUrl Abstract/FREE Full Text
↵
Cutler, D. J., and Jensen, J. D. (2010). To pool, or not to pool? Genetics, 186:41–43.
OpenUrl FREE Full Text
↵
Dennenmoser, S., Nolte, A. W., Vamosi, S. M., and Rogers S. M. (2015). Phy-logeography of the prickly sculpin (Cottus asper) in north-western North America reveals parallel phenotypic evolution across multiple coastal-inland colonizations. J. Biogeogr., 42:1626–1638.
OpenUrl CrossRef
↵
Dennenmoser, S., Rogers, S. M., and Vamosi, S. M. (2014). Genetic population structure in prickly sculpin (Cottus asper) reflects isolation-by-environment between two life-history ecotypes. Biol. J. Linnean Soc., 113:943–957.
OpenUrl
↵
Dennenmoser, S., Vamosi, S. M., Nolte, S. W., and Rogers, S. M. (2017). Adaptive genomic divergence under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin (Cottus asper) revealed by Pool-Seq. Mol. Ecol., 26:25–42.
OpenUrl CrossRef
↵
Dexter, E., Bollens, S. M., Cordell, J., Soh, H. Y., Rollwagen-Bollens, G., Pfeifer, S. P., Goudet, J., and Vuilleumier, S. (2018). A genetic reconstruction of the invasion of the calanoid copepod Pseudodiaptomus inopinus across the North American Pacific Coast. Biol. Invasions, 20:1577–1595.
OpenUrl CrossRef
↵
Ellegren, H. (2014). Genome sequencing and population genomics in nonmodel organisms. Trends Ecol. Evol., 29:51–63.
OpenUrl CrossRef PubMed Web of Science
↵
1. Balding, D. J.,
2. Bishop, M., and
3. Cannings, C.
Excoffier, L. (2007). Analysis of population subdivision. In Balding, D. J., Bishop, M., and Cannings, C., editors, Handbook of Statistical Genetics, pages 980–1020, Chichester. John Wiley & Sons, Ltd.
↵
Fariello, M. I., and Boitard, S., Mercier, S., Robelin, D., Faraut, T., Arnould, C., Recoquillay, J., Bouchez, O., Salin, G., Dehais, P., Gourichon, D., Leroux, S., Pitel, F., Leterrier, C., and SanCristobal, M. (2017). Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes : the local score approach. Mol. Ecol., 26:3700–3714.
OpenUrl CrossRef
↵
Ferretti, L., Ramos Onsins, S., and Pérez-Enciso, M. (2013). Population genomics from pool sequencing. Mol. Ecol., 22:5561–5576.
OpenUrl CrossRef Web of Science
↵
Fischer, M. C., Rellstab, C., Leuzinger, M., Roumet, M., Gugerli, F., Shimizu, K. K., Holderegger, R., and Widmer, A. (2017). Estimating genomic diversity and population differentiation - an empirical comparison of microsatellite and SNP variation in Arabidopsis halleri. BMC Genomics, 18:69.
OpenUrl CrossRef
↵
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychol. Bull., 76:378–382.
OpenUrl CrossRef
↵
Fleiss, J. L., and Cuzick, J. (1979). The reliability of dichotomous judgements: Unequal numbers of judges per subject. Appl. Psychol. Meas., 3:537–542.
OpenUrl CrossRef
↵
Fuentes-Pardo, A. P.and Ruzzente, D. E. (2017). Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Mol. Ecol., 26:5369–5406.
OpenUrl CrossRef
↵
Futschik, A. and Schlötterer, C. (2010). The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics, 186:207–218.
OpenUrl Abstract/FREE Full Text
↵
Gautier, M. (2015). Genome-wide scan for adaptive divergence and association with population-specific covariates. Genetics, 201:1555–1579.
OpenUrl Abstract/FREE Full Text
↵
Gautier, M., Gharbi, K., Cezaerd, T., Galan, M., Loiseau, A., Thomson, M., Pudlo, P., Kerdelhué, C., and Estoup, A. (2013). Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Mol. Ecol., 22:3766–3779.
OpenUrl CrossRef Web of Science
↵
Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Mol. Ecol. Resour., 11:759–769.
OpenUrl CrossRef PubMed
↵
Goudet, J. (1993). The genetics of geographically structured populations. PhD thesis, University of Wales, Bangor.
↵
Holsinger, K. S., and Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet., 10:639–650.
OpenUrl CrossRef PubMed Web of Science
↵
Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18:337–338.
OpenUrl CrossRef PubMed Web of Science
↵
Karlsson, E. K., Baranowska, I., Wade, C. M., Salmon Hillbertz, N. H. C., Zody, M. C., Anderson, N., Biagi, T. M., Patterson, N., Pielberg, G. R., Kulbokas, E. J., Comstock, K. E., Keller, E. T., Mesirov, J. P., von Euler, H., Kämpe, O., Hedhammar, A., Lander, E. S., Andersson, G., Andersson, L., and Lindblad-Toh, K. (2007). Efficient mapping of Mendelian traits in dogs through genome-wide association. Nat. Genet., 39:1321–1328.
OpenUrl CrossRef PubMed Web of Science
↵
Kofler, R., Pandey, R. V., and Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27:3435–3436.
OpenUrl CrossRef PubMed Web of Science
↵
Landis, J. R., and Koch, G. G. (1977). A one-way components of variance model for categorical data. Biometrics, 33:671–679.
OpenUrl CrossRef Web of Science
↵
Leblois, R., Gautier, M., Rohfritsch, A., Foucaud, J., Burban, C., Galan, M., Loiseau, A., Sauné, L., Branco, M., Gharbi, K., Vitalis, R., and Kerdelhué, C. (2018). Deciphering the demographic history of allochronic differentiation in the pine processionary moth Thaumetopoea pityocampa. Mol. Ecol., 27:264–278.
OpenUrl CrossRef
↵
Lewontin, R. C., and Krakauer, J. (1973). Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphism. Genetics, 74:175–195.
OpenUrl Abstract/FREE Full Text
↵
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25:2078–2079.
OpenUrl CrossRef PubMed Web of Science
↵
Lotterhos, K. E., and Whitlock, M. C. (2014). Evaluation of demographic history and neutral parameterization on the performance of F_ST outlier tests. Mol. Ecol, 23:2178–2192.
OpenUrl CrossRef PubMed
↵
Lotterhos, K. E., and Whitlock, M. C. (2015). The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Mol. Ecol., 24:1031–1046.
OpenUrl CrossRef PubMed
↵
Lynch, M., Bost, D., Wilson, S., Maruki, T., and Harrison, S. (2014). Population-genetic inference from pooled-sequencing data. Genome Biol. Evol., 6:1210–1218.
OpenUrl CrossRef PubMed
↵
Mak, T. K. (1988). Analysing intraclass correlation for dichotomous variables. J. R. Stat. Soc. Ser. C Appl. Stat., 37:344–352.
OpenUrl
↵
Malécot, G. (1948). Les Mathématiques de l’Hérédité. Masson, Paris.
↵
Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA, 70:3321–3323.
OpenUrl Abstract/FREE Full Text
↵
Nei, M. (1977). F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet., 41:225–233.
OpenUrl CrossRef PubMed Web of Science
↵
Nei, M. (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89:583–590.
OpenUrl Abstract/FREE Full Text
↵
Nei, M. (1986). Definition and estimation of fixation indices. Evolution, 40:643–645.
OpenUrl CrossRef Web of Science
↵
Nei, M. and Chesser, R. K. (1983). Estimation of fixation indices and gene diversities. Ann. Hum. Genet., 47:253–259.
OpenUrl CrossRef PubMed Web of Science
↵
Nychka, D., Furrer, R., Paige, J., and Sain, S. (2017). fields: Tools for spatial data. R package version 9.6.
↵
1. Orgogozo, V.
Orgogozo, V., Peluffo, A. E., and Morizot, B. (2016). The “mendelian gene” and the “molecular gene”: two relevant concepts of genetic units. In Orgogozo, V., editor, Genes and Evolution, volume 119 of Current Topics in Developmental Biology, pages 1–26. Academic Press.
↵
Pickrell, J. K., and Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet., 8(11):e1002967.
OpenUrl CrossRef PubMed
↵
R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
↵
Reynolds, J., Weir, B. S., and Cockerham, C. C. (1983). Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics, 105:767–779.
OpenUrl Abstract/FREE Full Text
↵
Ridout, M. S., Demktrio, C. G. B., and Firth, D. (1999). Estimating intraclass correlation for binary data. Biometrics, 55:137–148.
OpenUrl CrossRef PubMed Web of Science
↵
Robertson, A. (1962). Weighting in the estimation of variance components in the unbalanced single classification. Biometrics, 18:413–417.
OpenUrl
↵
Rode, N. O., Holtz, Y., Loridon, K., Santoni, S., Ronfort, J., and Gay, J. (2018). How to optimize the precision of allele and haplotype frequency estimates using pooled-sequencing data. Mol. Ecol. Resour., 18:194–203.
OpenUrl
↵
Ross, M. G., Russ, C., Costello, M., Hollinger, A., Lennon, N. J., Hegarty, R., Nusbaum, C., and Jaffe, D. B. (2013). Characterizing and measuring bias in sequence data. Genome Biol., 14:R51.
OpenUrl CrossRef PubMed
↵
Rousset, F. (1996). Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics, 142:1357–1362.
OpenUrl Abstract/FREE Full Text
↵
Rousset, F. (1997). Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics, 145:1219–1228.
OpenUrl Abstract/FREE Full Text
↵
1. Balding, D. J.,
2. Bishop, M., and
3. Cannings, C.
Rousset, F. (2007). Inferences from spatial population genetics. In Balding, D. J., Bishop, M., and Cannings, C., editors, Handbook of Statistical Genetics, pages 945–979, Chichester. John Wiley & Sons, Ltd.
↵
Rousset, F. (2008). genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol. Ecol. Resour., 8:103–106.
OpenUrl CrossRef PubMed Web of Science
↵
Schlötterer, C., Tobler, R., Kofler, R., and Nolte, V. (2014). Sequencing pools of individuals - mining genome-wide polymorphism data without big funding. Nat. Rev. Genet., 15:749–763.
OpenUrl CrossRef PubMed
↵
Slatkin, M. (1993). Isolation by distance in equilibrium and non-equilibrium populations. Evolution, 47:264–279.
OpenUrl CrossRef Web of Science
↵
Smadja, C. M., Canbäck, B., Vitalis, R., Gautier, M., Ferrari, J., Zhou, J.-J., and Butlin, R. K. (2012). Large-scale candidate gene scan reveals the role of chemoreceptor genes in host plant specialization and speciation in the pea aphid. Evolution, 66:2723–2738.
OpenUrl CrossRef PubMed Web of Science
↵
The International HapMap Consortium (2005). A haplotype map of the human genome. Nature, 437:1299–1320.
OpenUrl CrossRef PubMed Web of Science
↵
Tukey, J. W. (1957). Variances of variance components: II. The unbalanced single classification. Ann. Math. Statist., 28:43–56.
OpenUrl
↵
1. Pompanon, F. and
2. Bonin, A.
Vitalis, R. (2012). DetSel: An R-Package to detect marker loci responding to selection. In Pompanon, F. and Bonin, A., editors, Data Production and Analysis in Population Genomics: Methods and Protocols, volume 888 of Methods in Molecular Biology, pages 277–293, New York. Humana Press.
↵
Vitalis, R., Boursot, P., and Dawson, K. (2001). Interpretation of variation across marker loci as evidence of selection. Genetics, 158:1811–1823.
OpenUrl Abstract/FREE Full Text
↵
Wahlund, S. (1928). Zusammens etzung von populationen und korrelationserscheinungen vom standpunkt der vererbungslehre aus betrachtet. Hered-itas, 11:65–106.
OpenUrl CrossRef Web of Science
↵
Weir, B. S. (1996). Genetic Data Analysis II. Sinauer Associates, Inc., Sunderland, MA.
↵
Weir, B. S. (2012). Estimating F-statistics: A historical view. Philos. Sci., 79:637–643.
OpenUrl CrossRef
↵
Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M., and Hill, W. G. (2005). Measures of human population structure show heterogeneity among genomic regions. Genome Res., 15:1468–1476.
OpenUrl Abstract/FREE Full Text
↵
Weir, B. S., and Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution, 38:1358–1370.
OpenUrl CrossRef PubMed Web of Science
↵
Weir, B. S., and Goudet, J. (2017). An unified characterization of population structure and relatedness. Genetics, 206:2085–2103.
OpenUrl Abstract/FREE Full Text
↵
Weir, B. S., and Hill, W. G. (2002). Estimating F-statistics. Annu. Rev. Genet., 36:721–750.
OpenUrl CrossRef PubMed Web of Science
↵
Whitlock, M. C., and Lotterhos, K. E. (2015). Reliable detection of loci responsible for local adaptation: inference of a null model through trimming the distribution of F_ST. Am. Nat., 186:S24–S36.
OpenUrl CrossRef PubMed
↵
Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16:97–159.
OpenUrl FREE Full Text
↵
Wright, S. (1951). The genetical structure of populations. Ann. Eugen., 15:323–354.
OpenUrl PubMed Web of Science
↵
Wu, S., Crespi, C. M., and Wong, W. K. (2012). Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp. Clin. Trials, 33:869–880.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted July 12, 2018.

Download PDF

Citation Tools

Subject Area

Evolutionary Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5213)
Biochemistry (11744)
Bioengineering (8751)
Bioinformatics (29193)
Biophysics (14968)
Cancer Biology (12094)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18303)
Genetics (12244)
Genomics (16801)
Immunology (11866)
Microbiology (28082)
Molecular Biology (11592)
Neuroscience (60959)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4957)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
Akey, J. M., Zhang, G., Jin, L., and Shriver, M. D. (2002). Interrogating a high-density SNP map for signatures of natural selection. Genome Res., 12:1805–1814.
OpenUrl Abstract/FREE Full Text

[2] ↵
Anderson, E. C., Skaug, H. J., and Barshis, D. J. (2014). Next-generation sequencing for molecular ecology: a caveat regarding pooled samples. Mol. Ecol., 23:502–512.
OpenUrl CrossRef

[3] ↵
Beaumont, M. A. (2005). Adaptation and speciation: what can F_ST tell us? Trends Ecol. Evol., 20:435–440.
OpenUrl CrossRef PubMed Web of Science

[4] ↵
Beaumont, M. A., and Nichols, R. A. (1996). Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B, 263:1619–1626.
OpenUrl CrossRef

[5] ↵
Bhatia, G., Patterson, N., Sankararaman, S., and Price, A. L. (2013). Estimating and interpreting F_ST: the impact of rare variants. Genome Res., 23:1514–1521.
OpenUrl Abstract/FREE Full Text

[6] ↵
Cavalli-Sforza, L. (1966). Population structure and human evolution. Proc. R. Soc. Lond., B, Biol. Sci., 164:362–379.
OpenUrl CrossRef

[7] ↵
Chen, J., Källman, T., Ma, X.-F., Zaina, G., Morgante, M., and Lascoux, M. (2016). Identifying genetic signatures of natural selection using pooled populations sequencing in Picea abies. G3, 6:1979–1989.
OpenUrl

[8] ↵
Cockerham, C. C. (1969). Variance of gene frequencies. Evolution, 23:72–84.
OpenUrl CrossRef Web of Science

[9] ↵
Cockerham, C. C. (1973). Analyses of gene frequencies. Genetics, 74:679–700.
OpenUrl Abstract/FREE Full Text

[10] ↵
Cockerham, C. C., and Weir, B. S. (1987). Analyses of gene frequencies. Proc. Natl. Acad. Sci. USA, 84:8512–8514.
OpenUrl Abstract/FREE Full Text

[11] ↵
Collet, J. M., Fuentes, S., Hesketh, J., Hill, M. S., Innocenti, P., Morrow, E. H., Fowler, K., and Reuter, M. (2016). Rapid evolution of the intersexual genetic correlation for fitness in Drosophila melanogaster. Evolution, 70:781–795.
OpenUrl CrossRef

[12] ↵
Coop, G., Witonsky, D., Di Rienzo, A., and Pritchard, J. K. (2010). Using environmental correlations to identify loci underlying local adaptation. Genetics, 185:1411–1423.
OpenUrl Abstract/FREE Full Text

[13] ↵
Cutler, D. J., and Jensen, J. D. (2010). To pool, or not to pool? Genetics, 186:41–43.
OpenUrl FREE Full Text

[14] ↵
Dennenmoser, S., Nolte, A. W., Vamosi, S. M., and Rogers S. M. (2015). Phy-logeography of the prickly sculpin (Cottus asper) in north-western North America reveals parallel phenotypic evolution across multiple coastal-inland colonizations. J. Biogeogr., 42:1626–1638.
OpenUrl CrossRef

[15] ↵
Dennenmoser, S., Rogers, S. M., and Vamosi, S. M. (2014). Genetic population structure in prickly sculpin (Cottus asper) reflects isolation-by-environment between two life-history ecotypes. Biol. J. Linnean Soc., 113:943–957.
OpenUrl

[16] ↵
Dennenmoser, S., Vamosi, S. M., Nolte, S. W., and Rogers, S. M. (2017). Adaptive genomic divergence under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin (Cottus asper) revealed by Pool-Seq. Mol. Ecol., 26:25–42.
OpenUrl CrossRef

[17] ↵
Dexter, E., Bollens, S. M., Cordell, J., Soh, H. Y., Rollwagen-Bollens, G., Pfeifer, S. P., Goudet, J., and Vuilleumier, S. (2018). A genetic reconstruction of the invasion of the calanoid copepod Pseudodiaptomus inopinus across the North American Pacific Coast. Biol. Invasions, 20:1577–1595.
OpenUrl CrossRef

[18] ↵
Ellegren, H. (2014). Genome sequencing and population genomics in nonmodel organisms. Trends Ecol. Evol., 29:51–63.
OpenUrl CrossRef PubMed Web of Science

[19] ↵
Balding, D. J.,
Bishop, M., and
Cannings, C.
Excoffier, L. (2007). Analysis of population subdivision. In Balding, D. J., Bishop, M., and Cannings, C., editors, Handbook of Statistical Genetics, pages 980–1020, Chichester. John Wiley & Sons, Ltd.

[20] Balding, D. J.,

[21] Bishop, M., and

[22] Cannings, C.

[23] ↵
Fariello, M. I., and Boitard, S., Mercier, S., Robelin, D., Faraut, T., Arnould, C., Recoquillay, J., Bouchez, O., Salin, G., Dehais, P., Gourichon, D., Leroux, S., Pitel, F., Leterrier, C., and SanCristobal, M. (2017). Accounting for Linkage Disequilibrium in genome scans for selection without individual genotypes : the local score approach. Mol. Ecol., 26:3700–3714.
OpenUrl CrossRef

[24] ↵
Ferretti, L., Ramos Onsins, S., and Pérez-Enciso, M. (2013). Population genomics from pool sequencing. Mol. Ecol., 22:5561–5576.
OpenUrl CrossRef Web of Science

[25] ↵
Fischer, M. C., Rellstab, C., Leuzinger, M., Roumet, M., Gugerli, F., Shimizu, K. K., Holderegger, R., and Widmer, A. (2017). Estimating genomic diversity and population differentiation - an empirical comparison of microsatellite and SNP variation in Arabidopsis halleri. BMC Genomics, 18:69.
OpenUrl CrossRef

[26] ↵
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychol. Bull., 76:378–382.
OpenUrl CrossRef

[27] ↵
Fleiss, J. L., and Cuzick, J. (1979). The reliability of dichotomous judgements: Unequal numbers of judges per subject. Appl. Psychol. Meas., 3:537–542.
OpenUrl CrossRef

[28] ↵
Fuentes-Pardo, A. P.and Ruzzente, D. E. (2017). Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Mol. Ecol., 26:5369–5406.
OpenUrl CrossRef

[29] ↵
Futschik, A. and Schlötterer, C. (2010). The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics, 186:207–218.
OpenUrl Abstract/FREE Full Text

[30] ↵
Gautier, M. (2015). Genome-wide scan for adaptive divergence and association with population-specific covariates. Genetics, 201:1555–1579.
OpenUrl Abstract/FREE Full Text

[31] ↵
Gautier, M., Gharbi, K., Cezaerd, T., Galan, M., Loiseau, A., Thomson, M., Pudlo, P., Kerdelhué, C., and Estoup, A. (2013). Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Mol. Ecol., 22:3766–3779.
OpenUrl CrossRef Web of Science

[32] ↵
Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Mol. Ecol. Resour., 11:759–769.
OpenUrl CrossRef PubMed

[33] ↵
Goudet, J. (1993). The genetics of geographically structured populations. PhD thesis, University of Wales, Bangor.

[34] ↵
Holsinger, K. S., and Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet., 10:639–650.
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18:337–338.
OpenUrl CrossRef PubMed Web of Science

[36] ↵
Karlsson, E. K., Baranowska, I., Wade, C. M., Salmon Hillbertz, N. H. C., Zody, M. C., Anderson, N., Biagi, T. M., Patterson, N., Pielberg, G. R., Kulbokas, E. J., Comstock, K. E., Keller, E. T., Mesirov, J. P., von Euler, H., Kämpe, O., Hedhammar, A., Lander, E. S., Andersson, G., Andersson, L., and Lindblad-Toh, K. (2007). Efficient mapping of Mendelian traits in dogs through genome-wide association. Nat. Genet., 39:1321–1328.
OpenUrl CrossRef PubMed Web of Science

[37] ↵
Kofler, R., Pandey, R. V., and Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27:3435–3436.
OpenUrl CrossRef PubMed Web of Science

[38] ↵
Landis, J. R., and Koch, G. G. (1977). A one-way components of variance model for categorical data. Biometrics, 33:671–679.
OpenUrl CrossRef Web of Science

[39] ↵
Leblois, R., Gautier, M., Rohfritsch, A., Foucaud, J., Burban, C., Galan, M., Loiseau, A., Sauné, L., Branco, M., Gharbi, K., Vitalis, R., and Kerdelhué, C. (2018). Deciphering the demographic history of allochronic differentiation in the pine processionary moth Thaumetopoea pityocampa. Mol. Ecol., 27:264–278.
OpenUrl CrossRef

[40] ↵
Lewontin, R. C., and Krakauer, J. (1973). Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphism. Genetics, 74:175–195.
OpenUrl Abstract/FREE Full Text

[41] ↵
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25:2078–2079.
OpenUrl CrossRef PubMed Web of Science

[42] ↵
Lotterhos, K. E., and Whitlock, M. C. (2014). Evaluation of demographic history and neutral parameterization on the performance of F_ST outlier tests. Mol. Ecol, 23:2178–2192.
OpenUrl CrossRef PubMed

[43] ↵
Lotterhos, K. E., and Whitlock, M. C. (2015). The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Mol. Ecol., 24:1031–1046.
OpenUrl CrossRef PubMed

[44] ↵
Lynch, M., Bost, D., Wilson, S., Maruki, T., and Harrison, S. (2014). Population-genetic inference from pooled-sequencing data. Genome Biol. Evol., 6:1210–1218.
OpenUrl CrossRef PubMed

[45] ↵
Mak, T. K. (1988). Analysing intraclass correlation for dichotomous variables. J. R. Stat. Soc. Ser. C Appl. Stat., 37:344–352.
OpenUrl

[46] ↵
Malécot, G. (1948). Les Mathématiques de l’Hérédité. Masson, Paris.

[47] ↵
Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA, 70:3321–3323.
OpenUrl Abstract/FREE Full Text

[48] ↵
Nei, M. (1977). F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet., 41:225–233.
OpenUrl CrossRef PubMed Web of Science

[49] ↵
Nei, M. (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89:583–590.
OpenUrl Abstract/FREE Full Text

[50] ↵
Nei, M. (1986). Definition and estimation of fixation indices. Evolution, 40:643–645.
OpenUrl CrossRef Web of Science

[51] ↵
Nei, M. and Chesser, R. K. (1983). Estimation of fixation indices and gene diversities. Ann. Hum. Genet., 47:253–259.
OpenUrl CrossRef PubMed Web of Science

[52] ↵
Nychka, D., Furrer, R., Paige, J., and Sain, S. (2017). fields: Tools for spatial data. R package version 9.6.

[53] ↵
Orgogozo, V.
Orgogozo, V., Peluffo, A. E., and Morizot, B. (2016). The “mendelian gene” and the “molecular gene”: two relevant concepts of genetic units. In Orgogozo, V., editor, Genes and Evolution, volume 119 of Current Topics in Developmental Biology, pages 1–26. Academic Press.

[54] Orgogozo, V.

[55] ↵
Pickrell, J. K., and Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet., 8(11):e1002967.
OpenUrl CrossRef PubMed

[56] ↵
R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

[57] ↵
Reynolds, J., Weir, B. S., and Cockerham, C. C. (1983). Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics, 105:767–779.
OpenUrl Abstract/FREE Full Text

[58] ↵
Ridout, M. S., Demktrio, C. G. B., and Firth, D. (1999). Estimating intraclass correlation for binary data. Biometrics, 55:137–148.
OpenUrl CrossRef PubMed Web of Science

[59] ↵
Robertson, A. (1962). Weighting in the estimation of variance components in the unbalanced single classification. Biometrics, 18:413–417.
OpenUrl

[60] ↵
Rode, N. O., Holtz, Y., Loridon, K., Santoni, S., Ronfort, J., and Gay, J. (2018). How to optimize the precision of allele and haplotype frequency estimates using pooled-sequencing data. Mol. Ecol. Resour., 18:194–203.
OpenUrl

[61] ↵
Ross, M. G., Russ, C., Costello, M., Hollinger, A., Lennon, N. J., Hegarty, R., Nusbaum, C., and Jaffe, D. B. (2013). Characterizing and measuring bias in sequence data. Genome Biol., 14:R51.
OpenUrl CrossRef PubMed

[62] ↵
Rousset, F. (1996). Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics, 142:1357–1362.
OpenUrl Abstract/FREE Full Text

[63] ↵
Rousset, F. (1997). Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics, 145:1219–1228.
OpenUrl Abstract/FREE Full Text

[64] ↵
Balding, D. J.,
Bishop, M., and
Cannings, C.
Rousset, F. (2007). Inferences from spatial population genetics. In Balding, D. J., Bishop, M., and Cannings, C., editors, Handbook of Statistical Genetics, pages 945–979, Chichester. John Wiley & Sons, Ltd.

[65] Balding, D. J.,

[66] Bishop, M., and

[67] Cannings, C.

[68] ↵
Rousset, F. (2008). genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol. Ecol. Resour., 8:103–106.
OpenUrl CrossRef PubMed Web of Science

[69] ↵
Schlötterer, C., Tobler, R., Kofler, R., and Nolte, V. (2014). Sequencing pools of individuals - mining genome-wide polymorphism data without big funding. Nat. Rev. Genet., 15:749–763.
OpenUrl CrossRef PubMed

[70] ↵
Slatkin, M. (1993). Isolation by distance in equilibrium and non-equilibrium populations. Evolution, 47:264–279.
OpenUrl CrossRef Web of Science

[71] ↵
Smadja, C. M., Canbäck, B., Vitalis, R., Gautier, M., Ferrari, J., Zhou, J.-J., and Butlin, R. K. (2012). Large-scale candidate gene scan reveals the role of chemoreceptor genes in host plant specialization and speciation in the pea aphid. Evolution, 66:2723–2738.
OpenUrl CrossRef PubMed Web of Science

[72] ↵
The International HapMap Consortium (2005). A haplotype map of the human genome. Nature, 437:1299–1320.
OpenUrl CrossRef PubMed Web of Science

[73] ↵
Tukey, J. W. (1957). Variances of variance components: II. The unbalanced single classification. Ann. Math. Statist., 28:43–56.
OpenUrl

[74] ↵
Pompanon, F. and
Bonin, A.
Vitalis, R. (2012). DetSel: An R-Package to detect marker loci responding to selection. In Pompanon, F. and Bonin, A., editors, Data Production and Analysis in Population Genomics: Methods and Protocols, volume 888 of Methods in Molecular Biology, pages 277–293, New York. Humana Press.

[75] Pompanon, F. and

[76] Bonin, A.

[77] ↵
Vitalis, R., Boursot, P., and Dawson, K. (2001). Interpretation of variation across marker loci as evidence of selection. Genetics, 158:1811–1823.
OpenUrl Abstract/FREE Full Text

[78] ↵
Wahlund, S. (1928). Zusammens etzung von populationen und korrelationserscheinungen vom standpunkt der vererbungslehre aus betrachtet. Hered-itas, 11:65–106.
OpenUrl CrossRef Web of Science

[79] ↵
Weir, B. S. (1996). Genetic Data Analysis II. Sinauer Associates, Inc., Sunderland, MA.

[80] ↵
Weir, B. S. (2012). Estimating F-statistics: A historical view. Philos. Sci., 79:637–643.
OpenUrl CrossRef

[81] ↵
Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M., and Hill, W. G. (2005). Measures of human population structure show heterogeneity among genomic regions. Genome Res., 15:1468–1476.
OpenUrl Abstract/FREE Full Text

[82] ↵
Weir, B. S., and Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution, 38:1358–1370.
OpenUrl CrossRef PubMed Web of Science

[83] ↵
Weir, B. S., and Goudet, J. (2017). An unified characterization of population structure and relatedness. Genetics, 206:2085–2103.
OpenUrl Abstract/FREE Full Text

[84] ↵
Weir, B. S., and Hill, W. G. (2002). Estimating F-statistics. Annu. Rev. Genet., 36:721–750.
OpenUrl CrossRef PubMed Web of Science

[85] ↵
Whitlock, M. C., and Lotterhos, K. E. (2015). Reliable detection of loci responsible for local adaptation: inference of a null model through trimming the distribution of F_ST. Am. Nat., 186:S24–S36.
OpenUrl CrossRef PubMed

[86] ↵
Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16:97–159.
OpenUrl FREE Full Text

[87] ↵
Wright, S. (1951). The genetical structure of populations. Ann. Eugen., 15:323–354.
OpenUrl PubMed Web of Science

[88] ↵
Wu, S., Crespi, C. M., and Wong, W. K. (2012). Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials. Contemp. Clin. Trials, 33:869–880.
OpenUrl CrossRef PubMed

Measuring genetic differentiation from Pool-seq data

Abstract

INTRODUCTION

MODEL

MATERIALS AND METHODS

Simulation study

Other estimators

Analyses of Ind-seq data

Application example: Cottus asper

Data availability

RESULTS

Comparing Ind-seq and Pool-seq estimates of FST

Comparing Pool-seq estimators of FST

Robustness to unbalanced pool sizes and variable sequencing coverage

Robustness to sequencing and experimental errors

Application example

DISCUSSION

Analysis of variance and probabilities of identity

Comparison with alternative estimators

Applications in evolutionary ecology studies

Limits of the model and perspectives

DATA ACCESSIBILITY

ACKNOWLEDGEMENTS

Footnotes

Literature Cited

Citation Manager Formats

Subject Area

Comparing Ind-seq and Pool-seq estimates of F_ST

Comparing Pool-seq estimators of F_ST