Abstract
Local adaptation is often studied via 1) multiple common garden experiments comparing performance of genotypes in different environments and 2) sequencing genotypes from multiple locations and characterizing geographic patterns in allele frequency. Both approaches aim to characterize the same pattern (local adaptation), yet the complementary information from each has not yet been coherently integrated into a modeling framework. Here, we develop a genome-wide association model of genotype interactions with continuous environmental gradients (G×E), i.e. reaction norms. We employ an imputation approach to synthesize evidence from common garden and genome-environment associations, allowing us to identify loci exhibiting environmental clines where alleles are associated with higher fitness in home environments. Simulations show our approach can increase power to detect loci causing local adaptation. In a case study on Arabidopsis thaliana, our approach reveals candidate genes for local adaptation based on known involvement in environmental stress response. Most identified SNPs exhibited home allele advantage and fitness tradeoffs along climate gradients, suggesting selective gradients maintain allelic clines. SNPs exhibiting G×E associations with fitness were enriched in genic regions, putative partial selective sweeps, and G×E associations with an adaptive phenotype (flowering time). We discuss extensions for situations where only adaptive phenotypes other than fitness are available. Many types of data may point toward the loci underlying G×E and local adaptation; coherent models of diverse data provide a principled basis for synthesis.
Introduction
Populations commonly exhibit phenotypic differences, often due to local adaptation to environment (Leimu & Fischer 2008; Hereford 2009). Local adaptation is defined as a genotype-by-environment interaction (G×E) for fitness that favors home genotypes (Kawecki & Ebert 2004). Local adaptation has long interested empirical and theoretical biologists (Clausen et al. 1940, 1948; Levene 1953; Slatkin 1973). However, little is known about the genomic basis of local adaptation, such as genetic architecture, major molecular mechanisms, and the extent to which genomic divergence among populations is driven by local adaptation. Because local adaptation involves organismal responses to environmental gradients, understanding the mechanisms of local adaptation has important applications in agriculture and biodiversity conservation under climate change (Aitken & Whitlock 2013; van Oppen et al. 2015; Lasky et al. 2015). Additionally, G×E are important in human phenotypes like disease (Anastasi 1958; Hunter 2005; Gage et al. 2016). Understanding the genomic basis of G×E is an emerging area of biomedical research (Thomas 2010; Keller 2014) as are the genomics of local adaptation (reviewed by (Des Marais et al. 2013; Manel & Holderegger 2013; Tiffin & Ross-Ibarra 2014; Adrion et al. 2015; Bragg et al. 2015; Hoban et al. 2016)).
A central question in local adaptation is whether selective gradients can maintain allelic clines at individual loci, or whether stochastic processes, like limited dispersal, are required to explain clines at individual loci causing local adaptation (Mitchell-Olds et al. 2007; Anderson et al. 2011b). If selective gradients cause rank changes in alleles with the highest relative fitness at & Willis 2010; Fournier-Level an individual locus, selection may maintain a cline, a pattern known as genetic tradeoff or antagonistic pleiotropy (Ågren et al. 2013). Detecting loci that exhibit antagonistic pleiotropy has been challenging, partly due to limited statistical power of approaches that conduct multiple tests of significance for opposing fitness effects in different environments (Fournier-Level et al. 2011a; ANDERSON et al. 2013).
Common garden experiments have long been employed to characterize genetic variation in phenotypes (Langlet 1971). In particular, reciprocal common gardens at multiple positions along environmental gradients are a powerful tool to reveal local adaptation (Clausen et al. 1940, 1948). One approach to identifying the loci underlying local adaptation is to combine fitness data from multiple common garden experiments with genomic data (Lowry & Willis 2010; Fournier-Level et al. 2011a; Anderson et al. 2011a; Ågren et al. 2013). However, common gardens are logistically challenging and it is unclear how the typically small spatiotemporal scales of common gardens relate to the scales of processes that generate local adaptation in the wild (Weigel & Nordborg 2015).
An alternative to discovering genetic and ecological mechanisms of local adaptation is to study changes in allele frequency along environmental gradients (Hedrick et al. 1976; Tiffin & Ross-Ibarra 2014; Adrion et al. 2015; Bragg et al. 2015; Rellstab et al. 2015; Hoban et al. 2016). In this approach, known as a genome-environment association study, individuals are sequenced from multiple locations along environmental gradients. Genetic markers and environmental gradients showing the strongest correlations are then considered as potentially involved in local adaptation (e.g. Hancock et al. 2008, 2011; ECKERT et al. 2010; Turner et al. 2010; Coop et al. 2010; Lasky et al. 2012; Jones et al. 2012; Fitzpatrick and Keller 2015). A challenge of both traditional association studies (genome-phenotype) and genome-environment association studies is that the genomic variation is observational and is not experimentally randomized (as opposed to linkage mapping with experimental crosses) (Devlin & Roeder 1999; Hancock et al. 2008; Kang et al. 2008; Nordborg & Weigel 2008). As a result, many loci may show spurious associations with phenotypes or with environment (Price et al. 2010; Schoville et al. 2012; Bragg et al. 2015). Spurious associations are particularly problematic for environmental gradients that are spatially autocorrelated due to confounding with population structure (Schaffer & Johnson 1974). A technique for dealing with this confounding is to control for putative population structure when testing associations (Coop et al. 2010) by controlling for genome-wide (identity-in-state) similarity among accessions (Yoder et al. 2014; Lasky et al. 2014).
Understanding the genomic basis of adaptation may benefit from synthesizing lines of evidence, for example by combining multiple types of genome scans to strengthen the evidence that a locus is under selection (Lasky et al. 2014; Evans et al. 2014). For example, researchers have identified overlap between outliers for selection statistics and markers associated with putatively adaptive phenotypes (Horton et al. 2012) or between SNPs associated with phenotypes and those associated with climate gradients (Berg & Coop 2014). (Lasky et al. 2015) used a Bayesian approach to combine associations with phenotype and environment, first calculating climate associations and then using each marker’s association to determine the prior probability it was associated with G×E for adaptive phenotypes, yielding a posterior. Although combining multiple lines of evidence is potentially useful, the quantitative approaches in past studies have often been ad hoc and lacked reasoned principles. Here we develop a modeling framework to conduct genome-wide association scans for G×E while coherently synthesizing multiple data types. Existing approaches to genome-wide association studies (GWAS) with G×E (sometimes referred to as genome-wide interaction studies, GWIS) have dealt with categorical nominal environments (Murcray et al. 2009; Thomas 2010; Korte et al. 2012; Gauderman et al. 2013; Marigorta & Gibson 2014), benefiting from the statistical convenience of modeling phenotypes in different environments as correlated traits (Falconer 1952). Association models have not been applied to G×E along continuous environmental gradients, such as modeling SNP associations with reaction norms (Jarquín et al. 2014; Tiezzi et al. 2017). Despite the existence of studies where fitness was measured in multiple common gardens for diverse genotyped accessions (Fournier-Level et al. 2011a), studies where linkage mapping was conducted for fitness at multiple sites (Ågren et al. 2013), and studies where authors conducted association mapping for G×E effects on phenotypes (Li et al. 2014), we found no example of association studies of G×E for fitness, which is the basis of local adaptation.
The underlying processes generating local adaptation are the same regardless of whether genome-environment associations or common gardens are used for inference. Thus it is natural to synthesize these data. Furthermore, by combining datasets into a single inferential framework we may increase power and accuracy for detecting causal loci. Here, we simultaneously leverage data from multiple common gardens and genome-environment associations. In the remainder, we describe our approach, present test cases using simulations and published data on Arabidopsis thaliana (hereafter Arabidopsis), and discuss extensions.
Methods
Genome-wide association study of G×E effects on fitness
In common garden experiments, environment is often treated as a factor. When more than two gardens are conducted, variation among them may be considered in a more general fashion. For a given environmental gradient, each common garden may be located along the gradient according to its conditions. Describing common gardens as such may be informative about the specific ecological mechanisms driving selective gradients, taking advantage of the ordered nature of gardens’ environments. We leverage multiple common garden experiments to identify markers (single nucleotide polymorphisms, SNPs) that show the strongest G×E effects, loci where allelic state shows the strongest interaction with environment in its association with fitness.
Local adaptation requires a genotype by environment interaction for fitness at the whole genome-level. Variation in individual phenotypes from multiple environments can be separated into components determined by genotype, environment, and G×E (Yates & Cochran 1938; Falconer 1952). To assess this interaction at an individual locus, one can assume that the relative fitness of individual i in a single location, wi, satisfies the linear modelwhere Gi,l is the genotype of individual i at locus l and Ei is the value of a single environmental variable at the location where wi was measured. The βE parameter gives the effect of environment on fitness and βG gives the effect of genotype on fitness. Our primary interest lies in the βG×E parameter, which gives the strength and direction of G×E effects; βG×E determines how responses to environmental gradients are mediated by genotype. The term α gives the fitness intercept. We assume that the vector of errors, ε, can be expressed as where E is a diagonal matrix of the environmental values, and
Here v and e are independent. The matrix K is calculated as the genome-wide identity in state for each pair of accessions (Kang et al. 2008). Random effects v are included because a substantial portion of G×E may be associated with population structure (Lasky et al. 2015); naively applying standard F-tests to assess the interaction effects can result in a dramatic increase in false positive rates. To ameliorate this issue, the random effect v represents the genetic background interactions with environment (G×E, magnitude of their variance determined by ), while e represents the independent and identically distributed error in the model (variance determined by ). However, it is important to note that incorporating random effects may also decrease power when causal loci covary with genomic background.
Coherent synthesis of common gardens and genome-environment associations via imputation
We now tackle the goal of synthesizing genome-environment associations and G×E observed in multiple common gardens, given that both patterns are expected to inform on the same process of local adaptation. The challenge in synthesizing these approaches is that genome-environment associations are purely observational and lack common garden data. However, an implicit assumption in studies of genome-environment associations is that local adaptation occurs; if a common garden were conducted at each location where genotypes are collected, the home genotype would tend to be most fit. Here, we make this assumption of genome-environment association studies explicit. A formal consequence of this assumption is an (imputed) observation of highest relative fitness for genotypes at home, which we combine with observed genomic marker and environment of origin data (Figure 1). Next, we scale relative fitness within each common garden so that the maximum observed fitness is given a relative fitness of unity, yielding a measure that can be directly observed or imputed in each type of study (common garden and genome-environment association). For imputation, we then assume that each genotype collected from wild populations is locally adapted at its home and thus has a relative fitness = 1 (Figure 1). After imputation, we can calculate marker associations with G×E for fitness, where each fitness observation arises from either (A) observations on a given genotype by common garden combination or (B) imputation on a given genotype collected from its natural home and subsequently sequenced.
Comparing approaches and fitting models
We compared four reaction norm approaches to genome-wide G×E association studies in addition to a more common approach to genome associations with the home environment (genome-environment association). In Approach 1, we ignored potential confounding of population structure, using least squares to fit the model in eqn 1 where ε is normal, independent, and identically distributed (excluding random effects v), but only including observed fitness data from common gardens and. excluding imputed fitness data. In Approach 2, we again used a linear model but included imputed fitness data; these imputations using information from the ancillary geographic data could possibly reduce false positives. In Approach 3, we fit the full mixed-effects model (including random effects v), but excluded imputed fitness. In Approach 4, we fit the full mixed-effects model while including imputed fitness data. To test a genome-environment association approach (Approach 5), we also compared associations between SNPs and home environments used a mixed-effects model in an approach akin to traditional association mapping but where environment is substituted for phenotype (Yoder et al. 2014; Lasky et al. 2014).
To improve computation time for the mixed-model approaches (3 & 4), we used the method of (Kang et al. 2010) and first fit the random effects with covariance determined by kinship, and then fixed these effects while testing the effects of each SNP on the phenotype. We included the environmental covariate effect in this initial step, following the recommendation of (Kang et al. 2010) for fitting additional non-SNP covariates. In other words, we first fit the model:
Eqn. 3 is the same as Eqn. 1 but with genetic effects omitted. We obtained parameter estimates . We then take the variance parameter estimates and use them to estimate the remaining slope coefficients in eqn 1 using generalized least squares.
Because inclusion of the βE term in ordinary least squares regression (Approaches 1-2) led to poor model fit, we excluded the term from those approaches. We fit the discussed mixed-model using Minimum Norm Quadratic Unbiased Estimation, MINQUE (Rao 1971; Brown 1976; Reimherr & Nicolae 2016). This approach is equivalent to restricted maximum likelihood, REML, but rephrased in a way that more fully exploits the linearity of the model, resulting in a flexible framework that can be quickly computed.
Simulation
We used simulations to demonstrate how our imputation technique can improve power to identify loci causing G×E for fitness. To assess scenarios with varying strength of local adaptation, we tested simulations of varying dispersal distances. (Forester et al. 2016) previously simulated local adaptation in a square two-dimensional 1024 x 1024 grid-cell landscape along a continuous environmental gradient, using the program CDPOP v1.2 (Landguth & Cushman 2010). (Forester et al. 2016) simulated 5,000 diploid individuals with 100 bi-allelic loci, one of which was under selection (99 neutral loci). All loci had a 0.0005 mutation rate per generation, free recombination, and no physical linkage. The authors ran 10 Monte Carlo replicates of the simulation for 1,250 generations, using the first 250 generations as a burn-in, with no selection imposed, to establish a spatial genetic pattern.
In the simulation, selection changed linearly along an environmental gradient, with AA and aa genotypes favored at different ends of the gradient (North and South, Fig. S1). The selection strength of s=0.10 at extreme ends of the gradient was mediated through density-independent mortality determined by an individual’s genotype at the selected locus, where AA experienced 0% mortality at the North extreme and s mortality at the South extreme, while aa genotypes experienced the opposite selection gradient. Aa genotypes experienced uniform selection of s/2 across the gradient.
Mating pairs of hermaphroditic individuals and dispersal locations of offspring were chosen using a random draw from the inverse-square probability function of distance, truncated at a distance equal to d, the proportion of a landscape edge. We tested three values of d: 0.03, 0.1, and 0.25 (i.e. truncated at 31, 102, and 256 pixels, respectively). These three values resulted in strong, moderate, and weak local adaptation, respectively, with the Pearson’s correlation between selected locus and selective gradient equal to 0.28, 0.24, and 0.11, respectively. Individuals near landscape edges were unable to disperse or mate with individuals beyond the edge such that boundaries were not periodic.
The number of offspring produced from mating (fitness) was determined from a Poisson distribution (λ = 4), which produced an excess of individuals each generation, maintaining a constant population size of 5,000 individuals at every generation. Carrying capacity of the landscape surface was 5,000 individuals, and excess individuals were discarded once all 5,000 locations became occupied.
We sampled 250 individuals randomly from the 5,000 available. We then located four common gardens at equal intervals along the gradient, encompassing the extremes of the selection surface (Fig. S1). For the moderate dispersal and local adaptation scenario, we tested the effect of common gardens that sample only half the environmental gradient (Fig. S1). For the gardens, we subsampled 100 individuals from the full 250, and then averaged fitness for 25 clones of each individual (each with the identical adaptive genotype of their parent clone) in each common garden using the above parameters for selection and fitness. After imputing fitness for the 250 individuals in their home environments, we had a total of 650 observations of fitness × location (250 imputed observations from individuals sampled across the landscape + 400 real observations arising from 100 clones in each of four common gardens). For both the simulations and the Arabidopsis case study, we focus on the 1% of SNPs with the lowest p-values and their role in local adaptation. In simulations with 100 SNPs, this was equivalent to the lowest p-value SNP. To determine false positive rates in simulations, we calculated the proportion of simulations for a given scenario where a neutral (as opposed to a causal) SNP had the lowest p-value for βG×E.
Model extensions
Two extensions to our approach could increase its generality by treating unobserved fitness as a parameter rather than using imputation. First, an alternative would be to treat unobserved fitness of a genotype in its home environment as a free parameter. To constrain estimates of unobserved fitness one could use informative priors, such that the prior probability of relative fitness at home for each genotype would be monotonically increasing, i.e. local adaptation is the most likely state, but minor maladaptation is common. Inferences about unobserved fitness could be further constrained using hierarchical models, such that home fitness parameters for multiple genotypes arise from a common distribution (GELMAN & HILL 2007). Relaxing the assumption of perfect local adaptation would also generate less biased, if less precise, parameter estimates for βG×E, which are currently conservative because imputation of local adaptation for maladapted genotypes will push βG×E toward zero and weaken estimates for selective gradients.
Second, when fitness is not measured, components of fitness (e.g. survival) or traits thought to be locally adaptive (e.g. physiological or behavioral) can be measured and used to infer the genomic basis of local adaptation. For example, instead of modeling SNP×environment associations with fitness, one could model SNP×environment associations with components of fitness or adaptive traits measured in common gardens, and estimate unobserved traits for sequenced genotypes using informative priors. To be clear, in our case study of Arabidopsis, we had near but not complete lifetime fitness data (missing germination stage). Here we do not fit these model extensions to data, given the current computational challenge of fitting many more parameters in a Bayesian framework.
Case study: local adaptation to climate in Arabidopsis thaliana
We applied these approaches to published data from studies of Arabidopsis thaliana in its native Eurasian range. Fournier-Level et al. (2011) conducted replicated common gardens at four sites across Europe: Spain, England, Germany, and Finland (Figure 2). With these data, (Fournier-Level et al. 2011a; b; Wilczek et al. 2014) showed evidence that genotypes are locally adapted to their home temperature and moisture regimes and that alleles associated with high fitness in a given garden tended to be found nearer to that garden than alternate alleles, suggesting these loci were involved in local adaptation. At each site the authors transplanted 157 accessions (59 in the case of Finland) on a date in the fall matched to germination of local winter-annual natural populations (Fournier-Level et al. 2011a). The authors calculated survival (out of individuals surviving transplant) and average fecundity (where individuals that died before reproducing had fecundity zero) giving an estimate of absolute fitness (excluding the seed to seedling transition) (Fournier-Level et al. 2011a).
These accessions were part of a panel of 1,307 accessions from around the globe that were genotyped at ~250k SNPs using a custom Affymetrix SNP tiling array (AtSNPtile1), with 214,051 SNPs remaining after quality control (Figure 2) (Horton et al. 2012). This array was generated by resequencing 19 diverse ecotypes from across the range of Arabidopsis (Kim et al. 2007). Of the 1,307 genotyped accessions, we used 1,001 accessions that were georeferenced and likely not contaminant lines (Anastasio et al. 2011), in addition to being from the native range in Eurasia (Hoffmann 2002; Lasky et al. 2012b), and excluding potentially inaccurate high altitude outliers (i.e. > 2000 m). After imputing fitness for accessions in their home environments we had a total of 1,531 observations of fitness × location (1001 imputed observations + 530 real observations). We removed from association tests SNPs having minor allele frequency (MAF) below 0.01, though we also considered a more conservative threshold of MAF = 0.1.
We used climate data compiled previously (Lasky et al. 2012b) from published global climate datasets (Hijmans et al. 2005; Zomer et al. 2008). Here we focus on four climate variables that differ among common gardens, are not strongly correlated, and may be involved in local adaptation: minimum temperature of the coldest month, average monthly minimum temperature in the growing season, coefficient of variation of monthly growing season precipitation, and aridity index.
Genome-wide G×E associations
We separately tested for each SNP’s interaction with each of the four environmental variables. For each of the four approaches using common garden data we fit a model for each combination of SNP and environmental variable. To characterize the types of patterns identified by our approach, we studied variation in the SNPs in the 0.01 lower tail of p-values for the hypothesis test that βG×E = 0 (the coefficient for SNP×environment effects on fitness) for each climate gradient. We considered whether these SNPs showed patterns consistent with home genotype advantage via changes in the allele with greatest relative fitness along the environmental gradient (i.e. local adaptation via antagonistic pleiotropy) versus a pattern where βG×E merely involved changes in fitness difference between alleles (such as conditional neutrality or variance changing G×E), the latter of which cannot stably maintain local adaptation. For these SNPs, we calculated whether the direction of allelic differentiation along environmental gradients was consistent with the sign of βG×E. For example, if one allele was more common in accessions from warmer locations, we assessed whether that same allele showed an increase in relative fitness in warmer common gardens.
Next, we assessed whether our model predicted that different alleles were most fit in the two common gardens at either extreme of a climate gradient, i.e. whether the SNP was associated with rank changes in fitness that are consistent with antagonistic pleiotropy. For example, if one allele was estimated to be most fit in the coldest common garden, we determined whether a different allele was estimated to be most fit in the warmest common garden. Furthermore, we also quantified similarity (rank correlation in SNP scores and proportion of SNPs common to the strongest 0.01 tail of associations) between results from our G×E approach versus those from other recent studies of association with home climate in Arabidopsis (Hancock et al. 2011; Lasky et al. 2012b, 2014).
Enrichment of strong SNP×environment associations across the genome
We studied whether loci we identified as likely being involved in local adaptation exhibited supportive patterns in ancillary datasets. First, to assess whether our association approach is capable of identifying the signal of local adaptation rather than spurious background associations, we tested for enrichment of SNPs in genic versus intergenic regions. These tests are based on the hypothesis that loci involved in adaptation are on average more likely to be found near genes and linked to genic variation, in comparison with loci evolving neutrally (Hancock et al. 2011; Lasky et al. 2012b). For a test statistic, we calculated the portion of SNPs in the 0.01 lower p-value tail that were genic versus intergenic.
Second, we hypothesized that locally-adaptive alleles may have been subject to partial (local) selective sweeps, especially given that much of Arabidopsis’ Eurasian range was recently colonized following the last glacial maximum. We tested for an enrichment of significant (alpha = 0.05) pairwise haplotype sharing (PHS, (Toomajian et al. 2006)) in the SNPs (using PHS calculated by Horton et al. 2012) showing the greatest evidence of G×E for fitness. We also tested evidence that these SNPs are enriched for significant (alpha = 0.05) integrated extended haplotype homozygosity (standardized, iHS (Voight et al. 2006)), an additional metric of partial sweeps. We used ancestral SNP allele determinations from (Horton et al. 2012) (based on alignment with the Arabidopsis lyrata genome) and the R package ‘rehh’ to calculate iHS (Gautier et al. 2012).
Third, we also studied whether loci we identified were associated with plasticity in flowering time, a trait that plays a major role in local adaptation to climate in plants (Hall & Willis 2006; Franks et al. 2007; Keller et al. 2012; Lowry et al. 2014). Recently (Li et al. 2014) tested the flowering time response of 417 natural accessions to simulated warming (up to ~4°C), and then identified SNP associations with changes in flowering time across treatments, G×E for flowering time. We tested whether SNPs we identified as having SNP×environment interactions for fitness (0.01 lower p-value tail) were enriched in nominally significant associations (alpha = 0.05) with G×E for flowering time.
To generate a null expectation for each enrichment while maintaining a signal of linkage disequilibrium in the null model, we circularly permuted SNP categories (e.g. as genic versus intergenic, having significant iHS or not) along the genome and recalculated the test statistics 10,000 times (Hancock et al. 2011; Lasky et al. 2012b).
Results
We compared four approaches to genome-wide G×E association study and one approach for genotype-environment association. Approach 1 used only observed (excluding imputed) fitness data, but no correction for population structure. Approach 2 used observed and imputed fitness data, and no correction for population structure. Approach 3 fitted the full mixed-effects model, but only including observed fitness data from common gardens, excluding imputed fitness. Approach 4 fitted the full mixed-effects model while including both observed and imputed fitness data. Approach 5 used a mixed model of genotype associations with environment (no common garden data).
Simulations
Across dispersal scenarios, we found that mixed models decreased false positive rates and increased accuracy of inference as to the SNPs driving G×E for fitness (Figure 3). When dispersal was highest and local adaptation weakest, all approaches exhibited an increase in false positives compared to the moderate dispersal scenario. Among the approaches using common garden data (Approaches 1-4), the mixed models generally had low false positive rates and thus high true positive rates. Based on the low false positive rates and low p-values for causal SNPs in Approach 3, common garden data were a clear source of statistical power to identify causal SNPs (Figure 3). Including imputed data (Approach 4) further reduced false positive rates and resulted in no false positives under the two lower dispersal scenarios. Genotype-environment associations that did not use common garden data (Approach 5) had similarly low false positive rates. However, common garden data combined with imputations (Approach 4) yielded stronger inference for SNPs driving G×E; causal SNPs had lower p-values (Figure 3, median causal SNP p for low, med., high dispersal scenarios: 3.2×10-6, 4.5×10-13,7.4×10-7) compared to ignoring common garden data (Approach 5, median causal SNP p for low, med., high dispersal scenarios: 2.3×10-7, 1.9× 10-10, 1.2× 10-5). Under a scenario of medium dispersal and common gardens that only covered half the gradient, false positive rates were elevated for approaches excluding fixed effects (Approaches 1-2) or excluding imputations (Approach 3) but not when imputations and random effects were included (Approach 4, Figure S4).
Case study on Arabidopsis
We found that simple linear model tests (Approaches 1-2) of SNP×environment interactions were highly enriched in very low p-values (Figure S5) relative to the theoretical expectation. After incorporating the kinship×environment random effects (but excluding imputed fitness observations, Approach 3), we found that SNP×environment associations with fitness were closer to the theoretical expectation but still highly enriched in low p-values for three climate variables. After incorporating imputed fitness observations into the mixed model (Approach 4, right column, Figure S5), we found p-value distributions hewed closer to the theoretical expectation and were slightly conservative (under-enriched in low p-values) for two climate variables. These approaches tended to identify different SNPs as having the strongest SNP×environment associations with fitness (Table S5).
Based on the results of our simulations and the p-value distributions noted above, we focus the remainder of analyses on results from mixed models with imputed fitness included (Approach 4). We found that climate variables differed in the importance of kinship-climate interaction associations with fitness (proportion of variance in fitness explained by random effects v), suggesting that population structure in Arabidopsis is more strongly correlated with some climatic axes of local adaptation (G×E for fitness) compared to other climate gradients. For growing season minimum temperatures, kinship×environment interactions explained most of the variation in fitness (R2 =0.78, Table 1, row 3). By contrast, kinship×environment interactions for fitness were weaker along a gradient in winter minimum temperature (R2 =0.07).
Approach 4 tended to identify SNPs where SNP×environment interactions favored alleles in climates where they were relatively more common, that is, the sign of allelic differences in home climates were mirrored by the sign of fitted mixed model SNP×environment associations with relative fitness (Table 1, row 1, and see outlier examples in Figure 4). In addition to characterizing SNP×environment associations, we tended to identify SNPs where we estimated a rank change in relative fitness for alternate alleles along the environmental gradient between the two extreme common gardens (where the fitted model expectation was that the allele with higher fitness at one extreme common garden differed from the allele with higher fitness at the other extreme, Table 1, row 2). It appeared that the proportion of SNPs expected to show rank changes in relative fitness among the common gardens was related to how much of each climate variable’s range was covered by gardens (Table 1, row 4). Thus the common gardens may have been limited in their ability to capture rank changing of alleles at some loci involved in local adaptation to aridity and growing season cold.
We found non-random, but very weak overlap between the SNPs we identified and those outliers in previous analyses (Hancock et al. 2011; Lasky et al. 2012b, 2014). When considering mixed model (Lasky et al. 2014) or partial Mantel (Hancock et al. 2011) SNP associations with the same climate variables (genome-environment associations with no common garden data), we found significant overlap among the previously identified SNPs in the 0.01 lower tail of p-values versus those in the 0.01 tail identified here (permutation test, all p<0.05, Table S6). However, rank correlations among SNP scores from previous approaches versus our current approach were very weak (all rho < 0.2, Table S6).
SNP×environment associations with fitness are enriched in regions suggestive of local adaptation
To assess whether our approach identified certain types of SNPs, we tested for enrichment of genic and intergenic regions for SNP×environment effects on fitness (again focusing on Approach 4: mixed model including imputations). We found that SNP×environment interactions for fitness were significantly enriched in genic regions (Table 2; for reference, SNPs identified via Approach 2, including imputation but without random effects, were not significantly enriched in genic regions). Additionally, we found that SNP×environment interactions for fitness were enriched for high pairwise haplotype sharing (PHS) and high integrated extended haplotype homozygosity (iHS, Table 2). Finally, SNPs associated with G×E for flowering time response to growing temperature (Li et al. 2014) tended to also have strong SNP×growing season minimum temperature interactions for fitness (p < 0.0002) but not for other climate variables. (Table 2) Enrichments reported above did not change qualitatively (with respect to statistical significance) when we only considered SNPs with MAF > 0.1.
SNP×environment associations with fitness identify genes potentially involved in local adaptation
Our approach identified a number of strong candidates for local adaptation at the top of lists of SNPs with the strongest SNP×environment associations with relative fitness (Tables S1-S4). For example, the top SNP associated with aridity interaction effects on fitness (chr. 4, position 11005059) fell within LESION SIMULATING DISEASE 1 (LSD1), which affects a number of traits in Arabidopsis, including survival and fecundity under drought (Wituszyńska et al. 2013; Szechyńska-Hebda et al. 2016) (Figure 4), while the third SNP (chr. 2, position 7592008) fell within ATMLO8, MILDEW RESISTANCE LOCUS O 8, homologous with barley MLO which controls resistance to the fungal pathogen powdery mildew (Büschges et al. 1997). The top SNP associated with winter cold interaction effects on fitness (chr. 5, position 7496047) falls within coding region of WRKY38, involved in the salicylic acid pathway and pathogen defense (Kim et al. 2008), and was the same locus identified as most strongly associated with multivariate climate in Lasky et al. (2012) (Figure 4). The top SNP associated with variability in growing season precipitation interaction effects on fitness (chr. 2, position 18504858) falls 380 bp from ABA HYPERSENSITIVE GERMINATION 11, AHG11, which mediates the effect of abscisic acid (ABA), a major hormone of abiotic stress response, on germination (Murayama et al. 2012). The fifth highest SNP (and second highest locus) associated with growing season cold interaction effects on fitness (chr. 3, position 8454439) fell within ABERRANT LATERAL ROOT FORMATION 5, ALF5, a gene that confers resistance to toxins (Diener et al. 2001) belonging to the MATE gene family, which play a variety of roles responding to environment (Shoji 2014).
Discussion
Genetic variation in environmental responses (G×E) is ubiquitous but its genetic and physiological basis and role in local adaptation is poorly understood. Replicated common garden experiments and genome scans for loci exhibiting evidence for local adaptation have been important in understanding the genetic basis of G×E and local adaptation (Hancock et al. 2008; ECKERT et al. 2010; Turner et al. 2010; Fournier-Level et al. 2011a; Lasky et al. 2012b; Ågren et al. 2013; Evans et al. 2014; Lasky et al. 2015). However, the complementary information in common gardens and geographic variation in allele frequency have not been coherently synthesized. Previous association studies of G×E have modeled discrete, categorical environmental effects (Murcray et al. 2009; Thomas 2010; Korte et al. 2012; Marigorta & Gibson 2014). The modeling of G×E across discrete, categorical environments is typically conducted, in part, for mathematical convenience, as such a treatment allows the use of models designed for multiple phenotypes, where the same phenotype in different environments is considered as multiple phenotypes (Falconer 1952).
We demonstrated an approach to association study of G×E for fitness and an imputation technique that allowed us to coherently synthesize evidence from common gardens and genome-environment associations. Our imputation method relied on making explicit the implicit assumption of local adaptation that underlies genome-environment association studies (Coop et al. 2010; Hancock et al. 2011; Lasky et al. 2012b). Using simulation, we demonstrated that this imputation can increase power to identify SNPs causing G×E for fitness and local adaptation. Our approach also identified strong candidate genes in Arabidopsis associated with SNPs that exhibit fitness tradeoffs along climate gradients such that locally common alleles had greater relative fitness.
The relative information on selective and adaptive genetic mechanisms contained in the two datasets (common garden, geographical genomic) for a given system will be determined by several factors. First, the power of common gardens depends on the range of sampled covariates (genotype and environment). We found evidence with both our simulations and empirical case study that greater coverage of environmental gradients can increase power to detect loci under selection by selective gradients. Similarly, power may be enhanced by including in common gardens a range of variation at locally adaptive loci using diverse germplasm from across gradients. However, confounding between population structure and adaptive loci and alternate mechanisms of local adaptation across regions suggest that regional stratification in scans for local adaptation may be more powerful (e.g. (Horton et al. 2016)). Additionally, the power of common gardens is influenced by the match between conditions in gardens and long-term natural selective gradients that give rise to local adaptation (Weigel & Nordborg 2015), and the heritability of adaptive traits and fitness. The information contained in genome-environment associations (and hence imputed fitness data here), is influenced by the strength of local adaptation in sampled populations (Figure 3), which itself is determined by steepness of selective gradients, the level of gene flow, and time populations have had to evolve toward equilibrium allele frequencies (Yeaman & Whitlock 2011; Lotterhos & Whitlock 2014; Forester et al. 2016). It is important to recognize that our simulations covered a limited range of the parameter space relevant in nature (genetic architecture of local adaptation, dimensionality of environmental selective gradients, etc.). Here, populations were given time to reach equilibrium (Forester et al. 2016), which likely enhanced the power of genotype-environment associations compared to scenarios common in nature where populations may still be responding to long-term environmental changes such as glacial cycles. Apart from information on genetic mechanisms of G×E for fitness, common gardens afford a more direct opportunity to study phenotypes under selection, as opposed to genotype–environment associations where information on phenotype is limited to gene annotations.
Above we described a method of imputation based on the assumption of local adaptation, i.e. home genotypes had greater fitness than away genotypes. However, local adaptation in nature is typically imperfect, such that the optimal genotype for a given location might not be the home genotype (Leimu & Fischer 2008; Hereford 2009). Local adaptation may not occur due to immigration of maladaptive alleles (Slatkin 1973), limited genetic variation (Barton 2001), temporal environmental shifts, and other processes (Bridle & Vines 2007). Thus our imputation can be considered a heuristic to be improved by further development.
Genotype-by-environment interactions in genome-wide association studies
Recent advances in association models have included explicit modeling of G×E (Murcray et al. 2009; Thomas 2010; Korte et al. 2012; Marigorta & Gibson 2014; Li et al. 2014; Kooperberg et al. 2016; Windle 2016), but to our knowledge there are no published genome-wide association studies accounting for SNP interactions with continuous environmental gradients (a reaction norm approach, cf. Jarquín et al. 2014; Tiezzi et al. 2017). By employing a reaction norm approach to G×E (as we did here), models can be applied to prediction at new sites, which is not possible using correlated trait approaches to G×E (Falconer 1952; Korte et al. 2012) where sites are treated as idiosyncratic. Some of the aforementioned categorical treatments of SNP×environment interactions were used in association studies for human disease. However, many of the environmental variables that may mediate genetic risk of disease are continuous in nature, such as exposure to ultraviolet radiation and tobacco smoke. Future research on local adaptation and human disease may benefit from exchange of approaches given the shared importance across disciplines of understanding the genomic basis of G×E.
Case study on Arabidopsis thaliana
Our approach identified many SNPs where allelic variation was associated with rank-changing relative fitness tradeoffs along climate gradients (e.g. all 214 of the SNPs with strongest interaction, i.e. 0.001 quantile, with winter minimum temperature association for fitness), loci where selective gradients may stably maintain population differentiation (Anderson et al. 2011b; Ågren et al. 2013). Studies of local adaptation genomics often find limited evidence for loci with antagonistic pleiotropy. A previous study of the common garden data used here (Fournier-Level et al. 2011a) found that the SNPs with the strongest association with fitness in one common garden were rarely among those with the strongest associations in another garden, which the authors interpreted as evidence for conditional neutrality. However, the fact that a locus is not among the strongest associated with fitness at an individual site does not indicate the locus is neutral at that site, it may simply be under relatively weaker selection (see Figure S6 for example illustration). By contrast with previous approaches that model phenotypes at a single site, our model was explicitly focused on detecting alleles with the strongest evidence for SNP×climate interactions favoring home alleles. Thus our explicit model of G×E is more likely to detect loci with patterns indicative of antagonistic pleiotropy compared with approaches that model fitness in a single common garden at a time, approaches that do not model G×E.
Local adaptation may often involve complex traits governed by many loci. Loci exhibiting antagonistic pleiotropy and loci exhibiting G×E but no tradeoffs (variance changing or conditional neutrality) may both underlie genome-level local adaptation. Note that our study, like that of (Fournier-Level et al. 2011a) is based on association mapping, which may suffer from identification of more false positives compared with linkage mapping approaches (HALL et al. 2010; ANDERSON et al. 2013; Ågren et al. 2013). Follow-up experimental study of phenotypic effects of variation at individual loci is required to confirm the results of association mapping (Verslues et al. 2014; Broekgaarden et al. 2015).
We found evidence that SNP×climate interaction effects on fitness were enriched in genic regions, suggesting that our model captured a signal of local adaptation rather than population structure. We found that enrichments in genic SNPs only emerged after using a mixed model to control for the putative effects of population structure (genomewide similarity), suggesting that the genic-enriched patterns of divergence we modeled were not simply associated with overall patterns of among-population divergence. This enrichment is consistent with other findings in Arabidopsis (Hancock et al. 2011; Lasky et al. 2012b) and other species ((Coop et al. 2009; Fumagalli et al. 2011; Lasky et al. 2015), but see (Pyhäjärvi et al. 2013)). We do not interpret this enrichment as indicating that changes in amino acid sequences are more important than regulatory evolution in local adaptation, but rather as supporting the hypothesis that local adaptation is more likely to involve sequence evolution near genes as opposed to at locations farther from genes, where many intergenic SNPs are found.
We found evidence that loci we identified as candidates were enriched in evidence for partial selective sweeps (PHS and iHS statistics), suggesting that recent sweeps in particular environments are an important mode of local adaptation (Voight et al. 2006; Toomajian et al. 2006). These local sweeps may be expected based on the range dynamics of Arabidopsis, which has colonized much of its Eurasian range following the retreat of glaciers (Sharbel et al. 2000), a process that likely involved recent local adaptation. It is important to note that extended haplotype patterns suggestive of partial sweeps may occur at the shoulders (away from causal loci) of complete sweeps (Schrider et al. 2015), thus caution is warranted in attributing our observed PHS and iHS enrichment to localized sweeps versus global sweeps at nearby loci.
We found significant overlap between SNPs associated with G×E for fitness along growing season cold gradients and SNPs associated with G×E for flowering time across growing season temperature treatments (Li et al. 2014). Our findings suggest that evolution of plasticity in flowering time is a mechanism of local adaptation along growing season temperature gradients and that our model has captured the signal of this adaptation. For organisms inhabiting seasonal environments, timing of the life cycle can have large impacts on fitness. Previous common garden experiments have provided strong evidence that flowering time is a central trait involved in local adaptation (Hall & Willis 2006; Franks et al. 2007; Keller et al. 2012; Lowry et al. 2014) with molecular study further supporting the role of flowering time (Stinchcombe et al. 2004; Caicedo et al. 2004; Shindo et al. 2005; Lovell et al. 2013) and the role of plasticity (Fraser 2013; Lasky et al. 2014) in local adaptation.
Though there was overlap with signal identified by previous approaches using the same data (Hancock et al. 2011; Lasky et al. 2012b, 2014), overlap was generally weak, indicating our approach identified distinct loci as causing local adaptation. In our case study on Arabidopsis, the SNPs that exhibited the strongest evidence for SNP×climate interaction effects on fitness often fell within the coding regions of strong candidate genes based on known roles in environmental responses, suggesting our approach is a useful for identifying loci underlying local adaptation.
Conclusions
Local adaptation to environment involves genotype-by-environment interactions for fitness. Genome-wide association studies are a promising approach for identifying the genomic basis of local adaptation and G×E. Additional approaches like genome-wide expression profiling may also be useful for uncovering the genomic basis of local adaptation (Des Marais et al. 2013). Future approaches that use a principled basis for quantitative synthesis of patterns in multiple data types (Levy Karin et al. 2017) may enhance our ability to characterize adaptation in an integrative fashion.
Data Accessibility
All data were previously published, including fitness (Data Dryad package (Fournier-Level et al. 2011b), environment (Data Dryad package (Lasky et al. 2012a), and SNPs (Horton et al. 2012), available at http://bergelson.uchicago.edu/wpcontent/uploads/2015/04/call_method_75.tar.gz). Simulation data previously published in (Forester et al. 2016) are archived in Data Dryad package (Forester BR et al. 2015). New simulation data, and code for simulations and GWAS are included in an online supplemental file with this manuscript.
Author contributions
JRL designed research and led paper writing. All authors conducted analyses and contributed to writing.
Acknowledgements
We thank David Lowry, Thomas Juenger, and several anonymous reviewers for comments on earlier versions of this manuscript.