ABSTRACT
Urbanization significantly alters natural ecosystems, and its rate is only expected to increase globally as more humans move into urban centers. Urbanized landscapes are often highly fragmented. Isolated populations within these fragments may adapt in response to novel urban ecosystems, but few studies have found strong evidence of evolutionary responses in urban environments. We used multiple genome scan and genotype-environment association (GEA) approaches to examine signatures of selection in transcriptomes from urban white-footed mice (Peromyscus leucopus) in New York City. We scanned transcriptomes from 48 P. leucopus individuals from six environmentally heterogeneous locations (three urban and three rural) for evidence of rapid local adaption in isolated urban habitats. We analyzed 154,770 SNPs and identified patterns of genetic differentiation between urban and rural sites and signatures of selection in a large subset of genes. Neutral demographic processes can create allele frequency patterns that are indistinguishable from positive selection. We accounted for this by simulating a neutral SNP dataset under the inferred demographic history for the sampled P. leucopus populations to serve as a null model when choosing outliers. We annotated the resulting outlier genes and further validated them by associating allele frequency differences with environmental measures of urbanization, percent impervious surface and human population density. The majority of candidate genes were involved in metabolic functions, especially dietary specialization. A subset of these genes have well-established roles in metabolizing lipids and carbohydrates, including transport of cholesterol and desaturation of fatty acids. Our results reveal clear genetic differentiation between rural and urban sites that likely resulted from rapid local adaptation in urbanizing habitats. The specific candidate loci that we identified suggest that populations of P. leucopus are using novel food resources in urban habitats or locally adapting through changes in their metabolism. Our data support the idea that cities represent novel ecosystems with a unique set of selective pressures.
Introduction
Traits are adaptive when they increase an organism’s fitness in a specific environment (Barrett & Hoekstra 2011). The identification of specific genotypes underlying adaptive traits is a major goal in evolutionary biology. Many studies have identified the genetic basis underlying adaptation, but they often focus on a small number of well-known, conspicuous traits (Nachman et al. 2003; Pool & Aquadro 2007; Linnen et al. 2009; Storz et al. 2009). In the current era of high-throughput DNA sequencing, where costs continue to drop by orders of magnitude (De Wit et al. 2015), it is now feasible to generate genomic datasets for natural populations of non-model organisms. Researchers can use a reverse-ecology approach where candidate genes behind ecologically relevant, but non-conspicuous, phenotypes are identified based on patterns of variation and signatures of selection in protein-coding sequences (Li et al. 2008). Here we examined local adaptation in isolated urban populations of white-footed mice, Peromyscus leucopus, in NYC. We scanned P. leucopus transcriptomes and identified regions and genes with divergent and skewed allele frequencies indicative of positive selection. We incorporated a neutral SNP dataset from an inferred demographic history directly into our null model. We then examined the statistical association between allele frequencies and environmental measures of urbanization.
Traditional approaches for identifying local adaptation involve reciprocal transplant or common garden experiments (Merila & Hendry 2014), but local adaptation also leaves a predictable pattern of genetic variation and differentiation along environmental gradients across the genome (Savolainen et al. 2013). Measuring changes in the site frequency spectrum (SFS), the distribution of allele frequencies across sites, from genomic data can be an efficient method of detecting past selection (Merila & Hendry 2014). Positive directional selection increases interspecific variation at selected loci compared to the genomic background (Beaumont 2005), decreases nucleotide diversity around the selected locus through genetic hitchhiking (Hermisson 2009), and skews the SFS towards excess low and high frequency variants (Nielsen 2005). Balancing selection leaves a generally opposite pattern with decreased intraspecific genetic diversity (Nielsen 2005), low genetic differentiation between sites (Foll & Gaggiotti 2008), and an excess of intermediate frequency alleles (Nielsen 2005). Negative, or purifying selection reduces genetic diversity and differentiation, and only low frequency variants increase in the SFS (Nielsen 2005).
Local adaptation has increasingly been shown to occur across multiple taxa (Stinchcombe & Hoekstra 2008; Bonin 2008; Linnen et al. 2009; Hohenlohe et al. 2010a; Turner et al. 2010; Ellison et al. 2011; De Wit & Palumbi 2013). Uncovering the genetic basis of local adaptation has provided insight into a variety of evolutionary processes including speciation, maintenance of genetic diversity, range expansion, and species response to changing environments (Savolainen et al. 2013; Tiffin & Ross-Ibarra 2014). Cities represent one of the fastest growing and most rapidly changing environments around the world. Urbanization leads to habitat loss and fragmentation, changes in resource availability, novel species interactions, altered community composition, and increased exposure to pollutants (McKinney 2002; Chace & Walsh 2004; Shochat et al. 2006; Sih et al. 2011). Each of these ecological consequences may exert strong selective pressure, and there is mounting evidence that rapid adaptation occurs in many urban organisms. Another cause of rapidly changing environments is global climate change, where increasing temperatures and altered precipitation patterns strongly influence the life history traits of many species (Franks & Hoffmann 2011). These two processes, urbanization and climate change, are not mutually exclusive, however. Understanding local adaptation in urban habitats may lead to general insights about local adaptation to future climate change threats, both of which represent cases of general rapid evolution in changing environments. What traits are most likely involved in local adaptation? How quickly do populations respond to selective pressures and adapt locally? What environmental variables have the largest impact on populations and drive local adaptation? Are the same genes and alleles involved in local adaptation also involved in similarly changing environments, i.e. is there evidence of convergent local adaptation?
White-footed mice are good candidates for local adaptation because they are widespread and are one of the few native mammals that thrive in extremely small, fragmented urban forests (Pergams & Lacy 2007; Rogic et al. 2013; Munshi-South & Nagy 2014). P. leucopus tend to be found at higher densities in urban patches due to a thick understory and fewer predators and competitors (Rytwinski & Fahrig 2007). Increased density may also be due to limited P. leucopus dispersal between urban sites. Munshi-South (2012) found barriers to dispersal between isolated NYC parks, with migrants only moving along significantly vegetated corridors throughout the city. There is also substantial genetic structure between NYC parks as measured by microsatellites (Munshi-South & Kharchenko 2010), genome-wide SNPs (Munshi-South et al. 2016) and demographic modeling (Harris et al. 2016). We have also previously found evidence of divergence and selection in urban populations of NYC white-footed mice (Harris et al. 2013), though we used much smaller datasets and less sophisticated approaches than presented here. Collectively, strong selective pressures from urbanization, lack of gene flow between NYC parks, genetic structure found between geographically close urban sites, and evidence of urbanization driving neutral allele frequency patterns in urban populations (Munshi-South et al. 2016) makes it likely that populations of urban white-footed mice are adapting to strong selective pressures in spite of the influence of genetic drift.
Urbanization and global climate change are relatively recent disturbances that rapidly change native ecosystems. Over short timescales, standing genetic variation, as opposed to novel mutations in organisms, often underlies adaptation (Barrett & Schluter 2008; Stapley et al. 2010). As these pre-existing mutations spread to fixation they produce a detectable signal in the form of ‘hard’ or ‘soft’ selective sweeps (Hermisson & Pennings 2005; Messer & Petrov 2013). Additionally, ecologically important traits involved in local adaptation are often quantitative traits with many genes of small effect involved in producing the desired phenotype (Orr 2005; Rockman 2012). In order to distinguish these more subtle signatures of selection, we used multiple tests that provide greater statistical power and higher resolution at identifying types and age of selection when used together (Grossman et al. 2010; Hohenlohe et al. 2011).
We used transcriptomes sequenced from urban and rural populations of P. leucopus to produce estimates of nucleotide diversity π (Tajima 1983), Tajima’s D (Tajima 1989), and FST (Wright 1951) and made inferences about the evolutionary processes at work in these populations. Several studies have used this suite of population genetic statistics to detect candidate genes that are the target of selection (Stajich & Hahn 2005; Hohenlohe et al. 2010a; Tennessen et al. 2010; Nadeau et al. 2012). Major challenges in solely using π or Tajima’s D are distinguishing between types of selection, and then disentangling demographic processes from selection (Biswas & Akey 2006). The difficulty arises because neutral demographic processes, like population bottlenecks, produce signatures of variation in the genome similar to those produced by selection (Oleksyk et al. 2010; Li et al. 2012). For example, a population bottleneck followed by an expansion will create genomic regions with low genetic diversity that resembles signatures from selection. Alleles present in the few breeding individuals during the bottleneck will become widespread during the expansion (Pavlidis et al. 2010). There has been much discussion on how to deal with the confounding effects of demographic history on identifying selection (Excoffier et al. 2009; Li et al. 2012; Vitti et al. 2013; Lotterhos & Whitlock 2015). The prevailing approach is to produce genome-wide data and assume selection acts on one or a few loci while demographic processes act across the genome. Outlier tests for loci under selection generate a null distribution, usually based on an island model of population differentiation (Excoffier et al. 2009), and then identify candidate genes with genetic differentiation beyond the null model’s limits. The true demographic history of most organisms is much more complex, and computational approaches have been developed to robustly infer demographic parameters (Gutenkunst et al. 2009; Excoffier et al. 2013). The inferred demographic history can then be used to construct a more realistic null model, reducing the rate of false positives in outlier based tests of selection (Excoffier et al. 2009; Yoder et al. 2014).
We used the inferred demographic history of urban populations of P. leucopus (Harris et al. 2016) to simulate comparable SNP datasets to our observed sequence data. We then used two genome scan tests that identify outlier loci based on population differentiation and the SFS, respectively. Bayescan uses a Bayesian approach to identify SNPs that show extreme allele frequency divergence between populations (Foll & Gaggiotti 2008). SweeD is a likelihood based test that finds evidence of selective sweeps by looking for regions with a SFS that deviates from neutral expectations (Pavlidis et al. 2013). We also used an emerging approach for identifying loci underlying local adaptation by examining associations between allele frequencies and environmental variables. Several tests have been developed based on the relationship between genotypes and environmental variables, falling under the general category of genotype-environment association (GEA) tests (Joost et al. 2007; Coop et al. 2010; Frichot et al. 2013; Lotterhos & Whitlock 2015). GEA tests perform better than genome scan based outlier tests under complex demographic scenarios (Lotterhos & Whitlock 2015) but can suffer from a high rate of false positives. Analyses suggest that using genome scan-based outlier tests in conjunction with GEA tests leads to reliable outlier loci identification (De Villemereuil et al. 2014). GEA tests also identify local adaptation in polygenic phenotypes where each polymorphism has a relatively weak effect (Frichot et al. 2013), because correlations between alleles and environmental variables do not rely on the strength of genetic differentiation or SFS skew between populations.
In this study, we examined transcriptomes generated from RNAseq for 48 Peromyscus leucopus individuals from three urban sites in NYC and three rural sites from the surrounding area. Including population pairs that are near each other and genetically similar, but occur in different environments (urban versus rural), increases the power to identify candidate genes under selection (Lotterhos & Whitlock 2015). We used traditional population genetic summary statistics to generate per-site estimates and find loci with patterns of genetic variation that deviate from neutral expectations. Next, we used several tests of selection that use our transcriptome-wide SNP datasets to determine whether these deviations are due to recent selection in urban populations of white-footed mice. To increase power, reduce false positives, identify more subtle signals of selection from standing genetic variation, and find candidate genes involved in polygenic phenotypic traits, we simulated a null background model from the inferred demographic history for NYC populations of P. leucopus. We examined the association between quantitative metrics of urbanization (percent impervious surface and human population density) and polymorphisms between rural and urban populations to identify the candidate genes experiencing selection from ecological pressures in urban habitats. We used overlapping results from multiple tests and environmental associations in order to generate a reliable list of candidate genes involved in the local adaptation of P. leucopus populations to the urban environment. This study is the first to use transcriptome-wide patterns of genetic variation for analyses of local adaptation in cities. Evidence of local adaptation in urban populations reveals how urbanization acts as an evolutionary force, gives insights into important traits for local adaptation, and provides an example of the speed of evolution in rapidly changing environments.
Materials and Methods
Sampling, library preparation, and transcriptome assembly
We sampled white-footed mice from 2009 - 2013. We randomly chose eight individual white-footed mice (equal numbers of males and females) from six sampling locations representative of urban and rural habitats (Fig. 1) (Harris et al. 2013, 2015). Three sampling sites occurred within NYC parks: Central Park in Manhattan (CP), New York Botanical Gardens in the Bronx (NYBG), and Flushing Meadow—Willow Lake in Queens (FM). These sites represented urban habitats surrounded by high-volume roads and dense human infrastructure. The remaining three sites occurred ~100 km outside of NYC in rural, undisturbed habitat representative of natural environments for Peromyscus leucopus. High Point State Park is in the Kittatinny Mountains in New Jersey (HIP), Clarence Fahnestock State Park is located in the Hudson Highlands in New York (CFP), and Brookhaven and Wilde Wood State Parks and neighboring sites occur on the northeastern end of Long Island, New York (BHwwp). We sacrificed mice on site and liver, gonad, and brain tissue were harvested in the field for immediate storage in RNAlater (Ambion). In the lab, we extracted total RNA and then removed ribosomal RNA during library preparation. The reverse transcribed cDNA was sequenced using the 454 GS FLX+ and SOLiD 5500 xl systems using standard RNAseq protocols. We called SNPs the Genome Analysis Toolkit pipeline using a Bayesian genotype likelihood model (GATK version 2.8, DePristo et al. 2011). See Harris et al. 2013 and Harris et al. 2015 for full transcriptome sequencing, assembly and SNP calling details.
Summary statistics
SNP information was stored in a VCF (variant call format) file and summary statistics were calculated using vcftools (Danecek et al. 2011). These analyses were used for general estimates of diversity for each population and were calculated for each site. We calculated persite nucleotide diversity (π), Tajima’s D, and FST. We also calculated the statistics for each contig (per-site statistic summed across all SNPs per contig divided by total sites) and found the average estimate for each population, including all pairwise population comparisons for FST
Sans for positive selection based on population differentiation
Population structure analyses for protein coding sequences show that the three urban sites and three rural sites comprise two distinct groups, but there was also hierarchical structure within each indicating urban sites represent unique evolutionary clusters (Harris et al. 2015). We used the FST based analysis implemented in Bayescan v. 2.1 (Foll & Gaggiotti 2008) to compare all six population-specific allele frequencies with global averages and identify outlier SNPs. Bayescan identifies markers that show divergence patterns between groups that are stronger than would be expected under neutral genetic processes. Based on a set of neutral allele frequencies under a Dirichlet distribution, Bayescan uses a Bayesian model to estimate the probability that a given locus is under the effect of selection. To generate more realistic allele frequency distributions, I used Bayescan to analyze coalescent simulations of SNP datasets based on the neutral demographic history inferred specifically for P. leucopus populations in Harris et al. 2016. We generated 100 sets of 100,000 SNPs each from a three population, isolation with migration model using the previously inferred parameter estimates for divergence time, effective population size, migration rate, and population size change in the coalescent based software program, fastsimcoal2 (Excoffier et al. 2013). In short, the model represented a deep split between an ancestral population into Long Island, NY and the mainland (including Manhattan) 29,440 generations before present (GBP). Migration was asymmetrical from the mainland into Long Island and an urban population later became isolated 746 GBP. Urban populations were also modeled to include a bottleneck event at the time of divergence. Finally, we allowed migration to occur between all three populations (Harris et al. 2016). Bayescan was run independently on each simulated dataset using default parameters. Within the observed SNP dataset, we performed a global analysis, one Bayescan run where all individuals were partitioned into Urban and Rural groups, and finally analyses on all individual pairwise population comparisons. Outlier SNPs were retained if they had a false discovery rate (FDR) value < 0.1 and if the calculated FST and posterior odds probability were higher than for any value calculated from the simulated dataset.
Analysis for selective sweeps
We also scanned the transcriptome to look for contigs where the observed SFS showed an excess of low frequency and high frequency minor alleles, a signal indicative of a recent selective sweep in the region. The composite likelihood ratio (CLR) statistic is used to identify regions where the observed SFS matches the expected SFS generated from a selective sweep (Kim & Stephan 2002; Nielsen et al. 2005; Pavlidis et al. 2010). I calculated the CLR along sliding windows across the transcriptome using the software program SweeD (Pavlidis et al. 2013). SweeD is an extension of the popular Sweepfinder (Nielsen et al. 2005) and is optimized for large next generation sequencing (NGS) datasets. SweeD was run separately for each population and on individual contigs directly from vcf files using default parameters except for setting a sliding window size of 200 bp and using the folded SFS, as we lacked an outgroup to infer the ancestral state. The window within each contig with the highest CLR score is the likely location of a selective sweep. Similar to the method used for Bayescan analyses, statistical significance was chosen from a null distribution generated by running SweeD on SNP datasets simulated under the inferred demographic history for P. leucopus populations (Harris et al. 2016). SweeD does not inherently identify outlier regions, but rather, the CLR statistic is computed using a selective sweep model on the observed dataset and needs to be compared to a neutral model calibrated with the background SFS generated from simulations. As before, we used 100 datasets with 100,000 SNPs each, simulated under the inferred neutral demographic history for urban and rural populations of white-footed mice in NYC. The CLR was calculated using SweeD for all simulated datasets and the resulting distribution was used to set a significance cutoff. For the observed dataset, we lacked a genome to provide clear linkage information, so SweeD was run separately on each contig. We identified outlier regions and chose the associated contigs as candidates if their CLR statistic was greater than any produced when calculated for neutral simulations. We also required outliers to fall within the top 0.01% of the CLR distribution for the observed SNPs. Choosing outliers within the top 0.01% of the distribution is a conservative cutoff value. When looking for regions with genetic patterns of a selective sweep, Wilches (2014) filtered regions within the top 5% of the distribution. Selective sweeps from artificial selection in rice, Oryza glaberrima, were identified with a cutoff value of 0.5% (Chen et al. 2014) and regions within the Gorilla genome were identified as significant if CLR scores were in the top 0.5% (McManus et al. 2014). We chose an even more stringent filter of 0.01% because we lacked a reference genome and analyses were restricted to relatively short individual contigs.
Genotype-environment association tests for environmental selection
We used LFMM (Frichot et al. 2013), a software program that is one of the recently emerging genotype-environment association (GEA) approaches for identifying selection (Hedrick et al. 1976; Joost et al. 2007; Coop et al. 2010; Frichot et al. 2013; Lotterhos & Whitlock 2015), to associate outlier SNPs and candidate loci identified above with potential environmental selection pressures. Latent Fixed Mixed Modeling (LFMM) tests for correlations between environmental and genetic variation while accounting for the neutral genetic background and structure between populations (Frichot et al. 2013). We tested three environmental variables associated with urbanization, the percent impervious surface within a two-kilometer buffer around each sampling site, human density within a two-kilometer buffer around each sampling site, and simply designating each site urban or rural. We tested all individuals and only the outlier SNPs detected in Bayescan and SweeD. An important first step in using the LFMM algorithm is to define the number of latent factors, K, that can be used to define population structure in the genetic background. To identify the appropriate number of K latent factors in our dataset, we used default parameters and performed a PCA followed by a recommended Tracy-Widom test to find the number of eigenvalues with significant p values < 0.01 (Patterson et al. 2006; Frichot & François 2015). Results suggested the use of six latent factors. Thus, I ran LFMM with default parameters except for a K = 6, an increased number of MCMC cycles = 100,000, and a burn-in = 50,000. Using author recommendations, we combined 10 replicate runs and readjusted the p values to increase the power of the test. LFMM uses |z|-scores to report the probability of a SNP’s association with an environmental variable. After correcting for multiple testing, we used a cutoff value of q ≤ 0.1.
Functional annotation of candidate gene
The contigs containing outlier SNPs identified using the tests for selection above were obtained from the P. leucopus transcriptome. The gene annotation pipeline implemented in Blast2GO (Conesa et al. 2005; Götz et al. 2008) was used to find homologous sequences from the NCBI non-redundant protein database using BLASTX, and associated gene ontology (GO) terms were retrieved. Gene ontology (GO) terms are a standardized method of ascribing functions to genes. Blast2GO retrieves GO terms associated with BLASTX hits and also uses the KEGG database to describe biochemical pathways linking different enzymes (Ogata et al. 1999; Kanehisa et al. 2014).
Results
Genetic diversity statistics
We retained 154,770 total SNPs for use in looking at patterns of genetic variation and performing tests of selection. For each population we obtained estimates of nucleotide diversity, Tajima’s D, and pairwise FST. There were differences in genetic diversity between urban and rural populations greater than one standard deviation. Urban populations had a two-fold decrease in nucleotide diversity compared to the rural populations (Table 1). The average nucleotide diversity for all three rural populations was 0.224 ± 0.034, while the average for urban populations was only 0.112 ± 0.019. The average Tajima’s D calculation within populations did not show substantial differences between populations (Table 1). For all populations, Tajima’s D was slightly positive, with rural populations only slightly more positive than urban populations, though not significantly different. Average pairwise FST calculated using vcftools ranged from a low of 0.018 ± 0.364 between two rural populations (CFP_HIP) to a high of 0.110 ± 0.520 between two urban populations (CP_FM, Table 2). These FST calculations were very similar to calculations made for neutral genome-wide SNP datasets from the same P. leucopus populations (Munshi-South et al. 2016), and supported findings that these populations lack an isolation-bydistance pattern. Comparisons between rural populations had the lowest FST values, urban to rural populations had the second lowest, and urban to urban population comparisons had the highest overall FST values despite being less than 5 km apart (Table 2).
Outlier detection
The test for positive or balancing selection implemented in Bayescan for the global analysis revealed 309 (0.19%) SNPs potentially under the influence of divergent selection. To investigate divergent selection due to urbanization, sampling sites were grouped and classified as urban or rural, and genome scans using Bayescan on this dataset uncovered 40 (0.025%) SNPs with signatures of positive selection (Fig. 2A, Table 3). Eight of these SNPs were found in the global analysis. Individual urban to rural population comparisons did not find any outlier SNPs, and zero SNPs were revealed to be under balancing selection. FST for outlier SNPs ranged from 0.21 - 0.33, much higher than the population average. When Bayescan was run on the simulated neutral dataset, which included bottlenecks during urban population divergence, there were zero identified outlier SNPs. I did, however, only include outlier SNPs from the observed dataset with FDR and posterior odds values that were smaller and larger, respectively, than the most extreme values for the simulated data (FDR ≤ 0.6 and log10(PO) ≥-0.196).
Outlier regions showing signatures of selective sweeps from the SweeD analysis were identified using comparisons to neutral expectations. To generate the null distribution of the CLR statistic I tested the 100 SNP datasets simulated under the inferred demographic history for NYC populations of P. leucopus. I found that CLR scores in the top 5% of the distribution were generally 2x - 3x lower than for the top 5% of the observed dataset. I ran SweeD runs on observed SNPs within individual contigs and identified outliers by filtering for a CLR score ≥ 3.53 (the maximum CLR from simulated data). I also chose regions that fell within the top 0.01% of the observed distribution (Fig. 2B). SweeD identified regions with SFS patterns that fit a selective sweep model in 55 contigs (40,908 contigs in P. leucopus transcriptome, 0.13%) within urban populations (Table 4). Contig 35790-44, which codes for the lipid transporter Apolipoprotein B100, had the highest CLR score, CLR = 8.56, and all outliers had CLR scores ≥ 4.97. There was no overlap of outliers between Bayescan and SweeD.
Environmental associations
We used LFMM to examine statistical associations of outlier SNPs with environmental measures of urbanization. Thirty of 40 outliers identified from Bayescan could be associated with at least one of the three environmental variables tested, which clearly delineate urban and rural sampling locations (Fig. 3A, Table 3). All 30 of the identified SNPs were associated with whether a site was classified as urban or rural. Only seven of the outlier SNPs were associated with percent impervious surface surrounding the sampling site and five were associated with human density. Twenty-six of the 55 outlier contigs in urban populations containing selective sweep regions as identified in SweeD could be associated with one of the environmental variables (Table 4). Again, all 26 significant associations involved classification of a site as either urban or rural. Fourteen outliers from SweeD were associated with percent impervious surface and eight were associated with human density surrounding the sampling location. Some contigs containing outlier SNPs associated with environmental variables were unique to individual urban populations, possibly indicating local adaptation within parks or selection on a polygenic trait.
Functional annotation
The full contig sequences containing the outlier SNPs were obtained from the P. leucopus transcriptome (Harris et al. 2015) and used to identify functional annotations. Of the 40 contigs identified by Bayescan as divergent between urban and rural populations, 36 could be annotated with gene names and functional information (Table 3). Of these, 29 were also associated with urban environmental variables. For the Bayescan outlier sequences, the ten most frequent gene ontology terms attributed to the DNA sequences involved organismal metabolism (Table S1). Some outliers occurred within well-studied genes with known functions and biochemical pathways. These included a farnesoid-x-receptor (FXR, Contig 25795-154) gene, the protein ABCC8 (Contig 26183-148), a Hermansky-Pudlak syndrome gene (Hps1, Contig 36706-36), KDM8, a histone demethylase (Contig 7750-426), a myosin light chain kinase (MYLK, Contig 7975-4180), and the gene SORBS2 (Contig 37967-26). These genes were identified as likely experiencing divergent selection between urban and rural populations and showed environmental associations with urbanization.
When we used results from SweeD, we found regions within 55 contigs that showed a signature of a selective sweep (Table 4). Forty-nine could be annotated with gene names and gene ontology terms, and 25 were also associated with urbanization. Overall, sequences were associated with metabolic processes, similar to the outliers found in Bayescan, and many genes were involved with basic metabolic functions such as glycolysis and ATP production (Table S1). A few contigs were annotated with well-studied genes and clearly understood functions. Contig 35790-44 was annotated as the gene APOB, an apolipoprotein, and Contig 10636-348 was an aflatoxin reductase gene AKR7A1. There was also the gene FADS1, part of the fatty acid denaturase family (Contig 342-1776), a heat-shock protein (Hsp90, Contig 3964-627), and a hepatocyte growth factor activator gene (Contig 8960-388). Most gene annotations did not have known phenotypic traits related to their function, but KEGG analysis revealed several contigs involved in the same biochemical pathways: galactose metabolism, fructose metabolism, and mannose metabolism (Fig. S1).
Discussion
The results of this study provide insight into the genetic basis of local adaptation, which is key for understanding the ecological and evolutionary processes that affect biodiversity and how organisms respond to changing environments. We hypothesized that populations of P. leucopus in urban habitat fragments within NYC adapt in response to selective pressures from urbanization. Previous work supports this claim. Clear evidence of population structure between urban and rural sampling sites from neutral non-coding (Harris et al. 2016) and protein coding datasets (Harris et al. 2015) suggests NYC populations of white-footed mice are genetically isolated. Urbanization also impacts genetic diversity across the genome (Munshi-South et al. 2012, Harris et al. 2015, Harris et al. 2016). P. leucopus populations along an urban-to-rural gradient in NYC had reduced nucleotide diversity and heterozygosity in urban populations (Munshi-South et al. 2016). Additionally, demographic inference indicates that NYC populations became isolated within the timeframe of urban settlement (Harris et al. 2016).
We previously found evidence for older occurrences of divergent selection in NYC white-footed mice by investigating non-synonymous polymorphisms between pooled transcriptome samples (Harris et al. 2013). There was little overlap between previous results and those found here, but that was not surprising, as this data-set was much larger, covered more sampling sites, and looked at recent signatures of selection. Two of the eleven previously identified candidate genes (Harris et al. 2013) were direct matches to outliers in this current analysis (Serine protease inhibitor a3c and Solute carrier organic anion transporter 1A5), and three other genes were from the same gene families or involved in the same biological processes as those described here. One gene was an aldo-keto reductase protein, part of the same gene family as our SweeD identified aflatoxin reductase gene (Contig 10636-348). The aldo-keto reductase gene family comprises a large group essential for metabolizing various natural and foreign substances (Hyndman et al. 2003). Two others, camello-like 1 and a cytochrome P450 (CYPA1A) gene, are involved in metabolism of drugs and lipids. In Peromyscus spp., CYPA1A is directly expressed along with Hsp90 (outlier from current SweeD analysis) when exposed to environmental toxins (Settachan 2001). Collectively, these findings suggest that urban populations of P. leucopus may be adapting in response to selective pressures from urbanization.
In this study, we observed patterns of divergent positive selection between urban and rural populations of P. leucopus, and were able to associate outlier SNPs, while annotating the parent contig, with environmental variables representative of urbanization. The majority of candidate genes deal with organismal metabolism, particularly diet-related breakdown of lipids and carbohydrates. We discuss what these finding mean for organisms as they are exposed to novel urban ecosystems, and for understanding the ecological processes and time frame of recent local adaptation in general.
The utility of using genome scan methods to test for selection
Over the past decade, genome scan methods have become a feasible and common way for investigating polymorphisms across the genome in order to detect and disentangle neutral (demographic) and adaptive (selection) evolutionary processes (De Villemereuil et al. 2014). One of the most popular approaches looks at locus specific allele frequency differentiation between sampling locations as measured by FST (Lewontin & Krakauer 1973; Weir & Cockerham 1984). Sites with extremely high allele frequency differences may be subjects of positive directional selection. Bayescan (Foll & Gaggiotti 2008) builds on this idea and identifies outliers using a Bayesian approach. Bayescan calculates the posterior probability of a site being under the influence of selection by testing two models, one that includes selection and one that does not. The model that does not invoke selection is based on a theorized neutral distribution of allele frequencies.
While Bayescan has been shown to be the most robust differentiation method with respect to confounding demographic processes (Pérez-Figueroa et al. 2010; De Villemereuil et al. 2014), population bottlenecks, hierarchical structure, recent migration, or variable times to most-recent-common-ancestor (MRCA) between populations can artificially inflate FST values (Hermisson 2009; Lotterhos & Whitlock 2014). One way to avoid false positives is to build population structure and a specific demographic history directly into the null distribution of FST. We dealt with the issue of type I errors by running Bayescan on simulated SNP datasets generated under the neutral inferred demographic history for urban populations of P. leucopus in NYC (Harris et al. 2016). We only included outliers if their posterior probability was greater than any found from simulations. The outliers captured when comparing urban to rural sites made up 0.025% of the total number of loci analyzed from the transcriptome. This number is in line with candidates uncovered from a similar study (0.05%) that looked at high and low altitude populations of the plant S. chrysanthemifolius (Chapman et al. 2013). Many studies find higher percentages of outlier loci using Bayescan, 4.5% in the American pika across its range in British Colombia (Henry & Russello 2013), and 5.7% in Atlantic herring across their range (Limborg et al. 2012). Our lower overall percentage of outliers may be because we included the known demographic history in our tests, because of the relatively recent isolation of urban populations of P. leucopus, or due to the fact that we did not have complete transcriptome sequences for our populations.
SweeD, another genome scan approach, looks at patterns in the SFS within a population as opposed to allele differentiation between populations. The statistics developed around the SFS are used to look at genetic hitchhiking around a selected locus that produces a pattern characteristic of a selective sweep (Schlötterer 2003; Pavlidis et al. 2008). The main footprint that selective sweeps leave on the SFS is an excess of rare low frequency and high frequency variants (Nielsen 2005). The SweepFinder method (Nielsen et al. 2005), recently upgraded to the NGS compatible SweeD (Pavlidis et al. 2013), uses a composite likelihood ratio test based on the ratio between the likelihood of a null (neutral evolution model) and the alternative (selective sweep) hypothesis. Like differentiation based methods, the weakness of hitchhiking methods is the confounding effect certain demographic processes have on the SFS. A strong population bottleneck can lead to variances in the genealogical history so that some loci have decreased genetic diversity and an excess of low frequency variants (Hermisson 2009). Again, however, building the known demographic history into the null model readily reduces false positive rates (Pavlidis et al. 2013).
We included the P. leucopus demographic history into our analysis, and found 0.04% of the transcriptome to contain regions with SFS patterns indicative of selective sweeps. This rate is in line with other studies that found 0.5% of regions in domesticated rice to show evidence of selective sweeps, though this might be unusually high due to artificial selection (Wang et al. 2014), 0.02% of loci in black cottonwood experiencing selective sweeps across geographic regions (Zhou et al. 2014), and 0.02% of regions across the entire Gorilla genome to show hitchhiking patterns (McManus et al. 2014).
Individual genome scan approaches look at different aspects of genomic structure and by themselves can miss true outliers, type II errors, or identify false positives, type I errors. Several studies have shown that a general principle to follow in order to avoid these errors is to perform multiple tests looking at various aspects of the genome (Nielsen 2005; Grossman et al. 2010; Hohenlohe et al. 2010b). We used Bayescan and SweeD to identify outliers experiencing positive selection, but did not find any overlapping candidate genes between them. This finding is not necessarily unexpected as the two tests look at different selection scenarios, divergent local selection versus population-wide positive selection in the form of selective sweeps (Hermisson 2009). FST based methods can pick up on divergence between alleles relatively quickly, while models for selective sweeps typically require nearly-fixed derived alleles (Hohenlohe et al. 2010b). Given the recent time frame of urbanization in NYC, not enough generations may have passed since white-footed mice have become isolated to find complete selective sweeps in loci that overlap with outliers from Bayescan. In the case of NYC populations of P. leucopus, it is likely that adaptation is occurring from standing genetic variation in the form of soft sweeps (Hermisson & Pennings 2005), which are not readily identified by programs like SweeD (De Villemereuil et al. 2014). To give further support to this idea, we found several outliers across the various tests we ran that are unique to specific urban populations, which is characteristic of soft sweeps, as they and polygenic traits can lead to outlier SNPs unique to populations (Messer & Petrov 2013). Despite the lack of overlapping outlier SNPs between the two tests, further evidence that positive selection is acting in urban populations of P. leucopus was found with an additional approach. Independent confirmation of candidate genes came from correlating genotypes and environmental variables, a method that may be more powerful than the genome scans above for identifying SNPs under selection (Savolainen et al. 2013).
Environmental associations strengthen evidence of local adaptation to urbanization
Genotype-environment association tests are a growing class of methods that provide fine scale detail about the ecological processes driving selection by identifying loci with allele frequencies that are correlated with environmental factors. Several have recently been developed (Joost et al. 2007; Coop et al. 2010; Frichot et al. 2013), and here we used LFMM (Frichot et al. 2013) to associate outlier SNPs with environmental measurements that capture the effects of urbanization. LFMM is uniquely suited for our dataset as it has been found to perform better than other methods in the presence of hierarchical structure and when polygenic selection is acting on many loci with small effect (De Villemereuil et al. 2014). In our dataset, there are many layers of structure including urban and rural differentiation (Harris et al. 2015; Harris et al. 2016), patterns of geographic structure between mainland mice and Long Island, NY (Harris et al. 2016), and population structure between individual urban parks (Munshi-South & Kharchenko 2010). It also has more power when the sampling size is less than 10 individuals per populations, there is no evidence of IBD, and sampling design of the experiment involves pairs in environmentally heterogeneous habitats (Lotterhos & Whitlock 2015). We sampled eight white-footed mice per population, found no evidence of IBD (Munshi-South et al. 2016), and sampled environmentally heterogeneous rural and urban locations.
Using LFMM, we found that 75 % and 47 % of outliers from Bayescan and SweeD, respectively, could also be associated with one or more environmental variables. These results complement our findings that positive selection is acting on urban populations of white-footed mice. We acknowledge that impervious surface, human density, or classification as urban may be correlated with a different environmental selection force, but our results ultimately support an evolutionary scenario where isolated urban populations are experiencing divergent positive selection that is strongly affected by one or more environmental variables, likely associated with urbanization. These results are also consistent with other studies combining genome scan methods and GEA tests. Limborg et al. (2012) found 62.5 % of the outliers identified in Bayescan to be correlated with temperature or salinity changes in Atlantic herring, and 26.3 % of genome scan outliers could be associated with temperature or latitude in the tree species, A. glutinosa (De Kort et al. 2014).
The percent impervious surface and human density around a park, or the classification of sites as urban or rural, are efficient metrics for determining whether a sampling location has been affected by urbanization (Munshi-South et al. 2016). We can make several predications about how ecological processes are changing within parks influenced by urbanization. One of the most obvious consequences of human altered environments is habitat loss and fragmentation (McKinney 2002; Sih et al. 2011). The act of fragmentation and the building of infrastructure invariably changes the net primary productivity due to increasing percentages of impervious surface or artificial landscapes, parks and yards (Shochat et al. 2006). Additionally, species interactions change as organisms are forced into smaller areas or separated by infrastructure (Shochat et al. 2006). This includes impediments to migration across the urbanized landscape. Humans often introduce invasive species into habitats (Sih et al. 2011) leading to increased competition or novel predator-prey interactions. Urbanization also changes the types and availability of resources available in the altered habitat (McKinney 2002; Sih et al. 2011). Pollution is also a major consequence of urbanization (Donihue & Lambert 2014), and can include chemical, noise, or light pollution (Sih et al. 2011).
Given the rapid alteration of environments during urbanization, behavioral flexibility and phenotypic plasticity are thought to play an important role in a species’ response to novel urban ecosystems (Sih et al. 2011). Climate change, another form of human-induced rapid environmental change, is often used as a model for understanding plastic and evolutionary responses in organisms. Franks et al. (2014), in a comprehensive review of phenotypic changes in plants in response to climate change, reported that the majority of studies showed evidence of plastic responses. They also found many studies showed evidence of adaptation, though not always conclusively. Looking at animal responses to climate, Boutin & Lane (2014) found similar findings but even less conclusive evidence of adaptation versus plasticity, possibly due to the motility of animals and difficulty in establishing common garden or reciprocal transplant experiments. While it is likely that P. leucopus in NYC are displaying some plastic phenotypic responses in urban ecosystems, our results provide evidence of heritable evolutionary responses as well.
Between divergent allele frequencies, a skewed SFS, and environmental associations, we find several overlapping lines of evidence that support rapid divergent positive selection in white-footed mice. Urban ecologists are increasingly finding evidence of selection acting in urban environments (Donihue & Lambert 2014), and our results are in line with other studies that have found rapid local adaptation to ecological pressures from urbanization. Yeh (2004) found sexually selected tail coloration in juncos was rapidly evolving in urban populations compared to rural ones. European blackbirds show reduced migratory behavior in cities, and there is also evidence of selection on genes underlying anxiety behavior across multiple urban areas (Partecke et al. 2006; Mueller et al. 2013). Cheptou et al. (2008) found weeds in urban vegetation plots surrounded by paved surfaces had a higher percentage of non-dispersing seeds and that this trait was genetically based. In marine species living in the polluted waters around urban areas, rapid adaptation for PCB resistance occurred in both killifish and tomcod (Whitehead et al. 2010; Wirgin et al. 2011). The realization that a diverse range of taxa may adapt to human induced landscape change suggests rapid adaptation to anthropogenic driven environmental change may be pervasive in nature.
Functional roles and ecological relevance of candidate genes
The model rodent species Mus musculus, Rattus norvegicus, and Cricetulus griseus, all have deeply sequenced, assembled and annotated reference genomes. These resources allowed us to annotate 89.5 % of contigs containing outlier SNPs and genomic regions with high quality gene information. These annotations provided us with information about the traits affected by candidate genes. Urban P. leucopus specifically exhibited genetic patterns that suggest positive selection in genes from the mitochondria, a potentially significant finding considering mitochondrial genes are often used for demographic inference (Munshi-South & Nagy 2014). Tests for selection also identified genes that protect cellular health in stressful environments, modulate melanism throughout the body, genes that are involved in epigenetic control of gene expression, or involved in digestion and metabolism of lipids and carbohydrates.
Gene ontology vocabulary assigns gene function according to biological process, molecular function, and cellular component. Across all candidate genes and gene ontology terms, involvement with mitochondria was one of the most common assignments (Table S1). Whether genes were involved in energy production through metabolism of food or were actual mitochondrial proteins, it appears evolution in mitochondria and metabolic processes is extremely important for P. leucopus living in urban parks. Mitochondrial genes were traditionally used as neutrally evolving markers, but researchers are finding evidence of selection on mitochondrial DNA across taxa (Oliveira et al. 2008; Balloux 2010). One example includes mitochondrial haplotypes associated with more efficient non-shivering thermogenisis and higher fitness in over-wintering shrews (Fontanillas et al. 2005). In Peromyscus leucopus, Pergams & Lacy (2007) found complete mitochondrial haplotype replacement in present-day white-footed mice living in the urban Chicago environment compared to haplotypes found in museum skins collected from before urbanization. The agent of selection is not clear, but independent research found evidence of negative selection acting on the mitochondrial D-loop gene in NYC P. leucopus (Munshi-South & Nagy 2014). These findings are not surprising. Many mitochondria-related metabolic functions are affected by the same environmental variables that change in response to urbanization, like temperature (Urban = heat island effect) (Balloux 2010), population density (Urban = barriers to dispersal around parks) (Lankau & Strauss 2011; Munshi-South 2012), or resource availability (Urban = increased non-native prey) (Burcelin et al. 2002). In novel urban ecosystems, P. leucopus may be experiencing different energy requirements than rural counterparts.
One example of uniquely urban energy requirements comes from the signature of a selective sweep and a strong correlation with urban site classification found in the heat-shock protein Hsp90. Heat shock proteins are a gene family that have repeatedly been found to play a pivotal role in adaptation to environmental stress (Limborg et al. 2012). In a landmark study, cryptic variation in Hsp90 specifically, was found to act as a capacitor for the loss of eyes in cavefish (Rohner et al. 2013). Essentially, under normal environmental conditions, Hsp90 masks phenotypic variation in eye size, but under high stress conditions, Hsp90 is effectively inhibited allowing for eye size variation and eventual selection for unmasked phenotypic traits. In Peromyscus spp., Hsp90 acts a chaperone for many proteins, including a suite of metabolizing receptors activated by dioxin-like industrial toxins often found in polluted soil samples (Settachan 2001). When P. maniculatus was exposed to soils inundated with the toxin, 2,3,7,8 TCDD, maintenance of their circadian rhythm was affected and mice became active 3 hours earlier than under normal conditions (Settachan 2001). The aldo-keto reductase gene, aflatoxin aldehyde reductase (AKR7), was also an outlier in our analyses and is also important for metabolizing environmental toxins (Hyndman et al. 2003). Aflatoxin is a natural carcinogen often found in cereals and nuts contaminated with the fungus, A. flavus and is metabolically activated by cytochrome P450 (Jin & Penning 2007). In experiments on Rattus norvegicus, researchers found AKR7 is upregulated in the liver when exposed to various classes of toxins and quickly acts to metabolize them, protecting cellular health (Ellis et al. 2003). We found P. leucopus caught in NYC had more enlarged, scarred, and fatty livers than those from rural populations (personal observation), and this may be directly related to ecological conditions in urban environments that promote environmental toxin accumulation. Due to proximity to human infrastructure, urban soils consistently show increased levels of heavy metal contamination (McDonnell et al. 1997). Urban ecosystems also experience the heat island effect with higher temperatures than rural locations (McDonnell et al. 1997), leaf litter that quickly decomposes but is of poor quality (Pouyat et al. 1997), and NYC in particular experiences high humidity in warmer months (National Oceanic and Atmospheric Administration, NOAA). The combination of constantly decaying vegetation, high temperatures, and high humidity is ideal for healthy communities of the fungus A. flavus, the primary producer of aflatoxins. Hsp90, AKR7, and cytochrome P450 may be under selective pressures in NYC to efficiently metabolize higher concentrations of toxins in P. leucopus exposed to polluted urban soils or food sources in NYC.
Energy requirements may also be different in in urban populations because of dietary shifts. We found a surprising number of candidate genes with functions related to the metabolism and transport of lipids and carbohydrates. These genes were strongly correlated with environmental measures of urbanization, with clearly divergent allele frequencies between urban and rural sites (Fig. 3B). APOB-100 is the primary apolipoprotein that binds and transports lipids, including both forms of cholesterol (HDL and LDL), and Mus musculus knock-out models result in hyperglycemia and obesity (Lloyd et al. 2008). FADS1, a farnesoid-x-receptor, is a nuclear receptor antagonist that is involved in bile synthesis and modulates high fat diets, with variation in expression affecting rates of obesity in mice (Li et al. 2013). Manually curated protein annotations show MYLK and SORBS2 are both directly involved in the gastrointestinal system, involved in smooth muscle contractions and absorption of water and sodium in the intestine, respectively (Magrane & Consortium 2011; Consortium 2014). ABCC8 is an ATP-binding cassette transporter, and knock-out mice models lack insulin secretion in response to glucose (Seghers et al. 2000). Finally, KEGG analysis found that two contigs (10636-348 and 27546-129) represent proteins that are both directly involved in Galactose, Fructose and Mannose metabolism (Ogata et al. 1999).
These candidate genes suggest that white-footed mice in isolated urban parks are responding to resource differences between urban and rural habitats. One prediction is urban P. leucopus consume a diet with higher overall fat content. The typical diet of P. leucopus across its range consists of arthropods, fruits, nuts, various green vegetation, and fungus (Wolff et al. 1985). They are especially reliant on oak mast cycles and an important predator of gypsy moths (Ostfeld et al. 1996). They are generalists and opportunistic in the food they eat, and thus many different food resources could drive diet differences in urban versus rural systems. Urbanization in NYC has lead to relatively small green patches that are surrounded by a dense urban matrix. The high percent of impervious surface is detrimental to the persistence of white-tailed deer in urban parks, leading to their exclusion throughout the majority of NYC. An overabundance of deer, like what occurs in our rural sampling sites, leads to the removal of the vegetative understory and inhibits regeneration of many plants (Stewart 2001). In these heavily browsed habitats lacking a thick vegetative understory, there is direct correlation with length of deer browsing in the area and invertebrate species diversity and abundance (Stewart 2001; Allombert et al. 2005). As the understory is cleared by deer there are fewer food resources and habitats for woodland invertebrates.
This is not the case for urban parks that often have extremely thick and healthy understories (Leston & Rodewald 2006). Although the understory of urban forest fragments is typically composed of invasive plants, such an understory can produce a number of novel seed and fruit resources (McKinney 2008), as well as support a high abundance, if not diversity, of invertebrate prey (McDonnell et al. 1997). P. leucopus in NYC are likely so successful in urban ecosystems because they take advantage of the new food sources in urban habitats, including seeds and other plant parts from an invasive understory layer, as well as invertebrates that may thrive in urban fragments. There has been much research on adaptation to diet specialization, especially in human populations. One well known case involves mutations in the human lactase gene that lead to lactase persistence, most likely in response to a cattle domestication event (Enattah et al. 2008). Another study that looked at more subtle shifts in allele frequencies across human populations found outlier SNPs within genes that more efficiently metabolize proteins found in the root and tuber based diets that humans switched to as they moved into polar ecoregions (Hancock et al. 2010). There is also growing evidence of adaptation in native predators in order to consume exotic or toxic prey species (Carlsson et al. 2009), for example, larger mouthparts in the Australian soapberry bug to increase foraging on invasive balloon vines (Carroll et al. 2005).
We hypothesize that urban P. leucopus have much higher fat content in their diets due to increased seed or invertebrate abundance or the inclusion of high-fat human food waste, and local adaption is occurring to more efficiently metabolize the increased lipids and carbohydrates. There is strong genetic evidence that divergent positive selection is occurring between urban and rural mice, but in order to confirm hypotheses, it would be worth performing common garden experiments to measure metabolic rates when mice from different habitats are fed a consistent diet, or sequencing these same candidate genes across a broader range of urban and rural sites to look for similar signatures of selection. It might also be worthwhile to associate outlier SNPs with more fine scale ecological measurements like temperature, environmental pollutant level, or vegetative understory cover. Diet analyses between sites can also be undertaken and with the use of a metabarcoding approach using next generation sequencing, the entire diet can easily be identified from P. leucopus waste (Pompanon et al. 2012; Soininen et al. 2013).
Conclusions
Results strongly suggest that populations of Peromyscus leucopus within urban parks in NYC are adapting to the effects of urbanization. Focusing on protein-coding regions of the genome, using multiple tests of selection that analyze different parts of genomic structure, and associating outliers with environmental variables that capture the ecological changes imposed by urbanization allowed us to narrow in on specific genes underlying recent adaptation in urban habitats. In line with the definition of an ‘urban adapter’ (McKinney 2002), the generalist P. leucopus is successful in urban parks, and our results suggest white-footed mice may be adapting to changing dietary resources in urban ecosystems and potentially metabolizing increased chemical pollutants in their environment. While we find definitive evidence of genetic variation between urban and rural sampling sites, further work needs to be done to look at specific polymorphisms and their impact on translation and protein folding.
Next steps should include SNP assays or full sequencing of outlier genes in more individuals from an increased number of sites across the urban - rural gradient. With this further confirmation, ecological based studies of diet can be pursued. Humans are increasingly altering the natural landscape through urbanization and indirectly through global climate change.
Despite this, there are few studies with clear evidence of adaptation in novel urban ecosystems. Our study begins to address this issue using the statistical power of genomic datasets and finds that rapid adaptation is possible in recently disturbed ecosystems. By providing further understanding of contemporary evolution in response to urbanization, we have begun to answer important questions about the traits involved in adaptation to human modified landscapes and what environmental variables most likely drive this adaptation. Hopefully, these insights can be used for urban ecosystem management as global biodiversity continues to deal with unprecedented environmental change in the Anthropocene.
Acknowledgments
This research was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R15GM099055 to JM-S and a NSF Graduate Research Fellowship to SEH. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health.