Abstract
Populations experience a continual input of new mutations with fitness effects ranging lethal to adaptive. While the distribution of fitness effects (DFE) of new mutations is not directly observable, many mutations likely have either no effect on organismal fitness or are deleterious. Historically, it has been hypothesized that populations carry many mildly deleterious variants as segregating variation, which may decrease the mean absolute fitness of the population. Recent advances in sequencing technology and sequence conservation-based metrics for predicting the functional effect of a variant permit examination of the persistence of deleterious variants in populations. The issue of segregating deleterious variation is particularly important for crop improvement, because the demographic history of domestication and breeding allows deleterious variants to persist and reach moderate frequency, potentially reducing crop productivity. In this study, we use exome resequencing of thirteen cultivated barley lines and genome resequencing of seven cultivated soybean lines to investigate the prevalence and genomic distribution of deleterious SNPs in the protein-coding regions of the genomes of two crops. We find that putatively deleterious SNPs are best identified with multiple prediction approaches, and that SNPs that cause protein truncation make up a minority of all putatively deleterious SNPs. We also report the implementation of a SNP annotation tool (BAD_Mutations) that makes use of a likelihood ratio test based on alignment of all currently publicly available Angiosperm genomes.
Introduction
Mutation produces a constant influx of new variants into populations. Each mutation has a fitness effect that varies from lethal to neutral to advantageous. While the distribution of fitness effects of new mutations is not directly observable (Eyre-Walker and Keightley 2007), most mutations with fitness impacts are deleterious (Keightley and Lynch 2003). Deleterious mutations are typically identified as changes at phylogenetically-conserved sites (Doniger et al. 2008), or loss of protein function (Yampolsky et al. 2005). Strongly deleterious variants (particularly those with dominant effects) are quickly purged from populations by purifying selection. Likewise, strongly advantageous variants increase in frequency, and ultimately fix due to positive selection (Robertson 1960; Smith and Haigh 1974). Weakly deleterious variants have the potential to persist in populations and cumulatively contribute significantly to reductions in fitness (Fay et al. 2001; Eyre-Walker et al. 2006; Doniger et al. 2008).
Considering a single variant in a population, three parameters affect its segregation: the effective population size (Ne), the selective coefficient against homozygous individuals (s), and the dominance coefficient (h). The effects of Ne and s are relatively simple; variants are primarily subject to genetic drift rather than selection if Nes < 1 (Kimura et al. 1963). The effect of h is not as straightforward, as it depends on the frequency of outcrossing. In populations with a high degree of inbreeding, many individuals will be homozygous, which reduces the importance of h in determining the efficacy of selection against the variant. In populations that are outcrossing, an individual deleterious variant will occur primarily in the heterozygous state, and h will determine how “visible” the variant is to selection, with higher values of h increasing the strength of selection (Charlesworth and Charlesworth 1999). A completely recessive deleterious variant may remain effectively neutral as long as the frequency of the variant is low enough that substantial numbers of homozygous individuals are not produced. Conversely, a completely dominant deleterious variant will be quickly purged from the population (Lande and Schemske 1985). On average, deleterious variants segregating in a population are predicted to be partially recessive (Simmons and Crow 1977), allowing them to remain “hidden” from the action of purifying selection, and reach moderate frequencies. Indeed data from a gene knockout library in yeast (Shoemaker et al. 1996) indicate that protein loss-of-function variants have an average dominance coefficient of 0.2 (Agrawal and Whitlock 2012).
Effective rates of recombination also have important impacts on the number and distribution of deleterious mutations in the genome. Low recombination regions are prone to the irreversible accumulation of deleterious variants. This phenomenon is known as the “ratchet effect” (Muller 1964). In Kinite populations with low recombination, the continual input of deleterious mutations and stochastic variation in reproduction causes the loss of individuals with the fewest deleterious variants. Lack of recombination precludes the selective elimination of chromosomal segments carrying deleterious variants, and thus they can increase in an inexorable fashion (Muller 1964). (Nordborg 2000) demonstrates that under high levels of inbreeding, effective recombination can be decreased by almost 20-fold relative to an outbreeding population. While inbreeding populations are especially susceptible to ratchet effects on a genome-wide scale, even outbreeding species have genomic regions with limited effective recombination (Arnheim et al. 2003; McMullen et al. 2009). Both simulation studies (Felsenstein 1974) and empirical investigations in Drosophila melanogaster (Campos et al. 2012, 2014) indicate that deleterious variants accumulate in regions of limited recombination.
Efforts to identify individual deleterious variants and quantify them in individuals have led to a new branch of genomics research. In humans, examination of the contribution of rare deleterious variants to heritable disease has contributed to the emergence of personalized genomics (Abecasis et al. 2010; Cooper et al. 2010; Marth et al. 2011). Current estimates suggest that an average human may carry ~300 loss-of-function variants (Abecasis et al. 2010; Agrawal and Whitlock 2012). Individual humans carry approximately three lethal equivalents (mutations that would be lethal in the homozygous state) (Gao et al. 2015; Henn et al. 2015), and up to tens of thousands of weakly deleterious variants in coding and functional noncoding regions of the genome (Arbiza et al. 2013). These variants are enriched for mutations that are causative for diseases (Kryukov et al. 2007; Marth et al. 2011). As such they are expected to have appreciable negative selection coefficients (Nes) and be kept at low frequencies due to the action of purifying selection.
Humans are not unique in harboring substantial numbers of deleterious variants. It is estimated that almost 40% of nonsynonymous variants in Saccahromyces cerevisiae have deleterious effects (Doniger et al. 2008) and 20% of nonsynonymous variants in rice (Lu et al. 2006), Arabidopsis thaliana (Günther and Schmid 2010), and maize (Mezmouk and Ross-Ibarra 2014) are deleterious. In dogs, (Cruz et al. 2008) identified an excess of nonsynonymous SNPs segregating in domesticated dogs with respect to grey wolves. A similar pattern has been found in horses (Schubert et al. 2014), suggesting that an increased prevalence of deleterious variants may be a “cost of domestication.”
Genetic bottlenecks associated with domestication (Eyre-Walker et al. 1998) may allow deleterious variants to drift to higher frequency (Robertson 1960). The selective sweeps associated with domestication and improvement (Wright et al. 2005) would decrease nucleotide diversity in affected genomic regions (Smith and Haigh 1974; Kaplan et al. 1989), and subsequently reduce the effective recombination rate. The selective and demographic processes of domestication and improvement lead to three basic hypotheses about the distribution of deleterious variants in crop plants: i) the relative proportion of deleterious variants will be higher in domesticates than wild relatives; ii) deleterious variants will be enriched near loci of agronomic importance subjected to strong selection during domestication and improvement; iii) the relative proportion of deleterious variants will be lower in elite cultivars than landraces due to strong selection for yield (Gaut et al. In Review).
Approaches to identify deleterious mutations take one of two forms. Quantitative genetic methods have been proposed that make use of phenotypic measurements to investigate the aggregate impact of potentially deleterious alleles (Kelly 1999). These approaches require phenotypic measurements of pedigreed individuals to estimate the net effect of potentially deleterious alleles on trait variation. While quantitative genetic approaches allow researchers to estimate the contribution of deleterious alleles to additive genetic variance to a particular trait, they do not yield information about any individual genetic variant. Bioinformatic approaches, on the other hand, make use of measures of sequence conservation to identify variants with the greatest probability of being deleterious. When combined with genome-scale resequencing, they permit the identification of large numbers of putatively deleterious variants. Commonly applied approaches include SIFT (Sorting Intolerant From Tolerated) (Ng 2003), PolyPhen2 (Polymorphism Phenotyping) (Adzhubei et al. 2010), and a likelihood ratio test (LRT) (Chun and Fay 2009). These sequence conservation approaches operate in the absence of phenotypic data, but allow assessment of individual sequence variants. Recent advances in resequencing and sequence conservation methods have led to the suggestion that removal of deleterious variants from breeding populations presents a novel path for crop improvement (Morrell et al. 2011).
In this study, we investigate the distribution of deleterious variants in thirteen elite barley (Hordeum vulgare ssp. vulgare) and seven elite soybean (Glycine max) cultivars using exome and whole genome resequencing. We seek to answer four questions about the presence of deleterious variants: i) How many deleterious variants do individual cultivars harbor, and what proportion of these are nonsense (early stop codons) versus nonsynonymous (missense) variants? ii) What proportion of nonsynonymous variation is inferred to be deleterious? iii) How many known phenotype-altering SNPs are inferred to be deleterious? iv) How does the relative frequency of deleterious variants vary with recombination rate? We identify an average of ~1,000 deleterious variants per accession in our barley sample and ~700 deleterious variants per accession in our soybean sample. Approximately 40% of the deleterious variants are private to one individual in both species, suggesting the potential for selection for individuals with a reduced number of deleterious variants. Approximately 3-6% of nonsynonymous variants are inferred to be deleterious by all three approaches, and known causative SNPs annotate as deleterious at a much higher proportion than the genomic average. In soybean, where appropriate recombination rates are available, the proportion of deleterious variants is negatively correlated with recombination rate.
Materials and Methods
Plant Material and DNA Sequencing
The exome resequencing data reported here includes thirteen cultivated barleys, and two wild barley accessions. Barley exome capture was based on a 60 Mb liquid-phase Nim-blegen capture design (Mascher et al. 2013). For the soybean sample, we resequenced whole genomes of seven elite soybean cultivarsand used previously-generated whole genome sequence of Glycine soja (Kim et al. 2010). Each sample was prepared and sequenced with manufacturer protocols (Illumina, San Diego, CA) to at least 25x coverage of the target with 76bp, 100bp or 151bp paired-end reads. A summary of samples and sequencing statistics is given in Table S1.
Read Mapping and SNP Calling
DNA sequence handling followed the “Genome Analysis Tool Kit (GATK) Best Practices” workflow from the Broad Institute (broadinstitute.org/gatk/guide/topic?name=best-practices). Our workflow for read mapping and SNP calling is depicted in Figure S1. First, reads were checked for proper length, Phred score distribution, and k-mer contamination with FastQC (bioinformatics.babraham.ac.uk/projects/fastqc/). Primer and adapter sequence contamination was then trimmed from barley reads using Scythe (github.com/vsbuf-falo/scythe), using a prior on contamination rate of 0.05. Low-quality bases were then removed with Sickle (github.com/najoshi/sickle), with a minimum average window Phred quality of 25, and window size of 10% of the read length. Soybean reads were trimmed using the fastqc-mcf tool in the ea-utils package (code.google.com/p/ea-utils/). Post-alignment processing and SNP calling were performed with the GATK v. 3.1 (McKenna et al. 2010; DePristo et al. 2011).
Barley reads were aligned to the Morex draft genome sequence (Mayer et al. 2012) using BWA-MEM (Li and Durbin 2009). We tuned the alignment reporting parameter and the gapping parameters to allow ~2% mismatch between the reads and reference sequence, which is roughly equivalent to the highest estimated nucleotide diversity observed at a locus in barley coding sequence (Morrell et al. 2003, 2006, 2014). The resulting SAM Kile was trimmed of unmapped reads with Samtools (Li et al. 2009), sorted, and trimmed of duplicate reads with Picard tools (picard.sourceforge.net/). We then realigned around indels, using a set of 100 previously known indels from Sanger resequencing of 25 loci (Caldwell et al. 2006; Morrell and Clegg 2007; Morrell et al. 2014). Sequence coverage was estimated with ‘bedtools genomecov,’ using the regions included in the Nimblegen barley exome capture design (https://sftp.rch.cm/diagnostics/sequencing/nimblegen_annotations/ez_barley_ex-ome/barley_exome.zip). Individual sample alignments were then merged into a multisample alignment for variant calling. A preliminary set of variants was called with the GATK HaplotypeCaller with a heterozygosity (average pairwise diversity) value of 0.008, based on average coding sequence diversity reported for cultivated barley (Morrell et al. 2014). This preliminary set of variants was filtered to sites with a genotype score of 40 or greater, heterozygous calls in at most two individuals, and read depth of at least five reads. We then used the filtered variants, SNPs identified in the Sanger resequencing data set, and 9,605 SNPs from genotyping assays: 5,010 from the James Hutton Institute (Comadran et al. 2012), and 4,595 from Illumina GoldenGate assays (Close et al. 2009) as input for the GATK VariantRecalibrator to obtain a final set of variant calls.
Processing of soybean samples is as described above, but with the following modifications. Soy reads were aligned to the Williams 82 reference genome sequence (Schmutz et al. 2010). Mismatch and reporting parameters for the cultivated samples were adjusted to allow for ~1% mismatch between reads and reference, which is approximately the highest typical genic sequence diversity in soybean cultivars (Hyten et al. 2006). The alignments were trimmed and sorted as described above. Preliminary variants were called as in the barley sample, but with a heterozygosity value of 0.001, which is the nucleotide diversity reported by Hyten et al. (2006). Final variant calls were obtained in the same way as described for the barley sample, using SNPs on the SoySNP50K chip (Song et al. 2013) as known variants.
Read mapping scripts, variant calling scripts, and variant filtering scripts for both barley and soybean are available on GitHub at (github.com/MorrellLAB/Deleterious_Mutations).
SNP Classification
Barley SNPs were identified as coding or noncoding using the Generic Feature Format v3 (GFF) file provided with the reference genome (Mayer et al. 2012). A custom Python script was then used to identify coding barley SNPs as synonymous or nonsynonymous. Soybean SNPs were assigned using primary transcripts using the Variant Effect Predictor (VEP) from Ensembl (ensembl.org/info/docs/tools/vep/index.html). Nonsynonymous SNPs were then assessed using SIFT (Ng 2003), PolyPhen2 (Adzhubei et al. 2010) using the ‘HumDiv’ model, and a likelihood ratio test comparing codon evolution under selective constraint to neutral evolution (Chun and Fay 2009). For the likelihood ratio test, we used the phylogenetic relationships between 37 Angiosperm species based on genic sequence from complete plant genome sequences available through Phytozome (phytozome.jgi.doe.gov/) and Ensembl Plants (plants.ensembl.org/). The LRT is implemented as a Python package we call ‘BAD_Mutations’ (BLAST Aligned-Deleterious Mutations; github.com/MorrellLAB/ BAD_Mutations). Coding sequences from each genome were downloaded and converted into BLAST databases. The coding sequence from the query species was used to identify the best match from each species using TBLASTX. The best match from each species was then aligned using PASTA (Mirab et al. 2014), a phylogeny-aware alignment tool. The resulting alignment was then used as input to the likelihood ratio test for the affected codon. The LRT was performed on codons with a minimum of 10 species represented in the alignment at the queried codon. Reference sequences were masked from the alignment to reduce the effect of reference bias (Simons et al. 2014). A SNP was identified as deleterious if the p-value for the test was less than 0.05, with a Bonferroni correction applied based on the number of tested codons, and if either the alternate or reference allele was not seen in any of the other species. A full list of species names and genome assembly and annotation versions used is available in Table S4.
Inference of Ancestral State
Prediction of deleterious mutations is complicated by reference bias (Chun and Fay 2009; Simons et al. 2014), which manifests in two ways. First, individuals that are closely related to the reference line used for the reference genome will appear to have fewer genetic variants, and thus fewer inferred nonsynonymous and deleterious variants. Second, when the reference strain carries a derived allele at a polymorphic site, that site is generally not predicted to be deleterious (Simons et al. 2014). To address the issue of reference bias, we polarized all coding variants by ancestral and derived state, rather than reference and non-reference state. Ancestral states were inferred for SNPs in gene regions by inferring the majority state in the most closely related clade from the consensus phylogenetic tree for the species included in the LRT. For barley, the ancestral states were inferred from gene alignments of Aegilops tauschii, Brachypodium distachyon, and Tritium urartu. For soybean, ancestral states were inferred using Medicago truncatula and Phaseolus vulgaris. This approach precludes universal inference of ancestral state for noncoding variants. However, examination of alignments of intergenic sequence in Triticeae species and in Glycine species showed that alignments outside of protein coding sequence is not reliable for ancestral state inference (data not shown).
Results
Identification of Deleterious SNPs
Resequencing and read mapping followed by read de-duplication resulted in an average coverage of ~39X exome coverage for our barley samples and ~38X genome coverage in soybean. After realignment and variant recalibration, we identified 652,797 SNPs in thirteen cultivated and two wild barley lines. The majority of these SNPs were noncoding, with 522,863 occurring outside of CDS annotations. Of the coding SNPs, 70,069 were synonymous, and 59,865 were nonsynonymous. The list of differences from reference carried by each barley sample is summarized in Table 1, and a per-approach summary of deleterious variants is given in Table 2. SIFT identified 13,626 SNPs as deleterious, PolyPhen2 identified 13,534 SNPs to be deleterious, and the LRT called 17,865 deleterious. The intersection of all three methods gives a much smaller set of deleterious variants, with a total of 4,872 nonsynonymous SNPs identified as deleterious. While individual methods identified ~18% of nonsynonymous variants as deleterious, the intersect of methods identifies 5.7%. A derived site frequency spectrum (SFS) of our barley sample is shown in Figure 1A.
In soybean, we called 586,102 SNPs in gene regions. Of these, 542,558 occur in the flanking regions of a gene model. We identify 73,577 SNPs with a synonymous consequence, and 99,685 with a nonsynonymous consequence (Table 3). SNPs in the various classes sum to greater than the total as a single SNP in multiple transcripts can have multiple functional classes. For instance, a SNP may be intronic in one transcript, but be in an exon of a different one. SIFT identified 7,694 of the nonsynonymous SNPs as deleterious, PolyPhen2 identified 14,933 as deleterious, and the LRT identified 11,223 as deleterious. Similarly to the barley sample, the proportion of putatively variants was similar across prediction approaches, with the exception of SIFT, which failed to find alignments for many genes. The overlap of prediction approaches identified 3,041 (2.6%) of nonsynonymous variants to be deleterious (Table 4).Derived allele frequency distributions are shown in Figure 1B. Variants inferred to be deleterious are generally at lower derived allele frequency than other classes of variation, implying that these variants are truly deleterious.
Nonsense variants made up a relatively small proportion of putatively deleterious variants. In our barley sample, we identify a total of 711 nonsense variants, 14.5% of our putatively deleterious variants. In soybean, we identify 1,081 nonsense variants, which make up 15.7% of putatively deleterious variants. Nonsense variants have a higher heterozygosity than tolerated, silent, or deleterious missense variants (Figure S2). While the absolute differences in heterozygosity were small due to the inbred nature of our samples, the pattern suggests that nonsense variants are more strongly deleterious than just missense variants.
Deleterious Mutations and Causative Variants
Bioinformatic approaches to identifying deleterious variants rely on sequence constraint to estimate protein functional impact. An example of a deleterious variant showing a derived base substitution that alters a phylogenetically conserved codon is shown in Figure 2. The variants identified in these approaches should be enriched for variants that cause large phenotypic changes. We identified 23 nonsynonymous variants inferred to contribute to known phenotypic variation in barley and 11 in soybean and tested the effect of these variants in our prediction pipeline. Of 23 putative causative mutations in barley, 6 (25%) of them were inferred to be deleterious (Table S5). Of the 11 soybean putatively causative mutations, 5 (45%) of them were inferred to be deleterious. This contrasts with the genome-wide average of ~3-6%, showing that variants that annotate as deleterious are more likely to impact phenotypes.
Deleterious Mutations and Genetic Map Distance
The purging of deleterious variants from populations is greatly affected by the effective recombination rate, which is related to the ratio of genetic distance to physical distance. To examine the relationship between the number of deleterious variants and recombination rate, we used a high-density genetic map from a soybean recombinant inbred line family (Lee et al. 2015). The soybean map was based on a subset of the SoySNP50K genotyping platform (Song et al. 2013). There was a weak but significant correlation between recombination rate and the proportion of nonsynonymous SNPs inferred to be deleterious (r2 = 0.007, p < 0.001, Figures 3, S3). We did not examine this relationship in barley because the barley reference genome assembly (Mayer et al. 2012) contains limited physical distance information.
Discussion
Questions regarding the prevalence of deleterious variants date back over half a century (Fisher 1930; Muller 1950). In Kinite populations, the segregation of deleterious mutations can have a substantial impact on population mean fitness (Kimura et al. 1963). While it has been argued that the concept of a reduction of fitness relative to a hypothetical optimal genotype is irrelevant (Wallace 1970), mutation accumulation studies have shown that the accumulation of deleterious mutations has a significant effect on absolute fitness (Schultz et al. 1999; Shaw et al. 2002).
Our results demonstrate that a large number of putatively deleterious variants persist in individual cultivars in both barley and soybeans. The approaches used in this study predict the probability that a given amino acid or nucleotide substitution disrupts protein function. Mutations that alter phenotypes may be especially likely to annotate as deleterious, and we show that a high proportion of inferred causative mutations annotate as deleterious. It should be noted that variants identified as deleterious may affect a phenotype that is adaptive in only part of the species range or has a transient selective advantage - i.e., locally or temporally adaptive phenotypes. If the portion of the range in which the phenotype is adaptive is small or the selective advantage is transient, such variants will be kept an low frequencies and be identified as deleterious. Just as few variants are expected to be globally advantageous, a portion of deleterious variants are likely to not be globally disadvantageous. Such variants could be either locally or temporally advantageous, with a fitness advantageous under some circumstances contributing to their maintenance in populations (Tiffin and Ross-Ibarra 2014).
At the molecular level, variants occurring in minor transcripts of genes may exhibit conditional neutrality (Tiffin and Ross-Ibarra 2014), and Nes will be too low for purifying selection to act. (Gan et al. 2011) identified many isoforms of genes among a diverse panel of Arabidopsis thaliana accessions, as well as compensatory mutations for a majority of frameshift mutations. Genetic variants that annotated as nonsynonymous or nonsense using the A. thaliana reference are frequently spliced out of the transcript such that the gene still produces a full-length and functional product. In a similar vein, deleterious variants are often accompanied by multiple compensatory mutations that alleviate their fitness effects (Poon and Otto 2000; Poon and Chao 2005). The occurrence of the preponderance of putatively deleterious variants in the rarest frequency classes (Fig 1), and a higher level of observed heterozygosity for putatively deleterious variants (Figure S5) are both consistent with action of purifying selection on variants with negative impacts on fitness. Putatively disease-causing variants in human populations have also been observed to occur at low frequencies and to occur over a more geographically restricted range (Marth et al. 2011).
Comparison of Identification Methods
Each of the methods used here to identify deleterious variants makes use of sequence constraint across a phylogenetic relationship. They differ in terms of the models used to assess the functional effect of a variant. SIFT uses a heuristic, which determines if a nonsynonymous variant alters a conserved site based on an alignment build from PSI-BLAST results (Ng 2003). Polyphen2 is similar but, additionally identifies potential disruptions in secondary or tertiary structure of the encoded protein (when this information is available) (Adzhubei et al. 2010). Both of these approaches estimate codon conservation from a multiple sequence alignment, but do not use phylogenetic relationships in their predictions. PolyPhen2 identified the largest number of variants as deleterious. The reason for this may be that the data used to train the PolyPhen2 model is from human disease-causing and neutral variants. Nonhuman systems may differ fundamentally as to which amino acid substitutions tend to have strong functional impact, which would reduce prediction accuracy in other species (Adzhubei et al. 2010). The LRT explicitly calculates the local synonymous substitution rate, and uses that to test whether an individual codon is under selective constraint or evolving neutrally (Chun and Fay 2009). It is a hypothesis-driven approach, and compares the likelihood of two evolutionary scenarios. Variants in selectively constrained codons are considered to be deleterious.
The SNPs predicted to be deleterious differ somewhat between prediction approaches. Even though SIFT and PolyPhen2 identify similar proportions of nonsynonymous SNPs as deleterious, they overlap at ~50% of sites (Table 2). SNPs identified through at least two approaches, seem more likely to be deleterious, based on lower average derived allele frequencies (Figure S4). Comparisons of the distribution of Grantham scores (Grantham 1974) show high similarity in the severity of amino acid replacements that are predicted to be deleterious by each approach (Figure S5). The effects of reference bias are apparent in SIFT and PolyPhen2. In barley and soybean, the reference genotypes are ‘Morex’ and ‘Williams 82’ respectively. Even when polarizing by ancestral and derived alleles, these genotypes show considerably fewer inferred deleterious variants (Table 2; Table 4).
Deleterious Variants in Crop Breeding
Identification and elimination of deleterious variants has been proposed as a potential means of improving plant fitness and crop yield (Morrell et al. 2011). Current plant breeding strategies using genome-wide prediction rely on estimating genome-wide marker effects on quantitative traits of interest (Meuwissen et al. 2001). Genome-wide prediction has been shown to be effective in both animals (Schaeffer 2006) and plants (Heffner et al. 2011; Jacobson et al. 2014), but these approaches rely on estimating marker contributions to a quantitative trait (i.e., a measured phenotypic effect). The genetic architecture of quantitative traits suggests that our ability to quantify the effects of individual loci will reach practical limits before we can identify loci contributing to the variance of many agronomic traits (Rockman 2012). Many traits of agronomic interest, particularly yield in grain crops, are quantitative and have a complex genetic basis. As such, they are under the influence of environmental effects and many loci (Falconer and Mackay 1996). QTL mapping approaches to identifying favorable variants for agronomic traits will reach practical limits, even for variants of large effect (King et al. 2012). Current genome wide prediction and selection methodologies rely on estimating the combined effects of markers across the genome (Meuwissen et al. 2001), but is approach is limited by recombination rate and the ability to measure phenotypes of interest. The identification and purging of deleterious variants should provide a complementary approach to current breeding methodologies (Morrell et al. 2011).
Rise of Deleterious Variants Into Populations
The number of segregating deleterious variants in a species is very different from the number of de novo deleterious mutations in each generation, commonly identified as U. In humans, U is estimated at ~2 new deleterious variants per genome per generation (Agrawal and Whitlock 2012) and estimates from Arabidopsis suggest that U is approximately 0.1 (Schultz et al. 1999). U is the product of the per-base pair mutation rate, the genome size, and the fraction of the genome that is deleterious when mutated (Charlesworth 2012). Even though new mutations are constantly arising, the standing load of deleterious variation greatly exceeds the rate at which they arise (Charlesworth et al. 2004; Charlesworth 2012). However, our results show that ~40% of our inferred deleterious variants are private to individual cultivars, suggesting that they can be purged from breeding programs.
In the current study, we restricted our analyses to protein coding regions, but additional recent evidence suggests that deleterious variants can accumulate in conserved non-coding sequences, such as transcription factor binding sites (Arbiza et al. 2013). As such, analysis of the protein-coding regions of genomes presents a lower-bound on the estimates of the number of deleterious variants segregating in populations. Efforts to identify deleterious variants in noncoding sequence are limited by scant knowledge of functional constraints on noncoding genomic regions, and difficulty in aligning noncoding regions from all but the most closely related taxa (Doniger et al. 2008). Annotation of noncoding sequence will uncover additional deleterious variants, but a majority of putatively deleterious variants will be in coding regions. The several thousand putatively deleterious variants we identify per individual cultivar should provide ample targets for selection of recombinant progeny in a breeding program.
Author Contributions
TJYK, RMS, PT, and PLM designed the research. KPS and PLM provided input on which barley lines to sample, and RMS provided sequence data for soybean lines. Barley read mapping, variant calling and assessment with SIFT and PolyPhen2 were performed by TJYK and CL. Soybean data analysis was performed by FF with assistance from TJYK. Code for the likelihood ratio test was developed by TJYK, PJH, and JCF. Breeding history and causative mutations list were provided by MM. TJYK and PLM wrote the manuscript.
Acknowledgements
The authors thank Brandon Gaut, Michael Kantar, and Ana Poets for helpful comments on an earlier version of the manuscript. This work was supported by a USDA NIFA National Needs Fellowship and a MnDrive 2014 Food Security Fellowship (in support of TJYK). Support was also provided by the Minnesota Agricultural Experiment Station Variety Development fund and U.S. NSF Plant Genome Program (DBI-1339393). This research was carried out with hardware and software support provided by the Minnesota Supercomputing Institute (MSI) at the University of Minnesota.