1 Abstract
Hybridization between related species results in the formation of an allopolyploid with multiple subgenomes. These subgenomes will each contain complete, yet evolutionarily divergent, sets of genes. Like a diploid hybrid, allopolyploids will have two versions, or homeoalleles, for every gene. Partial functional redundancy between homeologous genes should result in a deviation from additivity. These epistatic interactions between homeoalleles are analogous to dominance effects, but are fixed across subgenomes through self pollination. An allopolyploid can be viewed as an immortalized hybrid, with the opportunity to select and fix favorable homeoallelic interactions within inbred varieties. We present a subfunctionalization epistasis model to estimate the degree of functional redundancy between homeoallelic loci and a statistical framework to determine their importance within a population. We provide an example using the homeologous dwarfing genes of allohexaploid wheat, Rht-1, and search for genome-wide patterns indicative of homeoallelic subfunctionalization in a breeding population. Using the IWGSC RefSeq v1.0 sequence, 23,796 homeoallelic gene sets were identified and anchored to the nearest DNA marker to form 10,172 homeologous marker sets. Interaction predictors constructed from products of marker scores were used to fit the homeologous main and interaction effects, as well as estimate whole genome genetic values. Some traits displayed a pattern indicative of homeoallelic subfunctionalization, while other traits showed a less clear pattern or were not affected. Using genomic prediction accuracy to evaluate importance of marker interactions, we show that homeologous interactions explain a portion of the non-additive genetic signal, but are less important than other epistatic interactions.
2 Introduction
Whole genome duplication events are ubiquitous in the plant kingdom. The impact of these duplications on angiosperm evolution was not truly appreciated until the ability to sequence entire genomes elucidated their omnipresence (Soltis et al. 2009). Haldane (1933), postulated that single gene duplication allowed one copy to diverge through mutation while metabolic function was maintained by the other copy. (Ohno 1970) reintroduced this hypothesis, and it has since been validated both theoretically (Ohta 1987; Walsh 1995; Lynch and Conery 2000), and empirically (Blanc and Wolfe 2004; Duarte et al. 2005; Liu, Baute, and Adams 2011; Assis and Bachtrog 2013). The duplicated gene hypothesis does not, however, generally explain the apparent advantage of duplicating an entire suite of genes. The necessity of genetic diversity for plant populations to survive and adapt to divergent or changing environments may help to explain this pervasive phenomenon.
The need for gene diversity can become more immediate in plants than in animals, where the latter can simply migrate to “greener pastures” when conditions become unfavorable. Plants lack substantial within generation mobility and must therefore change gene expression to cope with changing environmental conditions. Many species maintain gene diversity through alternate splicing, but this has been shown to be less common in plants than in other eukaryotes (Nagasaki et al. 2005). Whole genome duplication can generate the raw materials for the maintenance of genetic diversity (Wendel 2000; Adams and Wendel 2005). Gault (2018) demonstrated that similar sets of duplicated genes were preserved in two related genera, Zea and Tripsacum, millions of years after a shared paleopolyploidization event. This conserved pattern in purifying selection suggests that, at least for some genes, there is a clear advantage to maintaining two copies.
The union of two complete, yet divergent, genomes during the formation of an allopolyploid introduces manifold novel gene pathways that can specialize to specific tissues or environments (Blanc and Wolfe 2004). Similar to diploid hybrids, the formation of an allopolyploid results in a homogeneous population, but heterozygosity is maintained across homeologous sites rather than homologous sites. Unlike diploid hybrids that lose heterozygosity in subsequent generations, the homeoallelic heterozygosity is fixed through selfing in the allopolyploid. Mac Key (1970) postulated a trade off between new-creating (allogamous) and self preserving (autogamous) mating systems, where allopolyploids favor self pollination to preserve diverse sets of alleles across their subgenomes. As such, an allopolyploid may be thought of as an immortalized hybrid, with heterosis fixed across subgenomes (Ellstrand and Schierenbeck 2000; Feldman et al. 2012). While still hotly debated, evidence is mounting that allopolyploids exhibit a true heterotic response as traditional hybrids have demonstrated (Wendel 2000; Adams and Wendel 2005; Chen 2010; Chen 2013).
Birchler et al. (2010) note that newly synthesized allopolyploids often outperform their sub-genome progenitors, and that the heterotic response appears to be exaggerated in wider inter-specific crosses. This seems to hold true even within species, where autopolyploids tend to exhibit higher vigor from wider crosses (Bingham et al. 1994; Segovia-Lerma et al. 2004). The overwhelming prevalence of allopolyploidy to autopolyploidy in plant species (Soltis and Soltis 2009) may suggest that it is the increase in allelic diversity per se that is the primary driver for this observed tendency toward genome duplication. Instead of allowing genes to change function after a duplication event, alleles may develop novel function prior to their reunion during an allopolyploidization event. The branched gene networks of the allopoloyploid may provide the organism with the versatility to thrive in a broader ecological landscape than those of its sub-genome ancestors (Mac Key 1970; Ellstrand and Schierenbeck 2000; Osborn et al. 2003).
Subfunctionalization and neofunctionalization are often described as distinct evolutionary processes. Neofunctionalization implies the duplicated genes have completely novel, non-redundant function (Ohno 1970). Subfunctionalization is described as a partitioning of ancestral function through degenerate mutations in both copies, such that both genes must be expressed for physiological function (Stoltzfus 1999; Force et al. 1999; Lynch and Force 2000). However, barring total functional gene loss, many mutations will have some quantitative effect on protein kinetics or expression (Zeng and Cockerham 1993). Duplicated genes will demonstrate some quantitative degree of functional redundancy until the ultimate fate of neofunctionalization (i.e. complete additivity) or gene loss (pseudogenization) of one copy. It has been proposed that essentially all neofunctionalization processes undergo a subfunctionalization transition state (Rastogi and Liberles 2005).
If the mutations occur before the duplication event, as in allopolyploidy, the two variants are unlikely to have degenerate mutations. Instead, they may have differing optimal conditions in which they function or are expressed. The advantage of different variants at a single locus (Allard and Bradshaw 1964, alleles) or at duplicated loci (Mac Key 1970, homeoalleles) can result in greater plasticity to environmental changes. Allopolyploidization has been suggested as an evolutionary strategy to obtain the genic diversity necessary for invasive plant species to adapt to the new environments they invade (Ellstrand and Schierenbeck 2000; Beest et al. 2011).
Adams et al. (2003) showed that some homeoallelic genes in cotton were expressed in an organ specific manner, such that expression of one homeolog effectively suppressed the expression of the other in some tissues. These results have since been confirmed in other crops such as wheat (Pumphrey et al. 2009; Akhunova et al. 2010; Feldman et al. 2012; Pfeifer et al. 2014), and evidence for neofunctionalization of homeoallelic genes has been observed (Chaudhary et al. 2009). Differential expression of homeologous gene transcripts has also been shown to shift upon challenge with heat, drought (Liu et al. 2015) and salt stress (Zhang et al. 2016) in wheat, as well as water submersion and cold in cotton (Liu and Adams 2007).
Common wheat (Triticum aestivum) provides an example of an allopolyploid that has surpassed its diploid ancestors in its value to humans as a staple source of calories. Hexaploid wheat has undergone two allopolyploid events, the most recent of which occurred between 10 and 400 thousand years ago, adding the D genome to the A and B genomes (Marcussen et al. 2014). The gene diversity provided by these three genome ancestors may explain why allohexaploid wheat has adapted from its source in southwest Asia to wide spread cultivation around the globe (Dubcovsky and Dvorak 2007; Feldman and Levy 2012).
In the absence of outcrossing in inbred populations, selection can only act on individuals, changing their frequency within the population. If the selection pressure changes (e.g. for modern agriculture), combinations of homeoalleles within existing individuals may not be ideal for the new set of environments and traits. This presents an opportunity for plant breeders to capitalize on this feature of allopolyploids by making crosses to form new individuals with complementary sets of homeoalleles. Many of these advantageous combinations have likely been indirectly selected throughout the history of wheat domestication and modern breeding.
A crucial example involves the two homeologous dwarfing genes (Borner et al. 1996) important in the Green Revolution, which implemented semi-dwarf varieties to combat crop loss due to nitrogen application and subsequent lodging. We discuss this example in detail, and use it as a starting point to justify the search for homeologous interactions. While the effect of allopolyploidy has been demonstrated at both the transcript level and whole plant level, we are unaware of attempts to use genome-wide homeologous interaction predictors to model whole plant level phenotypes such as growth, phenology and grain yield traits.
Using a soft winter wheat breeding population, we demonstrate that epistatic interactions account for a significant portion of genetic variance and are abundant throughout the genome. Some of these interactions occur between homeoallelic regions and we demonstrate their potential as targets for selection. If advantageous homeoallelic interactions can be identified, they could be directly selected to increase homeoallelic diversity, with the potential to expand the environmental landscape to which a variety is adapted.
We hypothesize that the presence of two evolutionarily divergent genes with partially redundant function leads to a less than additive gene interaction, and introduce this as a subfunctionalization model of epistasis.
3 Subfunctionalization Epistasis
We generalize the duplicate factor model of epistasis from Hill et al. (2008), by introducing a subfunctionalization coefficient d, that allows the interaction to shift between the duplicate factor and additive models. Let us consider an ancestral allele with an effect a. Through mutation, the effect of this locus is allowed to diverge from the ancestral allele to have effects a* and ã in the two descendant species. When the two divergent loci are brought back together in the same nucleus, the effect of combining these becomes d(a* + ã) (Figure 1).
Values of d < 1, indicate a less-than additive epistasis (Eshed and Zamir 1996), in this case, resulting from redundant gene function. When d = 1/2, and a* = ã, the descendant alleles have maintained the same function and the duplicate factor model is obtained. As d exceeds 1/2, the descendant alleles diverge in function (i.e. subfunctionalization), until d reaches 1, implying that the two genes evolved completely non-redundant function (i.e neofunctionalization). At the point where d =1, the effect becomes completely additive.
For values of d > 1/2, the benefit of multiple alleles is realized in a model analogous to overdominance in traditional hybrids. As alleles diverge they can pick up advantageous function under certain environmental conditions. The homeo-heterozygote then gains an advantage if it experiences conditions of both adapted homeoalleles. Values of d < 1/2 may indicate allelic interference (Herskowitz 1987), a phenomenon that has been observed in many newly formed allopolyploids (Comai et al. 2003; McClintock 1984). Allelic interference, also referred to as dominant negative mutation, can result from the formation of non-functional homeodimers, while homodimers from the same ancestor continue to function properly. This interference effectively reduces the number of active dimers by half (Herskowitz 1987; Veitia 2007).
3.1 Epistasis models
Let us consider the two locus model, with loci B and C. Using the notation of Hill, Goddard and Visscher (2008), the phenotype, y, is modeled as where B and C are the marker allele scores, BC is the pairwise product of those scores, αB and αC are the additive effects of the B and C loci and ααBC is the interaction effect.
We revisit two epistatic models, the “Additive × Additive Model without Dominance or Interactions Including Dominance” (called Additive × Additive hence forth) and the “Duplicate Factor” considered by Hill, Goddard and Visscher (2008) that are relevant for this discussion. We also propose a generalized Duplicate Factor epistatic model to estimate the degree of gene functional redundancy, or subfunctionalization. Letting a be the effect on the phenotype, we can represent these epistatic models for inbreds in Table 1.
When markers are coded {0, 1} for presence of the functional allele, the deviation from the additive expectation, δ, is estimated by ααBC. δ can then be used to calculate the subfunctionalization coefficient, (Figure 2). Omitting the population mean, the least squares expectation of additive and epistatic effects is then
3.2 Epistatic contrasts
Epistatic interaction predictors must be formed from marker scores in order to estimate interaction parameters. These interaction predictors are typically calculated as the pairwise product of the genotype scores for their respective loci. This can lead to ambiguity in the meaning of those interaction effects depending on how the marker scores are coded. Different marker parameterizations can center the problem at different reference points (i.e. different intercepts), and can scale the predictors based on allele or genotype effects (i.e. different slopes).
When loci B and C are coded as {−1, 1} for inbred genotypes, including the product of the marker scores, BC, corresponds to the Additive × Additive model. Changing the reference allele at either locus does not change the magnitude of effect estimates but will change their signs. Using {0, 1} coding, BC corresponds to the subfunctionalization model and estimates δ directly. For this coding scheme, the magnitude and sign can change depending on the reference allele at the two loci. This highlights one of the difficulties of effect sign interpretation, as it is not clear which marker orientations should be paired. That is, which allele should be B as opposed to b, and which should be C as opposed to c? Marker alleles can be oriented to have either all positive or all negative additive effects, but the question remains: which direction should the more biologically active allele have on the phenotype?
Marker scores are typically assigned as either presence (or absence) of the reference, major, or minor allele, which may or may not be biologically relevant. While it has been noted that the two different marker encoding methods do not result in the same contrasts of genotypic classes (He, Wang, and Parida 2015; Martini et al. 2016; Martini et al. 2017), coding does not affect the least squares model fit (Zeng, Wang, and Zou 2005; Álvarez-Castro and Carlborg 2007). Álverez-Castro and Carlborg (2007) show that there exists a linear transformation to shift between multiple parameterizations using a change-of-reference operation (see Appendix A3). This is convenient because all marker orientation combinations can be easily generated by changing the effect signs of a single marker orientation fit for the {−1, 1} marker coding. These effect estimates can subsequently be transformed to the {0, 1} coding effect estimates using the change-of-reference operation for all marker orientation combinations.
4 Materials and Methods
4.1 RIL population
A bi-parental recombinant inbred line (RIL) population of 158 lines segregating for two dwarfing genes was used to illustrate an epistatic interaction between the well known homeologous genes on chromosomes 4B and 4D, Rht-B1 and Rht-D1, important in the Green Revolution. Two genotyping by sequencing (GBS) markers linked to these genes were used to track the segregating mutant (b and d) and wildtype (B and D) alleles. Only one test for epistasis between these two markers was run. This homeologous marker pair was denoted RIL_Rht1. Details of the population can be found in Appendix A1.
4.2 CNLM population
The Cornell small grains soft winter wheat breeding population (CNLM) was used to investigate the importance of homeologous gene interactions in a large adapted breeding population. The dataset and a detailed description of the CNLM population can be found in in Santantonio et al. (2018). Briefly, the dataset consists of 1,447 lines evaluated in 26 environments around Ithaca, NY. Because the data were collected from a breeding population, only 21% of the genotype/environment combinations were observed, totaling 8,692 phenotypic records. Standardized phenotypes of four traits, grain yield (GY), plant height (PH), heading date (HD) and test weight (TW) were recorded. All lines were genotyped with 11,604 GBS markers aligned to the International Wheat Genome Sequencing Consortium (IWGSC) RefSeq v1.0 wheat genome sequence of ‘Chinese Spring’ (IWGSC 2018, accepted), and subsequently imputed.
4.3 Homeologous marker sets
Using the IWGSC RefSeq v1.0 ‘Chinese Spring’ wheat genome sequence (IWGSC 2018, accepted), homeologous sets of genes were constructed by aligning the annotated coding sequences (v1.0) back onto themselves. Alignments of coding sequences was accomplished with BLAST+, allowing up to 10 alignments with an e-value cutoff of 1e-5. Alignments were only considered if they aligned to 80% or more of the query gene. Of the 110,790 coding sequences, 13,111 triplicate sets with one gene on each homeologous chromosome (representing 39,333 genes) were identified with no other alignments meeting the criterion. An additional 5,073 triplicates (representing 15,219 genes) were added by selecting the top 2 alignments if they were on the corresponding homeologous chromosomes. Duplicate sets were also included if there was not a third alignment to one of the three subgenomes, adding an additional 5,612 duplicates. The coding sequences for which we did not identify homeologous genes either appeared to be singletons (24,695 coding sequences) that did not have a good alignment to a gene on a homeologous chromosome, or had many alignments across the genome making it impossible to determine with certainty which alignments were truly homeologous (20,319 coding sequences). The known 4A, 5A, and 7B translocation in wheat (Devos et al. 1995) was ignored for simplicity in this study, but could easily be accounted for by allowing homeologous pairs across these regions.
The resulting 23,796 homeologous gene sets, comprised of 18,184 triplicate and 5,612 duplicate gene sets, sampled roughly 59% of the gene space of hexaploid wheat. Each homeologous gene was then anchored to the nearest marker by physical distance (Supplementary Figures S1 and S2), and homeologous sets of markers were constructed from the anchor markers to each homeologous gene set. Redundant marker sets due to homeologous genes anchored by the same markers were removed, resulting in 6,142 triplicate and 3,985 duplicate marker sets for a total of 10,127 unique homeologous marker sets. These marker sets were then used to calculate marker interactions as pairwise products of the marker score vectors.
As a control, two additional marker sets were produced by sampling the same number of duplicate and triplicate marker sets as the homeologous set (Homeo) described above. These markers sets were sampled either from chromosomes within a subgenome (Within, e.g. markers on 1A, 2A and 3A), or across non-syntenic chromosomes of different subgenomes (Across, e.g. markers on 1A, 2B and 3D). Samples were taken to reflect the same marker distribution of the Homeo set with regard to their native genome, which has a larger proportion of D genome markers relative to their abundance. Note that three-way homeologous interactions have equal proportions of markers belonging to the A, B and D genomes, whereas D genome markers only account for 13% of all markers in the CNLM population (Santantonio, Jannink, and Sorrells 2018). All analyses conducted on the Homeo marker set were conducted on the Within and Across marker sets to determine the importance of homeologous gene interactions relative to non-homeologous gene interactions.
4.4 Determining marker orientation
For each homeologous marker set, additive homeologous marker effects and their multiplicative interaction effects were estimated as fixed effects in the following linear mixed model while correcting for background additive and epistatic effects. The {−1, 1} marker parameterization was used for fixed marker additive and interaction effect estimates. where X is the design matrix, β is the vector of fixed environmental effects, and Z is the line incidence matrix. S-11 is the matrix of genotype marker scores and interactions for each genotype class, while E-11 is the additive and interaction effects that need estimated (Appendix A3). is the incidence matrix for the two- or three-way genotype of each homeologous marker set. Z and differ in that the former links observations to a specific line, whereas the latter links observations to one of the two- or three-way genotype classes for the homeologous marker set. The background genetic effects were assumed to be with population parameters previously determined (Zhang et al. 2010). The additive and epistatic covariances, KG and HI, were calculated as described in Van Raden (2008, method I) and Martini et al. (2016), respectively. This weighted covariance matrix was used to reduce computational burden associated with estimating two variance components in the same fit.
A Wald test was used to obtain a p-value for the null hypothesis that the marker effects or interactions are zero. All marker orientation combinations were generated by changing the estimated effect signs, and then transformed to the {0, 1} marker effect estimates using the change-of-reference operation (Álvarez-Castro and Carlborg 2007). Only marker orientations with all positive or all negative effects were considered.
Four marker orientation schemes were used in further analyses. Markers were oriented to have either all positive (POS) effects, all negative (NEG) effects, or to have the direction chosen for each marker set by one of two methods. The first method, ‘low additive variance high additive effect’ (LAVHAE), selected the marker orientation that minimized the difference (or variance for three-way sets) of the additive main effects while maximizing the mean of the absolute values of the additive main effects. Only additive effects were used to select the marker orientation to keep from systematically selecting marker orientations with a specific interaction pattern for the LAVHAE scheme. A second method was used to select marker orientations solely based on the highest variance of estimated additive and interaction effects, denoted ‘high total effect variance’ (HTEV). The HTEV method biases the interaction term to be in the opposite direction to that of the additive effects, therefore discussion of results using this method is limited.
4.5 Additive only simulated controls
Marker effect and interaction estimates using either {0, 1} or {0, 1} marker parameterizations are not orthogonal, so care must be taken when interpreting the direction and magnitude of the effects estimates. The positive covariance between the marker scores and their interaction leads to a multicollinearity problem, and results in a negative relationship between additive and interaction effects if both additive effects are oriented in the same direction. To determine if any observed pattern was due entirely to multicollinearity, a new phenotype with no epistatic effects was simulated from the data for each trait. The estimate of the marker variance was calculated from the additive genetic variance estimate as , where p is the vector of marker allele frequencies. Then for each trait, a new additive phenotype was simulated as ysim = 1μ + Xβ + ZMusim + εsim where M is the matrix of marker scores, usim was sampled from and ε was sampled from . A Kolmogorov-Smirnov (KS) test was used to determine if the distribution of the estimated interaction effects from the actual data differed from the distribution of effects estimated from simulated data. An additional simulated phenotype was also produced by first permuting each column of M to remove any effects due to LD structure.
4.6 Genomic prediction
To determine the importance of epistatic interactions to the predictability of a genotype, a genomic prediction model was fit as where 1n is a vector of ones, μ is the general mean. The random vectors of additive genotype, epistatic interactions, and errors were assumed to be distributed as , and , respectively.
The additive covariance matrix, K, was calculated using Van Raden (2008), method I. The epistatic covariance matrix H was calculated either as defined by Martini et al. (2016) to model all pairwise epistatic interactions (Pairwise), or in a similar fashion as K for oriented marker sets, where only unique products of marker variables were included instead of the marker variables. For the latter, the matrix was scaled with the sum of the joint marker variances as (2qT(1 − q))−1, where q is the frequency of individuals containing both the non-reference marker alleles.
A small coefficient of 0.01, was added to the diagonals of the covariance matrix to recover full rank lost in centering the matrix of scores prior to calculating the covariance. Five-fold cross validation was performed by randomly assigning individuals to one of five folds for 10 replications. Four folds were used to train the model and predict the fifth fold for all five combinations. All models were fit to the same sampled folds so that models would be directly comparable to one another and not subject to sampling differences. Prediction accuracy was assessed by collecting genetic predictions for all five folds, then calculating the Pearson correlation coefficient between the predicted genetic values for all individuals and a “true” genetic value. The “true” genetic values were obtained by fitting a mixed model to all the data with fixed effects for environments and a random effect for genotypes, assuming genotype independence with a genetic covariance I.
Increase in genomic prediction accuracy from the additive model was used as a proxy to assess the relative importance of oriented marker interaction sets. To determine the proportion of non-additive genetic signal attributable to each interaction set, the ratio of the prediction accuracy increase from the additive model using the interaction set (Homeo, Within and Across) to the prediction accuracy increase from the additive model modeling all pairwise epistatic interactions (Pairwise) was used for comparison of models. The percentage of non-additive predictability was calculated as follows for each interaction set.
4.7 Software
ASReml-R (Gilmour 1997; Butler 2009) was used to fit all mixed models. BLAST+ (Camacho et al. 2009) was used for coding sequence alignment. All additional computation, analyses and figures were made using base R (R Core Team 2015) implemented in the Microsoft Open R environment 3.3.2 (Microsoft 2017) unless noted otherwise. Figures 2 and 1 were created using the ‘tikz’ package (Tantau 2018) for LATEX. Figure 4 was made with the ‘circlize’ R package (Gu et al. 2014). The R package ‘xtable’ (Dahl 2016) was used to generate LATEXtables in R.
4.8 Data availability
Phenotypes and genotypes for the CNLM population can be found in Santantonio et al. (2018). A list of homeologous genes can be found in supplementary file ‘homeoGeneList.txt’. The supplementary file ‘HomeoMarkerSet.txt’ contains non-unique marker sets anchored to each homeologous gene set. Unique marker sets can be found in ‘uniqueHomeoMarker-Set.txt’, ‘WithinMarkerSet.txt’, ‘AcrossMarkerSet.txt’ for the Homeo, Within and Across marker sets used in the analysis, respectively. Marker and marker interaction estimates and p-values for the Homeo set can be found in ‘twoWayInteractions.txt’ and ‘threeWay-Interactions.txt’ for two- and three-way marker interactions, respectively. Phenotypes and genotypes used in the RIL population are included in the ‘NY8080Cal.txt’ file.
5 Results and Discussion
5.1 Rht-1
5.1.1 RIL population
The markers linked to the Rht-1B and Rht-1D genes both had significant additive effects (p < 10−10) and explained 19.6%and 20.5% of the variation in the height of the RIL population (Supplemental Table S1). The test for a homeoallelic epistatic interaction between these Rht-1 linked loci was also significant (p = 0.0025), but only explained 3.5% of the variance after accounting for the additive effects. Had we tested all pairwise marker interactions in this population, this test would not have passed a Bonferroni corrected significance threshold.
Effect estimates for the Rht-1 markers and their epistatic interaction are shown in Table 3, for {0, 1} and {−1, 1} marker parameterizations, and for orientations where the marker main effects are both positive or both negative. By plotting these values on a classic epistasis plot and indicating the genotype class means, it it evident that the {0, 1} parameterization is arguably more intuitive, as effects correspond directly to differences in genotype values (Figure 3). They both contain the same information and are equivalent for prediction, but the interpretation of the {−1, 1} marker coding is less obvious because the slopes are deviations from the expected double heterozygote, which does not exists in an inbred population. The {0, 1} parameterization uses the double dwarf as the reference point, where the effects αB and αc are the two semi-dwarf genotypes. The tall genotype is the sum of the semi-dwarf allele effects plus the deviation coefficient, δ, which corresponds to ααBC.
The estimated d parameter of 0.73 indicates a significant degree of redundancy between the wild type Rht-1 homeoalleles. This suggests that either the gene products maintain partial redundancy in function, or the expression of the two homeoalleles is somewhat redundant. The latter is less likely given that the two functional wild type genes have comparable additive effects relative to the double dwarf. If the two genes were expressed at different times or in different tissues based on their native subgenome, the additive effects would be likely to differ in magnitude. This demonstrates a functional change between homeoalleles that has been exploited for a specific goal: semi-dwarfism.
When the markers are oriented in the opposite direction, to indicate the GA insensitive mutant allele as opposed to the GA sensitive wildtype allele, the interpretation of the interaction effect changes. The additive effect estimates become indicators of the reduction in height by adding a GA insensitive mutant allele. The interaction effect becomes the additional height reduction from the additive expectation of having both GA insensitive mutant alleles, resulting in a d parameter of 1.58. The same interpretation can be made, but must be done so with care. Losing wildtype function at both alleles results in a more drastic reduction in height than expected because there is redundancy in the system. Therefore, the d parameter is most easily interpreted when the functional direction of the alleles is known. Simply put, when function is added on top of function, little is gained, but when all function is removed, catastrophe ensues.
5.1.2 CNLM population
For the CNLM population, the markers with the lowest p-values associated with plant height on the short arms of 4B and 4D did not show a significant interaction with their respective assigned homeologous marker in homeologous sets H4.16516 and H4.23244, respectively. A new homeologous marker set, CNLM_Rht1, was constructed with the SNPs on 4BS and 4DS with the lowest p-values mentioned above. The additive effects of S4B_PART1_38624956 and S4D_PART1_10982050 had p-values of p = 5.5 × 10−4 and p = 3.7 × 10−8, respectively for this set. The pair was tested for an epistatic interaction, but was only significant at a α = 0.05 threshold (p = 0.015). This set was oriented in the same direction as the RIL_Rht1 set using the LAVHAE orientation method. While the magnitude of these effects was reduced (7.13, 7.09 and −4.56 for the 4D, 4B and 4B×4D effects respectively), the CNLM_Rht1 set had a d parameter value of 0.68, similar to that of RIL_Rht1. Had this set alone been tested, we would have concluded that this was a significant homeologous interaction.
Based on the p-value, it appears that the S4B_PART1_38624956 marker is not in high LD with the Rht-1B gene perhaps due to physical distance, multiple alleles or separation of these mutations in time. This SNP may have occurred before the Rht-1B mutation, and is therefore flagging both the wild type and a GA-insensitive allele in the population. It appears that a GBS SNP in high LD with the Rht-1B was not sampled, or was filtered out of the marker set.
These results highlight the difficulty in using single SNP markers to track homeologous regions. If a non-functional mutation that is not in LD with a functional mutation is identified as the homeologous marker, then there will appear to be no interaction, while a marker in high LD with the homeolog but physically further away from the gene was missed. Further research is required to determine if functionally important homeologous marker pairs can be identified based on LD signatures in the population.
One of the challenges of using diverse panels of individuals is that marker proximity to a functional mutation is not necessarily indicative of high LD between the two sites. Significantly older or newer marker mutations may be in weak LD with a functional mutation despite close physical proximity, at least until a genetic bottleneck brings them back into high LD. The best markers are those mutations in close proximity and occurring at approximately the same historical time as the functional mutation in the same individual, such that both polymorphisms are subsequently co-inherited.
Another strategy to determine functional homeologous regions is to relax which sets of markers are considered homeologous. This could be accomplished by allowing pairwise relationships with all markers across entire subgenomes (Santantonio, Jannink, and Sorrells 2018), on syntenic chromosome arms or smaller haplotypes. Haplotypes could be constructed in the population and used to model homeologous haplotype interactions. This strategy may have more power assuming the correct SNPs could be identified to assign functional haplotypes, but that is beyond the scope of this report.
5.2 Significant homeoallelic interactions
The absence of one genotype class in 7,912 interaction terms resulted in 20,641 testable interaction effects out of 28,553 total interaction terms. A trait-wise Bonferroni significance threshold of 0.05/20,641 = 2.4 × 10−6 was therefore used to determine which interaction effects had a significant effect on the phenotype. Few homeoallelic interactions were significant at the trait-wise Bonferroni cutoff (Figure 4). Significant homeoallelic interactions for PH were identified between 4AL and 4DS, as well as 4BL and 4DL. Both of these locations were likely too far away from the Rht-1 alleles to be tagging these genes directly, but they may be regulatory sites for these genes. Another set of interacting sites between the homeologous chromosome arms 3AS, 3BS and 3DS was also identified for PH, but the additive effects were not significant. Two interacting regions on homeolog 1, between 1AS and 1DS and between 1AL and 1DL, and three interacting regions on homeolog 5 also appeared to be influencing HD. One region on the distal end of homeolog 7 affected both HD and TW, with significant two-way and three-way interactions. Although they were tagged with different marker sets for the two traits, these epistatic regions appeared to co-localize within 2 Mbp.
No significant additive or interaction effects were detected for GY, highlighting the highly polygenic nature of the grain yield trait. In several cases, one of the additive effects was significant but the other was not, and it is not clear if this is influencing the detection of interactions. It may be that the significant marker is simply in higher LD with the functional mutation conditional on the presence of the other marker, allowing the interaction to pick up the additional signal from the functional mutation (Wood et al. 2014). However, if this were the case, the interaction would be expected to be in the same direction as the additive effect, which was not generally observed.
We did not detect an interaction between the two significant additive regions on 2B and 2D for the HD trait. While these two markers were not grouped as a homeologous set, they were tested as such based on their proximity to the well described Photoperiod-1 genes, Ppd-B1 and Ppd-D1, on chromosomes 2B and 2D respectively. These genes are known to influence photoperiod sensitivity, and therefore transition to flowering and heading date (Welsh J.R. and R.D. 1973; Law, Sutka, and Worland 1978; Scarth and Law 1983). Certain allele pairs at these genes have been shown to exhibit a high degree of epistasis (Poland 2018, personal communication) in a bi-parental family. It is unclear why no interaction was observed in this population.
Jiang et al. (2017) also investigated the presence of homeologous interactions, but found little evidence in a large population of hybrid wheat. They did not attempt to tag homeologous loci, but instead considered interactions across any markers on homeologous chromosomes to be syntenic. It is possible that interactions at homologous (i.e. heterozygous) loci largely outweighed the interactions across homeologous loci in that population, given it was constructed from highly divergent parents and that progeny were not inbred. Additionally, they tested all pairwise marker combinations, resulting in a strict significance threshold that may have missed small effect homeologous interactions.
Homeologous interactions make up relatively few of the potential two-way interactions within an allopolyploid genome. Given a subgenome with k genes and alloploidy level p (i.e. the number of subgenomes), there are two-way homeologous interactions versus potential two-way non-homeologous gene interactions. For a subgenome size of 30,000 genes, this represents 0.02% and 0.006% of the possible two-way gene interactions for an allotetraploid and an allohexaploid, respectively. That said, homeoallelic interactions should be far more likely to have a true biological interaction than random pairs of genes because they should belong to the same or similar biochemical pathways.
5.3 Estimates of d
There were few cases where at least two additive effects and their corresponding interaction effect were all significantly different from zero. This may be due to the difficulty of assigning functional homeologous gene sets using single SNPs, as well as a lack of statistical power owing to low minor allele frequencies (Hill, Goddard, and Visscher 2008). The lack of a large number of significant interactions is not surprising given that allele frequencies near 0.5 are uncommon in both natural and breeding populations.
To determine whether more homeologous marker sets were displaying a pattern indicative of subfunctionalization than would be expected by chance, marker sets where both additive and two-way interaction effects were significant at a threshold of α = 0.05 were examined (Table 4). The expected number of two-way marker sets with significant additive and interaction effects is about 11 (i.e. 4 traits × 22,411 two-way interactions × 0.053), assuming independence of loci and true additive and interaction effects of zero. Only the Homeo and Across marker sets had more than the expected number. The homeologous marker set had a larger proportion of d coefficients estimated between 0.5 and 1 relative to the strictly additive simulated phenotypes as well as the other non-homeologous marker sets, suggesting that homeologous loci exhibit a pattern indicative of subfunctionalization more so than other marker sets tested. The Across set showed the highest proportion of d < 0.5, suggestive of gene pathway interference. However, it is unclear if these few marker sets are indicative of any global pattern, and may simply be an artifact of sampling.
Because the power to detect significant effects diminishes as more tests are accomplished, it may be prudent to look at global trends between homeologous additive effects and their interactions, regardless of statistical significance.
5.4 Evidence of subfunctionalization
A strong negative relationship between additive and interaction effects was observed when using the {0, 1} marker parameterization (Supplementary Figure S3). This negative relationship was also observed in the phenotypes simulated to be strictly additive (Supplementary Figure S4). The multicollinearity of the additive and epistatic predictors at least partially drives this relationship, where positively correlated additive and epistatic predictors will tend to have effect estimates in opposing directions.
To determine if the interaction effects were greater in magnitude than expected by chance, the ordered interaction effects from the true and simulated phenotypes were plotted against one another to form a quantile-quantile plot (Figure 5). The interaction effects were multiplied by the sign of the corresponding additive effects to highlight the direction of interaction effect relative to the additive effect. Interaction effect distributions were significantly different between the observed and strictly additive simulated data as determined by the Kolmogorov-Smirnov test (KS; p < 0.05) for all traits accept GY.
HD showed a pattern consistent with a subfunctionalization model, with a low dropping tail for interaction effects in the opposite direction than that of the corresponding additive effects. This indicates that the less than additive effects of some estimated interactions are greater than expected by additivity alone. PH showed some evidence of this pattern, but also demonstrated a greater than additive effect for positively related interaction effects. The LAVHAE orientation scheme may have selected the wrong marker coding for those marker sets, resulting in a d parameter greater than 1, or there are true greater than additive interaction responses for positive effect alleles. Greater than additive responses would be indicative of overdominance across homeologous loci. GY and TW showed little evidence of the less than additive pattern, yet TW did show this trend when the HTEV marker orientation was used (Supplemental Figures S6 and S7). These relationships were more pronounced when the markers were permuted to remove LD before simulating the data (Supplemental Figure S5). High LD between homeologous marker sets may result in dampening of the epistatic signal due to unbalanced or missing genotype classes.
These findings are further supported by comparing the homeologous interactions to the Within and Across interaction effect estimates. The Homeo marker set showed more severe less than additive epistasis than both Within and Across for HD but not the other traits (Supplementary Figures S8 and S9). The Within set had more severe less than additive interaction effects than the Homeo set for TW (Supplementary Figure S8), and the Across had more severe less than additive effects for PH (Supplementary Figure S9). Large or moderate effect negative epistasis is expected across subgenomes in allopolyploids, but it is unclear why this was also observed for the Within marker set for TW.
5.5 Homeologous model fit
Comparing variance component estimates across different unstructured covariance matrices can be misleading as variance components can be scaled by pulling a constant out of the covariance matrix. Additionally, variance partitioning is only reliable when the covariance matrices are truly independent (Vitezica et al. 2017; Huang and Mackay 2016; Jiang et al. 2017). Therefore, we do not make an attempt to discern meaning from the variance components per se, and instead focus the discussion on model fit diagnostics, as well as prediction accuracy from cross validation to determine the value of the predictive information included in the model.
All epistatic models using the {−1, 1} marker parameterization provided a superior fit to the additive only model based on Akaike’s Information Criterion for all traits (Table 5). These results were confirmed by a likelihood ratio test to determine if the epistatic variance component was zero for all traits. With the exception of the GY trait, all of the epistatic models using the {0, 1} marker parameterization also had non-zero variance components (Supplementary Table S3), but did not result in a better fit for any models or traits. All LAVHAE oriented marker sets had a better model fit than forcing allele effects to be either all positive (POS) or all negative (NEG), demonstrating that some information was gained from orienting markers (Supplementary Tables S4 and S5). The LAVHAE also resulted in a better model fit than the HTEV marker orientation scheme under either marker parameterization (Supplementary Tables S6 and S7).
Despite resulting in a better fit, adding the epistatic interactions resulted in rather high correlations between additive and epistatic variance component estimates, as obtained from the average information matrix. Variance parameter estimate correlations between the additive and epistatic interactions consistently ranged from 0.57 to 0.65 for Pairwise, Within and Across models, but were notably lower for the Homeo epistatic set, ranging from 0.53 to 0.57. High correlations of variance estimates have been shown to be indicative of over fitting (Bates et al. 2015a; Bates et al. 2015b). However, if addition of epistatic interactions improves prediction accuracy of cross-validation, then the additional information they contain is warranted for inclusion in the model.
The Pairwise, Within and Across epistatic models outperformed the Homeo marker interaction set for all traits. This may be due to poor assignment of homeologous sets, or relatively fewer identifiable interactions and is discussed later (Section 5.7).
5.6 Genomic prediction
All epistatic models resulted in higher prediction accuracies for all traits other than GY, where only marginal increases were seen for certain marker interaction sets and parameterizations (Table 6).
The {−1, 1} marker coding resulted in higher prediction accuracies with a mean increase of 0.045 over the {0, 1} coding, and ranged from 0.007 to 0.084 higher accuracy. This increase may be due to choosing the wrong orientation using the {0, 1} marker coding effects. While these two codings are equivalent for prediction using ordinary least squares, this does not appear to be the case for the mixed model genomic prediction environment. The discrepancy may lie in shrinkage of interaction effects, where the {0, 1} marker coding should result in greater shrinkage than the {−1, 1} marker coding. This can be seen from a simple example with one observation of each genotypic class in {bbcc, bbCC, BBcc, BBCC}. The {−1, 1}coding would have an interaction predictor of {1, −1, −1, 1}, whereas the {0, 1} coding would have an interaction predictor of {0, 0, 0, 1}. This results in different numbers of observations per interaction class, with the {0, 1} coding contrasting 3 and 1, verses 2 and 2 for the {−1,1} coding. Therefore the shrinkage of the {0,1} coding should be greater than for the {−1,1} coding. Martini et al. (2017), also noted that the {−1,1} marker coding has a 50% chance of choosing the wrong marker orientation if chosen at random, whereas the {0,1}marker coding has a 75% chance of being the wrong marker orientation.
Choosing the marker orientation based on the LAVHAE scheme was better for all traits and marker sets for the {−1, 1} coding, with a mean prediction accuracy increase of 0.025 (ranging from 0.004 to 0.053) over the all positive (POS) or all negative (NEG) orientations (Supplemental Tables S8 and S9). Choosing the orientation had little to no effect on the {0, 1}marker coding (mean 0.003, ranging from −0.002 to 0.011). It is unclear if our attempt to choose the marker orientation resulted in the biologically relevant orientation more often than would be expected by chance, as this resulted in higher accuracies for the {−1, 1} but not the {0, 1} marker coding.
The increase in genomic prediction accuracy by choosing the orientation using the LAVHAE scheme over the all positive (POS) or all negative (NEG) scheme suggests that information can be gained from orienting markers relative to one another (Supplementary Tables S9 and S8). This is further supported by a better model fit for the LAVHAE orientation. However, it is unclear what strategy should be used to orient pairs of markers. In this report, marker additive effects were forced to be either all positive or all negative to model the homeologous subfunctionalization hypothesis, but there may be more biologically relevant orientations not explored here. Martini et al. (2017) used a categorical interaction that included a predictor for each pairwise genotype. That model was shown to be less predictive than the {−1,1} multiplicative model, perhaps due to more linearly dependent predictors assumed to have non-zero effects. How an optimal set of orientations might be obtained without losing biological meaning of the orientation warrants further investigation.
An indirect estimate of the proportion of non-additive genetic signal attributable to homeologous gene interaction was determined by taking the ratio of the percent increase in prediction accuracy of the Homeo, Within or Across prediction models from the additive model to the increase in prediction accuracy due to all pairwise interactions (equation 4). All three marker sets resulted in higher genomic prediction accuracy than the additive only GBLUP model (G) when the {−1, 1} marker coding was used. The homeologous marker interaction set explained between 58% and 167% of the additional genetic signal from the additive model. This result supports the idea that homeologous interactions are an important feature in the wheat genome. Conversely, Within and Across epistatic marker sets always resulted in a higher increase in genomic prediction accuracy relative to the Homeo marker set for all traits. This may suggest that the homeologous marker interactions are the least important relative to other epistatic interactions within and across the subgenomes, but could also be due to the paucity of these interactions relative to all possible two-way interactions, as previously discussed.
Another explanation might be provided by the relatively higher degree of LD across Homeo marker sets than found for the Within or Across marker sets. The Within and Across marker interaction sets also resulted in a larger number of unique interaction predictors because they were randomly selected across chromosomes. Homeologous marker sets were selected next to one another along syntenic regions of homeologous chromosome, and more often shared two of the three homeoallelic markers (Supplemental Figure S11). The Within and Across sets appear to have sampled the entire genome better than selecting only homeologous loci, as they track more unique pairs of genomic regions.
5.7 Homeologous LD
The superiority of the Within and Across genomic prediction models to the Homeo genomic prediction model may indicate that homeologous interactions are relatively less important than other sets of interacting loci. However, homeologous marker sets had a much higher tendency to be co-inherited together, as seen by relatively higher standardized linkage disequilibrium values, D’ (Lewontin 1964), than observed for either Within (KS test p-value = 1.1 × 10−6) or Across (KS test p-value = 2.3 × 10−13) marker sets (Figure S10). The greater fixation of allele pairs at homeologous regions may explain the lack of increased prediction accuracy of the Homeo marker set, but this may not diminish the importance of homeologous interactions. As sets of interactions are fixed within the population, the epistatic variance becomes additive (Hill, Goddard, and Visscher 2008). The higher degree of LD, per se, may indicate the importance of homeologous interactions.
The Green Revolution dwarfing genes are an excellent example of how pairs of homeoalleles may become fixed, or develop a tendency for co-inheritance under selection. In this example, the desirable phenotype is a semi-dwarf, due to its resistance to lodging. Therefore, wildtype Rht-1B alleles will usually be paired with a GA-insensitive Rht-1D dwarfing allele, while wildtype Rht-1D alleles will usually be found with a GA-insensitive Rht-1B dwarfing allele to confer the desirable semi-dwarf phenotype. The CNLM_Rht1 set had a large standardized D′ value of 0.66, indicating that pairs of alleles were being fixed in the population, despite the apparent lack of high LD between the B genome marker and the Rht-1B gene.
We recognize that it is also possible that the higher degree of LD observed between homeologous marker pairs could be due to misalignment of markers to the wrong subgenome. Markers assigned to the wrong homeolog would appear in high LD simply because they are physically located near their assigned homeologous partner on the same chromosome. We used strict filtering parameters to reduce the likelihood of misalignment. This included a threshold on observed heterozygosity in the population, which could indicate alignment to more than one subgenome.
5.8 Conclusion
While much epistasis is partitioned to additive variance, it has been shown to be prevalent (Forsberg et al. 2017), and is important for maintaining long term selection (Carlborg et al. 2006; Paixão and Barton 2016). Our results indicate that homeoallelic interactions do not account for a large portion of the non-additive genetic variance in the CNLM population. While they do to contribute to the genetic variance of the population as evidenced by a better model fit and higher prediction accuracy over the additive model, sampling interactions across non-syntenic regions was superior for all traits examined. Homeologous interactions appear to make up the minority of epistatic interactions within this population.
Wagner (2005) suggested that there are two potential drivers of less than additive (Eshed and Zamir 1996) or synergistic (Segre et al. 2005) epistasis. These drivers are i) functional redundancy, as might be expected across homeologous loci, and ii) distributed robustness of function, in which there can be are many pathways that can acheive the same outcome. Our observation that most epistasis is not due to homeologous interactions is supported by the findings of Jannink et al. (2009), who found the synergistic epistasis signal in a wheat dataset to be indicative of Wagner’s distributed hypothesis, and not of the redundancy hypothesis.
HD showed the greatest evidence of subfunctionalization. The LAVHAE orientation appeared to be effective for the HD trait, but it is unclear if this marker orientation scheme was effective for PH, which also demonstrated a greater than additive effect. TW showed little pattern of subfunctionalization, despite having significant epistatic interactions. GY showed no evidence of being affected which may be due to the highly polygenic nature of the trait. Essentially all functional differences in the population should contribute to GY, and it may be that the effects of epistatic interactions are too small to detect. Large effect homeologous interactions for GY are likely to have allele pairs that have been fixed or are well on their way to fixation in the elite wheat genotypes as a consequence of modern plant breeding. The fixation of the semi-dwarf phenotype provides a profound example where specific pairs of homeoalleles result in drastic increases in grain production under modern agriculture.
The apparent lack of substantial interactions across homeoallelic loci may be explained by several factors. It may be that there are few differences in protein function or expression across the three subgenomes, although this seems unlikely given mounting evidence that homeologous copies are differentially expressed in time, tissue and environment (Adams et al. 2003; Liu and Adams 2007; Liu, Baute, and Adams 2011; Chaudhary et al. 2009; Pfeifer et al. 2014; Liu et al. 2015; Mutti, Bhullar, and Gill 2017; Zhang et al. 2016). We were unable to assign homeologous pairs to all genes within the genome, suggesting that many of these potential sites for interacting loci were lost during polyploidization. Rapid loss of genetic material due to genome shock (McClintock 1984) is common in newly synthesized allopolyploids (Chen and Ni 2006), as has been shown in synthetic allopolyploid wheat (Ozkan, Levy, and Feldman 2001; Kashkush, Feldman, and Levy 2002). Other interacting loci may have undergone epigenetic (Comai 2000; Lee and Chen 2001; Comai et al. 2003) or transposon induced silencing of one or more homeoalleles (Kashkush, Feldman, and Levy 2003; Wang et al. 2004).
It may also be that the most important homeoallele pairs have been effectively fixed in the CNLM breeding population, as evidenced by a higher degree of LD between homeologous markers compared to markers sampled from non-homeologous regions. Determining homeologous regions is relatively simple using gene positions and orientation, but tagging those regions with markers that are informative of interacting loci appears to be a challenge, even for well known loci such as Rht-1. Marker imputation using a diverse panel of highly sequenced individuals may increase the marker density and the ability to identify interacting loci. As sequencing becomes more affordable, higher read coverage will allow for better genotyping, and should produce more high quality markers. Other approaches, such as the use of haplotypes, should be developed to assign informative homeologous locus indicators, as opposed to simply using physical position of markers or lumping markers on the same subgenome together (Santantonio, Jannink, and Sorrells 2018).
The TILLING population developed by Kasileva et al. (2017) consisting of 2,735 mutant lines each with thousands of genic mutations could be a useful resource for future investigation into homeoallelic gene interactions. Lines with complementary loss of function homeologous genes could be used to develop bi-parental mapping populations to test the degree of subfunctionalization with the high statistical power afforded by allele frequencies of 0.5.
Other genetic resources also exist that would maximize the likelihood of detecting these interactions. A segregating synthetic hexaploid wheat population was developed from a cross between a spring wheat variety, Opata, and a synthetic hexaploid, W7984, with durum A and B genomes coupled to an Ae. taushii D genome (Sorrells et al. 2011). The population, consisting of roughly 200 doubled haploid and over 2,000 recombinant inbred lines, will have high statistical power to detect interactions between the common wheat homeologs and their durum and Ae. taushii ancestors due to both optimal allele frequencies and high genetic differences.
Prediction of unobserved beneficial homeoallelic epistatic regions may prove difficult, as it currently is in diploid hybrids. Additionally, directed selection for homeoallelic interactions across subgenomes will be challenging in autogamous allopolyploids due to the intensive labor involved in making crosses. However, these tools provide a way to screen large populations for beneficial homeologous interactions such that they may in turn be selected in breeding populations before intensive field trials are conducted, saving phenotyping costs and potentially reducing the time to variety release.
Treating the genome as consisting of purely additive gene action assumes that genes are independent machines, whose products sum to the final value of an individual. While convenient for selection, this is almost certainly not true when we consider the molecular mechanisms of biological organisms. Instead, genes work in concert to produce an observable phenotype. To this day, breeders of allopolyploid crops have treated allopolyploids as diploids for simplicity, but we now have the technical ability to view and start to breed these organisms as the ancient immortal hybrids that they are.
6 Acknowledgments
Funding of this research was provided by the USDA National Needs Fellowship for N. Santantonio, in partial fulfillment of the requirements for a Ph.D in Plant Breeding and Genetics at Cornell University. The field trials comprising the phenotypic data for the CNLM population were funded in part by the Hatch Project # 149-447. Genotyping was funded by the Wheat Coordinated Agricultural Project (WheatCAP). We are also grateful to Jesse Poland’s research group at Kansas State University for their contribution to genotyping of CNLM materials. The authors thank the International Wheat Genome Sequencing Consortium for pre-publication access to IWGSC RefSeq v1.0. The authors would also like to acknowledge Roberto Lonzano Gonzalez for the suggestion of using the coding sequences to identify homeologous genes. Finally, we would like to acknowledge the Cornell small grains staff, particularly David Benscher and James Tanaka, who were vital in implementing, collecting and processing the materials used to build the CNLM dataset.
A1 Appendix 1 RIL population
The population was formed from a cross between two Cornell soft winter wheat lines, NY91017-8080 and Caledonia. Caledonia contains a GA-insensitive 4D allele, d, and a wild-type 4B allele, B, while NY91017-8080 has a GA-insensitive 4B allele, b, and the wild type 4D allele, D. The population consisting of 192 individuals was planted in single row plots in Ithaca NY and measured for plant height in 2008. The population was screened for loci influencing plant height on chromosomes 4B and 4D using genotyping by sequencing (GBS) markers. The markers with the lowest p-value on the short arms of 4B and 4D were used to indicate the Rht-1 gene in this study. Only individuals with homozygous genotype calls for both loci were included to test for epistasis. This resulted in 19 double dwarfs (bbdd), 51 D genome semi-dwarfs (BBdd), 35 B genome semi-dwarfs(bbDD), and 53 tall (BBDD), for a total of 158 individuals. It appears that the Caledonia parent plant used in the cross was heterozygous for the D genome dwarfing allele, resulting in the 1:2 segregation ratio for the d : D alleles, and was confirmed by the genotype call for that plant.
A2 Appendix 2 Coding Sequence Alignment
Alignments of coding sequences was accomplished with BLAST+, allowing up to 10 alignments with an e-value cutoff of 1e-5. Alignments were only considered if they aligned to 80% or more of the query gene. Of the 110,790 coding sequences, 13,111 triplicate sets with one gene on each homeologous chromosome (representing 39,333 genes) were identified with no other alignments meeting the criterion. An additional 5,073 triplicates (representing 15,219 genes) were added by selecting the top 2 alignments if they were on the corresponding homeologous chromosomes. Duplicate sets were also included if there was not a third alignment to one of the three sub-genomes, adding an additional 5,612 duplicates. The coding sequences for which we did not identify homeologous genes either appeared to be singletons (24,695 coding sequences) that did not have a good alignment to a gene on a homeologous chromosome, or had many alignments across the genome making it impossible to determine with certainty which alignments were truly homeologous (20,319 coding sequences).
A3 Appendix 3 Change of reference
Following Álverez-Castro and Carlborg (2007), we demonstrate the change-of-reference operation simplified for inbred populations. For {0, 1} marker coding and allowing G1 to be the reference genotype, the genotypic values at a single locus can be represented as where S01 is the marker score matrix using the {0, 1} marker parameterization and E01 is the vector of expected values. For the two locus epistasis model, the four genotypic values are then
The three locus interaction is extended by
To shift from {−1, 1} coding estimates, β101, to {0, 1} coding estimates, β01 the following transformation exists (Álvarez-Castro and Carlborg 2007). Let S-11 indicate the {−1, 1}marker parameterization then .