Abstract
Our study investigates the possible drivers of recombination hotspots in Theobroma cacao using ten genetically differentiated populations. This constitutes the first time that recombination rates from more than two populations of the same species have been compared, providing a novel view of recombination at the population-divergence time-scale. For each population, a fine-scale recombination map was generated using under the coalescent with a standard method based on linkage disequilibrium (LD). They revealed higher recombination rates in a domesticated population and a population that has undergone a recent bottleneck. We address whether the pattern of recombination rate variation along the chromosome is sensitive to the uncertainty in the per-site estimates. We find that uncertainty, as assessed from the Markov chain Monte Carlo iterations is orders of magnitude smaller than the scale of variation of the recombination rates genome-wide. We inferred hotspots of recombination for each population and find that the genomic locations of these hotspots correlate with genetic differentiation between populations (FST). We developed novel randomization approaches to generate appropriate null models for understanding the association between hotspots of recombination and both DNA sequence motifs and genomic features. Hotspot regions contained fewer known retroelement sequences than expected, and were overrepresented near transcription start and termination sites. Our findings indicate that recombination hotspots are evolving in a way that is consistent with genetic differentiation, but are also preferentially driven to regions of the genome that are up or downstream from coding regions.
Introduction
Genetic variation is fundamental for evolutionary forces like selection and genetic drift to act. Selection and drift also contribute to a loss of variation, which means that they must act in conjunction with forces that maintain variation along the genome in order for populations to continue evolving over prolonged periods of time. Recombination’s rearranging of genetic material onto different backgrounds generates a larger set of haplotype combinations on which selection can act, reducing the magnitude of Hill-Robertson interference (Felsenstein, 1974). Different regimes of recombination can strongly influence how efficient selection is at purging deleterious mutations and increasing the frequency of beneficial mutations in the population (Felsenstein, 1974).
One way to elucidate the distribution of recombination events along the genome is by using fine-scale recombination maps (Myers et al., 2005; Auton et al., 2012; Brunschwig et al., 2012; Paape et al., 2012; Choi et al., 2013; Hellsten et al., 2013; Singhal et al., 2015; Stevison et al., 2016). These maps are constructed with methods that leverage current patterns of linkage disequilibrium (LD) using the coalescent, in order to estimate historical rates of recombination between sites along the genome (Auton and McVean, 2007). Studies in a wide range of species have shown that recombination rates are not uniform along the genome and general patterns of variation have been described (Begun and Aquadro, 1992; Akhunov et al., 2003; Wu et al., 2003; Anderson et al., 2004; McVean et al., 2004; Mézard, 2006; Kim et al., 2007; Gore et al., 2009; Schnable et al., 2009; Branca et al., 2011; Paape et al., 2012). One of these patterns is the reduced recombination rate in centromeric regions of the chromosomes and the progressive increase of recombination rates as the physical distance to telomeres decreases (Begun and Aquadro, 1992; Akhunov et al., 2003; Wu et al., 2003; Anderson et al., 2004; Gore et al., 2009; Schnable et al., 2009). This pattern has also been shown to arise in simulation studies (e.g. Mackiewicz et al., 2010). Another interesting pattern that has been observed is that of regions with unusually high rates of recombination spread throughout chromosomes: recombination hotspots (McVean et al., 2004; Brunschwig et al., 2012; Paape et al., 2012; Hellsten et al., 2013; Stevison et al., 2016; Shanfelter et al., 2018). In this study, we define hotspots locally, requiring that their recombination rate be unusually high when compared to neighboring regions. The importance of recombination hotspots lies in their ability to shuffle genetic variation at higher rates than the rest of the genome, profoundly impacting the dynamics of selection for or against specific mutations (Felsenstein, 1974).
A variety of genomic features have been identified as being associated with regions of high recombination. Recombination hotspots have been linked to transcriptional start sites (TSSs) and transcriptional termination sites (TTSs) in Arabidopsis thaliana, Taeniopygia guttata, Poephila acuticauda, and humans (Myers et al., 2005; Choi et al., 2013; Singhal et al., 2015). In Mimulus guttatus hotspots were found to be associated with CpG islands (short segments of cytosine and guanine rich DNA, associated with promoter regions) (Hellsten et al., 2013). CpG islands were also associated with increased recombination rates in humans and chimpazees (Auton et al., 2012). These patterns point to recombination occurring frequently near, but not within coding regions. The formation of chiasmata is important for the proper disjunction of chromosomes during meiosis (Martinez-perez et al., 2008), but repeated double-strand breaks can lead to an increased mutation rate (Rodgers and McVey, 2015). In coding regions in particular, this excess mutation rate can have a high evolutionary cost, due to the likelihood of novel deleterious mutations being higher than that of beneficial ones (Haldane, 1937; Crow and Kimura, 1970; Wloch et al., 2001; Sanjuán et al., 2004; Eyre-Walker and Keightley, 2007). Recombination hotspots have also been found to be correlated with particular DNA sequence motifs. In some mammals, including Mus musculus (Brunschwig et al., 2012) and apes (Auton et al., 2012; Stevison et al., 2016) binding sites for PRDM9, a histone trimethylase with a DNA zinc-finger binding domain, have been found to correlate with recombination hotspots. In Arabidopsis thaliana, proteins that limit overall recombination rate have been identified, leading to a genome-wide increase in recombination rate in knockout mutants (Fernandes et al., 2018). However, these Arabidopsis proteins have not been shown to direct recombination to particular regions, and are therefore not expected to affect the location of recombination hotspots.
Comparisons of recombination hotspots between pairs of populations have yielded varying results. Hinch et al. (2011) found that, at finer scales, the genetic maps of European and African human populations were significantly different. They also found that, when looking at hotspots in the major histocompatibility complex, the African populations showed a hotspot that was not present in Europeans, but all European hotspots were found in African populations (Hinch et al., 2011). Recent work on recombination in apes (Stevison et al., 2016) found little correlation of recombination rates in orthologous hotspot regions when looking between species, but a strong correlation when comparing between two populations of the same species. Other studies have also found very little sharing of hotspots between humans and chimpanzees Ptak et al. (2005); Winckler et al. (2005). Additionally, the dynamic of changing hotspot locations observed in humans and other apes has been observed in simulations Mackiewicz et al. (2013). This suggests that recombination hotspots are potentially changing in ways that match demographic patterns, differentiating at a similar rate as genomic sequences.
The identification of ten genetically differentiated populations of the cocoa tree, Theobroma cacao (Motamayor et al., 2008; Cornejo et al., 2018) can be leveraged to study population-level drivers of recombination hotspots. These ten populations originate from different regions of South and Central America, and include one fully domesticated population (Criollo), used in the production of fine chocolate, and nine wilder, more resilient populations which generate higher cocoa yield than the Criollo variety (Fig. 1) (Motamayor et al., 2008; Henderson et al., 2007; Cornejo et al., 2018). These ten populations have been shown to have strong signatures of differentiation between them (FST values ranging from 0.16 to 0.65) and they separate into clear clusters of ancestry (Cornejo et al., 2018). Comparing the locations of hotspots between these ten populations of T. cacao can contribute to the understanding of hotstpot turnover at the population-divergence time-scale. These comparisons also contribute to our understanding of how demographics impact the turnover of recombination hotspot locations.
Fine-scale, LD-based recombination maps have been constructed for a number of plant models (Paape et al., 2012; Choi et al., 2013; Hellsten et al., 2013), identifying a variety of features correlated to recombination rate. Unlike these model plants with short generation times, T. cacao is a perennial woody plant with a five-year generation time (Henderson et al., 2007). The size and long generation time of T. cacao makes direct measurements of recombination impractical. However, historical recombination can be estimated for T. cacao using coalescent based methods (Auton and McVean, 2007). Theoretical studies have shown that population structure can generate artificially inflated measures of LD (Ohta, 1982; Li and Nei, 1974), which would be detrimental to our estimates of recombination. For this reason recombination maps were constructed independently for each population. In contrast to previous studies, which have focused primarily on recombination rates, this study attempts to describe the relationship between recombination hotspots and a variety of factors.
We used an LD-based method to estimate recombination rates, which we then analyzed with a maximum likelihood statistical framework to infer the location of recombination hotspots. The location of hotspots were compared across populations and a novel resampling scheme tailored to the genomic architecture of T. cacao was used to generate null assumptions for the distribution of hotspots along the genome. These null distributions were used to identify differential representation of known DNA sequence motifs in ubiquitous recombination hotspots, and of overlap between recombination hotspots and genomic traits for each population. The re-sampling schemes used to identify these associations are novel in the context of this work and were designed to take into account the size and distribution of elements in the genome. In this work we aimed to answer the following questions: (i) How are recombination rates distributed within 10 highly differentiated populations of T. cacao, and how do they compare to each other? (ii) How are hotspots distributed along the genome of each of the ten populations of T. cacao, and can these distributions be explained by patterns of population genetic differentiation? (iii) Are there identifiable DNA sequence motifs that are associated with the location of recombination hotspots along the T. cacao genome? (iv) Are there genomic features (e.g. TSSs, TTSs, exons, introns) consistently associated with recombination hotspot locations across T. cacao populations? Our findings suggest that recombination hotspot locations generally follow patterns of diversification between populations, while also having a strong tendency to occur close to TSSs and TTSs. Moreover, we find a strong negative association between the occurrence of recombination hotspots and the presence of retroelements.
Results
Comparing recombination rates between populations
Populations show a mean recombination rate r/kb between 2.1×10−5 and 5.25×10−3 (Table 1), with a variety of distributions (Fig. 2). We observe a higher mean than median r/kb for all populations, indicating that extreme high values are present for all populations. The extreme recombination rate values affect the mean, driving it to values consistently higher than the median. The pattern of recombination rates along the genome varied between populations, as can be seen in the comparison of the Nanay and Purus third chromosome (Fig. 3). Purus appears to have a higher average recombination rate than Nanay for chromosome three. More specifically, particular regions of the chromosome present peaks in one population that are absent in the other. A similar patter can also be observed for the density of recombination hotspots, e.g. Purus presenting a high density of hotspots in certain regions that is not observed in Nanay. The median 95% probability interval for recombination rate across the genome for each population was found to be several orders of magnitude larger than the uncertainty per site, estimated as the median 95% Credibility Interval of the trace for each position in the genome for that population (Table 2).
Overall, the mean recombination rate for most of the populations was higher than estimated mutation rates for multicellular eukaryotes of 10−6 changes per kb per generation (Lynch, 2010; Exposito-Alonso et al., 2018) (Table 1). Two populations, Guianna and Criollo, were notable exceptions, having higher average recombination rates than the other populations by one and two orders of magnitude respectively. Guianna and Criollo also have been estimated to have a lower effective population size (Ne) (Cornejo et al., 2018) by one and two orders of magnitude respectively. However, there was no significant linear trend between mean Ne and r/kb (p = 0.1119), indicating that, for a high enough Ne, the ability to detect recombination events is not dictated by the effective population size. When Criollo and Guianna were excluded, the relationship was also not present (p = 0.3886). When all populations were included, the inbreeding coefficient (F, from Cornejo et al., 2018) showed no significant linear association with mean r/kb (p = 0.3361). We also found no linear trend between sample size and mean r/kb (p = 0.2333). The average recombination rate per population was transformed from r/kb to cM/Mb (Table 1) using the Kosambi mapping function Kosambi (1943). The average cM/Mb was 4.6 × 10−04.
In order to compare the average recombination rates (r/kb) of the different populations, a Kruskal-Wallis test was performed for every pair of populations. The only pair of populations that did not show a significant difference in mean recombination rate was the pair of Nacional and Nanay (p = 0.3). All other pairwise comparisons were highly significant (p < 2 × 10−16).
Comparing recombination hotspot locations between populations
The majority (55.5%) of hotspots identified were not shared between populations. The 25 most numerous sets of hotspots are represented in Fig. 5. The nine largest of these are sets of hotspots unique to single populations. The hotspots unique to the remaining population (Criollo) formed the eleventh largest set. Effective population size (Ne) is not a good linear predictor of the amount of detected hotspots (p = 0.1489), nor is sample size (p = 0.351).
The recombination rate in hotspot regions for nine of the populations was on average between 22 and 237% higher than the average recombination rate of the genome. The exception was Guianna, which only showed an approximately 1% increase in average recombination rate in hotspots regions when compared to that of the non-hotspot regions. A 1% higher average recombination rate in hotspots may be due to an increased ability to detect hotspots in regions of low recombination for this population. Additionally, Guianna presents unusually large hotspots (average 8.9 kb, Table 6), which points to an especially low resolution in hotspot detection for this population.
Despite the majority of hotspots not being shared between populations, we conducted pairwise Fisher’s exact tests to verify whether there was significantly more hotspot overlap than expected (if hotspots were randomly distributed along the genome) between populations. For most pairs of populations we found significantly more hotspot overlap than expected (Table 3). There were three comparisons that did not show significantly more overlap than expected: Amelonado-Nacional, Amelonado-Purus, and Criollo-Nacional. A Mantel test comparing distances between populations based on shared hotspots and FST values between populations resulted in a significant correlation between them (r = 0.66, p = 0.002). The correlation between eigenvectors from a correlation matrix and those of the genetic covariance matrix were also explored. When all populations were included, we found that the first eigenvector from the genetic covariance matrix was not significantly correlated with the first eigenvector from the hotspot correlation matrix (p = 0.7055), but the second genetic eigenvector was (p = 0.009007, r = 0.7711638). However, the first eigenvector of the genetic covariance matrix captured the difference between the Criollo population (the only domesticated variety) and the rest of the populations. The second eigenvector explains most of the natural differentiation across populations (Cornejo et al., 2018). For that reason, we decided to exclude Criollo and repeat the analysis. We found that the first eigenvector from the correlation matrix constructed from shared hotspot information was not significantly correlated with either of the first two eigenvectors of the genetic covariance matrix when Criollo was excluded (eigenvector 1: p = 0.1314, eigenvector2: p = 0.3376).
To study the effects of demographic history more closely, shared hotspots were converted to dimensions of a multiple correspondence analysis and modeled along a previously constructed drift tree (Cornejo et al., 2018). Modeling the dimension as a Brownian motion was a better fit (AIC=79.4) than modeling it as an Ornstein-Uhlenbeck (OU) process (AIC=81.4), which is consistent with the small number of hotspots shared between populations. The model assuming Brownian motion is consistent with pure drift driving differentiation of a trait along a genealogy, while an OU process is consistent with a higher trait maintenance (stabilizing selection).
Identifying DNA sequence motifs associated with the locations of recombination hotspots
RepeatMasker was used to analyze the set of recombination hotspots that were present in at least eight T. cacao populations (17 total hotspots), as well as the consensus set of recombination hotstpots, and the reference genome. In order to determine whether a particular set of DNA sequence repeats was overrepresented in the regions of ubiquitous recombination hotspots, the percentage of DNA sequence that was identified as potentially being from retroelements or DNA transposon was compared to an empirical distribution. The percentage of observations from the distribution which were greater than the observed are reported in Table 4. While retroelements were found to be underrepresented in the ubiquitous hotspots, DNA transposons were marginally overrepresented.
Identifying genomic features associated with the location of recombination hotspots
An overrepresentation of recombination hotspots was found in all ten of the populations at transcriptional start sites (TSSs) and transcriptional termination sites (TTSs)(Table 5). The level of overrepresentation of hotstpots in particular regions was compared to a null expectation based on simulations of hotspots of the same size as the ones detected, distributed randomly along the chromosomes. For all populations, all 1000 simulations showed a lower proportion of overlap with TSSs and TTSs than the observed. In the case of exons and introns, seven populations (Contamana, Criollo, Iquitos, Maranon, Nacional, Nanay, Purus) had an observed value that was lower than all, or almost all (Purus for exons), simulations. Three of the remaining four populations (Amelonado, Curaray, and Nanay) had no clear trend in either direction (Table 5). The final population (Guianna) showed an overrepresentation of hotspots in both exons and introns.
Discussion
Understanding how recombination rates vary between genetically differentiated populations of the same species is an important step toward disentangling the role of recombination in genetic differentiation. This set of T. cacao populations presents a unique opportunity to infer recombination in wild, long- and recently established populations, as well as a domesticated population (Criollo) (Cornejo et al., 2018; Bartley, 2005). This system has allowed us to explore differences in recombination hotspot locations between populations of the same species. Our results point to a conservation of hotspots between populations that generally mirrors the patterns of genetic differentiation between populations. Also, we find that TSSs and TTSs are strongly associated with recombination hotspots in all populations, which is consistent with previous findings in plants (Paape et al., 2012; Choi et al., 2013; Hellsten et al., 2013). This factor seems to play an important role in determining the location of novel hotspots. Finally, hotspots that are shared by at least eight populations appear to be associated with DNA transposons, pointing to a potential mechanism for the maintenance of recombination hotspots at the population-divergence time-scale.
Comparing recombination rates between populations
We found that the eight long-established, wild T. cacao populations show an average recombination rate (r/kb) greater than multicellular eukaryotic mutation rates (Table 1), while the other two populations (Criollo and Guianna) show unusually high average recombination rates in comparison. Despite a small sample for some populations, we found no linear trend between sample size and recombination rate. Additionally, the rates calculated for the two wild, small-sample populations (Curaray and Nacional) were consistent with those of other wild populations. This makes us confident in our estimates, particularly for the domesticated Criollo population. For all populations, the mean recombination rate was found to be greater than the median. This is consistent with high rate outlier values; an expected result in the presence of recombination hotspots. Using the effective population size for Medicago truncatula from Siol et al., 2007 and the estimate of rho from Paape et al., 2012, we calculated r/kb (= 4 × 10−3) and found that it was comparable with the rate found for the Criollo population (Table 1). We also calculated the median recombination rate in cM/Mb for each chromosome using the Kosambi mapping function (Kosambi, 1943) over non-overlapping, 100 SNP windows. The average cM/Mb for all populations was 4.6 × 10−04, which is lower than has been measured for any Malvale (Kundu et al., 2015), but not as low as the lowest measured for conifers (Chen et al., 2010; Stapley et al., 2017). This places T. cacao on the high end of known recombination rates for its order but comfortably in the range of other long-lived, woody plants. Average recombination rates in cM/Mb varied between populations from Amelonado (4.04 × 10−06) to Criollo (3.91×-03). Previous work has shown that Criollo is the only population showing a strong signature of domestication, as revealed by much higher drift parameter than that observed for other populations (Cornejo et al., 2018). Domestication has been observed to increase recombination rates, particularly in plants (Ross-Ibarra, 2004), and is a possible explanation for the higher recombination rate observed for the Criollo population. The high recombination rate observed in Guianna can be explained in a similar way; while Guianna does not show a strong signature of domestication, it is the most recently established population (Bartley, 2005), and it has also undergone a recent bottleneck (Cornejo et al., 2018). We hypothesize from this result that the Guianna population is undergoing the initial stages of domestication and its increased recombination is an early indicator of this. It is possible that the high recombination rates estimated for Criollo and Guianna can be explained by biases in estimation caused by errors associated to small samples or low genetic variation; yet, the recombination rates for Amelonado (another population with low variation) or Purus (a population with small sample size) did not present this problem. Analyses exploring mutations of putative recombination suppression genes (Fernandes et al., 2018) could help disentangle the nature of this extreme variation in recombination rate in the Criollo and Guianna populations.
Despite recombination rates for eight of the ten populations being of the same order of magnitude, pairwise comparisons of average rates indicated that most populations have a significantly different rate of recombination from the others. The only exception were Nacional and Nanay whose average rates were not significantly different from each other. These two populations, however, are not more closely related to each other than they are to other populations, based on genetic differentiation (Cornejo et al., 2018). We interpret this result as suggestive that their similarity is not due to genetic similarity, but some other factors, e.g. epigenetics.
The likelihood of detecting hotspots of recombination in the genome will likely be affected by the amount uncertainty in the estimates of recombination across sites or regions. Yet, we have been unable to identify any study where the magnitude of the uncertainty in the estimates of recombination are assessed to address this issue. We have performed careful comparisons and assessed the magnitude of the uncertainty in the estimation of recombination rates to show that this uncertainty is several orders of magnitude smaller than the variation in recombination rates across the genome (Table 2).
Comparing recombination hotspot locations between populations
Similarly to recombination rates, the location of recombination hotspots can be very informative to questions of divergence between populations. Understanding the pattern and rate of change of recombination hotspots at the population level can elucidate their role in shaping genome architecture, impacting how effectively selection operates (Felsenstein, 1974). We found that a large proportion (55.5%) of hotspots detected are unique to a single population. While we do not detect all the hotspots in these populations and not all the hotspots detected are necessarily true positives, this proportion of unique hotspots can be seen as an indicator that the turnover rate for hotspots is faster than the time it took the 10 populations to differentiate. The detection rate for LDhot is approximately 55% under constant population conditions, and greater when a recent bottleneck has occurred (Auton et al., 2014; Dapper and Payseur, 2017). Only two of the populations in this study (Criollo and Guianna) have a known recent bottleneck (Cornejo et al., 2018). However, Criollo was the only one of these two with an unusually low hotspot count (Table 6). Criollo’s low number of detected hotspots can be a product of its increased genome-wide recombination rate, making the signal of hotspots less pronounced. The observed variability of hotspot location between populations points to demographic history not being the main driver of recombination hotspot location. However, the hotspots tend to appear in similar regions, as demonstrated by the Fisher’s exact tests (Table 3). This dichotomy can be explained by considering that the proportion of the genome occupied by recombination hotspots is very low, so even a small proportion of hotspots from two different populations being in the same region is enough for the Fisher’s exact test to recognize them as significantly similar. This small but significant similarity can occur by recombination being limited in its possible positioning along the genome, but not to the point of forcing hotspots to occur consistently in the same locations, and thus maintaining some level of stochasticity. It is important to note that our hotspots are unusually large (Table 6). This is likely a product of our low sample size leading to low resolution when resolving hotspot regions.
Given the significant proportion of overlapping hotspots between populations, it was still important to explore whether the similarities can be explained by shared genetic history. If demographic history explains the evolution of hotspot location, we would expect that more closely related populations would have a higher percent of overlapped hotspots. A significant relationship was found between population differentiation (FST) and the distance between populations based on shared hotspots (Mantel test, r = 0.66, p = 0.002). The comparison between the hotspot correlation matrix and the genetic covariance matrix supports what was found when comparing the hotspot correlation matrix to the FST matrix. One caveat is that the first genetic eigenvector, which separated Criollo from the other populations, was not correlated with the first hotspot correlation eigenvector, indicating that Criollo’s domestication generated a genetic pattern that deviates from the pattern of shared hotspots. This indicates that, to some extent, the genetic differentiation and the location of hotspots are mirroring each other, which could be due to recombination hotspots being a product of the shared history between the populations. However, since recombination rates were estimated using a coalescent-based method, we expect historical relationships to be represented in our findings. We transformed the information of hotspot overlap to model hotspots as quantitative traits changing along a population tree (Cornejo et al., 2018). Our results, show that a Brownian motion model (AIC=79.4) better fits the data than a model with stabilizing selection Ornstein-Uhlenbeck model (AIC=81.4) and suggest that, in principle, drift alone could explain the evolution of the location of recombination hotspots. However, the absolute number of hotspots that are shared among populations indicates that demographic history alone is insufficient to explain the evolution of recombination hotspots in this species.
One conclusion that follows from these results is that, while shared recombination hotspots can to some extent be explained by patterns of genetic differentiation, some of the sharing can simply be due to a tendency for hotspots to arise near TSSs and TTSs. It has been observed in other organisms that hotspots of recombination are frequently associated to specific genomic features (including TSSs and TTSs) (Auton et al., 2013; Choi et al., 2013; Hellsten et al., 2013; Myers et al., 2005; Singhal et al., 2015) or DNA sequence motifs (Auton et al., 2012; Brunschwig et al., 2012; Stevison et al., 2016). These factors can affect the landscape of recombination, contributing to the patterns of shared hotspot locations between populations that we are observing in T. cacao. Previous studies looking at apes and finches have explored recombination hotspots in multiple species and as many as two populations of the same species (Singhal et al., 2015; Stevison et al., 2016; Shanfelter et al., 2018), but this study is the first to compare hotspots in more than two populations of the same species at once. The increased number of populations allows us to analyze the relationship between population genetic processes and recombination. Our results suggest that the pattern of gains and losses of recombination hotspots is very dynamic and the landscape of recombination changes rapidly during the process of diversification within a species. This dynamism can have a tremendous impact on the adaptive dynamics of a species, and it should be taken into account, considering that theoretical studies tend to assume that recombination rates are constant during the evolution of populations (Hudson and Kaplan, 1988; Donnelly and Kurtz, 1999).
Identifying DNA sequence motifs associated with the locations of recombination hotspots
The analysis of 17 hotspots shared between at least eight populations of T. cacao found an underrepresentation of retroelements and a marginal overrepresentation of DNA transposons when compared to the entire genome (Table 4). These results are not entirely surprising as it has already been suggested that transposable elements (TEs) tend to be enriched in areas of low recombination in Drosophila as a consequence of selection against TEs (Rizzon et al., 2002). However, the marginal over-representation of DNA transposons in the most conserved recombination hostspot is unexpected, given that all previous observations have shown a reduced representation of mobile elements in areas with high recombination rate (Rizzon et al., 2002). It is possible that DNA transposons are at least partly responsible for the maintenance of recombination hotspots as populations diverge, from which we expect that site-directed recombination is more frequent in these locations of the genome. However, the low percentage of these sequences observed in the set of all hotspots (Table 4) indicates that these sequences only have a small effect on the maintenance of hotspots. It has been observed in humans that short DNA motifs enriched for repeat sequences determine the location of 40 per cent of hotspots enriched for recurrent non-allelic homologous recombination (McVean, 2010). One potential explanation for why natural selection does not eliminate hotspots in these regions is the possibility that these regions do not produce a large enough mutational load for natural selection to remove them from the population (McVean, 2010).
Identifying genomic features associated with the location of recombination hotspots
For all ten populations, an overrepresentation of hotspots was found in the areas immediately preceding and following transcribed regions of the chromosome. This matches the findings of previous studies in Arabidopsis thaliana (Choi et al., 2013), Taenipygia gut-tata and Poephila acuticauda (Singhal et al., 2015), and humans (Myers et al., 2005). The most likely explanation is that recombination events within genes are selected against. The rationale being that a recombinant chromosome that undergoes a double-strand break in the middle of a coding region will have a higher risk of being inviable, and therefore not represented in the current set of chromosomes for its population. Recombination occurring in transcription start and stop sites, on the other hand, does a much better job at breaking up haplotypes or shuffling alleles in different genomic backgrounds, while preserving the functionality of coding regions. This rational is supported by previous findings of increased recombination rates in these regions (Choi et al., 2013). It is also supported by results from PRDM9 knock-out Mus musculus, which has shown a reversion to hotspots located near TSSs (Brick Kevin et al., 2012). The enrichment of T. cacao hotspots in TSSs and TTSs is thus a reasonable result given that zinc-finger binding motifs and potential modifiers like PRDM9 have not been identified in this species.
Implications for the evolutionary history of T. cacao
Overall, our results show a large consistent pattern where recombination rates in the ten populations of T. cacao are of a similar magnitude as mutation rates, but show a high diversity in location and number of hotspots of recombination that cannot be explained solely by the process of diversification of the populations. In fact, the results are indicative of the turnover rate of hotspots being faster than the process of divergence among populations. A potential hypothesis that could explain the rapid turnover of hotspots of recombination and the relative differences in recombination among populations is that epigenetic changes are involved in controlling the turnover of recombination in plants. This hypothesis is not unreasonable given the recent observation of epigenetic control of recombination in plants (Yelina et al., 2015). Further theoretical and simulation work should be done in order to better understand the implications of the rapidly changing recombination hotspots in adaptive dynamics. We also show that there is an overall underrepresentation of hotspots in exons and introns for most populations, which is consistent with purifying selection acting against changes that could result in disruptions of gene function. On the other hand, we observed an overrepresentation of hotspots in TTSs and TSSs for all ten populations. This could impact the maintenance and spread of beneficial traits in the population by shuffling allelic variants of genes without causing disruption of their function. We hypothesize that the enrichment of hotspots of recombination in TTSs and TSSs can have an important impact in the spread of beneficial mutations across different genomic backgrounds; increasing the rate of adaptation to selective pressures (e.g. selection for improved pathogen response).
Materials and Methods
Comparing recombination rates between populations
Sequence data were downloaded from the Cacao Genome Database and NCBI (Accession PRJNA486011), including the reference sequence for each chromosome and the full genome annotation (Theobroma cacao cv. Matina 1-6 v1.1)(Motamayor et al., 2013). Processing was done using the pipeline from (Cornejo et al., 2018) available at the github repository oeco28/Cacao_Genomics. Full genome data was used from a total of 73 individuals across 10 populations (Cornejo et al., 2018): Criollo (N = 4, #SNPs = 309,818), Curaray (N = 5, #SNPs = 1,106,871), Contamana (N = 9, #SNPs = 2,097,618), Amelonado (N = 11, #SNPs = 373,789), Maranon (N = 14, #SNPs = 1,783,226), Guianna (N = 9, #SNPs = 770,729), Iquitos (N = 7, #SNPs = 1,575,711), Purus (N = 6, #SNPs = 1,184,181), Nanay (N = 10, #SNPs = 830,885), and Nacional (N = 4, #SNPs = 718,099). We filtered the single nucleotide polymorphism data and excluded rare variants (minor allele frequency <= 0.05) per population. Separate variant files per population per chromosome were then phased using default conditions with SHAPEIT2 (Delaneau et al., 2011) under default parameters. Haplotype files were converted back to phased variant calling format (vcf) for its downstream analysis. We have also phased the data with Beagle (Browning and Browning, 2007), using a burnin of 10000 iterations, and estimations done over 10000 iterations. No appreciable differences were observed between the two methods and Beagle phasing was maintained for the analyses. The reason for performing the phasing separately for each population is that linkage disequilibrium patterns are expected to be affected by population structure. The ten populations have been shown to be unique clusters with very little admixture between them Cornejo et al. (2018), and the individuals used in this study were those whose ancestry was clearly from a single population. VCFTools (Danecek Petr et al., 2011) was used to remove all singletons and doubletons. Only bi-allelic single nucleotide polymorphisms (SNPs) were retained and were exported in LDhat format.
In order to estimate recombination rates we used the interval routine of LDhat (Auton and McVean, 2007), a program that implements coalescent resampling methods to estimate historical recombination rates from SNP data. To reduce computation time, each chromosome was split into windows, each containing 2000 SNPs. To counteract the overestimation of recombination rate produced at the ends of the windows, an overlap of 500 SNPs was left between consecutive windows. The final window for each chromosome did not always match the general scheme, so the final 2000 SNPs were taken (making the overlap with the second to last window variable, but never less than 500 SNPs) (Fig. 6). Once these windows were generated, LDhat was run over each window using 100 million iterations, sampling every 10000 iterations (10000 total points sampled), with a block penalty of 5. Lookup tables with a grid of 100 points, a population mutation rate parameter (θ) of 0.1 and a number of sequences (n) of 50 were used for all populations. We used the same θ for all populations since estimates from Cornejo et al. (2018) ranged from π = 0.27% to π = 0.37%, all comfortably within an order of magnitude of each other. The first 50 million iterations were discarded as burn-in. Once recombination rates were calculated, 250 positions were cut off from both windows involved in each overlap, so that the estimates for the first half of the overlap was taken from the end of the preceding window and the estimates for the second half of the overlap were taken from the beginning of the following window. The final overlap in each chromosome was split in order to remove 250 SNPs from the second to last window, regardless of the remaining size of the last window. The remaining rate estimates were then merged in order to obtain recombination rates for the entire chromosome. This was done for each chromosome of each population.
The estimation of recombination rates with LDhat is approximated using a sampling scheme with a Markov Chain Monte Carlo (MCMC) algorithm as implemented in the interval routine. The inference of recombination rates is the result of the integration of estimated parameter values across iterations with the routine stats. In the majority of recent studies where LDhat or LDhelmet are used (Myers et al., 2005; Auton et al., 2012; Brunschwig et al., 2012; Paape et al., 2012; Auton et al., 2013; Choi et al., 2013; Singhal et al., 2015; Stevison et al., 2016), whether there is convergence of the Markov chains has not been explicitly investigated. One study that we are aware of has used simulations to asses whether their small sample size affected their ability to obtain reliable estimates of recombination using LDhelmet (Booker et al., 2017), but did not assess the uncertainty of the estimates from the MCMC process itself. We argue that evaluation of convergence is important to assess the confidence in the estimated reported values, especially if there is interest in analyzing the differences in recombination rate along the genome. Visual inspection of pilot runs of the analysis demonstrated that convergence was not achieved after running 40M iterations, which is why the length of the chains was increased to 100M iterations. Additionally we explored the uncertainty in the estimates of recombination site-wise by integrating over the trace of the estimates for recombination rate to infer the 95% Credibility Interval. We then estimated the 95% interval of recombination estimates range across all sites in the genome to have an overall measure of uncertainty that we compared to the median 95% Credibility Interval for the trace of each position.
In order to compare recombination rates, the effective population size (Ne) calculated for each population (Cornejo et al., 2018) was used to convert rates in Ner/kb to r/kb. Differences in the mean genome-wide recombination rate between populations were then tested using the Kruskal-Wallis test (kruskal.test function from the stats package in R) (R Core Team, 2018). There were 45 comparisons, making the Bonferroni correction cutoff value: α = 0.0011. To transform per population recombination rates from r/kb to cM/Mb, we divided each chromosome into windows of 100 SNPs and used the Kosambi mapping function (Kosambi, 1943). The median for the windows of a chromosome was then calculated, and the average of each population’s chromosomes was taken as that population’s average recombination rate in cM/Mb.
Comparing recombination hotspot locations between populations
Recombination hotspots were estimated with LDhot (Auton and McVean, 2007), a likelihood-based program that tests whether a single distribution model or a two distribution model better explains the observed recombination rates in 1 kb sliding windows (default), for each chromosome. Each chromosome was run in its entirety, with the number of simulations (nsims) set to 1000. The resulting potential hotspots were refined by an alpha of 0.001, and overlapping hotspots were merged. This method therefore detects hotspots by comparing rates in 1 kb windows to the rates in the surrounding regions.
To determine the set of consensus hotspots, the hotspots from all populations were merged. Two hotspots from different populations were considered to be shared if they both overlapped with the same hotspot in the consensus set. To summarize all shared hotspots, a Boolean matrix was constructed, in which a population having a hotspot that overlaps with a hotspot in the consensus list leads to an indication of presence of the consensus hotspot in that population. This matrix was used to determine hotspots shared by two or more populations.
A Fisher’s exact test was run for each pair of populations in order to determine whether hotspots for the pair of populations overlap significantly more than expected. The BED files containing the location of the recombination hotspots for each pair of populations were compared using Bedtools:fisher (Quinlan and Hall, 2010). The number of comparisons was 45, making the the Bonferroni correction cutoff value: α = 0.0011.
In order to compare the relationships between populations based on shared hotspots we calculated Jaccard distances (distance function, philentropy package, R) (Drost, 2018) and compared them to a published FST matrix (Cornejo et al., 2018) using a Mantel test (mantel.rtest function, ade4 package, R) (Chessel et al., 2004; Dray and Dufour, 2007; Dray et al., 2007; Bougeard and Dray, 2018). The FST estimates from Cornejo et al. (2018) were generated using Weir and Cockerham’s estimator Weir and Cockerham (1984).
The Boolean matrix for shared hotspots was also used to explore the relationship between hotspot similarities and genetic covariances from a previous study (Cornejo et al., 2018). Singletons were removed from the hotspot matrix, which was converted to a correlation matrix using the mixed.cor function from the psych package in R (Revelle, 2018). The mixed.cor function was used due to its ability to calculate Pearson correlations from dichotomous data. We then used the eigen function in R (R Core Team, 2018) to generate eigenvectors for the hotspot correlation matrix and the genetic covariance matrix. Pearson correlations between the first and second eigenvector of the genetic covariance matrix and the hotspot correlation matrix were then calculated (cor.test function, stats package, R)(R Core Team, 2018). This analysis was done once with all populations included, and once with the Criollo population excluded before correlations were calculated.
In order to model the presence or absence of hotspots along a drift tree, a multiple correspondance analysis was used on the Boolean matrix of shared hotspots using the MCA function from the FactoMineR package in R Lê et al. (2008). Nine dimensions were retained and used as traits along a previously generated drift tree (Cornejo et al., 2018). Using the Rphylopars package in R (Goolsby et al., 2016), the dimensions were modeled as Brownian motion and as an Ornstein-Uhlenbeck process. The fit of the two models were compared using the AIC values for the best fitting models of each type.
Identifying DNA sequence motifs associated with the locations of recombination hotspots
Motifs associated with hotspots were found using RepeatMasker (Smith et al., 2016). The entire genome, the set of consensus hotspots, and a set of ubiquitous hotspots (hotspots shared by at least eight of the populations) were examined with RepeatMasker, using normal speed and “theobroma cacao” in the species option. In order to determine whether ubiquitous hotspots were enriched for particular DNA sequences, a set of the same number and size of sequences was randomly selected from the genome using Bedtools:shuffle (Quinlan and Hall, 2010) and examined with RepeatMasker. This simulation was repeated one thousand times and a null distribution against which observed values were compared was constructed from the results.
Identifying genomic features associated with the location of recombination hotspots
Testing whether recombination hotspots were overrepresented near particular genomic features was done by using a resampling scheme to establish null expectations and then comparing the observed value to the empirical distribution. For each feature, locations were retrieved and the number of observed hotspots that overlap with this feature were counted. To determine whether this amount of overlapping hotspots was unusually high or low, a set of hotspots that matched the number of hotspots and the size of each hotspot was simulated. These simulated hotspots were placed randomly along the chromosome, using a uniform distribution. The simulation was run 1000 times and the number of simulated hotspots that overlap with the true genomic features was measured for each simulation. The simulations generate an expected distribution of overlap with the genomic feature, and the true value was then compared to the distribution. When simulated hotspots overlapped, the location of one of them was sampled again. Features tested were: Transcriptional start sites (TSSs), transcriptional termination sites (TTSs), exons, and introns. TSSs and TTSs are considered to be the 500bp upstream and downstream of coding regions respectively.
The reason for the proposed novel resampling scheme is that, if the size and distribution of genomic features and hotspots were not taken into account, it would set unrealistic expectations for the overlap between features under a null model of no association. In this sense, the null model would be inappropriate and potentially inflate the false positive rate.
Data and code availability
Rate and summary files from LDhat runs as well as hotspots for each population will be placed in a Dryad repository. Scripts for LDhat and LDhot runs as well as the resampling schemes used and additional analysis is available in the following github repository ejschwarzkopf/recombination - map.
1 Acknowledgments
The authors would like to thank the Noe Higinbotham endowment and the WSU College of Arts and Science for travel funds to EJS to present earlier versions of this work. We would like to thank the Kamiak High Performance Computing Cluster at WSU for the infrastructure support to run the analyses, and the Cornejo, Kelley, and Busch labs at WSU for feedback and edits on the manuscript.