Abstract
As recombination plays an important role in evolution, its estimation, as well as, the identification of hotspot positions is of considerable interest. We propose a novel approach for estimating historical recombination along a chromosome that involves a sequential multiscale change point estimator. Our method also permits to take demography into account. It uses a composite likelihood estimate and other summary statistics within a regression model fitted on suitable scenarios. Our proposed method is accurate, computationally fast, and provides a parsimonious solution by ensuring a type I error control against too many changes in the recombination rate. An application to human genome data suggests a good congruence between our estimated and experimentally identified hotspots. Our method is implemented in the R-package LDJump, which is freely available from https://github.com/PhHermann/LDJump.
1 Introduction
Recombination is a process during meiosis starting with the formation of DNA double-strand breaks (DSBs) and resulting in an exchange of genetic material between homologous chromosomes [Baudat et al., 2013]. The process leads to the formation of new haplotypes and increases the genetic variability in populations. In most species, recombination is concentrated in narrow regions known as hotspots, 1-2 kb in length, flanked by large zones with low recombination or cold regions. Meiotic recombination is a tightly regulated process defined mostly by a methyltransferase protein called PR domain zinc finger protein 9 (PRDM9) in most mammals (reviewed in [Baudat et al., 2013, Tiemann-Boege et al., 2017]). PRDM9 binds to a certain sequence motif (Myers motif) with its zinc finger array and recruits the DSB machinery (SPO11) to the hotspot (reviewed in [Tiemann-Boege et al., 2017]). Hotspots vary between species (human vs. chimpanzee see [Auton et al., 2012] or mice see [Smagulova et al., 2011]), populations within species (human populations like Africans and Europeans see [The 1000 Genomes Project Consortium, 2015, Pratto et al., 2014, Berg et al., 2010]), individuals within species (humans see [Pratto et al., 2014]), individuals of different sexes (see [Kong et al., 2010]) as well as between viruses (reviewed in [Pérez-Losada et al., 2015]).
Molecular and evolutionary mechanisms of the process of recombination can be better understood with accurate local estimates of the recombination rate [McVean et al., 2004, Chan et al., 2012]. Moreover, knowledge of the recombination rate variation along DNA sequences improves inference from polymorphism data about e.g. positive selection [Sabeti et al., 2006], linkage disequilibrium [Hill and Robertson, 1968], and facilitates an efficient design and analysis of disease association studies [McVean et al., 2004]. For this purpose, we designed LDJump, an algorithm that provides a fast and reliable new estimate of variable genome-wide historical recombination rates by partitioning the DNA sequence into homologuous regions with respect to recombination that also permits to take demography into account.
Methods differing in their genome-wide coverage, resolution, and reliance on active recombination have been proposed to estimate recombination rates in humans. Experimental approaches include whole genome sequencing or SNP typing of pedigrees of at least 2-3 generations [Coop et al., 2008, Kong et al., 2010, Halldorsson et al., 2016], leading to a resolution of order tens of kilobases, given that not many recombination events are captured. Direct measurements in sperm provide high resolution events at the level of a few hundreds of base pairs, but lack genome-wide coverage [Kauppi et al., 2004, Arnheim et al., 2007, Arbeithuber et al., 2015]. Finally, recombination hotspots have been inferred by the analysis of patterns of linkage disequilibrium [McVean et al., 2004, Myers et al., 2005, Myers et al., 2008]. The latter approach provides genome-wide historical recombination events based on polymorphisms characterized in several individuals within a population.
One of the first approaches to infer the population recombination rate ρ from patterns of linkage disequilibrium was to compute a lower bound on the number of recombination events [Hudson and Kaplan, 1985, Wiuf, 2002, Myers and Griffiths, 2003]. In population genetics, ρ is defined as ρ = 4Ner, where Ne is the effective population size and r the recombination rate per base pair (bp) and generation. Other methods estimate ρ via maximum likelihood [Kuhner et al., 2000, Fearnhead and Donnelly, 2001] or approximations to the likelihood [Hudson, 2001, Fearnhead and Donnelly, 2002, McVean et al., 2002, Li and Stephens, 2003, Wall, 2004]. The former methods rely on simulations using importance sampling [Fearnhead and Donnelly, 2001] or Markov chain Monte Carlo (MCMC) methods [Kuhner et al., 2000] to become computationally feasible. The latter approaches use a composite likelihood as in [Hudson, 2001], or a modified composite likelihood as in [McVean et al., 2002]. Software implementations such as LDhat [McVean et al., 2004, Auton and McVean, 2007] and LDhelmet [Chan et al., 2012] are also available. [Kamm et al., 2016] extend this approach to account for demographic effects in their software package LDpop. Generally, computing approximate likelihoods requires a somewhat smaller computational effort than full likelihoods at the price of a slight loss in accuracy. An improvement of composite likelihood estimators via optimizing the trade-off between bias and variance has been proposed by [Gärtner and Futschik, 2016]. For a more technical discussion on composite likelihood in general see [Varin et al., 2011, Reid, 2013].
Another approach is to rely on calculating moments or summary statistics [Hudson, 1987, Batorsky et al., 2011]. In [Wall, 2000, Wall, 2004], suitably chosen summary statistics such as the number of haplotypes (haps) are used. There the author performs simulations with given haps, calculates the likelihood for a series of ρ values, and chooses the value of ρ with the highest likelihood as estimator of the recombination rate.
Further well-established frameworks to estimate recombination rates include Lamarc [Kuhner, 2006], OmegaMap [Wilson and McVean, 2006], RDP [Martin et al., 2015], and CodABC [Arenas et al., 2015]. The latter method [Arenas et al., 2015] applies ABC (approximate Bayesian computation) using 26 summary statistics to estimate constant recombination rates for simulated regions of size up to 300 codons for 100 alignments. With the GUI of RDP [Martin et al., 2015] overall patterns of recombination and testing for hot- and coldspots is performed by integration of LDhat [McVean et al., 2004]. Recently, alternative fast estimates of ρ that rely on regression on sliding windows have been proposed by [Lin et al., 2013, Gao et al., 2016]. Their software implementation is called FastEPRR and recommended for larger samples consisting of 50 sequences or more.
So far all these previous methods have at least some limitations such as being computationally demanding, not designed for small sample sizes or leading to a too large number of breakpoints in the recombination map. In order to address these issues within one algorithm we developed LDJump. More precisely, we divide the DNA sequence into small segments and estimate the recombination rate per segment via a regression based on the following summary statistics: measures of LD, r2, Watterson’s θ, measures on pairwise differences, haplotype heterozygosity, the four gametes test as well as the constant recombination rate estimator of LDhat [McVean et al., 2004]. A frequentist segmentation algorithm [Frick et al., 2014] is then applied to the estimated rates to obtain change-points in recombination. The algorithm controls a type I error and provides confidence bands for the estimator. [Futschik et al., 2014] use a similar approach to partition DNA sequences into homogeneous segments with respect to GC content. In contrast to [Gao et al., 2016] our approach is also designed to work with small sample sizes. As will be shown in the following sections, LDJump allows us to calculate hotspots at high accuracy within a reduced computational time from sample sizes of at least 10 sequences of genomic regions spanning many megabases.
Section 2 contains a detailed description of our proposed method we call LDJump. In section 3 we assess LDJump and compare it with the popular software packages LDhat, LDhelmet, and FastEPRR. We also consider different levels of genetic diversity as well as demographic effects. For different human populations, we apply our approach to a well-characterized region of the human genome. We furthermore estimate population specific recombination maps for the complete human chromosome 16, showing a good overlap between our and experimental estimates of hotspot positions. Finally, we summarize our findings in section 4. Further details on the regression model, bias correction, and more detailed simulation results are provided in the supplementary material.
2 Materials and Methods
Our approach consists of two steps. First, we fit a regression model from simulated data to estimate constant recombination rates on small segments. Subsequently, we apply a segmentation algorithm to estimate breakpoints in the recombination rate which is subject to type I error control against over-estimating the number of identified breakpoints.
2.1 Regression Model for Constant Recombination Rates
We use generalized additive models (GAM) [Wood, 2011] to estimate cubic spline functions fj(Zj) for covariates zj, j = 1,…, q and linear (or quadratic) effects for covariates xk,k = 1,…, l to regress the population recombination rate ρ on summary statistics computed from simulated short DNA segments. The structure of our GAM is for i = 1,…, n. We chose our explanatory variables using an ANOVA on simulated data. The resulting set of (suitably scaled) summary statistics χ includes the constant recombination rate estimator available within the LDhat package, and can be found in Table 1.
For a more detailed description of the regression model as well as the selection of explanatory variables see supplementary material section 1.1. Initial computations revealed variance heterogeneity. Hence, we transformed the population recombination rate ρ using a Box-Cox transformation t(ρ) [Box and Cox, 1964]. In section 1.2 of the supplementary material, we describe the choice of the transformation parameters.
We observed a systematic overestimation of the background rates as well as underestimated hotspot intensities. Therefore, we performed a simulation based bias correction using quantile regression of the true recombination rate on the above described estimates. For further details on the bias correction see Figure 3 and section 1.3 of the supplementary material.
2.2 Segmentation Algorithm Estimating Variable Recombination Rates
[Frick et al., 2014] introduced a method called SMUCE for detecting change points in a function for observations distributed according to an exponential family. This method starts with a constant function and introduces successively additional jumps, as long as they lead to a significant increase in the likelihood. Using likelihood ratio tests, the probability of overestimating the number of change-points is controlled subject to a user specified type I error probability α. For a given number of jumps, the best fitting locally constant function is chosen by maximizing the likelihood. We use this method with local estimates as input. For a general overview on multiple change-point detection see [Niu et al., 2016].
In the first step LDJump divides the DNA sequence into k short segments. Summary statistics are computed separately for each segment and inserted into our regression model to estimate a local transformed recombination rate. The back-transformed rates follow an approximate normal distribution (natural scale of ρ, see supplementary material section 1.2) and are used as input for the change point estimator. In our simulations, the use of the back-transformed rates led to a better detection of hotspots compared to the transformed rates.
3 Results
We used the software package scrm of [Staab et al., 2014] to simulate samples of populations with variable recombination rates and converted its output to fasta-files with the software package rns2dna of [Haubold and Pfaffelhuber, 2013]. In this section we compare LDJump with LDhat, the newer version LDhat2, LDhelmet as well as FastEPRR. We consider both constant and variable recombination rates and look at the performance as well as the runtime. The runtime comparison is based on one core of an Intel Xeon E5-2630v3 2.4 1866, with 64GB DDR4-2133 RAM. Our analysis was performed in [R Development Core Team, 2017]. Note that all mentioned software packages can also be applied on several cores in parallel.
3.1 Constant Recombination Rate Estimation
We first focus on a constant recombination rate on a DNA segment. In our simulations, LDJump is compared with the functions pairwise of LDhat and max_lk of LDhelmet following the default guidelines. The chosen sample sizes were {10, 16, 20}, and the sequence lengths {1000, 2000, 3000} base pairs. For each of these nine setups we simulated under 111 different values of ρ ∈ [0, 0.1] yielding a total of 999 simulated scenarios. The population mutation rate was chosen θ = 0.01.
Using the root mean squared error and the coefficient of determination R2, we compare the accuracy of the mentioned methods. We visualize the estimators and the true values in Figure 1 along with a diagonal black line indicating the true values. Both prediction measures show a slightly better fit of the generalized additive model (purple plus signs: higher R2 of 0.4974; smaller RMSE of 0.0256) compared with the software packages LD-hat (red dots: 0.4447; 0.0290) and LDhelmet (green triangles: 0.2095; 0.0360). As our method uses the function pairwise as one of the summary statistics, the improved performance may be in part due to an optimized bias-variance trade-off, see [Gärtner and Futschik, 2016].
3.2 Variable Recombination Rate Estimation
For humans, large fractions of recombination events are concentrated on short segments which are called hotspots (reviewed in [Arnheim et al., 2007]). Following the literature, we define recombination hotspots as genomic regions that exceed the background rate by more than a threshold factor of five for a length of up to 2kb [McVean et al., 2004].
We investigate how well hotspots are detected by our method and simulated two types of setup for variable recombination rate estimation: simple setups (sequences of length 10 and 20 kb with one hotspot) and natural setups (sequences of length 1Mb containing 15 hotspots) both using a mutation rate θ of 0.01. These scenarios were investigated with different background rates, sample sizes, hotspot intensities, and hotspot lengths. When comparing our approach with LDhat(2) and LDhelmet, we followed recommendations and used 106 iterations for the reversible-jump MCMC procedure, sampled every 4000 iterations, chose a burn-in of 105, and different block penalties of 0, 5, and 50. For the computations with LDhelmet, we used a window size of 50 SNPs, and 11 Padè coefficients. Results for FastEPRR were obtained using winLength=stepLength (segment lengths) of 500, 1000, 1500, and 2000 base pairs. We applied the implemented function smuceR within the R-epackage stepR [Hotz and Sieling, 2016] to estimate the change-points for our method.
3.2.1 Simple Setups
We simulated samples of sizes {10, 16, 20} with sequence lengths of 10 kb and 20 kb. Our 15 considered background recombination rates were chosen equidistantly within [0.001, 0.03].
We considered hotspot intensities of {5, 10, 15, 20, 40}-fold the background recombination rate. The length of the hotspots varied among -times the sequence length. Due to the large number of resulting setups and the computation times of LDhelmet and LDhat(2), we have restricted this analysis to 2 replicates per sample yielding in total 4500 simulated recombination maps. We approximated the RMSE (root mean squared error) as our quality measure, and computed the estimation errors on an equidistant grid of 1000 positions along the sequences.
Table 2 summarizes the performance of the considered methods. More specifically, we computed the mean, median, and standard deviation (across simulations) of the RMSE for LDhat (column 3), LDhat2 (c. 4), LDhelmet (c. 5), FastEPRR (c. 6-9, with different segment lengths) and LDJump (c. 10-15 with different numbers of user-defined segments k). The results using different block penalties for LDhat(2), LDhelmet along with different type I error probabilities for LDJump are listed in separate rows.
As discussed in the supplementary material (section 1.4), segment lengths of at least 333 bp are needed with θ = 0.01 for a good performance of LDJump. Following this recommendation, our method performs equivalently or slightly better than LDhat2, and outperforms LDhat and also LDhelmet. The choice of α did not have a large effect under the considered scenarios. Similarly, the block penalty does not much affect the performance of LDhat2. With LDhat and LDhelmet on the other hand, the choice of the block penalty strongly influences the performance. The performance of LDJump and LDhat2 turned out to be more constant across simulations. Indeed, the standard deviation of the RMSE is more than 20 % lower with LDJump than that of LDhat, which in turn has a more than 30 % lower SD than LDhelmet. With FastEPRR, approximately 57%, 5%, 4%, and 2% of the computations terminated due to errors using segment lengths of 500, 1000, 1500, 2000, respectively. When FastEPRR provided estimates, the performance was comparable with LDJump. A more detailed graphical display of the performance of FastEPRR with respect to segment lengths can be found in Figure 7 in section 2 of the supplementary material.
Figure 2 contains separate results for different sample sizes, recombination rates, hotspot intensities and lengths, as well as sequence lengths. We applied LDJump with 20 segments and a type I error probability of 5%. Hence, the considered segments had a length of 500 and 1000 (for 10kb and 20kb, respectively) nucleotides. We used FastEPRR with a window length of 2kb in order to achieve a small number (32) of runs terminating due to errors. Especially for small to middle background rates (under the considered values) LDJump, FastEPRR, and LDhat2 have on average a lower RMSE than LDhat and LDhelmet. LDJump, FastEPRR and LDhat2 lead on average to a smaller RMSE for all considered sample sizes as well as sequence lengths, with our approach performing best in many cases. Moreover, slightly smaller or equivalent values for the RMSE were computed with LDJump and FastEPRR for hotspot intensities from 5- to 20-fold the background recombination rate and hotspot lengths between 1/50 and 1/10. For a hotspot length of 1/5 of the total sequence length we can see a similar fit for all methods. LDJump and FastEPRR have similar estimation quality with slight preference for our method.
3.2.2 Natural Setups
We simulated samples with 16 sequences and sequence lengths of 1Mb. The setups varied in the background rate which was chosen among 13 equidistant values between 0.001 and 0.01. The 15 hotspots were evenly distributed along the sequence and had different intensities of 8 to 40-fold the background rate. Every setup was replicated 20 times. The same mutation rate θ = 0.01 was again chosen for all setups. In our simulations, we focused on the methods that performed best for the simple scenarios. As FastEPRR using segment lengths of 1kb terminated without producing estimates for 88% of our simulated complex data sets, we mainly restricted our attention to a comparison between LDJump and LDhat2. Additional information on the performance of FastEPRR based on the non-terminating runs only can be found in supplementary material section 3. Notice however that a high proportion of missing results may lead to a biased quality assessment, if the missing probability depends on features of the data set that affect the performance of the estimate.
Figure 3 provides a comparison between LDJump with (grey) and without (purple) bias correction, and LDhat2 (blue). Three samples with different background recombination rates of 0.001 (left), 0.0054 (middle), and 0.01 (right) are presented in dotted black lines. Segment lengths were chosen to be 1kb with the quantile chosen 0.35 in the bias correction (see supplementary material section 1.3) and a type-I error probability of 0.05. The bias-correction decreases the bias in the background rates and increases the intensities of the estimated hotspots.
Quality Assessment We took the weighted RMSE as measure of quality. It is defined as with wi denoting the length of the estimated segment i divided by the total sequence length. We also considered the proportion of correctly identified hotspots (PCH). A hotspot is counted as correctly identified if it has a non-empty intersection with a detected hotspot (i.e. a region with at least five-fold background recombination rate). The proportion of correctly identified background rates (PCB) has been defined analogously. Finally, the weighted average performance is given as WAP = (PCH + PCB)/2.
To identify the best combination of bias correction and segment lengths, we applied LDJump with k = 500, 1000, 1500, and 2000 segments and estimated the recombination maps using the 0.25, 0.35, 0.45, and 0.5 quantiles in the bias correction (see supplementary material section 1.3). Notice that segment lengths resulting from the chosen values of k are 2kb, 1kb, 666 and 500 bp. As hotspot lengths are either 1 or 2kb, the scenario with k = 1500 is most challenging as the hotspot boundaries will systematically differ from the segment boundaries. A direct comparison with LDhat2 using a block penalty of 50 (based on the results from the simple setups) is provided.
The different choices of k are displayed by the first four groups of boxplots in Figure 4. For each of these four groups, quantiles of 0.25, 0.35, 0.4, and 0.5 are used in the bias correction and are presented in different colors. The rightmost bar in each panel (in blue) summarizes the result of LDhat2. From top-left to bottom-right, we show PCH, PCB, WAP, the estimated number of blocks, and the weighted RMSE.
PCH may be interpreted as a measure of sensitivity, whereas PCB provides a measure of specificity. We can see that our method has very high detection rates irrespective of k with even less variability in performance than LDhat2. On the other hand, LDhat2 has very high PCB proportions. The best PCB values for LDJump are obtained for the smallest quantile. As an overall measure, we display the mean of PCH and PCB as WAP in the bottom-left panel. It turns out that WAP is larger for LDJump regardless of the tuning parameters. In the bottom-middle panel we can see that the number of estimated blocks of LDJump depends on k. When using 500 segments, the estimated number of blocks is still below 31, the true number of blocks in the recombination map (due to 15 hotspots). For larger k the number of blocks is slightly overestimated. LDhat2 estimated many more blocks, indeed the number of change points in recombination tended to be larger by a factor of more than 3000. The bottom-right plot shows the weighted RMSE as an overall quality measure showing a similar level of accuracy across k and compared with LDhat2. A more detailed investigation reveals that our method estimates hotspot rates more precisely, but provides less accurate estimators of the background recombination rate.
Our results also show that our method is fairly robust with respect to tuning choices. This is also true for k =1500, where the hotspots have an unfavorable location compared with the design segment boundaries. To obtain a reasonable tradeoff between sensitivity (PCH) and specificity (PCB), segment lengths of 1kb (based on 1000 segments of sequence length 1Mb) and a quantile of 0.35 in the bias correction seem to be a good choice with LDJump.
Figure 5 shows our considered quality measures depending on the background recombination rates. We provide the average performance over 20 replicates. We can see that LDhat2 has constant PCB and decreasing PCH as the background rate increases. LDJump shows constant values for PCH and slightly increasing PCB for higher background rates. The overall measure WAP slightly increases for LDJump and decreases for LDhat2 with increasing background rates, respectively. The weighted RMSE is also plotted. It can be seen that LDhat2 leads to a slightly smaller weighted RMSE with decreasing differences for larger ρ.
Notice that we have obtained an error share of more than 88% using FastEPRR for the natural setups. We provide a comparison of the error-free results in Figure 8 in supplementary material section 3. Based on this smaller number of results for FastEPRR, LDJump is favorable in terms of the WRMSE and PCH, but has lower PCB compared to FastEPRR.
3.3 Populations under Different Levels of Genetic Diversity
Since natural populations differ in the level of genetic diversity, we simulated samples under different mutation rates θ ∈ = {0.0025, 0.005, 0.01, 0.02}. In Figure 6 we compare the performance based on the RMSE of LDJump (first panel) with LDhat2. For both methods, the influence of a misspecified θ has also been investigated. We used LDJump with segment lengths of 1kb, and the regression model calibrated under the mutation rate θ = 0.01. Thus the model is misspecified when the true θ ≠ 0.01. For LDhat2, results obtained using the true value of θ are displayed in the second panel, and results under misspecificationin the third panel.
LDJump improves with increasing mutation rates due to the higher information available per segment. Interestingly, LDhat2 benefits less from increased levels of genetic diversity. A misspecified θ had little effect on the performance of LDhat2.
Based on our simulations we also evaluate the influence of the SNP density on the performance of LDJump. Figure 7 provides box plots illustrating the performance in terms of the RMSE depending on the mean number of SNPs per base pair within a simulated segment. Our results suggest that the higher the SNP density the more accurate estimates are obtained. When there are fewer than two SNPs in a segment, our software implementation imputes estimates based on the neighboring segments.
3.4 Populations under Demography
It has been observed in [McVean et al., 2002, Chan et al., 2012, Smith, 2005] that ignoring population demography by wrongly assuming a constant population size leads to biased estimates of recombination. As a remedy, [Kamm et al., 2016] computed 2-locus likelihoods under a known variable population size. LDJump permits the natural inclusion of any type of demography or even range of demographic scenarios by simply fitting our regression model under suitable scenarios.
We illustrate this approach and consider a scenario that involves a bottleneck followed by a rapid population growth. This scenario has also been used by [Kamm et al., 2016]. More precisely, we chose time dependent population sizes as follows:
Time is scaled in coalescent units and the simulations were again performed with scrm [Staab et al., 2014]. [Johnston and Cutler, 2012] analyzed a similar demographic scenario and showed that LDhat infers spurious recombination hotspots when falsely assuming a constant population size.
With LDJump, we fitted our regression model using samples simulated under the demographic model (2). We used the same explanatory variables as under neutrality, but added Tajima’s D [Tajima, 1989] as an additional explanatory factor. This additional variable had significant effect on the model fit in our ANOVA, suggesting that choosing summary statistics dependent on demography can help to improve the accuracy of our estimates. We did not change the parameters in the Box-Cox transformation compared to the constant population size model. To see what can be gained by explicitly considering an underlying demography, we simulated samples under the demographic model (2), and sticking otherwise to the previously described natural setups. For these samples, we estimated recombination maps using the regression models trained either under neutrality (misspecified model) or under demography. More specifically LDJump has been applied with segment lengths of 1kb and a quantile of 0.35. The accuracy of these models was then compared in terms of the indicators PCH, PCB, and WRMSE.
The results are shown in Figure 8. When using the correct demographic model, the hotspot detection rate as well as the proportion of correctly identified regions with background recombination rate increases. We also found the WRMSE to be equal or slightly smaller when using the correct demographic model.
3.5 Runtime
Obtaining estimates of recombination can be computationally demanding, especially for a larger number of sequences, and separate analyses for several populations. Hence, we also provide a comparison with respect to runtime (in seconds) between LDhat, LDhat2, LDhelmet, FastEPRR, and LDJump. We first consider simple setups using our simulated sequences of length 20kb. Again, we looked at different block penalty choices, as well as at different numbers of atomic segments k for LDJump in Table 3. As summaries, we computed the mean (top), median (middle), and SD (bottom) of our measured runtimes. We can see that especially LDhat2 and LDhelmet run ten to fifty times longer than LDJump and FastEPRR. While being only slightly slower, we have seen before that LDhat leads to considerably less accurate estimates. LDJump turns out to be also faster than FastEPRR when the number of segments k is at least 15.
In Table 4 we show the mean, median, and SD of runtimes in seconds based on natural setups. On average LDJump turns out to be about ten to twenty times faster than LDhat2. Choosing larger values of k reduces the runtime for LDJump. Due to the faster computation of certain summary statistics, runtime was reduced by about 50% when going from 500 to 2000 segments.
In contrast to our approach. It turns out that the runtimes strongly depend on the underlying recombination rates with LDhat2, leading to a considerable difference between the median and mean of times. In supplementary material (section 4), we compare the runtimes for various background rates and different values of k. Overall, LDJump provides a particularly attractive combination of performance and runtime.
3.6 Validation of LDJump Computed Hotspots with Active Recombination Hotspots
We first tested our algorithm on a 103kb region on human chromosome 21. Therefore, we sampled the region between SNPs rs10622653 and rs2299784 containing the PCP4 gene, because this is the only region in which recombination has been characterized at high resolution by sperm typing for such a long continuous stretch [Tiemann-Boege et al., 2006]. Taking data from [The 1000 Genomes Project Consortium, 2015], we randomly chose 50 individuals for each of 4 subpopulations from 4 European regions (TSI, FIN, IBS, GBR). The data has been reformatted from vcf-format to fasta-files with the R packages [Knaus and Grünwald, 2017, Paradis et al., 2004] using two sequences per (diploid) sample and the reference sequence 80.37 (GRCH37) from [The 1000 Genomes Project Consortium, 2015]. We applied LDJump with a segment length of 1kb and chose the 35%-quantile for the bias-correction.
In the region from 60-100kb, the estimated recombination maps across populations (using a lookup table of 100 sequences and θ of 0.005) coincide well to the map obtained experimentally using sperm typing in [Tiemann-Boege et al., 2006] (see panel A of Figure 9). However, we also find population specific differences in the detected hotspots. We then compared this region with the double strand break (DSB) maps (representing active recombination hotspots) from [Pratto et al., 2014], see panel C of Figure 9. The hotspots inferred by LDJump in this region (60-100kb) also agree with the DSB activity. However, LDJump additionally estimated hotspots before the PCP4 gene around position 45kb. These hotspots were also found by other LD-based algorithms [McVean et al., 2004, Li and Stephens, 2003], see panel B of Figure 9 [Tiemann-Boege et al., 2006].
We further tested the performance of LDJump within a larger genomic region to validate our method. For this purpose, we applied LDJump to the entire chromosome 16, and consider separate samples of 50 sequences from four populations (GBR, TSI, IBS, FIN) taken from [The 1000 Genomes Project Consortium, 2015]. For the data preparation we used the software package vcftools [Danecek et al., 2011] and then ran a parallel version of LDJump with segment lengths of 1kb for each population recombination map. We obtained these results in less than 2 days using in total 15 cores of an Intel Xeon E5-2630v3 2.4 1866, with 64GB DDR4-2133 RAM.
In panel A of Figure 10 we show the estimated recombination maps under the demography model (2) for chromosome 16 with the Italian population (TSI) in black, the Finnish sample in dashed red (FIN), the Spanish sample (IBS) in dotted green, and the British population (GBR) in dash-dotted blue. When we ignore demography and applied LDJump under a neutral scenario, we obtained hotspots with unrealistically high intensities (up to values of 70). Demography model (2) is rather simple, and we stress that LDJump could also be applied under any demographic scenario by training the regression model with a suitable setup. Overall, we observe population specific hotspots, but also hotspots present in more than one population as is also observed in genome-wide DSB maps (Figure 10, panel B) [Pratto et al., 2014].
Furthermore, we evaluated the agreement of the estimated recombination hotspot locations using LDJump with the DSB map hotspots. For identifying LDJump hotspots we use a simple heuristic to define the average background rate. More specifically, we chose the mean of all estimates that fall below the median. This should give a downward biased estimate. With LDJump, we againd defined regions with more than five-fold the estimated background rate as hotspots. The DSB hotspots were selected by making use of the indicator variables provided by [Pratto et al., 2014]. Given that DSB hotspots are very narrow, yet the resolution of DSB into a crossover can occur with 3-5 kb, we added segments of different length (0, 0.5, 1, 2, 3 kb) left and right to the DSB hotspot regions and calculated the respective number of detected hotspots per PRDM9-type. The total number of DSB hotspots for AA1, AA2, AB1, AB2, and AC is 2889. We counted a hotspot as jointly detected, if an overlap between DSB hotspot and a LDJump hotspot occurred in at least one of the four populations (FIN, IBS, GBR, TSI). We display the number of jointly detected hotspots (augmented by segments of different lengths) via a Venn diagram in panel C of Figure 10. Notice that the number of hotspots estimated by LDJump for all considered populations is in total 31423, and therefore approximately 10 fold higher than the number of DSB-hotspots. Our analysis shows that on average about 70% of the DSB-hotspots (when adding 3kb segments to these regions) overlapped with at least one of the estimated LDJump population hotspots. These proportions go in line with the comparison of LD-based recombination maps and DSB hotspots in [Pratto et al., 2014] with an overlap of 56%.
4 Discussion
We introduced a new method called LDJump to estimate heterogeneous recombination rates along chromosomes from population genetic data. Our approach splits a given DNA sequence into segments of proper length in a first step. Subsequently, we use a generalized additive regression model to estimate the constant recombination rates per segment. Then, we apply a simultaneous multiscale change-point estimator (SMUCE) to estimate the breakpoints in the recombination rates across the sequence. We provide detailed comparisons of our method with the recent reversible jump MCMC methods LDhat(2) and LDhelmet as well as the regression based method FastEPRR. Our estimates are very fast, perform favourably in the detection of hotspots, and show similar accuracy levels as the best available competitor for simple and natural setups, respectively. These comparisons show that LDJump is a powerful tool to explore recombination rates in organisms with narrow recombination hotspots; for example, PRDM9 defined hotspots in most mammals (reviewed in [Tiemann-Boege et al., 2017]).
We validated our method by computing hotspots in several human populations and compared the estimated hotspots with recombination intensities measured by sperm-typing and double strand break maps. These computations revealed population specific hotspots in the region surrounding the PCP4-gene located on chromosome 21. Although on a small scale (103b) LDJump computed hotspots mainly agree with hotspots detected at high resolution with sperm typing and Chip immuno-precipitation (DSB map); we also observed a region with little congruence at position ~45kb. Given the lack of active recombination at position ~45kb (absence of this hotspot in sperm typing and in the DSB maps for the 2 European donors carrying the PRDM9 allele A, as well as the donor with African descent (carrying the PRDM9 allele C)), we hypothesize that this estimated hotspot might represent a historical hotspot that got extinct. Alternatively it could be a population-specific hotspot given that its intensity varies among different European populations. In order to test this latter hypothesis, active recombination maps from different populations would be needed. Experimental data (e.g. hotspot at position 95kb present only in the individual with a PRDM9 C allele) suggest that the intensity of hotspots might vary also within populations.
Differences between hotspot rates estimated from LD patterns compared to estimates based on sperm typing have also been observed by [Jeffreys and Neumann, 2009]. This might be caused by the short life-span of hotspots and their rapid evolution in intensity and genomic position among populations and species [Coop and Myers, 2007, Myers et al., 2010, Jeffreys et al., 2013]. In fact, only ~56% of historical hotspots determined by LD agree with genome-wide DSB maps [Pratto et al., 2014]. Our large-scale validation on chromosome 16 shows that about 70% of the DSB-hotspots were also found by LDJump using four European populations. Fine-scale population specific differences with respect to recombination events have also been highlighted in studies such as [Kong et al., 2010, Berg et al., 2011, Fledel-Alon et al., 2011, Pratto et al., 2014]. Given all this, our observed differences are likely due to underlying biological features. We have implemented our approach as an R-package called LDJump, which can be freely downloaded from https://github.com/PhHermann/LDJump. In our simulations, we obtained particularly good results when applying our method with segment lengths of 1kb and a bias correction using the default quantile of 0.35.
In conclusion, LDJump is a fast algorithm which is able to detect narrow hotspots at high accuracy using segments of approximately 1kb length. Moreover, we also show that LDJump can be applied on populations under demography. We validated our method on a 103kb region of human chromosome 21 as well as the whole chromosome 16 and found a good congruence by comparing LDJump hotspots with recombination hotspots measured with sperm typing or Chip immuno-precipitation (DSB map).
Data Accessibility Statement
LDJump has been implemented as an R package which can be downloaded and installed from Github (https://github.com/PhHermann/LDJump). We also provide example files and a manual in this repository. We downloaded the data of chromosome 16 from ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/release/20130502 and uploaded an R-script with details on the data management as well as the hotspot locations for the estimated population recombination maps to Github (https://github.com/PhHermann/Hermann_et_al_2018_LDJump). We download the data for the application on chromosome 21 from http://phase3browser.l000genomes.org/Homo_sapiens/Location/0verview?r=21:41187000-41290679 using the first 50 samples of the four European populations IBS, GBR, TSI, and FIN.
We provide details on the regression model, bias correction, choice of segment lengths, detailed quality assessments and runtime comparisons in the supplementary material.
Author Contributions
PH and AF designed the model and implemented the model into the R package. PH and AF focused on the statistical aspects and ITB and AH on the biological aspects. All authors wrote and commented on the manuscript.
Acknowledgements
We are grateful to Kerstin Gärtner and Renato Pereira Salazar for their assistance with the software packages LDhat and LDhelmet as well as the data acquisition for the application, respectively. We are thankful to Katharina Sallinger, Bettina Grün, Renato Pereira Salazar, and Theresa Schwarz for their helpful comments. This work was supported by the ‘Austrian Science Fund’ (FWF) P27698-B22 to I.T-B., and the DOC Fellowship of the Austrian Academy of Sciences (24529) to A.H.