Abstract
Accurate estimation of recombination rates is critical for studying the origins and maintenance of genetic diversity. Because the inference of recombination rates under a full evolutionary model is computationally expensive, an alternative approach using topological data analysis (TDA) has been proposed. Previous TDA methods used information contained solely in the first Betti number (β1)of the cloud of genomes, which relates to the number of loops that can be detected within a genealogy. While these methods are considerably less computationally intensive than current biological model-based methods, these explorations have proven difficult to connect to the theory of the underlying biological process of recombination, and consequently have unpredictable behavior under different perturbations of the data. We introduce a new topological feature with a natural connection to coalescent models, which we call ψ. We show that ψ and β1 are differentially affected by changes to the structure of the data and use them in conjunction to provide a robust, efficient, and accurate estimator of recombination rates, TREE. Compared to previous TDA methods, TREE more closely approximates of the results of commonly used model-based methods. These characteristics make TREE well suited as a first-pass estimator of recombination rate heterogeneity or hotspots throughout the genome. In addition, we present novel arguments relating β1 to population genetic models; our work justifies the use of topological statistics as summaries of distributions of genome sequences and describes a new, unintuitive relationship between topological summaries of distance and the footprint of recombination on genome sequences.
Author contributions
D.P.H., M.R.M, M.M., and A.J.B. designed and conducted experiments;D.P.H., M.R.M., and M.M. wrote code for the project; D.P.H. and M.R.M performed empirical analyses, M.M. contributed theory; and D.P.H., M.R.M, M.M.,and A.J.B. wrote the paper
Recombination is a fundamental source of genetic variation in many natural populations. By bringing existing mutations into novel genomic backgrounds, recombination can accelerate the rate at which adaptation occurs, as well as prevent the buildup of deleterious variants which occurs in asexuals via Muller’s ratchet (1–3). It is therefore critical to measure the rates of recombination in order to understand rates of adaptation. Resolution along the genome is also an important factor, as recombination rates are known to vary substantially along chromosomes. In particular, hotspots of recombination have been found associated with a variety of sequence and structural motifs in natural populations (4–8). In addition to hotspot detection, better estimation techniques for recombination rates can also improve our understanding of observed levels of linkage disequilibrium in genome data (9), and consequently the expected signatures of various evolutionary phenomena such as selective sweeps and epistatic interactions (10).
In practice, detecting genome-wide heterogeneity in recombination rates is challenging. Empirical approaches require building linkage maps through involved procedures such as sperm typing or multi-generational genetic crosses (11). While these are often the most powerful methods for detecting recombination, they are costly and time consuming. With the recent influx of large-scale sequencing data, alternative algorithmic approaches to inferring recombination rates from bulk genomic data have become a focus of attention. These methods, while often faster, have come with their own set of technical challenges. In order to detect patterns and rates of recombination along a genome, model-based algorithms infer properties of the ancestral recombination graph (ARG), an exercise which can be prohibitively computationally expensive on large datasets (12). This problem has driven the development of methods which use either a variety of summary statistics built on quantities such as the distribution of pairwise differences (13), or only compute partial or composite likelihoods, such as LDhat and its sister LDhelmet, two of the most widely-used model-based methods (14, 15). Even with this relaxation, these methods can take a matter of days to run on realistically sized sequence data.
In this paper, we present a method that takes advantage of novel summary statistics based on topological features of the genomes in a given sample to quickly and accurately provide estimates of recombination rate heterogeneity. Our method differs from existing model-based methods in that it is based solely on distances between sequences and consequently scales significantly better on large datasets. We find that a topological data analysis (TDA)-based approach greatly increases feasibility of the inference problem and implicitly ties genetic distances to modern models of population genetics via the coalescent (see Coalescent Intuition for Topological Statistics).
Recently, Camara et al. demonstrated the utility and efficiency of topological data analysis to inference of recombination rates (16), benchmarked against the methods of Hudson and Kaplan, Myers and Griffiths, and Chan et al. (15, 17, 18). They focused on a topological feature known as the first Betti number (β1, explained in Background on Topological Data Analysis), which captures the number of cycles in an ARG, the canonical graphical representation of recombination events. We have found that another topological feature of lower dimension, which we call ψ, is a better predictor of recombination rate in genomes. Moreover, ψ and β1 used in tandem provide much more accurate estimates than previous TDA-based methods. We investigate these two topological features and their relationships to evolutionary quantities of interest including recombination rate as well as coalescent tree length and describe a method of estimating recombination rates from genome samples using these features. We then compare the performance of our estimator on whole-genome data to LD-Helmet; we find that our results justify the use of the TDA estimator as a rapid approximation method.
Background on Topological Data Analysis
Topological data analysis (TDA) is a new branch of statistics that applies tools from algebraic topology to describe the shape of data (19–23). TDA has been successfully applied to a range of applications in biology, including the study of breast cancer transcriptional data for the discovery of a cancer subgroup (24), the construction of phylogenetic trees for analyzing tumor evolutionary patterns (25), and the detection of intrinsic structure in neural activity (26). In this section, we briefly review the relevant TDA methodology that we apply to the study of recombination. (For a more detailed review of TDA applications in genomics, see (27).)
TDA associates topological summaries to data sets using a novel invariant called persistent homology (19–23). We begin by giving a brief explanation of homology; note that herein, homology refers to a mathematical concept rather than a biological one. Homology refers to a family of ways of associating an algebraic object (i.e., a vector space) to a geometric object. For example, the zeroth homology, H0, is associated with the number of unique connected components (0-dimensional holes) in the data, such as independent points or groups of connected points. The first homology, H1, captures cycles or loops (1-dimensional holes) in the data, the second homology, H2, captures voids or hollow spheres (2-dimensional holes) in the data, and so on. The rank of the ith dimensional homology group is known as the ith Betti number, denoted βi, and roughly speaking it encodes the number of i-dimensional holes in the dataset (19–23).
For example, a figure-8 consists of a single connected component (all the points in the boundary of the figure are connected) and two loops, so for this shape β0 = 1, β1 = 2, and βk = 0 for k > 1. In contrast, a basketball is one connected component (all points on the surface are connected) with one hollow sphere and no loops so its associated Betti numbers are β0 = 1, β1 = 0, β2 = 1, and βk = 0 for k > 2. To see why β1 = 0 for this example, consider any loop on the basketball and fix some point p on the surface of the ball. Without breaking the loop, it’s possible to slide the loop in a continuous manner (while remaining on the ball) towards p until eventually the loop contracts to p. Consequently all loops on the basketball are trivially equivalent to a point, i.e. β1 = 0.
While shapes and surfaces have well-defined homology groups, computing the homology of data is less straightforward. If one were to directly compute the homology of a dataset consisting of N distinct data points directly, the resulting Betti numbers would be β0 = N and βk = 0 for k > 0, since the data has N connected components (that is, distinct points) and no higher dimensional holes. In this context the Betti numbers are clearly non-informative. Consequently, in data-driven tasks we instead seek to uncover the underlying shape on which the data lie and then compute the homology of that shape (19, 21, 23). Algorithmically, this procedure is carried out by assigning a combinatorial model of a space, called a simplicial complex, to the data.
Given a dataset X = {x0, x1, …, xN} living in some metric space (S, dS) and a distance parameter ∊ > 0, we think of simplicial complexes as an organized set of vertices (0-simplices), edges (1-simplices), triangles (2-simplices), tetrahedra (3-simplices), and higher dimensional simplices. The vertex set of the simplicial complex is the collection of data points {x0, x1, …, xN }. Moreover, if , then the edge connecting xi to xj is included in the simplicial complex of X with respect to ∊. Similarly, if , then the filled triangle with vertices xi, xj, xk is contained in the complex (19, 21, 23). In general, a simplex is included in the simplicial complex representation of X with respect to ∊ if the vertices of the simplex have pairwise distance . This simplicial complex is known as the Vietoris-Rips Complex, and it is the way in which we construct a simplicial complex from a set of genomes throughout this paper (where the Hamming distance provides the metric) (19, 21, 23). For a more rigorous definition of simplicial complexes, see Supporting Information and (21).
One can now compute the homology of the simplicial complex representation of a dataset to get non-trivial topological summaries. The main issue with this procedure is that it is sensitive to the choice of scale parameter ∊. The idea of persistent homology is to avoid making a choice of ∊ and instead track how the homology changes as ∊ varies (21, 28). For each dimension, persistent homology yields a collection of “homological features” which appear at some parameter value ∊0, known as the birth time, and disappear at some parameter value ∊1 ≥ ∊0, known as the death time. These features are often represented in a barcode diagram, where each bar represents an i-dimensional hole which persists within the interval [∊0, ∊1], as shown in Figure 1a (19, 28). One can think of persistent homology as an extension of hierarchical clustering for higher dimension homology groups, where dimension 0 persistent homology is analogous to single linkage clustering (19). See Figure 1b for an example of persistent homology applied to an arbitrary dataset via the Vietoris-Rips complex.
Intuitively, in the presence of recombination events the genealogy contains loops which correspond to dimension 1 homological features. This hypothesis was explored in certain cases in (16); the loops do not appear in the absence of recombination and consequently, β1 can be used to predict recombination rate. This is illustrated in figure 1c (although we note that in the context of standard coalescent assumptions, this connection is more complicated than it appears; see Explaining β1 for details). One of the main discoveries we describe is that in fact the mean barcode length in dimension 0, denoted ψ, is an even more accurate predictor of recombination rate than β1. We note that each bar in the dimension 0 barcode diagram corresponds to an individual in the sample population, and the bar lengths correlate with distance between the individual, or cluster of individuals, and its closest neighbor in Hamming space. Unexpectedly, this measure of distance between individuals is positively correlated with recombination. An explanation of this behavior is located in Coalescent Intuition for Topological Statistics.
Methodological Overview
We sought to identify topological summary statistics (using β1 as a baseline for comparison) that can serve as features for algorithms to perform recombination rate inference. Utilizing simulated data, we computed a variety of topological summaries of dimensions 0, 1, and 2 from the Hamming distance matrix between sequences.
The results of our LASSO regression indicated that the topological features with the highest predictive power for recombination rate are, in order: 1) the average dimension 0 barcode length (ψ), 2) the first Betti number (β1), and 3) the variance of the dimension 0 barcode lengths (Φ). We then used a nonlinear combination of these three topological statistics to build a novel TDA-based model for recombination rate inference, the Topological REcombination Estimator (TREE).
We used simulated data to perform an initial validation of the model. For a more serious validation, we applied TREE to 22 full genome assemblies from the RG Drosophila population (see Methods for more details) and compared its performance to ρph, the recombination rate estimator introduced by Camera et. al in (16), and to LDhelmet. We also benchmarked TREE on a much larger dataset of Arabidopsis genomes, consisting of 1,135 individuals and up to 50k SNPs.
Coalescent Intuition for Topological Statistics
The main topological statistics of interest here are β1, the first Betti number, and ψ, which corresponds to the mean bar length in the dimension 0 barcode diagram. In order to relate these to the biological process of recombination, we will use the language of coalescent theory. For a detailed introduction to the field, see (29). We note that our approach differs from the recent considerations of Lesnick, Rabadán, and Rosenbloom (30) in that we consider a coalescent model with branch lengths and model H0 behavior. Furthermore, we assume a more restrictive sampling regime where only sequences at contemporaneous terminals of the graph are known, as opposed to sequences all along the genealogy.
Explaining ψ
We will provide a heuristic argument that the value of ψ is elevated in the presence of recombination by demonstrating the desired behavior at the recombination rate extremes. First, we claim that in the absence of recombination, the distribution of H0 feature lengths corresponds to the mutation scaled distribution of branch lengths in the coalescent tree of the sample, as shown in figure 1a. Since there is a single, fixed genealogy which describes all positions within the sequence, it is sufficient to calculate the expected length of the coalescent tree and divide by the sample size. Assuming an idealized diploid population of size N, large, and that we are sampling K individuals with K sufficiently small relative to N that multiple merger events are rare, the expected waiting time between coalescence events is generations (31, 32), where k is the number of remaining lineages. The full coalescent tree is then made of each kth interval k times, for the number of remaining lineages at that time. Summing over all these segments and dividing by the sample size gives us the following: where μ is the per-generation mutation rate. Notably,this is equivalent to the expected number of segregating sites divided by the sample size per Watterson’s estimator (31).
We now show that in the infinite recombination limit, the expectation for ψ is strictly larger than in the case where there is a single fixed genealogy. If there is free recombination, every site in the sample has an independent genealogy which we will average over, so all the bars must be of the same length (Figure 2). In other words, the expected value of ψ becomes the scaled average coalescence time for two randomly sampled individuals in the current generation. This is simply 2μN. All that remains is to show that is less than 1. Since the partial sum is bounded by log(K – 1) + 1, this holds for all values of K larger than 5. This additionally suggests that the variance in the length of the H0 barcodes features should decrease as the recombination rate increases, which we observe in simulations. By integrating information both about the length of the coalescent tree as well as the distribution of pairwise differences averaged over multiple topologies, ψ can be viewed as capturing distortions in the expected amount of independent evolution between samples that occurs when sequences contain multiple discordant gene genealogies.
Explaining β1
We suggest in addition that the standard intuition for the use of β1 to detect cycles in the ARG (presented in Camera et al. and Chan et al. (16, 33), as well as here in Background on Topological Data Analysis), potentially over-simplifies the rather complicated relationship between recombination events and features in the H1 barcode. For this, we will consider Čech complexes, rather than Vietoris-Rips complexes, as they are closely related (22), and the Čech complex construction allows holes to be formed given only three points (see Figure 3a). Given the graph in Figure 3b, it is clear that if one were to sample the sequences at every node, there would be an H1 feature observed which corresponds precisely to the hole in the graph. However, in many genetic studies, samples of the common ancestors of present-day sequences do not exist. If we restrict our data to the sequences at the terminals, cycle detection with β1 becomes a function of mutation heterogeneity along the graph, and is in fact impossible if we are given only the true coalescent distances between samples along the genealogy. To see this, take terminals a and c from the graph. By hypothesis, the amount of time between them and their most recent common ancestor (MRCA) is the same, which we will call L. It follows that the minimum radius such that balls around these points would intersect is L. Then, for the triple intersection of balls around the terminals to be empty, b must be a distance greater than L from the MRCA. However, each portion of its genome has certainly experienced the same amount of time since the MRCA regardless of recombination history. Therefore, we require that there be a more than expected amount of mutations generated along the path to b in order for this event to be detected, given the actual sequence data. It still holds true that β1 reflects only these recombination events, assuming no sequence convergence and infinite sites, but this sampling reality may bias β1 detection in subtle ways. This also implies that the length of the H1 bar will not necessarily be indicative of features of the actual cycle in the graph, as it will increase in length as additional mutations are placed on the lineage leading to b, even if the cycle itself is untouched. We find via simulations (see supplement Filtering β1) that filtering small H1 bars only hurts our inference capabilities, as we would then expect.
Combining ψ and β1
These explanations for the behavior of ψ and β1 also implies differences in behavior between ψ and β1 under different population models. For example, while rapid demographic changes such as exponential population growth will distort ψ-based estimation (Figure 4), β1 counts features generated by recombination at a rate independent of the underlying tree structure, since the relative distances to the MRCA are unchanged with multifurcations. On the other hand, we find empirically ψ is very robust (especially compared to β1) to perturbations in the form of missing data, which serve only to minorly rescale the H0 bars on average. These differences suggest that a reliable predictor should incorporate both features. A more formal follow-up to the behavior of these statistics under different models of demography and selection, as well as further characterization of the behavior of ψ using the sequentially Markov coalescent (SMC) model (34) will be conducted in future work.
Results
Coalescent Simulations
We generated simulated data for an idealized population of fixed size Ne, with per-generation crossover rate r and mutation rate μ. Given this, we ran regression analyses to discover relationships between various topological summary statistics and ρ = 4Ner, the population recombination rate. The topological statistics analyzed were the mean and median bar lengths, variance of bar lengths, total number of bars, and number of bars above varying noise-filtering thresholds for dimensions 0, 1, and 2.
Our intention was to test various topological features as predictors of known recombination rates and to demonstrate comparable performance of these features to a comparator method, LDhelmet. However, given the computational bottlenecks inherent to LDhelmet, we were unable to run this software over the full set of >3600 alignments. We are nevertheless satisfied in that the parameters to be estimated are known from simulation, and so we save the use of LDhelmet for our empirical analyses where the truth is unknown.
As a preliminary analysis, the weight vectors of LASSO regression models provided insight into which barcode statistics were the strongest predictors of ρ. The results of this analysis showed that two key topological features, β1 and the mean dimension 0 bar length, which we will denote as ψ, correlate strongly with ρ given a constant sample size n > 10. Thus, these topological summaries became the main foci of our work. Moreover, we found no correlation between our barcode statistics and θ = 4Neμ, the population mutation rate, as expected, since changes in θ with a constant Ne only linearly rescale the distances.
In studying the information content of these various statistics, we found ψ stood out as an even better predictor of recombination rate than β1, and performed better on its own in predicting recombination rates from simulated datasets. Figure 5 demonstrates the relationships between ρ and β1, and between ρ and ψ for sample sizes n = 50 and n = 140. We note that both topological summaries exhibit an exponential relationship with recombination rate, and that the relationship is tighter for ψ than β1, especially for smaller sample size.
As a baseline, we fit our simulated data with a fixed sample size of 160 to the β1 based model of Camara et al. (2017), ρph (the “ph” stands for persistent homology). This model is given by the equation where the parameters g and h are coefficients related to sample size and are independently calculated for a given dataset. The best fit over the simulated range of recombination rates simulated is R2 = 0.86. We found a relationship between ρ and ψ with R2 = 0.90, suggesting that this new parameter has comparable or greater power for predicting population recombination rates.
We demonstrated that the two parameters ψ and β1 are differentially stable under violations of assumptions about the data. Notably, ψ is robust to large amounts of missing data, whereas β1 is robust to rapid changes in population size. The former we show empirically: ψ maintains its relationship with ρ with R2 = 0.76 with 10% of each sequence missing as a tandem indel, while β1 loses this relationship quickly with R2 = 0.036 under the same missing data scheme. The assumption of 10% missing data is somewhat conservative; nextgeneration sequencing data sets have less missing data, but this figure is realistic for many older sequencing data sets. As one would expect, we note that performance of each method declines as missing data become more prevalent (See supplement Missing Data). However, if missing data is located randomly throughout the genome, it is unlikely to bias relative measures of recombination rates within a dataset. While β1 is expected to be robust to rapid changes in population size since this does not change the number of cycles in the ARG, ψ will be more sensitive to these changes as rapid demographic expansion generates multifurcations in the genealogy which converge to the same branch lengths as in the infinite recombination case.
TREE Model
We implemented several machine learning and regression models to build an accurate and robust model relating topological summaries to ρ, including LASSO, polynomial regression, exponential regression, and linear regression. We varied model parameters and input features (subsets of the aforementioned barcode statistics) across each learning algorithm.
A subset of model comparison results are presented in Table 1, where we show the R2, or goodness-of-fit measure, values for the model , where corresponds to a vector of different barcode statistics. We ran the model separately on datasets generated with varying sample sizes, as well as on a dataset consisting of simulations of varying population size. An R2 value close to 1 signifies a nearly perfect model, whereas R2 near 0 indicate that the model has low predictive power.
The results show that ψ is an overall stronger predictor than β1, and as expected, recombination rate is predicted more accurately for higher sample sizes. Importantly, β1 fails as a predictor in the case of small sample size, while ψ maintains decent predictive power for sample size as low as 25.
A thorough comparison of the different model outputs showed that an exponential model in ψ2, β2, ψ, β1, and the variance of the dimension 0 bar lengths, which we will denote Φ, is the best predictor for ρ. While we also tested more topological features as inputs to different models, the increase in R2 values was negligible in comparison to the increased risk of over-fitting.
Summarizing, we propose the following Topological REcombination Estimator (TREE) model: where
Figure 7 shows TREE’s performance on blind testings set for sample sizes of 50 and 140. While the model preforms best when restricted to datasets corresponding to high sample size, TREE was trained on mixed sample size data in an attempt to make the model as robust and have as little bias as possible.
To analyze the robustness of the model we preformed fivefold cross validation. The results are presented in table 2, and they show that the model maintains high accuracy regardless of the training set.
Empirical Analysis
We compared the performance of TREE on empirical datasets to a widely used estimator, LDhelmet v1.9. We first benchmarked each method on a large genomic dataset from the Arabidopsis 1000 Genomes project consisting of 1,135 samples. We used subsets of 1k, 10k, and 50k SNPs in 100 SNP windows from the total dataset in order to test the computational speed of each program. We found that LD-helmet cannot process datasets with greater than 50 samples, failing to complete the first step of the likelihood table computations for the smallest of these datasets. However, TREE was able to process each dataset in full within a reasonable time frame (Table 3). By sub-sampling these data, we can illustrate one advantage of TREE’s ability to analyze this quantity of individuals. We find that as the number of samples is increased from 20%to 50% to 100% of the full set, the distribution of recombination events along the genome shifts such that the top 10% of windows contain 17.4%, 19.8%, and 23.9% of events respectively, as the signal of hotspots grows more pronounced. In addition, the importance of efficiency is underscored by the fact that as more samples are included, more SNPs are realized in the data, and more windows are required for a full analysis. As we could not compare recombination estimates on the full Arabidopsis genome due to LDhelmet’s processing times, we turned to a smaller Drosophila dataset with 22 samples and >22Mbp. For these datasets, LDhelmet takes on the order of hours to complete a run over a single chromosome, whereas TREE terminates on the order of minutes and in all cases finished running in under one hour.
We first take a broad look at the relative performance of TREE to LDhelmet on genomic datasets, looking for concordance in predicting an increase, decrease, or no change between each window of our analysis. We find that TREE is concordant with LDhelmet 69.2% of cases where ρ increases and in 69% of cases where ρ decreases. We note that since LDhelmet applies a smoothing with TREE does not, we cannot directly compare the accuracy with which TREE generates adjacent windows of identical recombination rate.
To characterize TREE’s behavior in these cases, we looked at the magnitude of the difference between the TREE prediction and LDhelmet’s. We found that in the cases where LDhelmet predicts no change in ρ, TREE’s predicted change is less than 0.05 72.5% of the time, less than 0.01 18.1% of the time, and less than 0.001 1.8% of the time. These results suggest that TREE is good at detecting large changes in recombination rate, and is thus well-suited for hotspot detection. However, it can have difficulty differentiating between subtler changes in recombination rate ranging in magnitude between 0 and 0.01.
Finally, we looked directly at the correlation between the absolute predicted values of TREE and LDhelmet for the most fine-grained comparison. We used two measures of correlation the Kendall-Tau rank test, and Spearman’s ρ. With the exception of chromosome arm 2L, Kendall’s Tau between TREE estimates and LDhelmet estimates ranges between 0.067 and 0.241 with P-values less than 0.0001. Similarly, Spearman’s ρ ranges between 0.097 and 0.346 with P-values less than 0.0002. These positive correlation coefficients and low p-values suggests global agreement between LDhelmet’s and TREE’s rankings of recombination rates across sliding windows, indicating that TREE is useful in detecting global hotspots of recombination.
Chromosome X and arm 2L are outliers here, as their correlation coefficients are substantially lower and associated Pvalues substantially higher than the remaining chromosome arms in the dataset (Table 4). Despite attempts to discover why these two chromosomes in particular are outside of the average ranges of performance, we were unable to find a compelling reason. It may be that each of these chromosomes have many more cases of subtle recombination rate changes between 00.1, such that LDhelmet’s estimates are most different from TREE’s.
We also compared the model of Camara et al. to LDhelmet in the same framework to discover its relative performance, in order to discover whether the addition of the feature ψ is necessary for greater accuracy. We found that the ρph model in β1 alone is dramatically less concordant with LDhelmet estimates across all windows than is TREE (Table 5). For each chromosome arm analyzed using ρph, the rank coefficients of Kendall’s Tau or Spearman’s ρ are less than 0.01, sometimes negative, and not statistically significant. This indicates poor concordance between the two methods and, in some cases, disagreement, suggesting that the addition of ψ to an estimator of recombination rate substantially improves accuracy as well as computation time compared to LDhelmet. We see evidence of the differences in prediction between TREE and LDhelmet as well as ρph in Figure 7, which shows that TREE approximates the estimates of LDhelmet across the entire span of the chromosome without rescaling, whereas ρph fails to capture similar detail to TREE and requires an informed re-centering to the range of LDHelmet to be competitive.
Discussion
We have discovered a new feature of the distribution of genomes in Hamming space, which we denote ψ, that improves the performance of topological estimators of recombination. While the field of TDA is in its infancy, our work provides a novel demonstration of the power of persistent homology-based estimators for fundamental questions in evolutionary biology. Notably, our feature is related to biologicallymeaningful quantities in coalescent models; this is some of the first work we are aware of to make such a tight connection between TDA estimators and coalescent theory.
Our ψ-based approach to recombination rate inference is able to quickly scan large genomic datasets for regions of recombination rate heterogeneity. Due to its speed, it can serve as a first-pass estimate of recombination rate variation prior to targeted use of much more computationally expensive inference methods. While ψ itself can potentially be influenced by distortions to the genealogical structure of a sample, it is naturally complemented by higher dimensional topological features (namely β1) of the data explored in prior work (16, 33), while maintaining accuracy in the face of missing data which confounds β1-only methods.
Similarly to how ψ can supplement and guide usage of evolutionary-model driven methods, ψ can also add a degree of finer-scale detection and biological intuition to topology-driven methods, bringing us closer to bridging the gap between population genetics and persistent homology. The differing behaviors of H0 and H1 derived statistics on genomic data also point towards the potential of TDA as a source for summary statistics which can tease apart the signatures of demography, selection, or population structure, a fundamental goal of population genetics. The behavior of ψ also suggests that topological quantities could be merged with a fully coalescent model of recombination, as a more rigorous SMC-based modeling could make explicit predictions for the distribution of the effect sizes of a single recombination events on ψ. In addition, the ordering of the H0 features by duration of persistence hints at additional connections between barcodes and the look-down construction of the coalescent (35, 36). We will explore these avenues in a future theoretical treatment of these statistics.
Summarizing, we have shown that a combination of TDA and machine learning techniques can detect recombination rate heterogeneity in biological data faster than previously possible and with greater accuracy than previous TDA-based approaches. We demonstrate that while the behavior of 0dimensional barcodes has been previously ignored with respect to genealogical inference problems, these features are robust and increase the overall accuracy of inference compared to using 1-dimensional barcodes alone. Our coalescent analyses also suggest a promising future endeavor: building a fully coalescent-motivated model explaining the behavior of Betti numbers on distributions of genome sequences.
Methods
Topological Data Analysis
We created a pipeline which takes as input a sequence file in FASTA format, computes a Hamming distance matrix DH, uses DH to extract the corresponding dimension 0 and dimension 1 persistent homology barcodes, and then calculates barcode summary statistics. The barcodes were computed using Ripser, a publicly available C++ package for computing Vietoris–Rips persistence barcodes, and all other computations were done in Python 2.7 (37). The scripts are available on our Github page at: https://github.com/MelissaMcguirl/TREE.
Simulations
We simulated over 50,000 datasets of genetic sequence data with known recombination rates, each of length 1000 bp, using the programs ms and seq-gen (38). Given an idealized population, we varied the parameters for the population recombination rate ρ = 4Ner, sample size n, and population mutation rate θ = 4Neμ to capture a variety of mutation-recombination regimes in the data. The datasets had recombination rates varying from 0 to 1000 (that is, no recombination up to free recombination between all sites under the population recombination model ρ = 1000).
For each population we computed dimension-0 and dimension-1 persistent homology barcodes using Ripser (37). We ran regression analyses to discover topological predictors for recombination and then confirmed that our method predicts ρ directly and does not predict a covariate such as θ.
Model Selection
Initially, we applied polynomial regression of degree two, linear regression, LASSO regression, polynomial LASSO regression, and exponential regression using several different combinations of barcode statistics as inputs for predicting recombination rate using Scikit-learn (39). We sought the simplest model with the least number of parameters that was able to predict recombination rate while still maintaining high accuracy.
Each model was trained on a randomly selected subset of the input data, whose size was chosen to be 30% of the total dataset. The model was then tested on the remaining 70% of the input data, where R2 values were computed to access the goodness-of-fit of the resulting model. This process was repeated several times to test the robustness of the learned parameters with respect to different training sets.
Based on the R2 values, we were able to narrow down to three barcode statistics, ψ (average dimension 0 bar length), β1 (first Betti number), and Φ (variance of dimension 0 bar lengths), as inputs for an exponential regression model of the form
The coefficients (α, β, γ, δ, ∊, ζ) were determined via ordinary least squares from simulated training data in scikit-learn Python 3.0).
This model was tested on input data consisting of 53,461 barcode statistic files corresponding to simulations of varying sample size, mutation rate, and recombination rate. Five-fold cross validation was preformed for different sample sizes, and for the complete dataset.
Empirical Analysis
We obtained 22 full genome assemblies from the RG population (from the African survey of Drosophila melanogaster) for our analyses. We ran LDhelmet and TDA in parallel and compared the mean estimates of recombination rate ρ in sliding windows of 500 SNPs. Since LDhelmet provides estimates of ρ between any two SNPs, whereas our method computes ρ within windows of 500 SNPs, we take the mean of every 500 ρ estimates from LDhelmet to compare to our windows. Moreover, since LDhelmet estimates ρ = 2×Ne×r per base pair, and TREE predicts ρ = 4×Ne×r, where Ne is the population size and r is the probability of a crossover event from ms, then we apply a uniform normalization of to the TREE predictions for comparison to LDHelmet.
We also ran Camara et al’s model using only β1 as a predictor within our sliding window framework to compare this to our method and LDhelmet. We looked at the results in 3 different ways: 1) in terms of absolute estimates compared to each other, 2) in terms of concordance in the change in ρ across windows (i.e. do both methods predict an increase or decrease in ρ in the same window), and 3) in terms of concordance of estimates above the 75th and 90th percentiles of the distribution of estimates. All analyses were done in Python 3.0, and scripts are available on our Github page: https://github.com/MelissaMcguirl/TREE.
We additionally included an analysis of 1,135 publicly available Arabidopsis genomes. We converted the raw VCF file to FASTA using a combination of bcftools and VCF-kit’s phylo fasta function, subsampling up to 50k SNPs in order to run the software within Stampede2’s 48 hour time limit. We ultimately used this dataset to benchmark the computational efficiency of TREE over LDhelmet, as LDhelmet is unable to process the full dataset with so many samples for us to use this as a comparator of accuracy.
Acknowledgements
M.M. would like to thank John Wakeley as well as members of the Desai lab for helpful comments and discussion. The authors would like to thank Raul Rabadan and Juan Patino Galindo for very helpful feedback on the manuscript. D.P.H, M.R.M., and M.M. are supported by the National Science Foundation Graduate Research Fellowship Program under Grant Numbers DGE1610403, 1644760, and DGE1745303, respectively. A.J.B. was supported in part by NIH grants 5U54CA193313 and GG010211-R01-HIV and AFOSR grant FA9550-15-1-0302. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Author contributions
D.P.H., M.R.M, M.M., and A.J.B. designed and conducted experiments;D.P.H., M.R.M., and M.M. wrote code for the project; D.P.H. and M.R.M performed empirical analyses, M.M. contributed theory; and D.P.H., M.R.M, M.M.,and A.J.B. wrote the paper