Abstract
Runs of homozygosity (ROH) are important genomic features that manifest when an individual inherits two haplotypes that are identical-by-descent. Their length distributions are informative about population history, and their genomic locations are useful for mapping recessive loci contributing to both Mendelian and complex disease risk. We have previously shown that ROH, and especially long ROH that are likely the result of recent parental relatedness, are enriched for homozygous deleterious coding variation in a worldwide sample of outbred individuals. However, the distribution of ROH in admixed populations and their relationship to deleterious homozygous genotypes is understudied. Here we analyze whole genome sequencing data from 1,441 individuals from self-identified African American, Puerto Rican, and Mexican American populations. These populations are three-way admixed between European, African, and Native American ancestries and provide an opportunity to study the distribution of deleterious alleles partitioned by local ancestry and ROH. We re-capitulate previous findings that long ROH are enriched for deleterious variation genome-wide. We then partition by local ancestry and show that deleterious homozygotes arise at a higher rate when ROH overlap African ancestry segments than when they overlap European or Native American ancestry segments of the genome. These results suggest that, while ROH on any haplotype background are associated with an inflation of deleterious homozygous variation, African haplotype backgrounds may play a particularly important role in the genetic architecture of complex diseases for admixed individuals, highlighting the need for further study of these populations.
Introduction
Runs of Homozygosity (ROH) are long stretches of identical-by-descent (IBD) haplotypes that manifest in individual genomes as the result of recent parental relatedness. Originally conceived to improve the accuracy of homozygosity mapping of recessive Mendelian diseases, ROH have formed the foundation of studies investigating the contribution of recessive deleterious variants to the genetic risk for complex diseases and the to the determination of complex traits [1]. Moreover, they have provided unique insights into the demographic and sociocultural processes [1] that have shaped genomic variation patterns in contemporary worldwide human populations [2–12], ancient hominins [13–16], non-human primates [17, 18], woolly mammoths [19], livestock [20–26], birds [27, 28], felines [29], and canids [30–39]. Recent population bottlenecks, cultural preferences for endogamy or consanguineous marriage, and natural selection, can create increased rates of ROH in individual genomes, substantially increasing overall homozygosity in such populations.
Several studies of the distribution of ROH in ostensibly outbred human populations have shown that ROH are common and range in size from tens of kilobases to several megabases in length [2–8]. Furthermore, total length and prevalence of ROH are correlated with distance from Africa [5, 7, 8], with more and longer ROH manifesting in individuals from populations a longer distance away. These patterns likely reflect increased IBD among haplotypes as a result of the serial bottlenecking process that humans experienced as they migrated out of Africa.
The prevalence of ROH in individual genomes has also been an important factor for understanding the genetic basis of complex phenotypes [40–43]. High levels of ROH have been associated with heart disease [44–47], cancer [44, 48-52], blood pressure [53–57], LDL cholesterol [57], various mental disorders [58–63], human height [64, 65], and increased susceptibility to infectious diseases [66]. Indeed, these results are consistent with the idea that many rare alleles of small effect may be the cause of increased risk for complex diseases [67–71], especially if these mutations are recessive [4, 72].
We have previously shown that ROH, especially long ROH, are enriched for deleterious homozygous variation [73, 74]. Whereas an overall increase in homozygotes is expected with increasing genomic ROH, we have shown that the rate at which deleterious homozygotes accumulate outpaces the rate at which benign homozygotes accumulate [73, 74] in long ROH (ROH on the order of several megabases). This result is a consequence of young (long) haplotypes with low-frequency variants segregating on them being paired IBD [74]. As low-frequency variants are more likely to be deleterious, the processes that create very long ROH can also generate unusually high numbers of deleterious homozygotes within these regions.
Although a few studies describing the worldwide distribution of ROH patterns have included a small number of admixed populations [5, 7, 8], the number of individuals per admixed population has been fairly small. Even as the number of admixed individuals continues to grow in the United States [75], they are still relatively understudied, which translates to disparities in our understanding of population-specific genetic factors that may influence complex phenotypes [76, 77]. Indeed, admixed populations have unique features compared to other populations, in that genomes from these populations are recent combinations of two or more ancestral populations.
This ancestral mosaicism has been exploited to make inferences about the natural history of human populations [78–88] and to search for ancestral haplotypes that influence complex phenotypes [89–96]. Here we add to the body of work on admixed populations by examining the relationship between ROH, local ancestry, and the accumulation of deleterious alleles. We use 1,441 recently published [97] whole genome sequences (dbGaP accession numbers phs000920 and phs000921) distributed roughly equally across three admixed populations in the Americas: African American (n = 475), Mexican American (n = 483), and Puerto Rican (n = 483). Each of these populations is three-way admixed between European, Native American, and African ancestral populations, although each has a distinct history.
Among the ancestral populations that contributed haplotypes to these admixed populations, it has been shown that the distribution of deleterious heterozygotes and deleterious homozygotes changes with distance from Africa [98–101]. With this in mind, we propose that accumulation of deleterious homozygotes via increased genomic ROH may also differ within admixed populations based on differing ancestral haplotypes. Indeed, with high deleterious heterozygosity, we propose that African ancestral haplotypes may be most susceptible to large increases in deleterious homozygotes when subjected to harsh bottlenecks or inbreeding, as these low frequency deleterious alleles will be paired into homozygotes as a result of increased genomic ROH.
Results
Admixture
Using the subset of sites from our whole-genome sequencing data that intersected with our African, European, and Native American reference panels, we called 3-way local ancestry tracts in all 1,441 samples (see Methods). We also estimated global ancestry proportions by summing the length of all haplotypes inferred to be from a given ancestry and dividing by the total genome length. Fig 1 summarizes the global ancestry proportions for all individuals from each population on a ternary plot. The admixture proportions largely accord with previous results in these populations, with Puerto Ricans having mostly African and European ancestry, Mexican Americans having mostly European and Native American ancestry, and African Americans having mostly African and European ancestry to the near exclusion of any Native American ancestry. However, although African Americans are frequently treated as a 2-way admixed population between European and African sources, we show that several AA individuals have non-trivial proportions of Native American ancestry. This suggests that, in general, a 2-way admixture model may not be uniformly appropriate for studying admixture patterns amongst self-identified African American individuals.
Runs of Homozygosity
We followed the ROH calling pipeline of Pemberton et al. [7] as implemented in the software GARLIC [102] to call ROH from the full whole-genome sequencing data (see Methods). This method identifies three classes of ROH based on the length distribution in each population. We refer to these size classes as short, medium, and long. These classes roughly correspond to ROH formed of IBD haplotypes from different time periods from the population history. Short ROH are tens of kilobases in length and likely reflect the homozygosity of old haplotypes; medium ROH are hundreds of kilobases in length and likely reflect background relatedness in the population; and long ROH are hundreds of kilobases to several megabases in length and are likely the result of recent parental relatedness. Total length of ROH in the genome is correlated with distance from Africa [4, 7]. In the case of our admixed populations, we therefore expect the total length of ROH to be correlated with increased European and Native American admixture fraction. Indeed, Fig 2A illustrates this pattern, with AA individuals having lowest total ROH, PR individuals having intermediate total ROH, and MX individuals having the highest total ROH (all pairwise Mann-Whitney U tests p < 2.2 × 10−16). Breaking down ROH by size class, we find that the total length of short ROH is comparable between PR and MX individuals (Fig 2B), but the total length of both medium ROH (Fig 2C) and total long ROH (Fig 2D) is highest on average in MX individuals.
Deleterious Alleles
We used multiple approaches to predict the deleteriousness of all sites in the genome (see Methods), but focus on missense mutations classified as Probably Damaging, Possibly Damaging, or Benign using Polyphen 2 [103]. As in [73], we combine the Probably Damaging and Possibly Damaging mutations into a single “damaging” class, and we combine all Benign mutations with synonymous mutations into a single “benign” class. For individual i across all sites, we denote by and the total number of sites with k ∈ {0,1,2} alternate alleles classified as damaging or benign, respectively. In Fig 3A we plot the distribution of deleterious heterozygotes per individual, , split by population. Consistent with previous work [98–101], we see an increased number of deleterious heterozygotes in populations with more African ancestry, with AA individuals having the most and MX individuals having the fewest (patterns replicate with other deleterious categories, see S5-S10 Figs). Conversely, we would expect an increase of deleterious homozygotes per individual in populations with more non-African ancestry. Indeed, in Fig 3B we plot the distribution of deleterious homozygotes per individual, , split by population and observe AA individuals with the fewest and MX individuals having the most (these patterns also replicate with other deleterious categories, see S5-S10 Figs). Figure 3C plots the total number of deleterious alleles per individual . Contrary to other work [101], we find a total deleterious load highest on average in AA individuals. Although this pattern replicates across several other deleterious calling methods (S5-S9 Figs), when using GERP scores (as in [101]) the pattern reverses (S10 Fig) and is consistent with [101].
Deleterious Alleles Across Local Ancestry
We next investigate whether there are any differences in deleterious load by local ancestry. Although our local ancestry calls provide us with phased local ancestry inferences, we were limited to a small subset of sites for our reference populations. Since the vast majority of our deleterious alleles come from our unphased whole-genome data, we do not have phase information for the deleterious alleles and cannot assign a specific ancestral haplotype in regions of discordant ancestry. Therefore, we calculate total load based on six different ancestry backgrounds. AFR, EUR, and NAM ancestry regions represent regions that are homozygous for African, European, and Native American ancestries, respectively, and AFEU, EUNA, and AFNA ancestry regions represent regions that are called heterozygous for African/European, European/Native American, and African/Native American ancestries, respectively. We then calculate for each population the number of deleterious alleles per basepair for each ancestry background.
Table 1 shows the number of deleterious alleles per basepair for each population and each ancestry background. We perform two types of tests for independence in order to determine whether there are significant differences in the number of deleterious alleles per basepair. First, we test for independence of the count of deleterious alleles on an ancestry background and the count of basepairs covered by that ancestry across populations. We find that neither African ancestry nor European ancestry have statistical differences in the number of deleterious alleles per MB across populations. Further, while NAM, EUAF, and AFNA exhibit statistically differences across populations, it appears to be driven by one of the two populations (AA, MX, and PR, respectively). Next, we test for independence of these counts across ancestries within each population. Here we find that all populations have statistically significant differences in the distribution of deleterious alleles across ancestry backgrounds (AA p < 2.2 x 10−16; MX p < 2.2 = 10−16; PR p < 2.2 × 10−16), with NAM ancestry having the lowest rate in AA and PR individuals and EUR having the lowest rate in MX individuals. However, we note that the overall differences were very small (a difference of < 0.1 deleterious alleles per Mbp).
Deleterious Alleles in ROH
Next, we turn to examining the distribution of deleterious homozygotes within ROH. It was previously reported [73, 74] that there is a higher proportion of deleterious homozygotes per unit increase of ROH than expected from the proportion of benign homozygotes. Naturally, as the total amount of genomic ROH increases, we expect more homozygotes to fall within ROH. However, [73] and [74] found that the rate of increase of the proportion of deleterious homozygotes was greater than for benign homozygotes. This effect was strongest for long ROH, which are likely the result of recent parental relatedness.
For each individual i and for each ROH class j ∈ {A, B, C, R, N} (A - short ROH, B - medium ROH, C - long ROH, R - all ROH, and N - outside ROH), we define the number of damaging or benign sites with k ∈ {0,1,2} alternate alleles as and , respectively. Thus we calculate the proportion of damaging homozygotes in ROH class j as and the proportion of benign homozygotes in ROH as respectively. We also compute, for each individual i and each class j, the fraction of the genome covered in ROH as
We plot the proportions of ROH homozygotes versus genomic fraction of ROH in Fig 4, which is analogous to Fig 4 from [73]. In order to determine if there is a statistically significant difference in the accumulation of deleterious homozygotes versus benign homozygotes, we construct a linear regression model (as in [73, 74]), f.,j = β0 + β1 G.,j + β2D + β3DG.,j + ε, where f.,j is a vector of length 2,882 containing the proportions of both damaging and benign homozygotes in ROH class j for all individuals, G.,j is a vector of genomic class j ROH proportions, and D is an indicator variable taking a value of 1 when the response represents damaging homozygotes and 0 for benign homozygotes. In this framework, a statistically significant β2 suggests an overall higher proportion of damaging homozygotes in ROH compared to benign homozygotes, e.g. β2 = 0.1 means that an extra 10% of genome-wide deleterious homozygotes fall in ROH compared to the distribution of benign homozygotes. A statistically significant β3 suggests a difference in the rate of accumulation per unit increase of ROH, e.g. β3 = 1.0 means that for a 10% increase in genomic ROH, 10% more deleterious homozygotes fall in ROH compared to benign homozygotes. Inferred coefficients for the four regressions corresponding to j ∈ {A, B, C, R} each are given in Table 2.
Fig 4A plots these proportions versus total ROH for all ROH classes combined. In agreement with [73], we find that there is an overall greater proportion of damaging homozygotes in ROH compared to benign homozygotes (β2 = 0.1799, p < 2 × 10−16), but in contrast the overall rate of accumulation is not different (β3 = 1.807 × 10−2, p = 0.0671). When we partition ROH by size class, the distribution of homozygotes in short ROH (Fig 4B) also differs from [73]. Whereas previously there were no statistically significant differences in β2 or β3, here we find a significant positive β2 = 4.810 × 10−2 (p < 2 × 10−16) and a statistically significant negative β3 = −0.428 (p = 1.10 × 10−8) suggesting that ROH comprised of old haplotypes accumulate deleterious homozygotes at a slower rate that benign homozygotes. As we expect short ROH to be comprised of old haplotypes that have been segregating for a long time, it is reasonable to think that only haplotypes with relatively few deleterious alleles remain segregating in the population. Our results for medium (Fig 4C) and long ROH (Fig 4D) are consistent with previous work [73, 74]; in particular we find that the difference in rates of gain of deleterious versus benign homozygotes is greatest in long ROH (β3 = 0.229; p < 2 × 10−16).
Deleterious Alleles in ROH Partitioned by Local Ancestry
Now we turn to analyzing the distribution of deleterious homozygotes in ROH comprised of only one particular ancestral haplotypes. As shown in Fig 3A and in other work [98–101], populations with more African ancestry tend to have high numbers of deleterious heterozygotes genome-wide. This contrasts with populations that have more European and Native American ancestry, which tend to have more genome-wide deleterious homozygotes (Fig 3B) as a result of the serial bottlenecks they experienced since migrating out of Africa.
However, admixed populations are a recent combination of two or more ancestral populations, and since genome-wide ancestry proportions correlate with numbers of deleterious heterozygotes and homozygotes, we desire to investigate how this mosaicism might affect the accumulation of deleterious homozygotes in ROH. We have already shown (Fig 4) that as total genomic ROH increases the proportion of deleterious homozygotes falling in ROH increases faster than the proportion of benign homozygotes, but here we want to know if the ancestral background of the IBD haplotypes matters. Here we propose that haplotypes sourced from ancestral populations with high deleterious heterozygosity have highest rates of accumulation of deleterious homozygotes when paired IBD to generate ROH.
Why might we expect high deleterious heterozygosity haplotypes to generate large numbers of deleterious homozygotes in ROH? Pemberton and Szpiech [74] recently demonstrated that long ROH are enriched for homozygotes comprised of low-frequency alleles. Low-frequency alleles are more likely to be deleterious and are more likely to manifest in individual genomes as heterozygotes. Under a typical random mating scenario these low frequency alleles would be likely to segregate in the population largely as heterozygotes, however severe bottlenecks and cultural practices such as endogamy and consanguineous marriage substantially raise the likelihood of pairing low-frequency alleles as IBD homozygotes. We therefore expect that deleterious homozygotes will be concentrated in large proportion within ROH comprised of African ancestral haplotypes, and that the rate of gain of deleterious homozygotes will be greatest in ROH of African ancestral haplotypes.
To test this proposition, we first partition ROH based on the ancestral background of the underlying IBD haplotypes. Then we compute for each individual (i) the fraction of all deleterious (d) and benign (b) homozygotes across the genome that fall into each ROH class (j) as: and where and are the number of deleterious and benign homozygotes, respectively, in individual i in ROH class j on ancestral haplotype background A ∈ {AFR,EUR,NAM}. Similarly, and are the genome-wide fraction of deleterious and benign homozygotes, respectively, in individual i in ROH class j that fall on haplotype background A. Finally, we fit a linear model similar as above, f.,j(A) = β0 + β1G.,j(A) + β2D + β3DG.,j(A) + ε, in order to test for differences in the rate of accumulation (β3) of deleterious homozygotes compared to benign homozygotes as a function of G.,j(A), the genomic fraction of ROH on ancestral background A. The results are plotted in Fig 5 for total ROH (j = N; Fig 5A-C) and for long ROH (j = C; Fig 5D-F), and the regression coefficients are also summarized in Table 3.
For total ROH, we find significant differences in the rate of accumulation of deleterious homozygotes on all ancestry backgrounds (Fig 5A-C). Furthermore, consistent with our expectations, we find that ROH on African ancestral haplotypes have the highest rate difference (β3 = 1.214, p < 2 × 10−16; Fig 5C), whereas ROH on European ancestral haplotypes have an intermediate rate difference (β3 = 0.648, p < 2 × 10−16; Fig 5B) and ROH on Native American ancestral haplotypes have the lowest rate difference (β3 = 0.510, p < 2 × 10−16; Fig 5A). This pattern is repeated when we consider only long ROH comprised of young haplotypes (Fig 5D-F) and also when we analyze smaller ROH (albeit with weaker effects; S1 Fig).
We next directly compare the rate of increase of deleterious homozygotes across different ancestral haplotype backgrounds. To do this we compute the following regression, , where is a vector representing the proportion of damaging homozygotes in ROH class j on each local ancestry background across all individuals. G.,j(·) represents the genome-wide fraction ROH class j falling on each local ancestry background across all individuals, and I(A) is an indicator variable which takes the value 1 if the associated response is on ancestral background A ∈ {AFR,EUR, NAM} and takes the value 0 otherwise. Here we analyze each ROH class: all, long, medium, and short.
We plot the results for “all” and “long” in Fig 6 (“medium” and “short” in S2 Fig) and summarize the inferred regression coefficients for all classes in Table 4. We focus on the regression coefficients β4 and β5, which represent the difference in rate of gain of deleterious homozygotes in ROH on European or Native American haplotypes compared to African haplotypes, respectively. Graphically, in Fig 6 and S2 Fig, a significant β4 corresponds to a significant difference in the slope of the orange and blue line, and a significant β5 corresponds to a significant difference in the slope of the orange and purple line. Since we expect that the rate of gain of deleterious homozygotes to be lowest in ROH on European and Native American haplotypes compared to ROH on African ones, we expect significant negative values for both β4 and β5.
Consistent with our expectations, when analyzing all ROH (Fig 6A) we find a significant negative β4 = −0.763 (p < 2 × 10−16) and β5 = −0.852 (p < 2 × 10−16), indicating that the gain rate of damaging homozygotes in ROH on African ancestral haplotypes outpaces that of ROH on the other ancestral haplotypes. This pattern continues when considering only long ROH (β4 = −0.852, p < 2 × 10−16; β5 = −0.727, p < 2 × 10−16; Fig 6B) and smaller ROH (Table 4 and S2 Fig).
To check the robustness of these results, we reran these analyses using several other deleterious classification methods including SIFT [104, 105], Provean [106], and GERP [107]. Since GERP scores sites and not mutations, we restricted the GERP analysis to loci where the ancestral and derived states were inferred to high confidence. As this ancestral polarization results in discarding a large number of loci with ambiguous ancestral allele state, we also reran these analyses for Polyphen 2 [103], SIFT [104, 105], and Provean [106] restricted only to loci for which we have ancestral/derived state information. S3 Fig plots the inferred β3 for each of these analyses for each ROH size class and demonstrates qualitatively similar patterns as shown above.
We further re-analyzed a subset of the ROH and deleteriousness calls from Pemberton and Szpiech [74], which contains data on six admixed populations from the 1000 Genomes Project [108] and used CADD [109] scores as a deleteriousness prediction (S1 Text). After extracting the data relating to the admixed individuals from Pemberton and Szpiech [74] and calling local ancestries, we again find qualitatively similar patterns as above (S4 Fig).
Finally, since Pemberton and Szpiech [74] showed that these enrichment patterns appear to be driven by an abundance of homozygotes in ROH comprised of low-frequency alleles, we reanalyzed our data using categories of minor allele frequency (MAF) instead of deleteriousness. In order to determine MAF category, we use frequencies computed from all TOPMed Freeze 3 whole-genome sequencing data sets (dbGaP accession numbers phs000920, phs000921, phs001062, phs001032, phs000997, phs000993, phs001189, phs001211, phs001040, phs001024, phs000974, phs000956, phs000951, phs000946, phs000988, phs000964, phs000972, phs000954, and phs001143) forming a total sample size of n = 18,581. Using these allele frequencies, we categorize each polymorphic locus in a gene region (exons plus introns) into one of two categories: common (MAF ≥ 0.05) and rare (MAF < 0.05). We then fit the same models as above, except that instead of comparing the proportion of deleterious alternate allele homozygotes to benign homozygotes as a function of ROH coverage, we compare the number of minor allele homozygotes in the rare class to the common class.
We summarize the results of these analyses for each ancestral background, each ROH size class, and each low-frequency class in Fig 7. We find that ROH on African haplotype backgrounds are gain more low-frequency minor allele homozygotes per unit increase of ROH (and especially long class C ROH) compared to common minor allele homozygotes. Since low frequency alleles are enriched for deleterious variants relative to high frequency alleles, this result accords with our previous analyses.
Discussion
The distribution of runs of homozygosity in individual genomes has provided insights into evolutionary, population, and medical genetics [1]. By examining their genomic location and prevalence in a population, we can learn about the history and adaptation of natural populations [2–39], and we can make discoveries about the genetic basis of complex phenotypes [40–66]. Given the importance of demographic history and socio-cultural practices in the generation of ROH in individual genomes, and their relationship to complex phenotypes including many genetic diseases, it naturally follows to study the distribution of deleterious alleles and their relationship to ROH.
Previous work has described the effect of demographic history on the distribution of deleterious alleles [98–101, 110, 111], including a few specifically investigating their relationship with runs of homozygosity [21, 38, 73, 74, 112, 113]. However, little work has been done on the relationship between deleterious alleles and ROH in admixed populations (although see [113]). Since there is evidence of very recent bottlenecks (which generate ROH) within admixed populations living in the Americas [88, 113], the relationship between ROH and the accumulation of deleterious homozygotes may provide valuable insights into the genetic basis of complex phenotypes in these individuals.
Here we analyzed 1,441 individuals across three admixed populations: African American, Puerto Rican, and Mexican American. We found that, consistent with other studies, the proportion of deleterious homozygotes found in ROH increases faster than the proportion of benign homozygotes as a function of total genomic ROH (Fig 4). However, we also proposed that ancestral haplotypes from populations with high deleterious heterozygosity would exhibit even greater increases of deleterious homozygotes per unit ROH. We reason that, under random mating, the larger number of low-frequency deleterious alleles in the population would largely segregate as heterozygotes, whereas, when a harsh bottleneck or consanguinity occurs, these mutations get paired IBD as homozygotes, concentrating more deleterious homozygotes within ROH. Indeed, we found that the genome-wide proportion of deleterious homozygotes in ROH on African ancestral haplotypes increased faster per unit ROH than on ether European or Native American ancestral haplotypes (Figs 5 and 6). These patterns are also consistent with population-specific worldwide patterns of deleterious homozygotes in ROH [74], where three of the five African populations analyzed had among the highest rates of enrichment in long ROH.
Whereas ROH on any haplotype background are associated with an increased rate of deleterious homozygotes, ROH on African haplotypes tend to have a larger share of the genome-wide deleterious homozygotes. Indeed, this accords with recent work that has independently associated increased ROH [65] and increased African ancestry [114] with reduced lung function. This suggests that these ROH on African haplotypes may play a particularly important role in the genetic architecture of complex phenotypes in admixed individuals, especially for populations with African ancestry that have undergone very harsh bottlenecks in the recent past.
Methods
Calling Local Ancestry
We used 90 African (YRI) individuals and 90 European (CEU) individuals for ancestry references (genotypes obtained from the Axiom® Genotype Data Set at https://www.thermofisher.com/us/en/home/life-science/microarray-analysis/microarray-data-analysis/microarray-analysis-sample-data/axiom-genotype-data-set.html) and SNPs with less than 95% call rate were removed. For Native American reference genotypes we used 71 Native American individuals previously genotyped on the Axiom® Genome-Wide LAT 1 array [115].
We then subset our 1,441 whole-genome sequences corresponding to sites found on the Axiom® Genome-Wide LAT 1 array, leaving 765,321 markers. We then merge these data with our European (CEU), African (YRI), and Native American (NAM) reference panels, which overlapped at 434,145 markers. After filtering multi-allelic SNPs and SNPs with > 10% missing data, we obtained a final merged dataset of 428,644 markers. We phased this combined data set using SHAPEIT2 [116] and called local ancestry tracts jointly with RFMix [117] under a three-way admixture model based on the African, European, and Native American reference genotypes described above.
Calling Runs of Homozygosity
We called runs of homozygosity using the program GARLIC v1.1.4 [102], which implements the ROH calling pipeline of [7], for each population separately on the full whole-genome call set, filtering only monomorphic sites. For the 475 African American (AA) individuals, this left 39,517,679 segregating sites; for the 483 Puerto Rican (PR) individuals, this left 31,961,900 segregating sites; and for the 483 Mexican American (MX) individuals, this left 30,744,389 segregating sites. Instead of asserting a single constant genotyping error rate (as in [7]), we used genotype quality scores provided with the WGS data to give GARLIC a per-genotype estimation of error. Using GARLIC’s rule of thumb parameter estimation, we chose analysis window sizes of 290 SNPs, 250 SNPs, and 210 SNPs for the AA, PR, and MX populations, respectively. Using GARLIC’s rule of thumb parameter estimation, we chose overlap fractions of 0.3688, 0.3553, and 0. 3528 for the AA, PR, and MX populations, respectively. GARLIC chose LOD score cutoffs of −47.5169, −70.1977, and −60.9221 for the AA, PR, and MX populations, respectively. Using a three-component Gaussian mixture model, GARLIC determined class A/B and class B/C size boundaries as 38,389 bps and 142,925 bps for AA; as 50,618 bps and 230,079 bps for PR; and 46,979 bps and 217,054 bps for MX.
Calling Deleterious Alleles
Using the Whole Genome Sequencing Annotation (WGSA) pipeline [118] to generate annotation data, we extracted PolyPhen 2 [103], SIFT [104, 105], Provean [106], and GERP [107] scores for deleteriousness, as well as ancestral allele state and synonymous annotations and for all mutations in coding regions.
PolyPhen 2 generates three deleteriousness categories: Probably Damaging, Possible Damaging, and Benign. If a mutation has more than one PolyPhen2 classification (e.g. Benign and Probably Damaging), it is reassigned to have only the most damaging category of the group. All mutations that have a PolyPhen 2 prediction or that are synonymous, are then pooled into two separate categories: “damaging” and “benign.” All Probably Damaging or Possibly Damaging mutations are pooled into the “damaging” category, and all Benign and synonymous mutations are pooled into the “benign” category.
SIFT generates two deleteriousness categories, Intolerant and Tolerant, which we relabel “damaging” and “benign.” If a mutation has more than one SIFT classification, it is reassigned to have only the most damaging category of the group.
Provean generates two deleteriousness categories, Deleterious and Neutral, which we relabel “damaging” and “benign.” If a mutation has more than one Provean classification, it is reassigned to have only the most damaging category of the group.
GERP generates a numerical score at a given locus where a higher score indicates more deleteriousness for a derived allele at that locus. Here we focus on derived alleles that are very likely to be deleterious and combine all derived mutations at sites with GERP >= 6 into the category “damaging.” We form our “benign” category with all derived mutations with GERP < = 2.
References
- 1.↵
- 2.↵
- 3.
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.
- 10.
- 11.
- 12.↵
- 13.↵
- 14.
- 15.
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.
- 24.
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.↵
- 39.↵
- 40.↵
- 41.
- 42.
- 43.↵
- 44.↵
- 45.
- 46.
- 47.↵
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.↵
- 54.
- 55.
- 56.
- 57.↵
- 58.↵
- 59.
- 60.
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.↵
- 89.↵
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵