Abstract
Most population isolates examined to date were founded from a single ancestral population. Consequently, there is limited knowledge about the demographic history of admixed population isolates. Here we investigate genomic diversity of recently admixed population isolates from Costa Rica and Colombia and compare their diversity to a benchmark population isolate, the Finnish. These Latin American isolates originated during the 16th century from admixture between a few hundred European males and Amerindian females, with a limited contribution from African founders. We examine whole genome sequence data from 449 individuals, ascertained as families to build mutigenerational pedigrees, with a mean sequencing depth of coverage of approximately 24X. We find that Latin American isolates have increased genetic diversity relative to the Finnish. However, there is an increase in the amount of identity by descent (IBD) segments in the Latin American isolates relative to the Finnish. The increase in IBD segments is likely a consequence of a very recent and severe population bottleneck during the founding of the admixed population isolates. Furthermore, the proportion of the genome that falls within a long run of homozygosity (ROH) in Costa Rican and Colombian individuals was significantly greater than that in the Finnish, suggesting more recent consanguinity in the Latin American isolates relative to that seen in the Finnish. Lastly, we found that recent consanguinity increased the number of deleterious variants found in the homozygous state, which is relevant if deleterious variants are recessive. Our study suggests there is no single genetic signature of a population isolate.
Introduction
The use of population isolates to map Mendelian and complex diseases has been a key feature of medical genomics. In addition to experiencing the bottleneck involved with the migration out of Africa, some populations underwent subsequent bottlenecks and remained in relative seclusion afterward. These populations formed present-day isolates1. There are numerous benefits to use population isolates to map genes underlying disease. First, and perhaps the largest benefit, is the increased homogeneity of genomes in isolates when compared to outbred populations. Second, isolates experience greater genetic drift than the population from which they were founded. Drift can allow disease causing alleles to exist at an appreciable frequency in isolated populations. Third, isolates may be endogamous and the cryptic relatedness of individuals leads to an enrichment in prevalence of the phenotype of interest and an enrichment of homozygous disease variants1–4. Lastly, isolates may also have increased cultural and environmental homogeneity, resulting in a reduction of variability in phenotype due to non-genetic sources.
Generally, the genomes of population isolates are thought to exhibit several hallmark features of genetic diversity. Due to bottlenecks associated with their founding, it is thought that isolates should carry lower levels of genetic diversity and lower haplotype diversity than closely related non-isolated populations. Drift experienced by isolates is magnified by small population size, which generates more linkage disequilibrium (LD) than in non-isolated populations. In addition to increased LD, individuals from isolated populations tend to share more regions of the genome identical by descent (IBD) due to small population sizes. Further, due to the isolation after founding and recent mating practices, isolates may have larger regions of the genome found in runs of homozygosity (ROHs) as a result of recent inbreeding. Lastly, bottlenecks and inbreeding should impact patterns of deleterious variation5–7. Consequently, one would predict that individuals from isolates will have fewer segregating sites, and the remaining deleterious variants will be segregating at a higher frequency8. Indeed, genomic studies over the last decade have documented several of these signatures2,9,10. However, it is known that not all isolates share the same demographic history. Therefore, it is essential that we understand how the factors shaping genetic variation in a population, are influenced by the unique demographic history of the population.
One archetypal human population isolate with a demography that has been extensively studied is the Finnish 2,11–13. Finland was populated through two separate major migrations. The first wave originated 4000 years ago from the west, and the second wave originated from the southern shores approximately 2000 years ago. There was a subsequent internal migration and expansion around the 16th century. Briefly, the small number of founders, relative isolation, serial bottlenecks, and recent expansion in Finland has allowed drift to play a large role in shaping the gene pool of this population. The aforementioned demographic history of Finland has led to an increase in the prevalence of rare heritable Mendelian diseases, which has made this population particularly fruitful for disease gene discovery studies12,14. Most of the initial disease gene discovery studies in Finland exploited LD mapping in affected families and well curated genealogical records to identify causal and candidate variants12. More recently, it has been possible to apply population-based linkage analyses to identify disease associated variants as an alternative to GWAS (unpublished data)15 due to the availability of whole genome sequence data in conjunction with extensive electronic health records.
A number of studies have shown that disease detection power can be improved by studying population isolates other than the Finnish10,16–18. For example, the Greenlandic Inuit (GI) experienced an extreme bottleneck which caused a depletion of rare variants and segregating sites in their genome18. The remaining segregating variants are maintained at higher allele frequencies and a larger proportion of these SNPs are deleterious when compared to non-isolated populations. Another study on Southeast Asian populations showed similar results. Specifically, South East Asian populations have experienced more severe founder effects than the Finnish17, thus causing an excess of rare alleles associated with recessive disease. A study of European population isolates compared the isolates with the closest non-isolated population from similar geographic regions10, and found that the total number of segregating sites was depleted across all isolates relative to the comparison non-isolate. Of the sites that were segregating in isolates, between ~30,000-122,000 sites existed at an appreciable frequency (MAF > 5.6%), while remaining rare (MAF < 1.4%) in all of the non-isolate population samples. These variants could serve as candidate markers in genome-wide association studies (GWAS) for novel associations and included SNPs that had been previously associated with cardio-metabolic traits.
As previously mentioned, population isolates tend to have a depletion of segregating sites and an increased number of homozygous sites in their genomes. Pemberton and colleagues19,20 demonstrated that levels of homozygosity differ across the globe, and that long ROH, 2Mb or greater, are strongly influenced by very recent demography. A study involving multiple Jewish isolates showed a link between historic consanguinity and the amount of long ROH in the genome21. The same correlation between long ROH and parental consanguinity was also observed in Middle Eastern isolates4. In another study, authors observed an increased count of ROH and total length of the genome within an ROH in Greek isolates from the Pomak villages and the island of Crete relative to a non-isolated Greek cohort16. Lastly, a study in Sardinians showed there was sub-structure within the island when comparing the amount of long ROH across sampled populations22. Authors identified an enrichment of long ROH in sub-populations that had experienced recent endogamy, while sub-populations that had not experienced such isolation have ROH levels similar to that of non-Sardinian Italians.
While there have been many studies of genetic variation in population isolates, the studies described above have focused on populations where the founders all came from the same ancestral population. However, the founders of Latin American population isolates have come from distinct continental populations. We sampled individuals from mountainous regions of Costa Rica and Colombia where geographic barriers resulted in populations remaining isolated since their founding in the 16th and 17th centuries, until the mid-20th century23. Both groups share a similar demographic history, having originated primarily from admixture between a few hundred European males and Amerindian females, with a limited contribution from African founders. After the founding event, both populations experienced a subsequent bottleneck and then a recent expansion, within the last 300 years, the expansion increased the population size over 1000-fold since the initial founding event23. The effect that admixture has had on overall patterns of genetic variation in isolates remains elusive, and it is unclear whether these populations share the typical genomic signatures seen in population isolates. While the small founding population size could reduce diversity, because the Costa Rican and Colombian isolates were founded from multiple diverse populations, they could potentially have increased in diversity relative to other population isolates. Lastly, the impact of admixture on deleterious variation remains unclear.
To better understand patterns of genetic variation in admixed isolated populations, we compared the Colombian and Costa Rican population isolates to a benchmark isolate, the Finnish, as well as other 1000 Genomes Project populations24. We observe that relative to the Finnish, Latin American isolates have increased genetic diversity but an excess of IBD segments. Moreover, we detect an increase in the proportion of an individual’s genome that falls within a long ROH in Latin American isolates relative to all other sampled populations and an enrichment of deleterious variation within these long ROH. Demographic simulations indicate that the enrichment of long ROH is a consequence of recent inbreeding in Latin American isolates. We corroborate these results by leveraging extended pedigree data. Pedigree inbreeding coefficients explain approximately 21.8% of the amount of an individual’s genome within a long ROH. Next, we examine the relationship between the proportion of European, Native American, and African ancestry and the amount of the genome within an ROH, as well as the relationship to an individual’s pedigree inbreeding coefficient. To our knowledge, this is the first time long ROH have been examined in admixed isolated populations. Further, we examine demography across both recent and ancient timescales in these isolates. Our work sheds light on how the distinct demographic histories of population isolates affect both genetic diversity and the distribution of deleterious variation across the genome.
Methods
Pedigree Data for Costa Rican and Colombian Individuals
Our study included 10 Costa Rican (CR) and 12 Colombian (CO) multi-generational pedigrees ascertained to include individuals affected by Bipolar Disorder 1. More extensive details about the curation of pedigree data and clinical assessments of diagnosis can be found in Fears et al.25.
Identifying Unrelated Individuals
We defined unrelated individuals as those who are at most third-degree relatives. We chose this threshold of relatedness because the families from CR and CO are known to be cryptically related. We used KING26 to identify 30 unrelated individuals from CR and CO. 24 of the 30 unrelated individuals in the CO are founders in the pedigree and 15 of the 30 unrelated individuals in the CR are founders. The algorithm implemented in KING estimates familial relationships by modeling the genetic distance between a pair of individuals as a function of allele frequency and kinship coefficient, assuming that SNPs are in Hardy-Weinberg equilibrium.
We also used KING26 to identify 30 unrelated individuals from the following 1000 Genomes Project24 populations: Yoruba (YRI), CEPH-European (CEU), Finnish (FIN), Colombian (CLM), Peruvian (PEL), Puerto Rican (PUR), and Mexican from Los Angeles (MXL). We used these 30 unrelated individuals per population for all analyses unless otherwise stated (Supplementary Figure 1).
Genotype Data Processing
We generated a joint variant call file (VCF) containing single nucleotide polymorphisms (SNP) from two separate data sets. The first data set contained 210 whole genome sequences sampled from the aforementioned 1000 Genomes Project populations24. The second data set contained 449 whole genome sequences from Costa Rican and Colombian individuals. Variants in the second data set were called following the GATK best practices pipeline27 with the HaplotypeCaller of GATK. All multi-allelic SNVs and variants that failed Variant Quality Score Recalibration were removed. Genotypes with genotype quality score ≤ 20 were set to missing. Further quality control on variants was performed using a logistic regression model that was trained to predict the probability of each variant having good or poor sequencing quality. Individuals with poor sequencing quality and possible sample mix-ups were removed, and all sequenced individuals had high genotype concordance rate between whole genome sequences and genotypes from microarray data. All sequenced individuals had consistency between the reported sex and sex determined from X chromosome and also between empirical estimates of kinship and theoretical estimates. More information on sequencing and quality control procedures is discussed in Sul et al. 2018 (unpublished data).
We used the following protocol to merge these two datasets. First, we used guidelines from the 1000 Genomes Project strict mask to filter the Costa Rican and Colombian VCFs as well as the 1000 Genomes Project VCFs. Then, we used GATK to remove sites from both sets of VCFs that were not bi-allelic SNPs or monomorphic. Next, we merged the 1000 Genomes Project VCFs with the Costa Rican and Colombian VCFs into a single joint-VCF for each chromosome. We only used autosomes for our analyses. Lastly, we filtered the merged joint-VCF to only contain sites that were present in at least 90% of individuals. There were a total of 57,597,196 SNPs and 1,891,453,144 monomorphic sites in the final data set. We ensured that the merged data sets were comparable by examining the number of derived putatively neutral alleles across the 30 unrelated individuals in all sampled populations, and finding few differences between populations, which is consistent with theory8 (Supplementary Figure 2).
Calculating Genetic Diversity
We computed two measures of genetic diversity from sites called across all 30 unrelated individuals from each population: pi (π) and Watterson’s Theta (θw). The average number of pairwise differences per site (π) was calculated across the genome as: where n is the total number of chromosomes sampled, p is the frequency of a given allele, and L is the length in base pairs of the sampled region. Watterson’s Theta, was computed by counting the number of segregating sites and dividing by Watterson’s constant, or the n-1 harmonic number28.
Site Frequency Spectrum (SFS)
Site frequency spectra were generated using the 30 unrelated individuals from each population, SNPs with missing data were removed from these analyses. There was a total of 16 SNPs out of the 57,597,196 SNPs that were removed due to missing data.
Linkage Disequilibrium Decay
We calculated LD between pairs of SNPs for all unrelated individuals. First, we applied a filter to remove sites that were not at a frequency of at least 10% across all populations. Next, pairwise r2 values were calculated using VCFTools29. SNP pairs were then binned according to physical distance (bp) between each other and r2 was averaged within each bin.
Identifying Identity by Descent Segments
To detect regions of the genome that have shared IBD segments between pairs of individuals, we first removed singleton SNPs in each population since singletons are not informative about shared IBD. Then, we called IBD segments using IBDSeq30. IBDSeq is a likelihood-based method that is designed to detect IBD segments in unphased sequence data. We chose to use IBDSeq because other methods that require computational phasing could be biased when applied to Latin American population isolates, as they do not have a publicly available reference population to aid in phasing. We compared IBDSeq to two well-known phasing methods Beagle31 and GERMLINE32 to determine whether it was feasible to use IBDSeq on an admixed population (Supplementary Figure 3). Beagle33 produced the shortest IBD segments while GERMLINE produced the longest IBD segments. IBDSeq produced segments with a length distribution similar to what we observed in Beagle, though the average segment length was slightly larger, which we expected given that IBDSeq was created to call longer segments that would have previously been broken up when using Beagle for phasing30. We used the default parameters for IBDSeq.
Next, we filtered the pooled IBD segments to remove artifacts. First, we calculated the physical distance spanned by each IBD segment. Then, we totaled the number of SNPs that fell within each segment. We observed an appreciable number of IBD segments that were extremely long but sparsely covered by SNPs (Supplementary Figure 4). IBD segments were removed if the proportion of the IBD segment covered by SNPs was not within one standard deviation of the mean proportion covered across all IBD segments (Supplementary Figure 4). Strong deviations from the mean could indicate that the IBD segment spans a region of the genome with low mappability, and we are only calling the SNPs at the outer ends of the segment. Therefore, the true segment length might be much shorter than what is being calculated by IBDSeq. Lastly, we converted from physical distance to genetic distance using the deCODE genetic map34.
Enrichment analyses of IBD segments
To determine whether certain populations contain more IBD segments than others, we followed the IBD score procedure outlined in Nakatsuka and collegues17. A population’s IBD score was calculated by computing the total length of all IBD segments between 3 and 20 cM. The score difference is the difference between the query population’s IBD score and the Finnish IBD score. The score ratio is the ratio of each population’s IBD score relative to the Finnish IBD score. The significance of enrichment relative to the Finnish was evaluated using a permutation test for each population, where IBD segment length was held fixed and labels of the two populations were permuted. We recalculated the score on a total of 10,000 permutations to generate a null-distribution of scores for each isolate.
Estimating Effective Population Size
We used the output files from IBDSeq to estimate the recent effective population size from the 30 unrelated individuals from each sampled population. We estimated effective population size by using the default settings in IBDNe35. We set the minimal IBD segment length equal to 2cM since that is the suggested setting when using sequence data.
Identifying Runs of Homozygosity
Runs of homozygosity were identified for each individual using VCFTools, which implements the procedure from Auton et al. 200936. Next, we examined the number of callable sites that lie within each ROH. We found that there was a bi-modal distribution of coverage for ROH, where some ROH appeared to contain almost no callable sites, while others had much higher coverage. We only kept ROHs that were at least 2Mb in length, which we called long runs of homozygosity, and were at least 60% covered by callable sites. (Supplementary Figure 5).
Calculating Inbreeding Coefficients
SNP-based inbreeding coefficients were calculated using VCFTools29. VCFTools calculates the inbreeding coefficient F per individual using the equation , where O is the observed number of homozygotes, E is the expected number of homozygotes (given population allele frequency), and N is the total number of genotyped loci.
Pedigree-based inbreeding coefficients were computed using the R package kinship237.
Demographic Simulations
In order to investigate how aspects of the population history affect current day genetic diversity in Latin American isolate populations, we simulated genetic variation data using the forward simulation software SLiM 4.2.238. We simulated a sequence length of 10Mb under uniform recombination rate of 1×10−8 crossing-over events per chromosome per base position per generation and under a mutation rate of 1.5×10−8 mutations per chromosome per base position per generation. Every simulation contained intergenic, intronic, and exonic regions, but only nonsynonymous new mutations experienced natural selection in accordance with the distribution of selection coefficients estimated in Kim et al. 201739. Within coding sequences, we set nonsynonymous and synonymous mutations to occur at a ratio of 2.31:139,40. The chromosomal structure of each simulation was randomly generated, following the specification in the SLiM 4.2.2 manual (7.3), which is modeled after the distribution of intron and exon lengths in Deutsch and Long41.
We assumed an effective population size in the ancestral African population of 10,000 individuals, and a reduction in size to 2,000 individuals, starting 50,000 years ago, reflecting the colonization of the European, Asian, and American continents. The population then recovers to a size of 10,000 individuals 5,000 years ago. The colonization bottleneck is assumed to occur 500 years ago by an admixture event with a European population (70% admixture proportion) and is followed by an immediate reduction in population size to 1,000 individuals. The recent expansion in population size is modeled by an increase in population size to 10,000 individuals 200 years ago. We simulated data with recent inbreeding and without recent inbreeding. In the former case, inbreeding started at the time of the European colonization 500 years ago and continues until the present. Inbreeding is implemented with the “mateChoice” function in SLiM. Because SLiM’s pedigree track function is only valid for at most second-degree related individuals, 50% of the time, mating occurs randomly. However, in the remaining cases, mating occurs between close relatives with a relatedness coefficient bigger than 0.25. This produces levels of consanguinity similar to those seen empirically as measured by F (see Results). Finally, we sampled a total of 60 random individuals and calculated summary statistics on the sample data. The simulation script can be found on GitHub (see Web Resources).
Annotation of Variants
The ancestral allele was determined using the 6-primate EPO alignment (see Web Resources) and we restricted to only those sites called with the highest confidence. After filtering, 54,049,081 SNPs remained.
Subsequently, exonic SNPs were annotated using the SeattleSeq Annotation website (see Web Resources). A total of 693,301 SNPs were annotated as either nonsynonymous or synonymous. We further classified these sites as either putatively neutral or deleterious using Genomic Evolutionary Rate Profiling (GERP) scores42. A GERP score less than two was considered as putatively neutral and a GERP score greater than 4 was considered as putatively deleterious, totaling 404,302 classified SNPs.
Counting Deleterious Variants
We used three different statistics to count the number of deleterious mutations per individual. First, we tabulated the number of deleterious variants (the number of heterozygous plus the number homozygous derived genotypes). Second, we counted the total number of derived deleterious alleles (the number of heterozygous genotypes plus twice the number of homozygous derived genotypes). Third, we computed the total number of derived deleterious homozygous genotypes.
Testing for an enrichment of deleterious variation in ROHs
We were interested in whether there is an enrichment of nonsynonymous or loss-of-function mutations in ROH over non-ROH regions for the three different ways of counting deleterious variants outlined above. To account for differences in neutral variation, we standardized by synonymous variation, which is assumed to be neutral. Then, we calculated the ratio of nonsynonymous over synonymous variation in ROH regions divided by the ratio of nonsynonymous over synonymous variation outside of ROH. We computed significance using a permutation test, where the position of each SNP and its annotation as synonymous versus nonsynonymous was fixed and the positions of the vector of ROH annotations were randomly placed throughout the genome. Thus, the frequency distribution of synonymous and nonsynonymous SNPs, as well as the total amount of ROH and non-ROH annotations, is kept constant when compared to the unpermuted data. We recalculated the ratio for a total of 10,000 permutations to form a null-distribution of ratios and then computed significance.
Calculating Ancestry Proportions
We estimated genome-wide ancestry proportions in members of the CR and CO pedigrees using ADMIXTURE43 (v1.22). Then the genome-wide ancestry proportions from ADMIXTURE were used as the prior in local ancestry analysis using LAMP44. We generated estimates for all 838 pedigree members with SNP array genotype data. Detailed information on the SNP array data can be found in Pagani et al.45. The reference populations were the CEU (n=112) and YRI (n=113) from HapMap46,47, as well as 52 Native American samples from Central or South America. The Native American samples are the Chibchan-speaking subset of those used in Reich et al.48, selected to originate from geographical regions relevant to CR/CO and to have virtually no European or African admixture. In total the admixture analysis used 57,180 LD-pruned SNPs and 1115 individuals.
Accounting for Relatedness
We tested for correlations among several quantities computed for each individual in the Latin American population isolates. Because some of these individuals are closely related, the data points in our linear regression are no longer independent. Therefore, we implemented the R-package GenABEL49 to incorporate kinship when performing statistical tests for our correlations. We used the polygenic_hglm() function where the formula input was the equation for our linear model of interest and the kinship.matrix input was a kinship matrix computed from our pedigree computed using kinship237. Our input took the following form: (FPED ~ Length of genome in ROH, kin = kinshipMatrix, data = df).
Results
Genetic Variation in Population Isolates
We first compared levels of genetic diversity in a sample of 30 unrelated individuals across the 1000 Genomes populations24 and the CO and CR isolates. We split the genome into several different genomic regions and in each region summarized genetic variation using both the average number of pairwise differences (π) and Watterson’s theta (θw) (Figure 1A and B). Overall, we found differences in diversity across the functional category of sequence studied in all populations, with coding regions exhibiting the lowest diversity and intergenic regions the highest. These patterns are consistent with the role of purifying selection affecting coding diversity39. However, if we look genome-wide or focus on intronic regions we see intermediate levels of diversity (Supplementary Table 1 and 2). We suspect that these categories are more strongly influenced by linked selection50–52.
As we are interested in the role of demography in shaping genetic diversity, we focused on comparisons of intergenic levels of diversity as those are most likely to be neutrally evolving (Figure 1A and B). Overall, the YRI had the highest level of diversity (π ≈ 0.0010; θw ≈ 0.0012) (Supplementary Table 1 and 2). The European populations (CEU and FIN) had lower levels of diversity. The CEU and FIN had similar levels of π (approximately 0.0004), despite the FIN being considered an isolated population. However, the FIN had reduced numbers of SNPs as reflected by lower values of θw (CEU ≈ 0.0008 & FIN ≈ 0.0008). The CO and CR had levels of diversity similar to that of several other Latin American populations in the 1000 Genomes Project (CLM and MXL). We found no clear pattern of the population isolates (FIN, CO, CR) having lower diversity than their most similar non-isolated population. Instead, diversity levels tended to be higher across all the sampled Latin American populations (CLM, CO, CR, MXL, and PUR) when compared to the European populations. One exception to this pattern is the PEL population, who had the lowest neutral levels of diversity (π ≈ 0.0007; θw ≈ 0.0007).
Next, we examined the proportional site frequency spectrum (SFS; Figure 1C). Latin American populations had the highest proportion of singletons. The CO and CR had similar proportions of singletons when compared to other 1000 Genomes Project Latin American populations. Conversely, the FIN had the lowest proportion of singletons in comparison to all sampled populations. The depletion of singletons relative to common variation supports the presence of a stronger founder effect during the FIN population history13.
We also examined patterns of linkage disequilibrium (LD), since LD is affected by population size and recent bottlenecks53,54. Figure 1D shows the mean decay of r2 with physical distance over 2Mb intervals across the genome in each population. We found that the YRI had the lowest levels of LD for each bin of physical distance, and the PEL formed the upper bound of the LD decay curves. The remaining Latin American populations (PUR, MXL, CLM, CO, CR) clustered together, close to the YRI, while the CEU and FIN are shifted toward higher values, like those seen in the PEL.
The FIN were previously shown to have more extensive haplotype blocks in their genome in comparison to the Latin American isolates9. In line with these findings, we observed faster LD decay in the Latin American isolates relative to the FIN. When considering pairs of SNPs 150kb or more apart, rates of LD decay become quite similar across all the sampled populations. Analogous to other diversity statistics, LD in the CO and CR closely resembled those of non-isolated Latin American populations. Once again, we found there is no clear pattern of having lower diversity or more LD that holds across all the population isolates (FIN, CO, CR) when compared to their most similar non-isolated population.
Latin American isolates carry more IBD segments than Finnish
Next, we used IBD sharing between pairs of individuals to gain insight about more recent demographic events within populations (Figure 2). We compared the amount of IBD within each population by computing an IBD score. Each population’s IBD score was calculated by totaling the length of IBD segments between 3cM and 20cM. We expressed IBD scores for each population as the ratio of the IBD score for a given population relative to the IBD score in the FIN (Figure 2A). We also tabulated the total count of IBD segments for each population. The CEU showed the lowest number of both called IBD segments and the lowest IBD score relative to the FIN (p-value = 0.0001). Latin American populations formed the upper bounds of both total IBD segments called and IBD enrichment scores (Figure 2A). The PUR had the largest number of IBD segments (1402) and had a 2.1-fold increase in IBD score relative to the FIN (p-value < 1×10−4). The CO and CR isolates had a 1.8-fold and 2-fold increase in their IBD scores relative to the FIN (p-value < 1×10−4), as well as carrying more IBD segments than the FIN (Figure 2B and 2C). However, there were some Latin American populations that exhibited depletions in both IBD segments and IBD scores relative to the FIN. The MXL and PEL have the lowest number of IBD segments for the Latin American populations. Previous work has shown that a larger effective population size in admixed populations likely drove the depletion of IBD segments in these two Latin American populations55.
Inferring the Demographic History of Latin American Isolates
We next leveraged the patterns of IBD described above to estimate the effective population size using IBDNe35 on the 30 unrelated individuals from each population (Figure 3). The use of only 30 unrelated individuals caused limitations for accurate estimation of Ne (see Discussion), but the demographic history of the population was robust to the number of individuals used. First, we found that recent demography differs vastly between the European populations (FIN and CEU). In general, CEU experienced population expansions over much of their demographic history. It was only in the most recent generations that they experienced a decrease in Ne. The FIN, on the other hand, have experienced a long population decline since their founding, approximately 4000 years ago, followed by a recent population expansion.
When analyzing the Latin American isolates, we detected a recent bottleneck, approximately 500 years ago (Figure 3). This bottleneck could correspond to the recorded bottleneck that followed the founding of these populations, and it appears to be much shorter and less severe than the bottleneck seen in the FIN. The strength and duration of bottlenecks varied across each of the Latin American populations. For example, we observed a more severe bottleneck in the CR, CO, CLM, and PUR than in PEL or MXL. However, we detected a subsequent period of growth across all populations following the bottleneck. The rate of growth differed across each population, and the PEL appeared to be growing at a much more rapid rate than any of the other Latin American populations.
Exploring Recent Consanguinity
Isolated populations may have experienced recent consanguinity. To test for this, we began by examining SNP-based inbreeding coefficients (FSNP) (Supplementary Figure 6). YRI individuals had the lowest median inbreeding coefficients and the CO and CR isolates had the highest median inbreeding coefficients. Further, the CO and CR also had the highest maximum FSNP values in the entire sample of unrelated individuals from any population (Supplementary Figure 6). Median levels of FSNP in the CEU suggested that they are more inbred than the FIN, which may be a result of how 1000 Genomes samples were selected. The PEL had the largest variance in FSNP across any of the sampled populations.
Next, we examined patterns of long runs (>2Mb, see Methods) of homozygosity (ROH), since ROH have been linked to recent consanguinity20,21,56–58. The YRI and CEU had the lowest amount of their genome contained within an ROH (Figure 4A). The FIN had higher median amounts of their genome within an ROH in comparison to the CEU. Latin American isolates had the highest median amount of the genome contained within an ROH. Specifically, the CR had the highest median at 21.7 Mb. Further, the Latin American isolates also had the greatest variance in the amount of the genome contained within an ROH. For example, one of the CO individuals had approximately 230 Mb of her/his genome contained in long ROHs.
As expected, we found that the amount of the genome contained in a long ROH strongly correlated with an individual’s FSNP (CO: R2 = 0.8060, p –value = 1.1 × 10−11; CR: R2 = 0.7740, p-value = 9.5 × 10−11; FIN: R2 = 0.1288, p-value = 0.03) (Figure 4B-4D). Indeed, individuals with higher values of FSNP tended to have more of their genome within an ROH. Further, the individual with the highest FSNP (0.133) also had the largest amount of his/her genome in long ROHs (230Mb).
The total number of ROH segments per individual followed a similar pattern as the total amount of genome within an ROH (Supplementary Figure 7). For example, in populations with low values of FSNP, ROH segments were not frequent. One YRI individual and three CEU individuals carried a ROH >4Mb, whereas more than 50% of CO and CR individuals carried an ROH >4Mb. Additionally, the longest ROHs identified (>20MB), only occurred in Latin American populations, where there were the largest values of FSNP (Supplementary Figure 7).
Importantly, the FIN individuals had significantly fewer ROH segments than the CO and CR, and most individuals had an FSNP close to 0 (Figure 4). The Latin American isolates had the most ROH in comparison to any other sampled population, as well as the largest values of FSNP (Figure 4).
Determining the Mechanisms that Generate Runs of Homozygosity
In principle, ROHs can be generated either by recent consanguinity over the last few generations, or by older historical processes, such as bottlenecks19,56,58–60. Based on both historical data23 and inference from IBDNe analyses, Latin American population isolates show evidence of recent population bottlenecks. Therefore, we used two complementary strategies to test whether recent consanguinity or bottlenecks drove the observed increase in ROHs in the Latin American isolates. First, we used the extensive pedigree data for 449 sequenced individuals to calculate a pedigree inbreeding coefficient (FPED) for each individual (Figure 5). Most individuals had a FPED of 0. However, there were several individuals with values of FPED as high as 0.07 in CR and 0.06 in CO. We observed a significant correlation between FSNP and FPED (R2=0.1520 and p-value < 2 x10−16), even after accounting for the non-independence of individuals based on their kinship (Figure 5A; see Methods). These correlations suggest that the recent consanguinity captured within the last few generations in the pedigree was likely sufficient to drive the increase in ROHs in the CO and CR populations. FSNP was a substantially better predictor of the amount of an individual’s genome that falls within a ROH (R2 = 0.7540 and p-value < 2 × 10−16), than FPED (R2 = 0.2180 and p-value < 2 × 10−16) (Figure 5B and C) likely due to the fact that FSNP captured distant background relatedness within the population as well as the realized level of consanguinity, rather than the expected value61. Further, because the pedigrees were ascertained and analyzed separately, connections between pedigrees were not accounted for in FPED, but were likely captured by FSNP.
As a second approach to determine the mechanism driving the increase in ROHs in the CO and CR populations, we conducted forward in time demographic simulations. We simulated a 10Mb region under a demographic model that reflected changes in effective population size during the human expansion across the European, Asian and American continents, as well as the more recent bottleneck during the Spanish colonization about 500 years ago (Figure 5D; see Methods). We compared simulations with no inbreeding to simulations with recent inbreeding. Consanguinity in the populations was modeled to begin 500 years ago, and simulated individuals had an inbreeding coefficient of about 0.075. This level of inbreeding was comparable to the level of inbreeding in some of the CO and CR individuals, based on calculations using pedigree data (see Methods). Our simulations suggested that the recent population bottleneck caused by the Spanish colonization was not capable of generating the large amounts of the genome within an ROH (>2Mb) that we observed for some of the individuals (Figure 5D). Only when simulating recent inbreeding could levels of the genome in an ROH comparable to that we observed be generated. Thus, recent inbreeding was paramount for generating the long ROH that we observed in the CO and CR isolates.
Local Ancestry
Since the Latin American isolates originated from an admixture event between Native Americans, Africans, and Europeans, we tested for a correlation between FPED and the proportion of European, African, or Native American ancestry (Supplementary Figure 8). We used the entire sequenced Costa Rican and Colombian data set (n=449) for the local ancestry analyses and accounted for relatedness of individuals in all the following reported p-values (see Methods). We found that European ancestry was positively correlated with FPED (p-value = 0.0052) while Native American ancestry was negatively correlated with FPED (p-value = 0.0245). African ancestry was also negatively correlated with FPED (p-value = 0.0496).
Then, we asked if the proportion of ancestry was correlated with the amount of the genome within an ROH (Supplementary Figure 8). The correlation between ancestry and amount of the genome within an ROH followed the same trend as the correlation between ancestry and FPED. Native American ancestry and African ancestry are negatively correlated with the amount of the genome within a long ROH (p-value = 3.91 × 10−14 and p-value = 6.76 × 10−07, respectively). European ancestry was positively correlated with the amount of an individual’s genome within an ROH (p-value < 2 × 10−16).
Recent Consanguinity is Correlated with an Increase of Deleterious Variation
It is well known that demography impacts patterns of deleterious variation in populations5,8,20,62–66. Thus, we compared patterns of putatively deleterious variation in the CO and CR to those in the FIN. Variants were classified as putatively deleterious or putatively neutral using GERP scores (see Methods). Recall that we consider three ways of counting deleterious variants in the genome of an individual: first, counting the number of heterozygous genotypes plus twice the number of homozygous derived genotypes (i.e. the total number of derived deleterious alleles), second, counting the number of heterozygous and homozygous derived genotypes (counting variants), and third, counting only the number of homozygote derived genotypes (counting homozygotes). The first quantity is most relevant if deleterious alleles are additive, while the third is most relevant if they are recessive. First, we looked at absolute counts of derived deleterious variation across isolates (Supplementary Figure 9). Then, we used linear regression to test if there was a relationship between the amount of an individuals’ genome in an ROH and the number of nonsynonymous sites in the genome for each counting method (Figure 6).
The FIN carried approximately 1% more derived deleterious nonsynonymous alleles per individual than CO and CR (p-value = 0.0007; p-value = 0.0013). However, there was no significant difference in the number of putatively neutral synonymous derived alleles per individual. These results suggest that the difference seen for putatively deleterious variants is not driven by data artifacts (Supplementary Figure 9), and the FIN indeed have a slightly higher additive genetic load than the CO or CR. Turning to the number of variants per individual, FIN individuals carried significantly more deleterious nonsynonymous variants than the CR but not the CO (p-value = 0.0110). However, CO and CR did not differ significantly in the number of deleterious variants carried per individual (Supplementary Figure 9). When we examined neutral synonymous variants, CO had significantly more variants than either FIN or CR (p-value = 8.56-06; p-value = 0.0054, respectively). Finally, when counting the number of homozygous derived genotypes, we found that the FIN carried 3.3% more deleterious variants in the homozygous state per individual than CO but not the CR (p-value = 0.0003) (Supplementary Figure 9). Additionally, the FIN carried significantly more neutral homozygous genotypes per individual than either population (CO p-value = 1.01 × 10−05; CR p-value = 6.96 × 10−05). The increased deleterious and neutral variation in homozygous form is an expected consequence of the long-term bottleneck that the FIN experienced during their founding.
We next tested whether the amount of the genome in an individual contained within a ROH was correlated with the number of nonsynonymous mutations carried by the individual. Counting nonsynonymous (NS) or synonymous (SYN) allele copies did not show any correlation with the amount of an individuals’ genome that falls within an ROH for the CR or FIN (Figure 6A and D; Supplementary Figures 10-13). However, in the CO, as the amount of the genome within an ROH increased, individuals tended to carry more NS alleles, though this correlation was strongly driven by a single individual, who also had the highest FSNP and FPED (R2 = 0.2393; p-value = 0.0036; Supplementary Figure 10). Importantly, the number of SYN alleles per individual was not correlated with the amount of the genome in an ROH (p-value = 0.2261).
When counting variants per individual, we observed a significant negative correlation with the amount of an individuals’ genome that falls within an ROH in the Latin American isolates (Figure 6B and E; Supplementary Figures 10-12). The negative correlation is a result of heterozygous sites being lost when an ROH is formed due to inbreeding. Conversely, when counting homozygous genotypes per individual, we observed a significant positive correlation with the amount of an individual’s genome that falls within an ROH in both the Latin American isolates and FIN (Figure 6C and F; Supplementary Figures 10-13). Homozygous genotypes were the only statistic that correlated significantly with the amount of the genome in an ROH across all isolated population for both SYN and NS sites. We observed a stronger correlation between the number of NS homozygous genotypes and the amount of an individual’s genome within an ROH in the Latin American isolates (R2 = 0.5000 (CO) & R2 = 0.2165 (CR); p-value = 7.546-06(CO) and p-value = 0.0059(CR)) compared to the FIN (R2 = 0.1130 and p-value = 0.0389) (Supplementary Figures 10-13). This pattern exists because the majority of CO and CR individuals carried larger proportion of their genome within an ROH while the FIN individuals do not harbor many ROH.
We next asked whether there was an enrichment or depletion of NS variants relative to SYN variants within versus outside of an ROH using a permutation test on the three different counting approaches (see Methods). When variants or allele copies were counted, none of the populations produced significant results (Table 1). When homozygous genotypes were counted, ROHs in the MXL and CR were enriched for homozygous NS genotypes relative to SYN homozygous genotypes (p-value = 0.0052 and p-value = 0.0169) (Table 1). Additionally, if we pooled the CR and CO populations, we also observed a significant enrichment of deleterious variation within an ROH compared to non-ROH regions of the genome (p-value = 0.0011).
We tested whether FSNP was correlated with the amount of deleterious variation per individual. We only used isolates for these regressions, because we are particularly interested in how recent consanguinity affected deleterious variation in the genome. We observed the exact same pattern with FSNP as with ROH (Supplementary Figure 14). Briefly, counting NS or SYN allele copies did not show any correlation with FSNP for the CR or FIN, but there was a significant correlation with NS allele copies in CO (Supplementary Figure 14; Supplementary Figures 15-17). Counting NS and SYN variants per individual produced a significant negative correlation with FSNP in the Latin American isolates (Supplementary Figure 12; Supplementary Figures 15-17). Counting the number of NS and SYN homozygous genotypes per individual was positively correlated with FSNP in the both Latin American isolates and FIN (Supplementary Figure 14; Supplementary Figures 15-17). Again, counting homozygotes was the only method with significant results across all isolated populations for both SYN and NS variants. The ability to recapitulate the pattern we observed in ROH using FSNP was reassuring and adds further support to the strong relationship between recent consanguinity and ROH.
Lastly, because we had multi-generational pedigrees for the Latin American isolates, we examined the correlation between putatively deleterious variation and recent consanguinity as measured by (FPED). All the following reported p-values account for kinship (see Methods). When we pooled the CO and CR individuals together, we did not observe any relationship between counting derived deleterious allele copies and FPED after correcting for kinship (Figure 7A). Moreover, we observed a negative correlation between FPED and the number of deleterious variants per individual (R2 = 0.0375, p-value = 6.02 × 10−06). The number of neutral variants per individual was also negatively correlated with FPED (p-value = 2.26 × 10−10) (Figure 7B). Finally, we observed a positive correlation between FPED and derived deleterious homozygotes (R2 = 0.0575, p-value = 1.0 × 10−06) as well as between FPED and the number of neutral derived homozygotes per individual. (p-value = 1.03 × 10−08) (Figure 7C). These results suggest that recent consanguinity during the last few generations has increased the number of derived deleterious homozygous genotypes in these two populations.
Discussion
Here we present the first comprehensive study of genetic diversity, demographic history, identity-by-descent, runs of homozygosity, and deleterious mutations in multiple admixed isolated populations. We show that admixture sufficiently increases genetic diversity of the Colombian and Costa Rican isolates, such that each isolate has diversity levels comparable to a non-isolated population. However, we still observe characteristics in the Latin American isolates that are hallmarks of an archetypal isolate, such as: an excess of IBD segments, cryptic relatedness within the population, and an enrichment of long ROH. Further, we demonstrate that long ROHs contain an enrichment of deleterious variants carried in the homozygous state, which has potential implications for fitness and disease risk.
Taken together our results support historical data which states that a recent admixture event, within the last 500 years, founded the Colombian and Costa Rican population isolates. A bottleneck corresponding to the Spanish Settlement, followed the founding event and then each population has increased in size until the present day23,67. We see evidence of these processes in the inference of demography from IBD patterns. The IBDNe inference shows a severe population bottleneck occurring approximately 500 years ago, coinciding with the historical record23. Importantly, the bottleneck experienced in the Latin American isolates was not as prolonged as that experienced by the Finnish. Further, the Finnish bottleneck occurred thousands of years ago. The difference in bottleneck timescales likely accounts for some portion of the higher genetic diversity observed in Latin American population isolates in comparison to the Finnish. In other words, the bottlenecks captured by IBDNe in the Latin Americans are too recent to markedly impact levels of genetic diversity. Further, the admixture process experienced by the Latin American isolates could increase levels of genetic diversity, especially because some individuals have appreciable levels of African ancestry50. We see little difference in patterns of genetic variation in the 1000 Genomes Colombian samples and the Colombian sample studied in this project. The Latin American isolates occupy areas that were considered as being geographically isolated at the time of sampling – the Central Valley of Costa Rica and the department of Antioquia in Colombia23, while the 1000 Genomes CLM sample was taken from Medellín which is included within the Antioquia region68–71. Thus, there is likely some amount of shared demography between the 1000 Genome CLM and our isolated Colombia population. However, it is worth noting that the individuals in the Latin American isolates are from pedigrees ascertained for Bipolar Disorder 1, rather than a random sample from the area.
Our results beg the question, what constitutes a population isolate? For example, is it a requirement that population isolates have low genetic diversity relative to the source population? Under this definition, the Latin American population isolates would not qualify as population isolates. The bottleneck in the Costa Ricans and Colombians seems to have had little effect on their genetic diversity, as their diversity levels are similar to non-isolated Latin American populations. The Finnish, on the other hand, experienced a long-term bottleneck that has resulted in a depletion of segregating sites, and of the remaining segregating sites, there is an enrichment of deleterious variants relative to non-isolated populations2,13, and would clearly qualify as an isolate. However, if one measures isolation based on IBD we see that there is an enrichment of IBD segments in the Latin American isolates relative to the Finnish. Further, looking at ROH, Latin American individuals from population isolates have a larger burden of ROH than Finnish, thus increasing the chances of identifying more shared genomic regions in the Latin American isolates than the Finnish. By this metric, the Latin American population isolates would certainly qualify as a population isolate. Thus, both the Costa Rican and Colombian populations and the Finnish are isolates but in different ways. For example, the Costa Ricans and Colombians are historical isolates, meaning these populations are not currently isolated but they exhibit many traits of an isolate, whereas the Finnish are contemporary isolates, meaning the population is still isolated and is the archetypal isolate that one would imagine. Our work also suggests that isolated populations have distinct demographic histories that impact genetic variation in different ways; and it is critical that researchers study and quantify the consequences of demography in each population.
We find that Latin American isolates have the largest ROH burden in comparison to any other sampled population. Our work corroborate results from a recent review on ROH where authors state that populations with small Ne and recent consanguinity will harbor the largest amount of ROH72. Because previous research has shown a strong correlation between recent inbreeding, quantified by both FSNP and FPED, and long runs of homozygosity, we were particularly interested in the mechanism behind the generation of long ROH19–21,56–58,73. We used simulations to test which demographic scenarios could produce long ROH (Figure 5). These simulations and availability of extended pedigree data were crucial, because the FSNP metric can also be influenced by a recent bottleneck. Thus, having FPED available allowed us to test whether the correlation we observed between FSNP and ROH was a consequence of a recent bottleneck or recent consanguinity. If small population size or admixture was responsible for generating the ROHs, these processes would not be reflected in FPED. Thus, we would not expect to find a correlation between FPED and the amount of the genome in ROHs. The fact that we observe a correlation between FPED and the amount of the genome in ROH suggests that recent consanguinity (as measured by FPED) is related to the extent of long ROHs in the genome. Further, our simulations show that neither admixture nor a recent population bottleneck could generate the high levels of long ROH that are observed in some individuals. It was only when we incorporated inbreeding into the simulation that levels of ROH comparable to what we observed in our data were produced. Thus, both lines of evidence suggest that the Latin American population isolates have experienced more recent consanguinity than other population isolates, like the Finnish. Further, in Finland it has previously been shown that the frequency of consanguinity, due to first-cousin marriages, is quite low and the best predictors of these unions were socio-economic class and ethnicity, rather than geographic barriers or population density74. On the other hand, for the two Latin American isolates consanguinity could be a consequence of increased geographic barriers preventing movement of individuals over more dispersed areas. It is also important to point out that it is unclear the extent to which ascertaining individuals from large pedigrees may impact the number of ROHs in our sample. Thus, the finding of an increase in ROHs may not be generalizable to Colombian and Costa Rican populations as a whole. However, we observed a similar pattern of increased ROH in the CLM, which suggests that the pedigree ascertainment of the CO and CR may not be generating the increase in ROHs.
We also tested how recent consanguinity affects deleterious variation in the genome. When counting homozygous derived deleterious genotypes, we found a positive correlation between the number of nonsynonymous homozygous genotypes and the amount of an individual’s genome within an ROH (Figure 6). Further, we observed an enrichment for nonsynonymous homozygous derived genotypes relative to synonymous homozygous derived genotypes within ROHs versus the rest of the genome (Table 1). This enrichment can be a result of nonsynonymous mutations generally segregating at lower frequency and typically being carried as a single copy in an individual. When an ROH is formed, the chromosome that was carrying the mutation is copied, thus allowing the mutation to increase the number of homozygotes within the ROH19,20. Since long ROH are a product of recent consanguinity, and these populations have experienced recent consanguinity, we see a corresponding increase in the burden of deleterious variants in the genomes of Costa Rican and Colombian isolates. Since we are more likely to see deleterious variants in the homozygous form in areas of the genome that fall within an ROH, our work is particularly relevant for alleles associated with recessive diseases. Lastly, we provide a mechanism for how recent consanguinity can reduce fitness in natural populations75–77 Specifically, if gene-knockouts and deleterious mutations tend to be recessive40,78–82, as suggested by several studies, then recent consanguinity will increase the number of homozygous derived deleterious variants carried by an individual in a long ROH, thus leading to a reduction of fitness in the sampled population6.
Utilizing estimated ancestry proportions from across the genome, we tested for a correlation between an individual’s ancestry and the amount of their genome that falls within an ROH. To our knowledge, this is the first time that the relationship between proportion of ancestry and the amount of the genome within an ROH has been examined (Supplementary Figure 8). We found a positive correlation between the proportion of European ancestry and the amount of an individual’s genome within a run of homozygosity. These results are consistent with the Latin American isolates originating from a small number of European founders, which would decrease genetic diversity and increase homozygosity for those areas of the genome containing European haplotypes. We observed a negative correlation between Native American ancestry and the amount of the genome contained within an ROH (Supplementary Figure 8). This finding appears to be at odds with previous research19,72 that detected the opposite pattern. Some of this difference may be due to distinct sampling strategies of the Native American source population in our study compared to previous work. The reference Native American population we used was composed of Chibchan-speaking individuals from Reich et al.48. Chibchan-speaking populations inherited their Native American ancestry from admixture between Southern and Northern American lineages, the necessity of admixture was particularly apparent in the Cabecar of Costa Rica48. Because our reference Native American population is admixed, and Native American populations tend to be small, it is likely that drift has affected different alleles in source populations that formed the current Chibchan-speaking populations. The Chibchan speaking populations may have more diversity, fewer fixed homozygous sites, than previously sampled Native American populations which could explain the negative correlation we observed between ancestry and ROH.
While we were able to capture evidence of recent bottlenecks and expansions within Latin American isolates using IBDNe35 (Figure 3), our demographic inferences have some limitations. For example, the current estimates of Ne are unrealistically large. This inflation may be due to low sample size, since we only used 30 individuals, or it may be a result of applying IBDNe to admixed populations. IBDNe was designed to be applied to un-admixed, randomly mating populations. Thus, one area of future research could be exploring the influence of admixture on IBD and developing methods to infer demography using IBD patterns in admixed populations. Interestingly, in our study, the populations with the highest IBD scores were admixed (PUR, CO, CR, and CLM). Furthermore, because IBD segments may contain useful information for identifying regions of the genome that contain disease associated mutations, especially within individuals with the highest amounts of consanguinity, it may be useful to deconvolute ancestry for each segment when identifying disease associated mutations because disease prevalence may differ in each parental population.
Population isolates have frequently been used for mapping Mendelian disease genes17,83–88 and studying complex diseases16,89–95. Isolates are thought to be beneficial in comparison to non-isolated populations because of their increased homogeneity of the gene pool, disease causing alleles potentially existing at an appreciable frequency due to drift, possible enrichment in prevalence of the phenotype of interest1–4, and a likely reduction in the variability of phenotypes. Our work shows that the genetic diversity and genomic background of population isolates varies immensely. Therefore, it is imperative that we understand the unique genetic diversity belonging to each population isolate. Researchers should adapt their study design to integrate the demographic history of the population, to better leverage the power of the unique genetic features of the population of interest. For example, if we knew beforehand that there was a history of consanguineous unions within the study population, researchers could target ROH for disease mapping. This method has previously been used to identify human knockouts, discover novel loci associated with disease, and understand gene function95–98. Further, populations with large amounts of ROH could help us better understand disease architecture because ROH may harbor more recessive mutations that do not have full penetrance, since the prevalence of recessively acting variants in ROH is enriched relative to non-ROH portions of the genome. Most importantly, our work highlights the importance of understanding the demographic history of isolated populations, as differences in demographic history will greatly impact patterns of genetic variation in isolates.
Supplemental Data
Supplemental data include seventeen figures and two tables.
Conflicts of Interest:
Web Resources
6-primate EPO alignment: ftp://ftp.ensembl.org/pub/release-75/fasta/ancestral_alleles/
ADMIXTURE: https://www.genetics.ucla.edu/software/admixture/download.html
IBDSeq: http://faculty.washington.edu/browning/ibdseq.html
IBDNe: http://faculty.washington.edu/browning/ibdne.html#download
KING (version 2.1): http://people.virginia.edu/~wc9c/KING/history.htm
LAMP: http://lamp.icsi.berkeley.edu/lamp/
PLINK: http://www.cog-genomics.org/plink2
ROH simulation script: https://github.com/LohmuellerLab/ROH_Latin_American_Isolates
SeattleSeq Annotation website: http://snp.gs.washington.edu/SeattleSeqAnnotation138/
SLIM: https://messerlab.org/slim/
VCFTools: http://vcftools.sourceforge.net/downloads.html
GATK: https://software.broadinstitute.org/gatk/download/archive
Acknowledgements
The authors would like to acknowledge Charleston Chiang, Jesse Garcia, Malika Kumar, and Sonya McKeown for contributing their time and thoughtful discussion. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Numbers DGE-1144087 and DGE-1650604 awarded to JAM, as well as partial support from NIH grant MH095454 awarded to NF, and NIH grant R35 GM119856 awarded to KEL.
References
- 1.↵
- 2.↵
- 3.
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.
- 58.↵
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.
- 64.
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.
- 80.
- 81.
- 82.↵
- 83.↵
- 84.
- 85.
- 86.
- 87.
- 88.↵
- 89.↵
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.↵
- 96.
- 97.
- 98.↵