Abstract
Balancing selection maintains advantageous diversity in populations through different mechanisms. While extensively explored from a theoretical perspective, an empirical understanding of its prevalence and targets lags behind our knowledge of positive selection. Here we describe a simple yet powerful statistic to detect signatures of long-term balancing selection (LTBS) based on the expectation that some types of LTBS result in an accumulation of polymorphic sites at moderate-to-intermediate frequencies. The Non-Central Deviation (NCD) quantifies the degree to which SNP frequencies within a window of a pre-defined size depart from deterministic expectations under balancing selection. The statistic can be implemented considering only polymorphisms (NCD1) or also including also information on fixed differences (NCD2), and can detect LTBS under different frequencies of the balanced allele(s). Because of its simplicity, NCD can be applied to single loci or genomic data, and to populations with or without known demographic history. We show that, in humans, NCD1 and NCD2 have high power to detect long-term balancing selection, with NCD2 outperforming all existing methods. We applied NCD2 to genome-wide data from African and European human populations, and found that 0.6% of the analyzed windows show signatures of LTBS, corresponding to 0.8% of the base pairs and 1.6% of the SNPs in the analyzed genome. This suggests that albeit not prevalent, LTBS affects the evolution of a sizable portion of the genome (it overlapping ∼8% of protein-coding genes). These SNPs disproportionally overlap sites with protein-coding and amino-acid altering functions, but not putatively regulatory sites. Our catalog of candidates includes known targets of LTBS, but a majority of them have not been previously identified. As expected, immune-related genes are among those with the strongest signatures, although most candidates are involved in other biological functions, suggesting that balancing selection potentially influences diverse human phenotypes.
Author Summary With the availability of whole-genome sequences on a population level, genetic variation in humans has been queried for signatures of natural selection. Most of these efforts have focused on positive selection, which results in novel adaptions. Balancing selection, an important form of natural selection that maintains advantageous genetic variants within populations, sometimes for millions of years, has attracted less attention. This is despite the important effects that variants under balancing selection have in phenotypic diversity and susceptibility to disease, as shown by the most eminent target of balancing selection: the Major Histocompatibility Complex Locus (MHC, known as HLA in humans). We developed a statistic that identifies regions of the genome with signatures that are expected under balancing selection. This statistic has very high power to detect long-term balancing selection in humans, and it is simple enough to be used in a wide variety of species, having the potential to improve our understanding of balancing selection across taxonomic groups. When applied to human data, we find that long-term balancing selection has affected genomic regions that define the sequence of protein-coding genes more often than their regulation, and has targeted genes involved in immunity and a diversity of additional biological functions.
Introduction
Balancing selection refers to a class of selective mechanisms that maintains advantageous genetic diversity in populations. Decades of research have established HLA genes as a prime example of balancing selection [1,2], with thousands of alleles segregating in humans [3], extensive support for functional effects of these polymorphisms (e.g. [4,5]), and various well-documented cases of association between selected alleles and disease susceptibility (e.g. [6,7]). The catalog of well-understood non-HLA targets of balancing selection in humans remains small, but includes genes associated to phenotypes such as auto-immune diseases [8,9], resistance to malaria [10], HIV infection [11] or susceptibility to polycystic ovary syndrome [12]. Thus, besides historically influencing individual fitness, balanced polymorphisms shape current phenotypic diversity and susceptibility to disease.
Balancing selection encompasses several mechanisms (reviewed in [13–15]). These include heterozygote advantage (or overdominance), frequency-dependent selection [16,17], selective pressures that fluctuate in time [18,19] or in space in panmitic populations [20,21], and some cases of pleiotropy [22]. For overdominance, pleiotropy, and some instances of spatially variable selection, a stable equilibrium can be reached [16]. For other mechanisms, the frequency of the selected allele can change in time without reaching a stable equilibrium. Regardless of the mechanism, long-term balancing selection (LTBS) has the potential to leave identifiable signatures in genomic data. These include a local site-frequency spectrum (SFS) with an excess of alleles at intermediate frequencies and, when selection is old enough, an excess of polymorphisms relative to substitutions (reviewed in [15]). In some cases, very ancient balancing selection can maintain trans-species polymorphisms in sister species [23,24], while transient heterozygote advantage and other types of recent balancing selection [25] will result in signatures difficult to distinguish from incomplete, recent selective sweeps [15].
While balancing selection has been extensively explored from a theoretical perspective, an empirical understanding of its prevalence lags behind our knowledge of positive selection. This stems from technical difficulties in detecting balancing selection, as well as the perception that it may be a rare selective process [26]. In fact, few methods have been developed to identify its targets, and only a handful of studies have sought to uncover them genome-wide in humans [23,24,27–32]. Different approaches have been used to identify genes [28] or genomic regions [31] with an excess of polymorphisms and intermediate frequency alleles, while other studies have identified trans-species polymorphisms between humans and their closest living relatives (chimpanzees and bonobos) [23,24]. Overall, these studies suggested that balancing selection may act on a small portion of the genome, although the limited extent of data available (e.g., exome data [31], small sample size [28]), and stringency of the criteria (e.g., balanced polymorphisms predating human-chimpanzee divergence [23,24]) may underlie the paucity of detected regions.
Here, we developed two statistics that summarize, directly and in a simple way, the degree to which allele frequencies of SNPs in a genomic region deviate from those expected under balancing selection. We then use these statistics to test the null hypothesis of neutral evolution. We showed, through simulations, that one of our statistics outperforms existing methods under a realistic demographic scenario for human populations. We applied this statistic to genome-wide data from four human populations and used both outlier and simulation-based approaches to identify genomic regions bearing signatures of LTBS.
Results
The Non-Central Deviation (NCD) statistic
Background
Owing to linkage, the signature of long-term balancing selection extends to the genetic neighborhood of the selected variant(s); therefore, patterns of polymorphism and divergence in a genomic region can be used to infer whether it evolved under LTBS [13,21]. LTBS leaves two distinctive signatures in linked variation, when compared with neutral expectations. The first is an increase in the ratio of polymorphic to divergent sites: by reducing the probability of fixation of a variant, balancing selection increases the local time to the most recent common ancestor [33]. The HKA test is commonly used to detect this signature [34]. The second signature is an excess of alleles segregating at intermediate frequencies. In humans, the folded SFS – the frequency distribution of minor allele frequencies (MAF) — is typically L-shaped, showing an excess of low-frequency alleles when compared to expectations under neutrality and demographic equilibrium. The abundance of rare alleles is further increased by recent population expansions [35], purifying selection and recent selective sweeps [36]. Regions under LTBS, on the other hand, can show a markedly different SFS, with proportionally more alleles at intermediate frequency (Fig 1A-B). Such a deviation in the SFS is the signature identified by classical neutrality tests such as Tajima’s D (TajD) and newer statistics such as MWU-high [37].
With heterozygote advantage, the frequency equilibrium (feq) depends on the relative fitness of each genotype [16]: under symmetric overdominance, i.e. where the two types of homozygotes have the same fitness, feq = 0.5; under asymmetric overdominance, where the fitness of the two homozygotes is different, feq ≠ 0.5 (S1 Note). Under frequency-dependent selection and fluctuating selection, while an equilibrium may not be reached (S1 Note), feq can be thought of as the frequency of the balanced polymorphism at the time of sampling.
NCD statistic
In the tradition of neutrality tests analyzing the SFS directly (e.g. [37–39]), we propose and define the statistic “Non-Central Deviation” (NCD) which measures the degree to which the local SFS deviates from a pre-specified allele frequency (the target frequency, tf) in a genomic region. Under a model of balancing selection, tf can be thought of as the expected frequency of a balanced allele, with the NCD statistic quantifying how far the sampled SNP frequencies are from it. Because bi-allelic loci have complementary allele frequencies, and there is no prior expectation regarding whether ancestral or derived alleles should be maintained at higher frequency, we use the folded SFS (Fig 1B). NCD is defined as: where i is the i-th informative site in a locus, pi is the MAF for the i-th informative site, n is the number of informative sites, and tf is the target frequency with respect to which the deviations of the observed alleles frequencies are computed. Thus, NCD is a type of standard deviation that quantifies the dispersion of allelic frequencies from tf, rather than from the mean of the distribution. Low NCD values reflect a low deviation of the SFS from a pre-defined tf, as expected under LTBS (Fig 1C and S1 Note).
We propose two NCD implementations. NCD1 uses only on polymorphic sites as informative sites, and NCD2 also includes the number of fixed differences (FDs) relative to an outgroup species (i.e, all informative sites, ISs = SNPs + FDs, are used to compute the statistic). In NCD2, FDs are considered ISs with MAF = 0; thus, the greater the number of FDs, the larger the NCD2 and the weaker the support for LTBS. From equation 1 it follows that the maximum value for NCD2(tf) is the tf itself (for tf ≥ 0.25, see S1 Note), which occurs when there are no SNPs and the number of FDs ≥ 1. The maximum NCD1 value approaches – but never reaches – tf when all SNPs are singletons. The minimum value for both NCD1 and NCD2 is 0, when all SNPs segregate at tf and, in the case of NCD2, the number of FDs = 0 (S1 and S2 Figs).
Power of NCD to detect LTBS
We evaluated the sensitivity and specificity of NCD1 and NCD2 by benchmarking their performance using simulations. Specifically, we considered demographic scenarios inferred for African, European, and Asian human populations, and simulated sequences evolving both under neutrality and LTBS using an overdominance model. We explored the influence of parameters that can affect the power of NCD statistics: time since onset of balancing selection (Tbs), frequency equilibrium defined by selection coefficients (feq), demographic history of the sampled population, tf used in NCD calculation, length of the genomic region analyzed (L) and implementation (NCD1 or NCD2). Box 1 summarizes nomenclature used throughout the text.
List of Abbreviations
LTBS, long-term balancing selection.
MAF, minor allele frequency.
SFS, site-frequency spectrum.
FD, fixed differences (between ingroup and outgroup species).
IS, informative sites (polymorphic sites in the ingroup species plus fixed differences between ingroup and outgroup species).
feq, deterministic equilibrium frequency expected under balancing selection as defined by the selection coefficients. tf, target frequency: the frequency used in NCD as the value to which queried allele frequencies are compared to.
NCD statistics, non-central deviation statistics, with two implementations, NCD1 and NCD2.
NCD1, measures the average departure between polymorphic allele frequencies and a pre-determined frequency (tf). NCD1(tf) is NCD1 for that given tf.
NCD2, measures the average departure between allele frequencies and a pre-determined frequency (tf) considering both polymorphisms and fixed differences with an outgroup. NCD2(tf) is NCD2 for that given tf.
NCD(tf), refers to the average of NCD1(tf) and NCD2(tf).
For simplicity, we averaged power estimates across NCD implementations (NCD being the average of NCD1 and NCD2), African and European demographic models (Asian populations were not considered, see below and S2 Note), L and Tbs (Methods). These averages are helpful in that they reflect the general changes in power driven by individual parameters. Nevertheless, because they often include conditions for which power is low, they underestimate the power the test can reach under each condition. The complete set of power results is presented in S1 Table, and some key points are discussed below.
Time since the onset of balancing selection (Tbs) and sequence length
Signatures of LTBS are expected to be stronger for longer Tbs, because time to the most recent common ancestor is older and there will have been more time for linked mutations to accumulate and reach intermediate frequencies. We simulated sequences with variable Tbs (1, 3, 5 million years, mya). For simplicity, here we only discuss cases where tf = feq, although this condition is relaxed in later sections. Power to detect LTBS with Tbs = 1 mya is low (NCD(0.5) = 0.32, averaged across populations and L values), and high for 3 (0.74) and 5 mya (0.83) (S3-S8 Figs, S1 Table), suggesting that NCD statistics are well powered to detect LTBS starting at least 3 mya. We thus focus subsequent power analyses exclusively on this timescale.
In the absence of epistasis, the long-term effects of recombination result in narrower signatures when Tbs is larger [23,24]. Accordingly, we find that, for example, power for NCD(0.5) (Tbs = 5 mya) is on average 10% higher for 3,000 bp loci than for 12,0000 bp loci (S3-S8 Figs, S1 Table). In brief, our simulations show power is highest for windows of 3 kb centered on the selected site (S2 Note), and we report power results for this length henceforth.
Demography
Power is similar for samples simulated under African and European demographic histories (Table 1), but considerably lower under the Asian one (S1 Table, S3-S8 Figs), possibly due to lower Ne (S2 Note). While power estimates may be influenced by the particular demographic model used, we nevertheless focus on African and European populations, which by showing similar power allow fair comparisons between them.
Simulated and target frequencies
So far, we have only discussed cases where tf = feq, which is expected to favor the performance of NCD. Accordingly, under this condition NCD has high power: 0.91, 0.85, and 0.79 on average for feq = 0.5, 0.4, and 0.3, respectively (averaged across Tbs and populations, Table 1). However, since in practice there is no prior knowledge about the feq of balanced polymorphisms, we evaluate the power of NCD when feq ≠ tf. When feq = 0.5, average power is high for tf = 0.5 or 0.4 (above 0.85), but lower for tf = 0.3 (0.50, Table 1). Similar patterns are observed for other simulated feq (Table 1). Therefore, NCD statistics are overall well-powered both when the feq is the same as tf, but also in some instances of feq ≠ tf. In any case, the closest tf is to feq, the higher the power, so when possible, it is desirable to perform tests across a range of tf.
Power at false positive rate (FPR) = 5%. Simulations with L = 3 kb. Tbs, time in mya since onset of balancing selection; feq, equilibrium frequency in the simulations. Power on additional conditions is presented on S1 Table.
NCD implementations and comparison to other methods
Power for NCD2 is greater than for NCD1 for all tf: feq = 0.5 (average power of 0.94 for NCD2(0.5) vs. 0.88 for NCD1(0.5), averaged across populations and Tbs; Table 1), feq = 0.4 (0.90 for NCD2(0.4) vs. 0.80 for NCD1(0.4)) and feq = 0.3 (0.86 for NCD2(0.3) vs. 0.73 for NCD1(0.3)) (Table 1, Fig 2). This illustrates the gain in power by incorporating FDs in the NCD statistic, which is also more powerful than combining NCD1 and HKA (S1 Table).
We compared the power of NCD to two statistics commonly used to detect balancing selection (TajD and HKA), a composite statistic of NCD1 and HKA (with the goal of quantifying the contribution of FD to NCD power), and a pair of composite likelihood-based measures (T1 and T2 [31]). The T2 statistic, similarly to NCD2, considers both the SFS and the ratio of polymorphisms to FD. Power results are summarized in Fig 2. When feq = 0.5, NCD2(0.5) has the highest power: for example, in Africa (Tbs = 5 myr, and 3 kb) NCD2(0.5) power is 0.96 (the highest among other tests is 0.94, for T2) but the difference in power is highest when feq departs from 0.5. For feq = 0.4, NCD2(0.4) power is 0.93 (compared to 0.90 for TajD and T2, and lower for the other tests). For feq = 0.3, NCD2(0.3) power is 0.93 (compared to 0.89 for T2 and lower for the other tests). These patterns are consistent in the African and European simulations (Fig 2, S10 Fig), where NCD2 has greater or comparable power to detect LTBS than other available methods. When focusing on the tests that use only polymorphic sites, NCD1 has similar power to Tajima’s D when feq = 0.5, and it outperforms it when feq departs from 0.5 (Table S1). Altogether, the advantage of NCD2 over classic neutrality tests is its high power, especially when feq departs from 0.5; the advantage over T2 is its simplicity of implementation and interpretation, and the fact that it can be run in the absence of a demographic model.
Recommendations based on power analyses
Overall, NCD performs very well in regions of 3 kb (Table 1, Fig 2) and similarly for African and European demographic scenarios. In fact, NCD2 outperforms all other methods tested (Fig 2, S10 Fig) and reaches very high power when tf = feq (always > 0.89 for 5 mya and always > 0.79 for 3 mya). While the feq of a putatively balanced allele is unknown, the simplicity of the NCD statistics makes it trivial to run for several tf values, allowing detection of balancing selection for a range of equilibrium frequencies. Also, the analysis can be run in sliding windows to ensure overlap with the narrow signatures of balancing selection. Alternatively, NCD could also be computed for 3kb windows centered in each SNP or IS. Because NCD2 outperforms NCD1, we used it for our scan of human populations; NCD1 is nevertheless a good choice when outgroup data is lacking.
Identifying signatures of LTBS
We aimed to identify regions of the human genome under LTBS. We chose NCD2(0.5), NCD2(0.4) and NCD2(0.3), which provide sets of candidate windows that are not fully overlapping (Table 1). We calculated the statistics for 3 kb windows (1.5 kb step size) and tested for significance using two complementary approaches: one testing all windows with respect to neutral expectations, and one identifying outlier windows in the empirical genomic distribution. We analyzed genome-wide data from two African (YRI: Yoruba in Ibadan, Nigeria; LWK: Luhya in Webuye, Kenya) and two European populations (GBR: British, England and Scotland; TSI: Toscani, Italy) [40]. We filtered the data for orthology with the chimpanzee genome (used as the outgroup) and implemented 5 additional filters to avoid technical artifacts (S13 Fig). Finally, we excluded windows with less than 10 IS in any of the populations since these showed a high variance in NCD2 due to noisy SFS (see empirical patterns in S18 Fig and neutral simulation patterns in S11 Fig).
Simulation and empirical-based sets of windows
After all filters were implemented, we analyzed 1,657,989 windows (∼ 81% of the autosomal genome; S13 Fig), overlapping 18,633 protein-coding genes. We defined a p-value for each window as the quantile of its NCD2 value when compared to those from 10,000 neutral simulations under the inferred demographic history of each population and conditioned on the same number of IS. Depending on the population, between 6,226 and 6,854 (0.37-0.41%) of the scanned windows have a lower NCD2(0.5) value than any of the 10,000 neutral simulations (p < 0.0001). The proportions are similar for NCD2(0.4) (0.40-0.45%) and NCD2(0.3) (0.33-0.38%) (Table 2). We refer to these sets, whose patterns cannot be explained by the neutral model, as the significant windows. In each population, the union of significant windows considering all tf values spans, on average, 0.6% of the windows (Table 2) and 0.77% of the base pairs.
Due to our criterion, all significant windows had simulation-based p < 0.0001. In order to quantify how far the NCD2 value of each window is from neutral expectations, we defined Ztf-IS (Equation 2, see Methods) as the number of standard deviations a window’s NCD2 value lies from the neutral expectation. We defined as outlier windows those with the most extreme signatures of LTBS (in the 0.05% lower tail of the Ztf-IS distribution). This more conservative set contains 829 outlier windows for each population and tf value (Table 2), which cover only ∼ 0.09% of the base pairs analyzed and largely included in the set of significant windows. Significant and outlier windows are collectively referred to as candidate windows.
Significant and outlier genes and windows, see main text. U, union of windows considering three tf values. Total number of queried windows per population is 1,657,989. Union of all candidate genes is 2,348 (significant) and 402 (outlier).
Reliability of candidate windows
Significant windows are enriched both in polymorphic sites (Fig 3A-B) and intermediate-frequency alleles (Fig 3C-D), and the SFS shape reflects the tf for which they are significant (Fig 3C-D). Although expected, because these were the patterns used to identify these windows, this shows that significant windows are unusual in both signatures. These striking differences with respect to the background distribution, combined with the fact that neutral simulations do not have NCD2 values as low as those of the significant windows, precludes relaxation of selective constraint as a an alternative explanation to their signatures [28].
To avoid technical artifacts among significant windows we filtered out regions that are prone to mapping errors (S13 Fig). Also, we find that significant windows have similar coverage to the rest of the genome, i.e, they are not enriched in unannotated, polymorphic duplications (S14 Fig). We also examined whether these signatures could be driven by two biological mechanisms other than LTBS: archaic introgression into modern humans and ectopic gene conversion (among paralogs). These mechanisms can increase the number of polymorphic sites and (in some cases) shift the SFS towards intermediate frequency alleles (S5 Note). We find introgression is an unlikely confounding mechanism, since candidate windows are depleted in SNPs introgressed from Neanderthals (S17 Fig, S5 Note). Also, genes overlapped by significant windows are not predicted to be affected by ectopic gene conversion with neighboring paralogs to an unusually high degree, with the exception of olfactory receptor genes (S16 Fig, S5 Note). Thus, candidate windows represent a catalog of strong candidate targets of LTBS in human populations.
Assigned tf values
For both novel and previously known targets of LTBS, an advantage of NCD is that it provides an assigned tf for each window, which reflects the shape of its SFS. Our simulations suggest that the assigned tf is informative about the frequency of the site under balancing selection, so when a window was detected for more than one tf, we identified the tf value that minimizes Ztf-IS (S3 Note). On average ∼53% of the candidate windows are assigned to tf = 0.3, 27% to tf = 0.4 and 20% to tf = 0.5 (S5 Table).
Non-random distribution across chromosomes
Candidate windows are not randomly distributed across the genome. Chromosome 6 is the most enriched for signatures of LTBS, contributing, for example, 10.2% of significant and 25% of outlier windows genome-wide for LWK while having only 6.4% of analyzed windows (S12 Fig, with qualitatively similar results for the other populations). This pattern can be explained by the MHC region (Fig 4A), rich in genes with well-supported evidence for LTBS. Specifically, 10 HLA genes are among the strongest candidates for balancing selection in all four populations, most of which have prior evidence of balancing selection (S6 Table, S4 and S6 Notes): HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DQB2, HLA-DRB1, HLA-DRB5, HLA-G [24,31,41–45].
Biological pathways influenced by LTBS
To gain insight on the biological pathways influenced by LTBS, we focused on protein-coding genes containing at least one candidate window (222-249 outlier and 1,404-1,616 significant genes per population), and investigated their annotations. They are disproportionally expressed in a number of tissues: lung, adipose tissue, adrenal tissue, kidney, and prostate (S4 Table).
Regarding functional categories, significant genes are overrepresented in 28 GO categories, 24 of which are shared by at least two populations and 18 by four populations. Thirteen categories are immune-related according to a list of 386 immune-related keywords from ImmPort (Methods). The more stringent sets of outlier genes are enriched for 28 GO categories (21 shared by all four populations), 18 of which are immune-related. Furthermore, in both sets several of the remaining enriched categories are directly related to antigen presentation although not classified as immune-related (e.g., “ER to golgi transport vesicle membrane”, “integral to membrane”). Among the non immune-related categories are “sarcolemma”, “epidermis development”, “keratin filament” and “negative regulation of blood coagulation” (S2 Table).
When classical HLA genes are removed from the analyses, only two categories remain enriched: “sarcolemma” (in YRI) and “epidermis development” (GBR), but the small set of genes per population hampers power. For the significant windows, “antigen processing and presentation of endogenous peptide antigen via MHC class I” remains significantly enriched (driven by TAP1, TAP2, ERAP1 and ERAP2; S2 Table). Also, significant windows are still enriched in categories related to the extracellular space – “extracellular regions”, “integral to membrane” (as in [15,28,31]) – and “keratin filament”. These categories are not immune-related per se, but they represent physical barriers to the invasion by pathogens. This indicates that LTBS maintains advantageous diversity in numerous defense-related genes other than classical HLA genes.
Overall, 33% of the outlier (and 31% of the significant) genes have at least one immune-related associated GO category, while only 24% of scanned genes do (see Methods). These results collectively suggest that immunity and defense are frequent targets of LTBS, although a large fraction of the candidates for LTBS have non-immune related functions or indirect connections with immunity hitherto unknown.
Functional annotation of SNPs in candidate windows
Because the identification of candidate windows is independent from functional annotation, we were able to test whether LTBS preferentially maintains SNPs at particular types of functional sites. To do so we investigated the overlap of candidate windows with different classes of functional annotations in the human genome, and tested the hypothesis of enrichment of certain classes of sites within our sets of candidate windows, when compared to sets of randomly sampled windows from the genome (S8 Table and Fig 5).
SNPs in outlier windows overlap disproportionally with protein-coding exons in all the populations (p ≤ 0.001, one-tail test; Fig 5, see Methods). The protein-coding enrichment is even stronger when considering only SNPs within genes, which both in outlier (p < 0.001) and significant windows (p ≤ 0.003) are strongly enriched in protein-coding exons (Fig 5). Within the protein-coding exons, outlier windows in Africa (p ≤ 0.022) and significant windows in all populations (p ≤ 0.037) are enriched in non-synonymous SNPs (Fig 5). These observations show that our candidate targets of LTBS tend to be enriched in exonic and potentially functional (amino-acid altering) SNPs.
Conversely, outlier and significant windows have no excess of SNPs annotated as regulatory (p ≥ 0.458 in all populations, Fig 5). When we explicitly compared protein-coding exons vs. regulatory sites by restricting our analysis to sites in these two categories, outlier windows have an excess of exonic SNPs (p ≤ 0.003). The same is true for significant windows (p ≤ 0.016; Fig 5). When only nonsynonymous and regulatory sites are considered, we see enrichment for LWK and YRI for the outlier windows (p ≤ 0.036, Fig 5) but not for the significant windows (p ≥ 0.458 for all populations, Fig 5), although the two analyses that consider nonsynonymous SNPs are likely underpowered due to low SNP counts (S8 Table). Finally, results using more detailed RegulomeDB annotations generally agree with the observation of lack of enrichment of regulatory sites in our candidate windows (p ≥ 0.121 for a one tail test for enrichment for RegulomeDB1+3 for SNPs with MAF ≥ 0.2) (S6 Note, S8 Table).
Although perhaps limited by the quality of the annotation of regulatory sites and the low power associated to small SNP counts for nonsynonymous variants, we do not have strong evidence that LTBS in human populations has preferentially shaped variation at sites with a role in gene expression regulation. These results suggest that LTBS preferentially affects exons and non-synonymous mutations.
Monoallelic expression
Genes with mono-allelic expression (MAE) – i.e, the random and mitotically stable choice of an active allele for a given locus – have been found to be enriched among those with signatures of balancing selection [46]. Our observations agree with this. For example, 64% and 62% of the outlier and significant genes shared by at least two populations have MAE status according to [46], compared to only 41% for genes without signatures of LTBS (p < 1.12−6 Fisher Exact Test, one-sided).
Overlap across populations
On average 86% of outlier windows in a given population are shared with another population (79% for significant windows), and 77% with another population within the same continent (66% for significant ones) (S19 Fig). The sharing is similar when tf are considered separately (S20 and S21 Figs). Consequently, there is also considerable overlap of candidate protein-coding genes across populations: e.g. in LWK (tf = 0.5), 76.6% of outlier genes are shared with any other population, and 66% are shared with YRI (89% and 77% for significant genes; Fig 4B). In fact, on average 44% of outlier genes for a given population are shared across all populations and 78.7% are shared by a same-continent population (50% and 77% for significant genes; S22 Fig).
Candidate genes in more than one population
Instances where signatures of LTBS are not shared between populations may result from changes in selective pressure, which may be important during fast, local adaptation [47]. Still, loci with signatures of LTBS across human populations are more likely to represent stable selection. We considered as “shared” those candidate protein-coding genes (from the union of candidate windows for all tf) that are shared by all populations (S6 Table). For the rest, we considered as “African” those shared between YRI and LWK (but neither or only one European population), and “European” those shared between GBR and TSI (but neither or only one African population). We note that these designations do not imply that genes referred to as “African” or “European” are putative targets of LTBS for only one continent (partially because there are some power differences between Africa and Europe, Table 1). The 79 African, 84 European and 102 shared” outlier genes add up to 265 genes in total (∼1.5% of all queried genes) and the 458 African, 400 European, and 736 shared significant genes add up to 1,594 (∼8.5% of all queried genes; S6 Table).
Discussion
The targets of LTSB in the human genome
Using simulation-based and empirical outlier approaches, we uncovered windows with signatures of LTBS in humans. We showed that these windows are unlikely to be affected by technical artifacts or confounding biological processes other than LTBS, such as introgression from archaic hominins. On average, across populations, 0.6% of the windows in a population are significant: we never observe comparable or more extreme signatures of LTBS in 10,000 neutral simulations.
These windows contain on average 0.77% of the base-pairs and 1.6% of the SNPs in the genome per population, and although they amount to a low proportion of the genome, on average 7.9% of the protein-coding genes in a population contain at least one significant window (considering UTRs, introns and protein coding exons). For the more restrictive set of outlier windows (0.05% of windows), on average 2.1 % of genes in each population show some evidence of selection.
In both sets, we identified many previously known targets of LTBS, but almost 70% of the outlier genes shared by same-continent populations (and 90% of the significant genes) are novel. Many of these candidate genes show strongest evidence for LTBS at tf values different from 0.5. This is expected, for instance, under asymmetric overdominance, and highlights the importance of considering selective regimes with different frequencies of the balanced polymorphism.
Functional properties of SNPs in candidate windows
In this study, we confirm cases where protein-coding regions are the likely target of selection, such as HLA-B and HLA-C [48], as well as cases where regulatory regions are probably targeted, such as HLA-G, UGT2B4, TRIM5 [45,49,50]. Overall, we found a strong enrichment of exonic, and a weaker enrichment of aminoacid-altering SNPs in the candidate windows, suggesting an abundance of potentially functional SNPs within selected regions.
While LTBS has been proposed to play an important role in maintaining genetic diversity that affects gene expression [23,46], we find that regulatory SNPs are underrepresented within the candidate regions. This does not imply that there are no regulatory SNPs under balancing selection, but rather that with existing annotations (which are less precise for regulatory than protein-coding sites) they are not enriched within candidate targets. Overall, we show that LTBS plays an important role in maintaining diversity at the level of protein sequence. This is compatible with two scenarios: (a) direct selection on protein-coding sites or (b) accumulation of functional (including slightly deleterious) variants as a bi-product of balancing selection. Importantly, we show that significant windows are also extreme in their high density of polymorphisms and have a SFS that is markedly different from neutral expectations, suggesting that relaxed purifying selection and background selection are unlikely to generate their signatures.
Overlap with previous studies
Whereas positive selection scans show a remarkably low overlap with respect to the genes they identify, with as few as 14% of protein-coding loci appearing in more than a single study [51], 34% of our outlier genes (11% of significant ones) had evidence of LTBS in at least one previous study [23,28,31]. Remarkably, 47% of the shared outliers across all four populations (17% of the shared significant ones) have been detected in at least one previous study, and the proportions are similar even when classical HLA genes are removed (39 and 16% overlap, respectively). This is a high degree of overlap, considering the differences in methods and datasets across studies. For example, we find 45% of the genes from [28] among the outliers (and 78% among the significant) and 10 % and 38% of genes from [31] among outlier and significant genes, respectively. Still, the majority of our loci represent novel targets.
Properties of candidate genes
Below we briefly discuss the outlier genes (S6 Table), highlighting the variety of biological functions and known genetic associations (see Methods) potentially shaped by LTBS in humans.
Mono-allelic expression
In agreement with previous findings, we found a significant excess of MAE genes among our outlier candidates. This excess is not driven by HLA genes, which were filtered out in the study originally reporting MAE genes and supports the claim for a biological link between MAE and balancing selection [46]. Heterozygosity in a MAE gene could lead to cell-to-cell heterogeneity within same-cell clusters, which could in turn be potentially advantageous [46,52]), particularly in the case of cell-surface proteins. Some of these MAE genes found in our study, and not previously detected in scans for balancing selection, are involved in immunity/defense barriers (e.g. IL1RL1, IL18R1, FAM114A1, EDARADD, SIRPA, TAS2R14), oxygen transport and hypoxia (e.g. PRKCE, HBE1, HBG2, EGLN3), or reproduction (e.g. CLDN11).
Oxygen transport and response to hypoxia
Among the outlier genes with MAE we find members of the beta-globin cluster (HBE1 and HBG2, in the same window) that are involved in oxygen transport and have strong associations to hemoglobin levels and beta thalassemia[53], and EGLN3, a regulator of the NF-kβ pathway that is significantly upregulated under hypoxia in anti-inflammatory macrophages [54] and also plays a role in skeletal muscle differentiation [55]. The encoded protein hydroxylates the product of EPAS1, a gene shown to harbor variants responsible for human adaptation to high altitude in Tibet [56]. Interestingly, in addition to having strong signatures of LTBS in all populations we analyzed, they also have evidence for recent positive selection in Andean (HBE1, HBG2) or Tibetan (HBG2) populations [57–59]. It is plausible that these genes have been under LTBS, and have undergone a shift in selective pressures in high-altitude populations (as in [47]), but further analyses are required to confirm this possibility. Another of our outlier genes, PRKCE, is also strongly associated to haemoglobin levels and red blood cell traits.
Immunological function and defense barriers
It has long been argued that genes of immune function are prime candidates for balancing selection. As expected, we detect several classical HLA with known signatures of LTBS. However, many non-HLA candidates from our set of outlier genes have immunological functions. For example, we confirm signatures of LTBS in the ABO locus, a well-known case of LTBS in humans [60]_(S4 Note), and TRIM5, a gene with important antiviral function [49].
Among novel candidates of balancing selection, we find several genes involved in auto-immune disease. For example, IL1RL1-IL18R1 have strong associations to celiac disease and atopic dermatitis, an auto-immune disease [61]). HLA-DQB2 mediates superantigen activation of T cells [62] and is associated both to infectious (hepatitis B) and autoimmune diseases (e.g. lupus [63,64]). Two other significant genes for which there is prior evidence for LTBS [65,66], ERAP1 and ERAP2, are associated with ankylosing spondylitis and psoriasis (e.g [67–69]). Finally, there are several associations to autoimmune disease and susceptibility to infections in the classical HLA genes that we identify. In brief, our results are consistent with the hypothesis that auto-immune disease is linked to natural selection favoring effective immune response against pathogens [9,70].
Another important aspect of defense is the avoidance of poisonous substances. As suggested previously by studies on polymorphism in PTC receptors [71,72], avoidance of bitterness might have been adaptive throughout human evolutionary history because several potentially harmful substances are bitter. The TAS2R14 gene encodes for a bitter taste receptor, and in humans it has strong associations to taste perception of quinine and caffeine [73], is considered a promiscuous receptor [74–76], and is one of the few bitter taste receptors that binds a vast array of compounds, and for which no common structure has been found [75,77]. This entails diversity in the antigen binding portions of the receptors, which may be enhanced by balancing selection. Indeed, elevated dN/dS ratio was reported for a cluster of bitter taste receptors which includes TAS2R14 [78]. To our knowledge, our study is the first in detecting signatures of LTBS in this gene.
Cognition
Interestingly, several candidate genes are involved in cognitive abilities, or their variation is associated with diversity in related phenotypes. The KL (life extension factor klotho) is a gene that has been associated to human longevity [79] and for which signatures of LTBS have been previously reported [31]. In mice, decreased levels of klotho shorten lifespan (reviewed in [80]). In humans, heterozygotes for the KL-VS variant show higher levels of serum klotho and enhanced cognition, independent of sex and age, than wild-type homozygotes. On the other hand, KL-VS homozygotes show decreased lifespan and reduced cognition [81]. If higher cognition is advantageous, overdominance forthis phenotype can explain the signatures of balancing selection we observe (although klotho’s the effect in lifespan can also influence).
PDGFD encodes a growth factor that plays an essential role in wound healing and angiogenesis. A comparison between human and mice revealed that the PDGFD-induced signaling is crucial for human (but not mouse) proliferation of the neocortex due to neural stem-cell proliferation [82], a trait that underlies human cognition capacities [83]. This gene has strong associations to coronary artery disease and myocardial infarction, which are related to aging.
Also, among our outliers, a gene with a cognitive-related genetic association is ROBO2, a transmembrane receptor involved in axon guidance. Associations with vocabulary growth have been reported for variants in its vicinity [84]. ROBO2 has signatures of ancient selective sweeps in modern humans after the split with Neanderthals and Denisova [85] on a portion of the gene (chr3:77027850-77034264) almost 40kb apart from the one for which we identified a signature of LTBS (chr3:76985072-76988072). The occurrence of both these signatures highlights the complex evolutionary relevance of this gene.
Associations of genetic diversity in candidate genes with cognition are also supported by case-control and cohort studies linking polymorphisms in the estrogen receptor alpha (ER-α) gene, ESR1, to dementia and cognitive decline. Links between ER-α variants and mood outcomes such as anxiety and depression in women have been proposed but lack confirmation (reviewed in [86]). Interestingly, three other of our candidate genes (PDLIM1,GRIP1, SMYD3) interact with ER-α at the protein level [87], and two have strong association with suicide risk (PDLIM1,GRIP1)[88,89].
In genes like KL, where heterozygotes show higher cognitive abilities than homozygotes, cognition may be a driving selective force. This is a possible scenario in other genes, too. Still, given the complexity of brain development and function, it is also possible that cognitive effects of this variation are a byproduct of diversity maintained for other phenotypes. For example, MHC proteins and other immune effectors are believed to affect connectivity and function of the brain {reviewed in [90,91]), with certain alleles being clearly associated with autism disorder ([91–93]).
Reproduction
We see an enrichment for genes preferentially expressed in the prostate, as well as a number of outlier genes involved in the formation of the sperm. For example, CLDN11 encodes a tight-junction protein expressed in several tissues and crucial for spermatogenesis. Knockout mice for the murine homologue show both neurological and reproductive impairment, i.e, mutations have pleiotropic effects [94,95]. In humans, variants in the gene are strongly associated to prostate cancer.
ESR1, which as mentioned above (in the Cognition section) encodes the ER-α transcription factor activated by estrogen, leads to abnormal secondary sexual characteristics in females when defective [96]. ER-α interacts directly with the product of BRCA1 and has strong associations to breast cancer [97] and breast size [98]. It also harbors strong associations to menarche (age at onset). In males, it is involved in gonadal development and differentiation, and lack of estrogen and/or this receptor in males can lead to poor sperm viability (reviewed in [99]). Strikingly, this gene also has SNPs strongly associated to a diverse array of phenotypes, including height, bone mineral density (spine and hip), and sudden cardiac arrest [100–102]. Two other genes among our candidates are also part of the estrogen signaling pathway: PLCB4 and ADCY5 (which is strongly associated to birth weight). Estrogens are not only involved in reproductive functions (both in male and females), but also in several other processes of neural (see above), muscular or immune nature, and the ER-a-estrogen complex can act directly on promoter regions of other genes, or interact with transcription factors of genes without estrogen-sensitive promotor regions [103]. In this case, balancing selection could be explained by the high level of pleiotropy (if different alleles are beneficial for different functions), including the function in male and female reproduction (if different alleles are beneficial in males than females).
Conclusions
We present two new summary statistics, NCD1 and NCD2, which are both simple and fast to implement on large datasets to identify genomic regions with signatures of LTBS. They have a high degree of sensitivity for different equilibrium frequencies of the balanced polymorphism and, unlike classical statistics such as Tajima’s D or the Mann-Whitney U [28,37], allow an exploration of the most likely frequencies at which balancing selection maintains the polymorphisms. This property is shared with the likelihood-based T1 and T2 tests [31]. We show that the NCD statistics are well-powered to detect LTBS within a complex demographic scenario, such as that of human populations. They can be applied to either single loci or the whole-genome, in species with or without detailed demographic information, and both in the presence and absence of an appropriate outgroup.
More than 85% of our outlier windows are shared across populations, raising the possibility that long-term selective pressures have been maintained after human populations colonized new areas of the globe. Still, about 15% of outlier windows show signatures exclusively in one sampled population and a few of these show opposing signatures of selective regimes between human groups; they are of particular relevance to understand how recent human demography might impact loci evolving under LTBS for millions of years or subsequent local adaptations through selective pressure shifts (e.g. [47]).
Our analyses indicate that, in humans, LTBS may be shaping variation in less than 2 % of variable genomic positions, but that these on average overlap with 7.9% of the protein-coding genes. Although immune-related genes represent a substantial proportion of them, almost 70% of the candidate genes cannot be ascribed to immune-related functions, suggesting that diverse biological functions, and the corresponding phenotypes, contain advantageous genetic diversity.
Methods
Simulations and power analyses
NCD performance was evaluated by simulations with MSMS [104] following the demographic model and parameter values described in [105] for African, European, and East Asian human populations (Fig 2). To obtain the neutral distribution for the NCD statistics, we simulated sequence data under the following demographic model: generation time of 25 years, mutation rate of 2.5 × 10−8 per site and recombination rate of 1 × 10−8, and a human-chimpanzee split at 6.5 mya was added to the model, which was used to obtain the neutral distributions for the NCD statistics. For the simulations with selection, a balanced polymorphism was added to the center of the simulated sequence and modeled to achieve a pre-specified frequency equilibrium (feq = 0.3, 0.4, 0.5) following an overdominant model (S2 Note). Simulations with and without selection were run for different sequence lengths (3, 6, 12 kb) and times of onset of balancing selection (1, 3, 5 mya). For each combination of parameters, 1,000 simulations, with and without selection, were used to compare the relationship between true (TPR, the power of the statistic) and false (FPR) positive rates for the NCD statistics, represented by ROC curves. For performance comparisons, we used FPR = 0.05. When comparing performance under a given condition, power was averaged across NCD implementations, demographic scenarios, L, and Tbs. When comparing NCD performance to other methods (Tajima’s D [106], HKA [34], and a combined NCD1+HKA test), we simulated under NCD optimal conditions: L = 3 kb and Tbs = 5 mya (S1 Table). Since power for T1 and T2 is reported based on windows of 100 informative sites (∼ 14 kb for YRI and CEU) up and downstream of the target site [31], we divided simulations of 15 kb into windows of 100 IS, calculated T1 and T2 with BALLET [31] and selected the highest T1 or T2 value from each simulation to obtain their power for the same set of parameters used for the other simulations.
Human population genetic data
We analyzed genome-wide data from the 1000 Genomes (1000G) Project phase I [40], excluding SNPs only detected in the high coverage exome sequencing in order to avoid SNP density differences between coding and non-coding regions. We queried genomes of individuals from two African (YRI, LWK) and two European populations (GBR, TSI). We did not consider Asian populations due to lower NCD performance for these populations according to our simulations (S1 Table, S7-8 Figs). To equalize sample size, we randomly sampled 50 unrelated individuals from each population (as in [107]). We dedicated extensive efforts to obtain an unbiased dataset by extensive filtering in order to avoid the inclusion of errors that may bias results. We kept positions that passed mappability (50mer CRG, 2 mismatches [108]), segmental duplication [109,110] and tandem repeats filters [111], as well as the requirement of orthology to chimp (S13 Fig) because NCD2 requires divergence information (Equation 1). Further, we excluded 3 kb windows: with less than 10 IS in any population (∼2% of scanned windows) and less than 500 bp of positions with orthology in chimp (1.6%); the two criteria combined resulted in the exclusion of 2.2% of scanned windows.
Identifying signatures of LTBS
After applying all filters and requiring the presence of at least one informative site, NCD2 was computed for 1,695,655 windows per population. Because in simulations 3kb windows yielded the highest power for NCD2 (S3-S6 Figs, Table 1), we queried the 1000G data with sliding windows of 3 kb (1.5 kb step size). Windows were defined in physical distance since the presence of LTBS may affect the population-based estimates of recombination rate. For each window in each population we calculated NCD2 for three tf values (0.3-0.5).
Filtering and correction for number of informative sites
Genome-wide studies of natural selection typically place a threshold on the minimum number of IS necessary (e.g., at least 10 IS in [28], or 100 IS in [31]). We observe considerable variance in the number of IS per 3 kb window in the 1000G data; also, NCD2 has high variance when the number of IS is low in neutral simulations (S11 and S18 Figs). We thus excluded windows with less than 10 IS in a given population because, for higher values of IS, NCD2 stabilizes. We then analyzed the 1,657,989 windows that remained in all populations, covering 2,145,937,383 base pairs (S13 Fig). Neutral simulations with different mutation rates were performed in order to retrieve 10,000 simulations for each value of IS (S18 Fig and Methods). NCD2 (tf = 0.3, 0.4, 0.5) was calculated for all simulations, allowing the assignment of significant windows and the calculation of Ztf-IS (Equation 2 below).
Significant and outlier windows
We defined two sets of windows with signatures of LTBS: the significant (based on neutral simulations) and outlier windows (based on the empirical distribution of Ztf-IS, see below). When referring to both sets, we use the term candidate windows. Significant windows were defined as those fulfilling the criterion whereby the observed NCD2 value is lower than all values obtained from 10,000 simulations with the same number of IS. Thus, all significant windows have the same p-value (p < 0.0001). In order to rank the windows and define outliers, we used a standardized distance measure between the observed NCD2 (for a queried window) and the mean of the NCD2 values for the 10,000 simulations with the matching number of IS: , where Ztf-IS is the standardized NCD2, conditional on the value of IS, NCD2tf is the NCD2 value with a given tf for the n-th empirical window, is the mean NCD2 for 10,000 neutral simulations for the corresponding value of IS, and sdtf-IS is the standard deviation for 10,000 NCD2 values from simulations with matching IS. Ztf-IS allows the ranking of windows for a given tf, while taking into account the residual effect of IS number on NCD2tf, as well as a comparison between the rankings of a window considering different tf values. An empirical p-value was attributed to each window based on the Ztf-IS values for each tf. Windows with empirical p-value < 0.0005 (829 windows) were defined as the outlier windows. Outlier windows are essentially a subset of significant windows (except for 5 windows in LWK, 1 window in YRI, 3 windows in GBR, and 4 windows in TSI). Significant and outlier windows for multiple tf values had an assigned tf value, defined as the one that minimizes the empirical p-value for a given window (S3 Note).
Coverage as a proxy for undetected short duplication
To test whether the signatures of LTBS are driven by undetected short duplications, which can produce mapping and SNP call errors, we analyzed an alternative modern human genome-wide dataset, sequenced to an average coverage of 20x-30x per individual [112,113]. We used an independent data set because read coverage data is low and cryptic in the 1000G, and putative duplications affecting the SFS must be at appreciable frequency and should be present in other data sets. We considered 2 genomes from each of the following populations: Yoruba, San, French, Sardinian, Dai, and Han Chinese. For each sample, we retrieved positions above the 97.5% quantile of the coverage distribution for that sample (“high coverage” positions). For each window with signatures of LTBS, we calculated the proportion of the 3kb window having high coverage in at least two samples and plotted the distributions for different NCD2 Ztf-IS p-values. Extreme NCD windows are not enriched in high-coverage regions; in fact, they are depleted of them in some cases (S14 Fig) (Mann-Whitney U two-tail test; p < 0.02 for tf = 0.5 and tf = 0.4 for GBR and TSI).
Enrichment Analyses
Gene (GO) and Phenotype (PO) Ontology, and Tissue-specific expression
We analyzed protein-coding genes overlapped by one or more candidate windows. GO, PO and tissue of expression enrichment analyses were performed using GOWINDA [114], which corrects for gene length-related biases and/or gene clustering (S6 Note). GO/PO accession terms were downloaded from the GO Consortium (http://geneontology.org), and the Human PO (http://human-phenotype-ontology.github.io/). We ran analyses in mode:gene (which assumes that all SNPs in a gene are completely linked) and performed 100,000 simulations for FDR (false discovery rate) estimation. Significant GO, PO and tissue-specific categories were defined for a FDR<0.05. In both cases, a minimum of three genes in the enriched category was required.
For tissue-specific expression analysis we used Illumina BodyMap 2.0 [115] expression data for 16 tissues, and considered genes significantly highly expressed in a particular tissue when compared to the remaining 15 tissues using the DESeq package [116], as done in [117]. All three enrichment analyses (GO, PO, and tissue-specific expression) were performed for each population and set of genes: outliers or significant; different tf values (or union of all tf); with or without classical HLA genes (S6 Note).
Archaic introgression and ectopic gene conversion
We evaluated two potentially confounding biological factors: ectopic gene conversion and archaic introgression. We verified the proportion of European SNPs in candidate windows that are potentially of archaic origin, and whether candidate genes tend to have elevated number of paralogs in the same chromosome. Details in S5 Note.
SNP annotations and re-sampling procedure
Functional annotations for SNPs were obtained from ENSEMBL-based annotations on the 1000G data (http://www.ensembl.org/info/genome/variation/predicted_data.html). Specifically, we categorized SNPs as: intergenic, genic, exonic, regulatory, synonymous, and non-synonymous. Details on which annotations were allocated to each of these broad categories are presented in S6 Note. Within each category, each SNP was only considered when variable in the population under analysis (S6 Note). For each candidate window, we sum the number of SNPs with each score, and then sum across candidate windows. To compare with non-candidate windows, we performed 1,000 re-samplings of the number of candidate windows (which were merged in case of overlap) from the set of background windows (all windows scanned). For each re-sampled set, we summed the number of SNPs in a particular category and then computed the ratios in Table S8 and Fig 5. We therefore obtained ratios for each re-sampling set, to which we compared the values from candidate windows to obtain empirical p-values. Because we considered the sum of scores across windows, and counted each SNP only once, results should be insensitive to window length (as overlapping candidate windows were merged). As before, we performed these analyses for each population and sets of windows: outliers or significant, considering the union of all tf.
Genes with monoallelic expression (MAE) and immune-related genes
To test for enrichment for genes with MAE, we quantified the number of outlier and significant genes with MAE and the number that have bi-allelic expression as described in [46]. We compared these proportions to those observed for all scanned genes (one-tailed Fisher’s test.) The same procedure was adopted to test for enrichment of immune-related genes among our sets: we used a list of 386 keywords from the Comprehensive List of Immune Related Genes from Immport (https://immport.niaid.nih.gov/immportWeb/queryref/immportgene/immportGeneList.do) and queried how many of the outlier protein-coding genes (402 genes in total across populations and tf, of which 378 had at least one associated GO term) had at least one immune-related associated GO category.
All statistical analyses and figures were performed in R [118] (scripts available on https://github.com/bbitarello/NCV_dir_package). Gene Cards (www.genecards.org) and Enrichr [119] were used to obtain basic functional information about genes and STRING v10 [87] was used to obtain information for interactions between genes. The GWAS catalog [120] was used to search for associations included in the discussion (we only report “strong associations”, i.e, when there is at least one SNP with p < 10−8).
Author Contributions
AA, DM and BDB conceived and designed the study. BDB, CDF and PK performed data quality filters. AA, BDB, CDF and DM designed and explored the properties of the statistic. BDB and CDF performed power analyses and ran the genome-wide analysis. JCT and JS performed the enrichment analyses. All authors interpreted the data. AA and DM supervised the project. BDB, DM and AA wrote the manuscript, with contributions from all authors.
Acknowledgements
We would like to dedicate this manuscript to Scott Williamson, in memoriam, for playing a fundamental role in the conception of NCD. We also thank Warren Kretszchmar for analyses on the properties of related statistics not included here, and Eric Green for his support of that work. We thank Michael DeGiorgio for assistance with BALLET, Felix Key help with 1000 Genomes data sets, Michael Dannemann for assistance in the implementation of expression analyses, Stéphane Peyrégne for comments on the manuscript, and David Reher, members of the Evolutionary Genetics Group (São Paulo), Alex Cagan and Svante Päabo for helpful comments.
Footnotes
↵¶ AA and DM co-supervised the study
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵