Abstract
Balancing selection maintains advantageous genetic and phenotypic diversity in populations. When selection acts for long evolutionary periods selected polymorphisms may survive species splits and segregate in present-day populations of different species. Here, we investigated the role of long-term balancing selection in the evolution of protein-coding sequences in the Pan-Homo clade. We sequenced the exome of 20 humans, 20 chimpanzees and 20 bonobos and detected eight coding trans-species polymorphisms (trSNPs) that are shared among the three species and have segregated for approximately 14 million years of independent evolution. While the majority of these trSNPs were found in three genes of the MHC cluster, we also uncovered one coding trSNP (rs12088790) in the gene LAD1. All these trSNPs show clustering of sequences by allele rather than by species and also exhibit other signatures of long-term balancing selection, such as segregating at intermediate frequency and lying in a locus with high genetic diversity. Here we focus on the trSNP in LAD1, a gene that encodes for Ladinin-1, a collagenous anchoring filament protein of basement membrane that is responsible for maintaining cohesion at the dermal-epidermal junction; the gene is also an autoantigen responsible for linear IgA disease. This trSNP results in a missense change (Leucine257Proline) and, besides altering the protein sequence, is associated with changes in gene expression of LAD1.
Introduction
Balancing selection maintains advantageous polymorphisms in populations, preventing fixation of alleles by drift and increasing genetic diversity (Charlesworth 2006; Andres 2011). There are a variety of mechanisms through which balancing selection can act, including overdominance or heterozygote advantage (Allison 1956; Pasvol, Weatherall, Wilson 1978), frequency-dependent selection and rare-allele advantage (Wright 1939; Gigord, Macnair, Smithson 2001), temporal and spatial variation in selective pressures (Gillespie 1978; Muehlenbachs et al. 2008), or pleiotropy (Gendzekhadze et al. 2009).
When balancing selection acts on a variant long enough it creates long local genealogies, with unusually old coalescence times. Selected alleles can segregate for millions of years, with neutral diversity accumulating near the selected variant(s) due to linkage (Charlesworth, Nordborg, Charlesworth 1997; Clark 1997; Charlesworth 2006). Selection maintains alleles close to the frequency equilibrium, the frequency that maximizes fitness in the population. This results in an enrichment of variants close to the frequency equilibrium in selected and linked variation (Hudson, Kaplan 1988; Takahata, Nei 1990; Charlesworth, Nordborg, Charlesworth 1997; Charlesworth 2006). Recombination restricts these signatures to short genomic segments (Wiuf et al. 2004; Charlesworth 2006; Segurel et al. 2012; Leffler et al. 2013). If selection is strong and constant enough, the polymorphism may survive the split of different species and persist in present-day populations of more than one species, resulting in a trans-species polymorphism (trSNP) (Muirhead, Glass, Slatkin 2002; Charlesworth 2006; Andres 2011) (Figure 1). In species with old enough divergence time trans-species polymorphisms are rare under neutrality and are hallmarks of balancing selection (Charlesworth, Nordborg, Charlesworth 1997; Clark 1997; Wiuf et al. 2004).
The assumption that trans-species polymorphisms are very rare in humans combined with the absence of unbiased genome-wide polymorphism datasets in other great ape species resulted in few trans-species polymorphisms being described in humans: several SNPs in the major histocompatibility locus (MHC) (Klein et al. 1993; Asthana, Schmidt, Sunyaev 2005), and a few non-MHC genes (e.g. TRIM5 (Cagliani et al. 2010), ZC3HAV1 (Cagliani et al. 2012), and ABO (Segurel et al. 2012)). Recently, six well-defined short trans-species haplotypes containing at least two trSNPs shared in humans and chimpanzees have been identified (Leffler et al. 2013). Interestingly, only two of them contained coding SNPs (in both cases synonymous), suggesting that they may be involved in regulation. Leffler et al. (2013) also identified a number of coding SNPs shared between humans and chimpanzees, but because filtering on allelic trees or CpG sites was not performed, it is unclear whether they represent trans-species polymorphisms or recurrent mutations (an important problem in the identification of trSNPs, see below).
Here we analyze the complete exomes of 20 humans, 20 chimpanzees and 20 bonobos to identify trans-species polymorphisms present since the Pan-Homo common ancestor until the present-day population of each of the three species. By including the three species we focus only on strong balancing selection that has been maintained in the three lineages. Besides identifying coding trSNPs in several MHC genes, we also identify a novel trans-species polymorphism maintained by long-term balancing selection in the gene LAD1 (ladinin-1).
Results
Absence of neutral trSNPs in humans, chimpanzees and bonobos
As mentioned above the presence of neutral trSNPs is rare when species diverged long ago. Bonobos and chimpanzees diverged about 2My ago (Prufer et al. 2012) and they share some polymorphisms: we expect, under neutrality, 0.85% of the SNPs in bonobo to be segregating in chimpanzee, and 4.6% vice-versa (see SOM section II). Conversely, a neutral trans-species polymorphisms between Homo and any of the two Pan species is unlikely to occur by genetic drift alone: based on coalescent theory, we estimate that a SNP found in a sample of humans has very low probability of being polymorphic in chimpanzees too, p = 1.6e-08 (see also SOM section II). If we include bonobos the presence of neutral trSNP is even less likely: the probability of a polymorphism observed in humans to be a trans-species polymorphism shared with both chimpanzees and bonobos is, under neutrality, 4.0e-10 (see also SOM section II). This is roughly 39 times lower than the probability that a SNP in humans is also polymorphic in chimpanzees, illustrating the advantage of including bonobo in the comparison. Given that we observe 121,904 human SNPs, we expect about 5.0e-05 neutral trSNPs in the three species. We note that these are actually overestimates, since coding variation is subject to purifying and background selection that produce shallower trees than in neutrally evolving loci. Therefore we cannot explain human-chimpanzee-bonobo trSNPs under neutrality.
Idenfication of trSNPs
We sequenced the exomes of 20 Yoruba humans, 20 central chimpanzees (Pan troglodytes troglodytes) and 20 bonobos (Pan paniscus) to an average coverage of ∼18X in each individual of the three species (data is very homogeneous across species in coverage and quality, see Materials and Methods). We uncovered a total of 121,904 high-quality SNPs in human, 262,960 in chimpanzee and 99,142 in bonobo. This represents a novel SNP discovery of ∼33.54% in bonobo, ∼49.29% in chimpanzee and ∼2.8% in human (compared with Prado-Martinez et al. (2013) and dbSNP build 138). We focused on the 202 coding SNPs with the same two segregating alleles in the three species.
Two important confounding factors in the identification of trSNPs are genotype errors and recurrent mutations. The latter is particularly likely in hypermutable sites where the probability of a parallel mutation in two lineages is high. Prime examples of this are CpG dinucleotides, where a methylated cytosine can deaminate to a thymine and result in a C->T transition (Bird 1980; Hodgkinson, Eyre-Walker 2011), but additional, cryptic heterogeneity in mutation rate exist (Hodgkinson, Ladoukakis, Eyre-Walker 2009; Hodgkinson, Eyre-Walker 2010; Johnson, Hellmann 2011). To limit the influence of recurrent mutation and mapping and sequencing artifacts, we conservatively removed SNPs that: a) occur at a CpG dinucleotide in at least one species; b) are out of Hardy-Weinberg equilibrium (p<0.05 in at least one species); c) are in the upper 5% tail of the empirical distribution of coverage in at least one species; and d) do not lie in regions with high mappability (see Materials & Methods and SOM section I). This resulted in a set of 13 coding shSNPs that passed all filters, distributed across 5 genes.
Although removing CpGs should drastically reduce the number of recurrent mutations, non-CpG SNPs could also occur via recurrent mutations. However, such SNPs are expected to fall in genomic regions that follow the species tree, while trans-species polymorphisms create local genealogies that cluster by allele (Figure 2) because the most recent common ancestor of the genomic segment containing the trSNP predates the split of the three species (Schierup, Mikkelsen, Hein 2001; Wiuf et al. 2004). Thus, analyses of the regions surrounding a shSNP may allow us to distinguish the two cases.
We therefore focus only on shSNPs that fall in genomic regions that exhibit trees that cluster by allelic type. We identify 10 shSNPs (in 5 genes) that have a probability of belonging to an allelic tree > 0.90 (see Materials and Methods) and consider these shSNPs as candidate trans-species polymorphisms (candidate trSNPs). They lie in the genes LAD1 (ladinin-1, 1 trSNP), TNFRSF10D (tumor necrosis factor receptor superfamily member 10d, 2 trSNPs), HLA-C (1 trSNP), HLA-DQA1 (3 trSNPs) and HLA-DPB1 (3 trSNPs).
Because trSNPs have been previously described in HLA genes (Lawlor et al. 1988; Mayer et al. 1988; Fan et al. 1989; Klein et al. 1993; Asthana, Schmidt, Sunyaev 2005; Leffler et al. 2013), we focus on the remaining two genes. To confirm that their SNPs are true polymorphisms rather than Illumina sequencing errors or the result of misaligned reads, we performed independent SNP validation with Sanger sequencing on the three candidate trSNPs outside the HLA region (for detailed information see SOM section I). We were able to validate two candidate trSNPs (one in LAD1 – chr1:201355761 and one in TNFRSF10D chr8:23003292) in all three species. When we BLAT (Kent 2002) the 25bp region surrounding these candidate trSNPs to the three reference genome sequences, they all map uniquely to the human reference genome (hg19), but only the SNP in LAD1 maps uniquely to the chimpanzee genome (PanTro4) (see SOM section I). This suggests that the candidate trSNP in TNFRSF10D could be the result of mapping errors to duplicated chimpanzee regions misassembled in the chimpanzee reference genome. Although this does not discard this position as a SNP in the other species (or in chimpanzee) we conservatively removed it from further analyses. We therefore focus (besides the MHC trSNPs) on the non-synonymous variant in exon 3 of the gene LAD1 (rs12088790), which was confirmed in all three species. Figure 3 shows the neighbor-joining tree for LAD1 where sequences cluster by allelic type.
Excess of polymorphism linked to the trSNPs
In order to investigate whether the four genes exhibit an excess of genetic diversity after taking heterogeneity in mutation rate into account, we calculated the ratio of polymorphism to divergence (PtoD) in LAD1, HLA-C, HLA-DQA1 and HLA-DPB1, and found that in all three species this set of genes is significantly more polymorphic than the empirical distribution of all genes with at least one variable site (polymorphism or substitution) in our dataset. This is indeed the case when we consider a) the entire genic region, b) only their coding sequence, and c) the 500 bp surrounding the trSNP (Table 1). LAD1 is the least polymorphic of the four genes, which is not surprising as the remaining trSNPs fall in HLA genes. Nevertheless, LAD1 lies in the upper tail of the empirical distribution, with a significant excess of polymorphism (p<0.05) in human and bonobo for at least one comparison (humans considering both the entire gene and coding regions, bonobo considering the entire gene). The signal is not as strong in chimpanzee, with the excess of polymorphism being marginally non-significant (p=0.06) if we consider the entire gene (see SOM section IV).
Intermediate-allele frequency of the trSNPs and linked variants
The allele frequency distribution of sites linked to a balanced polymorphism is expected to exhibit an excess of alleles at frequencies close to the frequency equilibrium. If the frequency equilibrium is high enough (e.g. 0.5) the local site frequency spectrum (SFS) will show an observable departure from the genome-wide empirical distribution. The SFS of the four genes together shows a significant shift towards intermediate-frequency alleles in all species (Mann-Whitney U test p<4e-10; Figure 4).
When we consider the genes individually, almost all exhibit a significant excess of intermediate-frequency alleles in all species except for LAD1 in bonobo and human (marginally non-significant), and for HLA-C in bonobo (Table 2). Therefore, we observe a global accumulation of intermediate-frequency alleles in the genes containing trSNPs, although in individual genes the low number of SNPs hampers the power of the test. As shown in Figure 4, the missense trSNP in LAD1 is at intermediate frequency in all three species: MAF=0.45 in human, 0.325 in chimpanzee, and 0.225 in bonobos. These frequencies are all (except in bonobo) in the upper tail of the empirical allele frequency distributions of non-synonymous variants: in the upper 1.9% quantile for human, in the 8.6% for chimpanzee, and in the 23.8% in bonobo.
When we investigate the 1000 Genomes dataset (Abecasis et al. 2012), which contains both coding and non-coding data for LAD1, we observe a significant excess of intermediate-frequency alleles in all African populations, although the signature varies across human groups with some non-African populations showing an excess of low-frequency alleles instead (see Table S4). The trSNP is itself at intermediate frequency in all the African populations analyzed (between 31% and 48%) and at low frequency (<8%) in the non-Africans from the 1000Genomes data. Actually, when we compute FST values for LAD1’s trSNP between the African Yoruba and two non-African populations (Toscani and Han Chinese) we observe high allele frequency differences (FST = 0.238 and 0.293, respectively), which are in the top 6.5% tail of the empirical FST distribution.
Balancing selection in LAD1
LAD1 (ladinin 1) spans 18,704 bp and is composed of 10 exons. We obtained a total of 1,213bp of the gene by sequencing the complete exons 4, 7 and 9, as well as parts of exons 2, 3 and 5. The trSNP found in LAD1 (chr1: 201355761, rs12088790), located in exon 3, is a missense mutation that results in a Leucine to Proline change. This change has a moderately conservative Grantham score (amino acid replacement score based on chemical similarity – Leucine -> Proline = 98) (Grantham 1974).
Besides altering the sequence of the protein, the trSNP is associated with expression changes in present-day humans. Specifically, when we analyzed expression data in lymphoblast cell lines from a subset of the 1000 Genomes project individuals (Lappalainen et al. 2013), we observed significantly lower expression of LAD1 in carriers of at least one ancestral G allele (GG and GA genotypes) than in AA homozygotes (p=0.02). Comparing carriers of at least one A allele with GG homozygotes did not show a significant difference in expression levels (p=0.21). This shows that the derived A allele is associated with increased expression of LAD1 in an at least partially recessive manner.
Discussion
By comparing the full exome of humans, chimpanzees and bonobos, we identify polymorphisms maintained by long-term balancing selection in the Homo-Pan clade. Undoubtedly, other cases of long-term balancing selection exist, including species-specific balancing selection (Pasvol, Weatherall, Wilson 1978; Bamshad et al. 2002; Wooding et al. 2004; Wooding et al. 2005; Muehlenbachs et al. 2008; Andres et al. 2009; Andres et al. 2010), but here we focus on selection that is old, strong, constant and shared across lineages, and that results in trans-species polymorphisms. Even among trSNPs, we focus only on coding variants shared among the three species, and likely underestimate the number of human trSNPs. First, by focusing on coding variation we are largely blind to balancing selection that maintains variants outside genes, which may not be rare (Leffler et al. 2013). Second, by restricting on a SNP being present in the three species we discard cases where the variant was lost in one of the lineages, which may again not be rare. For example, even one of the best-established cases of trSNPs, the one present in the ABO gene from humans to old word monkeys, is not shared among the three species because it was lost in chimpanzees (Segurel et al. 2012). This is not unexpected as it is likely that one of the species has undergone demographic or selective changes that have weakened or changed selection on an old balanced polymorphism. Conversely, considering three species (e.g. adding bonobo) reduces the probability of trSNPs under neutrality; in fact, after considering the number of SNPs discovered in humans (121,904), we expect to observe no neutral trSNP (specifically, we expect 5.0e-05 neutral trSNPs).
Of the eight trSNPs we identify shared among the three species, seven are located in HLA genes (HLA-DQA1, HLA-C and HLA-DPB1) and one is a non-synonymous SNP in exon 3 of the gene LAD1 (rs12088790). This variant, which has segregated for millions of years in these lineages, represents to our knowledge the only trans-species polymorphism known to segregate in present-day populations of these three species outside of the MHC.
LAD1 trSNP segregates at intermediate frequencies in Yoruba, bonobos and chimpanzees. In humans, the trSNP is present in several populations throughout the world (1000 Genomes data (Abecasis et al. 2012)), at intermediate frequency in African populations and at low frequency in non-African populations. The fact that only African populations show a significant excess of intermediate frequency alleles (Table S4) suggests that the selective pressure might have changed across human groups, as indicated by the FST analysis. When considering the entire gene, we observe a significant excess of intermediate-frequency alleles in chimpanzee, and a marginally significant excess of intermediate-frequency alleles in humans (Table 2). The gene also exhibits high levels of genetic diversity, particularly in bonobos and humans. The weaker signal in chimpanzee is likely due its higher effective population size (Prado-Martinez et al. 2013) that translates in higher genetic diversity across the genome and lower power to detect the localized increased in diversity around LAD1. Overall LAD1, and rs12088790 in particular, show consistent and strong signatures of balancing selection.
Although rs12088790 in LAD1 is a good candidate to have been the target of selection (being non-synonymous and present in the three species), it is possible that it is instead maintained by linkage to an undiscovered selected trSNP, as the maintenance of several linked trSNPs is possible under long-term balancing selection (Segurel et al. 2012). Although more detailed genomic and functional analysis on LAD1 are needed to completely clarify this question, we explored a recently published catalog of great-ape genetic polymorphism in search for additional shSNPs (Prado-Martinez et al. 2013). Besides our trSNP (which in that dataset also segregates in all three species), we were unable to identify additional shSNPs in the three species that are independent of CpG dinucleotides (see SOM section VI). We were, however, able to uncover two additional intronic shSNPs in LAD1, one shared between human and bonobo, and the other between human and orangutan. Nevertheless, these are distant to rs12088790 (at least 6kb away), not shared between the three species here analyzed, and thus highly unlikely to be related to rs12088790.
Interestingly, the two alleles of rs12088790 are associated with differences in expression levels of LAD1, with higher expression associated with the ancestral G allele in lymphoblastoid cell lines. This highlights the possibility that, in addition to causing an amino acid replacement, the trSNP might also have regulatory effects, although it is certainly possible that another, nearby variant, instead produces the observed differences in expression.
The biological factors leading to long-term balancing selection on LAD1 are not obvious. The gene encodes a collagenous anchoring filament protein of basement membrane at the dermal-epidermal junction. The mRNA and the protein are observed in a number of tissues including the gastrointestinal system (and its accessory organs), the kidney, prostate, placenta, and one type of hematopoietic cells (Kim et al. 2014). Genes involved in cell adhesion and extracellular matrix components are enriched among candidate targets of balancing selection and among genes with intermediate-frequency alleles in pathogen-rich environments (Andres et al. 2009; Fumagalli et al. 2009; Fumagalli et al. 2011), suggesting that certain components of the cellular junction may benefit from the presence of functional polymorphism, perhaps as a defense against pathogens. In this context, LAD1 may represent one of such examples.
Interestingly, genetic variation in LAD1 is associated with linear IgA disease, an autoimmune blistering disease. The disease, which affects mostly children and elderly adults (McKee, Calonje, Granter 2005), is caused by the presence of circulating IgA autoantibodies that target peptides in the Ladinin-1 protein, causing an immunological reaction. This results in the disruption of the dermal-epidermal cohesion, leading to skin blistering that predominantly affects the genitalia but also the face, trunk and limbs (Ishiko et al. 1996; Marinkovich et al. 1996; Motoki et al. 1997; McKee, Calonje, Granter 2005). Although our understanding of the effect of the disease in different populations is biased by the fact that the disease (which is rare) has mostly been studied in Western countries, some evidence suggests that it is more common in Africa (Aboobaker et al. 1991; Denguezli et al. 1994; Monia et al. 2011). Balancing selection has been proposed to play a role in the evolution of autoimmune genes, because the inflammatory response must be precisely balanced to be effective yet moderate (Ferrer-Admetlla et al. 2008). Whether balancing selection in LAD1 is responsible for its role in auto-immunity remains though unclear. It is possible, and perhaps more likely, that autoimmune diseases appear as consequences of diversity in proteins that is maintained by balancing selection and happen to be able to initiate pathogenic immunological reactions. Further work is necessary to discern the functional consequences and advantageous role of its balanced polymorphisms in humans and other primates.
Materials and Methods
DNA samples and sequencing
We performed whole-exome capture and high-coverage sequencing of 20 humans, 20 central chimpanzees (Pan troglodytes troglodytes) and 20 bonobos (Pan paniscus). Human samples belong to the well-studied Yoruba population from HapMap; bonobo and chimpanzee blood samples were collected in African sanctuaries (Lola ya bonobo sanctuary in Kinshasa, Democratic Republic Congo; and Tchimpounga sanctuary, Jane Goodall Institute, Republic of Congo, respectively) and immortalized as cell culture (Fischer et al. 2011). DNA was extracted using the Gentra Purgene Tissue Kit (Qiagen), sheared to a size range of 200 to 300 bp using the Bioruptor (Diagenode) and converted into DNA libraries for capture and sequencing (Meyer, Kircher 2010). All samples were double-indexed to prevent cross-sample contamination during the processing and sequencing of the samples (Kircher, Sawyer, Meyer 2012). Exome capture was performed using the SureSelect Human All Exon 50Mb Kit (Agilent Technologies). The kit design is based on the complete annotation of coding regions from the GENCODE project with a capture size of approximately 50 Mb. We selected all Ensembl genes (mapping uniquely to hg19) that are RefSeq genes (with good functional support) and targeted by our capture design, and selected their longest RefSeq transcript. Samples were then pooled by species and sequencing was performed on Illumina’s GAIIx platform, with paired-end reads of 76bp.
Base calling and read mapping
Base calling was performed with Ibis (Kircher, Stenzel, Kelso 2009), and reads with more than 5 bases with a base quality score lower than 15 were discarded. Reads were aligned to the human reference genome hg19 using BWA with default parameters. Mapping all individuals to the same reference genome prevented complications from mapping to genomes of different quality. Only reads with a mapping quality (MQ) ≥ 25 and mapping outside of known segmental duplications in the three species were considered for further analysis. Specifically, the average coverage for each individual is 18.9X in human, 17.9X in chimp and 17.9X in bonobo.
Genotype calling and filtering
Genotype calls were performed in the autosomes using the Genome Analysis Toolkit (GATK) UnifiedGenotyper (version 1.3-14) (McKenna et al. 2010). Aside from true variation, these preliminary SNP calls likely include false positives due to the presence of mismapped reads, misaligned indels and systematic errors. We used a combination of strict filters to remove such errors. SNPs were removed using the following criteria (acronyms correspond to the GATK package or fields in the VCF files):
The depth of coverage (DP) was <8 or >100 in at least 50% of the individuals of each species. This allowed us not only to exclude positions for which the coverage depth was low, but also positions that might fall in segmental duplications not annotated in the datasets above [28-30];
The quality score (QUAL) of the call was <50;
There was evidence of strand bias (SB>0);
The genotype quality (GQ) was <10 in all individuals carrying the alternative allele;
The SNP was located within 3bp of a homopolymer with a minimum length of 5bp;
The SNP was located within 5bp up- and down-stream of an insertion or deletion (indel) polymorphism or substitution with the human reference genome.
Shared SNPs as trans-species polymorphisms
In order to account for sites that are prone to recurrent mutation we did not consider CpG dinucleotides. Finally, wrongly mapped reads are difficult to account for and can result in an increased false discovery of shSNPs. In order to remove undetected duplications, we further filtered shSNPs to remove sites: a) with coverage higher than the 95% quartile tail of the distribution specific to each sample, b) that are in Hardy-Weinberg disequilibrium (p<0.05 in each species), and c) that do not lie in regions with 24mer mappability of 100% uniqueness.
Haplotype inference
We use the fastPHASE 1.4.0 software to infer the chromosomal phase for the alleles of each of the 37 genes containing at least one shSNP. The inferences were performed separately for each species and for each chromosome using the default parameters of fastPHASE.
Allelic trees
The region surrounding a trans-species polymorphism is expected to follow unusual genealogies where haplotypes cluster by allelic type rather than by species. This occurs because the age of the balanced polymorphism predates the speciation time and, unless recombination happens, there will be no fixation of new mutations. We call these two types of phylogenies “allelic tree” and “species tree” (Figure 2). The trees were inferred in windows of different lengths (100 bp to 2,000 bp) centered on the shared polymorphism, as the region expected to follow the allelic tree is very short due to the long-term effects of recombination. We considered as candidate trans-species polymorphisms only shSNPs that show an allelic tree in a window of at least 100 nucleotides.
We adopted a simple resampling approach to calculate the probability of allelic tree in the region surrounding a shared SNP. We randomly created 1,000 samples of six haplotypes (one haplotype per allele and per species). For each of the 1,000 resamples we built a neighbor-joining tree using as distance matrix the number of nucleotide differences among the six haplotypes. If the three closest tips are haplotypes from the three species containing the same allele of the shSNP, it was considered an allelic tree. If the two different human haplotypes are closer to each other than to any other haplotypes, the tree was considered a species tree (the relationship between chimpanzees and bonobos was not considered because incomplete lineage sorting occurs often given their short divergence time). The probability of a tree being allelic was estimated as the proportion of resampled trees that were allelic trees. Figure 2 shows an example of allelic and species tree built from six haplotypes.
Polymorphism-to-Divergence ratios (PtoD)
We defined the ratio of polymorphism to divergence PtoD = p/(d+1), where p is the number of polymorphisms observed in a species and d the number of fixed differences between species. For each species, we calculated PtoD in three different ways: a) for the entire gene, from the start to the end coordinates of the longest transcript in refGene; b) for the protein-coding part; c) for a 500bp region centered on one particular SNP, i.e. the shSNPs for candidate genes and a random SNP for the control genes.
In order to ascertain significance (p<0.05) when comparing the set of candidate genes to the set of control genes (empirical distribution), we performed 2-tails Mann-Whitney U (MW-U) tests. After comparing the PtoD values in the two groups, we sequentially removed the top candidate gene (i.e. one gene each time) from the candidate’s group and recalculated MW-U p-values maintaining the control group unaltered (see SOM section IV for details).
Measuring expression levels in LAD1 alleles
We analyzed lymphoblastoid cell line expression data obtained from a subset of 462 of the 1000 genomes project individuals provided by Lappalainen et al. (2013). To compute gene expression we used the aligned reads provided by Lappalainen et al. (2013) and assigned reads with a mapping quality (MQ) > 29 to protein coding genes by overlapping the read coordinates with gene coordinates (ENSEMBL version 69). Reads overlapping a gene are summed up and used as the estimate for gene expression.
We grouped the individuals by their genotype at position chr1:201355761 (rs12088790, the non-synonymous trSNP in LAD1). We sought to test for allele-specific expression for LAD1 between individuals carrying the two different trSNP alleles by testing for differential expression between (i) the groups of individuals with genotype AA vs. GG/GA and (ii) the groups of individuals with genotype GG vs. GA/AA. We computed differential expression for LAD1 for (i) and (ii) using the DESeq package (Anders, Huber 2010). Expression values in both groups are modeled by a fit of a negative binomial distribution. DESeq tests then for differences between the distributions of the two groups.
Acknowledgments
This work was supported by the Max Planck Society. JCT is supported by Fundação para a Ciência e a Tecnologia (FCT) within the Portuguese Ministry for Science and Education (SFRH/BD/77043/2011). We wish to thank Felix M. Key, Gabriel Renaud, Jing Li, Kay Prüfer and Joshua Schraiber for helpful discussions and suggestions. We are grateful to the Lola Ya Bonobo sanctuary in Kinshasa, Democratic Republic Congo, and the Tchimpounga sanctuary, Jane Goodall Institute, Republic of Congo, for allowing access to the primate samples.