Abstract
Changes in transcriptional regulation are thought to be a major contributor to the evolution of phenotypic traits. We developed a new method to identify DNase I Hypersensitive (DHS) sites with differential chromatin accessibility between species. Our method replaces threshold-based multiple pairwise comparisons with a single quantitative test, which detects subtle differences while scaling well to multiple taxa. We applied this method to DHS sites in fibroblast cells from five primates (human, chimpanzee, gorilla, orangutan, and rhesus macaque). We identified approximately 90,000 DHS sites, of which 59% are non differential between species, 27% are differential and likely due to a single evolutionary change, and 14% are differential due to multiple evolutionary changes. We found that using closely related species in our analyses allows us to distinguish between accessibility changes that are specific to a single species and those that have experienced multiple changes in chromatin accessibility during primate evolution. The non differential sites are enriched for nucleotide conservation, while the species-specific changes are enriched for positive selection. Differential DHS sites with decreased chromatin accessibility relative to rhesus macaque occur more commonly in proximal regulatory elements, while those with increased chromatin accessibility relative to rhesus macaque occur more commonly in distal regulatory elements. Differential DHS sites overlapping proximal regulatory elements display less cell-type specificity than those overlapping distal regulatory elements. Taken together, these results identify several classes of chromatin accessibility, each with distinct characteristics of selection, genomic location, and cell-type specificity.
Introduction
It has long been hypothesized that phenotypic differences between species are more often due to genetic variation in non-coding regulatory regions than in protein-coding regions (King and Wilson 1975; Wray 2007; Wittkopp and Kalay 2011). The development of diverse genome-wide assays, combined with the publication of primate reference genomes, has allowed identification of inter-species differences in gene expression (Cáceres et al. 2003; Gilad et al. 2006; Blekhman et al. 2008; Brawand et al. 2011), DNA methylation (Pai et al. 2011; Zeng et al. 2012; Hernando-Herraez et al. 2013), histone modifications (Zhou et al. 2014; Villar et al. 2015), transcription factor binding motifs (Dermitzakis and Clark 2002; Odom et al. 2007; Schmidt et al. 2010), chromatin accessibility (Shibata et al. 2012; Gallego Romero et al. 2018), and alternative splicing (Blekhman et al. 2010; Barbosa-Morais et al. 2012). These differences in molecular function among primate species can provide valuable insights into species-specific trait differences, including disease risk (Prabhakar et al. 2008; Boyd et al. 2015; Prescott et al. 2015).
Previous approaches to analyzing multi-species data have used multiple pairwise comparisons to detect differences between species (Robinson et al. 2010; Love et al. 2014). Applying these methods to comparisons between three or more groups becomes cumbersome because it requires multiple pairwise comparisons followed by an overlap step that intersects the results from those comparisons. The number of required pairwise comparisons increases substantially as the number of groups is increased (e.g., 10 comparisons for 5 species). Calculating p-values for multiple species comparisons is difficult because species are not completely independent and indeed differ in their degree of dependence due to phylogenetic relationships.
As an alternative to performing multiple pairwise comparisons, we developed a negative binomial generalized linear model that jointly models data across multiple species and replicates, and can be applied to any count-based data such as DNase-seq, RNA-seq, ChIP-seq, and ATAC-seq. We applied this method to chromatin accessibility DNase-seq data from cultured skin fibroblasts obtained from human, three recently diverged great apes (chimpanzee, gorilla, and orangutan), and rhesus macaque for use as an outgroup. We identified a total of 89,744 DNase I Hypersensitive (DHS) sites that are directly comparable between species, of which 36,666 (41%) displayed a statistically significant difference between species. Using effect estimates from the negative binomial generalized linear model, we inferred the direction of the change relative to rhesus macaque and the possible branch or branches on which the changes occured.
Materials and Methods
DNase-seq Experiments and Sequencing
Fibroblast cell lines from 15 individuals comprising 3 biological replicates from each of five primate species (human, chimpanzee, gorilla, orangutan, and rhesus macaque) were obtained from Coriell (Supplementary Table 1). It is estimated that human and chimpanzee diverged 7 million years ago, gorilla diverged from the human-chimpanzee ancestor 10 million years ago, orangutan diverged from the human-chimpanzee-gorilla ancestor 18 million years ago, and rhesus macaque diverged from the human-chimpanzee-gorilla-orangutan ancestor 30 million years ago (Schrago and Voloch 2013). DNase-seq experiments were performed as previously described (Shibata et al. 2012). DNase-seq libraries were generated from 50 million cells and sequenced on Illumina instruments (Supplementary Table 1).
DNase-seq Read Mapping and Conversion to Human Genome
Due to the use of MmeI to generate DNase-seq libraries (Boyle et al. 2008), genomic DNA fragments are only 20 bases long. Therefore, sequencing reads were trimmed to 20 bases using a custom perl script. Reads for each species were mapped to its native genome: hg19 for human, panTro4 for chimpanzee, gorGor3 for gorilla, ponAbe2 for orangutan, and rheMac3 for rhesus macaque (Lander et al. 2001; Chimpanzee Sequencing and Analysis Consortium 2005; Locke et al. 2011; Yan et al. 2011; Scally et al. 2012). Reads were mapped using Bowtie version 0.12.9 (Langmead et al. 2009) (parameters: -trim5 0 --trim3 0 -m 1 -l 20) as part of a custom two-step pipeline. In the first step (“tier 1”), reads were required to match to a unique location with no mismatches (parameter: -n 0). In the second step (“tier 2”), unmapped reads from step one were re-mapped with a relaxed mismatch parameter of one mismatch (parameter: -n 1). Reads that mapped to multiple locations or had more than one mismatch were discarded. Samtools version 0.1.19-44428cd (Li et al. 2009) was used to convert the sam files from each step to bam files, merge them into one file, and remove duplicate reads (defined as having the same chromosomal coordinates). Bedtools v2.17.0 (Quinlan and Hall 2010) was used to convert the bam files to bed files. Details on the number of reads in the input files and at each step are included in Supplementary Table 2.
Reads from the non-human samples were converted from their native genomic coordinates to hg19 coordinates using a three-step process that removed reads without a one-to-one relationship between the genomes. In each step, read coordinates were converted from one genome to the other using the UCSC liftOver software (Hinrichs et al. 2006) with a minMatch parameter of 0.8, which requires that 80% of the read maps to the new genome. Note that this parameter filters only on genomic coverage, not sequence identity. In the first step, read coordinates were converted from their native genome to hg19. Read coordinates that successfully lifted to hg19 were then lifted back to the native genome. Read coordinates that did not lift back to the same coordinates on the native genome were removed. Reads that did lift back to the same coordinates were lifted back to hg19 for further processing. An additional filtering step was added to ensure the reads did not map to a duplicated region. In that step, overlapping reads on the native genome were merged into a region, and that region was lifted to hg19. Regions that failed to lift uniquely to hg19 were flagged and reads that overlapped them were removed. Because some of the samples were from males and some were from females (Supplementary Table 1), we removed reads that mapped or lifted over to the human X or Y chromosomes to eliminate any sex-specific bias. Details on the number of reads lifted over and remaining after removal of sex chromosomes are included in Supplementary Table 2. Phylogenetic trees were drawn with ggtree (Yu et al., 2016).
DHS Site Identification and Filters
To avoid bias due to large differences in depth of library sequencing, 20 million reads were randomly selected from samples with library sizes greater than 20 million reads. To determine the appropriate subset size, we generated subsets of 10, 15, 20, and 25 million reads and compared the average number of per-species DHS sites in each subset (Supplementary Figure 1).
Peak calling was performed on each sample separately using the MACS2 callpeak command with an FDR cutoff of 5% (Zhang et al. 2008) (version 2.1.0.20150420; parameters: -- nomodel --extsize 20 --qvalue .05). For each species, DHS sites were identified by taking the union set of peaks that were found in at least 2 out of 3 biological replicates. We removed DHS sites that overlapped the ENCODE blacklist (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncode DacMapabilityConsensusExcludable.bed.gz) (Rosenbloom et al. 2013). To generate a master set of DHS sites to use for cross-species comparisons, we first took the union set of DHS sites identified in each species, then applied two filters. In the first filter, we removed DHS sites without at least 95% genomic coverage between the start and stop coordinates for each of the species. Genomic coverage was determined using the Multiz Alignment MAF file (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/maf/) from the UCSC Genome Browser Multiz Alignment of 100 Vertebrates (Blanchette et al. 2004), the Galaxy MAF Coverage Stats tool at usegalaxy.org (Afgan et al. 2018), and custom perl scripts. Next, we assigned read counts to each DHS site using bedtools v2.17.0 (Quinlan and Hall 2010). In the second filter, we removed DHS sites without DNase-seq sequence reads in at least two biological replicates from each species because they may be indicative of regions that cannot be uniquely aligned. In other words, we expect at least some level of background DNase cutting across the genome. See Supplementary Table 3 for DHS site counts before and after each filtering step.
Principal Components Analysis
We performed a principal components analysis on the read counts from the 15 samples (Supplementary Figure 4). We first normalized the counts by library size then ran the R prcomp function with the center and scale parameters set to true. We also performed a principal components analysis on the read counts from the 15 samples and 3 additional chimpanzee samples from Pizzollo et al. 2018 (Supplementary Figure 4). See Supplementary Table 1 for details on the samples.
Differential Site Identification and Classification
The read counts for each DHS site were used as input to a custom R script that identified and classified differential DHS sites. To address the over-dispersion problem inherent in count-based sequencing data, we used the R package DSS (Wu et al. 2013) to calculate a dispersion parameter for each DHS site, as well as a normalization offset (based on total library size) for each sample. For each DHS site, the read counts, dispersion parameter, and normalization offset were fit using a negative binomial generalized linear model with a log link function. Specifically, we fit two models: a species informed model and a null model in which species was not predictive of normalized counts. The species informed model modelled the expected counts by , where λ represents expected counts and j indexes the sample. The design vector xj indicates to which species the jth sample belongs and denotes its transpose. This vector has 4 elements comprised of indicator functions of whether the sample is human, chimpanzee, gorilla, or orangutan. Because accessibility changes are relative to rhesus macaque, it is used as the intercept in the models and does not have an indicator function. Specifically, if the jth sample is human; if the jth sample is chimpanzee; if the jth sample is gorilla; and if the jth sample is orangutan. The vector β = (βh, βc, βg, βo)𝑇 parameterizes the change in expected counts between species and 𝛼j is a normalization offset for sample j. The null model assumes the vector βis zero and models the expected counts by 𝑙𝑜𝑔 (𝜆𝑗) = 𝛼𝑗. These models were fit using the R package glm using negative.binomial as the family. The DSS normalization offset value was used for the offset parameter. The inverse of the DSS dispersion parameter value was used for the theta parameter. The difference in deviances between these two models was used to form a likelihood ratio test of whether the sites were differential. A Benjamini-Hochberg correction was performed using the R function p.adjust. DHS sites with a corrected p-value of less than .01 were classified as differential.
To determine which species (or combination of species) were different from rhesus macaque, 15 contrasts were constructed using the β values estimated in the regression model detailed above. Five contrasts were used to identify changes in a single species, six for changes in two species, and four for changes in three species (Supplementary Table 4). Note that the contrast for changes in rhesus macaque will also identify changes that occurred in the human-chimpanzee-orangutan-gorilla internal branch. A value for each contrast 𝑐 using constraint matrix C and variance-covariance matrix 𝑉𝑎𝑟(𝐶β) was calculated as c = [Cβ]T[Var(Cβ)]−1Cβ, where [Var(Cβ)]−1 denotes the matrix inverse of Var(Cβ). A p-value was calculated for each contrast using a chi-squared test and the contrast with the lowest p-value was chosen. For 30 DHS sites, none of the p-values were less than .01, so the change was marked as “other”. The sign of the β was used to determine whether each change was an increase (referred to as “increased accessibility”) or decrease (referred to as “decreased accessibility”) relative to rhesus macaque.
Note that since we classified only the sites that were shown to be differential and we controlled the type one error rate in the differential site identification step, there is no need to further adjust our analyses for the multiple contrasts being considered here. In the statistical literature this is referred to as a “gateway” procedure as the differential site identification is a gatekeeper for site classification.
We have included supplementary files that contain the input data for the R script (glm_analysis.input_file.txt), the results from all of the analyses discussed in this paper (glm_output_and_analsyis_results.txt), and the field information for the input and output files (input_and_output_file_field_information.xlsx). Raw fastq files and bed files containing hg19 coordinates for the reads are available under GEO accession GSE129034. The reads, DHS sites, and differential analysis classifications are available in a UCSC Genome Browser session at http://genome.ucsc.edu/s/ledsall/2019primate. All scripts for the data processing and analyses described above are available at http://github.com/ledsall/primate.
Testing for Positive Selection and Determining Vertebrate Conservation
We performed selection analysis on both the differential and non differential DHS sites. We tested for selection on the species branches for human, chimpanzee, gorilla, and orangutan, and on the internal branches for human-chimpanzee and human-chimpanzee-gorilla. As the stochasticity of the evolutionary process may be elevated in short alignments, we expanded each DHS site that was smaller than 300 bases up to 300 bases, while maintaining the size of any DHS site longer than 300 bases. We removed any sites that couldn’t be expanded due to gaps in the non-human genomes.
To investigate the extent of positive selection among the DHS sites, we used a branch-specific method we first developed in 2007 (Haygood et al. 2007) and recently improved (Berrio et al. under review). Briefly, the method uses a likelihood ratio test based on the maximum likelihood estimates obtained from HyPhy (Pond et al. 2005). The branch of interest (e.g. human species branch) is used as the foreground and the rest of the tree is used as the background. The assumption for the background is the same for both the null and alternative models; specifically, neutral evolution and negative (purifying) selection are permitted, but positive selection is not. In the null model, the assumption for the foreground is the same as the one for the background. In the alternative model, all three models of evolution are permitted (neutral evolution, negative selection, and positive selection) in the foreground. This method is highly sensitive and specific in that it can differentiate between positive selection and relaxation of constraint.
The method requires a 3kb reference alignment for each species that is used as a putatively neutral proxy for computing substitution rates. To generate this alignment, we first identified a set of functional regions on the human genome using annotations from the ENCODE project at UCSC (http://genome.ucsc.edu/encode/downloads.html) (ENCODE Project Consortium 2012) and annotations from the HoneyBadger2-intersect dataset from the ENCODE and Roadmap Epigenomics projects (https://personal.broadinstitute.org/meuleman/reg2map/HoneyBadger2-intersect_release/) (Roadmap Epigenomics Consortium et al. 2015). We used the set of 56,893 putative promoter regions, 1,598,323 putative enhancer regions, and 31,255 putative dyadic regions. We then masked the genomes using those functional regions, along with 5’ and 3’ UTRs, coding and non-coding RNAs, CpG repeats, microsatellite repeats, and simple repeats. Next, we extracted windows of 300 bases and excluded those with substitution rates that are too high or slow relative to the entire tree. Finally, we concatenated the set of these windows until we reached a length of 3kb (Berrio et al. under review).
We used the PHAST library msa_split (Hubisz et al. 2011) to extract query regions from the UCSC Genome Browser Multiz Alignment of 100 Vertebrates (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/maf/) (Blanchette et al. 2004) for the human, chimpanzee, gorilla, orangutan, and rhesus macaque genomes. For each DHS site (called a query site), we used HyPhy (Pond et al. 2005) to fit the null and alternative models and generate maximum likelihood values. We used a custom R script to compute the likelihood ratio, which was used as a test statistic for a chi-squared test with one degree of freedom to calculate a p-value. We classified a DHS site as under positive selection if the p-value was less than .05. We were unable to successfully run HyPhy on 12 sites due to unknown reasons and removed these regions from analysis.
We found that the substitution rates of the orangutan branch and the internal branches for human-chimpanzee and human-chimpanzee-gorilla were extremely variable and caused an overestimation of selection in these branches and in the subset of non functional data. Therefore, we excluded these branches from the subsequent analyses and focused on the human, chimpanzee, and gorilla branches.
We then tested for significant enrichments of positive selection in different classes of DHS sites. The classes we investigated are 1) human-specific accessibility increases and decreases; 2) chimpanzee-specific accessibility increases and decreases; 3) gorilla-specific accessibility increases and decreases; and 4) non differential sites. We performed Fisher’s exact tests using a test statistic of the number of DHS sites classified as under positive selection. For the human-specific accessibility changes, we performed two tests; 1) the human species branch as the foreground compared to the chimpanzee species branch as the foreground; and 2) the human species branch as the foreground compared to the gorilla species branch as the foreground. For the chimpanzee-specific accessibility changes, we performed two tests; 1) the chimpanzee species branch as the foreground compared to the human species branch as the foreground; and 2) the chimpanzee species branch as the foreground compared to the gorilla species branch as the foreground. For the gorilla-specific accessibility changes, we performed two tests; 1) the gorilla species branch as the foreground compared to the human species branch as the foreground; and 2) the gorilla species branch as the foreground compared to the chimpanzee species branch as the foreground. For the non differential sites, we performed three tests; 1) the human species branch as the foreground compared to the chimpanzee species branch as the foreground; 2) the human species branch as the foreground compared to the gorilla species branch as the foreground; and 3) the chimpanzee species branch as the foreground compared to the gorilla species branch as the foreground. We used a Bonferroni correction to adjust for the multiple tests performed.
To visualize the strength of selection, we computed the statistic 𝜁(zeta), representing the ratio of evolution, by calculating the ratio of the substitution rate in each query compared to the reference alignment; we computed 𝜁 for the human, chimpanzee, and gorilla species branches. This parameter is homologous to ω (omega), the ratio of dN/dS, where a value of ω < 1 indicates constraint or negative selection; a value of ω = 1 indicates neutrality; and a value of ω > 1 indicates positive selection.
We then tested whether the distributions of 𝜁differed between classes of differential DHS sites. The classes we investigated are 1) human-specific increased accessibility; 2) chimpanzee-specific increased accessibility; 3) gorilla-specific increased accessibility; 4) orangutan-specific increased accessibility; 5) human-chimpanzee increased accessibility; 6) human-chimpanzee-gorilla increased accessibility; 7) human-specific decreased accessibility; 8) chimpanzee-specific decreased accessibility; 9) gorilla-specific decreased accessibility; 10) orangutan-specific decreased accessibility; 11) human-chimpanzee decreased accessibility; and 12) human-chimpanzee-gorilla decreased accessibility. We performed Wilcoxon tests on human-specific increased accessibility against the other classes of DHS sites with increased accessibility (classes 2-6) and used a Bonferroni correction to adjust for the multiple tests. Similarly, we performed Wilcoxon tests on human-specific decreased accessibility against the other classes of DHS sites with decreased accessibility (classes 8-12) and used a Bonferroni correction to adjust for the multiple tests. Finally, we performed a Wilcoxon test on the non differential sites against the non functional sites defined above.
To determine the amount of vertebrate conservation, we computed the median value of PhastCons scores for each DHS site using bedops (Neph et al. 2012), the UCSC 100-way PhastCons table (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way) (Siepel et al. 2005; Pollard et al. 2010), and custom scripts. The PhastCons score represents the probability of a particular base being conserved. The values range from 0 to 1, with higher values representing an increased probability of conservation. We classified a DHS site as constrained if the median PhastCons score was above 0.9.
We then investigated whether the amount of conservation differed between differential and non differential DHS sites. We used the percentage of constrained DHS sites as our test statistic and performed a Fisher’s exact test on non differential sites compared to three classes of differential DHS sites; specifically, 1) human-specific accessibility changes; 2) chimpanzee-specific accessibility changes; and 3) gorilla-specific accessibility changes. We used a Bonferroni correction to adjust for the multiple tests.
Intersection with Human Putative Regulatory Annotations
We characterized each DHS site as a proximal element, distal element, or unannotated region using the HoneyBadger2-intersect dataset from the ENCODE and Roadmap Epigenomics projects (https://personal.broadinstitute.org/meuleman/reg2map/HoneyBadger2-intersect_release/) (Roadmap Epigenomics Consortium et al. 2015). We used the putative promoter and enhancer regions as above, but did not use the putative dyadic regions. We used bedtools2 (Quinlan and Hall 2010) to identify DHS sites that overlapped the annotated promoters (which we characterized as “proximal elements”) and enhancers (which we characterized as “distal elements”). DHS sites that didn’t overlap promoters or enhancers were characterized as unannotated regions.
Determining Cell-type Specificity
We characterized the cell-type specificity of each DHS site by intersecting it with DHS sites from 125 human cell types and tissues (Thurman et al. 2012). We used bedtools2 (Quinlan and Hall 2010) to find the overlap with the wgEncodeAwgDnaseMasterSites dataset from the UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgDnaseMasterSites/wgEncodeAwgDnaseMasterSites.bed.gz). We assigned a score to each DHS site representing its cell-type specificity. The score was calculated as , where N represents the number of cell-types and tissues in which the DHS site is present (including the fibroblast cell line from this study). The score ranges from 0 for a DHS site present in all tissues and cell types to (which is approximately 0.99) for a DHS site present in only our dataset.
We then asked whether the distribution of cell-type specificity scores varied between different classes of DHS sites. We subset the DHS sites into those overlapping proximal elements and those overlapping distal elements. For each subset, we investigated the following classes: 1) human-specific increased accessibility; 2) chimpanzee-specific increased accessibility; 3) gorilla-specific increased accessibility; 4) orangutan-specific increased accessibility; 5) human-chimpanzee increased accessibility; 6) human-chimpanzee-gorilla increased accessibility; 7) human-specific decreased accessibility; 8) chimpanzee-specific decreased accessibility; 9) gorilla-specific decreased accessibility; 10) orangutan-specific decreased accessibility; 11) human-chimpanzee decreased accessibility; 12) human-chimpanzee-gorilla decreased accessibility; and 13) non differential sites.
Results
Method Development to Identify and Classify Differential DNase I Hypersensitive Sites Across Multiple Primate Species
We developed a negative binomial generalized linear model based method to allow us to quantitatively compare DNase I hypersensitive (DHS) sites across five primate species with one statistical test, rather than performing multiple pairwise comparisons. This greatly simplifies the analysis, scales well for additional replicates and species, and has straightforward p-value calculations. The output of this test indicates for each DHS site whether or not at least one species displays a statistically significant difference in read counts. In addition, the output from the model includes the vector β = (βh, βc, βg, βo)𝑇, representing the changes in expected counts compared to rhesus macaque for human, chimpanzee, gorilla, and orangutan respectively. We used both the value and the sign (positive vs. negative) of the β values to classify the type of difference between a given species and the outgroup, rhesus macaque (see Materials and Methods). DHS sites that are not differential have values near zero for all of the β values (Figure 1A). DHS sites that have increased accessibility in a particular species have a positive β for that species with the other species having β values near zero (Figure 1B for human-specific increased accessibility; Supplementary Figure 2 for other species). DHS sites that have decreased accessibility in a particular species have a negative β value for that species with the other species having β values near zero (Figure 1C for human-specific decreased accessibility; Supplementary Figure 2 for other species). We also detect regions that have altered β values in more than one species (Figure 1D, Supplementary Figures 2 and 3).
Using this approach, we identified 89,744 total DHS sites that can be compared across all 5 species at 1:1:1:1:1 orthologous genomic regions (Materials and Methods). As a first step in analyzing these data, we carried out a principal components analysis and found that the first principal component separated the single old world monkey (Rhesus macaque) from the four great apes, while the second principal component recapitulated the phylogeny of the great apes (Supplementary Figure 4). Because we are drawing on the original data from Shibata et al. 2012 for three species and have generated new data for two species (Supplementary Table 1), we also investigated whether batch effects would overwhelm the species signal by comparing principal components analyses with and without three additional chimpanzee samples generated more recently from Pizzollo et al. 2018 (Supplementary Table 1). As shown in Supplementary Figure 4, the Pizzollo et al. chimpanzee samples cluster with the original Shibata et al. chimpanzee samples across the first four principal components (cumulative proportion of variance of 0.53), suggesting that biological signal is retained even when samples are prepared and sequenced years apart.
Of the 89,744 total DHS sites, 53,078 (59%) are not statistically significantly different between species, 23,926 (27%) display a difference that likely resulted from a single evolutionary event, and 12,710 (14%) that display a difference due to multiple evolutionary events (Figure 1E, Table 1). Of the 23,926 DHS sites with changes that are likely due to a single event, about half (11,879) appear to have changed more recently and are detected only in a single species (human, chimpanzee, gorilla, or orangutan) (Figure 1F, Table 1). About one quarter of single event changes (5,738) appear to have occured in either the human-chimpanzee internal branch or the human-chimpanzee-gorilla internal branch (Figure 1F, Table 1). The remaining single events (6,309) occurred either recently in rhesus macaque or longer ago in the internal human-chimpanzee-gorilla-orangutan internal branch (Figure 1F, Table 1). Because we are using rhesus macaque as the outgroup, we are unable to differentiate between changes in the rhesus macaque species branch and changes in the human-chimpanzee-gorilla-orangutan internal branch (see Discussion). Consistent with our earlier study (Shibata et al. 2012), as well as studies by other groups (Reilly et al. 2015; Villar et al. 2015; Emera et al. 2016), the majority of the changes are increased accessibility rather than decreased accessibility (see Discussion). For changes on the species branches, there are approximately 10x the number of DHS sites with increased accessibility as DHS sites with decreased accessibility; 12:1 for human, 8:1 for chimpanzee, 11:1 for gorilla, 6:1 for orangutan. For changes on the internal branches, the ratio of DHS sites with increased accessibility to DHS sites with decreased accessibility is much less; 2:1 for the human-chimpanzee branch, and nearly equal for the human-chimpanzee-gorilla branch (Figure 1G, Table 1).
Changes in Chromatin Accessibility Detected in a Single Species
Using the methods described above, we identified 10,574 DHS sites with increased accessibility specific to a single species; 2,539 in human, 1,539 in chimpanzee, 2,737 in gorilla, and 3,759 in orangutan (Figure 2A, Table 1). We identified 1,305 DHS sites with decreased accessibility specific to a single species; 209 in human, 196 in chimpanzee, 254 in gorilla, and 646 in orangutan (Figure 2B, Table 1). Heatmap overviews (Figure 2A, 2B) of each class of increased accessibility and decreased accessibility show that these differences are not binary, but instead span the continuum from extremely large differences to those that represent more modest changes. Representative screenshots of individual genomic loci are provided (Figure 2A, 2B). A complete list of coordinates for all DHS sites, β values, and p-values that allow regions to be filtered at different cutoff stringencies are included in Supplementary File glm_output_and_analsyis_results.txt.
Even though we can’t classify rhesus macaque-specific changes, we can identify sites where rhesus macaque is different from the other four species. We identified 4,992 sites that have increased accessibility in rhesus macaque relative to human, chimpanzee, gorilla, and orangutan (Figure 2A, Table 1). We identified 1,317 sites that have decreased accessibility in rhesus macaque compared to human, chimpanzee, gorilla, and orangutan (Figure 2B, Table 1).
Changes in Chromatin Accessibility that Likely Occurred on Internal Branches
Our method allows us to identify ancient changes in chromatin accessibility that likely occurred as a single change on internal branches. Using rhesus macaque as an outgroup, we can classify whether these internal changes are accessibility increases or decreases. We identified 1,735 DHS sites with increased accessibility and 814 DHS sites with decreased accessibility that likely occurred during the common lineage of human and chimpanzee (Figure 3, Table 1). We identified 1,736 DHS sites with increased accessibility and 1,453 DHS sites with decreased accessibility that likely occurred before the split between human, chimpanzee, and gorilla (Figure 3, Table 1).
Multiple Changes in Chromatin Accessibility
In addition to detecting likely single changes in chromatin accessibility on either species (Figure 2) or internal branches (Figure 3), we also identified changes in chromatin accessibility that appear to have occurred multiple times, resulting in different combinations of chromatin accessibility patterns between species. There are many possible ways these differences could have happened and our method cannot determine if these changes resulted from multiple increases in accessibility, multiple decreases in accessibility, or a combination of increases and decreases (see Discussion).
We identified 6,339 DHS sites where two species display increased accessibility relative to rhesus macaque; 997 in human and gorilla; 997 in human and orangutan; 1,568 in chimpanzee and gorilla; 657 in chimpanzee and orangutan; and 2,120 in gorilla and orangutan (Figure 4A, Table 1). We identified 1,801 DHS sites where two species display decreased accessibility relative to rhesus macaque; 221 in human and gorilla; 424 in human and orangutan; 387 in chimpanzee and gorilla; 252 in chimpanzee and orangutan; and 517 in gorilla and orangutan (Figure 4B, Table 1).
There were 2,424 DHS sites where three species displayed increased chromatin accessibility relative to rhesus macaque; 794 in human, chimpanzee, and orangutan; 691 in human, gorilla, and orangutan; 939 in chimpanzee, gorilla, and orangutan (Figure 4C, Table 1). There were 2,146 sites where three species displayed decreased accessibility relative to rhesus macaque; 906 in human, chimpanzee, and orangutan; 554 in human, gorilla, and orangutan; 686 in chimpanzee, gorilla, and orangutan (Figure 4D, Table 1).
Comparison to Previous Study with Fewer Species
To test our new method for identifying differences in chromatin accessibility across five species, we compared our results with those from our previous study that used individual pairwise edgeR (Robinson et al. 2010) comparisons for human, chimpanzee, and rhesus macaque (Shibata et al. 2012). We used the same raw fastq files for the human, chimpanzee, and rhesus macaque samples. Due to updates in the analysis pipeline (Supplementary Table 5), not all of the DHS sites that were previously characterized were also identified as DHS sites in this study (Supplementary Table 5).
Of the DHS sites that were previously characterized by Shibata et al. as being human-specific increased accessibility, human-specific decreased accessibility, chimpanzee-specific increased accessibility, or chimpanzee-specific decreased accessibility, we find that 88-98% of our more recent set of DHS sites that overlap them displayed a consistent call (Table 2). The additional gorilla and orangutan DNase-seq data in this study allows us to fill in missing branch data and gauge the accuracy of our previous classification of human-specific or chimpanzee-specific changes. For 342 DHS sites that overlap sites previously characterized by Shibata et al. as being human-specific increased accessibility, 245 (72%) are still characterized as human-specific increased accessibility after including gorilla and orangutan, while 91 (27%) are now characterized as increased accessibility in human and at least one other species (Table 2). For 234 DHS sites that overlap previously called chimpanzee-specific increased accessibility, 105 (45%) are still characterized as chimpanzee-specific increased accessibility after including gorilla and orangutan, while 114 (49%) are now characterized as increased accessibility in chimpanzee and at least one other species (Table 2).
A similar trend was detected for previously identified human-specific or chimpanzee-specific decreased accessibility. For 146 DHS sites that overlap sites previously characterized as human-specific decreased accessibility, 21 (14%) are still characterized as human-specific decreased accessibility even after including orangutan and gorilla, while 109 (75%) are now characterized as decreased accessibility in human and at least one other species (Table 2). For 98 DHS sites that overlap sites previously called chimpanzee-specific decreased accessibility, 17 (17%) are still characterized as chimpanzee-specific decreased accessibility even after including orangutan and gorilla, while 69 (70%) are now characterized as decreased accessibility in chimpanzee and at least one other species (Table 2).
For 1,154 DHS sites that overlap sites Shibata et al. (2012) characterized as common between human, chimpanzee, and rhesus macaque, 95% (1,100) displayed a non-differential call among these same species using our pipeline (Table 2). After adding gorilla and orangutan, 90% (1,043) of the DHS sites are still characterized as non-differential, while 5% (57) displayed changes in accessibility on branches not considered by Shibata et al., namely gorilla, orangutan, or both (Table 2).
Together, this indicates that adding chromatin accessibility data from additional primate species allows us to identify a substantial subset of DHS sites that have experienced changes in chromatin accessibility across multiple species during evolution.
DHS Sites with Decreased Accessibility are Enriched for Proximal Elements and DHS Sites with Increased Accessibility are Enriched for Distal Elements
After identifying and classifying DHS sites, we next determined their location in the human genome relative to previously annotated proximal and distal elements. We used the HoneyBadger2 annotations (see Materials and Methods), which are predicted promoters or enhancers based on histone marks identified in human cells and tissues as part of the Roadmap Epigenomics project (Roadmap Epigenomics Consortium et al. 2015). We overlapped these annotations to characterize each DHS site identified in this study as a proximal element, distal element, or unannotated region.
For DHS sites that are not differential between primate species, 22% (11,850) of these regions overlap with proximal elements, 57% (30,371) overlap with distal elements, and the remaining 20% (10,857) are unannotated (Figure 5A, Supplementary Table 6). All DHS sites with increased accessibility relative to rhesus macaque display a substantially depleted amount of proximal element overlap compared to the non differential DHS sites (human: 2%; chimpanzee: 4%; gorilla: 9%; orangutan: 10%; human-chimpanzee: 3%; human-chimpanzee-gorilla: 5%) (Figure 5A, Supplementary Table 6). Conversely, DHS sites with decreased accessibility relative to rhesus macaque overlap proximal elements to a similar degree as non differential DHS sites (human: 18%; chimpanzee: 21%; gorilla: 11%; orangutan: 20%; human-chimpanzee: 34%; human-chimpanzee-gorilla: 10%) (Figure 5A, Supplementary Table 6). These results indicate that decreased accessibility changes are more likely to be associated with proximal elements, while increased accessibility changes are more likely to be associated with distal elements. In every category of accessibility changes, there are substantially more distal than proximal elements, which is consistent with other studies (Schmidt et al. 2010; Villar et al. 2015).
We note that all of these proximal and distal annotations are from human tissues, which allows us to make specific inferences about comparisons only to human. There is not yet a similar Roadmap effort for non-human primate species. Categories of accessibility increases that include human (human-specific, human-chimpanzee, and human-chimpanzee-gorilla) have the lowest amount of overlap with unannotated regions of the genome. DHS sites with increased accessibility specific to chimpanzee, gorilla, or orangutan all have much higher overlaps with unannotated regions, with orangutan-specific increased accessibility showing the highest degree of overlap with unannotated regions (Figure 5A, Supplementary Table 6). This is expected since orangutan is the most distantly related great ape species in our study. Similarly, we find that DHS sites with decreased accessibility in human (human-specific, human-chimpanzee, and human-chimpanzee-gorilla) have a higher overlap with unannotated regions compared to DHS sites with decreased accessibility specific to chimpanzee, gorilla, and orangutan. This is also expected since DHS sites with decreased accessibility in non-human primates will by definition have higher chromatin accessibility signals in human fibroblasts.
Evolutionary Changes in Accessibility are Associated with Cell-type Specificity
We calculated cell-type specificity (see Materials and Methods) for the union set of DHS sites detected in primate fibroblasts by comparing them to a much larger set of DHS sites detected in 125 different human cell and tissue types (Thurman et al. 2012). A cell-type specificity score close to 1 indicates the DHS site is present in only a few of the 125 tissues and cell types, while a score near 0 indicates that the DHS site is present in almost all of the 125 tissues and cell types.
As with the proximal and distal annotations, we can make inferences about evolutionary changes in chromatin accessibility only for DHS sites that overlap the human annotations. The union set of the DHS sites we identified show a continuum of cell-type specificity scores with DHS sites from different human cell types (Figure 5B). 1,914 (2%) of our DHS sites overlapped DHS sites found in all 125 tissues and cell types, while 3,873 (4%) of our DHS sites were not found in any of previously tested tissues and cell types.
We then analyzed the distribution of cell type specificity scores in distal and proximal DHS sites that displayed changes in chromatin accessibility. In general, distal elements have higher specificity scores than proximal elements (Figure 5C,D), consistent with previous studies (Thurman et al. 2012).
For proximal elements showing increases in accessibility, tissue specificity is higher on all four species branches than on the two internal branches (one sided Wilcoxon test comparing pooled distributions of external vs internal; P = 2.03×10-28) (Figure 5C). The opposite pattern is evident for decreases in accessibility (one sided Wilcoxon test comparing pooled distributions of external vs internal; P = 3.52×10-32) (Figure 5C). Since all changes on the internal branches are more ancient than those on external branches, this result hints at the possibility that degree of chromatin accessibility is positively correlated with broader utilization across cell types. One possible explanation is that increases in chromatin accessibility raise the likelihood that a proximal regulatory element is co-opted for use by another tissue. The same trends are observed for distal elements, with the exception of distal sites having higher tissue specificity scores, which is expected since distal chromatin accessible sites are more likely to be cell type specific than proximal elements (Figure 5D).
For proximal elements showing changes in chromatin state, the human branch shows lower cell type specificity compared to the three other species for increases (one sided Wilcoxon test with Bonferroni correction; PH:C = 1.62×10-5; PH:G = 1.11×10-6; PH:O = 4.69×10-11) and higher cell type specificity for decreases (one sided Wilcoxon test with Bonferroni correction; PH:C=0.004; PH:G = 0.036; PH:O = 0.004) (Figure 5C). The same pattern is present for distal elements, both for increases and decreases (one sided Wilcoxon test with Bonferroni correction; Increases: PH:C = 4.55×10-70; PH:G = 1.18×10-44; PH:O = 3.47×10-133; Decreases: PH:C = 1.11×10-6; PH:G = 2.54×10-3; PH:O = 2.38×10-8) (Figure 5D). This may reflect an ascertainment bias arising from relying on human tissue comparisons for the cell type specificity score.
Species-specific Chromatin Changes are Enriched in Positive Selection
To investigate the evolutionary significance of species-specific changes in chromatin accessibility, we tested each DHS site for signatures of positive selection on the human, chimpanzee, and gorilla branches separately (see Materials and Methods). Testing for positive selection required additional filtering of DHS sites (see Materials and Methods), resulting in a reduced set of 87,431 DHS sites used in this analysis. The figure of merit in these analyses is ζ (zeta), the ratio of substitution rates within a DHS site on a given branch relative to the rest of the tree in comparison to that ratio for a collection of proxy neutral sites (Wong and Nielsen 2004; Haygood et al. 2007; Haygood et al. 2010). Similar to the analogous and more familiar ω (omega), high values of ζ indicate positive selection, values near 1 indicate neutrality, and low values indicate negative selection.
Putative non functional elements display a relatively tight distribution of ζ on the human branch centered around 1 (Figure 6A), confirming they are a good proxy for neutral evolution in non-coding regions of the genome. Non differential DHS sites (those where the degree of chromatin accessibility does not change on any branch) have a distribution of ζ on the human branch that is centered significantly below 1 (one sided Wilcoxon test; P = 1.57×10-283) (Figure 6A), consistent with ongoing negative selection. Additionally, the distribution of ζ values is much broader for non differential DHS sites compared to putative non functional sites, with a small fraction showing elevated substitution rates on the human branch that are consistent with positive selection.
DHS sites that showed a change in chromatin accessibility on the human branch have positively-shifted distributions of ζ on the human branch relative to non differential DHS sites (Figure 6A). This suggests that both increases and decreases are accompanied by enrichment for a combination of relaxed selection and positive selection on the same branch. As expected, this enrichment is less pronounced when the accessibility change occurs on a different branch of the phylogeny: the distributions of ζ on the human branch are higher when the chromatin state change occurred on the human branch rather than the gorilla or orangutan branches, and this is true for both increases and decreases (one sided Wilcoxon test with Bonferroni correction; Increases: PH:C = 0.12, PH:G = 2.17×10-10, PH:O = 6.39×10-6, PH:H-C = 9.36×10-13, PH:H-C-G = 1.40×10-10; Decreases: PH:C = 0.006, PH:G =4.13×10-5, PH:O = 2.26×10-5, PH:H-C = 0.03, PH:H-C-G = 1.92×10-4), although these differences are all modest in magnitude (Figure 6A).
For the human-specific accessibility changes, we tested for enrichment of positive selection of these regions in the human branch relative to either the chimpanzee or gorilla branches. We performed a similar comparison for the chimpanzee-specific accessibility changes by testing for enrichment of positive selection on the chimpanzee branch relative to either the human or gorilla branches. Finally, we tested for enrichment of gorilla-specific accessibility changes on the gorilla branch relative to the human or chimpanzee branches (Figure 6B). None of the Fisher’s exact tests were significant after Bonferroni correction, even though the data trends in the expected patterns (e.g., human-specific changes have more selection on the human branch, etc.).
As a control, non differential DHS sites show no significant differences in positive selection between branches (Figure 6B). Additionally, putatively non functional sites in the genome do not display an enrichment of positive selection (Figure 6B). These results suggest that evolutionary changes in chromatin accessibility between species are phylogenetically correlated with an enrichment of positive selection.
Next, we investigated the converse: whether signatures of positive selection on individual regulatory elements are generally limited to branches where the state change occurred. This is clearly not the case: DHS sites with increased accessibility on the human branch show positive selection on the chimpanzee branch almost as often as on the human branch (Figure 6C). The same pattern is evident for increased accessibility on the chimpanzee branch and for decreased accessibility on the human or chimpanzee branch (Figure 6C). These results suggest that positive selection often acts on the DNA sequence of DHS sites in ways that do not affect chromatin state. Interestingly, some DHS sites show evidence of positive selection on both branches (Figure 6C). Given that 1.96% of DHS sites contain signatures of positive selection in the human branch, this result suggests that positive selection is not distributed randomly across the regulatory genome, but occurs repeatedly in a very small percentage of sites (0.12%). On average, 5.4% of DHS sites that show species-specific accessibility changes for human, chimpanzee, and gorilla are highly constrained in sequence evolution, with nucleotide substitution rates that are significantly lower than the neutral expectation across vertebrates (Figure 6D). In contrast, DHS sites that do not change state are enriched for highly constrained sites in comparison to species-specific changes for human, chimpanzee, and gorilla (Fisher’s exact test, one-sided with Bonferroni adjust for 3 comparisons; PND:H = 9.10×10-6; PND:C=2.42×10-6; PND:G=3.18×10-4) (Figure 6D). Together, these results suggest that positive selection contributes to species-specific chromatin accessibility increases and decreases, while purifying selection contributes to the conservation of non differential DHS sites.
Discussion
We developed a new method to identify regions of differential chromatin accessibility among multiple species, based on a negative binomial generalized linear model. This method does not rely on thresholding and is therefore able to detect subtle differences in degree of chromatin accessibility that are obscured using conventional approaches. Furthermore, unlike methods based on pairwise comparison, our method is readily scalable to multiple taxa. Here we applied it to DNase-seq data from cultured skin fibroblasts obtained from five species. While the majority of DHS sites were not quantitatively distinct between species, we identified over 35,000 DHS sites with significant differences in chromatin accessibility between human, chimpanzee, gorilla, orangutan, and rhesus macaque. Of those, approximately 65% are likely the result of a single evolutionary change in chromatin state that occured on either an internal or external branch, while the remainder imply multiple evolutionary changes in state.
Our results are largely congruent with our earlier study (Shibata et al. 2012) that used a threshold-based multiple pairwise comparison approach and considered three primate species (human, chimpanzee, and rhesus macaque). Here, the use of five species provides additional confidence in the identification of species-specific accessibility changes and also allows for the identification of changes that likely occurred multiple times throughout evolution. For these multiple events, the method we developed does not characterize how exactly they occurred. That characterization will be the subject of future work using a likelihood analysis that incorporates the phylogenetic information and models evolutionary processes (Felsenstein 1973; Hansen 1997; Felsenstein 2008; Paradis and Schliep 2019).
As mentioned in the results, we identified substantially more accessibility increases than decreases. It seems in principle unlikely that increases and decreases in accessibility actually occur at such different rates, since, if true, primate genomes would eventually become saturated with open chromatin regions. The same asymmetry was observed previously by us (Shibata et al. 2012) and other groups (Villar et al. 2015; Reilly et al. 2015; Emera et al. 2016) using conventional pairwise comparisons and thresholding, so the source is unlikely to lie in the method used to identify chromatin state changes. Instead, it seems likely that the asymmetry is an ascertainment bias that derives, in part, from unequal statistical power to call gains and losses, though the exact basis of the bias is currently unclear.
Finding so many DHS site differences in non-human primates is a fascinating result with implications for understanding the evolution of transcriptional regulation. Nevertheless, we also suggest that these results describing cell type specificity should be interpreted carefully. One non-biological possible scenario for such enrichment is that there is ascertainment bias in our analyses due to the cell-type specificity score being based entirely on data from human, a limitation imposed by lack of relevant data from other primate species. Additionally, many of the chromatin accessibility datasets analyzed by Thurman et al. that we used to calculate the cell-type specificity scores are from fibroblast-like cells. Some of the DHS sites that we identified as being used in many cell types may in fact be used only in fibroblast-like tissues. Although the patterns of positive selection that we detected are consistent with expectations, none of the tests found statistically significant enrichment on the human, chimpanzee, and gorilla branches. This may be due to our method of positive selection detection relying on human functional annotations to identify proxy neutral regions, which may result in a loss of power with increasing phylogenetic distance.
Interestingly, our results suggest that DHS sites are not homogenous from either a functional or an evolutionary perspective. Those near transcription start sites (including likely core promoter regions) differ from DHS sites that are distant (including classic enhancers and other kinds of distal elements) in several regards. DHS sites that show no state changes across the species surveyed are enriched in conserved nucleotides, consistent with greater functional significance. Compared with proximal DHS sites, gains in chromatin accessibility in distal sites are more likely to show same branch-specific signatures of positive selection, as might be expected if these DHS sites are contributing to changes in gene regulation. These and other trends we observed suggest that functional constraints and opportunities differ markedly among classes of DHS sites. Additional studies will be needed to delineate these distinct classes of likely regulatory elements and to understand how evolutionary mechanisms operate on their chromatin state and underlying DNA sequence.
Functional characterization studies will be necessary to understand these regions and their contribution to species-specific gene expression patterns and organismal traits. High-throughput reporter assays such as MPRA (Klein et al. 2018) and population STARR-seq (Vockley et al. 2015) can quantify the impact of these differentially utilized regulatory regions, as well as variants within these regions. In addition, methods such as CRISPR (Diao et al. 2017; Klann et al. 2017) can characterize the impact of these regions in their natural context, including identifying the correct target gene(s) for these regulatory elements. Finally, additional replicates from these species can provide characterization of variability within each species. While obtaining data from additional tissues for primate species is not possible for most tissues, generation of induced pluripotent stem cells (iPSCs) followed by differentiation (Gallego Romero et al. 2018), will provide insights into how these differential chromatin signals translate into different cell types across many species.
While we used our negative binomial generalized linear model method to identify differences in chromatin accessibility between five species, we believe this strategy can be used for quantitative comparisons across many tissues, cell types, and time-series experiments. In addition to DNase-seq, we expect this method can be readily applied to any count-based data type such as RNA-seq, ATAC-seq, and ChIP-seq. We also note that this method is scalable, as increasing the number of groups or replicates requires simply changing the design matrix and contrasts.
Acknowledgments
We would like to thank Terry Gaasterland for her help developing the tiered mapping approach. We thank the Duke Sequencing and Genomic Technologies Shared Resource for sequencing. This work was supported by the National Science Foundation [HOMINID 0827552 to G.A.W]; the National Institute of Mental Health at the National Institutes of Health [5R01MH105472 to G.E.C and G.A.W]; and a generous donation from Dr. Howard Clark.
Footnotes
Data deposition: The project has been deposited at NCBI GEO under the accession number GSE129034