Abstract
Summary De novo mutations in hundreds of different genes collectively cause 25-42% of severe developmental disorders (DD). The cause in the remaining cases is largely unknown. The role of de novo mutations in regulatory elements affecting known DD associated genes or other genes is essentially unexplored. We identified de novo mutations in three classes of putative regulatory elements in almost 8,000 DD patients. Here we show that de novo mutations in highly conserved fetal-brain active elements are significantly and specifically enriched in neurodevelopmental disorders. We identified a significant two-fold enrichment of recurrently mutated elements. We estimate that, genome-wide, de novo mutations in fetaLbrain active elements are likely to be causal for 1-3% of patients without a diagnostic coding variant and that only a small fraction (<2%) of de novo mutations in these elements are pathogenic. Our findings represent a robust estimate of the contribution of de novo mutations in regulatory elements to this genetically heterogeneous set of disorders, and emphasise the importance of combining functional and evolutionary evidence to delineate regulatory causes of genetic disorders.
Main Text
The importance of non-coding variation in complex disease is well established–the majority of disease-associated common SNPs lie in intergenic or intronic regions, albeit with low effect sizes1,2. Rare sequence and structural variants in relatively few regulatory elements have been causally linked to specific Mendelian disorders3,4 (reviewed in Mathelier et. al, 2015 and Spielmann and Mundlos, 2016), most of which act dominantly, with a few exceptions5. These pathogenic regulatory variants can act by loss-of-function 6–9 or gain-of-function 10,11. These regulatory elements can lie far from the gene they regulate. For example, a set of sequence variants in an evolutionarily conserved regulatory element located 1Mb from its target gene, SHH, can cause polydactyly10. As a consequence, in both complex and Mendelian disorders it can often be challenging to identify the gene whose regulation is being perturbed by an associated regulatory variant12–14. Moreover, the contribution of highly penetrant mutations in regulatory elements to highly genetically heterogeneous rare diseases, such as neurodevelopmental disorders, has not been firmly established.
We recruited 7,930 individuals with a severe, undiagnosed developmental disorder, and their parents to the Deciphering Developmental Disorders (DDD) study from clinical genetics centres in the UK and Ireland. Systematic clinical phenotyping15 identified 79% with intellectual disability (87% with more broadly defined neurodevelopmental disorders). Congenital heart defects were the most prevalent non-neurodevelopmental phenotype recorded, present in 10% of the cohort. Exome sequencing of these families revealed that damaging de novo mutations (DNMs) in known DD-associated genes are present in ~25% of probands, accounting for the majority of diagnostic variants16,17, and that an additional 17% of probands carry pathogenic DNMs in genes not yet robustly associated with DDs17. Thus the majority of the probands do not carry a diagnostic variant in a protein-coding gene, and are termed ‘exome-negative’. To explore the role of DNMs in non-coding, likely regulatory, elements we performed targeted sequencing on three classes of putative regulatory elements: the 4,307 most evolutionarily conserved non-coding elements (CNEs)18, 595 experimentally-validated enhancers19, and 1,237 putative heart enhancers20, together covering 4.2Mb of sequence. Furthermore, we define a set of ‘control’ non-coding elements covering 6.03Mb by analysing intronic sites at least 10bp away from the nearest exon that are targeted by exome baits and were sequenced to >30X depth (see Methods).
Selective constraint acting on non-coding elements
We first assessed how much purifying selection had skewed allele frequencies in non-coding elements. We used the mutability-adjusted proportion of singletons (MAPS) metric21 in 7,080 unrelated, unaffected DDD parents to test six different element classes: introns, heart enhancers, validated enhancers, CNEs, protein-coding genes, and known DD-associated genes. Both the introns and the heart enhancers, like synonymous variants in proteinucoding genes, show little evidence of purifying selection (Figure 1a). The experimentally-validated enhancers and CNEs are constrained to a similar degree to all protein-coding genes, but less than known DD-associated genes (Figure 1a). Statistical power to detect functionally relevant variants in protein-coding genes is strengthened considerably by stratification of variants on the basis of their likely impact on the encoded protein (e.g. missense and truncating variants) and using metrics that predict variant deleteriousness such as CADD22. We used the MAPS framework to assess whether deleteriousness metrics were predictive of selective constraint in the targeted non-coding elements. We computed the MAPS within bins of CADD scores encompassing 1,520,250 variants in unaffected DDD parents. In protein-coding genes, the strong correlation between CADD score and strength of purifying selection enabled differentiation between variants that are neutral, weakly constrained and very highly constrained. In CNEs and validated enhancers, CADD differentiates neutral variation from those under weak constraint (similar to missense level in the protein-coding regions), but failed to identify highly deleterious variants with selective constraint on a par with protein truncating variants (Figure 1c). Other deleteriousness metrics were assessed, but none were more informative than CADD (Figure S1). These findings replicate across the entire non-coding genome in an independent sample of 3,781 UK whole genomes sequenced to 7x coverage as part of the UK10K project23 (Figure S2).
Functional genomic annotation of putative regulatory elements provides additional information to distinguish elements and sites that are more likely to be functionally relevant for DD-causation. We used DNase I hypersensitivity sites (DHS) in 39 tissues and chromHMM genome segmentation predictions in 111 tissues24 to predict tissue activity for the targeted non-coding elements. Of the 4,307 CNEs sequenced in this analysis, 4,046 (93.9%) were active in at least one of the 111 surveyed tissues while 261 (6.1%) were inactive or repressed in all tissues. Most (n=212) of these 261 inactive CNEs exhibited polycom-mediated repression in several tissues suggesting they are active in tissues not yet surveyed. Nonetheless, active CNEs were under stronger selective constraint than inactive CNEs, despite no difference in evolutionary conservation (Figure S3). Likewise, variants within a DHS peak in at least one tissue were under stronger purifying selection than variants not overlapping a DHS peak (p = 0.019) (Figure 1d). Within the set of active elements, we did not find any significant difference in purifying selection in elements active in a single tissue compared to those active in multiple tissues (Figure S3), nor did we identify significant differences in selective constraint in the targeted elements between tissues
Enrichment of de novo mutations in non-coding elements
We identified candidate de novo single nucleotide mutations in 7,930 trios (Methods). We identified 1,691 ‘exome-positive’ individuals with a likely pathogenic protein-altering DNM or inherited variant in a known DD-associated gene, with the remaining 6,239 being ‘exome-negative’. Using a previously described model for germline mutation25, we compared the numbers of observed and expected DNMs in the different sets of non-coding elements in exome-positive and exome-negative trios. No significant DNM enrichment was observed in exome-positive probands for any of the sets of non-coding elements, demonstrating that the mutation model is well-calibrated for our data (Figure S4). By contrast, we found that the targeted CNEs (irrespective of their tissue activity) are nominally significantly enriched for DNMs in exome-negative individuals (422 observed, 388 expected, p = 0.04), whereas experimentally-validated enhancers (153 observed, 156 expected, p = 0.596), heart enhancers (86 observed, 86 expected, p = 0.511), and intronic controls (901 observed, 920 expected, p = 0.719) were not enriched (Figure 2). Even when focusing on the minority of probands with heart phenotypes, no enrichment for DNMs in heart enhancers was observed. Subsequent analyses focus exclusively on exome-negative trios.
Given the preponderance of individuals with neurodevelopmental disorders in our cohort but the broad range of the tissue activity of the targeted non-coding elements, we refined our testing for DNM enrichment by focusing on CNEs active in fetal brain. We observed a strong significant enrichment of DNMs within 2,077 fetal brain DHS peaks in CNEs (177 observed, 138 expected, p = 0.002) but no enrichment sites in CNEs falling outside of fetal brain DHSs (245 observed, 249 expected, p = 0.623) (Figure 2). We also used chromHMM26 predictions of fetal brain activity, which incorporate a broader range of functional assays, and again identified a significant enrichment of DNMs in the 2,613 fetal brain active CNEs (286 observed, 246 expected, p = 0.007) but not in the fetal brain inactive (quiescent, heterochromatin, or polycomb-repressed) elements (136 observed, 141 expected, p = 0.692) (Figure 2). Moreover, the DNMs observed in fetal brain active CNEs in exome-negative probands were at more highly conserved sites (Wilcoxon rank sum test on PhyloP 100-way score27) compared to DNMs observed in exome-positive probands (Figure S5). The absence of DNM enrichment in fetal brain inactive elements suggests that modulation of existing enhancer activity in the brain is more likely to cause developmental disorders than gain of ectopic brain activity for elements active in other tissues. The excess of DNMs observed in fetal brain active CNEs and within fetal brain DHS peaks is concentrated exclusively within the 79% of ‘exome-negative’ probands with one or more neurodevelopmental phenotypes (fetal brain DHS peaks: p = 2.2e-4, fetal brain active by chromHMM: 238 observed, 194 expected, p = 9.8e-4), with no significant enrichment observed in those without neurodevelopmental phenotypes (fetal brain DHS: p=0.426; fetal brain active by chromHMM: p=0.671) (Figure 2). In summary, we identified a highly significant and specific enrichment of DNMs in fetal brain active CNEs in exome-negative probands with neurodevelopmental disorders
Having demonstrated a specific enrichment of DNMs in fetal brain active CNEs in probands with neurodevelopmental disorders, we re-evaluated the subset of experimentally-validated enhancers with functional evidence for activity in fetal brain (N=383) in these probands and observed an enrichment for DNMs only within the top quartile of evolutionary conservation (18 observed, 9 expected, p = 0.01 within fetal brain DHS) (Figure S6), suggesting that even for experimentally validated fetal brain enhancers, the DNM enrichment is concentrated within elements with strong evolutionary conservation.
We assessed four different methods for gene target prediction: Genomicus28 (based on evolutionary synteny), DHS-expression correlation29, Hi-C in fetal brain30 and simply choosing the closest gene. The proportion of fetal brain active CNEs for which a prediction could be made varied from 26% (fetal brain Hi-C) to 100% (closest gene). The pairwise-concordance between the four approaches was low: at most ~ 35% between any two methods (Figure S7). We did not identify any enrichment for DNMs in elements predicted to target known DD genes, or likely dosage sensitive genes (defined by the 'probability of loss of function intolerance' (pLI) metric21), or genes that are more highly expressed in the brain compared to other tissues (Methods, Figure S7)
We identified 45 transcription factors (TFs) whose motifs were significantly enriched in fetal brain active CNEs (Methods), are expressed in the brain and for which position-weight matrices are available31. We assessed the impact of DNMs on predicted TF binding, and observed nominally significant increases in DNMs predicted to increase affinity and overlap of binding sites, but none of these results survived correction for multiple testing (Figure S8). Given the number of DNMs we have identified, and the relative immaturity of in silico predictions of the impact of non-coding variation on gene expression, it is not currently possible to determine the mechanisms by which these DNMs contribute to DDs.
To explore the likely penetrance associated with the observed DNM enrichment in the targeted non-coding elements, we investigated potential over-transmission of inherited rare variants in these elements to affected children. No class of non-coding element classes showed any evidence for overtransmission (Figure S9). Furthermore, we did not detect any enrichment for other variants in the same element as the DNM that would suggest the DNM is acting as a second hit to an already 'perturbed' haplotype. These observations are consistent with the enrichment of DNMs in fetal brain active CNEs comprising a mixture of 70-80% non-pathogenic DNMs and 20-30% of highly penetrant dominantly-acting DNMs.
Recurrently mutated regulatory elements
We observed a significant excess of recurrently mutated fetal brain active CNEs (two or more DNMs in unrelated individuals) compared to the expectation under the null mutation model (31 observed, 15 expected, p = 1.2e-4) (Figure 3a). However, no individual CNE exceeds a conservative genome-wide significance threshold of p<1.91e-5 (Bonferroni-corrected p-value based on independent tests for enrichment across 2,613 fetal brain active elements) (Figure 3b). We did not observe any significant clustering of DNMs within recurrently mutated elements, which might have highlighted disruption of specific binding sites or functional sub-regions within each element.
Increased power to detect locus-specific enrichments of DNMs could, in principle, be gained from aggregating DNMs across elements regulating the same target gene, analogous to the aggregation of DNMs in distinct exons of the same protein-coding gene. However, as described above, gene target prediction is currently lacking in both coverage and accuracy. CNEs have been shown to cluster together within the genome32, largely due to multiple regulatory elements acting on the same target gene. Moreover, CNEs are strongly enriched around developmentally important genes32. Therefore we clustered CNEs spatially to identify groups of CNEs likely to be regulating the same target gene, while remaining agnostic as to the identity of that gene. We applied hierarchical clustering on the 2,613 fetal brain active CNEs to identify 356 clusters of two or more fetal brain active CNEs (Methods). We found an excess of recurrently mutated clusters, defined as two or more elements with at least one DNM in each element, (11 observed, 6 expected, p = 0.016). As with individual elements, we did not find any element clusters with a significant excess of DNMs at a genome-wide significance threshold (Table S1). We compared phenotypic similarity between probands with DNMs in the same CNE or CNE-cluster and randomly sampled probands (Methods) and found probands with DNMs in the same element or cluster to be more phenotypically similar than expected by chance (p = 0.013, one-sided KS-test comparing observed p-values to expected uniform distribution)
We used chromHMM26 to assign the recurrently mutated CNEs to a predicted chromatin state. We observed the greatest excess of DNMs in CNEs predicted to be enhancers (N=9) or strongly/weakly transcribed (N=8). Five of the eight transcribed recurrently mutated elements fall in close proximity to exons, but are not in protein-coding transcripts and show evidence for involvement in alternative splicing or induction of nonsense-mediated decay (BCLAF1, SRRT, SLC10A7, and MKNK1) or as a 3’ UTR (CELF1) (Figure S10). The remaining three transcribed recurrently mutated elements overlap DHS peaks and may be weakly transcribed enhancers or long non-coding RNAs. The full set of recurrently mutated elements is described in Table 1 and the location of DNMs, population variation, and additional annotations are shown in Figure S11.
Estimating the genome-wide burden of DNMs in fetal brain active elements
The absence of individual non-coding elements with a genome-wide significant enrichment of DNMs allowed us to place an upper bound on the proportion of sites and elements at which DNMs are pathogenic. Approximately 8% of DNMs in protein-coding regions result in a protein-truncating mutation25,33. Unlike protein-coding regions, CNEs lack annotation to identify putative pathogenic mutations and are smaller (median 600bp). Downsampling gene length to 600bp and masking annotation of protein consequence, results in an 80% drop in empirical power for the 94 genes passing the genome-wide significance threshold in McRae et. al, 2017 (Figure S12a). This empirical estimate is close to the simulated estimate (Figure S12b). As we did not discover any genome-wide significant CNEs, the proportion of DNMs in CNEs that are pathogenic and highly penetrant must be substantially lower than 8%. Figure 4a depicts the likelihood of observing zero genome-wide significant non-coding elements given the number of fetal brain active CNEs (out of 2,613) in which DNMs can cause neurodevelopmental disorders and the proportion of mutations in those elements that are pathogenic. This analysis suggests that very few DNMs in these elements (likely <2%) are highly penetrant for dominant neurodevelopmental disorders.
Our survey of the non-coding genome is biased towards elements with a high degree of evolutionary conservation, but also includes fetal brain active elements with lower levels of evolutionary conservation. To extrapolate the excess of DNMs we observed in the targeted fetal-brain active elements genome-wide, we fitted the enrichment of DNMs as a function of evolutionary conservation using logistic regression (Methods). Factoring in the distribution of evolutionary conservation of all fetal brain DHS peaks genome-wide, we predicted a genome-wide excess of 88 DNMs (95% CI: 48-140) across the 4,915 exome-negative probands with neurodevelopmental disorders, corresponding to 1.0% – 2.8% of unsolved cases (Figure 4b).
Discussion
In summary, we have demonstrated that severe neurodevelopmental disorders can be caused by highly penetrant de novo mutations in regulatory elements. The majority of these regulatory elements likely act as enhancers. This significant excess of DNMs is only observed in the most highly evolutionary conserved elements that are active in the fetal brain. These elements also exhibit substantial selective constraint within human populations. We observed a 1.3-fold excess of DNMs within DHS peaks within these regulatory elements, suggesting that only a minority of such DNMs are pathogenic. Moreover, our modelling suggests that the proportion of all possible mutations within these elements that are likely to cause dominant neurodevelopmental disorders is very small (<2%). As a consequence, this class of pathogenic non-coding DNMs is only likely to account for a small proportion (<5%) of ‘exome-negative’ individuals.
The nature of our study design focuses on highly conserved elements and fetal-brain active elements and is relatively uninformative with respect to pathogenic ‘gain-of-function’ DNMs within elements that show no wildtype activity in fetal brain, and are not highly evolutionarily conserved. Recently, targeted sequencing in 218 consanguineous autism families of ‘human accelerated regions’ (HARs) - non-coding elements that exhibit evolutionary conservation but excessive human-specific divergence - identified a significant 1.43-fold enrichment of rare biallelic variants34. We note that the majority of pathogenic DNMs in protein-coding genes causing neurodevelopmental disorders occur in genes that are highly evolutionary conserved, are under strong selective constraint within human populations and exhibit wildtype expression in fetal brain17,35. Therefore, the relatively modest excess of DNMs that we observed in this study is highly likely to be an upper bound for other classes of non-coding elements. Consequently, a comprehensive analysis of the contribution of variation within all classes of non-coding elements to neurodevelopmental disorders is likely to require whole genome sequencing (WGS) of many tens of thousands, if not hundreds of thousands of parent-proband trios (Figure S13).
One challenge of interpreting WGS data is the vast universe of different hypotheses that could be tested, and thus how to account appropriately for multiple hypothesis testing. Turner et al. recently reported a nominally significant enrichment (p = 0.03) of de novo SNVs and private copy number variants in fetal brain DHS or at sites with PhyloP conservation score of >4, within 50kb of known autism-associated genes in whole genome sequences from 53 individuals with autism36. Caution should be exercised in interpreting findings based on: small sample sizes relative to that required for well-powered analyses (as discussed above), especially in cohorts with lower mutational burden (e.g. within protein-coding genes) – and therefore lower statistical power - compared to the cohort studied here, and analyses requiring multiple, arbitrary, levels of variant stratification (e.g. gene set, genomic proximity threshold, and conservation score). WGS-based analyses need to explicitly articulate and account for all tested hypotheses.
Our analyses were limited to SNVs as current models for indel mutation processes are too inaccurate to allow robust assessment of mutational excess. In addition, our analyses also highlight an urgent need for improved tools to stratify more and less damaging variants within non-coding elements, and more accurate and comprehensive annotation of gene targets for individual regulatory elements. These improved mutational models and functionally-relevant annotations will dramatically increase power to detect highly-penetrant disease-associated non-coding variation, for example, increasing power more than ten-fold from 5% to 83% in 40,000 trios (Figure S13). Functional characterisation of increasing numbers of robustly-associated, highly-penetrant, regulatory variants in cellular and animal models will be critical in moving us from a descriptive to a more predictive understanding of non-coding variation in the human genome, as well as elucidating their underlying pathophysiological mechanisms.
Acknowledgements
We thank the families for their participation and patience. We are grateful to the Exome Aggregation Consortium for making their data and code available. We would like to thank Sebastian Gerety, Greg Elgar, Stein Aerts, and Dmitry Svetlichnyy for helpful discussions. We thank Hugues Roest Crollius and Lambert Moyon for help on gene target prediction. We would like to thank Jonathan Mudge and Adam Frankish for their help in annotation of conserved non-coding elements. We thank the Sanger HGI and DNA pipelines teams for their support in generating and processing the data. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund (grant HICF-1009-003), a parallel funding partnership between the Wellcome Trust and the UK Department of Health, and the Wellcome Trust Sanger Institute (grant WT098051). The views expressed in this publication are those of the author(s) and not necessarily those of the Wellcome Trust or the UK Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South Research Ethics Committee and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee). D.R.F. is funded through an MRC Human Genetics Unit program grant to the University of Edinburgh.