Abstract
The study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates, ~8-70% depending on the disease type. Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in non-coding regions with damaging regulatory consequences. The large amount of rare and singleton variants found in each individual genome requires computational filtering and scoring strategies to gain power in downstream statistical genetics tests. However, state-of-the-art methods estimating the functional relevance of non-coding genomic regions have been mostly characterized on sets of variants largely composed of trait-associated polymorphisms and associated to common diseases, yet with modest accuracy and strong positional biases. In this work we first curated a collection of n=737 high-confidence pathogenic non-coding single-nucleotide variants in proximal cis-regulatory genomic regions associated to monogenic Mendelian diseases. We then systematically evaluated the ability to predict causal variants of a comprehensive set of natural selection features extracted at three genomic levels: the affected position, the flanking region and the associated gene. In addition to inter-species conservation, a comprehensive set of recent and ongoing purifying selection signals in human was explored, allowing to capture potential constraints associated to recently acquired regulatory elements in the human lineage. A supervised learning approach using gradient tree boosting on such features reached a high predictive performance characterized by an area under the ROC curve = 0.84 and an area under the Precision-Recall curve = 0.47. The figures represent a relative improvement of >10% and >34% respectively upon the performance of current state-of-the-art methods for prioritizing non-coding variants. Performance was consistent under multiple configurations of the sets of variants used for learning and for independent testing. The supervised learning design allowed the assessment of newly seen non-coding variants overcoming gene and positional bias. The scores produced by the approach allow a more consistent weighting and aggregation of candidate pathogenic variants from diverse non-coding regions within and across genes in the context of statistical tests for rare variant association analysis.
Introduction
To date, more than 4,000 Mendelian diseases have been clinically recognized 1, collectively affecting more than 25 million people in the US only 2. However, around 50% of all known Mendelian diseases still lack the identification of the causal gene or variant 3. Moreover, every year approximately 300 new Mendelian diseases are described, whereas the pace for discovery of the causal molecular mechanisms fluctuates at around 200 yearly 3. Despite the progress achieved through Whole Exome Sequencing (WES)-based studies, recent reviews show highly heterogeneous diagnostic rates across disease types 4,5, ranging from <15% (e.g. congenital diaphragmatic hernia or syndromic congenital heart disease) to >70% (e.g. ciliary dyskinesia). In those scenarios, a common working hypothesis is that non-coding variants could explain the etiology of many of the unresolved cases 5. Whole Genome Sequencing (WGS) permits to expand the survey of pathogenic variants to non-coding genomic regions in an unbiased way. Such possibility generates great expectations, as most trait/disease-associated Single Nucleotide Variants (SNVs) identified by Genome Wide Association Studies (GWAS) map to non-coding regions, suggesting a prominent role of regulatory elements in genetic diseases 6,7. Nevertheless, the large amount of rare and singleton variants in non-exonic positions shown by large-scale WGS projects in human 8, makes computational predictions a fundamental step to prioritize candidate variants for further clinical and experimental follow up.
A number of machine-learning methods have been developed in the last years to predict the regulatory consequences of non-coding SNVs 9–15. Two complementary perspectives have been exploited: First, from an evolutionary standpoint, genomic positions under non-neutral evolution are expected to have a functional role. Consequently, position-based purifying selection scores determined at different time-scales (i.e. from vertebrates, mammals and primates sequence alignments) have been successfully used by reference methods. Second, from a mechanistic view, phenotypic consequences of genetic variants are thought to result from their impact on non-coding functional elements, defined as those having reproducible biochemical features associated to regulatory roles, such as promoters, enhancers, silencers, repressors, etc. Thus, computational methods have exploited diverse sets of chromatin and epigenetic characteristics (e.g. histone marks, chromatin states, DNase I-hypersensitivity sites and transcription factor binding sites) obtained from heterogeneous sets of cell lines, primary cell types and tissues by Consortia such as ENCODE, FANTOM5, the Roadmap Epigenomics and BluePrint projects 16–18. While the ability of state-of-the-art methods to discriminate functionally relevant non-coding variants is well established, the value of such scores as a proxy of pathogenic potential in the context of Mendelian diseases is still unclear. This stems from the fact that functional scores of non-coding SNVs were mostly evaluated by their ability to identify trait-associated (e.g. quantitative trait loci, QTLs) and disease-associated loci from GWAS studies of common diseases. Yet, even in those contexts, predictive accuracy is modest and mainly driven by position-based interspecies conservation signals, with chromatin and epigenetic features providing only a marginal contribution 10,13,15. More recently, Smedley et al. developed the so-called Regulatory Mendelian Mutation Score (ReMM)19, which -to our knowledge-is the only method specifically developed to score pathogenic non-coding variants in the context of Mendelian disease studies. The approach trained a random forest classifier on a curated set of 406 SNVs (including long non-coding RNA SNVs). Twenty-six features were considered, including 8 interspecies conservation scores, 4 GC/CpG-based characteristics and 8 epigenetic features. Despite the simplicity of the model, ReMM scores proved valuable to prioritize Mendelian disease variants when integrated in a more comprehensive framework considering candidate regulatory regions and the phenotypic relevance of the associated genes 19.
In this work we hypothesized that the computational prediction of pathogenic non-coding variants in Mendelian diseases would benefit from a more comprehensive set of natural selection signals, notably regarding recent and ongoing selective constraints in human. In this regards, evolutionary and functional evidence support a rapid turnover of functional non-coding elements across species that would limit the capacity of interspecies conservation to pinpoint recently acquired regulatory sequences in the human lineage 20,21. Moreover, it has been suggested that lineage-specific and ongoing natural selection in human could help further understanding the partial overlap observed among the fraction of the genome inferred to be functional from evolutionary, biochemical and genetic evidences 22,23. The use of recent and ongoing purifying selection signals to prioritize pathogenic variants has been historically challenged by a number of confounding factors shaping patterns of human genomic variation. Thus, random genetic drift, population structure and demographic processes such as rapid expansions, migrations and population bottlenecks, have played a major role of governing changes in allele frequency within and between populations23–25. In addition, uneven recombination rates across the genome and heterogeneous neutral mutation rates26 associated to sequence context27,28 or to different types of non-coding elements29 further complicates the distinction of neutral versus non-neutral evolution.
Notwithstanding, the increasing sample size of current large-scale whole genome sequencing projects of the general population are providing a better resolution of recent and ongoing purifying selection signals in human8,30–32 that could improve their utility in scoring systems of pathogenic variants. In addition, machine learning methods have shown able to extract complex patterns associated to functional variants combining different types of selective constraints that would be missed by classical approaches10,15. Hence, a supervised learning approach could help better exploiting recent natural selection features in spite of confounding factors. To test both previous possibilities, in this study we first extracted a comprehensive set of recent and ongoing natural selection features determined from recent large-scale WGS projects in human together with interspecies conservation scores assessed on different evolutionary timescales. We then trained NCBoost, a classifier of non-coding SNVs based on gradient tree boosting, on a curated set of high-confidence pathogenic non-coding SNVs associated to monogenic Mendelian disease genes and on common non-coding SNVs without clinical assertions. The approach outperformed existing state-of-the-art methods under multiple training and testing scenarios, while overcoming gene and positional bias.
Material and Methods
High-confidence pathogenic variants
Three sets of high-confidence pathogenic variants in non-coding regions were obtained: 1) Regulatory disease-causing mutations (so-called “DM” set) from the Human Gene Mutation Database (HGMD, professional version, accessed on 2018/01/03, 33), manually annotated as involved in conferring the associated clinical phenotype; 2) pathogenic single-nucleotide variants (SNVs) from Clinvar 34 manually annotated as “pathogenic” with no conflicting assertions (GrCh37 release from 2017/12/31, downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/); and 3) a manually curated set compiled from the medical literature of non-coding single-nucleotide variants associated with Mendelian disease and validated by experimentation or co-segregation studies, or for which other convincing evidence of pathogenicity was available (19.
Variant mapping and annotation of non-coding SNVs
Only Single Nucleotide Variants (SNVs) where considered through the study. Variants were annotated using Annovar 35, downloaded on 2016-02-01; using the gene-based annotation option based on RefSeq for Humans (assembly version hg19); http://annovar.openbioinformatics.org/en/latest/user-guide/gene/) in order to obtain i) the gene region affected by intragenic variants, or ii) the nearest flanking gene in the case of intergenic variants. Exonic variants and variants within 10 base pairs (bp) of a splicing junction of protein-coding genes were removed (Annovar splicing_threshold =10). At this stage, variants from HGMD-DM, Clinvar and Smedley’2016 overlapping non-coding RNAs within an exon (n= 143, 2 and 68, respectively), intron (n= 24, 3, 13, respectively) or 10bp from a splicing junction (n= 1, 0, 0, respectively) were filtered out. In the case of SNV overlapping several types of regions associated to different genes or transcripts, the following three criteria were consecutively adopted: A) the default Annovar precedence rule for gene-based annotation was adopted, i.e.: exonic = splicing > ncRNA > UTR5 = UTR3 > intronic > upstream = downstream > intergenic. B) if after applying the previous precedence rule a SNV could still be associated to several neighbor/overlapping genes (e.g. in the intergenic region between two genes, or in the intronic region of two overlapping genes, etc), the SNV’s nearest protein coding gene was kept as a reference for the annotation of the variant. The SNV’s nearest gene was determined by the shortest distance to either the TSS or TSE. C) In case of SNVs with two or more genes with identical shortest distance to TSS/TSE, the SNV was tagged as ‘conflicted’ and filtered out from the analysis. After all previous filtering steps, a total of 18 disease-causing SNVs overlapping upstream (n=9), UTR’3 (n=7) and downstream regions (n=2) of non-coding RNAs were filtered-out. Thus, for the purpose of this study, the set of non-coding variants was constituted of SNVs associated to protein-coding genes and overlapping intronic, 5’ UTR or 3’ UTR, upstream and downstream regions -i.e. closer than 1kb from the Transcription Start Site (TSS) and the Transcription End Site (TSE) respectively- and intergenic regions.
Curation of high-confidence pathogenic non-coding SNVs associated to monogenic Mendelian disease genes
For high-confidence pathogenic SNVs, we manually supervised a total of n=71 cases showing a disagreement between the gene associated to the variant in the original resource (i.e. HGMD-DM, Clinvar and Smedley’2016) and the gene associated by the previously described annotation procedure. The original gene assignment was kept for n=17 SNVs where conflict originated due to straightforward exceptions of the Annovar’s precedence rule or the assignment to the nearest upstream or downstream gene (Criteria A and B described in the previous section). The number of variants retained at this stage is represented in Figure 1A. Only high-confidence pathogenic non-coding variants associated to the same gene by both the original resource and the annotation process done in this work were retained for downstream analyses.
We then evaluated whether the genes associated to high-confidence pathogenic non-coding SNVs were reported as Mendelian disease genes in 1. A list of n=3695 Mendelian disease genes was obtained following 3: OMIM raw data files mim2gene.txt, genemap2.txt and morbidmap.txt were downloaded from www.omim.org on 2017/10/13. MIM phenotype number and supporting evidence annotations where extracted from morbidmap.txt. Phenotype descriptions containing the word ‘somatic’ were flagged as ‘somatic’, those containing [‘carcinoma’, ‘cancer’, ‘tumor’, ‘leukemia’, ‘lymphoma’, ‘sarcoma’, ‘blastoma’, ‘adenoma’, ‘cytoma’, ‘myelodysplastic’, ‘Myelofibrosis’ or ‘oma’] were flagged as ‘cancer’, and those containing [‘risk’, ‘quantitative trait locus’, ‘QTL’, ‘{’, ‘[’ or ‘susceptibility to’] were flagged as ‘complex’. Phenotypes flagged as ‘somatic’ and ‘cancer’ were classified as ‘somatic cancer’. Mendelian genes were then defined as the genes having a supporting evidence level of 3 (i.e. the molecular basis of the disease is known) and not having a ‘somatic’ flag. Two main categories of Mendelian disease genes where defined: monogenic Mendelian disease genes (n=3354) and complex Mendelian disease genes (n=596), i.e. those presenting mutation risk factors, quantitative-trait loci (QTL) or contributing to susceptibility to multifactorial disorders or to susceptibility to infection 3. Of note, 255 genes were associated to both monogenic and complex Mendelian disease genes.
High-confidence pathogenic non-coding SNVs associated to monogenic Mendelian disease genes where manually further inspected to check consistency between the disease phenotype reported in the original source (i.e. HGMD-DM, Clinvar and Smedley’2016) and the ones described in OMIM database for the same gene. A total number of n=138 variants for which the agreement was unclear or a disagreement was observed were filtered out for downstream analyses. In the remaining set of high-confidence pathogenic non-coding SNVs associated to monogenic Mendelian disease genes, we then inspected whether variants were detected as heterozygous or homozygous among the individuals included in the GnomAD database 31; http://gnomad.broadinstitute.org/downloads, version r2.0.2, using both whole genome sequencing data and whole exome sequencing data. Variants present as homozygous in at least one carrier were filtered out for downstream analysis. Thus only high-confidence pathogenic non-coding SNVs associated to monogenic Mendelian diseases, with no homozygous individuals in GnomAD and overlapping intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions were finally retained for downstream analysis (Table S1).
Common and rare human variants without clinical assertions
Common and rare human variants without clinical assertions where obtained from dbSNP (downloaded on 2017/07/10 from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20170710.vcf.gz). For the purpose of this study, variants labeled as common (“COMMON=1”) and with Minor Allele Frequency (MAF)> 0.05 were considered as ‘common variants’, while those labeled as non-common (“COMMON=0”) and with MAF< 0.01 were annotated as ‘rare variants’. Variants with no MAF information (no “CAF” field reported in the “INFO” field of the variant) and multiallelic variants were filtered out. Common and rare human variants without clinical assertions were first annotated by Annovar and filtered as described above. For consistency in the comparison against the pathogenic set of SNVs, the set of common and rare human variants without clinical assertions was restricted to SNVs associated to protein-coding genes and overlapping intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions. The list of protein-coding genes was extracted from Ensembl Biomart 36; human genome assembly version GrCh37.p13).
Pathogenicity scores of non-coding SNVs
Pre-computed pathogenicity scores of non-coding SNVs were extracted from the following state-of-the-art methods: CADD non-coding score (version v1.3, 9; DeepSEA functional significance score (version v0.94,13; Eigen and Eigen-PC scores (version v1.1,15; FunSeq2 score (version v1.2, 11 and ReMM scores (version v0.3.1; 19).
Feature extraction of non-coding SNVs
Features extracted are summarized in Table 1. They can be gathered in 5 main categories:
A. Inter-species sequence conservation features at position and window level
To evaluate evolutionary conservation at a given site, the following scores evaluating non-neutral rates of substitution from multiple species alignments (excluding human) were used: PhastCons 37,38,39 and PhyloP scores for three multi-species alignment (Vertebrates, Mammals and Primates, excluding human) and GerpN and GerpS single-nucleotide scores fom mammalian alignments 40, all of them obtained from CADD (version v1.3, 9, file name: whole_genome_SNVs_inclAnno.tsv.gz, downloaded from http://cadd.gs.washington.edu/download). PhyloP scores measure neutral evolution at individual sites. The score corresponds to the -log p-value of the null hypothesis of neutral evolution. Positive values (up to 3) represent purifying selection, while negative values (up to - 14) represent acceleration. PhastCons scores estimate the probability that the locus is contained in a conserved element. GerpN and GerpS single-nucleotide scores assess respectively the neutral substitution rate and the rejected substitution rate of the locus. A high GerpN value indicates high homology of the locus across species. Positive values of GerpS indicate a deficit in substitutions, while negative values convey a substitution surplus.
B. Recent and ongoing natural selection signals in Humans at position and window level
Three human population-specific natural selection scores based on the allele frequency spectrum on a 30 kb sequence window region centered around the SNV were obtained from The 1000 Genomes Selection Browser 1.0 (http://hsb.upf.edu/,41): Tajima’s D 42, Fu & Li’s D⋆ and Fu & Li’s F* 43. Tajima’s D is a neutrality test comparing estimates of the number of segregating sites and the mean pair-wise difference between sequences. Fu & Li’s D* is a neutrality test comparing the number of singletons with the total number of nucleotide variants within a region. Fu & Li’s F is a neutrality test comparing the number of singletons with the average number of nucleotide differences between pairs of sequences. The three tests were performed within 3 populations of the 1000 Genome Project phase 1 data, producing population-specific scores: Yoruba in Ibadan, Nigeria (YRI), Han Chinese in Beijing, China (CHB) and Utah Residents with Northern and Western European Ancestry (CEU). Negative logarithmic percentiles associated to each of these score were used with values ranging from 0 (indicating positive selection) to 6 (indicating purifying selection). Here, we used the negative logarithm of the ranked percentile of each score over the whole genome (http://hsb.upf.edu/?page_id=594) associated to the raw scores, with values ranging from 0 (indicating positive selection) to 6 (indicating purifying selection).
The background selection score (B statistic, 44), indicating the expected fraction of neutral diversity that is present at a site, was obtained from CADD annotations (version v1.3). B statistic values close to 0 represent nearly complete removal of diversity as a result of selection and values near 1 indicate no conservation. B-statistic is based on human single nucleotide polymorphism (SNP) data from Perlegen Sciences, HapMap phase II, the SeattleSNPs NHLBI Program for Genomic Applications and the NIEHS Environmental Genome Project.
Context-dependent tolerance scores (CDTS) for 10bp bins of the human genome computed on 15496 unrelated individuals from the gnomAD consortium 32) were downloaded from http://www.hli-opendata.com/noncoding/ (file CDTS_diff_perc_coordsorted_gnomAD_N15496_hg19.bed.gz). The CDTS represents the difference between observed and expected variations in humans. The expected variation is computed genome-wide for each nucleotide as the probability of variation of each nucleotide depending on its heptanucleotide context. Low CDTS scores indicate loci intolerant to variation.
Mean heterozygosity and mean derived allele frequency of variants in a 1kb window region centered around the SNV and calculated from the 1000 Genomes Project (excluding the query variant) were obtained from 10 ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/source_data/1kg). Mean minor allele frequency (MAF) of variants in 1kb flregion were calculated from GnomAD genome data 31, excluding the query variant from the calculation. Mean MAF was assessed for the global population and for population-specific frequencies: Africans and African Americans (AFR), Admixed Americans (AMR), East Asians (EAS), Finnish (FIN), Non-Finnish Europeans (NFE), Ashkenazi Jewish (ASJ) and Other populations (OTH). Additionally, we extracted mean MAF of variants in a 1kb window calculated from the 1000 Genomes Project (excluding the query variant). MAFs both from GnomAD and the 1000 Genomes Project were extracted from GnomAD release r2.0.2, using the Genome VCF files (available at http://gnomad.broadinstitute.org/downloads).
C. Gene-based features
The following gene-level features associated to natural selection were obtained:
Primate dn/ds ratios (i.e. the ratio between the number of nonsynonymous substitutions and the number of synonymous substitutions) were taken from 45. Low dn/ds values reflect purifying selection, while high dn/ds values are indicative of positive selection.
The gene probability of loss-of-function intolerance (pLI), from 31, reflecting a depletion of rare and de novo protein-truncating variants as compared to the expectations drawn from a neutral model of de novo variation on ExAC exomes data. pLI values close to 1 reflect gene intolerant to heterozygous and homozygous loss-of-function mutations.
Gene Damage Index (GDI), a gene-level metric of the mutational damage that has accumulated in the general population, based on CADD scores 46. High GDI values reflect highly damaged genes.
The Residual Variation Intolerance Score (RVIS percentile; 47), which provides a gene measure of the departure from the average number of common functional mutations in genes with a similar amount of mutational burden in human. High RVIS percentiles reflect genes highly tolerant to variation.
The non-coding version of the RVIS score (ncRVIS, 48 measuring the departure from the genome-wide average of the number of common variants found in the noncoding sequence of genes with a similar amount of noncoding mutational burden in human. Negative values of ncRVIS indicate a conserved proximal non-coding region, while positive values indicate a higher burden of SNVs than expected under neutrality.
The average non-coding GERP (ncGERP) is the average GERP score 40 across a gene’s noncoding sequence (Petrovski et al., 2015). Both in the case of ncRVIS and ncGERP, the non-coding sequence was defined in the original publication as the collection of 5’-UTR, 3’-UTR and an additional non-exonic 250 bp upstream of transcription start site (TSS).
Gene age estimating the gene time of origin based on the presence/absence of orthologs in the vertebrate phylogeny was taken from 49. It varies from 0 (oldest) to 12 (youngest, corresponding to human specific genes). The number of human paralogs for each gene was obtained from the OGEE database 50.
For all scores, gene names were mapped to approved gene symbols from HGNC. Missing values were imputed through the median value computed over all protein coding genes.
D. Sequence context
The percentage of GC and CpG in a window of 150bp around the variant of interest was taken from CADD v1.3 annotations. In addition, we one-hot encoded the non-coding genomic region overlapping the SNV annotated by AnnoVar, and used it as binary features: intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions.
E. Epigenetic features
Epigenetic features such as histone modification marks, nucleosome position, open chromatin profiles and transcription factor binding sites (TFBS) profiles generated by the ENCODE project 51 were extracted from CADD v3.1 annotations. DNA accessibility was assessed using two set of features: 1) the open chromatin evidence coming from the open chromatin super track, containing peak signal and Phred-scaled p-values of evidence for five open chromatin assays: DNase-seq (EncOCDNaseSig and EncOCDNasePVal), FAIRE-seq (EncOCFaireSig and EncOCFairePVal) and ChIP-seq using CTCF (EncOCctcfSig and EncOCctcfPVal), PolII (EncOCpolIISig and EncOCpolIIPVal) and Myc (EncOCmycSig and EncOCmycPVal), the Phred-scaled combined p-value of both DNase-seq and FAIRE-seq assays (EncOCCombPVal) and the Open Chromatin Code (EncOCC), a metric integrating DNaseI, FAIRE and ChIP-seq peak evidence of open chromatin. Further details are provided herein: http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTrackUi?g=wgEncodeOpenChromSynth and http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTables?db=hg19&hgta_group=regulation&hgta_track=wgEncodeOpenChromSynth&hgt a table=wgEncodeOpenChromSynthGm18507Pk&hgta doSchema=describe+table+schema). And 2) the maximum nucleosome position score obtained through MNase-seq (EncNucleo), indicating packed chromatin states: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeSydhNsome. Potential transcription factor activity was assessed using i) the number of different overlapping TFBS (TFBS), ii) the number of overlapping TFBS peaks summed over cell types (TFBSPeaks) and iii) the highest value of overlapping TFBS peaks across cell types from ChIP-seq (TFBSPeaksMax), as well as using histone modification marks, such as the maximum methylation peak at H3K4 (EncH3K4Me1, enhancers-associated), maximum trimethylation peak at H3K4 (EncH3K4Me3, promoter-associated) and maximum acetylation peak at H3K27 (EncH3K27Ac, associated to active enhancers).
NCBoost training strategy
NCBoost training was performed with XGBoost, a machine learning technique based on gradient tree boosting (aka. gradient boosted regression tree; 52,53). The R implementation from https://github.com/dmlc/xgboost (version 0.71.1) was used with parameters: eta=0.01, max_depth=25 and gamma=10, selected to avoid overfitting and after parameter optimization through a prior tenfold cross-validation step.
To train NCBoost, we first randomly split the complete list of protein-coding genes in 10 genome partitions of equal size, with the same distribution across all chromosomes and keeping in each of them the same proportion of monogenic Mendelian disease genes presenting high-confidence pathogenic non-coding variants (see above). Throughout the work, each disease-causing variant (aka. ‘positive’ variants) was associated with a unique set of 10 ‘negative’ variants, randomly sampled from the set of common human variants without clinical assertion described above and associated to genes within the same genome partition. Random sampling of common variants was matched to the positive set to keep the same fraction of variants per type of region: intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions. A maximum of one positive and one negative variant associated to the same gene was allowed, although no minimum per gene was required. For the training step, a maximum of one disease-causing non-coding variant was randomly sampled per gene (Table S2). We then trained NCBoost as a bundle of 10 independently trained models, consecutively excluding in each of them 1 of the 10-genome partitions described above.
Correlation between independently trained 10-NCBoost models
To assess the correlation among the scores led by the independently trained 10-NCBoost models, we created 11 genome partitions in order to create 11 independent sets of positive and negative variants, randomly sampled in an analogous way as described above. One partition was randomly selected and reserved for validation while the other 10 were used for training. 10-NCBoost models were then independently trained using the set of features A+B+C+D described above, by consecutively excluding in each of them 1 of the 10-genome partitions. Then, each model was used to score variants in the 11th partition. Correlation among the scores of the 10 models was assessed through Spearman rank correlation.
Random sampling of rare human variants without clinical assertion
Each disease-causing variant was associated with a unique set of 10 rare variants, randomly sampled from the set of rare human variants without clinical assertion described above. Random sampling of rare variants was matched to the positive set to keep the same fraction of variants per type of region: intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions. A maximum of one positive and one rare variant associated to the same gene was allowed, although no minimum per gene was required.
Region-based random sampling of common variants
To constitute a “region-context” matched set of positive and negative variants, each disease-causing variants was associated –when available-with one common variant, randomly sampled from the set of common human variants without clinical assertion associated to the same gene and mapping to the same region (intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic regions). Disease-causing variants with no matching common variants in the same region of the same gene were excluded from the region-context matched set of positive and negative variants. Multiple positive-negative variant pairs per gene were allowed in this setting. In the end, 149 region-matched pairs of pathogenic and random common variants were sampled, associated to 54 unique genes.
Annotation of dominant/recessive and haploinsufficient genes
A list of n=299 haploinsufficient genes was obtained from 54. Genes intolerant to heterozygous truncation (pLI>0.9; 31 were obtained from ExAC browser: file fordist_cleaned_exac_nonTCGA_z_pli_rec_null_data.txt downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/functional_gene_constraint/. Dominant and recessive disease gene predictions were obtained from DOMINO 55, file score_all_final_03.04.17.txt downloaded from https://wwwfbm.unil.ch/domino/download.html). Following Quinodoz et al., 2017, DOMINO score, reflecting the predicted probability of a gene to harbor dominant changes, was used to establish five gene categories (i.e. recessive, likely recessive, rest, likely dominant, dominant), corresponding to the probability intervals <0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, >=0.8, respectively.
Software availability
Scripts to annotate SNVs with all features used in this study, software to score pathogenicity with NCBoost (ABCD model) and genome-wide pre-computed scores will be available online at http:// [URL TBD] upon publication of the manuscript
Results
Curation of a high-confidence set of pathogenic non-coding variants associated to monogenic Mendelian disease genes
The number of high-confidence pathogenic non-coding variants obtained from HGMD-DM, Clinvar and Smedley’2016 is represented in Figure 1A (Methods). The majority of causal variants were assigned to the closest protein-coding gene in the reference source (94%, 98% and 89%, respectively). Thus the available set is mostly constituted of proximal cis-regulatory variants (Figure 1A), with distal cis-regulatory and trans-acting variants scarcely represented. Our curation effort allowed further refining this set to retain the fraction of pathogenic variants confidently associated to monogenic Mendelian diseases genes (84%, 87% and 98%, respectively; Figure 1B). In addition, a small though non-negligible fraction of variants for which homozygous individuals were detected in recent large-scale whole exome and genome sequencing (GnomAD; 17) were excluded for downstream analysis (5%, 7% and 4% respectively; Figure 1B). After all filtering steps, a total of 737 pathogenic non-coding SNVs collectively associated to 283 genes were retained (Figure 1C). Variants distributed in intronic (23%), UTR’5 (36%) and UTR’3 (12%), and 1Kb-upstream TSS (26%), with a minority of variants in 1Kb-downtream TSE (<1%) and in intergenic regions (1%). The 3 resources mined in this work (HGMD-DM, Clinvar and Smedley’2016) showed varying degrees of overlap regarding causal SNVs (Figure 1D) and associated genes (Figure 1E). Notably, the set of 283 monogenic Mendelian disease genes collectively affected by pathogenic non-coding SNVs is enriched in haploinsufficient genes (Odds ratio OR = 2,59, one-sided Fisher test p-value= 1,279e-9), in genes intolerant to heterozygous truncation (OR = 1,29; p-value=1,279e-9) and in genes predicted to have a dominant inheritance mode (OR = 1,36; p-value=1,72e-3) as compared to a background set of 3354 monogenic Mendelian disease genes (Figure S1; Methods).!
Distribution of state-of-the-art pathogenicity scores across pathogenic and non-pathogenic SNVs
We then checked the distribution of six state-of-the-art pathogenicity scores (CADD, DeepSea, Eigen, Eigen-PC, FunSeq2 and ReMM; Methods) across the 737 high-confidence non-coding pathogenic variants and 4’960’178 common SNVs without clinical assertions. All evaluated scores showed marked differences depending on the type of gene region involved (i.e. intergenic, intronic, 3’UTR, 5’UTR, upstream and downstream regions of associated genes; Figure S2). Thus, the distributions of median scores per gene for pathogenic SNVs in 5’UTR and for SNVs within 1Kb-upstream TSS were shifted towards more severe values than those of pathogenic SNVs in 3’UTR, intronic and intergenic regions. Bias per gene region was also observed across common SNVs without clinical assertions, suggesting that the regulatory region where a variant maps systematically biases the scores (Figure S2). Surprisingly, the distributions of median scores per gene for common SNVs in 5’UTR was not significantly lower (i.e. less severe) than that of pathogenic SNVs in 3’UTR, intronic and intergenic regions for none of the 5 scores evaluated (two-sided Wilcoxon test p-values for all pair-wise comparisons evaluated are reported in Table S3). As a corollary, the previous observations warn about the necessity of matching the relative composition of pathogenic and non-pathogenic SNVs across different gene regions in predictive benchmarks, as well as for relative differences in region distribution across datasets (Figure 1C).
Ability of natural selection signals to predict pathogenic non-coding SNVs when considered independently
Table 1 summarizes the set of natural selection features extracted for both pathogenic non-coding SNVs and common SNVs without clinical assertions. Features gathered covered different evolutionary scales and can be classified as interspecies natural selection (considering vertebrates, mammals and primates, excluding human) or recent and ongoing natural selection in human. Second, features were categorized either as “position-based”, when they refer to the specific genomic position where the variant occurred, “window-level” when they refer to a given sequence interval centered in the SNVs, or “gene-level”, when they refer to the global characteristics of the closest protein-coding gene.
For the purpose of this work, features were group under three main sets (Table 1): A) interspecies sequence conservation features at position and window level; B) recent and ongoing natural selection signals in human at position and window level; and C) gene-based features. Furthermore, we included 2 additional sets of features: D) the sequence context, i.e. GC and CpG content as well as information of the type of gene region (intronic, 5’UTR, 3’UTR, upstream, downstream and intergenic region); and E) epigenetic features such as histone modification marks, nucleosome position, open chromatin profiles and transcription factor binding sites (TFBS) profiles generated by the ENCODE project 51.
We first checked the predictive ability of each individual feature to classify the n=737 high-confident set of pathogenic non-coding SNVs associated to monogenic Mendelian disease genes from a ‘negative set’ of n=7370 randomly sampled common SNVs without clinical assertions and matched by region (Methods). Figure 2 shows the area under the receiver operating characteristic (AUROC) curve and the area under the Precision-Recall curve (AUPRC) obtained for each feature. The ranking of features according to both AUROC and AUPRC showed that predictive ability was dominated by interspecies sequence conservation features at position and window level (Category A), while only poor performances were observed for the rest of features when considered independently.
Supervised learning of NCBoost based on a comprehensive set of ancient, recent and ongoing purifying selection signals in human
NCBoost, a machine learning approach based on gradient tree boosting (Methods) was trained on a ‘positive set’ of n=283 high-confident set of pathogenic non-coding SNVs associated to monogenic Mendelian disease genes (randomly selecting one variant per gene out of the total n=737 initially obtained to avoid gene-level contamination of the training/testing sets; Figure 1C) and a ‘negative set’ of n=2830 randomly sampled common SNVs without clinical assertions, matched by region and allowing a maximum of one negative variant per gene (Methods). NCBoost is a bundle of 10 independently trained models, consecutively excluding in each of them 1 out of 10 genome partitions were ‘positive’ and ‘negative’ variants are evenly distributed. In such a way, each non-coding variant in a putative cis-regulatory region of a protein-coding gene may be scored in NCBoost by the model that excluded from its training all variants -either pathogenic or non-pathogenic-associated to the same gene. This strategy permits to reduce overfitting as well as to avoid biasing the score of newly seen variants by the fact that they mapped in the vicinity of variants and genes initially presented to the classifier. Therefore, NCBoost may be applied to score any set of non-coding variants in cis-regulatory regions with no contamination with the training set. Of note, the 10 models proved to be largely equivalent among them, as shown by the high correlation of their scores when applied to an independent set of variants excluded from their training (average Spearman correlation 0.96 ± 0.0111 of all pairwise comparisons among the 10 models; Methods).
Six feature configurations were evaluated, including the following combinations of feature categories: A, B, A+B, A+B+C, A+B+C+D and A+B+C+D+E. The different NCBoost configurations were first tested mimicking a ten-fold cross-validation on the same n=283 high-confidence pathogenic non-coding SNVs and n=2830 common variants. Figure 3 shows the area under the receiver operating characteristic (AUROC) curve and the area under the Precision-Recall curve (AUPRC) obtained for each of the six feature configurations. Best performance was reached by the model including ABCD features: AUROCABCD = 0,84 and AUPRCABCD=0,47. The figures represent a relative improvement of 9% (AUROC) and 42% (AUPRC) over a model based purely in interspecies sequence conservation features at position and window level. Results were consistent when NCBoost was trained and tested on positive variants from each of the 3 resources taken independently, i.e. HGMD-DM, Clinvar and Smedley’2016 (Figure S3A, S3B and S3C, respectively).
The features importance analysis of NCBoost upon the explored feature configurations, revealed a balanced contribution of inter-species sequence conservation features at position and window level (Category A, cumulative importance in the ABCD configuration CIABCD= 42%) and recent and ongoing natural selection signals in human at position and window level (Category B, CIABCD= 33%) collectively considered (Figure S4A and S4B). Such balance is observed in spite of the sharp differences in predictive ability observed across features when considered independently (Figure 2). The collective feature importance of recent and ongoing natural selection signals in human is in turn much higher than what could be expected from the observed incremental performance obtained in the join model AB (interspecies and intraspecies selection) as compared to B (interspecies selection; Figure 3). Both previous observations are not merely the straightforward consequence of the correlation structure across features (Figure S5). The previous results show the capacity of a supervised learning approach using a regression tree to extract complex patterns of natural selection signals distinguishing pathogenic versus non-pathogenic non-coding variants.
Notably, in contrast with the state-of-the-art methods evaluated (Figure S2), the per-region distribution of NCBoost scores across the 737 high-confidence non-coding pathogenic variants and 4’960’178 common SNVs, showed a clearer separation of between pathogenic and common variants for all types of regions evaluated (Figure S6). Thus, the distributions of median scores per gene for common SNVs in 5’UTR was significantly lower (i.e. less severe) than that of pathogenic SNVs in all evaluated regions (i.e. intronic, 3’UTR, 5’UTR and upstream regions; two-sided Wilcoxon test p-values <1e-10; Table S3), with the exception of intergenic region due to the low sample size.
Comparative benchmark against state-of-the-art methods
NCBoost performance observed in Figure 3 (Configuration ABCD) was compared against the results of the 6 state-of-the-art methods considered in this work (i.e. CADD, DeepSEA, Eigen, Eigen-PC, FunSeq2 and ReMM) when applied on the same ‘positive’ and ‘negative set’ of SNVs (Figure 4). NCBoost outperformed all evaluated methods both regarding AUROC and AUPRC, with a relative improvement of 10% and 34% respectively over the 2nd ranked method (REMM), and of 13% and 104% over the 3rd ranked method (Eigen). We note here that REMM is a supervised learning method whose training set partially overlapped with the ‘positive set’ of pathogenic non-coding SNVs variants used for testing here. Figures were consistent when the benchmark was performed on positive variants from each of the 3 resources taken independently, i.e. HGMD-DM, Clinvar and Smedley’2016 (Figure S7A, S7B and S7C, respectively).
The outperformance of NCBoost over reference methods was also observed when testing on the same ‘positive set’ of n=283 high-confident set of pathogenic non-coding SNVs as in Figure 4 and on a negative set that -rather than of common variants-is composed of 2830 randomly selected rare variants (allele frequency < 1%) matched by region (Figure S8; Methods). This test allows ruling out the possibility that figures obtained in Figure 4 are merely explained by the capacity to discriminate rare from common variants, rather than pathogenic from non-pathogenic variants.
In a more stringent set-up, we further explored the capacity of the different methods to discriminate pathogenic and non-pathogenic variants within the same non-coding region of a given gene. For this purpose, we restricted the previous testing to a set of 149 region-matched pairs of pathogenic and random common variants associated to 54 unique genes (Figure S9). Figures obtained were consistent with those previously observed in Figure 4 and Figure S8, further supporting the superior capacity of NCBoost to discriminate pathogenic variants as compared to reference methods. We note that both in Figure S8 and Figure S9, no re-training of NCBoost was done, but used the same NCBoost ABCD model trained as described in the previous section.
Fully independent training and testing across all possible configurations of the three sources of high-confident non-coding pathogenic SNVs
To further characterize the performance of the NCBoost approach upon different training and testing scenarios, we evaluated all possible configurations of the training and testing set upon the three sources of high-confident non-coding pathogenic SNVs, i.e: HGMD-DM, Clinvar and Smedley’2016. Thus, the ‘positive set’ of n=283 high-confident set of pathogenic non-coding SNVs and the associated ‘negative set’ of n=2830 common SNVs matched by region (Methods) were each split in two non-overlapping sets in three different ways according to the source of annotation (Table 2), that is: training on the pathogenic variants reported in one source and testing on those in the other two sources not overlapping with the first one. In addition, we explored two additional configurations: training on variants reported in at least two sources and testing on those reported only in one single source, and the other way around. For each different training set, we retrained NCBoost as a bundle of 10 independently trained models, consecutively excluding in each of them 1 out of 10 genome partitions as previously done for the entire sets. We note again that a maximum of one positive and one negative variant per gene was allowed within the positive and negative sets, so that gene-level contamination across the training and test sets is avoided. Table 2 shows the AUROC and AUPRC values obtained on each of the independent test sets when applying the corresponding NCBoost model. Consistently with previous sections, NCBoost outperformed the reference state-of-the-art methods under all training and testing scenarios evaluated.
Discussion
In this work we implemented a supervised learning approach, so called NCBoost, to classify pathogenic SNVs based on a comprehensive set of features at the position, flanking region and gene level, associated to interspecies, recent and ongoing selection in human. When trained and tested on multiple configurations of high confidence sets of pathogenic non-coding SNVs associated to monogenic Mendelian disease genes, the approach showed superior performance than reference methods. Notable improvements were observed on precision-recall rates. The context-specific assessment of natural selection signals permitted to overcome the pervasive regional bias observed in all evaluated reference methods, which e.g. tend to provide scores to non-pathogenic SNVs in 5’UTR not significantly different from the scores assigned to high-confident pathogenic SNVs in 3’UTR and intronic regions.
The curation process showed that current sets of high confidence large-effect pathogenic non-coding SNVs associated to monogenic Mendelian diseases are mostly constituted of proximal cis-regulatory variants associated to the closest protein-coding gene, in line with previous reports19. Such distribution most probably reflects a historical ascertainment bias towards such regions in previously described_ Mendelian genes, which is expected to be steadily overcome by unbiased WGS approaches 5. However, in the time being, the current status posses limits to the supervised learning and benchmark on distal cis- and trans-acting pathogenic regulatory variants with clinical implications in Mendelian diseases, and warns about the applicability of our approach and reference methods in such scope.
The approach implemented allowed us to evaluate the ability to prioritize pathogenic non-coding SNVs of recent and ongoing natural selection features in human when considered independently, collectively and in combination with interspecies conservation. While none of the features evaluated showed individual predictive strength (Figure 2), supervised learning performed through gradient tree boosting found complex patterns associated to pathogenic SNVs, reaching a significant performance combining multiple features (Figure 3). Detailed feature importance analysis showed a prominent contribution of recent and ongoing natural selection signals under all feature configurations evaluated. However, their final impact in the global performance of the classifier, while significant, is attenuated by the fact that some signals may be redundant with selective constrains already accounted for by interspecies conservation. Best figures were, nevertheless, obtained when the collective assessment of interspecies and intraspecies natural selection features was performed taking into consideration the sequence context where SNVs occurred, as informed by the selective signals accumulated by the associated gene and by the type of non-coding element involved.
This work represents a proof-of-concept of the added value of incorporating a large and heterogeneous set of recent and ongoing natural selection features under a supervised machine learning approach for the detection of pathogenic non-coding SNVs associated to Mendelian diseases. The rapidly increasing sample size of current large-scale WGS projects in the general population is expected to have a major impact in the capacity to detect additional and more accurate recent and ongoing natural selection signals in human, with a consequent repercussion in their use to identify pathogenic non-coding variants, as recently illustrated 32,56,57
In the last years, different large-scale projects have identified an important fraction of regulatory elements of the human genome, and the epigenetic insights are proving valuable to understand the functional consequences of disease-associated variants in those regions 16–18. However, in the setting of this work, the small set of epigenetic features evaluated had only a minor contribution to the classification of pathogenic SNVs associated to Mendelian diseases, in line with the results of previous analysis 10,13,15. On the one hand, this may suggest that the epigenetic signals evaluated here are partially redundant with natural selection features; a more exhaustive extraction of epigenetic features is however beyond the scope of this work. On the other hand, it may reflect a lack of specificity in regards of the cell types and tissues relevant for the heterogeneous set of Mendelian diseases considered here. In this line, recent studies are consolidating a view of regulatory mechanisms that is highly cell type-specific, where gene expression, DNA methylation, histone modifications, promoter interaction networks and transcription factor binding sites may substantially vary across tissues and developmental stages 58–61. Thus, the assessment of non-coding variants in the context of Mendelian diseases may largely benefit from the integration of purifying selection signals with the epigenetic information derived on the particular cell types, tissues and/or developmental time relevant for the onset and progression of a disease, as illustrated by recent successful examples 57,62. Notwithstanding, the identification of the specific cell type and tissue to be considered may be a challenging task, especially in the case of largely uncharacterized rare Mendelian diseases and syndromes.
Recently, it was shown that the number of singleton variants found on each newly sequenced genome stabilizes on average at ~8’500, with regulatory elements highly enriched in the relative amount of SNVs found per kb of sequence 8. The large amount of rare variants in each individual genome, together with the typically low number of participants in the study of specific rare diseases, challenges the statistical power of downstream statistical association and/or linkage studies to associate a genotype with a phenotype. The scoring approaches evaluated in this work may help filtering variants to increase power, although they often need to be integrated within more comprehensive frameworks in order to reach the necessary sensitivity and specificity to identify causal variants in disease cohorts19. In addition to the use of epigenetic information previously discussed, variant filtering strategies include focusing on SNVs associated to genes of phenotypic relevance for the disease under consideration19,63. From a complementary perspective, gene-based or region-based aggregation tests of multiple variants (a class of rare variant association tests) have been developed to evaluate cumulative effects of multiple genetic variants in a gene or region, with the aim of increasing power when multiple variants are associated with a disease 64, e.g. burden tests and variance component tests implemented in popular software such as PLINK/SEQ and SKAT. In these approaches, a continuous weight function can be used in the aggregation of rare variants in order to up-weight those predicted to have more damaging consequences. A similar weighting strategy can be proposed for rare-variant extensions of the Transmission Disequilibrium Test in the analysis of parent-child trio data 65. In both previous families of statistical tests for rare variant analysis of WGS from Mendelian diseases studies, the pathogenic scores led by the supervised learning approach implemented in this work, NCBoost, may be used to weight the aggregation of candidate pathogenic SNVs across heterogeneous cis-regulatory elements in a consistent way.
Supplemental Table and Figure Legends
Table S1. High-confidence pathogenic non-coding variants associated to monogenic Mendelian disease genes.
Table S2. Variants randomly sampled from the set of common human SNVs without clinical assertion associated to protein-coding genes used as the non-pathogenic set for training and testing in this work. Sampling of variants was done to match the relative distribution across gene regions of the high-confidence pathogenic non-coding variants reported in Table S1.
Acknowledgements
This work was supported by the French National Research Agency (ANR) grant ANR-17-RHUS-0002 - C’IL-LICO project of the second “Investissements ďAvenir” program and by the MSDAvenir fund, Devo-Decode project.