1 Abstract
Genetic variants are associated with a number of human diseases, but they also occur in healthy individuals, contributing to inter-person and ethnic differences. A subset of DNA variants alter the sequences of the proteins encoded, but differences in the nature of protein variants in health and disease, and the cellular processes they may affect are not fully understood. Because of this, distinguishing missense variants associated with “healthy” and “diseased” states remains a challenge.
To understand the molecular principles which underlie these differences, we quantify variant enrichment at multiple levels, from 3D structure defined regions to full-length proteins, and integrate this with available transcriptomic (gene expression) and proteomic data (half-life, thermal stability, abundance). We show a clear separation between population and disease-associated variants.
In comparison to population variants, we find that disease-associated variants preferentially target proliferative and nucleotide processing functions, localise to protein cores and interaction interfaces, and are enriched in more abundant proteins. In terms of their molecular properties, we find that common population variants and disease-associated variants show the greatest contrast. Additionally, we find that rare population variants display features closer to common than disease-associated variants.
We highlight that a multidimensional, integrative approach is essential to obtain a better understanding of the molecular features which separate these studied datasets. Ultimately, this understanding will contribute to the prediction of variant deleteriousness, and will help in prioritising variant-enriched proteins and protein domains for therapeutic targeting and development.
The ZoomVar database, which underlies our analysis, is available at http://fraternalilab.kcl.ac.uk/ZoomVar, and allows users to structurally annotate SAVs and calculate variant enrichments in protein structural regions.
2 Introduction
The genomic revolution has brought about large advances in the identification of disease-associated variants. However, despite the recent explosion of genetic data, the problem of “missing heritability” still persists [1], where the genetic component of a phenotype remains unidentified. This phenomenon can most likely be attributed to variants where a causal link is difficult to establish. Prime examples being variants with low penetrance, and/or those with higher penetrance, however unique to single/few individuals, such as de novo variants implicated in developmental disorders [2]. Somatic cancer variants pose a similar problem, as driver mutations can be difficult to segregate from passenger mutations; moreover this classification may vary from case to case [3]. A common feature of all such variants where disease associations are difficult to establish is that they defy detection by the use of statistical methods alone [4]; therefore one must understand the impact of variants at the molecular level on protein structure and function, to correctly classify them. Moreover, this knowledge is essential for a number of strategies associated with the design of therapeutic interventions, e.g. structure-based drug design and/or drug repurposing [5].
A number of recent studies highlight that the characteristics of genetic variants, particularly those found in nominally healthy individuals, are still poorly understood. As a consequence, the boundary separating disease-causing from neutral variants may be more fluid than initially believed; an example being the fact that a number of missense variants thought to lead to severe Mendelian childhood disease were identified in nominally healthy individuals in the ExAC database [6]. Variant impact predictors play an important role in the identification/prioritisation of potential disease-associated variants. However these have, in the past, been trained predominantly to detect the difference between disease-associated and common variants, neglecting the difference between disease-associated and rare variants; thus it has been suggested that these do not perform so well when distinguishing rare neutral variants from those which are pathogenic [7]. Moreover, whether common variants have more functional impact than rare variants is hotly debated [8,9]. A handful of recent studies have attempted to verify whether predictions are functionally accurate, by performing in vivo saturation mutagenesis. Here it has been shown that current predictive methods are limited in accuracy [10,11]. As the majority of such methods rely heavily on evolutionary, sequence-based information, cases where pathogenic mutations localise to non-conserved positions are often incorrectly predicted. This has been proven to be problematic, particularly in the case of compensated pathogenic mutations; such mutations are present as the wild-type in other species, where their pathogenic effects are negated by another variant [12].
Due to these observations, it becomes increasingly pressing to understand the molecular characteristics of variants in health and disease; including differences in the characteristics of driver and passenger somatic cancer mutations, and in the impact of rare and common population variants. Analyses of the localisation of variants to protein structure, taking into account their proximity to functional sites (e.g. post-translational modifications, or PTMs) [13, 14, 15, 16, 17, 18], have shown to be effective in uncovering the impact of variants at the molecular level [19]. In the field of cancer research, protein structure-based methods have been used to successfully predict cancer driver genes [20,21], as validated by a recent large-scale study by Bailey et al. [22]. Despite such success, this does not appear to have been applied to other classes of variants (i.e. population and Mendelian disease-associated variants).
Only a few studies [15,23] have taken advantage of the recently available large-scale data, and compare, using structural bioinformatics methods, disease-associated variants with somatic cancer variants, and variants found in the general population. Furthermore, recent papers propose that previously observed trends, which suggest that somatic cancer single amino acid variants (SAVs) are enriched in protein-protein interaction sites, could be due to biases caused by the tendency of disease-associated variants to localise to those proteins which are most experimentally studied [15]. Therefore robust statistical methods to treat these comparisons are urgently needed. Additionally, large-scale proteomics and transcriptomics datasets have been generated in recent years, but, to the best of our knowledge, studies which incorporate this information in the analysis of the impact of genetic variants are yet to appear.
These factors have motivated us to undertake an integrative analysis which compares disease-associated variants, including somatic cancer variants and germline disease-associated variants, with variants found at different frequencies in the general population. A unique feature of our analysis is in addressing the interplay between macroscopic features, such as proteomics data and functional pathways, with microscopic features, such as protein structural localisation. In particular, we have made use of recently published protein half-life data [24], along with protein abundance [25], thermal stability [26] and transcriptomics data [27], to uncover underexplored biophysical and biochemical principles governing the impact of variants.
The ZoomVar database, which underlies our analysis, enables users to structurally annotate SAVs and to calculate the enrichment of SAVs in protein structural regions. It can be queried directly via the web interface (http://fraternalilab.kcl.ac.uk/ZoomVar) or programmatically using the REST script downloadable from the site.
3 Methods
3.1 Data sources
3.1.1 Variant data
ClinVar (dbSNP BUILD ID 149) variant data [28], COSMIC coding mutations (v80) [29] and gnomAD exome data [30], all mapped to the GRCh37 genome build, were obtained in variant call format (VCF). The ClinVar dataset contains variants submitted through clinical channels. Only variants with CLINSIG codes 4 and 5 (probably pathogenic and pathogenic) were selected for further analysis. To ensure the quality of our dataset, we selected only variants with “variant suspect reason code” of 0 (unspecified). Additionally, all variants labelled as being somatic were filtered from this dataset. All variant datasets were mapped to Ensembl protein sequences [31] using the Variant Effect Predictor (VEP) [32], and further mapped to canonical UniProt sequences and the respective structures/homologs.
3.1.2 Protein-protein interaction networks
A large non-redundant protein-protein interaction network (UniPPIN) [33] was used. This incorporates non-redundant data amalgamated from IntACT [34], BioGRID [35], STRING [36], DIP [37] and HPRD [38], as well as recent large-scale experimental studies [39,40,41].
3.1.3 Protein sequences and structures
The biounit database of the Protein Data Bank (PDB) was downloaded on 28/04/2017. For mapping purposes, in this study, both the canonical UniProt human protein sequences [42] (for mapping to structures and protein-protein interaction networks) and Ensembl protein sequences [31] (for mapping variant datasets) were used.
3.1.4 Gene and protein annotations
Gene sets for KEGG pathways were obtained from the MSigDB database [43]. Oncogene and tumour suppressor gene annotations were taken from Supplementary Table S2A from Vogelstein et al. [44]. Cancer drivers were taken from the Cancer Gene Census (CGC) (COSMIC v84). Genes from both tiers 1 and 2 were included. Conversions between gene symbols, Entrez gene identifiers and UniProt accession numbers were performed using the biomartR package [45]. A list of DNA-binding domains was obtained from the review by Vaquerizas et al. [46]. These domains were mapped from InterPro [47] IDs to PFAM IDs using conversion tables in PFAM (v31).
3.1.5 Protein-drug interaction mapping
A mapping of protein-drug interactions was obtained from DrugBank (v5.0.11) [48] (under “Target Drug-UniProt Links”) and filtered for human proteins. Drugs were mapped to a PFAM domain-type if at least one domain of that type occurs in a protein a drug is known to interact with. It is, of course, possible that a drug may only interact directly with another domain-type within the protein. However, this approach was chosen due to the fact that if only domain-drug interactions with supporting structural information are accepted, the data becomes both sparse and biased towards structurally resolved domains.
3.1.6 Protemics and transcriptomics data
Protein thermal stability and half-life data were obtained from separate large-scale studies by the Savitski lab [26,24]. Gene expression quantification (Reads Per Kilobase of transcript per Million mapped reads [RPKM]) counts per sample (v6p) was downloaded from the GTEx portal [27] and grouped by tissue, according to the sample metadata provided. For each tissue type, we quantified the gene-wise proportion of samples with an RPKM equal to zero. Only those genes with zero counts in < 10 % of samples were retained for our analysis. Protein abundance data (protein per million [ppm]), integrated for each tissue/sample type were obtained from PaxDb [25].
3.2 ZoomVar Database
3.2.1 Identification of resolved structures/homologs
Canonical UniProt human protein sequences were assigned resolved structures/homologs from the PDB biounit database [49] using BLAST [50]. BLAST searches were carried out using both the entire protein sequences and domain sequences, which were defined by scanning UniProt sequences against the PFAM seed library [51] using HMMER [52]. Hits were only accepted with sequence identity > 30 % and E-value < 0.001. T-COFFEE [53] was used to obtain a residue level mapping of queries to structure hits. The quotient solvent accessible surface area [Q(SASA)] of each structure residue was computed using POPS [54].
3.2.2 Mapping of Ensembl proteins
Ensembl protein sequences were mapped to UniProt protein sequences [42], using the UniProt ID mapping. Additionally, if UniProt and Ensembl sequences were not of the same length, the sequences were aligned using T-COFFEE [53] to obtain a per residue mapping. Stretcher [55] was used to align those sequences which were too long to align using T-COFFEE.
3.2.3 Identification of interaction complexes
For each interaction in our protein-protein interaction network, resolved binary interaction complexes and homologues were identified using the BLAST search results. As an example, if protein A and B are annotated as interacting in UniPPIN, and their structure homologues A’ and B’ are located in a resolved structural complex (and at least one residue from each protein is involved in a shared interface), residues from A and B are mapped onto A’ and B’ to infer their interaction interface.
The partner-specific regression formula from HomPPI [56] was used to assign a score and zone to each interaction interface inferred in this way. Residues involved in interfaces were assigned using POPSCOMP [57]. Only those residues with a change in SASA > 15 Å2 were annotated as interface residues.
3.2.4 Determination of per-residue binding partners
A protein may interact with multiple other proteins. For each of these interactions, a maximum of 10 corresponding best hits (ordered by HomPPI defined score [56]), located in the best populated zone, were considered. If a residue was located at the interaction interface, in at least half of these structures, it was annotated as interacting with that specific protein, otherwise it was annotated as non-interacting.
3.2.5 Mapping of variant data
Variants in each dataset were annotated according to protein region localisation using the ZoomVar database. For certain analyses the COSMIC data was divided into “driver” and “non-driver” subsets, taking drivers as variants which map to all proteins from both tier 1 and tier 2 of the Cancer Gene Census (CGC) (COSMIC v84). The non-driver subset contains all other variants.
3.2.6 Definition of regions
We defined several types of protein and domain regions as described below.
Interface regions were considered to be composed of residues which bind to at least one protein interaction partner. Core residues were defined as those with a Q(SASA) < 0.15. Surface residues were defined as those with a Q(SASA) ≥ 0.15 which do not take part in protein-protein interaction interfaces.
Disordered protein regions were predicted using DISOPRED3 [58]. Intra-domain ordered regions were defined as those regions predicted to be ordered which lie within PFAM defined domains. Intra-domain disordered regions were defined as regions predicted to be disordered which lie within PFAM domains. Inter-domain dis-ordered regions were defined as those regions not located within PFAM defined domains which are predicted to be disordered. Any residues with structural coverage were filtered from the inter-domain disordered regions as it was thought that these could potentially belong to domains which have not been defined by PFAM.
Ubiquitination and phosphorylation sites were obtained from PhosphoSitePlus [59]. Each site was mapped to the structural template with the highest identity. Regions close to phosphorylation and ubiquitination sites were defined as those within 8 Å in 3D space.
3.2.7 Creation of ZoomVar database
All data, including per-residue mappings, were stored in the ZoomVar MySQL database [60]. A web interface and REST architecture was implemented, using the Django framework [61] to allow users to query this. It is available at http://fraternalilab.kcl.ac.uk/ZoomVar".
3.3 Calculation of SAV enrichment
The binomial cumulative distributive function (see Equation 1) was used to assess the SAV enrichments of individual proteins, domains or domain-types, and the 2-tailed binomial test was used to assess the significance of enrichment/depletion. In this formula k is the number of observed SAVs which localise to a region, n is the total number of SAVs which localise to all regions of interest (all regions) and p is ratio of the size of the region (number of amino acids) to the size of all_regions: Hereafter, we refer to the binomial CDF as the variant enrichment score (VES).
The calculations were performed for the regions defined in the table below:
For each analysis at the whole protein or whole domain level, all UniProt proteins/domains (except for immunoglobulin and T cell receptors) containing SAVs in any of the datasets analysed, were considered to be the background proteome (all regions). Proteins belonging to immunoglobulin and T cell receptor gene family products were filtered from all analyses (HGNC definition [62]), to avoid the inclusion of variants which could have arisen from the process of affinity maturation.
For all calculations of enrichment and simulations involving protein or domain regions (e.g. core, surface and interface), cases where the region is of size 0, or that the protein/domain contains no SAVs, were omitted in this analysis.
The overall SAV enrichment of protein regions, for each data set, was also calculated using a density-based metric (see Equation 2).
Here 95 % confidence intervals were estimated via bootstrapping (10,000 iterations). The 2-tailed significance of enrichment/depletion was estimated by simulation. 10,000 simulations were carried out for each dataset, in which the number of variants which localise to a given protein was kept constant, but their location within the protein randomised. The regional density of variants was calculated for each simulation and compared to the actual value in order to derive a p-value. Simulations were performed in this way, keeping the number of SAVs which localise to each protein fixed, in order to overcome bias which stems from the assumption that variants are uniformly distributed throughout the proteome.
3.4 Enrichment analysis of gene sets
3.4.1 Gene set enrichment
Enrichment analyses were performed using Gene Set Enrichment Analysis, using the implementation provided by the R FGSEA package [63]. Given an enrichment statistic for each query gene, the GSEA algorithm outputs a score per gene set, which quantifies the enrichment of query genes in the sets examined. This is then normalised by the size of the gene set, to give a normalised enrichment score (NES).
We utilise the centred VES, as the enrichment statistic which is input into the GSEA algorithm. Here, the centred VES is simply obtained by subtracting 0.5, therefore proteins with the expected number of SAVs have a centred VES of 0. At the whole protein level only sets with n ≥ 25 were considered. At the protein subregion level, variant enrichment data exists for a smaller number of proteins, due to incomplete structural coverage of the proteome. In order to perform a complete comparison between pathway enrichment at different levels, all pathways analysed at the whole protein level were also analysed at the protein subregion level.
3.5 Analysis of expression, abundance, and stability data
Spearman correlations of protein-wise and region SAV enrichments with expression levels (RPKM), abundance (ppm), half-life (hours), thermal stability (Tm), and density (mean contacts of core α carbons) were calculated. Additionally, gene set enrichment analysis was performed as in Section 3.4.1, but using the metrics in the table below as enrichment statistics.
Here it can be seen that the mean value for each quantity of interest was subtracted to obtain values centred around 0, allowing both pathway enrichment and depletion to be assessed.
3.6 Statistics and data visualisation
The majority of data analyses were performed in the R statistical programming environment. All corrections for multiple testing have been done using the Benjamini-Hochberg method in R (p.adjust function). Bootstrapping was performed using the boot package (function boot) [68]. Spearman correlations were performed using the SpearmanRho function of the DescTools package [69]. Heatmaps were produced with either the heatmap.2 function in the gplots package [70] or the ComplexHeatmap package [71], in which clustering, wherever shown, was performed with hierarchical clustering (hclust function) using default parameters unless otherwise stated. Circos plots were generated with the Circos package [72]. Additionally, binomial CDFs were calculated and two-tailed binomial tests performed using the NumPy package in Python [73].
4 Results
We present a multidimensional analysis of single amino acid variants (SAVs) observed in the general population (gnomAD database) [30], in comparison to somatic cancer-associated SAVs from the COSMIC database [29] and disease-associated SAVs from the ClinVar database [28]. Throughout this analysis we further divide the gnomAD data into its constituent common and rare variants, to investigate whether there are differences between these two subsets.
We ask whether the enrichment of variants is associated with specific structural features and functional pathways, and whether results differ for population and disease-associated variants. In particular we investigate the interplay between variant enrichment and proteomics features; for example, we explore whether disease-associated variants preferentially target the core of less thermally stable proteins, as these might be more prone to destabilisation, leading to complete/partial unfolding. Finally, we use these features to understand whether rare population variants demonstrate characteristics which are more similar to common population variants or disease-associated variants. Such exploration of the interplay of the microscopic versus macroscopic features of proteins is novel in the field.
Our analysis explores the enrichment of SAVs at different levels, constituting what we define as a protein-centric anatomy of variants in health and disease, as illustrated in Fig. 2. We employ a similar approach to that used in the prediction of cancer driver genes [74]: the SAV enrichment of individual proteins/regions has been modelled using a binomial distribution (Methods Equation 1), whereas global trends in the distribution of SAVs have been investigated by calculating SAV density (Methods Equation 2). The binomial cumulative distribution function quantifies the enrichment of variants (Fig. 2g) and is referred to as the Variant Enrichment Score (VES). This is assessed statistically using a two-tailed test (see Section 3.3). Additionally, the significance of the enrichment/depletion of SAVs, in terms of their density, is assessed by comparison to simulated SAV distributions, in which the number of SAVs is kept identical to that observed in the data, but their positions within the protein are randomised. This goes beyond similar studies (e.g. [16,17,18]) and addresses biases which could result from the increased study and structural coverage of disease-related proteins.
A summary of the numbers of SAVs investigated in each dataset is given in Table 3, and a more detailed breakdown is given in the Supplementary Materials Section S4.1.
4.1 Disease-associated and population variants target different functional pathways
We first investigated whether variants from each dataset target proteins which are involved in distinct functional pathways. To do this we performed KEGG [43] functional pathway analysis, by ranking proteins using their whole-protein VESs (see Fig. 2d) calculated for each dataset, and using the Gene Set Enrichment Analysis (GSEA) algorithm [43] (see Section 3.4.1).
The pathway enrichment data, for each mutation dataset, were subjected to clustering and Principal Component Analysis (PCA) (see Section 3.4.2). In Fig. 3a it can be seen that variant enrichment segregates pathways into three clusters. Strikingly each pathway cluster appears to have distinct characteristics. The cluster visualised in orange is primarily composed of terms associated with cancer, growth and proliferation, whereas that coloured in pink contains pathways associated with splicing, transcription, translation and metabolic terms. Pathways associated with sensory perception and the immune response are found in the final “green” cluster. A handful of metabolic pathways also localise to this cluster, however, these appear to be more associated with environmental response and adaptation than those pathways found in the “pink” cluster; for example, pathways associated with the metabolism of drugs and xenobiotics are found here. For brevity, the “orange”, “pink” and “green” clusters will be termed the “proliferative”, “nucleotide processing” and “response” clusters respectively, for the remainder of this text. A list of pathways assigned to each cluster is given in the Supplementary Materials Section S4.2.
This visualisation (Fig. 3a) also reveals that both the common and rare subsets of the gnomAD database associate mostly with the “response” cluster, whereas COSMIC data localises between the clusters associated with response and proliferation. ClinVar data associates (as revealed by the localisation of factor loadings) with the “nucleotide processing” cluster, between both the “response” and “proliferation” clusters. Strikingly, the population variant datasets (gnomAD rare and common) are clearly separated from the disease-associated variant datasets by the first principal component (PC1), whereas COSMIC variants are separated from ClinVar variants along the third principal component (PC3) (see Fig. 4a).
These trends of functional distinction are further visualised in the Circos plot (Fig. 3b) [72]. Here it can be clearly seen that the gnomAD data only shows significant enrichment for pathways belonging to the “response” cluster, whereas the COSMIC data shows enrichment for pathways belonging to this cluster and those belonging to the “proliferative” cluster. The ClinVar dataset displays enrichment for pathways belonging to all three clusters; uniquely showing enrichment for pathways within the “nucleotide processing” cluster.
We went on to extend this analysis to the protein region level (Figures 4 and S4). Here we find that proteins enriched in gnomAD variants at the surface (Fig. 4b) are significantly enriched in pathways belonging to the “proliferative” cluster. Moreover, this enrichment is shared between common and rare variants (albeit not significant for common variants in individual pathways after FDR correction). Proteins with surfaces enriched in disease-associated variants (from COSMIC and ClinVar) are, contrastingly, not enriched in “proliferative” cluster pathways. However, no such pattern emerges for the protein core and interface (Fig. 4c and S4b), suggesting that population variants avoid disrupting the function of proliferation-related proteins by preferentially localising to the surface. Interestingly, the “nucleotide processing” cluster does not show such a marked enrichment of variants which localise to the surface in the gnomAD database, a possible indication that these pathways are more robust to disruption than those in the proliferative cluster. These data show that there is clearly an interplay between variant localisation at the macroscopic level (functional pathways) and the microscopic level (structural regions).
4.2 Population and disease-associated variants localise to different protein regions
We then zoomed in to view trends in the enrichment of variants at the microscopic level. Specifically, we catalogued the enrichment of variants in core, surface and interface regions; intra-domain ordered regions (intra-ord), intra-domain disordered regions (intra-dis), and inter-domain disordered regions (inter-dis); and regions close to (≤ 8 Å) of phosphorylation sites and ubiquitination sites.
In agreement with previous research, we find disease-associated (ClinVar) variants to be enriched in both protein cores and interfaces, but depleted on protein surfaces (see Fig. 5a and Supplementary Fig. S4 [15, 16, 17, 18]. This reflects the potential disruption, caused by such mutations, of structurally and functionally important protein regions. The enrichment of Clinvar variants in structurally important sites is further demonstrated by their preferential targeting of residues which are highly connected when considering network representations of protein structure, as shown in Section S2.1 of the Supplementary Materials. GnomAD variants (both common and rare) and somatic non-driver variants display the opposite trend, most likely as variants which localise to protein surfaces are less likely to impact on protein structure and function than either core or interface mutations. Somatic driver variants follow trends closer to ClinVar variants, with slight, but significant, depletion on the surface, but enrichment in the core. Protein interfaces are enriched in disease-associated variants but depleted of gnomAD rare variants. GnomAD common variants appear neither significantly enriched nor depleted, however this may result from the relative sparsity of the data; fewer variants are shared between many individuals (this is clearly evidenced by the numbers in Table 3). Interestingly, COSMIC non-driver variants appear depleted in interacting interfaces. However, it becomes clear that they are actually significantly enriched when compared to simulated null distributions (Fig. 5a inset), and that this enrichment is due to a small subset of proteins which harbour a large number of variants at interface regions. Genes to which these variants reside may be putative driver genes (see Supplementary Materials Section S4.3), as a number of known driver genes are enriched in variants in protein interface regions [16,22,74], and this phenomenon has been exploited by by Porta-Pardo et al. [74] to identify cancer driver genes.
A more detailed per-protein analysis can bring finer granularity into the comparison of variant enrichment. Therefore we look at a curated list of oncogenes and tumour suppressor genes (TSGs) (see Section 3.1.4) [44]. Several studies have suggested that proteins encoded by oncogenes (which are activated upon mutation) and tumour-suppressor genes (TSGs, which are inactivated upon mutation) tend to be enriched in mutations in different protein regions [15,16,75]. We found that clustering based on VESs broadly classifies these proteins into two groups, one comprising of proteins enriched in mutations mainly at protein-protein interaction interfaces and protein surfaces, and another group of proteins generally enriched in mutations in the core (some of these proteins also show enrichment in mutations in interacting interfaces, but a clear depletion at the surface is evident) (Fig. 5b). Interestingly, we observe a statistically significant (Fisher-exact test p-value = 0.004199) segregation of these two groups in terms of cancer driver status: the first group of proteins are mainly (17 out of 24) products of oncogenes, and the other mainly those of TSGs (17 out of 25). These results are consistent with the hypotheses that activating mutations in oncogenes are likely to affect particular functions by targeting specific interactions, whilst inactivating mutations in TSGs abrogate protein function [16,75]. Taking the oncogenes and TSGs as two separate groups, the GSEA result confirms a similar trend; moreover, it can also be seen that the disease-associated datasets (ClinVar and COSMIC) show opposite patterns of enrichment in comparison to the gnomAD data (Supplementary Fig. S6) [15,16,75]. These results confirm that our approach reproduces previous results and highlights clear, robust trends.
On analysis of variant enrichment in ordered and disordered regions, we again observe clear segregation between disease and population variants (see Fig. 5a). ClinVar and COSMIC variants are depleted in interdomain disordered regions and enriched in intra-domain ordered regions. In contrast, gnomAD variants (both rare and common) appear enriched in inter-domain disordered regions and depleted in intra-domain ordered regions. GnomAD common and rare variants show similar trends to one another, which are distinct to those of disease-associated variants. Using quantitative statistical measures, these results suggest, as intuitively one would expect, that variants are more likely to be pathogenic if they fall within ordered domain regions.
The density of variants close to PTMs is also shown in Fig. 5a. Here, ClinVar variants appear enriched when considering the density of SAVs close to phosphorylation sites, but not significantly so in comparison to simulations. The large bootstrap confidence interval suggests this may be due to the sparsity of the data available. A similar observation is seen for COSMIC driver variants; however, COSMIC non-driver variants, which appear depleted according to variant density, are significantly enriched close to phosphorylation sites in comparison to simulated null distributions (Supplementary Fig. S5). This indicates that, in agreement with a number of other studies [76,77], the disruption of phosphorylation sites may play a particularly important role in cancer. In contrast to phosphorylation sites, all data sets appear depleted of variants close to ubiquitination sites (Supplementary Fig. S5).
These analyses conclude that the enrichment of missense variants at various structural features consistently segregate population variants from disease-associated ones. For the majority of structural regions defined here, the greatest, most consistent distinction is always seen between common and ClinVar variants, in conditions where the data are not too sparse.
4.3 Towards a domain-centric landscape of variant enrichment
We then proceeded from the protein level to examine variant enrichment at the domain level. First, we compare variant localisation across protein domain types, using PFAM domain definitions. As depicted in Fig. 2, we calculated the variant enrichment at the amalgamated whole-domain and structural region (core/surface/interaction sites) levels.
Strikingly, characteristic patterns of variant enrichment appear. Fig. 6 depicts the union of the top 20 most variant-enriched domains for each data set. Here it can be seen that a small number of domains appear enriched in variants primarily only in the COSMIC and ClinVar data sets. These include known drug targets such as kinase and ion channel domains. A handful of domains, which are only enriched in COSMIC variants, include the Cadherin tail and Laminin G 2 domain, both of which play an important role in cancer [78,79]. A larger number of domains are variant enriched in both the COSMIC and gnomAD dataset (rare and common variants). Some domains (e.g. Serpin, UDPGT, Collagen and EGF CA) contain variants from all datasets or all datasets with the exception of COSMIC. In such domains, it is likely that the precise structural localisation of a variant determines whether it plays a pathogenic role. Intriguingly a few domain types, such as NPIP and NUT appear only enriched in common variants. This could suggest that these domains take part in functions for which it is desirable to maintain diversity within a population; however, little is known about either domain type [80,81]. Thereby this further highlights the bias in study towards those domains associated with disease, rather than those enriched in population variants.
It also becomes apparent that the global trends in variant localisation to the core, surface and interface regions, observed in Section 4.2 are recapitulated here. Again the majority domains are enriched in gnomAD (rare and common) variants at the surface but ClinVar variants at the core. Although COSMIC variants show a trend broadly similar to gnomAD variants, it is clear that a larger proportion of domain-types are enriched at the core or interface. These include domain-types with known cancer driver associations, such as the P53 and VHL domains [82]. The observed patterns of variant enrichment are further highlighted by considering variant localisation to CATH architectures and by a case study on DNA-binding domains, both presented in the Supplementary Materials (see Sections S2.3 and S2.2).
We wished to understand how the targeting of domains by drugs and small molecules mapped to the landscape of variant enrichment we previously observed. To investigate this we used the protein-drug mapping provided in the DrugBank database, as detailed in Section 3.1.5. As already extensively pointed out [83], the targeting of domain-types by existing drugs is highly biased towards a small number of domain types, such as GPCRs and kinase domains. Indeed, we observe a large number of drugs targeting proteins containing 7tm (GPCR) domains. These domains are enriched in variants from the gnomAD and COSMIC database but are devoid of disease-associated ClinVar variants. Interestingly it has recently been shown that genetic variants in such domains (GPCRs), identified in the general population, may be associated with differential drug response between individuals [84]. Therefore we show that our domain-centric landscape of variant localisation highlights, for each domain type, implications useful for both understanding variant impact and motivating therapeutic design (see discussion).
4.4 Proteomics and transcriptomics features associate with variant localisation
Proteins, of course, do not function in isolation but in the crowded environment of the cell. In our analysis so far we have viewed proteins through their three dimensional and functional properties; however, we have to consider that proteins may be present in the cell in different quantities, display different turnover rates and possess different melting temperatures. All of these factors can crucially affect the stability and the fitness of a protein to perform its function. Here we have made use of large-scale proteomics data, including protein abundance data from PaxDb [25] and data describing both protein half-lives and thermal stability from the Savitski lab [26,24], together with transcriptomics data (GTEx database [27]), to explore relationships between these features and variant localisation. Please note that the numbers of proteins and SAVs which underlie each comparison are described in the Supplementary Materials Section S4.5.
Our results show that the protein-wise enrichment of disease-associated variants displays positive correlations with protein abundance, expression, half-life and thermal stability, whereas population variants exhibit the opposite trend (see Fig. 7 and Supplementary Figures S7-S8). It is important to recall here that the proteinwise enrichment of variants is calculated in comparison to the entire proteome (all UniProt proteins which contain SAVs in any of the datasets; see Fig. 2d).
However, zooming into the enrichment of variants in the core of protein structures, we found that in comparison to all regions of proteins with resolved structure, rare population variants demonstrate a positive correlation with abundance and thermal stability, whereas disease-associated variants negatively correlate with this (see Fig. 7). These results prove robust across multiple tissue types. Analogous correlations for variant enrichments at protein surfaces display opposite trends to those observed at the protein cores. Due to the relative sparsity of variants which map to protein interfaces, we believe it is difficult to draw robust conclusions from any trends observed for correlations of proteomics data with variant enrichment at protein-protein interaction sites.
Our results, at the “core” region level, for gnomAD rare and ClinVar variants suggest that disease-associated variants might preferentially localise to the core of unstable proteins, as these might be more easily destabilised to a degree at which function is deleteriously impacted. This possibility is further explored in the discussion. Similarly to the ClinVar data, the gnomAD common data also show negative correlations for variants occurring at the protein core; this could potentially give weight to the argument presented by Mahlich et al. [8] that common variants could affect molecular function more than rare variants. However, we believe this is more likely to be due to the fact that very few common variants localise to protein cores, as shown by Fig. 5, resulting in sparse statistics (i.e. the correlation is calculated over Variant Enrichment Scores which are already very low). One might expect that mutations would be less easily accommodated in cores of densely packed proteins, which would have higher thermal stability. To assess this we calculate the mean number of Cα contacts within 8 Å of core residues, as a proxy for protein density. We find a significant correlation between this metric and protein thermal stability (vehicle 1: ρ = 0.168, q-value = 1.464e-12; vehicle 2: ρ = 0.185, q-value = 1.529e-13). If we correlate this metric of core density (see Supplementary Materials S1.2 for details) with the core Variant Enrichment Score, we find a significant negative correlation for the gnomAD common dataset. No other datasets show significant correlations with core density, however a clear trend emerges in which correlations become progressively more positive in the order of gnomAD common, gnomAD rare, somatic driver, somatic non-driver and ClinVar (see Supplementary Fig. S9). This suggests variants may be more deleterious if they localise to a packed core. Again the complexity of the interplay between features is highlighted, as the higher stability of proteins with more packed cores suggests that destabilisation, to a degree which is physiologically relevant, may be more difficult to achieve. Although core packing and thermal stability are correlated, the correlation value (ρ) is low. Therefore, this feature is clearly not the only determinant of protein stability.
The results we see at the whole protein level, where the disease-associated ClinVar data clearly show a more positive correlation with Tm, are, at a first sight, more difficult to explain. However, work by the Picotti lab [85] has demonstrated that more stable proteins are generally more abundant. In agreement with this, we find significant correlations between the protein abundance and thermal stability data (see Supplementary Materials Section S4.6). Moreover, we do see significant positive correlations of protein-wise variant enrichment with protein abundance, in our analysis (see Fig. 7b). This suggests that the preferential localisation of ClinVar variants to more stable proteins could be attributed to the higher abundance of such proteins.
Interestingly, it can be seen that the trends observed at both the protein level and core region level, are less pronounced for cell line data and break down for extracellular fluids (saliva and urine). Moreover, the trend is most evident for tissues containing long-lived cell-types, such as the brain, ovary and testis. Transcriptomics data (see Fig. S7) again reinforces this picture, albeit with less contrast between data sets (particularly at the protein core).
Finally, we wanted to understand whether correlations with these proteomic and transcriptomic features could be associated with the specific functional roles of the involved proteins. This was achieved by investigating the association of these proteomic and transcriptomic features with biological pathways, using the GSEA algorithm. For the majority of proteomic and transcriptomic features, no clear associations with the functional clusters identified in Fig. 3 can be detected (see Supplementary Figures S11-S13). An exception to this is protein thermal stability: pathways which belong to the “proliferative” cluster are clearly enriched in proteins of lower stability than the other two clusters (see Fig. 7c). This suggests that proliferation-related proteins may be vulnerable to disruption by mutations which target their already unstable cores. Moreover, this agrees with the idea proposed in Section 4.1, that “proliferative” cluster proteins may be less robust to disruption.
4.5 Rare variants are similar to common variants
Throughout the majority of analyses, performed both at the macroscopic and microscopic levels, the greatest segregation of data can be seen between common and disease-associated variants (see Figures 3-5). Rare variants show characteristics more similar to common variants, both in terms of the functional pathways they target, and in terms of the protein regions they localise to (core, surface and interface, order and disorder). If more stringent minor allele frequency (MAF) thresholds are used to define rare variants, their properties move towards those of disease-associated variants, but still remain closest to those of common variants (see Fig. 8 and Supplementary Fig. S14). A visible separation between common and rare variants, especially in the pathway analysis, can only be seen if an extreme MAF cutoff (<0.00001) is used.
5 Discussion and conclusions
Throughout this work, we show that SAVs in the general population, considered ‘nominally healthy’, show properties distinct from those in disease cohorts, both at the macroscopic (omics features and functional pathways) and microscopic levels (protein structural localisation). Additionally, although we uncover a spectrum in these properties of variants, which ranges from common population variants to disease-associated ClinVar variants, we find that the properties of rare variants remain close to those of common variants. These findings contrast with other observations [8], which suggest that common variants have more impact on molecular function than rare variants. Common variants appear closer in character to disease-associated variants than to rare variants, only for certain proteomics properties, such as the thermal stability and abundance of the targeted proteins, as discussed in Section 4.4. However, we consider these results inconclusive, due to the sparsity of the data. Alhuzimi et al. [9] suggest that the properties of genes enriched in rare population variants are similar to those enriched in disease-associated variants, and are thus good candidates for harbouring unknown disease associations. Instead, we show that such proteins are, from the annotated functional pathways, most similar to those enriched in common variants (Fig. 8). Moreover our results, which show that variants maintained within a population target functions which are mainly associated with response to the environment (Figs 3a and 4), agree with results from evolutionary studies reviewed in [86].
We have dissected the levels of variant enrichment in diverse datasets and across different protein levels (Fig. 2). Such a detailed anatomy of variant enrichment in health and disease provides a unique link between the cataloguing of mutations, and understanding both their mechanistic and functional effects. This supplies invaluable information to researchers studying specific proteins or domains, or focusing on proteins involved in a particular function (e.g. DNA binding; Fig. S3). By analysing the enrichment of variants in protein regions (core, surface, interface, disorder and disorder, PTM vicinity), we recapitulate trends observed by previous studies (e.g. in the comparison of oncogenes and TSGs; Fig. 5b) [16, 18, 17, 15, 75], but also shed light on the debate as to whether somatic cancer variants are enriched in interface regions, by simulating null-distributions of variants. The simulations we have performed show that it is essential to consider that variants from different datasets are not uniformly randomly distributed throughout the proteome. Through density-based metrics we find somatic cancer variants are not enriched in protein interfaces, however using a simulation-based approach we do find an enrichment (Fig. 5a). A similar simulation-based approach was taken by Gress et al., [15], but they found no significant enrichment for COSMIC variants in interface regions. Whilst they analysed a filtered set of mutations likely to play a driver role, we investigated all somatic variants and addressed separately mutations that localise to defined driver and non-driver genes. Our enrichment calculations were rigorous, and directly compared against null (n = 10,000) simulations to assess statistical significance. Throughout this analysis, we have, of course, been limited by the number of proteins with available structural data, although this has been enriched by considering homologous structures. We are also still limited by the structural coverage of protein interactions; although enough data exists to uncover broad trends, our analyses at a finer granularity, which probed protein-protein interaction sites, generally lacked statistical power. Moreover, it is likely that a more detailed picture will emerge if variant localisations to proteins involved in different classes of interactions are probed (e.g. transient vs permanent interactions). We envisage that the recent advances in cryo-EM [87], and the integration of structural data derived by a variety of techniques [88], will further increase the structural coverage of the protein-protein interaction network, enabling such finer-grained analyses in the future.
Our analysis at the macromolecular level, which probes associations between the enrichment of variants and proteomic features, is, to the best of our knowledge, unprecedented, and has only been made possible due to the recent release of large-scale proteomics data [25, 24, 26, 85]. We observe correlations which suggest an interplay between variant enrichment, protein abundance and thermal stability. First, disease-associated variants localise preferentially to proteins which are highly expressed and abundant (Fig. 7). These results complement a body of research which concludes that the rate of protein evolution correlates negatively with protein expression and abundance [89]. The extent of this anti-correlation has been found to be tissue-specific; those tissues with a high neuron density demonstrating the highest anti-correlation [90]. Consistent with this, we found the largest negative correlation for the protein-wise enrichment of rare variants, from the gnomAD dataset, with protein abundance in the brain, and, interestingly also in the ovary and testis, which both harbour long-lived germline progenitor cells (Fig. 7b; Fig. S7); purportedly the lifespan long-lived cells render them more sensitive to the toxicity of misfolded proteins. Second, we see a trend which suggests disease-associated variants preferentially localise to the core in less thermally stable proteins, most probably as these are more easily destabilised to an extent at which function is lost or impaired (Fig. 7a). Hence two competing trends emerge; variants which localise to less abundant proteins have greater disruptive potential, conversely, those which localise to thermally unstable proteins (which are normally less abundant [85]) may be able to deleteriously destabilise such proteins more easily. It is conceivable that the chemical nature of the particular missense variant plays an important role here: e.g. if a variant at the protein surface alters the “stickiness” of the protein and promotes non-specific interactions, this is likely to be most detrimental if the affected protein is present in great abundance. This highlights the importance of evaluating the interplay of macroscopic and microscopic features when estimating the potential impact of variants on protein function and stability.
The relationship between variant localisation and protein stability is of importance, as a number of algorithms have used the change in protein stability upon mutation (ΔΔG) as a proxy for variant impact. Our results indicate that the baseline stability of the wild-type protein may also be important when considering the phenotypic relevance of a change in stability upon mutation. From their analysis of the ProTherm database, Serahijos et al. [91] found that mutations in more stable proteins generally led to greater destabilisation (ΔΔG variation). They interpret this as suggesting that proteins which have evolved to become more stable are in a state closer to their peak stability, where any changes will result in drastic destabilisation. Similarly, Pucci and Rooman [92] used temperature dependent statistical potentials to investigate the thermal stability of the structurome (all proteins with resolved structure), and concluded that mutations in proteins which are highly thermally stable lead to a larger decrease in thermal stability, compared with those in less thermally stable proteins. We believe that our results point to the fact that, even under a scenario in which mutations in proteins with higher stability result in a greater change in stability, a mutation in an already unstable protein is more likely to result in complete/partial unfolding under physiological conditions. These factors should be brought into consideration when interpreting the impact of missense variants.
We show that greater insight into the properties of variants in health and disease can be obtained by combining protein structural and functional pathway information. For example, as discussed in Section 4.1, it can be clearly seen that population variants are most enriched on the surface of proteins which take part in pathways we have defined as belonging to the “proliferative” cluster (Fig. 3d). Moreover, pathways belonging to this cluster also appear to be enriched in proteins with less thermal stability (Fig. 7c), suggesting a possible mechanistic basis underlying the localisation of variants (variants tend to localise to the surface and avoid disrupting the core of these already unstable proteins). This indicates that the combinatorial use of such features may aid in both improving the prediction of a variant’s impact on phenotype, and in assessing the molecular mechanisms underlying this.
Ultimately, the goal should reach beyond the identification of variants which underlie a disease phenotype, to the use of this information in the development of therapeutic strategies. Here we envisage that our domain-centric landscape of variant enrichment (Fig. 6), which includes the mapping of targeted drugs, besides providing another feature for the characterisation of variants, will allow for more informed decisions in selecting new therapeutic targets. We show that many domains are enriched in either COSMIC and/or ClinVar variants, but few or no drugs exist to target these proteins. This could offer a starting point to prioritise drug discovery efforts for these domain-types. For domain-types already targetable by drugs, our analysis highlight domains to which multiple disease-associated variants localise, which could give scope for drug repurposing or redesign. Additionally, targets with few population variants could be selected, to minimise differential drug response due to genetic differences between individuals.
In conclusion, our work highlights the complex interplay between different factors which may determine variant pathogenicity, at both the macroscopic and microscopic levels. We believe that these insights will prove important in the prediction of which variants drive disease phenotypes. Moreover, the ZoomVar database, which we have made available at http://fraternalilab.kcl.ac.uk/ZoomVar", will facilitate users in the structural analysis of variants, and provides precomputed data underlying all analyses presented here. Further advancement in the structural coverage of the proteome, and the exploitation of high throughput proteomics technologies, such as those pioneered by the Savitski and Picotti labs [26,85], will ultimately offer a finer-grained picture of features which segregate variants in “health” and “disease”.
6 Acknowledgements
This research was supported by the British Heart Foundation (RE/13/2/30182 to FF and AL), Croucher Foundation Hong Kong (to JCN) and the Medical Research Council (MR/L01257X/1 to FF).
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].
- [67].
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵