SUMMARY
Post-translational lysine methylation has been found to play a fundamental role in the regulation of protein function and the transmission of biological signals. We present the development of a machine learning model for predicting lysine methylation sites among human proteins. The model uses fully-alignment-free features encoding sequence-based information. A total of 57 novel predicted histone methylation sites were selected for evaluation by targeted mass spectrometry, with 51 sites positively reassigned as true methylated sites, while one site was also found to be dynamically responsive to DNA damage. To gain insight into the cellular function of the lysine methylation system, we reveal links between cellular metabolic and GTPase signal transduction, demonstrating a dynamic hypoxia-responsive methylation of the inducible nitric oxide synthase (NOS2). With the growing implication of lysine methylation in human health and disease, the development of methods that help to target its discovery will become of critical importance to understanding its biological implications.
INTRODUCTION
Post-translational modifications (PTMs) are reversible chemical modifications that play a crucial role in the regulation of protein function and the transmission of biological signals (Mann and Jensen, 2003). This diversity in available chemical protein modifications greatly expands the information potential within the PTM code, allowing cells to exert much greater control over crucial cellular processes. For example, histone proteins and their diverse array of PTMs have been subject to exquisite evolutionary conservation in eukaryotes, and one of the main types of PTMs occurring on histones is the reversible methylation of lysine residues (Martin and Zhang, 2005). Although lysine methylation is commonly known as a PTM of histone proteins, the prevalence of the methylation of non-histone proteins has received considerable attention in recent years, and has been found to play crucial roles in a number of human diseases, including cancer (Zhang et al., 2012; Arrowsmith et al., 2012; Biggar and Li, 2015; Hamamoto et al., 2015). Given the importance of PTMs in protein regulation and cellular function, and the prevalence of its dysregulation in human health and disease, the development of identification technologies have received considerable attention. As a result, there has been a significant effort placed on the development of both in silico and mass spectrometry-based enrichment methods to aid in the discovery and exploration of the methyl-lysine proteome (Liu et al., 2013; Carlson et al., 2014; Shi et al., 2015; Wen et al., 2016; Audagnotto and Peraro, 2017).
The number of known methylated proteins and modification sites has grown tremendously in recent years. Indeed, recent advances in identification technologies (i.e., affinity enrichment methods and high-resolution mass spectrometry) have provided insight into a large number of non-histone proteins that undergo lysine methylation, with many of these methylation events shown to have important regulatory functions for the respective proteins (Liu et al., 2013; Carlson et al., 2014). Furthermore, it is now known that the methylation of proteins is extremely dynamic and is involved in a growing number of cellular processes (Wu et al., 2017). These studies suggest a broad role for lysine methylation in regulating protein function, well beyond controlling chromatin dynamics via histone methylation. For example, the tumor suppressor p53 is methylated on multiple lysine residues and individual modifications have the capacity to regulate p53 function through a surprisingly diverse array of mechanisms (West and Gozani, 2011). Further, the catalytic subunit of DNA-dependent protein kinase (DNA-PK), an important regulator of DNA damage repair, is methylated on multiple lysine residues and methylation status dictates its ability to effectively repair damaged DNA (Liu et al., 2013).
Given the extensive regulatory importance that is beginning to be realized for lysine methylation, the successful identification of modification sites has become increasingly important. One of the largest challenges placed on the discovery of lysine-methylated proteins has been limitations in identification technology. It has proven to be difficult to develop specific affinity strategies that are able to enrich for the lysine methylation modification (Liu et al., 2013; Carlson et al., 2014). As a result, the identification of lysine methylation sites has not experienced the same growth in discovery as other PTMs, such as serine/threonine and tyrosine phosphorylation, lysine acetylation, or arginine methylation. However, the development of both new in silico prediction resources combined with targeted enrichment strategies will help to aid in the initial annotation of the methyllysine proteome on a proteome scale. Although several affinity strategies that utilize natural methyl-binding domains have been remarkably successful in the identification of new lysine methylation events when coupled with mass spectrometry (Liu et al., 2013; Carlson et al., 2014), these approaches are inherently biased towards the biologically-relevant binding specificity of the domain used for the initial enrichment. In silico prediction methods help to overcome this issue by predicting methylation events based on general underlying characteristics of all known modification sites. During the past decade, there have been several attempts to develop methyllysine and methylarginine computational predictors (Table S1) (Chen et al., 2006; Hu et al., 2011; Qiu et al., 2014; Shao et al., 2009; Shi et al., 2012; Shi, et al., 2015; Shien et al., 2009). These studies built their models from the available information of methylated sites extracted from UniProtKB, PhosphoSitePlus, and PubMed, gathering only a few hundred methylation sites. Therefore, these predictors are limited to approximately 200 nonredundant methyllysine sites for building and assessing their models. Critically, the expected diversity of methyllysine sites can undoubtedly not be represented with such a few number of examples given the impressive growth of validated methylated sites in recent years (Cao and Garcia, 2016). Most notably, there has been a stark lack of experimental validation highlighting the prospective use of such in silico methods to aid in vivo discovery.
We address these limitations by conducting a model learning approach based, first, on alignmentfree features to directly capture the physical and chemical properties of the peptides, rather than relying on domain-specific features that often fail due to the limited amount of available data. At the same time, we enlarged the size of our training dataset to approximately two thousand sites that have been gathered from years of experimental studies on the lysine methylation and deposited in the PhosphoSite database (www.phosphosite.com). Secondly, we treat imbalance by using costsensitive learning, thus the datasets are kept in their intrinsic imbalanced ratio during crossvalidation and hold-out tests rather than introducing synthetic training da ta nor losing valuable exemplars through undersampling. In summary, our method of methyllysine prediction has resulted in a number of promising methylation sites based on comparisons with other existing methods using common independent tests. Moreover, our proteomewide predictions provide a valuable resource to gain functional insight into the methyllysine proteome, and for the experimental validation of new methylation sites and for the generation of useful hypotheses. The MethylSight user inter face, source code, datasets and support vector machine (SVM) models can be freely found at http://methylsight.com.
MATERIALS AND METHODS
Preparing the data sets
The generation of training, calibration, and test datasets are described in the supplemental methods and summarized in Table S2. Feature extraction was accomplished using ProtDCal properties, groups, modifiers and aggregators, as follows: amino acid properties are first computed over different grouping subsets of amino acids within each input sequence window (RuizBlanco et al., 2015). For example, the hydrophobicity of all charged amino acids within a sequence window could form the basis for a ProtDCal descriptor. In this case, 12 amino acid properties were used to numerically encode the physicalchemical characteristics of the residues. These properties are found in the AAindex database (Kawashima and Kanehisa, 2000) and are also described in the ProtDCal documentation. Fourteen residue groups are used based on either side chain structure or using specific residue positions within the input sequence window. The properties can then be modified by the computed properties of neighboring amino acids, before applying an aggregation operation to reduce the vec tor down to a scalar quantity, known as a descriptor or feature. Two modification oper ators for capturing vicinity information and twelve aggregation operators ultimately transform the property vector of each amino acid group into the final scalar features. The project files with the lists of indices, groups, modification and aggregation operators as well as other parameters for the calculations are provided on the http://methylsight.com website. The above configuration leads to an initial set of 3720 descriptors, which is subsequently filtered to identify those features most useful for methyllysine prediction using a pipeline of supervised and unsupervised feature selection processes.
Feature selection begins with information gain (IG) analysis, which retains only those features whose distribution across all sites in the training data correlates with class label. All the attributes with a nonzero IG value were extracted in this step. Subsequently, an unsupervised redundancy filter is applied, using a single-linkage clustering algorithm with the Spearman correlation coefficient as the similarity measure. Features exhibiting pair-wise correlation above 0.9 are clustered together and only one representative feature from each cluster is kept. Ultimately, the supervised WrapperSubsetEval method, implemented in Weka 3.7.11 (Hall et al., 2009), is used to extract an optimum subset of features for modelling. This method was configured using a Genetic Search for exploring the feature space and potential feature sets are evaluated using the classification F-measure of the positive class in 5fold cross validation tests using SVM classifiers with a linear kernel. The costsensitive sequential minimal optimization (SMO) algorithm (Cai and Cherkassky, 2012) was used to train all SVM classifiers in this work. The cost matrix reflected the relative class imbalance in the data, such that the false negative error cost is equal to the number of negative instances and the cost of false positive errors was fixed at the number of positive instances in the data (Table S2).
Training the support vector machine predictor
Following feature selection, a gridbased optimization of SVM hyperparameters was conducted using the training and calibration data. The final model is selected according to the prediction accuracy (in terms of Fmeasure, precision, and recall). The optimal model is selected based on strong performance in both crossvalidation testing and in holdout calibration testing.
Multiple reaction monitoring mass spectrometry (MRMMS)
To validate the status of predicted methylation sites, isolated proteins were digested with trypsin and the digest was analyzed by positive ESI LC-MS/MS on a triple quadrupole mass spectrometer (4000 QTRAP, Applied Biosystems Inc.) utilizing Q3 as a linear ion trap. A nanoAcquity UPLC system (Waters) equipped with a C18 ana lytical column (1.7 μm, BEH130, 75 μm×250 mm) was used to separate the peptides at the flow rate of 300 nl/min and operating pressure of 8000 psi. Peptides were eluted using a 62 min gradient from 95% solvent A (H2O, 0.1% formic acid) and 5% B (acetonitrile, 0.1% formic acid) to 50% B in 41 min, 6 min at 90% B, and back to 5% for 10 min. Eluted peptides were directly electrosprayed (Nanosource, ESI voltage +2000V) into the mass spectrometer. The instrument was set to monitor up to 200 transitions in each sample with a dwelling time of at least 25 msec/transition.
The in silico protease digest patterns (i.e. to generate precursorions) and the corresponding MRM transitions were compiled using the Skyline(tm) software made freely available to us by the McCoss Lab, Department of Genome Sciences University of Washington School of Medicine (MacLean et al., 2010). Transitions that are larger than the precursor ion was selected based on the Skyline predictions and the specific b/y ions that allow unambiguous identification of the methylated lysine site were included. Positive identification of a new methylation site required the successful detection of at least three transitions. All transitions used to identify methylation sites are listed in Table S3. An internal NOS2specific peptide (NH2QQNESPQPLVETGKCOOH) was used as a standard to normalize relative NOS2 methylation data to pro tein abundance.
Functional analysis of prediction methyllysine proteome
To functionally annotate the biological functions enriched in the dataset of known and predicted human lysine methylation sites, we initially used Gene Ontology enrichments to identify biological processes enriched in lysinemethylated proteins. To functionally annotate clusters of interacting proteins within the predicted methyllysine interactome, we used the spatial analysis of functional enrichment (SAFE) component of Cytoscape (v.3.5.1) (Baryshnikova, 2016) using STRING interactions (v.10.5). Functional enrichments based on known protein interactions were carried out at recommended settings.
RESULTS
Demonstrating effectiveness of prediction framework
The achievable prediction recall, precision, and specificity are presented in Figure 2A as a function of decision threshold. As with previous studies, those lysine residues appearing on proteins that have been investigated for methylation, but which have not been reported to be methylated, are here assumed to be negative when training and evaluating predictors. Considering that the number of methylation sites continues to grow significantly, this assumption is known to be flawed (i.e. many of the assumednegative instances are expected to actually be undocumented positive sites). This leads to a pessimistic estimation of the precision of the obtained model. Therefore, we also computed the precision using a high confidence negative test subset (see Supplemental Methods). Shown as yellow in Figure 2A, this can be considered an optimistic estimator of prediction precision, with the true precision expected to lay between the yellow and grey curves.
The model is subsequently evaluated in the holdout test set, and the performance is contrasted with other available methylation prediction servers (Figure 2B). In general, the performance of all the methods is very low, which could be a reflection of the limited training data used to create most of the other servers and the erroneous information of assumednegative instances that are supplied to the training algorithms. Our method achieved significantly better performance in identifying methylated sites as is shown by our much higher sensitivity. The precision is slightly higher than other predictors which mean that overall, we are able to predict more positive sites than the other methods without sacrificing the false positive prediction rate.
Validation of histone lysine methylation sites
Given effective enrichment methods for the isolation and purification of histone proteins, we chose to validate the methylation status of positively predicted lysine methylation sites in histone proteins. MRMMS was carried out on purified histone pro teins using transitions that were designed for the detection of specific methylation sites. It should be noted that given the high lysine content within histones, it was not possible to validate all predicted methylation sites from trypsindigested peptides as some sites exist on peptides that are simply too short for proper detection and site specific identification. Within histone proteins, a total of 74 lysine methylation sites were predicted (Table S4). Given that histone proteins are rich in lysine residues sus ceptible to trypsin cleavage, from these peptides, only 57 methylation sites were iden tified to exist on trypsindigested peptides that we deemed suitable for detection on the QTRAP 4000 MS as determined by the Skyline software. Of these peptides, tran sitions were selected and optimized for the detection of either the unmethylated or the Kme1, 2, or 3 methylmodified lysine residues. A total of 51 new histone methylation sites containing 81 different methylmodifications were successfully validated by MRMMS and are listed in Table 1. Remarkably, 89% of the sites were found to be ac tually positive cases of lysine methylation, which outperforms the expected precision and is, therefore, a corroboration of the bias introduced in the model by mislabeled in stances assumed to be negative.
DNA damage response of histone H2B(K43) methylation
Given the proximity of the histone H2B(K43) methylation to bound DNA, and a known role of H2B during repair of DNA damage (Hung et al., 2017), we explored the dynamics of H2B(K43) methylation in response to doxorubicininduced DNA damage (Figure 3). Histone methylation sites with a known response to periods of DNA dam age, specifically histone H3(K4me3) and H3(K9me3), were also included in the analy sis to provide a broader scope of analysis (Sun et al., 2009; Faucher et al., 2010; Ayrapetov et al., 2014). Relative methylation status of histone H2B(K43me2 and 3) were found to decrease in response to increasing doxorubicin concentrations following 24 hr treatment (Figure 3C). In contrast, the methylation status of histone H3(K4me3) and H3(K9me3) both dynamically increased in response to increasing concentration of doxorubicin treatment, corroborating with previous studies (Figure 3C). These findings suggest a dynamic response of a previously undocumented H2B methylation site in response to DNA damage.
Prediction of the human methyllysine proteome
To provide insight into the potential scale of the predicted methyllysine proteo me, we used our framework to identify proteins harbouring high confidence lysine methylation sites throughout the whole human proteome. A prediction score of 0.7 was chosen for threshold used for the predicted methyllysine sites as this score corre sponds to a 95% specificity of the MethylSight algorithm (Figure 2A). A total of 35,973 lysine residues were predicted to be methylated at this threshold; all predicted lysine methylation sites identified within the human proteome are listed in Table S5.
To provide deeper information into the potential biological functions of lysine methylation, the STRING database was used to identify and build networked clusters of interacting proteins that contain predicted methylation sites (Figure 4C). The cellular function of clusters was identified based on the GO enrichment analysis. Results indi cated that predicted lysine methylation events were significantly enriched in the regu lation of complement activation, positive regulation of the immune response, endonucleolytic cleavage of tricistronic rRNA transcripts, amino acid metabolism, nu cleartranscribed mRNA catabolism, calcium-independent cell-cell adhesion, nuclear protein export, intracellular protein transport, regulation of GTPase mediated signal transduction, among other histonerelated biological processes. The VEGF signalling was used as a wellstudied GTPase mediated signal transduction example to map both known (black) and predicted (red) lysine methylation events that may play a role in its regulation (Figure 5A). Predicted NOS2 methylation events were chosen for MRMMS validation, given the role of NOS2 in nitric oxide production in angiogenesis and hypoxia adaptation.
Validation of NOS2 lysine methylation and hypoxia response
Next, we validated the predicted NOS2 lysine methylation events from NOS2 IP samples obtained from MCF7 cells. The sitespecific methylation status of NOS2 at lysine residues K12, K520, and K531 sites was determined in a manner similar as de scribed above using MRMMS. Although methylation at K12 and K531 could not be detected, the monomethylation of K520 was positively identified as a validated meth ylation site from MCF7 cells (Figure 5B). Neither the dimethylated or trimethylated state of NOS2(K520) were detected by MRM-MS. As the K520 methylation site is within the calmodulin binding region of NOS2, we then examined the effect of hypoxia on NOS2(K520me1) methylation status. In response to 24hr of 1% oxygen, relative NOS2(K520me1) levels decreased to only 47% of normoxic (i.e., 20% oxygen) levels (Figure 5C).
DISCUSSION
Traditionally, the disease context of lysine methylation has mostly been viewed via its roles in epigenetics, where the aetiology invariably stems from dysregulated his tonedependent transcription programs. Apart from the contribution of histone meth ylation events, a growing number of lysinemethylated non-histone proteins are being found to directly contribute to cellular dysfunction. For example, the discovery of MAP3K2 methylation at K260 by SMYD3 was shown to be instrumental in the activa tion of oncogenic Ras/Raf/MEK/ERK signalling and the progression of Ras-driven cancers (Mazur et al., 2014). This example highlights the importance of developing tools that are able to successfully identify new lysine methylation sites for their func tional annotation in human health and disease. Indeed, a remarkable amount of attention has been drawn to the analysis and discovery of nonhistone lysine methylation events.
Though many efforts have been devoted to the investigation of protein methyla tion, the analysis of nonhistone methylation at proteome level is still a great chal lenge. The discovery and mechanistic insight into new lysine modifications will un doubtedly pave the way for the future development and therapeutic application of “epidrugs” in cancer. However, the alteration of protein/peptide physicochemical proper ties caused by methylation is very small and it is difficult to develop highly efficient en richment approaches to separate the methylated peptides from the pool of diverse background peptides (Wu et al., 2017). The MethylSight program was developed to help in the efficient discovery of new lysine methylation sites that can then be validated through targeted mass spectrometry.
Currently, stateoftheart methods for the prediction of posttranslational lysine methylation do not provide adequate specificity for the efficient discovery of new in vivo methylation events. The AutoMotif server was the first prediction tool for methylation (Plewczynski et al., 2005). Methylated sites with 9 flanking residues were used as a positive dataset, while negative datasets were created using the unmodified corresponding sequences. These data were utilized to train an SVM classifier for the pre diction of novel methylation sites. An improvement to this method was published later that year by Daily et al., who proposed that methylated events occur in disordered structures and incorporated this feature into their predictions thereby increasing accuracy (Daily et al., 2005). In the years following, several other prediction algorithms have been developed using an increasing number of features characteristic of known methyllysine sites (such as solventaccessible surface area and secondary structure). However, these in silico approaches require high quality, large methylation site data bases using experimentally validated modification sites as positive datasets, a resource which remains elusive. Given the exceptional growth and availability of newly validated lysine methylation sites, we used fullyalignmentfree features, which are able to encode structural information from the lysine sites, to train the MethylSight al gorithm, a highly accurate SVMbased prediction tool.
A total of 51 new histone methylation sites containing 81 different methyl modifications were successfully validated by MethylSight (Table 1). To demonstrate the applicability of MethylSight to uncover methylation sites with possible functional implications. Interestingly, analysis of the histone H2B crystal structure (PDB 1AOI) identified the K43 methylation site within 5 angstroms to bound DNA (Figure 3A). Us ing antibodies designed specifically for the methylated form of histone H2B(K43me2/3), we monitored the response of this methylation to periods of doxoru bicininduced DNA damage (Hung et al., 2017). Indeed, the relative methylation of his tone H2B(K43me3) was found to be a response to DNA damage in a doxorubicin concentrationdependent manner (Figure 3C). Previous studies have shown that the H2B(K43) site is also ubiquitinated and has also been shown to have an acetylated variant (Vlaming et al., 2014). The contribution of H2B(K43) methylation to the DNA damage response not directly known at this point, however, H2B is known to be globally ubiquitinated at multiple sites in response to DNA damage (Hung et al., 2017). Specifically, H2B(K123Ub) by Bre1/Rad6 helps to direct DOT1 methylation on H3(K4) methylation. The crosstalk between H2B ubiquitination and H3(K4) and H3(K79) methylation is evolutionarily conserved from yeast to metazoans. Since many other chromatin proteins are also subject to ubiquitination, an important question is which molecular features of ubiquitinated H2B are important for this transhistone crosstalk in vivo. It is possible that H2B(K43me3) could also represent a modification helping to direct sitespecific PTM competition between lysine modifications such as Ub, acetylation, and methylation during periods of DNA damage.
To facilitate the highthroughput in silico prediction of methylation sites on a proteomic scale, we used MethylSight to screen the complete human proteome from the UniProtKB/SwissProt database (version 2017_07). Our analysis predicted 35,973 methyllysine sites (Table S5). To gain functional insight into the predicted human methyllysine protein network, we used a spatial analysis of functional enrichment (SAFE) (Baryshnikova, 2016) (Figure 4). SAFE was developed as a systematic meth od for annotating biological networks and examining their functional organization. Our analysis identified our methyllysine network to be enriched in the regulation of com plement activation, positive regulation of the immune response, endonucleolytic cleavage of tricistronic rRNA transcripts, amino acid metabolism, nuclear-transcribed mRNA catabolism, calciumindependent cellcell adhesion, nuclear protein export, in tracellular protein transport, regulation of GTPase mediated signal transduction, among other wellstudied histonerelated biological processes. This analysis agrees with reports demonstrating a role for lysine methylation in the nuclear localization of heat shock proteins (Cho et al., 2012), calcium signalling events mediated by calmodulin methylation (Haziza et al., 2015), and the regulation of Ras/Raf/MEK/ERK signaling through the methylation of MAP3K2 (Mazur et al., 2014). Indeed such prote omewide analyses represent a valuable resource for the experimental validation of novel methylation substrates and generation of useful hypotheses.
Given the recent implication of lysine methylation on several examples of GTPase mediated signal transduction, including MAP3K2(K260) and VEGFR1(K831) methylation, we mapped the known and predicted lysinemethylated sites to the VEGFR signal transduction pathway to provide new insight into potential regulation by posttranslational lysine methylation (Figure 5A). Indeed MethylSight identified poten tial methylation site on a number of proteins with direct regulation influence on signal ling, including several additional sites on proteins previously known to be lysine methylated such as VEGFR1 and the guanidine exchange factor, SOS1. To demon strate the ability of MethylSight to identify methylated sites on non-histone proteins, several predicted sites on NOS2 were selected for MRM-MS based validation. The NOS2 protein was selected for validation given its biologically relevant role in angio genesis and hypoxia adaptation (Heinecke et al., 2014). Monomethylation at the MethylSight predicted NOS2(K520) site was detected from NOS2 IP samples obtained from MCF7 cells by MRM-MS (Figure 5B). Given that this new methyllysine modified residue is within the calmodulin binding region of NOS2, a region critical for NOS2 function and nitric oxide production, we explore a possible hypoxiaresponsive regula tion of this methylation site. Indeed, in response to 24hr hypoxia relative monomethylation levels decreased to 47% of normoxic control levels (Figure 5C). These results indicate a possible role of NOS2 methylation in the regulation of its hy poxiaresponsive activity, likely dictated by calmodulin binding.
The advances in analyses of lysine methylation at proteome level have been slow compared with other well studied PTMs, such as serine and threonine phosphorylation. Fortunately, progress in this field has been achieved along with advances in its identification technology. Exploiting the recent expansion of publicly available methyllysine datasets, and our combination of in silico and wetlab experiments, we were able to develop and use the MethylSight pipeline to evaluate several new methylation sites. With the further development of novel analytical methods, indepth exploration of protein lysine methylation can be achieved more easily using in silico predic tion tools (e.g., MethylSight) that contribute to the deeper understanding of how pro tein methylation regulates diverse cellular processes.
AUTHOR CONTRIBUTIONS
KKB, SSCL and JRG conceived the study. KKB, YRB, FC and JC carried out all validation experiments. KKB, YBRB, and JRG prepared the sequence data. KF and HA carried out all hypoxia experiments, while QF carried out all DNA damage experiments. KKB, YBRB JRG wrote the manuscript.
ACKNOWLEDGEMENTS
This work was supported by a National Science and Engineering Research Council (NSERC) Canada Discovery grants to K.K. Biggar and J.R. Green, and a Canadian Institutes of Health Research (CIHR) grant to S.S.C. Li.