Abstract
While cancer is a heterogeneous complex of distinct diseases, the common underlying mechanism for uncontrolled tumor growth is due to mutations in proto-oncogenes and the loss of the regulatory function of tumor suppression genes. In this paper we propose a novel deep learning model for predicting tumor suppression genes (TSGs) and proto-oncogenes (OGs) from their Protein Data Bank (PDB) three dimensional structures. Specifically, we develop a convolutional neural network (CNN) to classify the feature map sets extracted from the tertiary protein structures. Each feature map set represents particular biochemical properties associated with the atomic coordinates appearing on the outer surface of protein’s three dimensional structure. The experimental results on the collected dataset for classifying TSGs and OGs demonstrate promising performance with 82.57% accuracy and 0.89 area under the ROC curve. The initial success of the proposed model warrants further study to develop a comprehensive model to identify the cancer driver genes or events using TSG and OG as the basis to track the causal chain.
1 Introduction
Common themes1 among many different types of cancer at molecular level include (1) mutations in proto-oncogenes that alter the function of regular cell cycle to uncontrollable cell division, (2) mutations in cancer suppressor genes that alter their cell regulatory mechanism, and (3) mutations in DNA-repair genes that cause further mutations in cells instead of repairing them. Traditional machine learning algorithms such as decision trees, random forest (RF), artificial neural networks (ANN), support vector machines (SVM) have been successfully applied to build predictive models for various aspects related to cancer including prognosis of cancer, classification of cancer types from data sources such as clinical data, SNP’s, gene expressions [1, 2, 3, 4, 5]. Recently, deep learning [6, 7] has shown remarkable performances for predicting the specificity of DNA and mRNA binding sites [8], functional classification [9], protein folding pattern [10], and for cancer categorization [11, 12]. Automatic detection and prediction of the either oncogenes or cancer suppressor genes from their three dimensional features is a big step in discovering their structural characteristics to improve the state-of-the-art in making a dent in cancer treatments. To our knowledge, there is no or few works done in applying machine learning in identifying oncogenes or cancer suppressor genes from the three dimensional structures.
Although there exist many different cancer types such that finding a coherent pattern representing their drivers is a challenging problem, cancer manifests as tissue grows in an uncontrolled manner due to malfunction in regular cell cycle process. Along with many other factors, it has been documented through experimentation that mutations in proto-oncogenes and in tumor suppressor genes and their regulatory mechanism play major roles in tumor growth. The roles played by genes in various types of cancer fall into one of these following categories: oncogene (OG), tumor suppressor gene (TSG), fusion or combination of them such as (a) OG and fusion, (b) OG and TSG, (c) OG, TSG and fusion, (d) TSG and fusion. OGs are referred to the genes that increase the cells while TSGs are referred to the genes that control the cell growth process. Osborne et al [13] has reviewed popular OG and TSG malfunctions in human breast cancer. The OG/TSG detection improves the cancer identification performance as discussed in [14]. They have used genomic data and their variance from the cancer genomic atlas (TCGA), ICGC, and COSMIC and have applied a random forest model integrating five statistical tests to detect the cancer genes and specify them as likely OG and TSG. A question arises that how to classify OGs and TSGs only from their three-dimensional protein structures without extra statistical tests or other feature extraction modules? Prediction of the functional annotation of proteins is being improved by various methods such as prediction by sequence similarity [15, 16], evolutionary relations [17], genetic interactions [18], protein-protein interactions [19], protein structures and gene-ontology hierarchy [9, 20, 21].
In this paper, we propose a deep convolutional neural network (CNN) to classify TSGs and OGs based on their PDB structures. CNNs have shown high performances in visual feature extraction and classification [22, 23]. Additionally, CNNs provide hierarchical feature extraction modules which are robust against rotation, scale, and local translation. Thus, these types of visual extraction modules can be used to discover discriminative information of the PDB structures by mapping the biochemical features annotated with the 3-D atomic coordinates appearing on the outer surfaces to visual feature maps.
2 TSG and OG Dataset Preparation
2.1 Protein Structure
The tertiary protein structure is determined by a three-dimensional geometric shape with a single polypeptide backbone. It contains a variety of bonding interactions between the side chains on the amino acids. Fig. 1 shows a protein structure in which the colors exhibit its secondary structure. In this paper, we concentrate on the protein’s atomic coordinates appearing on the outer surfaces (< x, y, z >) and their associated biochemical properties.
2.2 biochemical Features
Annotated cancer genes are downloaded from COSMIC2 V82 for human (GRCh38). For the purpose of this machine learning experiment in identifying the genes role in cancer from their 3D structure, we have focused on tumor suppressor genes and oncogenes. The recent version of the COSMIC annotated gene lists has 137 TSG and 78 oncogenes. These gene sets are combined together and are clustered with DAVID Bioinformatics Resources 6.8 [24] using only direct annotation from gene ontology3 and other functional categories provided by the tool. The gene ontology term [25] provides a structured vocabulary to annotate genes and their products by providing three orthogonal ontologies: biochemical process (BP), cellular components (CC), and molecular function (MF), each of which is modeled as a directed acyclic graph. As expected, the TSGs and the OGs are clustered into non-overlapping, separate clusters. Therefore, they are appropriate candidates for separate functional predictive models.
The Ensembl ids of OGs and TSGs are mapped to the PDB ids by using the UniProt web tool [26]4. The PDB files were downloaded from the protein data bank website5. The PDB format contains a standard format for macromolecular structure data achieved by X-ray diffraction and NMR studies [27].
To interpret and distinguish these genes, we provide a feature extraction module to represent bio-chemical characteristics of their tertiary structure. The feature extraction module has two steps: 1) indexing the surface Cα atoms; 2) extracting the outer surface atoms’ properties. The feature extraction algorithm is shown in Fig. 2.
2.2.1. Surface Cα Indexing
For each PDB file, the surface Cα atoms are chosen. To find the surface atoms, the Cartesian coordinates are changed to polar coordinates and then, with 1 degree resolution, the highest radius atom is selected as the surface atom (Fig. 2 lines 1 through 5). Finally, the surface indices < x, y, z > are converted to decimal numbers starting from < 0, 0, 0 > (Fig. 2 line 6).
2.2.2. Atom Properties
The PDB files provide information on amino acids placed in the Cα coordinates. In our model, each PDB file is represented by sixteen features along with their 3-D coordinates. These 16 features indirectly characterize biochemical property of an amino acid6. Table 1 represents the mapping between an amino acid to the corresponding feature vector of length 16. These features are used in the “property" function as shown in Fig. 2 line 7. Note: the feature values of Pka-NH2 and P-Ka-COOH are normalized in the range [0, 1].
3 Deep Learning Model
In this section, we first explain the data processing steps required for preparing the feature maps feed into the CNN; and then, the network architecture is explained.
3.1 Input Feature Maps
As mentioned earlier, each PDB file is represented by 16 features associated with the atomic coordinates < x, y, z >. To covert the three dimensional feature space to the feature maps (2-D), we generate three independent feature sets associated with three atomic projections on < x, y >, < y, z >, and < x, z > feature spaces. Therefore, each PDB file can be converted to three perpendicular 2-D feature spaces. In the next step, each projection is converted to 16 feature maps corresponding to the sixteen feature values computed in the previous section.
This approach converts a 3-D structure to three feature map sets with dimensions of 200 × 200 × 16 pixels (16 feature maps of 200 × 200). Processing the projections is much faster than processing the 3-D structures while not losing information considerably due to the PDB’s sparse structure. Furthermore, each feature map set of a projection denotes specific features of the protein while preserving its spatial information. Fig. 3 shows an example of these feature maps (here, < x, y > projection of a TSG). Later, the three feature map sets are used for TSG/OG classification as explained in Section 3.2.
3.2 CNN Architecture
The deep learning model, in this study, develops a parallel CNN with three branches followed by a multi-layer fully connected neural network. Fig. 4 shows this deep CNN’s architecture. The model consists of four convolution and pooling layers and three fully connected layers including the final classifier. The convolution kernel size (p), pooling strides (si), number of hidden neurons (h1, h2), convolution pad (γ), and the number of generated feature maps (di) are shown in Table 2. These parameters have been set up after a number of control experiments and initial evaluations.
As shown in Fig. 4, the CNN receives three 200 × 200 × 16 feature maps in parallel and performs a binary (TSG/OG) classification. Each layer is equipped by the rectified linear unit (ReLU) activation function. We used 30% dropout in the fully connected layers to control probable over training. More details about the network’s training process is discussed in the next section (Experiments). The biochemical properties utilized for generating the feature maps are shown on the left side of the network (Fig. 4). The convolution/pooling layers extract 108 × 64 = 6912 visual features. The number of trainable parameters are shown on the right side of the network. If we consider p = 5, we will have 888, 250 trainable parameters.
4 Experiments and Results
The proposed model is evaluated on the dataset that we collected in Section 2. The dataset consists of 2379 PDB files (1191 TSG and 1188 OG) that is converted to 7137 feature maps with 16 channels. The 2379 feature map sets, each representing one particular protein structure, are randomly divided into separate training and testing sets with 2029 training and 350 testing samples. This dataset division method is repeated three times using different random seeds. Finally, the model is trained and evaluated over 100 iterations.
We implemented the model using the Torch library [28]. The implementation codes for data preparation and CNN training are available on GitHub https://github.com/tavanaei/Cancer-Suppressor-Gene-Deep-Learning.
4.1 Results
We ran four experiments on the CNNs with different convolution kernel sizes (p = {3, 5, 7, 9}) to find a proper patch size for extracting discriminative visual features. Fig. 5a illustrates the CNNs’ accuracy rates over 100 training iterations. It is shown that the 3 × 3 patch is not capable to discover the visual features well. The best accuracy belongs to the networks equipped by convolution kernels with p = {7, 9} patch sizes. Table 3 shows the detailed performance measures of the proposed model. The best performed model reported the accuracy rate of 82.57% and the area under the ROC curve (AUROC) of 0.89. The test sets used to generate Fig. 5a and Table 3 were slightly different (fewer test samples were used for the plot).
To asses the model’s convergence speed, Fig. 5b shows the model’s performance with respect to different learning rates. The models trained by the learning rates μ > 0.03 reached the accuracy rates higher than 75% after 20 training iterations. The best performed models were trained by the learning rates of 0.05 and 0.06. Table 4 also shows high performances for the CNNs trained by these learning rates (0.05 and 0.06) while they were evaluated by a slightly different test set (same as Table 3)
4.2 Summary and Discussion
The raw input data for the positive and negative examples are obtained from the 3D structures of proto-oncogenes and cancer suppressor genes. The function and the biochemical processes of a protein are dependent on the outer surface configuration and their chemical properties. We have developed an algorithm to identify atoms located on the outer surface and have used the selected sixteen properties of amino acids as shown in Table 1. The 3D configuration of the outer surface is mapped onto three orthogonal planes. Each property becomes a channel in the feature map of the CNN as illustrated in Fig. 4. The proposed CNN model with the 16 channel feature map achieved high performance of 82.6% accuracy rate and 0.89 AUROC. The model becomes very useful in annotating uncharacterized PDB structures into either the TSG or the OG structures. This model and the approach of utilizing the outer surface structure and the chemical properties of the amino acids are novel for predicting protein function from their PDB structures.
Furthermore, this performance of our model compares favorably with the statistical methods studied by [14] on pan-cancer genome sequencing data [29] which consists of very rich genomic information. The datasets we used for evaluating our model is different from their dataset. However, Table 5 compares the AUROC value reported in our study with the AUROC values reported by the state-of-the-art statistical methods for OG versus TSG identification. Our model outperforms the six out of eight methods and is close to the best AUROC, 0.924.
5 Conclusion
A deep learning approach was proposed in this paper to classify the cancer genes: proto-oncogenes and tumor suppressor genes. When either the proto-oncogenes mutate and become uncontrollable cell divisor, or cancer suppressor genes mutate and lose their function, cancer progresses. By having a model that confidently identifies proto-oncogene or cancer suppressor genes from the structure, we are opening a new tool to discover a new set of cancer suppressor genes or proto-oncogenes that may not have been identified in the literature of having such functionality. By activating the dormant cancer suppressor gene through drugs, we improve the chances of controlling tumor growth. Of course, the identified potential cancer suppressor genes have to be verified through testing with rat or mouse models which resemble human gene content.
This investigation was established by two folds: 1) protein feature extraction from the PDB tertiary structure; 2) modeling the gene patterns using a parallel deep convolutional neural network (CNN). As the protein function is associated with the atoms activated on the surface of protein structure, we extracted sixteen amino-acid features corresponding to each 3-D coordinates. This new dataset is converted to three orthogonal projections of atomic features to generate three feature map sets that each consists of 16 feature maps. These feature maps were applied to the DCNN to classify the OG and TSG proteins. The proposed DCNN preserves the spatial information of the tertiary structure while modeling the protein structure/features via three parallel, independent visual feature extraction modules. Finally, the fully connected neural network of the DCNN classifies the combined visual features.
The experimental results showed high performance of 82.57% and 0.887 accuracy rate and area under the ROC curve, respectively. The reported performance is comparable with the state-of-the-art models classifying OG/TSG to identify the cancer proteins. The initial success of our model warrants our future study to apply the same deep learning approach to new datasets for predicting different cancer types to identify the cancer drivers.
Footnotes
tavanaei{at}louisiana.edu, nishanth{at}louisiana.edu, raja{at}louisiana.edu
↵6http://www.proteinstructures.com/Structure/Structure/amino-acids.html