Abstract
Many biological processes are governed by protein-ligand interactions. Of such is the recognition of self and nonself cells by the immune system. This immune response process is regulated by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides is crucial to our understanding of the functioning of the immune system, which in turns will broaden our understanding of autoimmune diseases and vaccine design.
We introduce a new distributed representation of amino acids, named HLA-Vec, that can be used for a variety of downstream proteomic machine learning tasks. We then propose a deep convolutional neural network architecture, named HLA-CNN, for the task of HLA class I-peptide binding prediction. Experimental results show combining the new distributed representation with our HLA-CNN architecture acheives state-of-the-art results in the vast majority of the latest two Immune Epitope Database (IEDB) weekly automated benchmark datasets. Codes are available at https://github.com/uci-cbcl/HLA-bind.
1 Introduction
The major histocompatibility complex (MHC) are cell surface proteins used to bind intracellular peptide fragments and display them on cell surface for recognition by T-cells [Janeway et al., 2001]. In humans, the human leukocyte antigens (HLA) gene complex encodes these MHC proteins. HLAs displays a high degree of polymorphism, a variability maintained through the need to successfully process a wide range of foreign peptides [Jin et al., 2003, Williams, 2001].
The HLA gene lies in chromosome 6p21 and is comprised of 7.6Mb [Simmonds et al., 2007]. There are different classes of HLAs including class I, II, and III corresponding to their location in the encoding region. HLA class I is one of two, the other being class II, primary classes of HLA. Its function is to present peptides from inside cells to be recognized either as self or nonself as part of the immune system. Foreign antigens presented by class I HLAs attracts killer T-cells and provoke an immune response. Similarly, class II HLAs are only found on antigen-presenting cells, such as mononuclear phagocytes and B cells, and presents antigen from extracellular proteins [Ulvestad et al., 1994]. Unlike class I and II, class III HLAs encode proteins important for inflammation.
The focus of this paper is on HLA class I proteins. As these molecules are highly specific, they are able to bind with only a tiny fraction of the peptides available through the antigen presenting pathway [Nielsen et al., 2016, Yewdell, 1999]. This specificity makes binding to the HLA protein the most critical step in antigen presentation. Due to the importance of binding, accurate prediction models can shed understanding to adverse drug reactions and autoimmune diseases [Gebe et al., 2002, Illing et al., 2012], and lead to the design of more effective protein therapy and vaccines [Chirino et al., 2004, van der Burg et al., 2006].
Given the importance of MHC to the immune response, many algorithms have been developed for the task of MHC-peptide binding prediction. The following list is by no means exhaustive but a small sample of previously proposed models. Wang et al. proposed using quantitative structure-activity relationship (QSAR) modelling from various amino acid descriptors with linear regression models [Wang et al., 2015]. Kim et al. derived an amino acid similarity matrix [Kim et al., 2009]. Luo et al. proposed both a colored and non-colored bipartite networks [Luo et al., 2016]. Shallow and high-order artificial neural networks were proposed from various labs [Hoof et al., 2009, Koch et al., 2013, Kuksa, 2015, Nielsen et al., 2003]. Of these approaches, NetMHC has been shown to achieve state-of-the-art for MHC-peptide binding prediction [Nielsen et al., 2016].
In this article, we apply machine learning techniques from the natural language processing (NLP) domain to tackle the task of MHC-peptide binding prediction. Specifically, we introduce a new distributed representation of amino acids, named HLA-Vec, that maps amino acids to a 15-dimensional vector space. We combine this vector space representation with a deep convolutional neural network (CNN) architecture, named HLA-CNN, for the task of HLA class I-peptide binding prediction. Finally, we provide evidence that shows HLA-CNN achieves state-of-the-art results for the majority of different allele subtypes from the IEDB weekly automated benchmark datasets.
2 Methods
2.1 Dataset
To control for data pre-processing variabilities, we decided to use an existing post-processed training dataset so prediction algorithms could be more directly compared. The dataset used was filtered, processed, and prepared by Luo et al. [Luo et al., 2016]. This dataset contained HLA class I binding data currated from four widely used, publicly available MHC datasets: IEDB [Vita et al., 2015], AntiJen [Toseland et al., 2005], MHCBN [Lata et al., 2009], and SYFPEITHI [Rammensee et al., 1999]. Target indicator indicating binding or nonbinding was readily given as one of the column in the processed dataset. Peptides that contained unknown or indiscernible amino acids, denoted as “X” or “B”, were removed from the dataset prior to training. Dataset was split into 70% training set and 30% validation set.
The test datasets were obtained from IEDB automatic server benchmark page http://tools.iedb.org/auto_bench/mhci/weekly/). Allele subtypes with less than 500 training examples were excluded from testing. The lack of training data is a well-known weakness of deep neural networks as the model may not converge to a solution or worst yet, may overfit to the small training set. Indicators of binding were given has either binary values or ic50 (half maximal inhibitory concentration) measurements. Binary indicator was used directly. For test sets given in ic50 measurements, a standard threshold, ic50 ¡ 500 nM, is used to denote binding.
2.2 Distributed Representation
Distributed representation has been successfully used in NLP to train word embeddings, the mapping of words to real-value vector space representations. More generally, distributed representation is a means to represent an item by its relationship to other items. In word embeddings, this means semantically similar words are mapped near each other in the distributed representation vector space [Mikolov et al., 2013]. The resulting distributed representation can then be used much like how BLOSUM is used for sequence alignment of proteins [Henikoff et al., 1992] or peptide binding prediction by NetMHCpan [Andreatta et al., 2015]. That is, we encode amino acids with their vector space distributed representation to be useable by downstream machine learning algorithms.
Recently, distributed representation had been explored for bioinformatics applications. Specifically, 3-gram protein distributed representation of size 100-dimensions was used to encode proteins for protein family classification and identifying disordered sequences, resulting in state-of-the-arts performance [Asgari et al., 2015]. The distributed representation was further showed to grouped 3-gram proteins with similar physicochemical property closer to each other by mapping the 100-dimensional space to 2-D.
The two main neural probabilistic language models commonly used to train a distributed representation are the continuous skip-gram model and continuous bag-of-words (CBOW) model [Mikolov et al., 2013]. Both models are similar and is often thought of as inverse of one another. In the skip-gram model, the adjacent context-words are predicted based on the center (target) word. Conversely in the CBOW model, the center word is predicted based on adjacent context-words. In this paper, the skip-gram model is used. The interested reader is encourage to consult the relevant references for further details of the CBOW model.
A short overview of the skip-gram model is given here for completeness. As originally formulated by Mikolov et al. [Mikolov et al., 2013], in the skip-gram model, given a sequence of words w1, w2,…,wT, the objective is to maximize the average log probability: where c is the sliding window size and p(wt+j |wt) is defined as:
Here, vw and v′w are two vector space representations of the word w. The subscripts O and I correspond to the output (context-words) word and input (target) word respectively. W is the total number of unique words in the vocabulary. In typical NLP text corpus with large vocabulary, calculating the gradient of the log probability becomes impractical. An approximation to the log probability is obtained by replacing every log p(wO|wI) with where σ(x) = 1/(1 + exp(−x)) and k are negative samples. This was motivated by the idea that a good model should be able to differentiate real data from false (negative) ones.
By formulating protein data as standard sequence data like sentences in a text corpus, standard NLP algorithms can be readily applied. More concretely, individual peptides are treated as individual sentences and amino acids are treated as words. In this paper, the skip-gram model is used with 5 negative samples and the projected vector space of 15-dimensions is chosen. The 15-dimensional vector space distributed representation, HLA-Vec, is summarized in Table 1.
2.3 Convolutional neural network
Convolutional neural networks (CNN) have been studied since the late 1980s and have made a comeback in recent years along with the renewed interested in artificial neural networks, and in particular of the deep architecture varieties. Much of the recent fervor is spurned in part by both accessibility to large training datasets consisting of over millions of training examples and advances in cheap computing power needed to train these deep network architectures in a reasonable amount of time. Although originally proposed for the task of image classification [LeCun et al., 1989, Krizhevsky, 2012, Simonyan et al., 2014], CNN have been found to work well for general sequence data such as sentences [Kalchbrenner et al., 2014,Kim, 2014]. It is with this insight that we propose a convolutional neural network for the task of MHC-peptide binding prediction.
The CNN architecture we propose in this paper consists of both convolutional and fully connected (dense) layers. Convolutional layers preserve local spatial information [Taylor et al., 2010] and thus is well suited for studying peptides where spatial locations of the amino acids are critical for bonding.
Our CNN model, dubbed HLA-CNN, can be seen in Fig. 1. The input into HLA-CNN network is the character string of the peptide, a 9-mer peptide in this example. The input feeds into the embedding layer that substitutes each amino acid with their 15-dimensional vector space representation. The output encoding is a 2-dimensional matrix of size 9×15. The vector space matrix is then 1-dimensionally convolved with 32 filters of size 7 and returning the same output length as input, resulting in a matrix of size 9×32. The activation unit use is the leaky rectified linear units (LeakyReLU) with default learning rate of 0.3. LeakyReLU is similar to rectified linear units except there is no zero region which results in nonzero gradient over the entire domain [Maas et al., 2013]. Dropout is used after each of the convolutional layers. Dropout acts as regularization to prevent overfitting by randomly dropping a percentage of the units from the CNN during training [Srivastava et al., 2014]. This has the effect of preventing co-adaptation between neurons, the state where two or more neurons detect the same feature. In our architecture, the dropout percentage is set to 25%. The output then feeds into a second convolutional layer with the same filter size, activation unit, and dropout as the first convolutional layer. The 9×32 matrix outputted by the second convolutional layer is reshaped into a single 1-D vector of size 288 which is fully connected to another layer of the same size with sigmoid activation units. This dense layer is then fully connected to a logistic regression output unit to make a prediction.
The loss function used is the binary cross entropy function and the optimizer used was the Adam optimizer with learning rate 0.004. We used a variable batch size instead of a fixed one, choosing instead to force all allele subtypes to have 100 batches no matter the size of the training set. The convolutional layers’s filters are initialized by scaling a random Gaussian distribution by the sum of edges coming in and going out of those layers [Glorot et al., 2010]. Finally, the embedding layer of HLA-CNN is initialized to the previously learned HLA-Vec distributed representation with the caveat that the embedding layer is allowed to be updated during the supervised binding prediction training for each allele subtypes. This allows for the distributed representation to be fined-tuned for each allele subtypes uniquely and for the task of peptide binding specifically. The number of epoch was less important as we arbitrarily set max epoch to 100 but enforce early stoppage if the loss function stops improving for 2 epochs. Solutions were found to have converged under 40 epochs for all test sets.
The dataset was most abundant in 9-mer HLA-A*02:01 allele (10547 samples) therefore this specific 9-mer subtype is used for network architectural design and hyperparameter tuning. Dataset split of 70% training and 30% validation was used to determine the optimal architecture and hyper-paramters. While the network architecture was designed using a single allele subtype of length 9, HLA-CNN framework is robust enough to accept and make prediction for allele subtypes of any length.
Each test datasets of different allele subtypes and peptide lengths are treated as completely separate tests. For a specific test dataset, the training dataset is filtered on the allele subtype and peptide length. The resulting smaller training subset is then used to train the HLA-CNN model. Due to the random nature of initialization in the deep learning software framework used, five prediction scores are made for each test sets. The final prediction used for evaluation purposes is taken as the average predicted score of the five predictions. Two commonly used evaluation metric for peptide binding prediction task are the Spearman’s rank correlation coefficient (SRCC) and area under the receiver operating characteristic curve (AUC). The state-of-the-art NetMHCpan [Andreatta et al., 2015], a shallow feed forward neural network, and a more recently developed bipartite network-based algorithm, sNebula [Luo et al., 2016], will be used to compared the performance of our proposed HLA-CNN prediction model.
3 Results
We have introduced the HLA class I dataset. We formulated the HLA class I peptide data as an equivalence of text data used in NLP machine learning tasks. We have proposed a model to learn a vector space distributed representation of amino acids from this HLA class I dataset. We have also describe our deep learning method and how it takes advantage of this new distributed representation of amino acids to solve the problem of HLA class I-peptide binding prediction. Next, we show the result of the learned distributed representation followed by the performance of our model against the state-of-the-art prediction model and another recently developed model.
3.1 Distributed Representation
The 15-dimensional distributed representation of amino acids is shown in Table 1. Each of the 15 dimensions have no corresponding physicochemical equivalence or interpretation. They are simply the result of the algorithm and our choice of size for the representation. Various other dimensional size were explored, however, 15-dimensions gave the best results on 10-fold cross-validation of HLA-A*02:01 subtype.
To understand the learned, 15-dimensional HLA-Vec distributed representation of the twenty amino acids, we visualize this vector space in 2-D using a dimension reduction techniqued called t-distributed stochastic neighboring embedding (t-SNE) [Maaten et al., 2008]. t-SNE is capable of preserving local structure of the data, e.g. points closer to each other in the high-dimensional space are grouped closer together in the low 2-D space.
In Fig 2, we see the 2-D mapping of HLA-Vec colored by various physicochemical properties, including hydrophobicity, normalized van der waals volume, polarity, mass, volume, and net charge [Asgari et al., 2015] from the Amino acid Physicochemical properties Database (APDbase) [Mathura et al., 2005]. As can be seen, hydrophobicity, polarity, and net charge, factors important for covalent chemical bonding, can be visually distinguished into groups. This gives validation to distributed representation as an effective method to encode amino acids that also preserves important physicochemical properties.
3.2 HLA-peptide binding prediction
The results of our HLA-CNN prediction model against NetMHCpan and sNebula on the two latest IEDB benchmarks are shown in Table 2. As AUC is a better measure of the goodness of binary predictors compared to SRCC, for evaluation purposes between models, we say one algorithm is superior to another if it scores higher on the AUC metric.
On these latest IEDB benchmark datasets, our algorithm achieved state-of-the-art results in 10 of 15 (66.7%) test datasets. This is in contrast to NetMHCpan, which acheived state-of-the-art results in only 4 out of 15 (26.7%) and sNebula in 4 out of 15 (26.7%). In many of the allele subtypes, our algorithm achieved significant AUC improvements (greater than 10%) over the two existing models. In fact, for B*07:02 and B*27:03 subtypes, our model achieved perfect SRCC and AUC scores. In the 10 test sets where our model achieved state-of-the-art results, our model averaged a 7.8% improvement over the previous state-of-the-art. SRCC performance also increased on the vast majority of test sets as well.
In Fig 3, the ROC curves are shown for all five predictions of the HLA-A*68:02 9-mer subtype as an example of the improvement our model gives over the previous state-of-the-part. As can be seen, all five curves are outperforming NetMHCpan’s curve at almost all thresholds.
The results suggests that HLA-CNN can accurately predict HLA class I-peptide binding and outperforms the current state-of-the-art algorithms. The results also confirmed that the hyerparamters of HLA-CNN learned on the HLA-A*02:01 9-mer subtype generalizes well to cover a variety of other allele subtypes and peptide lengths. This validates the robustness of our algorithm as different networks did not have to be specifically design for each allele subtypes.
4 Conclusion
In this work, we have described how machine learning techniques from the NLP domain could be applied to HLA class I-peptide binding prediction. We presented a method to extract a vector space distributed representation of amino acids from HLA peptide data that preserved property critical for covalent bonding. Using this vector space representation, we proposed a deep CNN architecture for the purpose of HLA class I-peptide binding prediction. This framework is capable of making prediction f
While our network established new state-of-the-art results for many of these test sets, by no means does it mean this is the best our model could ever do on these datasets. As with any artificial neural network-based machine learning algorithms, our model will improve as more training data becomes available. Therefore, as genomic researchers publish more HLA class I peptide data, our algorithm can be re-trained at a future date and improved results can be expected.
On future work, allele-specific affinity thresholds instead of a general binding affinity ic50 threshold of 500 nM can be used to identify peptide binders in different subtypes. This approach had shown superior predictive efficacy in previous work [Paul et al., 2013]. From an architecture design standpoint, one possibility to extend the network is to replace the dense layer with a convolutional layer, thereby creating a fully convolutional network (FCN). The motivation being since convolutional layers preserve spatial information in the peptide, perhaps a FCN could improve performance over the existing network if all layers in the network had this capability. Another option is to tackle this peptide binding problem using more advanced NLP artificial neural networks such as Long Short Term Memory (LSTM) recurrent neural networks models which has the ability to remember values for long sequences. This would allow us to train a single model to classify binding prediction for peptides of any subtype and length.
Footnotes
ysvang{at}ics.uci.edu, xhx{at}ics.uci.edu