Abstract
Many biological processes are governed by protein-ligand interactions. One such example is the recognition of self and non-self cells by the immune system. This immune response process is regulated by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides can lead to the design of more potent, peptide-based vaccines and immunotherapies for infectious autoimmune diseases.
We apply machine learning techniques from the natural language processing (NLP) domain to address the task of MHC-peptide binding prediction. More specifically, we introduce a new distributed representation of amino acids, name HLA-Vec, that can be used for a variety of downstream proteomic machine learning tasks. We then propose a deep convolutional neural network architecture, name HLA-CNN, for the task of HLA class I-peptide binding prediction. Experimental results show combining the new distributed representation with our HLA-CNN architecture acheives state-of-the-art results in the majority of the latest two Immune Epitope Database (IEDB) weekly automated benchmark datasets. We further apply our model to predict binding on the human genome and identify 15 genes with potential for self binding. Codes are available at https://github.com/uci-cbcl/HLA-bind.
1 Introduction
The major histocompatibility complex (MHC) are cell surface proteins used to bind intracellular peptide fragments and display them on cell surface for recognition by T-cells [Janeway et al., 2001]. In humans, the human leukocyte antigens (HLA) gene complex encodes these MHC proteins. HLAs displays a high degree of polymorphism, a variability maintained through the need to successfully process a wide range of foreign peptides (Jin et al., 2003, Williams, 2001].
The HLA gene lies in chromosome 6p21 and is comprised of 7.6Mb [Simmonds et al., 2007]. There are different classes of HLAs including class I, II, and III corresponding to their location in the encoding region. HLA class I is one of two, the other being class II, primary classes of HLA. Its function is to present peptides from inside cells to be recognized either as self or non-self as part of the immune system. Foreign antigens presented by class I HLAs attracts killer T-cells and provoke an immune response. Similarly, class II HLAs are only found on antigen-presenting cells, such as mononuclear phagocytes and B cells, and presents antigen from extracellular proteins [Ulvestad et al., 1994]. Unlike class I and II, class III HLAs encode proteins important for inflammation.
The focus of this paper is on HLA class I proteins. As these molecules are highly specific, they are able to bind with only a tiny fraction of the peptides available through the antigen presenting pathway [Nielsen et al., 2016, Yewdell, 1999]. This specificity makes binding to the HLA protein the most critical step in antigen presentation. Due to the importance of binding, accurate prediction models can shed understanding to adverse drug reactions and autoimmune diseases [Gebe et al., the 2002, Illing et al., 2012], and lead to the design of more effective protein therapy and vaccines [Chirino et al., 2004, van der Burg et al., 2006].
Given the importance of MHC to the immune response, many algorithms have been developed for the task of MHC-peptide binding prediction. The following list is by no means exhaustive but a small sample of previously proposed models. Wang et al. proposed using quantitative structure-activity relationship (QSAR) modeling from various amino acid descriptors with linear regression models [Wang et al., 2015]. Kim et al. derived an amino acid similarity matrix [Kim et al., 2009]. Luo et al. proposed both a colored and non-colored bipartite networks [Luo et al., 2016]. Shallow and high-order artificial neural networks were proposed from various labs [Hoof et al., 2009, Koch et al., 2013, Kuksa, 2015, Nielsen et al., 2003]. Of these approaches, NetMHC/NetMHCpan have been shown to achieve state-of-the-art for MHC-peptide binding prediction [Nielsen et al., 2016, Trolle et al., 2015].
In this article, we apply machine learning techniques from the natural language processing (NLP) domain to tackle the task of MHC-peptide binding prediction. Specifically, we introduce a new distributed representation of amino acids, named HLA-Vec, that maps amino acids to a 15- dimensional vector space. We combine this vector space representation with a deep convolutional neural network (CNN) architecture, named HLA-CNN, for the task of HLA class I-peptide binding prediction. Finally, we provide evidence that shows HLA-CNN achieves state-of-the-art results for the majority of different allele subtypes from the IEDB weekly automated benchmark datasets.
2 Methods
2.1 Dataset
To control for data pre-processing variabilities, we decided to use an existing post-processed training dataset so prediction algorithms could be more directly compared. The dataset used was filtered, processed, and prepared by Luo et al. [Luo et al., 2016]. This dataset contained HLA class I binding data curated from four widely used, publicly available MHC datasets: IEDB [Vita et al., 2015], AntiJen [Toseland et al., 2005], MHCBN [Lata et al., 2009], and SYFPEITHI [Rammensee et al., 1999]. Target indicator indicating binding or nonbinding was readily given as one of the column in the processed dataset. Peptides that contained unknown or indiscernible amino acids, denoted as “X” or “B”, were removed from the dataset prior to training. Dataset was split into 70% training set and 30% validation set.
The test datasets were obtained from IEDB automatic server benchmark page (http://tools.iedb.org/autobench/mhci/weekly/). Allele subtypes with less than 500 training examples were excluded from testing. The lack of training data is a well-known weakness of deep neural networks as the model may not converge to a solution or worst yet, may overfit to the small training set. Indicators of binding were given as either binary values or ic50 (half maximal inhibitory concentration) measurements. Binary indicators were used directly while values given in ic50 measurements were denoted as binding if ic50 < 500 nM.
2.2 Distributed Representation
Distributed representation has been successfully used in NLP to train word embeddings, the mapping of words to real-value vector space representations. More generally, distributed representation is a means to represent an item by its relationship to other items. In word embeddings, this means semantically similar words are mapped near each other in the distributed representation vector space. The resulting distributed representation can then be used much like how BLO-SUM is used for sequence alignment of proteins [Henikoff et al., 1992] or peptide binding prediction by NetMHCpan [Andreatta et al., 2015]. That is, we encode amino acids with their vector space distributed representation to be useable by down-stream machine learning algorithms. Other amino acid encoding includes Atchley factors [Atchley et al., 2005] and Kidera factors [Kidera et al., 1985], both of which were constructed explicitly to summarize amino acid physicochemical properties. In the end-to-end machine learning approach we propose, the encoding is learned directly from the raw amino acid sequences in an unsupervised manner. The vector representation is obtained without any manual input, and as a result, the vector space has no explicit interpretations unlike Atchley or Kidera factors.
Recently, distributed representation had been explored for bioinformatics applications. Specifically, trigram (sequence of 3 amino acids) protein distributed representation of size 100-dimensions was used to encode proteins for protein family classification and identifying disordered sequences, resulting in state-of-the-arts performance [Asgari et al., 2015]. The distributed representation was further shown to grouped trigram proteins with similar physicochemical property closer to each other by mapping the 100-dimensional space to 2-dimension.
Distributed representation approaches can be classified into two broad classes: prediction-based and count-based. Two of the most popular prediction-based, neural probabilistic language models commonly used to develop a distributed representation are the skip-gram model and continuous bag-of-words (CBOW) model [Mikolov et al., 2013]. Both models are similar and is often thought of as inverse of one another. In the skip-gram model, the adjacent context-words are predicted based on the center (target) word. Conversely in the CBOW model, the center word is predicted based on adjacent context-words.
A recently proposed distributed representation based on more traditional count-based method is GloVe [Pennington et al., 2014]. In this approach, the authors worked on co-occurrence statistics explicitly and cast the problem as a weighted least square problem with the aim to minimize the difference between the inner product of each pair of word vectors and the logrithm of their co-occurrences. With certain assumption, the authors showed that the skip-gram model’s cost function can be formulated equivalently to the GloVe model. However, it has been shown that prediction-based models are superior to count-based models [Baroni et al., 2014], and under equal conditions where both models’ hyperparameters were highly tuned, the skip-gram model consistently outperformed GloVe model on a number of NLP tasks [Levy et al., 2015].
In this paper, the skip-gram model is used. The interested reader is encourage to consult the relevant references for further details of the CBOW or GloVe models.
A short overview of the skip-gram model is given here for completeness. As originally for mulated by Mikolov [Mikolov et al., 2013], in the skip-gram model, given a sequence of words w1,w2,…,wT, the objective is to maximize the average log probability: where T is the total number of words (i.e. total number of amino acids) in the dataset, c is the context window size (i.e. number of words to the right or left of the target word, and p(wt+j|wt) is defined as
Here, vw and are two vector space representations of the word w. The subscripts O and I correspond to the output (context-words) word and input (target) word respectively. W is the total number of unique words in the vocabulary. In typical NLP text corpus with large vocabulary, calculating the gradient of the log probability becomes impractical. An approximation to the log probability is obtained by replacing every log p(wO|wI) with where σ(x) = 1/(1 + exp(–x)) and k are negative samples. This was motivated by the idea that a good model should be able to differentiate real data from false (negative) ones.
By formulating protein data as standard sequence data like sentences in a text corpus, standard NLP algorithms can be readily applied. More concretely, individual peptides are treated as individual sentences and amino acids are treated as words. In this paper, the skip-gram model is used with a context window of size 5, 5 negative samples, and 15-dimensional vector space embedding. Various other dimensional size were explored, however, 15-dimensions gave the best results on 10-fold cross-validation of HLA-A*02:01 subtype. The entire post-processed dataset by Luo et al. [Luo et al., 2016] was used to learn this new distributed representation. The 15-dimensional vector space distributed representation, HLA-Vec, is summarized in Table 1. Experimental results indicate using our proposed HLA-Vec encoding showed performance gains over Asgari’s representation, Atchley factors, or Kidera factors. Description of the experiements results can be found in the Supplementary Material.
2.3 Convolutional neural network
Convolutional neural networks (CNN) have been studied since the late 1980s and have made a comeback in recent years along with renewed interested in artificial neural networks, and in particular of the deep architecture varieties. Much of the recent fervor has been spurned in part by both accessibility to large training datasets consisting of over millions of training examples and advances in cheap computing power needed to train these deep network architectures in a reasonable amount of time. Although originally proposed for the task of image classification [LeCun et al., 1989, Krizhevsky, 2012, Simonyan et al., 2014], CNN have been found to work well for general sequence data such as natural language sentences [Kalchbrenner et al., 2014, Kim, 2014]. It is with this insight that we propose a convolutional neural network for the task of MHC-peptide binding prediction.
The CNN architecture we propose in this paper consists of both convolutional and fully connected (dense) layers. Convolutional layers preserve local spatial information [Taylor et al., 2010] and thus is well suited for studying peptides where spatial locations of the amino acids are critical for bonding.
Our CNN model, dubbed HLA-CNN, can be seen in Fig. 1. The input into HLA-CNN network is the character string of the peptide, a 9-mer peptide in this example. The input feeds into the embedding layer that substitutes each amino acid with their 15-dimensional vector space representation. The output encoding is a 2-dimensional matrix of size 9×15. The vector space matrix is then 1-dimensionally convolved with 32 filters of length (rows) 7 and returns the same output sequence length as input, resulting in a matrix of size 9×32. 1-dimensional convolution automatically constrains the current filter’s column size to be identical to the incoming input matrix’s column size. Therefore each of the 32 filters in the conv1 layer are of size 7×15, and in the conv2 layer are of size 7×32. With appropriate zero-padding of the input matrix, the same output sequence length, e.g. 9, is returned. More formally, the 1-d convolution formula is defined as: where Fk is the kth filter, H is the input matrix, G is the output matrix, M is the column size of H minus 1, and u ranges from to
The activation unit use is the leaky rectified linear units (LeakyReLU) with default learning rate of 0.3. LeakyReLU is similar to rectified linear units except there is no zero region which results in non-zero gradient over the entire domain [Maas et al., 2013]. Dropout is used after each of the convolutional layers. Dropout acts as regularization to prevent overfitting by randomly dropping a percentage of the units from the CNN during training [Srivastava et al., 2014]. This has the effect of preventing co-adaptation between neurons, the state where two or more neurons detect the same feature. In our architecture, the dropout percentage is set to 25%. The output then feeds into a second convolutional layer with the same filter length, activation unit, and dropout as the first convolutional layer. The 9×32 matrix outputted by the second convolutional layer is reshaped into a single 1-D vector of size 288 which is fully connected to another layer of the same size with sigmoid activation units. This dense layer is then fully connected to a logistic regression output unit to make a prediction.
The loss function used is the binary cross entropy function and the optimizer used is the Adam optimizer with learning rate 0.004. We use a variable batch size instead of a fixed one, choosing instead to force all allele subtypes to have 100 batches no matter the total number of training samples of each subtype. The convolutional layers’ filters are initialized by scaling a random Gaussian distribution by the sum of edges coming in and going out of those layers [Glorot et al., 2010]. Finally, the embedding layer of HLA-CNN is initialized to the previously learned HLA-Vec distributed representation with the caveat that the embedding layer is allowed to be updated during the supervised binding prediction training for each allele subtypes. This allows for the distributed representation to be fined-tuned for each allele sub-types uniquely and for the task of peptide binding specifically. The number of epoch was less important as we arbitrarily set max epoch to 100 but enforce early stoppage if the loss function stops improving for 2 consecutive epochs. Solutions were found to have converged under 40 epochs for all test sets.
The dataset was most abundant in 9-mer HLA-A*02:01 allele (10547 samples) therefore this specific 9-mer subtype was used for network architectural design and hyperparameter tuning. Dataset split of 70% training and 30% validation was used to determine the optimal architecture and hyper-parameters. While the network architecture was designed using a single allele subtype of length 9, HLA-CNN framework is robust enough to accept and make prediction for allele subtypes of any length.
Each test datasets of different allele subtypes and peptide lengths are treated as completely separate tests. For a specific test dataset, the training dataset is filtered on the allele subtype and peptide length. The resulting smaller training subset is then used to train the HLA-CNN model. Due to the random nature of initialization in the deep learning software framework used, five prediction scores are made for each test sets. The final prediction used for evaluation purposes is taken as the average predicted score of the five predictions. Two commonly used evaluation metric for peptide binding prediction task are the Spearman’s rank correlation coefficient (SRCC) and area under the receiver operating characteristic curve (AUC).The state-of-the-art NetMHC-pan [Andreatta et al., 2015,Trolle et al., 2015], a shallow feed forward neural network, and a more recently developed bipartite network-based algorithm, sNebula [Luo et al., 2016], will be used to compared the performance of our proposed HLA-CNN prediction model.
3 Results
We have introduced the HLA class I dataset. We formulated the HLA class I peptide data as an equivalence of text data used in NLP machine learning tasks. We have proposed a model to learn a vector space distributed representation of amino acids from this HLA class I dataset. We have also described our deep learning method and how it takes advantage of this new distributed representation of amino acids to solve the problem of HLA class I-peptide binding prediction. Next, we show the result of the learned distributed representation followed by the performance of our model against the state-of-the-art prediction model and another recently developed model.
3.1 Distributed Representation
The 15-dimensional distributed representation of amino acids is shown in Table 1. Each of the 15 dimensions on their own have no explicit physicochemical interpretation, unlike in Atchley factors [Atchley et al., 2005] or Kidera factors [Kidera et al., 1985]. They are simply the result of the algorithm and our choice of embedding size for the representation.
To see if the learned, 15-dimensional distributed representation of the twenty amino acids was able to capture any interesting pattern, we reduce the 15-dimensional vector space to a visualizable 2-dimensional representation using a dimension reduction technique called t-distributed stochastic neighboring embedding (t-SNE) [Maaten et al., 2008]. t-SNE is capable of preserving local structure of the data, e.g. points closer to each other in the original, high-dimensional space are grouped closer together in the low 2-dimensional space. We color this low dimensional representation with various physicochemical properties to see if any pattern can be discerned using this unsupervised machine learning technique.
In Fig 2, we see the 2-D visualization of HLA-Vec colored by various physicochemical properties, including hydrophobicity, normalized van der waals volume, polarity, and net charge [Asgari et al., 2015] from the Amino acid Physicochemical properties Database (APDbase) [Mathura et al., 2005]. As can be seen, there are some structure in the graphs for hydrophobicity, polarity, and net charge; factors important for covalent chemical bonding. The clusters of magenta-colored amino acids are almost separable from clusters of green-colored amino acids with the exception of a few outliers. This gives validation to distributed representation as an effective technique to automatically learn encoding that is able to preserve some important physicochemical properties without explicitly constructing such an encoding by hand.
3.2 HLA-peptide binding prediction
The results of our HLA-CNN prediction model against NetMHCpan and sNebula on the two latest IEDB benchmarks are shown in Table 2. As AUC is a better measure of the goodness of binary predictors compared to SRCC, for evaluation purposes between models, we say one algorithm is superior to another if it scores higher on the AUC metric.
On these latest IEDB benchmark datasets, our algorithm achieved state-of-the-art results in 10 of 15 (66.7%) test datasets. This is in contrast to NetMHCpan, which acheived state-of-the-art results in only 4 out of 15 (26.7%) and sNebula in 4 out of 15 (26.7%). In the 10 allele subtypes where our model achieved state-of-the-art results, our model averaged a 9.3% improvement over the previous state-of-the-art.
As the binary cross-entropy loss function for this binding prediction problem operates on binary-transformed indicator values, any sort of ranking information encoded in ic50 binding measurements are loss in the objective and is a secondary task. Indeed, we observed no strong correlation or monotonicity between SRCC and AUC. Our algorithm scored highest for the SRCC metric on 7 of 15 test sets. NetMHCpan scored highest on 7 test sets as well and sNebula highest on 3 test sets. However, on average performance over all subtypes, our model gained a modest 1% improvement over netMHCpan.
In Fig. 3, the ROC curves are shown for all five predictions of the HLA-A*68:02 9-mer subtype as an example of the improvement our model gives over the previous state-of-the-part. As can be seen, all five curves are outperforming NetMHCpan’s curve at almost all thresholds.
The results suggests that HLA-CNN can accurately predict HLA class I-peptide binding and outperforms the current state-of-the-art algorithms. The results also confirmed that the hyperparamters of HLA-CNN learned on the HLAA* 02:01 9-mer subtype generalizes well to cover a variety of other allele subtypes and peptide lengths, demonstrating the robustness of our algorithm.
3.3 Model Ablations
In order to understand whether the distributed representation or the CNN was responsible for the performance of HLA-CNN, we perform a model ablation analysis where we remove one component of our algorithm at a time. The result shown in Table 3 is of average SRCC and AUC scores over the different allele subtypes. The table indicates that the CNN is most important. In the -CNN architecture, we run a one hidden layer (66 units) neural network like NetMHCpan [Nielsen et al., 2016] using the HLA-Vec distributed representation encoding and allow for fine-tuning during training step. In the -Distributed Rep. architecture, we use a sparse, one-hot encoding with the CNN. Performance for each allele subtypes under these two models are available in the Supplementary Material.
3.4 Run-time Analysis
Though we did not do so, due to the relatively small peptide binding dataset compared to ones typically seen in NLP, both HLA-Vec and HLA-CNN can be parallelized on a GPU for faster computation. Using a single thread of a quad 3.5GHz Intel Core i7 machine, HLA-Vec learned on the entire dataset took 33 seconds. HLA-CNN trained on the largest allele subtype, A*02:01 9-mers, took less than 10 minutes to run.
3.5 UniProtKB Human Gene binding prediction
We perform binding prediction experiment on the entire 20,162 human protein-coding genome from UniProtKB [The UniProt Consortium., 2017] and randomly generated 9-mers. The 20,162 human genes are chopped into 9-mers, with duplicates and those containing amino acids X, B, and U filtered out, leaving 1,0873,314 unique self 9-mers. An equivalent number of non-self 9-mer proteins, exclusive the self 9-mers, were obtained by randomly permuting the 20 amino acids.
The HLA class I - A gene alone is reported to have almost 4000 different alleles [Marsh et al., 2010], each estimated to bind between 1,000 and 10,000 individual peptide sequences [Brusic et al., 2004]. As each allele subtype is highly specific and binds to only a small subset of peptides that exhibits a particular motif [Eisen et al., 2012], we were interested to see if any pattern could be discerned using our model to make binding predictions on the sets of self and nonself 9-mers.
Fig. 4 shows the distributions of binding prediction for self and nonself 9-mers using HLA-CNN trained on the A*02:01 allele subtype. The distribution of predicted binding probablities between the two sets of self and nonself 9-mers are nearly identical. This was not unexpected as the small number of training data points compared to the overall size of the test sets led us to believe the model would exhibit similar level of false positives between the two 9-mer sets.
What is interesting is the fact that our model predicts a high number of potential self binding 9-mers. Table 4 shows the top 15 human 9-mers with highest predicted binding probabilities. Shown next to each of these 9-mers are the name of the genes where these 9-mers originated from. A literature review shows that these 15 genes are novel and not involved in any pathway of known autoimmune diseases. Our model indicates that these genes have the potential for self binding and may be worth validating in future experiments that are beyond the scope of this work.
4 Conclusion
In this work, we have described how machine learning techniques from the NLP domain could be applied to bioinformatics setting, specifically HLA class I-peptide binding prediction. We presented a method to extract a vector space distributed representation of amino acids from available HLA classI-peptide data that preserved property critical for covalent bonding. Using this vector space representation, we proposed a deep CNN architecture for the purpose of HLA class I-peptide binding prediction. This framework is capable of making prediction for any length peptides or any allele subtype, provided sufficient training data is available. Experimental results on the IEDB benchmark datasets demonstrate our algorithm achieved state-of-the-art binding prediction performance on the majority of test sets over existing models.
On future work, allele-specific affinity thresholds instead of a general binding affinity ic50 threshold of 500 nM can be used to identify peptide binders in different subtypes. This approach had shown superior predictive efficacy in previous work [Paul et al., 2013]. From an architecture design standpoint, one possibility to extend the network is to replace the dense layer with a convolutional layer, thereby creating a fully convolutional network (FCN). The motivation being since convolutional layers preserve spatial information in the peptide, perhaps a FCN could improve performance over the existing network if all layers in the network had this capability. Another option is to generalize the single output architecture to multi-outputs. Specifically, a secondary output layer and loss function can be added to minimize the mean square error between gold standard ic50 values and predicted ic50 values alongside the existing binary cross-entropy output layer. The underlying convolutional and fully connected layers would be shared between these two output layers/loss functions as the motivation would be to learn a model that has both good AUC quality as well as SRCC quality.