Abstract
Human leukocyte antigen (HLA) complex molecules play an essential role in immune interactions by presenting peptides on the cell surface to T cells. With significant progress in deep learning, a series of neural network based models have been proposed and demonstrated with their good performances for peptide-HLA class I binding prediction. However, there still lack effective binding prediction models for HLA class II protein binding with peptides due to its inherent challenges. In this work, we present a novel sequence-based pan-specific neural network structure, DeepSeaPanII, for peptide-HLA class II binding prediction. Compared with existing pan-specific models, our model is an end-to-end neural network model without the need for pre- or post-processing on input samples. Besides state-of-the-art peformance in binding affinity prediction, DeepSeqPanII can also extract biological insight on the binding mechanism over the peptide and HLA sequences by its attention mechanism based binding core prediction capability. The leave-one-allele-out cross validation and benchmark evaluation results show that our proposed network model achieved state-of-the-art performance in HLA-II peptide binding. The source code and trained models are freely available at https://github.com/pcpLiu/DeepSeqPanII.
1 Introduction
The human leukocyte antigen (HLA) complex is responsible for presenting peptides on cell surfaces for recognition by T-cells. There are two major groups of HLAs: class I and class II. HLAs class I present peptides that originate from the cytoplasm while HLAs class II present those originating extracellularly from foreign bodies such as bacteria. Another difference between these two classes is how they are expressed. The class I HLA protein has one chain (α) and the class II HLA protein has two chains (α and β). HLA genes are highly polymorphic, which allows them to fine-tune the adaptive immune system. Prediction of HLA-II binding peptides is important to vaccine design and targeted therapy development in immunology and cancer immunotherapy, but is challenging because HLA-II are highly polymorphic and the size of the peptides presented varies [1], [2]. As experimentally characterizing the binding specificity for all HLA molecules is costly in terms of time and labor, effective computational prediction methods are needed for HLA-II peptide binding affinity prediction. It has been shown that computational tools for predicting neoantigens are placing an increasingly important role in interrogating cancer immunity [3].
In the past few years, deep neural networks have achieved great success in computer vision, pattern recognition, and natural language processing [4]. Inspired by that, a series of deep neural network models have been proposed to address the peptide-HLA class I binding problem [5], [6], [7]. In our previous work, we developed a novel deep convolutional neural network based pan-specific model for MHC-I peptide binding prediction [8]. However, for HLA class II, there are few deep neural network based prediction algorithms in the published literture. In 2018, Nielsen et al. established an automated benchmarking platform for MHC class II binding prediction methods [9] while a few benchmark studies have been conducted for MHC class I binding [10], [11], [12]. Currently, NN-align(2009) [13], NetMHCIIpan-3.1(2015) [14], Comblib matrices(2008) [15], SMM-align(2007) [16], Tepitope(1999) [17] and Consensus IEDB (2008) [18] are included in this benchmark study, most of which were developed quite a while ago. More recently, there are several major reports on HLA-II peptide binding prediction [1], [2], [19]. First Garde et al. [2] proposed to take advantage of a large set of MHC class II eluted ligands generated by mass spectrometry to guide the prediction of MHC class II antigen. Next in Nature Biotechnology, Racle et al. [19] proposed to combine unbiased mass spectrometry with a motif deconvolution algorithm to analyze peptides eluted from HLA-II molecules. They developed a probabilistic framework to learn multiple motifs on the peptides, as well as the weights and binding core offsets of these motifs. Their probablistic predictor of HLA-II ligands (MixMHC2pred) was shown to outperform the NetMHCIIPanII. At the same time, also in Nature Biotechnology, a long short-term memory (LSTM) based recurent neural network model,MARIA [1], was proposed to handle the high variability in the length of HLA-II peptide ligands (826 amino acids). When trained with both traditional HLA binding affinity data, the MS-based antigen presentation profiling datasets together with gene expression data and flanking residues of peptides, their deep learning model achieved significantly better prediction performance. While their focus is on demonstrating the importance of new multi-modal data sources such as peptide HLA ligand sequences identified by mass spectrometry, expression levels of antigen genes and protease cleavage signatures, their deep neural network model is a basic LSTM model with one-hot encoding and two dense layers.
In this paper, we are interested in developing more advanced deep neural network architecture for achieving interpretable Class-II HLA peptide binding affinity prediction and binding core prediction. Compared with HLA class I peptide binding prediction, binding prediction of peptide-HLA class II is much more challenging due to two main facts [20]:
Two amino acid chain structure and highly variable lengths. HLAs of class I have one protein sequence and all HLA protein sequences have the same length. While in class II, HLA proteins have two amino acid sequences and their sequence lengths vary for different alleles, which causes issues for pan-specific binding prediction methods [8].
Longer peptides. HLA class I molecules have close-end binding groove. Thus, MHC-I binding peptides are 811 consecutive residues among which 9 peptide nonamers are most common. On the other hand, the groove of MHC-II molecules has open ends, which generally bind to longer peptides, normally 1418 residues. In those long peptides, a small part (usually nine amino acid residues, called binding core) is fitted into the groove, with remaining peptide termini on both ends extending outside [21].
In previous work, several strategies have been proposed to address these two challenges in MHC-II binding prediction. SMM-align, an allele-specific method (each allele has a trained model), uses Metropolis Monte Carlo procedure to search an optimal weight matrix which could be used to calculate the binding affinity given any 9-length peptide [16]. NN-align instead identifies the binding core given a peptide using Gibbs sampling, and then uses this binding core and binding affinity to update network weight [13]. NetMHCIIpan-3.1 is a pan-specific method (one model for all alleles), in which protein sequences are represented as pseudo sequences which were extracted from known binding structures. And then a peptide is processed through SMM-align to generate one binding core peptide and suboptimal peptides (as shown in Figure S1.). All these methods use some kind of alignment or pre-processing on peptides to obtain the binding core of the peptide to overcome highly variable length of the peptides issue. In NetMHCIIpan-3.1, the only pan-specific method, pseudo sequences are used to address variable lengths of different HLA class II proteins. In the latest MARIA algorithm, the variable lenght of peptides are handled with the LSTM layer.
To address two issues seamlessly, here we propose a novel pan-specific deep neural network model with the attention mechanism, DeepSeqPanII (as shown in Figure 1) for MHC-II peptide binding prediction. In our recurrent neural network module, raw peptide and HLA sequences are directly encoded as three vectors of unified-sizes. We then feed these three vectors into a convolutional network to extract binding context information,which is then used to predict binding affinity. A major advantage of our model compared to MARIA and other conventional machine learning models is that by taking advantage of the attention mechanism [22], we could identify the binding core of the peptide based on the attention vector automatically learned by the model during the end-to-end training without any supervised or alignment information. Our contributions in this work are summarized as follows:
We proposed an end-to-end pan-specific deep neural network architecture with attention mechanism for MHC-II peptide binding prediction, in which only raw sequences are needed to train the prediction models.
We develop a way to interpret the learned weights of our model to identify the peptide binding core ab initio, which demonstrates that our model could learn insightful knowledge of the binding mechanism in an unsupervised way.
Based on extensive benchmark experiments, we showed that our model could achieve state-of-the-art prediction performance.
The DNN architecture, source code and pre-trained models are all available for downloading for others to reproduce our work.
2 Materials and methods
2.1 Dataset
In this work, we collected two training data sets: BD2013 and BD2016 which were generated in 2013 [23] and 2016 [24], respectively. Both data sets were sourced from the IEDB 1. In the related works, NetMHCIIPan 3.0 was trained on BD2013 and NetMHCIIPan 3.2 was trained on BD2016. Recently, Andreatta et al. setup an automated platform to benchmark peptide-MHC class II binding prediction methods [9]. We downloaded all available benchmark datasets from 2016-12-31 to 2017-12-29 and evaluated our model on this dataset.
HLA class II protein sequences were downloaded from IPD-IMGT/HLA [25] database. With downloaded α sequences and β sequences, we performed two multiple-sequence alignments separately. Online alignment tool Clustal Omega 2 supported by EMBL-EBI [26] was used here,which is a fast multiple-sequence alignment tool that only requires a list of protein sequences as input. All datasets used in our experiments are included in our GitHub repository for this project.
2.2 Sequence encoding
We mix one-hot encoding and BLOSUM62 matrix to represent a protein sequence. Given a sequence with L amino acids, it is encoded as a 2D tensor with dimension L(length) × 43(channel). We tried several permutations of one-hot, BLOSUM62 and physical properties. At last, we found this combination gave us best performance. To reduce the training time, we padded all sequences to maximum lengths such that they can be processed in batch during the training stage. More specifically, we set the encoding lengths of peptide sequences, HLA α and β sequences as 25, 274 and 291 respectively. Theese values were obtained by identifying the maximum lengths for all the HLA α and β sequences and peptides in our training dataset. Though we padded our input sequences during training, in evaluation stage, LSTM encoders can accept input of arbitrary length.
2.3 LSTM-CNN model with Attention mechanism
Figure 1 shows the neural network architecture of our DeepSeaPanII model. It is developed based on our previous work DeepSeqPan for MHC-I binding prediction, which includes a peptide encoder, a HLA encoder, a binding context extractor, and binding affinity predictor. Here, our DeepSeqPanII network also has three parts but with different configuration compared to DeepSeqPanI: i) sequence encoders, ii) binding context extractor and iii) affinity predictor. Given a sample of HLA α chain, HLA β chain and peptide, sequence encoders encode them as shape-unified output tensors. These three encoded tensors are expected to extract key features that contribute to the binding. Binding context extractor will then take three encoded tensors and output a vector encoding the binding context between this allele and the peptide. It will learn the coupling relationship between the peptide and the HLA sequence. Finally, based on this binding context vector, the predictor will be trained to predict the binding affinity.
As we discussed above, unlike HLA class I, a HLA class II receptor consists of two amino acid sequences with highly variable lengths. Also, since HLA class II binding pockets are open pockets, the peptide sequence lengths range from 8 to 26 in general. To address these two issues, choosing proper encoder architecture becomes critical. Compared to the encoder design, We found that the structures of the context extractor and the affinity predictor do not have dramatic impact on DeepSeqPanII’s prediction performance. At first, we tried to use convolutional neural networks as encoders as we did in DeepSeqPan. However, we could not achieve satisfactory performance with this encoder architecture. A possible reason is that the lengthsof protein and peptide sequences of class II are highly variable. In DeepSeqPan for HLA class I, since all protein and peptide sequences have the same lengths, this was not a problem. To address this highly variable length issue, we propose to use LSTM for encoding the peptide and HLA sequences, which has obtained great success in natural language processing. Also, considering the existence of binding motifs in both the HLA alpha and beta sequences and the peptides, we added an attention module to each of the three encoders [27] in our model to learn the importance of different positions to peptide binding.
Below, we introduce details of each part in the DeepSeq-PanII model as illustrated in Figure 1:
DeepSeqPanII network. The overall network is composed of three sequence encoders, a binding context extractor, and an affinity predictor. An input sample consists of three parts: HLA α chain, β chain and a peptide. Three sequences are encoded as discussed in previous section. For each sequence, it will be fed into its LSTM block (LSTM Block in Figure 1(a)) first. The LSTM block will output two tensors: hidden state tensor and sum of all hidden state tensor. Then these two tensors go into the attention block (Attn. Block in Figure 1(a)), which will compute weighted output and the associated attention vector. For attention vectors, they are not directly used to predict final binding affinity values. However, the attention vectors could give us insight over the position importance to the final binding affinity prediction. With three weighted output obtained from the attention blocks, we combine them along channel axis as a new tensor (Encoded input in Figure 1(a)), which will be fed into the convolutional network (Context Extractor in Figure 1(a)). This convolutional network then outputs a 1D vector (Context vector in Figure 1(a)), which encodes all binding context information of this sample. It will go through a fully connected network (Affinity Predictor in Figure 1(a)) to calculate the final predicted binding affinity value.
LSTM block. The input of the LSTM block is the encoded sequence tensor with dimension L(length) × C(channels) and the sequence mask vector with dimension L(length). The encoded sequence tensor is fed into the LSTM layer with N hidden states. The LSTM layer will output a tensor with dimension L × N. With this raw output tensor and the sequence mask vector, we manually masked the raw output by assigning padding positions’ output values as 0. The reason for masking output is that even though LSTM can handle variable-length input, we padded all input sequences to the same length L for easier training of the deep neural network. It allows batch training and speeded training, which is a common setup in LSTM training. By output masking, it could make the network learn faster instead of leting itself figure out the padding’s existence via learning. In our experience, we found that masking can also improve network performance. After output masking, a masked tensor with dimension L N is obtained, which is one of the two output tensors from this block. An additional summarization operation is further applied on this marked tensor to get another output tensor with dimension N. This tensor would be used as one of inputs for the attention block.
Attention block. The attention block aims to extract weighted information based on all hidden states output from the LSTM layer. With the summarization tensor of all hidden states from LSTM as the input signal, it is first normalized and then fed into a fully connected (FC) layer with L hidden units. After the FC layer, a vector of L size is then calculated, which will be masked again and be fed into the SoftMax layer to calculate an attention vector. After calculating the attention vector, a batch matrix-matrix product (BMM) operation will be applied on the attention vector and all hidden states tensor. The result is a vector with dimension N, which contains weighted information of all previous hidden states.
The detailed parameter setup of our network are set as follows.
LSTM layer. The LSTM has 100 hidden units and 2 stacked layers.
Context Extractor The Context Extractor consists of 4 1-D convolutional layers and a max pooling layers.
Affinity Predictor Affinity Predictor is composed of three fully connected layers. First 2 layers have 200 hidden units and followed by a LeakyReLU activation layer and a dropout layer with drop rate of 0.25 The last layer has 1 hidden unit and followed by a Sigmoid activation layer.
2.4 Network as math functions
To make it clear to understand our network, we denote the whole network as a series of mathematic functions.
Encoding
We use Sα, Sβ and Sp to denote encoded sequence tensors of the HLA α, HLA β and the peptide sequences. Mα, Mβ and Mp are the corresponding mask tensors.
LSTM Block
The LSTM block function (Equation 2) takes S and M and outputs Hsum the sum of all hidden states Hall. For peptide, we got and . For HLA α and β chains, we got , , and .
Attention Block
The attention block function (Equation 3) for HLA α and HLA β (Equation) takes Hall and Hsum obtained from Equation 2. It outputs an attention vector A and weighted output L. After attention block, we got Aα, Lα, Aβ and Lβ.
For the peptide sequence, the attention block function (Equation 4) is slightly different since it takes additional inputs: and . It outputs Ap and Lp.
Prediction network
After the encoding stage, we have six vectors as output. Lα, Lβ and Lp are the encoded input vectors for the prediction network together with three attention vectors: Aα, Aβ and Ap. First, we concatenate three encoded input vectors into a new tensor LI′. Then the prediction network function (Equation 5) takes this as input and outputs the binding affinity value P.
2.5 Network training setup
Our batch size is set as 128 and start learning rate is set as 0.01. Our learning rate decay with the Reduce On Plateau strategy is adopted here. We reduce the learning rate if the evaluation loss does not decrease for 4 consecutive epochs and we cool down another 4 epochs before checking. The minimum learning rate is 0.0001. We use vanilla SGD as the network optimizer with weight decay regularization (the decay value is 0.01). Also, since training LSTM is sometimes difficult due to gradient exploding, we added gradient clipping into our optimization stage and the threshold is 0.8. Our implementations are based on PyTorch 0.4.1 and all source codes and trained models are freely available at the GitHub repository https://github.com/pcpLiu/DeepSeqPanII.
3 Results
3.1 Leave one allele out cross-validation
We performed a leave-one-allele-out (LOAO) cross-validation on the BD2016 dataset. We split BD2016 into 54 folds based on allele types. Then, we hold one fold as the testing data and other folds as the training data. The process will be iterated for all folds until we tested on all allele folds. In this way, we mimick the situation where the trained model predicts the binding affinity of unseen allele samples. Actually, one important advantage of pan-specific models with increasing attention over allele-specific models is that they can make predictions on HLA alleles that are not included in the training dataset. For researchers who are interested in alleles without any binding data, this is especially useful.
The LOAO cross-validation results are listed in Table 1. We calculated AUC scores for each validated allele. From the table we observed that out of 54 results, 44 scores of our algorithm are over 0.7, 23 scores are over 0.8 and 3 scores are over 0.9. The results showed that our model has good generalization capability.
In Figure 2, we compared LOAO results of NetMHCI-Ipan as included in the original dataset with our results by DeepSeqPanII. We found that on 26 alleles, our method performed better over NetMHCIIpan. If we ignore the records with minimal score difference less than 0.01, our model still performs better than NetMHCIIpan on 19 alleles. Moreover, if we only consider particularly large score difference, there are 3 alleles on which NetMHCI-Ipans scores are at least 0.1 higher than ours. Those alleles are DQA1*02:01-DQB1*02:02, DQA1*05:01-DQB1*04:02 and DRA*01:01-DRB1*04:02. In the opposite, our model has at least 0.1 higher scores than NetMHCIIpan on 4 alleles, those are DQA1*01:02-DQB1*05:01, DQA1*04:01-DQB1*04:02, DQA1*01:01-DQB1*05:01 and DRA*01:01-DRB1*13:02. And our models scores on three of those alleles are higher than the NetMHCIIpans above 0.15. But in most cases (47 of 54), two models delivered very similar performances with less than 0.1 margin (as shown in Figure 2). This indicates that two models have similar performance over most alleles while each one has several groups of alleles that it could achieve more accurate affinity prediction.
3.2 Prediction of peptide binding cores via attention mechanism
As we mentioned before, due to the open-ended binding pockets in HLA-II proteins, researchers are usually interested in finding the binding cores of a peptide sequence. In NetMHCIIpan, a pre-processing procedure was used to generate binding core labels. Here, a major goal of our model for binding affinity prediction is to extract insights of the binding mechanism by taking advantage of the attention mechanism, which can learn the importance or contribution of each peptide position to the final binding affinity prediction performance. By inspecting what the attention vectors learned, it makes it possible to predict the binding core of a given peptide sequence. Our intuitive hypothesis is: if the learned neural network pays relatively more attention to some specific amino acid locations of the peptide, these locations may correspond to the binding core.
Here we used a simple approach to determine the binding core from the attention vector: find the subsequence of length-9 with the maximum sum of attention weights (as shown in Figure 3). We conducted testing on the binding core dataset prepared in [14]. In their paper, the authors downloaded the peptide/HLA-DR complexes from PDB database. Then structure-based binding cores were identified by inspecting the location of the bound peptide core within the MHC binding groove. We compared the predicted binding cores on 47 complexes and the results are listed in Table 2. Our of 47 complexes, 8 predicted binding cores match exactly with the experimental ones. In addition, our predicted binding cores missed only one amino acid in 27 complexes. For the remaining 12 complexes, two or more amino acids are missed in our binding core predictions. Overall for about 74% out of the 47 complexex, our attention mechanism based binding core predictions could either exactly match or just miss one amino acids compared to the experimentially determined binding cores. This proves that our deep neural networks with attention mechanisms could capture some key information related to HLA-peptide binding by exploiting large number of training samples.
3.3 Comparison with other HLA-II binding prediction models on weekly benchmark data
Andreatta et al. recently setup an automated benchmarking for HLA class II alleles [9]. It evaluates participated methods in a similar way as the one for HLA class I [28], which has been widely used to compare performances of different models. We also performed an evaluation for our proposed DeepSeqPanII on the available dataset. Since the benchmark dataset includes data on IEDB since 2014, we trained DeepSeqPanII on BD2013 for a fair comparison. AUC and SRCC scores used in this benchmark were calculated based on all methods’ predictions. The binding prediction results of other methods are included in the original dataset downloaded from the IEDB website. Methods NN-align [13], NetMHCIIpan-3.1 [14], Comblib matrices [15], SMM-align [16], Tepitope [17] and Consensus IEDB [18] are included in this benchmark. We grouped the benchmark data by target alleles and the measurement type. Totally, we have 44 testing groups.
The performance results are listed in Table 3. From the table we can see that, different methods outperformed others on various alleles. And in general, NetMHCIIpan-3.1 and DeepSeqPanII significantly outperform the remaining methods. NetMHCIIpan-3.1 outperforms other methods over 25 test groups in terms of AUC scores and 26 test groups in terms of SRCC scores. DeepSeqPanII outperform all others over 16 test groups in terms of AUC scores and 14 test groups in terms of SRCC scores. NN-align obtains the best AUC scores among 4 groups and best SRCC scores in 3 groups. Surprisingly, Consensus IEDB which combines top performing algorithms does not show a good performance here. One possible reason could be that in the original Consensus IEDB paper, it only included results of several old methods available in 2008. And because their code is not open-sourced, its predictions have not been updated. Since our method and NetMHCIIpan performed overwhelmingly better than other methods on different alleles, an ensemble method based on these two methods could be very promising to improve the overall performance on peptide-HLA class II binding prediction.
4 CONCLUSION
In this work, we proposed a novel deep neural network with attention mechanism for peptide-HLA class II binding affinity prediction. Compared with existing pan-specific prediction algorithms, our pan-specific model successfully adopted the attention mechanism, which allows us to extract mechanistic insights of HLA-II peptide binding by interpreting the learned attention vector. The leave-one-allele-out results showed what 44 of 54 (80%) alleles’ AUC scores are over 0.7. This indicated that our model has a good generalization capability, which is a major advantage of pan-specific model since it could predict on unseen alleles. In LOAO experiments, our model and NetMHCPanII delivered similar performance on big part of tested alleles, while one method outperformed the other on some specific alleles. In weekly benchmark test, we also found similar trend. We argue that combining DeepSeqPanII with existing models, researchers could achieve even more accurate prediction results. Examining attention vectors learned by our model, we observed that DeepSeqPanII could see the important part of a peptide. By listing attention cores and structure cores side by side, we surprisingly found that our model could capture insightful structural information in an unsupervised way. The proposed sequence encoding approaches and the attention mechanism can be applied to other sequence related bioinformatics problems since it only needs sequence information as input. We are looking forward expanding this approach to other sequence related prediction problems.
Footnotes
He is now working at Facebook Inc.
E-mail: jianjunh{at}cse.sc.edu (JH)