Abstract
One common task in Computational Biology is the prediction of aspects of protein function and structure from their amino acid sequence. For 26 years, most state-of-the-art approaches toward this end have been marrying machine learning and evolutionary information, resulting in the need to retrieve related proteins at increasing cost from ever growing sequence databases. This search is so time-consuming to often render the analysis of entire proteomes infeasible. On top, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Here, we introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the deep bidirectional language model ELMo. The model effectively captured the biophysical properties of protein sequences from unlabeled big data (UniRef50). We showed how, after training, this knowledge was transferred to single protein sequences by predicting relevant sequence features. We referred to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrated their effectiveness by training simple neural networks on existing data sets for two completely different prediction tasks. At the per-residue level, we improved secondary structure (for NetSurfP-2.0 data set: Q3=79%±1, Q8=68%±1) and disorder predictions (MCC=0.59±0.03) that use only single protein sequences by a large margin. At the per-protein level, we predicted subcellular localization in ten classes (for DeepLoc dataset: Q10=68%±1) and distinguished membrane-bound from water-soluble proteins (Q2= 87%±1). All results built upon embeddings gained from the new tool SeqVec. These are derived from the target protein’s sequence alone. Where the lightning-fast HHblits needed on average several minutes to generate the evolutionary information for a target protein, SeqVec created the vector representation on average in 0.027 seconds.
Availability SeqVec: https://github.com/Rostlab/SeqVec Prediction server: https://embed.protein.properties
Introduction
Over two decades ago, it has been first demonstrated how the combination of evolutionary information (from Multiple Sequence Alignments – MSA), and machine learning (standard feed-forward neural networks – ANN) completely changed the game of protein secondary structure prediction [1-3]. The concept was quickly taken up by many [4-8], and it was shown how much improvement was possible through using larger families, i.e. by including more diverse evolutionary information [9, 10]. Other objectives that used the same idea included the prediction of transmembrane regions [11-13], solvent accessibility [14], residue flexibility (B-values) [15, 16], inter-residue contacts [17] and protein disorder [15, 18-20]. Later, methods predicting aspects of protein function improved through this combination, including predictions of sub-cellular localization (aka cellular location [21, 22]), protein interaction sites [23-25], and the effects of sequence variation on function [26, 27].
Despite these successes, using evolutionary information is costly. Firstly, finding and aligning related proteins in large database is computationally expensive. In fact, with databases such as UniProt doubling in size every two years [28], even methods as fast as PSI-BLAST [29] have to be replaced by even more efficient solutions such as the “lighting-fast” HHblits [30]. Even its latest version HHblits3 [31] still needs several minutes to search UniRef50 (subset of UniProt; ∼20% of UniProt release 2019_02). Faster solutions such as MMSeqs2 [32] require several hundred Giga Bytes (GBs) of main memory. Secondly, evolutionary information is missing for some proteins, e.g. for proteins with substantial intrinsically disordered regions [15, 33, 34], the entire Dark Proteome [35], and many less-well studied proteins.
Here, we propose a novel encoding of protein sequences that replaces the explicit search for evolutionary related proteins by an implicit transfer of biophysical information derived from large, unlabeled sequence data (here UniRef50). Towards this end, we used a recent method that has been revolutionizing Natural Language Processing (NLP), namely the bidirectional language model ELMo (Embeddings from Language Models) [36]. In NLP, methods like ELMo are trained on unlabeled text-corpora such as Wikipedia to predict the most probable next word in a sentence, given all previous words in this sentence. By learning a probability distribution for sentences, these models develop autonomously a notion for syntax and semantics of language. The trained vector representations (embeddings) are contextualized, i.e. embeddings of a given word depend on its context. This has the advantage that two identical words can have different embeddings, depending on the words surrounding them. We hypothesized that the ELMo concept could be applied to learn aspects of what makes up the language of life distilled in protein sequences. Three main challenges arose. (1) Proteins range from about 30 to 33,000 residues, compared to the longest NLP model using 1,024 words. Longer proteins require more GPU memory and the underlying models (so called Long Short-Term Memory networks (LSTMs) [37]) have only a limited capability to remember long-range dependencies. (2) Proteins mostly use 20 standard amino acids, 100,000 times less than in the English language. Smaller vocabularies might be problematic if protein sequences encode a similar complexity as sentences. (3) We found UniRef50 to contain almost ten times more tokens (9.5 billion amino acids) than the largest existing NLP corpus (1 billion words). As a result, larger models might be required to absorb the information provided.
We trained the bi-directional language model ELMo on UniRef50. Then we assessed the predictive power of the embeddings by application to tasks on two levels: per-residue (word-level) and per-protein (sentence-level). For the per-residue prediction task, we predicted secondary structure in three (helix, strand, other) and eight states (all DSSP [38]), as well as long intrinsic disorder in two states. For the per-protein prediction task, we implemented the predictions of protein subcellular localization in ten classes and a binary classification into membrane-bound and water-soluble proteins. We used publicly available datasets from two recent methods that achieved break-through performance through Deep Learning, namely NetSurfP-2.0 (secondary structure [39]) and DeepLoc (localization [39]).
Materials & Methods
Data
UniRef50 training of SeqVec
We trained ELMo on UniRef50 [28], a sequence redundancy-reduced subset of the UniProt database clustered at 50% pairwise sequence identity (PIDE). It contained 25 different letters (20 standard and 2 rare amino acids (U and O) plus 3 special cases describing either ambiguous (B, Z) or unknown amino acids (X); Table SOM_1) from 33M proteins with 9,577,889,953 residues. Each protein was treated as a sentence and each amino acid was interpreted as a single word. We referred to the resulting embedding as to SeqVec (Sequence-to-Vector).
Per-residue level: secondary structure & intrinsic disorder (NetSurfP-2.0)
To simplify compatibility, we used the data set published with a recent method seemingly achieving the top performance of the day in secondary structure prediction, namely NetSurfP-2.0 [40]. Performance values for the same data set exist also for other recent methods such as Spider3 [30], RaptorX [31, 32] and JPred4 [33]. The set contains 10,837 sequence-unique (at 25% PIDE) proteins of experimentally know 3D structures from the PDB [34] with a resolution of 2.5 Å (0.25 nm) or better, collected by the PISCES server [35]. DSSP [38] assigned secondary structure and intrinsically disordered residues are flagged (residues without atomic coordinates, i.e. REMARK-465 in the PDB file). The original seven DSSP states (+ 1 for unknown) were mapped upon three states using the common convention: [G,H,I] ⍰ H (helix), [B,E] ⍰ E (strand), all others to O (other; often misleadingly referred to as coil or loop). Sequences were extracted from the NetSurfP-2.0 data set through “Structure Integration with Function, Taxonomy and Sequence” (SIFTS) mapping. Only proteins with identical length in SIFTS and NetSurfP-2.0 were used. This filtering step removed 56 sequences from the training set and three from the test sets (see below: two from CB513, one from CASP12 and none from TS115). We randomly selected 536 (∼5%) proteins for early stopping (cross-training), leaving 10,256 proteins for training. All published values referred to the following three test sets (also referred to as validation set): TS115 [37]: 115 proteins from high-quality structures (<3Å) released after 2015 (and at most 30% PIDE to any protein of known structure in the PDB at the time); CB513 [41]: 513 non-redundant sequences compiled 20 years ago (511 after SIFTS mapping); CASP12 [39]: 21 proteins taken from the CASP12 free-modelling targets (20 after SIFTS mapping; all 21 fulfilled a stricter criterion toward non-redundancy than the two other sets; non-redundant with respect to all 3D structures known until May 2018 and all their relatives). There are some issues with each of these data sets. Nevertheless, toward our objective of establishing a proof-of-principle, these sets sufficed. All test sets had fewer than 25% PIDE to any protein used for training and cross-training (ascertained by the NetSurfP-2.0 authors). To compare methods using evolutionary information and those using our new word embeddings, we took the HHblits profiles published along with the NetSurfP-2.0 data set.
Per-protein level: localization & membrane proteins (DeepLoc)
Localization prediction was trained and evaluated using the DeepLoc data set [40] for which performance was measured for several methods, namely: LocTree2 [42], MultiLoc2 [43], SherLoc2 [44], CELLO [45], iLoc-Euk [46], WoLF PSORT [47] and YLoc [48]. The data set contained proteins from UniProtKB/Swiss-Prot [49] (release: 2016_04) with experimental annotation (code: ECO:0000269). The DeepLoc authors mapped these to ten classes, removing all proteins with multiple annotations. All these proteins were also classified into water-soluble or membrane-bound (or as unknown if the annotation was ambiguous). The resulting 13,858 proteins were clustered through PSI-CD-HIT [50, 51] (version 4.0; at 30% PIDE or Eval<10-6). Adding the requirement that the alignment had to cover 80% of the shorter protein, yielded 8,464 clusters. This set was split into training and testing by using the same proteins for testing as the authors of DeepLoc. The training set was randomly sub-divided into 90% for training and 10% for determining early stopping (cross-training set).
ELMo terminology
One-hot encoding (also known as sparse encoding) assigns each word (referred to as token in NLP) in the vocabulary an integer N used as the Nth component of a vector with the dimension of the vocabulary size (number of different words). Each component is binary, i.e. either 0 if the word is not present in a sentence/text or 1 if it is. This encoding drove the first application of machine learning that clearly improved over all other methods in protein prediction [1-3]. TF-IDF represents tokens with the product of “frequency of token in data set” times “inverse frequency of token in document”. Thereby, rare tokens become more relevant than common words such as “the” (so called stop words). This concept resembles that of using k-mers for database searches [29], clustering [52], motifs [53, 54], and prediction methods [42, 47, 55-59]. Context-insensitive word embedding replaced expert features, such as TF-IDF, by algorithms that extracted such knowledge from unlabeled corpus such as Wikipedia, by either predicting the neighboring words, given the center word (skip-gram) or vice versa (CBOW). This became known in Word2Vec [60] and showcased for computational biology through ProtVec [60, 61]. More specialized implementations are mut2vec [18] learning mutations in cancer, and phoscontext2vec [19] identifying phosphorylation sites. The performance of context-insensitive approaches was pushed to its limits by adding sub-word information (FastText [62]) or global statistics on word co-occurance (GloVe [63]). Context-sensitive word embedding started a new wave of word embedding techniques for NLP in 2018: the particular embedding renders the meaning of the phrase “paper tiger” dependent upon the context. Popular examples like ELMo [36] and Bert [22] have achieved state-of-the-art results in several NLP tasks. Both require substantial GPU computing power and time to be trained from scratch. However, in this work we focused on ELMo as it allows processing of sequences of variable length. This ELMo model consists of a single CNN over the characters in a word and two layers of bidirectional LSTMs that introduce the context information of surrounding words. This model is trained to predict the next word given all previous words in a sentence. To the best of our knowledge, the context-sensitive ELMo has not been adapted to protein sequences, yet.
ELMo adaptation
In order to allow more flexible models and easily integrate into existing solutions, we have used and generated ELMo as word embedding layers. No no fine-tuning was performed on task-specific sequence sets. Thus, researchers could just replace their current embedding layer with our model to boost their task-specific performance. Furthermore, it will simplify the development of custom models that fit other use-cases. The embedding model takes a protein sequence of arbitrary length and returns 3076 features for each residue in the sequence. These 3076 features were derived by concatenating the outputs of the three internal layers of ELMo (1 CNN-layer, 2 LSTM-layers), each describing a token with a vector of length 1024. For simplicity, we summed the components of the three 1024-dimensional vectors to form a single 1024-dimensional feature vector describing each residue in a protein. In order to demonstrate the general applicability of SeqVec, we neither fine-tuned the model on a specific prediction task, nor optimized the combination of the three internal layers. Instead, we used the standard ELMo configuration [36] with the following changes: (i) reduction to 28 tokens (20 standard and 2 rare (U,O) amino acids + 3 special tokens describing ambiguous (B,Z) or unknown (X) amino acids + 3 special tokens for ELMo indicating the beginning and the end of a sequence), (ii) increase number of unroll steps to 100, (iii) decrease number of negative samples to 20, (iv) increase token number to 9,577,889,953. Our ELMo-like implementation, SeqVec, was trained for three weeks on 5 Nvidia Titan GPUs with 12 GB memory each. The model was trained until its perplexity (uncertainty when predicting the next token) converged at around 10.5 (Fig. SOM_1).
SeqVec 2 prediction
On the per-residue level, the predictive power of the new SeqVec embeddings was demonstrated by training a small two-layer Convolutional Neural Network (CNN) in PyTorch using a specific implementation [50] of the ADAM optimizer [51], cross-entropy loss, a learning rate of 0.001 and a batch size of 128 proteins. The first layer (in analogy to the sequence-to-structure network of earlier solutions [1, 2]) consisted of 32-filters each with a sliding window-size of w=7. The second layer (structure-to-structure [1, 2]) created the final predictions by applying again a CNN (w=7) over the output of the first layer. These two layers were connected through a rectified linear unit (ReLU) and a dropout layer [52] with a dropout-rate of 25% (Fig. 1, left panel). This simple architecture was trained independently on six different types of input, each with a different number of free parameters. (i) DeepProf (14,000=14k free parameters): Each residue was described by a vector of size 50 which included a one-hot encoding (20 features), the profiles of evolutionary information (20 features) from HHblits as published previously [40], the state transition probabilities of the Hidden-Markov-Model (7 features) and 3 features describing the local alignment diversity. (ii) DeepSeqVec (232k free parameters): Each protein sequence is represented by the output from SeqVec. The resulting embedding described each residue as a 1024-dimensional vector. (iii) DeepProf+SeqVec (244k free parameters): This model simply concatenated the input vectors used in (i) and (ii). (iv) DeepProtVec (25k free parameters): Each sequence was split into overlapping 3-mers each represented by a 100-dimensional ProtVec [64]. (v) DeepOneHot (7k free parameters): The 20 amino acids were encoded as one-hot vectors as described above. Rare amino acids were mapped to vectors with all components set to 0. Consequently, each protein residue was encoded as a 20-dimensional one-hot vector. (vi) DeepBLOSUM65 (8k free parameters): Each protein residue was encoded by its BLOSUM65 substitution matrix [65]. In addition to the 20 standard amino acids, BLOSUM65 also contains substitution scores for the special cases B, Z (ambiguous) and X (unknown), resulting in a feature vector of length 23 for each residue.
On the per-protein level, a simple feed-forward neural network was used to demonstrate the power of the new embeddings. In order to ensure equal-sized input vectors for all proteins, we averaged over the embeddings of all residues in a given protein resulting in a 1024-dimensional vector representing any protein in the data set. ProtVec representations were derived the same way, resulting in a 100-dimensional vector. These vectors (either 100-or 1024 dimensional) were first compressed to 32 features, then dropout with a dropout rate of 25%, batch normalization [53] and a rectified linear Unit (ReLU) were applied before the final prediction (Fig. 1, right panel). In the following, we refer to the models trained on the two different input types as (i) DeepSeqVec-Loc (33k free parameters): average over SeqVec encoding of a protein as described above and (ii) DeepProtVec-Loc (320 free parameters): average over ProtVec encoding of a protein. We used the following hyper-parameters: learning rate: 0.001, Adam optimizer with cross-entropy loss, batch size: 64. The losses of the individual tasks were summed before backpropagation. Due to the relatively small number of free parameters in our models, the training of all networks completed on a single Nvidia GeForce GTX1080 within a few minutes (11 seconds for DeepSeqVec-Loc, 15 minutes for DeepSeqVec).
Evaluation measures
To simplify comparisons, we ported the evaluation measures from the publications we derived our data sets from, i.e. those used to develop NetSurfP-2.0 [40] and DeepLoc [39]. All numbers reported constituted averages over all proteins in the final test sets. This work aimed at a proof-of-principle that the SeqVec embedding contain predictive information. In the absence of any claim for state-of-the-art performance, we did not calculate any significance values for the reported values.
Per-residue performance
Toward this end, we used the standard three-state per-residue accuracy (Q3=percentage correctly predicted in either helix, strand, other [1]) along with its eight-state analog (Q8). Predictions of intrinsic disorder were evaluated through the Matthew’s correlation coefficient (MCC [66]) and the False-Positive Rate (FPR) representative for tasks with high class imbalance. For completeness, we also provided the entire confusion matrices for both secondary structure prediction problems (Fig. SOM_2). Standard errors were calculated over the distribution of each performance measure for all proteins.
Per-protein performance
The predictions whether a protein was membrane-bound or water-soluble were evaluated by calculating the two-state per set accuracy (Q2: percentage of proteins correctly predicted), and the MCC. A generalized MCC using the Gorodkin measure [54] for K (=10) categories as well as accuracy (Q10), was used to evaluate localization predictions. Standard errors were calculated using 1000 bootstrap samples, each chosen randomly by selecting a sub-set of the predicted test set that had the same size (draw with replacement).
Results
Per-residue performance high but not top
NetSurfP-2.0 uses HHblits profiles along with advanced combinations of Deep Learning architectures. This seemingly has become one of the best method for protein secondary structure prediction [40], reaching a three-state per-residue accuracy Q3 of 82-85% (lower value: small very non-redundant CASP12 set, upper value: larger slightly more redundant TS115 and CB513 sets; Table 1, Fig. 2). All six applications compared here (DeepProf, DeepSeqVec, DeepProf+SeqVec, DeepProtVec, DeepOneHot, DeepBLOSUM65) remained below the previously discussed figures (Fig. 2A, Table 1). When comparing methods that use only single protein sequences as input (DeepSeqVec, DeepProtVec, DeepOneHot, DeepBLOSUM65), the proposed SeqVec outperformed others by 5-10 (Q3), 5-13 (Q8) and 0.07-0.12 (MCC) percentage points. The evolutionary information (DeepProf with HHblits profiles) remained about 4-6 percentage points below NetSurfP-2.0 (Q3=76-81%, Fig. 2, Table 1). Depending on the test set, using SeqVec embeddings instead of evolutionary information (DeepSeqVec: Fig. 2A, Table 1) remained 2-3 percentage points below that mark (Q3=73-79%, Fig. 2A, Table 1). Using both evolutionary information and SeqVec embeddings (DeepProf+SeqVec) improved over both, but still did not reach the top (Q3=77-82%). In fact, the embedding alone (DeepSeqVec) did not surpass any of the existing methods using evolutionary information (Fig. 2A).
For the prediction of intrinsic disorder, we observed the same: NetSurfP-2.0 performed best, while our implementation of evolutionary information (DeepProf) performed worse (Fig. 2B, Table 1). However, for this task the embeddings alone (DeepSeqVec) performed relatively better, exceeding the evolutionary information (DeepSeqVec MCC=0.575-0.591 vs. DeepProf MCC=0.506-0.516, Table 1). The combination of evolutionary information and embeddings (DeepProf+SeqVec) improved over using evolutionary information alone but did not improve over SeqVec. Compared to other methods, the embeddings alone reached similar values (Fig. 2B).
Per-protein performance closer to top
For the task of predicting subcellular localization in ten classes, DeepLoc [39] appears to be today’s best tool with Q10=78% (Fig. 2C, Table 2). Our simple embeddings model DeepSeqVec-Loc reached second best performance together with iLoc-Euk [46] with Q10=68% (Fig. 2C, Table 2), improving over several other methods that use evolutionary information by up to Q10=13%.
Performance for the classification into membrane-bound and water-soluble proteins followed a similar trend (Fig. 2D, Table 2): while DeepLoc still performed best (Q2=92.3, MCC=0.844), DeepSeqVec-Loc reached just a few percentage points lower (Q2=86.8±1.0, MCC=0.725±0.021; full confusion matrix Figure SOM_3). In contrast to this, ProtVec, another method using only single sequences, performed substantially worse (Q2=77.6±1.3, MCC=0.531±0.026).
Visualizing results
Lack of insight often triggers the misunderstanding that machine learning methods are black box solutions barring understanding. In order to interpret the SeqVec embeddings, we have projected the protein-embeddings of the per-protein prediction data upon two dimensions using t-SNE [55]. We performed this analysis once for the raw embeddings (SeqVec, Fig. 3A) and once for the hidden layer representation of the per-protein network (DeepSeqVec-Loc) after training (Fig. 3B). All t-SNE representations were created using 2,000 iterations and the cosine distance as metric. The two analyses differed only in that the perplexity was set to 25 for one (SeqVec) and 50 for the other (hidden layer representation of trained model). The t-SNE representations were colored either according to their localization within the cell (upper row of Fig. 3) or according to whether they are membrane-bound or water-soluble (lower row).
Despite never provided during training, the raw embeddings appeared to capture some signal for classifying proteins by localization (Fig. 3A, upper row). The most consistent signal was visible for extra-cellular proteins. Proteins attached to the cell membrane or located in the endoplasmic reticulum also formed well-defined clusters. In contrast, the raw embeddings neither captured an ambiguous signal for nuclear nor for mitochondrial proteins. Through training, the network improved the signal to reliably classify mitochondrial and plastid proteins. However, proteins in the nucleus and cell membrane continued to be poorly distinguished via t-SNE.
Coloring the t-SNE representations for membrane-bound or water-soluble proteins (Fig. 3, lower row), revealed that the raw embeddings already provided well-defined clusters although never trained on membrane prediction (Fig. 3A). After training, the classification was even better (Fig. 3B).
CPU/GPU Time used
Due to the sequential nature of LSTMs, the time required to embed a protein grew linearly with protein length. Depending on the available main memory or GPU memory, this process could be massively parallelized. To optimally use available memory, batches are typically based on tokens rather than on sentences. Here, we sorted proteins according to their length and created batches of ≤15K tokens that could still be handled by a single Nvidia GeForce GTX1080 with 8GB VRAM. On average the processing of a single protein took 0.027s when applying this batch-strategy to the NetSurfP-2.0 data set (average protein length: 256 residues, i.e. shorter than proteins for which 3D structure is not known). The batch with the shortest proteins (on average 38 residues, corresponding to 15% of the average protein length in the whole data set) required about one tenth (0.003s per protein, i.e. 11% of that for whole set). The batch containing the longest protein sequences in this data set (1578 residues on average, corresponding to 610% of average protein length in the whole data set), took about six times more (1.5s per protein, i.e. 556% of that for whole set). When creating SeqVec for the DeepLoc set (average length: 558 residues; as this set does not require a 3D structure, it provides a more realistic view on the distribution of protein lengths), the average processing time for a single protein was 0.08 with a minimum of 0.006 for the batch containing the shortest sequences (67 residues on average) and a maximum of 14.5s (9860 residues on average). Roughly, processing time was linear with respect to protein length. On a single Intel i7-6700 CPU with 64GB RAM, processing time increased by roughly 50% to 0.41s per protein, with a minimum and a maximum computation time of 0.06 and 15.3s, respectively. Compared to an average processing time of one hour for 1000 proteins when using evolutionary information directly [40], this implied an average speed up of 120-fold on a single GeForce GTX1080 and 9-fold on a single i7-6700 when predicting structural features; the inference time of DeepSeqVec for a single protein is on average 0.0028s.
Discussion
ELMo alone did not suffice for top performance
On the one hand, none of our implementations of ELMo reached anywhere near today’s best (NetSurfP-2.0 for secondary structure and protein disorder and DeepLoc for localization and membrane protein classification; Fig. 2, Table 1, Table 2). Clearly, “just” using ELMo did not suffice to crack the challenges. On the other hand, some of our solutions appeared surprisingly competitive given the simplicity of the architectures. In particular for the per-protein predictions, for which SeqVec clearly outperformed the previously popular ProtVec [64] approach and even commonly used expert solutions (Fig. 1, Table 2: no method tested other than the top-of-the-line DeepLoc reached higher numerical values). For that comparison, we used the same data sets but could not rigorously compare standard errors which were unavailable for other methods. Estimating standard errors for our methods suggested that the differences were statistically significant as the difference was more than 7 sigmas for all methods (except DeepLoc (Q10=78) and iLoc-Euk(Q10=68)). The results for localization prediction implied that frequently used methods using evolutionary information (all marked with stars in Table 2) did not clearly outperform our simple ELMo-based tool (DeepSeqVec-Loc in Table 2). This was very different for the per-residue prediction tasks: here almost all top methods using evolutionary information numerically outperformed the simple model built on the ELMo embeddings (DeepSeqVec in Fig. 2 and Table 1). However, all models introduced in this work were deliberately designed to be relatively simple to demonstrate the predictive power of SeqVec. More sophisticated architectures using SeqVec are likely to outperform the approaches introduced here.
When we combined ELMo with evolutionary information for the per-residue predictions, the resulting tool still did not quite achieve top performance (Q3(NetSurfP-2.0)=85.3% vs. Q3(DeepProf + SeqVec)=82.4%, Table 1). This might suggest some limit for the usefulness of ELMo/SeqVec. However, it might also point to the more advanced solutions realized by NetSurfP-2.0 which applies two LSTMs of similar complexity as our entire system (including ELMo) on top of their last step leading to 35M (35 million) free parameters compared to about 244K for DeepProf + SeqVec. Twenty times more free parameters might explain some fraction of the success, however due to limited GPU resources, we could not test how much.
Why did the ELMo-based approach improve more (relative to competition) for per-protein than for per-residue predictions? Per-protein data sets were over two orders of magnitude smaller than those for per-residue (simply because every protein constitutes one sample in the first and protein length samples for the second). It is likely that ELMo helped more for the smaller data sets because the unlabeled data is pre-processed so meaningful that less information needs to be learned by the ANN during per-protein prediction. This view was strongly supported by the t-SNE [55] results (Fig. 3): ELMo apparently had learned enough to realize a very rough clustering into localization and membrane/not.
We picked four particular tasks as proof-of-principle for our ELMo/SeqVec approach. These tasks were picked because recently developed methods implemented deep learning and the associated data sets for training and testing were made publicly available. We cannot imagine why our findings should not hold for other tasks of protein prediction and welcome the community to test our SeqVec for their particular tasks. We assume that our findings will be more relevant for small data sets than for large ones. For instance, we assume predictions of inter-residue contacts to improve less, and those for protein binding sites possibly more.
Good and fast predictions without using evolutionary information
SeqVec predicted secondary structure and protein disorder over 100-times faster on a single 8GB GPU than the top-of-the-line prediction method NetSurfP-2.0 which requires machines with hundreds of GB main memory to run MMSeqs2 [32]. For some applications, the speedup might outweigh the reduction in performance.
Modeling the language of life?
Our ELMo implementation learned to model a probability distribution over a protein sequence. The sum over this probability distribution constituted a very informative input vector for any machine learning task. It also picked up context-dependent protein motifs without explicitly explaining what these motifs are relevant for. In contrast, tools such as ProtVec will always create the same vectors for a k-mer, regardless of the residues surrounding this k-mer in a particular protein sequence.
Our hypothesis had been that the ELMo embeddings learned from large databases of protein sequences (without annotations) could extract a probabilistic model of the language of life in the sense that the resulting system will extract aspects relevant both for per-residue and per-protein prediction tasks. All the results presented here have added independent evidence in full support of this hypothesis. For instance, the three state per-residue accuracy for secondary structure prediction improved by over eight percentage points through ELMo (Table 1: e.g. for CB513: Q3(DeepSeqVec)=76.9% vs. Q3(DeepBLOSUM65)=67.4%, i.e. 9.4 percentage points corresponding to 14% of 67.4), the per-residue MCC for protein disorder prediction also rose (Table 1: e.g. TS115: MCC(DeepSeqVec)=0.591 vs. MCC(DeepBLOSUM65)=0.488, corresponding to 18% of 0.488). On the per-protein level, the improvement over the previously popular tool extracting “meaning” from proteins, ProtVec, was even more substantial (Table 1: localization: Q10(DeepSeqVec-Loc)=68% vs. Q10(DeepProtVec-Loc)=42%, i.e. 62% rise over 42; membrane: Q2(DeepSeqVec-Loc)=86.8% vs. Q2 (DeepProtVec-Loc)=77.6%, i.e. 19% rise over 77.6). We could demonstrate this reality even more directly using the t-SNE [55] results (Fig. 3): some localizations and the classification of membrane/non-membrane had been implicitly learned by SeqVec without any training. Clearly, our ELMo-driven implementation succeeded to model some aspects of the language of life as proxied for proteins.
Conclusion
We have shown that it is possible to capture and transfer knowledge, e.g. biochemical or biophysical properties, from a large unlabeled data set of protein sequences to smaller, labelled data sets. In this first proof-of-principle, our comparably simple models have already reached promising performance for a variety of per-residue and per-protein prediction tasks obtainable from only single protein sequences as input, that is: without any direct evolutionary information, i.e. without alignments. This reduces the dependence on the time-consuming and computationally intensive calculation of protein profiles, allowing the prediction of per-residue and per-protein features of a whole proteome within less than an hour. For instance, on a single GeForce GTX 1080, the creation of embeddings and predictions of secondary structure and subcellular localization for the whole human proteome took about 32 minutes. Building more sophisticated architectures on top of the proposed SeqVec will increase sequence-based performance further.
Our new SeqVec embeddings may constitute an ideal starting point for many different applications in particular when labelled data are limited. The embeddings combined with evolutionary information might even improve over the best available methods, i.e. enable high-quality predictions. Alternatively, they might ease high-throughput predictions of whole proteomes when used as the only input feature. Alignment-free predictions bring speed and improvements for proteins for which alignments are not readily available or limited, such as for intrinsically disordered proteins, for the Dark Proteome, or for particular unique inventions of evolution. The trick was to tap into the potential of Deep Learning through transfer learning from large repositories of unlabeled data by modeling the language of life.
Acknowledgements
The authors thank primarily Tim Karl for invaluable help with hardware and software and to Inga Weise for support with many other aspects of this work. We gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan GPU used for this research. We also want to thank the LRZ (Leibniz Rechenzentrum) for providing us access to DGX-V1. This work was supported by a grant from the Alexander von Humboldt foundation through the German Ministry for Research and Education (BMBF: Bundesministerium fuer Bildung und Forschung) as well as by a grant from Deutsche Forschungsgemeinschaft (DFG–GZ: RO1320/4–1). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
Footnotes
Figure 2 & 3 revised; SOM added; minor text changes
Abbreviations used
- 1D
- one-dimensional – information representable in a string such as secondary structure or solvent accessibility
- 3D
- three-dimensional
- 3D structure
- three-dimensional coordinates of protein structure
- MCC
- Matthews-Correlation-Coefficient
- RSA
- relative solvent accessibility