Abstract
One common task in Computational Biology is the prediction of aspects of protein function and structure from their amino acid sequence. For 26 years, most state-of-the-art approaches toward this end have been marrying machine learning and evolutionary information resulting from related proteins retrieved at increasing cost from ever growing sequence databases. This search is often so time-consuming to prevent analyzing entire proteomes. On top, evolutionary information is less powerful for smaller families, e.g. for proteins from the Dark Proteome. Here, we introduced a novel way to represent protein sequences as continuous vectors (embeddings) by utilizing the deep bi-directional language model ELMo that effectively captured the biophysical properties of protein sequences from unlabeled big data (UniRef50). After training, this knowledge was transferred for single protein sequences along with other relevant sequence features. We referred to these new embeddings as SeqVec and demonstrated their effectiveness by training comparably simple neural networks on existing data sets for two completely different prediction tasks. For the per-residue level, we predicted secondary structure (for NetSurfP-2.0 data set: Q3=79%±1, Q8=68%±1) and disorder (MCC=0.59±0.03). For the per-protein level, we predicted subcellular localization in ten classes (for DeepLoc dataset: Q10=68%±1) and distinguished membrane-bound from water-soluble proteins (Q2= 87%±1). All results built upon the new tool SeqVec derived from single protein sequences. Where the lightning-fast HHblits needed on average 0.5 - 5 minutes to generate the evolutionary information for a single protein, our SeqVec created the vector representation on average in 0.027 seconds.
Availability SeqVec: https://github.com/mheinzinger/SeqVec - Predictions: https://embed.protein.properties
Abbreviations used
- 1D
- one-dimensional – information representable in a string such as secondary structure or solvent accessibility
- 3D
- three-dimensional
- 3D structure
- three-dimensional coordinates of protein structure
- MCC
- Matthews-Correlation-Coefficient
- RSA
- relative solvent accessibility