rawMSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete

Claudio Mirabello; Björn Wallner

doi:10.1101/394437

Abstract

In the last decades huge efforts have been made in the bioinformatics community to develop Machine Learning-based methods for the prediction of structural features of proteins with the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest on neural networks, with dozens of methods being developed in the hope of taking advantage of these new architectures. On the other hand, most methods are still based on heavy preprocessing of the input data, as well as the extraction and integration of multiple hand-picked, manually designed features. Since Multiple Sequence Alignments (MSA) are almost always the main source of information in de novo prediction methods, it should be possible to develop Deep Networks to automatically refine the data and extract useful features from it. In this work we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows to input the whole MSA to a Deep Network, thus rendering sequence profiles and other pre-calculated features obsolete. We developed rawMSA in three different flavours to predict secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and determined that it outperforms classical methods based on PSSM when predicting secondary structure and solvent accessibility, while performing on par with the top ranked CASP12 methods in the inter-residue contact map prediction category. We believe that rawMSA represents a promising, more powerful approach to protein structure prediction that could replace older methods based on protein profiles in the coming years.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.