ABSTRACT
Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time.
We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins. We evaluate the performance of DeepGOPlus on the CAFA3 dataset and significantly improve the performance of predictions of biological processes and cellular components with Fmax of 0.47 and 0.70, respectively, using only the amino acid sequence of proteins as input. DeepGOPlus can annotate around 40 protein sequences per second, thereby making fast and accurate function predictions available for a wide range of proteins.