Abstract
Accurate modelling of a single orphan protein sequence in the absence of homology information has remained a challenge for several decades. Although not as performant as their homology-based counterparts, single-sequence bioinformatic methods are not constrained by the requirement of evolutionary information and so have a swathe of applications and uses. By taking a bioinformatics approach to semi-supervised machine learning we develop Profile Augmentation of Single Sequences (PASS), a simple but powerful framework for developing accurate single-sequence methods. To demonstrate the effectiveness of PASS we apply it to the mature field of secondary structure prediction. In doing so we develop S4PRED, the successor to the open-source PSIPRED-Single method, which achieves an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵+ d.t.jones{at}ucl.ac.uk
Clarified and revised manuscript text.