Abstract
Proteins can be partitioned into eight mutually exclusive sets of peptides and recoded with a binary alphabet obtained by dividing the 20 amino acids into two ordered sets based on volume. By searching for these binary-coded peptides in a protein sequence database, their container proteins can be identified. Over 89.7% of 20207 curated proteins in the human proteome (http://www.uniprot.org; database id UP000005640, H. sapiens) can be so identified. This procedure can be translated into practice. Thus standard chemical procedures can be used for partitioning and a nanopore can be used to obtain binary coded sequences for partitioned peptides. In the latter case, recently published work has shown that a sub-nanometer-diameter pore can measure residue volume with a resolution of ~0.07 nm3. This can be used to distinguish between the two sets of residues defined above; a detector with two thresholds outputs a binary sequence for a partitioned peptide from the nanopore current signal. Using normal distributions of amino acid volume data from the literature, routine computations show that ~98% of the protein-identifying peptides in the curated human proteome have binary codes that are correct with a confidence level exceeding 85%. Similar results are presented for the proteomes of baker’s yeast (S. cerevisiae), the pathogen E. coli, and the gut bacterium H. pylori.