Protein fingerprinting with a binary alphabet and a nanopore

G. Sampath

doi:10.1101/119313

Abstract

Proteins can be partitioned into eight mutually exclusive sets of peptides and recoded with a binary alphabet obtained by dividing the 20 amino acids into two ordered sets based on volume. By searching for these binary-coded peptides in a protein sequence database, their container proteins can be identified. Over 89.7% of 20207 curated proteins in the human proteome (http://www.uniprot.org; database id UP000005640, H. sapiens) can be so identified. This procedure can be translated into practice. Thus standard chemical procedures can be used for partitioning and a nanopore can be used to obtain binary coded sequences for partitioned peptides. In the latter case, recently published work has shown that a sub-nanometer-diameter pore can measure residue volume with a resolution of ~0.07 nm³. This can be used to distinguish between the two sets of residues defined above; a detector with two thresholds outputs a binary sequence for a partitioned peptide from the nanopore current signal. Using normal distributions of amino acid volume data from the literature, routine computations show that ~98% of the protein-identifying peptides in the curated human proteome have binary codes that are correct with a confidence level exceeding 85%. Similar results are presented for the proteomes of baker’s yeast (S. cerevisiae), the pathogen E. coli, and the gut bacterium H. pylori.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.