Abstract
A single-molecule method of identifying proteins based on electrical measurements and database search without labels or immobilization is considered. It uses electrolytic cells with two or three nanopores in tandem and one or two peptidases covalently attached to the trans side of a pore. An unknown protein is digested into peptides ending in a known amino acid; the peptides enter the cell, pass through the first pore, and are fragmented by a high-specificity endopeptidase. The second enzyme, if present, is an exopeptidase that cleaves the fragments into residues after the second pore. Level transitions in a blockade pulse due to the pore ionic current or transverse current pulse caused by a fragment in the second pore or individual such pulses caused by single residues in the third pore are counted. N residue-specific cells produce N integer lists from which a partial sequence is assembled. Search through the Uniprot database shows that for small N (3 to 5) over 98% of proteins in the human proteome can be identified from such sequences. A Fokker-Planck model is used to derive minimum enzyme turnover intervals required for correct sequencing. With thick (80-100 nm) pores the pulse width is ∼1 μs/residue, which is within the capability of CMOS detector circuits. If digested peptides are assumed to enter a cell in random order then over a long run the quantity of a protein in a mixture of proteins can be estimated from the number of its identifying peptides.
1 Introduction
Unlike genome sequencing, which is largely aimed at extracting bio-markers such as gene mutations indicative of risk to diseases like cancer and diabetes, medical diagnostic procedures for patient treatment and care focus on the identification of cell constituents such as proteins. Often the identity and quantity of a protein in an assay are more useful than the sequence. While whole genome sequencing1 has advanced rapidly with the emergence of several new techniques, sequencing and identification of proteins are largely based on the established techniques of Edman degradation,2 gel electrophoresis,2 and mass spectrometry.3 Whether genome or protein, sequencing is based on bulky or expensive devices and/or time-consuming procedures; this has led to efforts aimed at developing portable/hand-held low-cost fast-turnaround devices.4,5 In particular, nanopores have been investigated for their potential use in the analysis and study of DNA6 and proteins/peptides.7-10 Recently a tandem electrolytic cell with cleaving enzymes was proposed for sequencing of DNA11 and peptides.12 It has two single cells in tandem, with the structure [cis1, upstream pore (UNP), trans1/cis2, downstream pore (DNP), trans2]. An enzyme covalently attached to the downstream side of UNP successively cleaves the leading monomer in a polymer threaded through UNP; the monomer translocates through DNP where the ionic current blockade it causes is used (along with other discriminators12) to identify it. With DNA the enzyme is an exonuclease,11 with peptides it is an amino or carboxy peptidase.12 The process is label-free and does not require immobilization of the analyte.
Here a low-cost alternative to conventional methods for protein identification is proposed in which a partial sequence is obtained for a peptide and used to identify the protein. Sequencing is based on a single tandem cell11,12 and an endopeptidase (Method 1) or a double tandem cell, endopeptidase, and exopeptidase (Method 2). The first enzyme breaks the peptide into fragments, the second breaks fragments into residues. The fragments/residues translocate through a pore and cause ionic current blockades or modulate a transverse current across the pore;6 the pore/transverse current pulses or level transitions within are counted. In both methods, N (∼3 to 5) tandem cells, each with an endopeptidase specific to a different amino acid, produce N lists of integers corresponding to the positions of the amino acid in the peptide sequence, from which a simple algorithm assembles a partial sequence. The protein is then identified by comparing the latter with sequences in a protein database. With a mixture of proteins the quantity of a protein in the mixture is estimated from the number of identifying peptides. The approach may be extended to cover modified amino acids.
This is a digital technique based on pulse counting, it differs from other nanopore sequencing and identification techniques based on analog measurements of pulse magnitude or width (equivalently analyte residence time in a pore) in a pore ionic or transverse current.6 The sequencing aspect is reminiscent of the Maxam-Gilbert13 and Sanger14 methods for DNA, wherein independent channels are used for A, T, C, and G, and subsequences are separated by length and terminal base type. The identification aspect resembles database-centered methods used in mass spectrometry such as Peptide Sequence Tags.3 The approach is similar in some ways to a recent theoretical proposal15 in which fluorescent labels specific to a set of amino acids are attached to residues in a peptide immobilized on a glass substrate. The labels are optically detected when the N-terminal residues are removed one after the other in a series of Edman degradation cycles. The optical output is used to partially sequence a peptide and then identify it in a proteome. In contrast, the proposal presented here does not require analyte immobilization, labeling, or repeated wash cycles.
2 Protein identification and quantification: method and materials
An unknown protein Px is identified in six stages:
Fragment copy of Px into peptides ending in amino acid X0.
Break peptide copy into fragments ending in amino acid X1 ≠ X0 (Methods 1 and 2). Break a fragment into individual residues (Method 2).
Find number of residues in fragment obtained in Stage 2.
Repeat Stages 1 through 3 for other amino acids X2, X3,…
Assemble partial sequence from length information obtained in Stage 3. Mark unknown residues with wild card *.
Match partial sequence with sequences in proteome of interest and identify Px (hopefully uniquely).
Stage 1. A highly specific chemical or peptidase is used. Examples include cyanogen bromide, which cleaves after methionine (M), and GluC protease, which cleaves after glutamic acid (E). Both cleave on the C terminal side and result in fragments ending in M or E respectively. More such agents/peptidases are available and are listed in Table A-4 in the Appendix. (See online review,16 from which the table has been adapted, for a list of comprehensive references.)
Stage 2. The positions of occurrence of a residue in a peptide are obtained by targeting it with a highly specific peptidase in a tandem cell. Peptidases with high specificity include GluC (E), ArgC proteinase (arginine R), AspN endopeptidase (aspartic acid D), and LysC lysyl endopeptidase (lysine K). Others with high specificity but some ambiguity include serine proteinase (E or D) and neutrophil elastase (valine V or alanine A).16 Similarly a peptide fragment can be cleaved into individual residues by an exopeptidase capable of cleaving a wide range of residue types at the carboxyl or amino end. Examples include Carboxypeptidase I (CPD-Y), Carboxypeptidase II (CPD-M-II), and Leucine Aminopeptidase (LAP).17-19
Stage 3. A tandem cell is used to count level transitions in a pore ionic current or transverse current pulse that is modulated by a fragment translocating through a nanopore or individual such pulses due to cleaved residues. Two methods are available.
In Method 1 the structure in Figure 1 is used. A peptide with a poly-X header (X = one of the charged amino acids: Arg, Lys, Glu, Asp; the charge on X depends on the pH value) is drawn into UNP by the electric field due to V05 (= ∼110 mV), most of which (∼98%) drops across the two pores.6 An endopeptidase specific to amino acid AA attached downstream of UNP cleaves the peptide after (or before) all n (≥ 0) points where AA occurs. The resulting n+1 fragments translocate to and through DNP, where level crossings in the resulting pore ionic current blockade or a transverse current across DNP may be used to count the residues in a fragment.
In Method 2 the double tandem cell in Figure 2 is used. A peptide is cleaved into fragments by an endopeptidase as in Method 1. An exopeptidase (amino or carboxy) covalently attached downstream of the middle nanopore (MNP) cleaves residues from a fragment; the residues translocate through DNP and blockade the pore ionic current or modulate the transverse current. The resulting single pulses are counted.
In both methods each tandem cell specific to an amino acid produces an ordered list of integers equal to the lengths of successive fragments in which the last residue is the target. If a cell generates a single integer, the target is not in the peptide.
Stage 5. The peptide is partially assembled using the following procedure:
Replace fragment lengths from cell with cumulative lengths (= target positions in peptide) and target identities.
Invert position-identity pairs.
Merge resulting sequences.
Insert wild card * in all other positions in sequence.
Stage 6. Standard string matching algorithms can be used to search for the partial sequence among the set of sequences in a protein database such as Uniprot20 or PDB. More general matching algorithms21 may be used if desired.
2.1 Database search and results
The number of identifying sub-sequences per protein was computed for the following set of cleavage choices: M (first stage); R, K, D, E (second stage). An exhaustive search of sequences in the human proteome (Uniprot database id UP000005640, manually reviewed subset with 20207 sequences) was done. Computation is in four steps:
All subsequences ending in M are extracted, they correspond to the peptides generated by the action of cyanogen bromide (see above). Every one of these peptides has exactly one M which is also the last residue in the peptide.
In four individual cells specific to R, K, D, or E a copy of each peptide is cleaved after every occurrence of the target (R, K, D, or E). The resulting subsequences are the fragments generated by ArgC, LysC, AspN, or GluC respectively.
The partial peptide sequence is assembled from these fragments using the algorithm in Stage 5 above.
To find out if a peptide is a unique identifier the wild card * is entered into every position in the peptide sequence where R, K, D, E, and M do not occur. The resulting string is then matched with every other peptide (similarly filled with *).
Example: Consider the protein P31946 in the human proteome (Uniprot id UP000005640).20 The following is one of three peptides that uniquely identify it in the proteome: KAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISS IEQKTERNEKKQQM. With cells targeting R, K, D, and E, the corresponding length lists are R:{15, 14, 1, 4, 11}, K:{1, 22, 19, 6, 1}, D:{}, E:{5, 4, 1, 2, 26, 4, 3}, and M:{52}. The corresponding position lists are R:{15, 29, 30, 34, 45}, K:{1, 23, 42, 48, 49}, D:{}, E:{5, 9, 13, 14, 40, 44, 47}, and M:{52}, where the position value is obtained as the sum of all lengths in the list up to and including the current one. Inverting and merging leads to the partial sequence K***E***E***EER*******K*****RR***R*****E***ER*EKK**M.
The percentage of proteins with at least one identifying subsequence (created by cyanogen bromide) is found to be 97.8%. The number of identifiable proteins can be increased by sequencing a protein with other combinations of cleaving chemical/enzyme in Stage 1 and peptidases in Stage 2. For example, instead of cyanogen bromide targeting M, GluC can be used to generate the set of peptides in the protein that end in E. Another possibility is to use diazonium to cleave after Tyr (Y) in the first stage. Various combinations of peptidases can be considered for the second stage; see Section A-6 in the Appendix. Figure 3 shows the distribution of the number of proteins vs the number of identifying peptides in a protein for two sets of cleavage choices in Stages 1 and 2.
The total coverage is the union of the sets of proteins with at least one unique identifier obtained from all these combinations. Table 1 shows the increase when sequencing is done twice by targeting M or Y in the first stage; both times R, K, D, and E are targeted in the second stage. With enough combinations 100% coverage may be possible.
2.2 Quantifying a protein in a mixture
Consider an assay with a mixture of proteins. The output of Stage 1 is the set of all the peptides from all the proteins in the mixture; they are the result of the cleaving action of the chemical agent or peptidase used. On input to Stage 2, the peptides enter a cell (which is designed to cleave after a given amino acid such as R, K, D, or E) in some random order; the partial sequences obtained are used to identify the container protein as described earlier. Consider a mixture { (Ni, Pi, Ii): i = 1, 2,…} where Ni is the number of molecules of the i-th protein in the mixture, Pi the number of peptides per molecule of the protein (this is equal to the number of peptides created in Stage 1 from a single molecule), and Ii (0 ≤ Ii ≤ Pi) the number of identifying peptides per molecule. For a given chemical agent/peptidase in Stage 1 the Pis are known, and for the set of peptidases used in cells in Stage 2 the Iis are known by computation. Ni is the desired unknown. For example, with M targeted in Stage 1 and R, K, E, and D targeted in Stage 2, for the protein P31946 in the human proteome Pi = 9 and Ii = 3. Peptides generated in Stage 1 from the mixture enter a cell in succession (in some random order) and are partially sequenced and identified. Let the number of peptides in protein i that have been identified in the run so far be Ii-measured. If peptide entry into a cell is totally random, then after a sufficiently long run Ni can be estimated as If the total number of peptides processed in the run is Ntotal then the number of peptides that do not yield identifying information is where the summation is over all the identified proteins. This number includes peptides that are found in more than one protein and may also include impurities in the assay sample. If the sample is not pure there seems to be no easy way to separate the two so unidentified sample proteins that are not impurities remain unestimated (even though their percentage is likely to be small).
3 Necessary conditions for correct sequencing
Nanopore-based sequencing relies on the ability to measure changes in current flow when an analyte molecule is present. This current may be an ionic current from cis to trans, a transverse electronic current across the pore membrane, or a transverse tunneling current across a gap in the membrane.6 The measurement ability is closely related to the bandwidth of the detector, see discussion in Section 4 below.
Since the charge carried by a peptide is highly variable and may be negative, 0, or positive, the two methods described above rely on diffusion as the primary mechanism for translocation of a fragment or residue, modified by the drift field. They are studied through the properties of the basic tandem cell, which has been modeled with a Fokker-Planck equation.11,12 Central to the model is the solution of a boundary value problem in which the trans side of a pore is viewed as a reflecting boundary for a cleaved fragment or residue, so the net diffusion tends to be in the cis-to-trans direction (with V05, V07 > 0). The main quantities of interest are the mean E(T) and variance σ2(T) of the time T taken by a particle to translocate through a trans compartment or pore of length L (in the latter case it is ≈ the width of the pore ionic blockade or transverse current pulse) and with applied potential difference of V. From the Appendix and Where Here vz is the drift velocity due to the electrophoretic force experienced by a charged particle in the z direction; it can be 0, negative, or positive. For vz = 0, these two statistics are Details are given in Section A-1 of the Appendix, derivations may be found elsewhere.11,12
Let Tdetector = time resolution of the detector circuit (= ∼1 μs with CMOS circuits22). The following are necessary conditions for correct sequencing:
C1:
At most one cleaved fragment may occupy DNP (Method 1) or MNP (Method 2) at any time;
At most one cleaved residue may occupy DNP (Method 2).
C2:
Cleaved fragments (Method 1) or residues (Method 2) must arrive at DNP in sequence order;
Cleaved fragments must arrive at MNP in sequence order (Method 2).
C3:
A residue translocating through DNP must have a pulse width > Tdetector (Method 2);
A fragment with Lf residues must have a pulse width in DNP > LfTdetector (Method 1).
These conditions are influenced by the following factors:
The pore ionic blockade or transverse current pulse width, which is effectively the fragment or residue’s residence time in DNP. It is approximated by the mean translocation time through DNP in both methods.
The charge carried by a peptide fragment (and hence its mobility μ). As it depends on the constituent amino acids it has a wide range of values, which directly affects the translocation time (see Section A-2 in the Appendix for the relevant equations). Thus fragments with high negative charge have very high speeds of translocation which may result in misses (‘deletes’), while those with high positive charge are ‘lost’ to diffusion because they are too slow. Figure A-1 in the Appendix shows the frequency distribution of all 207 peptides of length LF = 7 as a function of μ or μ/D at pH = 7 (physiological pH), where D is the diffusion constant of the fragment. (Note the multimodal shape and slight negative skew.) These distributions are used in the Appendix to estimate the percentage of misses (deletes) and losses.
C1 and C2 can be satisfied by requiring the enzymes to cleave at a given minimum rate. Enyzme reactions being stochastic processes, reaction rates are random variables with a distribution of values. The minimum rates required are estimated using standard statistical methods. C3 can be satisfied through the use of a sufficiently thick pore. Thus the pore ionic blockade or transverse modulated current pulse width is proportional to the square of pore length (Equations 3 through 6), so a thicker pore can significantly increase translocation times and thus lower the required bandwidth (or equivalently increase the resolution time needed to sense the pulse). This is contrary to the usual practice of using thinner pores to achieve better discrimination,6 but is appropriate here because residues do not have to be identified, they only have to be counted. (A side benefit of this is that thick synthetic pores are usually easier to fabricate than thin ones.23) See Discussion in Section 4.
With suitable values for the pore length, applied voltage, and peptide length all three conditions can be satisfied with a detector time resolution of ∼1 μs. This is shown below by computing pulse widths and required enzyme reaction times. Only the results are given here, details may be found in the Appendix.
3.1 Computational results
The following parameter values are assumed: V05 = ∼115 mV (Method 1); V07 = ∼180 mV (Method 2); detector resolution = 1 μs; pore (DNP, MNP) conductance = ∼1 nS; pH = 7.0; trans1/cis2 height = trans2/cis3 height = 0.5 μm, UNP length = MNP length = 10 nm. V05 divides as V01 = V23 = V45 ≈ 1.6 mV, V12 ≈ 10 mV, and V34 ≈ 100 mV. V07 divides as V01 = V23 = V45 = V67 ≈ 1.5 mV, V12 = V34 ≈ 15 mV, and V56 = 140 mV. Let Texo-min, Tendo-min-2, and Tendo-min-1 be the minimum reaction time intervals for the exopeptidase in Method 2, the endopeptidase in Method 2, and the endopeptidase in Method 1 respectively. Translocation time distributions are assumed to have 6σ support, where σ is the standard deviation.
Method 2: Using data from Table A-3 in the Appendix for DNP length = 80 nm, the minimum mean blockade pulse width in DNP is given by the fastest amino acid (Asp) and is ∼1.33 μs > Tdetector = 1 μs. Texo-min is largely determined by the slowest (Lys) and is ∼1 ms. More generally Figure 4 shows the mean blockade pulse widths due to single residues in DNP for all 20 residue types for three different lengths of DNP, while Figure 5 shows Texo-min vs residue type for DNP length = 80 nm. Figure 6 shows the frequency distribution of Tendo-min-2 for different fragment lengths (LF), each based on 106 randomly generated peptide sequences. In each case ∼95% of the sequences have minimum enzyme cleavage intervals < LF ms.
Method 1: Figure 7 shows the distribution of fragment pulse widths for three different fragment lengths. Figure 8 shows Tendo-min-1 for 106 random samples of length 12 in Method 1. For the vast majority of sequences Tendo-min-1 is < 3 ms. The curve is to the left of the corresponding curve in Method 2 (Figure 6, red) because the endopeptidase reaction times in the latter include the delay due to the cleaving of residues in a fragment by the exopeptidase (although this is not strictly necessary because the pulses are only counted so they can arrive in any order). The distribution of pulse widths > 12 μs due to fragments of length = 12 vs the endopeptidase reaction time is shown in Figure 9. For nearly 80% of the sequences (with pulses in which LF transitions can be counted) Tendo-min-1 < 1 ms. In comparison the percentage of pulses that may not be counted correctly is relatively small at ∼17%. The curves in Figures 6 and 8 are similar in shape and range to reaction rate graphs for the enzyme Exonuclease I.24
Comparing the two methods. Method 1 has a more compact physical structure and uses only one enzyme, but the need to recognize transitions in a blockade pulse due to a fragment reduces the maximum length that can be determined accurately. The ionic current is also lower. Method 2 can use a shorter (that is, thinner) DNP and a higher potential difference (leading to a higher ionic current). (This is not as serious a problem with transverse currents, which are on the order of nA,23 compared with at most 100s of pA with ionic currents.) However, as noted earlier, the reaction time required of the endopeptidase is significantly larger; also the endopeptidase and exopeptidase need to cleave at a sufficiently low rate and in synchrony. Notice that in Method 1 even if the exopeptidase is inefficient and does not cleave after every single residue, the number of residues would be counted correctly if DNP detects transitions between residues in a pulse due to a fragment with more than one residue.
4 Discussion
Some relevant implementation issues are considered next.
1) Counting pulses or transitions in a pulse
On the face of it counting transitions in a pulse due to residues in a fragment would appear to be easier with the following methods: a) using single-atom thick graphene25 or molybdenum disulphide (MoS2) sheets,26 both of which make counting of transitions easier; b) detecting level crossings in a transverse electronic26 or tunneling27,28 current pulse across graphene or silicon gaps; and c) using a narrow biological nanopore like MspA, which has a constriction in its short stem29 that may aid in recognizing the transitions. However all of these methods would require bandwidths in the tens of MHz if directly used in the approach described here. To bring the bandwidth down to 1-2 MHz (corresponding to a pulse width resolution of ∼1 μs), thick pores may be considered, as discussed in Section 3. With silicon compounds like Si3N4 thick pores of 50-100 nm are actually easier to manufacture than thin pores.23,30 With graphene, hourglass-shaped pores may be fabricated from graphite (which is a stack of graphene layers31) but stability may be an issue because of graphite’s flakiness. Biological pores like AHL or MspA can also be stacked, for example a stack of 10 AHL pores can provide a pore about 60-80 nm thick.
2) Location of peptidases
The cleaving action of an enzyme (endopeptidase or exopeptidase) requires it to be in the path of the peptide or fragment emerging from the respective pore (UNP or MNP) on the trans side. This can be ensured by covalently attaching the enzyme to the trans side of the pore membrane. Such covalent attachment has been discussed for DNA sequencing in two different approaches: exosequencing of mononucleotides32 and sequencing by synthesis using heavy tags attached to the bases.33 In both approaches an exonuclease or polymerase is attached to the cis side of the pore membrane. This could result in significant errors due to cleaved bases or tags being lost to diffusion in the cis chamber (deletions) or entering the pore out of order (delete-and-insert).34 In the present approach the peptidases are located on the trans side so deletions cannot occur. Out-of-order arrivals at the sensing pore (fragments at DNP in Method 1; fragments at MNP and residues at DNP in Method 2) are precluded as long as the necessary conditions given in Section 3 are satisfied.
3) Solution pH
Solution pH plays an important role for two reasons: a) the charge carried by a fragment, which is highly variable and not known in advance, is a function of pH; compare with DNA, where all nucleotide types have approximately the same electron charge of -q with a small variability due to pH; b) its effect on enzyme reaction rates. The choice of pH is a tradeoff between enzyme efficiency and being able to control translocation speeds; this may be determined by experiment.
4) Fabrication
A recent review of nanopore sequencing includes notes on fabrication techniques.30 Recently a tandem-pore-like structure was used to trap and analyze DNA,35 with the trans1/cis2 chamber functioning like a test-tube. In contrast with conventional nanopore sequencing methods, where the aim is to fabricate thin pores (∼1 nm) that are usually synthetic (such as, for example, Si3N4), as noted earlier the thick (80-100 nm) pores required in the proposed scheme may be more easily fabricated.
5) Other
See Appendix.
Appendix
A-1 Translocation statistics of tandem cell
A-2 Dependence of particle translocation on solution pH, charge, diffusion constant, and mobility
A-3 Calculating the percentage of misses (deletes) due to fast fragments and losses due to slow fragments
A-4 Table of translocation statistics for single residues
A-5 Derivation of necessary conditions for effective sequencing A-6 Peptidases and chemicals for cleaving and their specificities A-7 Additional notes and references
A-1 Translocation statistics of tandem cell
Following [11,12], the mean E(T) and variance σ2(T) of the translocation time T over a channel of length L that is reflective at the top and absorptive at the bottom with applied potential difference of V are given by and With Here vz is the drift velocity due to the electrophoretic force experienced by a charged particle in the z direction, which can be 0, negative, or positive. For vz = 0, these two statistics are If each section in the double tandem cell is considered independently these formulas can be applied to all the relevant sections: trans1/cis2 (T = Ttrans1/cis2; L = L23), MNP (T = TMNP; L = L34), trans2/cis3 (T = Ttrans2/cis3; L = L45), DNP (T = TDNP; L = L56), and trans3 (T = Ttrans3; L = L67). For an analysis of behavior at the interface between two sections see [11,12].
A-2 Dependence of particle translocation on solution pH, charge, diffusion constant, and mobility
Equations A-1 through A-4 involve a number of physical-chemical properties of amino acids: electrical charge (itself dependent on solution pH) [36], hydrodynamic radius, diffusion constant, and mobility. The following paragraphs provide a quantitative description of this dependence and allow calculation of fragment properties as they apply to peptide sequencing in a tandem cell with endopeptidase. In particular this information is used in the next section to derive a required condition for effective sequencing.
1) The electrical charge carried by a peptide (fragment) Px can be calculated with the Henderson-Hasselbach equation. Let the set of amino acids be AA = [A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V] where AA[i] is the i-th amino acid, 1 ≤ i ≤ 20. Let the pH value of the solution (electrolyte) be p, kC = kA value of the carboxy end = 9.69, kN = kA value of the amino end = 2.34, NX the number of times residue X occurs in the peptide (X = R, H, K), NZ the number of times residue Z occurs (Z = D, C, E, Y), and kX and kZ the kA values of X and Z respectively. kA values are given by Table A-1. The charge multiplier CPx on the peptide is given by where the summations are over the NX and NZ occurrences of X and Z respectively in Px.
2) The hydrodynamic radius RPx of peptide Px = X1 X2… XN is obtained recursively as follows: where VXk and and δv are the van der Waals volumes of Xk and a single molecule of water. Hydrodynamic radii of individual amino acids are given in [37] and van der Waals volumes in [38] (both sets of values are reproduced in the Supplement to [12]). This formula holds for small peptides (up to ∼20 residues).
3) The diffusion constant and mobility of Px are given by Here kB is the Boltzmann constant (1.3806 × 10−23 J/K), TR is the room temperature (298° K), η is the solvent viscosity (0.001 Pa.s), q is the electron charge (1.619 × 10−19 coulomb), and CPx is a multiplier.
Figure 1 shows the distribution of the number of peptides of length 7 vs mobility μ and μ/D (= a with V set to 1) over all 207 of them.
A-3 Calculating the percentage of misses (deletes) due to fast fragments and losses due to slow fragments
For fragments that carry a high negative or positive charge the mean translocation time in Equations A-1 and A-4 can be approximated by These formulas can be used to estimate the percentage of misses due to fast translocating fragments and slowly moving fragments through DNP in Method 1.
To estimate the former, for a given pore length L, voltage V across the pore, and blockade pulse width approximated by E(T), μ is written as The percentage of misses is given by The integral in Equation A-10 is the cumulative frequency for N1(μ) corresponding to the μ calculated from Equation A-9. The results are shown in Table A-2 for V = 100 mV, L = 100, 150, 200, and 250 nm, and two pulse widths: E(T) = 7 μs and 10 μs.
To estimate the percentage of losses rewrite Equation A-8b as This is an implicit function of two parameters, μ/D and D. To solve for μ/D for a given E(T), L, and V, it is approximated by where Davg is the average diffusion coefficient of all 207 peptides of length 7. This is a nonlinear equation in μ/D; the desired root on the real line can be found using standard methods. For a given value of V, the percentage of losses is given by The results are shown in Table A-2 for V = 100 mV and L = 100, 150, 200, and 250 nm, and E(T) = 1 s.
Figures A-2 and A-3 show similar distributions for peptide lengths 12 and 16.
A-4 Translocation statistics of single residues
The mean and standard deviation of the time taken by a single residue through trans2/cis3 and DNP (Method 2) are shown in Table A-3 as a function of pH.
A-5 Necessary conditions for effective sequencing
The material between ≪ and ≫ is repeated from the main text.
≪
Let Tdetector be the time resolution of the detector circuit (= ∼1 μs with CMOS circuits22). The following are necessary conditions for effective sequencing:
C1:
At most one cleaved fragment may occupy DNP (Method 1) or MNP (Method 2) at any time;
At most one cleaved residue may occupy DNP (Method 2).
C2:
Cleaved fragments (Method 1) or residues (Method 2) must arrive at DNP in sequence order;
Cleaved fragments must arrive at MNP in sequence order (Method 2).
C3:
A residue translocating through DNP must have a pulse width > Tdetector (Method 2);
A fragment with Lf residues must have a pulse width in DNP > LfTdetector (Method 1).
≫
It is now shown that the conditions applicable to each of the two methods are satisfied by a large majority (∼80% in most cases) of peptide sequences of a given length for a set of typical parameter values. In the following translocation time distributions are assumed to have 6σ support (σ = standard deviation).
Method 2. Conditions 1a, 1b, 2a, 2b, and 3 have to be satisfied. From Table A-3, with pH = 7.0, DNP height = 80 nm, and V56 = 140 mV, the fastest amino acid is Asp (D) with a translocation time of ∼1.33 μs > 1 μs. This satisfies Condition 3.
Let X1 and X2 be two residues cleaved in succession by the exopeptidase. Conditions 1a, 1b, and 2a are satisfied if
From columns 6 and 7 in the same table the second term in the inequality on the right is 0, leading to over all X. The maximum occurs for X = K (Lys), with E(Ttrans2/cis3-X) = 0.21×10−3, σtrans2/cis3-X = 0.18×10−3, E(TDNP-X) = 82×10−6, and σDNP-X = 81×10−6, leading to More generally the rate can be calculated for each residue type in a similar way. Figure 5 in the main text shows the required minimum cleaving interval with DNP height = 80 nm, V56 = 140 mV, and V45 = 1.2 mV.
A peptide that has threaded through UNP encounters the endopeptidase in trans1/cis2 and is cleaved into fragments. The latter translocate through trans1/cis2 and thread through MNP to be cleaved by the exopeptidase on the downstream side. Consider two successive fragments F1 and F2. Let LF1 be the length of a fragment F1. The delay due to cleaving of F1 into single residues by the exopeptidase is Lf1Texo-min-2. Conditions 1a and 2b will be satisfied if where Texo-min-X is the cleavage time for residue X and the summation is over all LF1 residues in F1. In the second term on the right side of the inequality, σtrans1/cis2-F2 ≈ E(Ttrans1/cis2-F2), so that max (0, E(Ttrans1/cis2-F2) – 3σtrans1/cis2-F2) = 0; this leads to
Figure 6 in the main text shows the distribution of Tendo-min-2 with 106 random peptide sequences with residues in a sequence drawn from a uniform distribution for three different fragment lengths.
Method 1. The development is similar to that for Method 2. Thus Conditions 1a, 2a, and 3 have to be satisfied. With two successive fragments F1 and F2 cleaved by the endopeptidase, Conditions 1a and 2a require As before the second term on the right is 0 because σtrans1/cis2-F2 ≈ E(Ttrans1/cis2-F2) which leads to Figure 7 in the main paper shows the pulse width distribution for Lf = 8, 12, and 16 based on 106 random samples of peptide sequences with residues drawn from a uniform distribution. In each case the percentage of pulse widths < Lf μs (= Lf Tdetector with Tdetector = 1 μs) is also indicated; these correspond to deletes. Figure 8 shows the distribution of Tendo-min-1 for a fragment length of 12 for 106 random samples of length 12. The distribution of pulse widths due to fragments of length > 12 μs vs the endopeptidase reaction time is shown in Figure 9.
A-6 Peptidases and chemicals for cleaving and their specificities
Table A-4 is a summary of selected chemicals and peptidases for use in cleaving of the unknown protein or peptides generated from it at desired locations; it is adapted from [16]. The following notation is used for cleavage sites on a substrate [39]: where – represents a peptide bond, N is the N-terminal end, and C is the C-terminal end.
A-7 Additional notes
1) Order of fragment entry into DNP. A fragment can enter DNP amino-end first or carboxy-end first. However the order is not important as the information sought is the number of residues, not their identity or sequence.
2) Order of entry of peptide into UNP. The assembly algorithm described in Section 2 implicitly assumes that entry of a peptide into UNP in each of the cells is all of them either N-terminal first or C-terminal first. This is a reasonable assumption because of the charged X-header. However, there is a non-zero probability that the peptide may enter wrong end first, so some of the fragment length lists obtained will be in the reverse order. The assembly algorithm can be modified to take this into account.
3) Applied voltage and current levels. Blockades are of ionic current flow through the pore due to K+ and Cl-ions in the electrolyte; with V = ∼100 mV this current is ∼100 pA (≈ GporeV, where Gpore is the conductance of the pore, typically 1 nS for a pore ∼10 nm thick), usually adequate for measuring blockades [6]. With thicker pores blockade levels may be lower. In the presence of noise there is a tradeoff between detectable pulse amplitude changes and translocation speed. While a higher voltage results in a higher blockade current and higher signal-to-noise ratios (SNR), it also causes a fragment or residue with high negative charge to translocate through DNP at a rate that exceeds 1/Tdetector, and one with high positive charge to translocate too slowly, resulting in misses or ‘loss’ to diffusion respectively. These extremes have been estimated in Section A-3 above. (The upper limit to the applied voltage is set by the breakdown field for the electrolyte, typically ∼70 MV/m.)
4) Entropy barriers. It is assumed that the entropy barrier [6] faced by a fragment during its entry into DNP (Method 1) or MNP (Method 2) is negligible, in part because short peptides have been considered. Long peptides may form secondary structures and also ball up, impeding entry into a pore. In this case, the barrier may not be negligible; it can be taken into account by increasing the minimum cleaving intervals required of the enzymes. The taper in trans2/cis3 (Figures 1 and 2) also helps in lowering the entropy barrier. Based on the computational results discussed above, the two methods presented here appear well suited to sequencing of peptides with 12-16 residues. (Compare with the optimum peptide length in an efficient mass spectrometer is ∼20 [3].)
5) Independence of cells. Each cell targets a different amino acid and operates independent of the other cells. This means that the cell can be independently optimized for enzyme reaction rates, applied voltage, pH value, etc.
6) Sticky fragments/residues. The problem of fragments or residues sticking to pore or compartment walls may be resolved through the use of non-stick additives [40] or wall coatings [41].
7) Sequencing with the potential reversed. A peptide can be sequenced with the applied potential reversed, which speeds up fragments with positive charge and slows down those with negative charge; neutral fragments are not affected. (If the pore is ion-sensitive, one with the appropriate sense may be used.) Merging the two sets of data can lead to improvements in detection and correction of errors, but this is only for charged fragments. The error can be minimized over all fragments, charged or neutral, by experimentally varying the pH and finding the pH value that yields the best results.
8) Hafnium oxide pores. Recent studies using high bandwidth (∼4 MHz) detectors have shown that a HfO2 membrane < 10 nm thick can slow down translocating DNA molecules [42]. (The slowdown is believed to be due to interactions of the DNA with the walls of the pore.) At the present time, however, fabrication seems to require an inordinate amount of time.
9) Applicability to DNA sequencing. The counting-based sequencing approach described in the main text can be applied to DNA sequencing if four endonucleases that are distinct and specific to the four nucleotide types can be found or synthesized and can be covalently (or otherwise) attached to the trans side of a pore. This could simplify DNA sequencing considerably.
For other implementation-related issues affecting tandem cells see discussions in [11,12].