Abstract
Folded states of single domain globular proteins, the workhorses in cells, are compact with high packing density. It is known that the radius of gyration, Rg, of both the folded and unfolded (created by adding denaturants) states increase as Nν where N is the number of amino acids in the protein. The values of the celebrated Flory exponent ν are, respectively, , and ≈ 0.6 in the folded and unfolded states, which coincide with those found in homopolymers in poor and good solvents, respectively. However, the extent of compaction of the unfolded state of a protein under low denaturant concentration, conditions favoring the formation of the folded state, is unknown. This problem which goes to the heart of how proteins fold, with implications for the evolution of foldable sequences, is unsolved. We develop a theory based on polymer physics concepts that uses the contact map of proteins as input to quantitatively assess collapsibility of proteins. The model, which includes only two-body excluded volume interactions and attractive interactions reflecting the contact map, has only expanded and compact states. Surprisingly, we find that although protein collapsibility is universal, the propensity to be compact depends on the protein architecture. Application of the theory to over two thousand proteins shows that the extent of collapsibility depends not only on N but also on the contact map reflecting the native fold structure. A major prediction of the theory is that β-sheet proteins are far more collapsible than structures dominated by α-helices. The theory and the accompanying simulations, validating the theoretical predictions, fully resolve the apparent controversy between conclusions reached using different experimental probes assessing the extent of compaction of a couple proteins. As a by product, we show that the theory correctly predicts the scaling of the collapse temperature of homopolymers as a function of the number of monomers. By calculating the criterion for collapsibility as a function of protein length we provide quantitative insights into the reasons why single domain proteins are small and the physical reasons for the origin of multi-domain proteins. We also show that non-coding RNA molecules, whose collapsibility is similar to proteins with β-sheet structures, must undergo collapse prior to folding, adding support to “Compactness Selection Hypothesis” proposed in the context of RNA compaction.
1. INTRODUCTION
Folded states of globular proteins, which are evolved (slightly) branched heteropolymers made from twenty amino acids, are roughly spherical and are nearly maximally compact with high packing densities [1–3]. Despite achieving high packing densities in the folded states, globular proteins tolerate large volume substitutions while retaining the native fold [4]. This is explained in a couple of interesting theoretical studies [5, [6], which demonstrated that there is sufficient free volume in the folded state to accommodate mutations. Collectively these and related studies show that folded proteins are compact. When they unfold, which can be achieved upon addition of high concentrations of denaturants (or applying a mechanical force), they swell adopting expanded conformations. The radius of gyration (Rg) of a folded globular protein is well described by the Flory law with [7], whereas in the swollen state Rg ≈ aDNν, where aD is an effective monomer size and the Flory exponent ν ≈ 0.6 [8]. Thus, viewed from this perspective we could surmise that proteins must undergo a coil-to-globule transition [9, [10], a process that is reminiscent of the well characterized equilibrium collapse transition in homopolymers [11, [12]. The latter is driven by the balance between conformational entropy and intra-polymer interaction energy resulting in the collapsed globular state. The swollen state is realized in good solvents (interaction between monomer and solvents is favorable) whereas in the collapsed state monomer-monomer interactions are preferred. The coil-to-globule transition in large homopolymers is akin to a phase transition. The temperature at which the interactions between the monomers roughly balance monomer-solvent energetics is the θ temperature. By analogy, we may identify high (low) denaturant concentrations with good (poor) solvent for proteins.
Despite the expected similarities between the equilibrium collapse transition in homopolymers and the compaction of proteins, it is still debated whether the unfolded states of proteins under folding conditions are more compact compared to the states created at high denaturant concentrations. If polypeptide chain compaction is universal, is collapse in proteins essentially the same phenomenon as in homopolymer collapse or is it driven by a different mechanism [13–17]? Surprisingly, this fundamental question in the protein folding field has not been answered satisfactorily [10, [18]. In order to explain the plausible difficulties in quantifying the extent of compaction, let us consider a protein, which undergoes an apparent two-state transition from an unfolded (swollen) to a folded (compact) state as the denaturant concentration (C) is decreased. At the concentration, Cm, the populations of the folded and unfolded states are equal. A vexing question, which has been difficult to unambiguously answer in experiments, is: what is the size, Rg, of the unfolded state under folding conditions (C < Cm)? Small Angle X-ray Scattering (SAXS) experiments on some proteins show practically no change in the unfolded Rg as C is changed [19]. On the other hand, from experiments based on single molecule Fluorescence Resonance Energy Transfer (smFRET) it has been concluded that the size of the unfolded state is more compact below Cm compared to its value at high C [20, [21]. The so-called smFRET-SAXS controversy is unresolved. Resolving this apparent controversy is not only important in our understanding of the physics of protein folding but also has implications for the physical basis of the evolution of natural sequences.
The difficulties in describing the collapse of unfolded states as C is lowered could be attributed to the following reasons. (1) Following de Gennes [22], homopolymer collapse can be pictured as formation of a large number of the blobs driven by local interactions between monomers on the scale of the blob size. Coarsening of blobs results in the equilibrium globule formation with the number of maximally compact conformations whose number scales exponentially with the number of monomers. Other scenarios resulting in fractal globules, enroute to the formation of equilibrium maximally collapsed structures, have also been proposed [23]. The globule formation is driven by non-specific interactions between the monomers or the blobs. Regardless of how the equilibrium globule is reached it is clear that it is largely stabilized by local interactions, because contacts between monomers that are distant along the sequence are entropically unfavorable. In contrast, even in high denaturant concentrations proteins could have residual structure, which likely becomes prominent at C < Cm. At low C there are specific favorable interactions between residues separated by a few or several residues along the sequence. As their strength grows, with respect to the entropic forces, the specific interactions may favor compaction in a manner different from the way non-specific local interactions induce homopolymer collapse. In other words, the dominant native-like contacts also drive compaction of unfolded states of proteins. (2) A consequence of the impact of the native-like contacts (local and non-local) on collapse of unfolded states is that specific energetic considerations dictate protein compaction resulting in the formation of minimum energy compact structures (MECS) [24]. The number of MECS, which are not fully native, is small, scaling as ln N with N being the number of amino acid residues. Therefore, below Cm their contributions to Rg have to be carefully dissected, which is more easily done in single molecule experiments than in ensemble measurements such as SAXS. (3) Single domain proteins are finite-sized with N rarely exceeding ~ 200. Most of those studied experimentally have N < 100. Thus, the extent of change in Rg of the unfolded states is predicted to be small, requiring high precision experiments to quantify the changes in Rg as C is changed. For example, in a recent study [25], we showed that in PDZ2 domain the change in Rg of the unfolded states as the denaturant concentration changes from 6 M guanidine chloride to 0 M is only about 8%. Recent experiments have also established that changes in Rg in helical proteins are small [20].
In homopolymers there are only two possible states, coil and globule, with a transition between the two occurring at Tθ. On the other hand, even in proteins that fold in a two-state manner one can conceive of at least three states (we ignore intermediates here): (i) the unfolded state UD at high C; (ii) the compact but unfolded state UC, which could possibly exist below Cm; (iii) the native state. Do the sizes of UD and UC differ? This question requires a clear answer as it impacts our understanding of how proteins fold, because the characteristics of the unfolded states of proteins plays a key role in determining protein foldability [26–28].
Given the flexibility of proteins (persistence length on the order of 0.5 – 0.6 nm), we expect that the size of the extended polypeptide chain must gradually decrease as the solvent quality is altered. Experiments on a number of proteins show that this is the case [29–31]. However, in some SAXS experiments the theoretical expectation that for one protein was not borne out [10, [19], precipitating a more general question: are chemically denatured proteins compact at low C? The absence of collapse is not compatible with inferences based on smFRET [21] and theory [26]. Here, we create a theory to not only resolve the smFRET-SAXS controversy but also provide a quantitative description of how the propensity to be compact is encoded in the native topology. The theory, based on polymer physics concepts, includes specific attractive interactions (mimicking interactions accounting for native contacts in the Protein Data Bank (PDB)) and a two-body excluded volume repulsion. By construction the model does not have a native state. In order to validate the theoretical predictions, we performed simulations using a completely different model often used in protein folding simulations. In both the models, there are only two states (analogues of UD and UC) in the model. The formation of UC is driven by the contact map of the folded state. Thus, chain compaction is driven in much the same way as in homopolymers, altered only by specific interactions that differentiate proteins from homopolymers.
Theory and simulations predict how the extent of compaction (collapsibility) is determined by the strength and the number of the native contacts and their locations along the chain. We use a large representative selection of proteins from the PDB to establish that collapsibility is an inherent characteristic of evolved protein sequences. A major outcome of this work is that β-sheet proteins are far more collapsible than structures dominated by α-helices. Our theory suggests that there is an evolutionary pressure on proteins for being compact as a pre-requisite for kinetic foldability, as we predicted over twenty years ago [26]. We come to the inevitable conclusion that the unfolded state of proteins must be compact under native conditions, and the mechanism of polypeptide chain compaction has similarities as well as differences to collapse in homopolymers. As a by-product of this work, we also establish that certain non-coding RNA molecules must undergo compaction prior to folding as their folded structures are stabilized predominantly by long-range tertiary contacts.
2. THEORY
We start with an Edwards Hamiltonian for a polymer chain [32]: where r(s) is the position of the monomer s, αo the monomer size, and N is the number of monomers. The first term in Eq. (1) accounts for chain connectivity, and the second term represents volume interactions and favorable interactions between select monomers given by 𝒱(r(s)),
The first term in Eq.(2) accounts for the homopolymer (non-specific) two-body interactions. It is well established in the theory of homopolymers that in good solvents with υ > 0 the polymer swells with Rg ~ aNν (ν ≈ 0.6). In poor solvents (υ < 0) the polymer undergoes a coil-globule transition with Rg ~ aNυ (υ ≈ 1/3). These are the celebrated Flory laws. Here, we consider only the excluded volume repulsion case (υ > 0).
The second term in Eq. (2) requires an explanation. The generic scenario for homopoly-mer collapse is based on an observation by de Gennes, who pictured the collapse process as being driven by the initial formation of blobs that arrange to form a sausage-like structure. At later stages the globule forms to maximize favorable intra-molecular contacts while simultaneously minimizing surface tension. Compaction in proteins, although shares many features in common with homopolymer collapse, could be different. A key difference is that the folded states of almost all proteins are stabilized by a mixture of local contacts (interaction between residues separated by less than say ~ 8 but greater than 3 residues) as well as non-local (> 8 residues) contacts. Note that the demarcation using 8 between local and non-local contacts is arbitrary, and is not germane to the present argument. These specific interactions also dominate the enthalpy of formation of the compact, non-native state UC, playing an important role in its stability. Previous studies using lattice models of proteins in two [33] and three [34] dimensions showed that formation of compact but unfolded states are predominantly driven by native interactions with non-native interactions playing a sub-dominant role. A more recent study [35], analyzing atomic detailed folding trajectories has arrived at the same conclusion. Therefore, our assumption is that the topology of the folded state could dictate collapsibility (the extent to which the UD state becomes compact as the denaturant concentration is lowered) of a given protein. In combination with the finite size of single domain proteins (N ~ 200), the extent of protein collapse could be small. In order to assess chain compaction under native conditions we should consider the second term in Eq.(2).
It is worth mentioning that several studies investigated the consequences of optimal packing of polymer-like representations of proteins [36–42]. These studies primarily explain the emergence of secondary structural elements by considering only hard core interactions, attractive interactions due to crowding effects [40, [43], or formation of compact states induced by anisotropic attractive patchy interactions [42]. However, the absence of tertiary interactions in these models, which give rise to compact states of varying topologies, prevents them from addressing the coil-to-globule transition. This requires creating a microscopic model along the lines described here.
We note in passing (with discussion to follow) that a number of studies have considered the effect of crosslinks on the shape of polymer chains [44–50]. Polymers with crosslinks have served as models for polymer gels and rubber elasticity [51–53]. In these studies the contacts were either random, leading to the random loop model [45], or explicit averages over the probability of realizing such contacts were made [44, 54], as may be appropriate in modeling gels. These studies inevitably predict a coil-to-globule phase transition as the number of crosslinks increases.
In contrast to models with random crosslinks, in our theory attraction exists only between specific residues, described by the second term in Eq. (2), where the sum is over the set of interactions (native contacts) involving pairs {si, sj}. We use the contact map of the protein (extracted from the PDB structure) in order to assign the specific interactions (their total number being Nnc). The contact is assigned to any two residues si and sj if the distance between their Cα atoms in the PDB entry is less than Rc = 0.8nm and |si − sj| > 2. We use Gaussian potentials in order to have short (but finite) range attractive interactions. For the excluded volume repulsion, this range is on the order of the size of the monomer, a0 = 0.38 nm. For the specific attraction, the range is the average distance in the PDB entry between Cα atoms forming a contact (averaged across a selection of proteins from the PBD). We obtain σ = 0.63 nm.
By changing the value of k, and hence the strength of attraction, there is a transition between the extended and compact states. Decreasing k is analogous to chemically denaturing proteins, although the connection is not precise. At high denaturant concentrations (k ≈ 0, good solvent) the excluded volume repulsion (first term in Eq.(2)) dominates the attraction, while at low C (high k, poor solvent) the attractive interactions are important. The point where attraction balances repulsion is the θ-point, and the value of k = kθ. Although reserved for the coil-to-globule transition in the limit of N ≫ 1 in homopolymers, we will use the same notation (θ-point) here. In our model, at the θ-point, the chain behaves like an ideal chain. To describe the globular state, a three-body repulsion needs to be added to the Hamiltonian (Eq. (2)), but we focus on the region between the extended coil and the θ-point because our interest is to access only the collapsibility of proteins. If kθ is very large then significant chain compaction would only occur at very low (C ≪Cm) denaturant concentrations, implying low propensity to collapse. Conversely, small kθ implies ease of collapsibility. Note that the ground state (k ≫ 1) of the Hamiltonian in Eq. (2) is a collapsed chain whose Rg is on the order of the monomer size. In other words, a stable native state does not exist for the model described in Eq. (2). Thus, we define protein collapse as the propensity of the polypeptide chain to reach the θ-point as measured by the kθ value, and use the changes in the radius of gyration Rg as a measure of the extent of compaction.
Assessing collapsibility:For our model, which encodes protein topology without favoring the folded state, we calculate using the Edwards-Singh (ES) method [55]. Although from a technical view point the ES method has pros as well as cons, numerous applications show that in practice it yields physically sensible results on a number of systems. First, ES showed that the method does give the correct dependence of on N for homopolymers. Second, even when attractive interactions are included, the ES method leads to predictions, which have been subsequently verified by more sophisticated theories. An example of particular relevance here is the problem of the size of a polymer in the presence of obstacles (crowding particles). The results of the ES method [56] and those obtained using renor-malization group calculations [57] are qualitatively similar. Here, we adopt the ES method, allowing us to deduce far reaching conclusions for protein collapsibility than is possible solely based on simulations. We use simulations on a limited set of proteins to further justify the conclusions reached using the analytic theory.
The ES method is a variational type calculation that represents the exact Hamiltonian by a Gaussian chain, whose effective monomer size is determined as follows. Consider a virtual chain without excluded volume interactions, with the radius of gyration [55], described by the Hamiltonian, where the monomer size in the virtual Hamiltonian is a. We split the deviation 𝒲 between the virtual chain Hamiltonian and the real Hamiltonian as, where The radius of gyration is , with the average being, where 〈· · · 〉υ denotes the average over ℋυ.
Assuming that the deviation 𝒲 is small, we calculate the average to first order in 𝒲. The result is, and the radius of gyration is If we choose the effective monomer size a in ℋυ such that the first order correction (second and third terms on the right hand side of Eq. (A5)) vanishes, then the size of the chain is, . This is an estimate to the exact , and is an approximation as we have neglected 𝒲2 and higher powers of 𝒲. Thus, in the ES theory, the optimal value of a from Eq. (A5) satisfies, Since 𝒲 = 𝒲1 + 𝒲2, the above equation can be written as Evaluation of the 〈r2(s)𝒲1〉υ term yields,
With the help of Eq. (11) and Eq. (9) we obtain the following self-consistent expression for a, Calculating the averages in Fourier space, where , we obtain
The best estimate of the effective monomer size a can be obtained by numerically solving Eq. (13) provided the contact map is known. A bound for the actual size of the chain is . Because we are interested only in the collapsibility of proteins we use the definition of the θ-point to assess the condition for protein compaction instead of solving the complicated Eq. (13) numerically. The volume interactions are on the right hand side of Eq. (13). At the θ-point, the υ-term should exactly balance the k-term. Since at the θ-point the chain is ideal with a = a0, we can substitute this value for a in the sums in the denominators of the υ- and k-terms. By equating the two, we obtain an expression for kθ. Thus, from Eq. (13), the specific interaction strength at which two-body repulsion (υ-term) equals two-body attraction (k-term) is: The numerator in Eq. (14) is a consequence of chain connectivity and the denominator encodes protein topology through the contact map, determining the extent to which the sizes in UD and UC states change as C becomes less than Cm. The numerical value of kθ is a measure of collapsibility.
A comment about the solution of Eq. (13) for a is worth making. For k = 0, corresponding to the good solvent condition, we expect that a ≫ a0. In this case, analysis of Eq. (13), in manner described in Appendix A, shows that there is only one solution with . Similarly, at kθ Eq. (13) also admits only one solution. Thus, from the structure of Eq. (13) we surmise there are no multiple solutions, at least in the extreme limits υ = 0 and k = 0.
The expression for kθ(Eq. (14)) is equally applicable to homopolymers in which contacts between all monomers are allowed, provided the self-avoidance condition is not violated. In Appendix A, we derive an expression for kθ ∝ Tθ ~ υ(1 – (υN−0.5)/2). Thus, our model correctly reproduces the known N dependence of Tθ obtained long ago by Flory [58] using insightful mean field arguments.
3. RESULTS
Native topology determines collapsibility: The central result in Eq. (14) can be used to quantitatively predict the extent to which a given protein has a propensity to collapse. We used a list of proteins with low mutual sequence identity selected from the Protein Data Bank PDBselect [59], and calculated kθ using Eq. (14) for these proteins. In all we considered 2306 proteins. For each contact (i,j), the energetic contribution due to interaction between i and j is k = (2πσ2)−3/2k according to Eq. (2). Thus, kθ = (2πσ2)−3/2kθ is the average strength (in units of kBT) of a contact at the θ-point. If kθ, calculated using Eq. (14), is too large then the extent of polypeptide chain collapse is expected to be small. It is worth reiterating that the theory cannot be used to determine the stability of the folded state, because in the Hamiltonian there are only two states, UD(k = 0 in Eq.(2)) and UC (k > kθ).
The strength of contacts in real proteins (excluding possibly salt bridges) is typically on the order of a few kBT in the absence of denaturants. This is the upper bound for the contact strength any theory should predict, as adding denaturant only decreases the strength. If kθ is unrealistically high (tens of kBT) then the attractive interactions of the protein would be too weak to counteract the excluded volume repulsion even at zero denaturant concentration, resulting in negligible difference in Rg between the Ud and UC states.
Fig.(1a) shows a two-dimensional histogram of the PDBselect proteins in the (N,kθ) plane. For the majority of small proteins (less than 150 residues) the value of kθ is less than 3 kBT, indicating that the unfolded states of all of these proteins should become compact at C < Cm. That collapse must occur, as predicted by our theory and established previously in lattice [26], and off-lattice models of proteins [60], does not necessarily imply that it can be easily detected in standard scattering experiments, because the changes could be small requiring high precision experiments (see below).
Weight function of a contact: For a given N, the criterion for collapsibility in Eq. (14) depends on the architecture of the proteins explicitly represented in the denominator through the contact map. Analysis of the weight function of a contact, defined below, provides a quantitative measure of how a specific contact influences protein compaction. Some contacts may facilitate collapse to a greater extent than others, depending on the location of the pair of residues in the polypeptide chain. In this case, the same number of native contacts Nnc in the protein of the same length N might yield a lower (easier collapse) or higher (harder collapse) value of kθ. In order to determine the relative importance of the contacts with respect to collapse, we consider the contribution of the contact between residues i and j in the denominator of Eq. (14), A plot of W(i − j) in Fig.(1b) for different values of the chain length N shows that the weight depends on the distance between the residues along the chain. Contacts between neighboring residues have negligible weight, and there is a maximum in W(i − j) at i − j ≈ 30 (for a0 /σ = 0.6), almost independent of the protein length. The maximum is at a higher value for proteins with N > 100 residues. The figure further shows that longer range contacts make greater contribution to chain compaction than short range contacts. The results in Fig. (1b) imply that proteins with a large fraction of non-local contacts are more easily collapsible than those dominated by short range contacts, which we elaborate further below.
Maximum and minimum collapsibility boundaries: Using W(i − j) in Eq. (15), we can design protein sequences to optimize for “collapsibility”. To design a “maximally collapsible” protein, for fixed N and number of native contacts Nnc, we assign each of the Nnc contacts one by one to the pair i,j with a maximal W(i,j) among the available pairs with the criterion that |i−j| > 2. Such an assignment necessarily implies that the artificially designed contact map will not correspond to any known protein. Similarly, we can design an artificial contact map by selecting i,j pairs with minimal W(i,j) till all the Nnc are fully assigned. Such a map, which will be dominated by local contacts, are minimally collapsible structures.
The white lines in Fig.(1a) show kθ of chains of length N with Nnc(N) contacts distributed in ways to maximize or minimize collapsibility. We estimated Nnc(N) ≈ 0.6Nγ, with γ ≈ 1.3, from the fit of the proteins selected from the PDBSelect set ( a fuller discussion is presented in Appendix A). Since the lines are calculated for Nnc from the fit over the entire set, and not from Nnc for every protein, there are proteins below the minimal and above the maximal curve in Fig.(1a). For a given protein, with N and Nnc defined by its PDB structure, kθ for all possible arrangements of native contacts is largely in between the maximally and minimally collapsible lines in Fig.(1a). The majority of proteins in our set are closer to the maximal collapsible curves, suggesting that the unfolded proteins have evolved to be compact under native folding conditions. This theoretical prediction is in accord with our earlier studies which suggested that foldability is determined by both collapse and folding transitions [26], and more recently supported by experiments [20].
β-sheet rather than α-helical proteins undergo larger compaction: The weight function W (Eq. (15) and Fig.(1b)) suggests that contacts in α-helices (|i − j| = 4) only make a small contribution to collapse. Contacts corresponding to the maximum of W at i − j ≈ 30 are typically found in loops and long antiparallel β-sheets. Fig.(2) shows a set of proteins with high α-helix (> 90%) and a set with high content of β-sheets (> 70%) [61]. The values of kθ for the two sets are very distinct, so they barely overlap. We find that many of the α-helical proteins lie on or above the curve of minimal collapsibility while the rest are closer to the maximal collapsibility. The smaller β-rich proteins lie on the curve of maximal collapsibility slightly diverging from it as the chain length grows. These results show that the extent of collapse of proteins that are mostly α-helical is much less than those with predominantly β-sheet structures.
A note of caution is in order. The minimal collapsibility of most α-helical proteins in the set may be a consequence of some of them being transmembrane proteins, which do not fold in the same manner as globular proteins. Instead, the transmembrane α-helices are inserted into the membrane by the translocon, one by one, as they are synthesized. Such proteins would not have the evolutionary pressure to be compact.
Comparison between theory and simulations: The major conclusions, summarized in Figs.(1-2), are based on an approximate theory. In order to validate the theoretical predictions, we performed simulations for 21 proteins using realistic models (see Appendix B for details) that capture the known characteristics of the unfolded states of proteins and the coil to globule transition.
In accord with our theoretical predictions, Rg decreases as k increases. For k = 0, corresponding to the maximally expanded state (high denaturant concentration) we expect that Rg ≈ aDN0.588. A plot of Rg versus N0.588 is linear with a value of aD = 0.25 nm (Fig.3a). Remarkably, this finding is in accord with the experimental fit showing Rg ≈ aDN0.588 with aD = 0.2 nm [8]. The modest increase in the aD, compared to the experimental fit, predicted here can be explained by noting that in real proteins there is residual structure even at high denaturant concentrations whereas in our model this is less probable. The scaling shown in Fig. (3a) shows that the model used in the simulations provides a realistic picture of the unfolded states. We emphasize that the parameters in the simulations were not adjusted to obtain the correct Rg scaling or aD.
In Fig. (4) we show the dependence of Rg as a function of k for three representative proteins along with their native and unfolded structures and contact maps. The α helical protein myoglobin and the β-lactoglobulin with β sheet architecture, have nearly the same number of amino acids, N ~ 150. The sizes of the two proteins are similar (Fig.4b) when k is small (k < 0.5) implying that the values of Rg in the unfolded states are determined solely by N (see Fig.3a). For each protein, we identified kθ from simulations with the k value at which is a minimum. Using this method, we find that the kθ value for β-lactoglobulm is less than for myoglobin. This result is consistent with the theoretical prediction, demonstrating that generically α proteins are less collapsible than β proteins. Interestingly, TIM barrel, an α/β protein with larger chain length (N = 246), collapses at kθ = 1.6, which is larger than β-lactoglobulin but smaller than myoglobin (purple line in Fig.4b). These results are qualitatively consistent with theoretical predictions.
In Fig. (5), we compare the predicted kθ (Eq. (14)) and the values from simulations. The absolute values of kθ are different between simulations and theory because we used entirely different models to describe the coil to globule transition. The potential used in the theory, convenient for serving analytic expression for kθ, is far too soft to describe the structures of polypeptide chains. As a result the polypeptide chains explore small Rg values without significant energetic penalty. Such unphysical conformations are prohibited in the realistic model used in the simulations. Consequently, we expect that the theoretical values of kθ should differ from the values obtained in simulations. Despite the differences in the potentials used in theory and simulations, the trends in kθ predicted using theory are the same as in simulations. The Pearson correlation coefficient, ρ = 0.79. Since we examined only 21 proteins in simulations, which is fewer than theoretical predictions made for 2306 proteins, we analyzed the correlation data by the bootstrap method to ascertain the statistical significance of ρ. The estimated probability distribution of ρ is shown in Fig. (5b). The mean of correlation coefficient is 0.78 and ρ90% > 0.61 with 90% confidence. The distribution is bimodal indicating that there is at least one outlier in the data set, which is likely to be the three helix bundle B domain of Protein A (labeled 5 in Fig. (5)). For 20 proteins excluding Protein A, the distribution has a single peak (green broken line) with the mean 0.88 and ρ90% > 0.82 (green dotted line in Fig. (5)). From these results, we surmise that both theory and simulations qualitatively lead to the conclusion that proteins with β-sheet architecture are more collapsible than α-helical is structures, which is one of the major predictions of this work.
Given that the simulations describe the characteristics of the unfolded states, we show in Fig.(3b) the variations in the probability distribution of Rg, P(Rg) for protein-L as a function of k. The broadest distribution, with k = 0, corresponds to the extended chain. We find that P(Rg) becomes narrower as the attractive strength (k) increases. The continuous shift to the compact state with gradual increase in the attractive strength is consistent with experiments that the unfolded proteins collapse as the denaturant concentration decreases. Thus, generally Rg of the UC state is less than that of the UD state. The end-to-end distribution, P(Ree), for different values of values of k in Fig.(3c) is broad at k = 0 corresponding to the unfolded protein. Average Ree decreases as attractive strength increases and the distribution becomes narrower. The results in Fig.(3) show that both Ree, which can be inferred using smFRET, and Rg (measurable using SAXS), are smaller in the UC state than the UD state. However, the extent of decrease is greater in Ree than Rg, an observation that has contributed to the smFRET-SAXS controversy.
RNAs are compact: There are major differences between how RNA and proteins fold [62]. In contrast to the apparent controversy in proteins, it is well established that RNA molecules are compact [63–65] at high ion concentrations or at low temperatures. Because our theory relies only on the knowledge of contact map, used to assess collapsibility in Azoarcus ribozyme and MMTV pseudoknot to merely illustrate collapsibility of RNA (Fig. (6)). The kθ values (green stars in Fig. (2)) are close to the lower β-sheet line, indicating that these molecules must undergo compaction as they fold. This prediction from the theory is fully supported by both equilibrium and time-resolved SAXS experiments [66] on Azoarcus ribozyme. In this case (N = 196) the changes are so large that even using low resolution experiments collapse is readily observed [67]. We should emphasize that the size of different RNAs (for example viral, coding, non-coding) vary greatly. For a fixed length, single-stranded viral RNAs have evolved to be maximally compact, which is rationalized in terms of the density of branching. Although the sizes of the viral RNAs considered in [68] are much longer than the Azoarcus ribozyme the notion that compaction is determined by the density of branching might be valid even when N ~ 200.
Dependence of kθ on the values of the cut-off: In order to ensure that the theoretical predictions do not change qualitatively if the cutoff values are changed, we varied them over a reasonable range. The reason for our choice of Rc is that in majority of folding simulations, using Cα representation of proteins, Rc = 0.8 nm is typically used. Consider the variation of kθ with Rc, the cut-off used to define contacts at a fixed σ = 0.63 nm. As Rc increases the number of contacts also increases. From Eq. (14) it follows that kθ should decrease, which is borne out in the results in Fig.(8a). Reassuringly, the trends are preserved. In particular, the prediction that β-sheet proteins are most collapsible is independent of Rc. The trend that β-rich proteins are more collapsible than α-rich proteins remains same irrespective of the Rc values.
Fig.(8b) shows the changes in kθ for proteins as a function of σ (contact distance) for fixed Rc = 0.8 nm. The kθ values decrease with increasing σ. The predicted trend is independent of the precise value. It is worth emphasizing that the predictions based on simulations that the size of the proteins at kθ is about (5-8)% of the folded state was obtained using σ = 0.63nm. This range is consistent with estimates based on experiments on a few proteins (see for example [69]). Higher values of σ would give values of compact states of proteins that are less than the native state R9.
4. DISCUSSION
We have shown that polymer chains with specific interactions, like proteins (but ones without a unique native state), become compact as the strength of the specific interaction changes. A clear implication is that the size of the UD state should decrease continuously as C decreases. In other words, the unfolded state under folding conditions is more compact than it is at high denaturant concentrations. Compaction is driven roughly by the same mechanism as the collapse transition in homopolymers in the sense that when the solvent quality is poor (below Cm) the size of the unfolded state decreases continuously. When the set of specific interactions is taken from protein native contacts in the PDB, our theory shows that the values of kθ are in the range expected for interaction between amino acids in proteins. This implies that collapsibility should be a universal feature of foldable proteins but the extent of compaction varies greatly depending on the architecture in the folded state. This is manifested in our finding that proteins dominated by β-sheets are more collapsible compared to those with α-helical structures.
Magnitude of kθ and plausible route to multi-domain formation: The scaling of kθ with N allows us to provide arguments for the emergence of multi-domain proteins. In Eqs. (13) or (14) attractive (k-) and repulsive (v-) terms have the same structure. The only difference in their scaling with N is due to the difference in the sums (over all the monomers in the repulsive term and over native contacts in the attractive term). Double summation over all the monomers gives a factor of N2 to the repulsive term. The summation over native contacts in the attractive term scales as Nnc. Therefore, to compensate for the repulsion, Nnc should scale as N2. However, for a given protein with a certain length N and certain numbers of contacts, it is not clear how the denominator in Eq. (14) scales with N. Empirically we find Nnc(N) dependence across a representative set of sequences scales as Nγ with γ at most ≈ 1.3 (Appendix A). Thus, it follows from Eq. (14) that kθ increases without bound as N continues to increase. Because this is unphysical, it would imply that proteins whose lengths exceeds a threshold value NC cannot become maximally compact even at C = 0. An instability must ensue when N exceeds Nc. This argument in part explains why single domain proteins are relatively small [70].
Scaling of Nnc as a power law in Nγ means that as the protein size grows, the value of kθ will deviate more and more from those found in globular proteins, implying such proteins cannot be globally compact under physiologically relevant conditions. However, such an instability is not a problem because larger proteins typically consist of multiple domains. Thus, if the protein does not show collapse as a whole, the individual domains could fold independently, having lower values of kθ for each domain of the multi-domain protein. It would be interesting to know if the predicted onset of instability at NC provides a quantitative way to assess the mechanism of formation of multi-domain proteins. Extension of the theory might yield interesting patterns in the assembly of multi-domain proteins. For instance, one can quantitatively ascertain if the N-terminal domains of large proteins, which emerge from the ribosome first, have higher collapsibility (lower kθ) than C-terminal domains.
SAXS-smFRET controversy resolved: Our theory resolves, at least theoretically, the contradictory results using SAXS and FRET experiments on compaction of small globular proteins. It has been argued, based predominantly using SAXS experiments on protein-L (N = 72) that Rg of UD and UC states are virtually the same at denaturant concentrations that are less than Cm [19]. This conclusion is not only at variance with SAXS experiments on other proteins but also with interpretation of smFRET data on a number of proteins. The present work, surveying over 2300 proteins, shows that the compact state has to exist, engendered by mechanisms that have much in common with homopolymer collapse. For protein-L, the kθ = 1.7kBT, a very typical value, is right on the peak of the heat map in Fig.(1). We have previously argued that because the change in Rg between the UD and UC states for small proteins is not large, high precision experiments are needed to measure the predicted changes in Rg between UC and UD. For protein-L the change is less than 10% [71], making its detection in ensemble experiments very difficult. Similar conclusions were reached in recent experiments [20]. A clear message from our theory is that, tempting as it may be, one cannot draw universal conclusions about polypeptide compaction by performing experiments on just a few proteins. One has to survey a large number of proteins with varying N and native topology to quantitatively assess the extent of compaction. Our theory provides a framework for interpreting the results of such experiments.
random contact maps, local and non-local contacts: In order to differentiate collapsibility between evolved and random proteins, we created twelve random contact maps keeping the total number of contacts the same as in protein-L (see Fig.(7) for examples). For each of these pseudo-proteins we calculated kθ using Eq. (14). We find that for all the random contact maps the kθ values are less than for protein-L, implying that the propensity of the pseudo-proteins to become compact is greater than for the wild type. This finding is in accord with studies based on homopolymer and heteropolymer collapse with random crosslinks. These studies showed that the polymer undergoes a collapse transition as the density of crosslinks is increased [45, 47, 48]. Of particular note is the demonstration by Camacho and Schanke [50], who showed using exact enumeration of random heteropolymers and scaling arrangements that the collapse can be either a first or second order transition depending on the fraction of hydrophobic residues [50].
Some time ago Abkevich et al. [72] showed, using Monte Carlo simulations of proteinlike lattice polymers, that the folding transition in proteins with predominantly non-local contacts was first order like, which is not the case for proteins in which local contacts dominate. In light of this finding, it is interesting to examine how compaction is affected by local and non-local contacts. We created for N=72 (protein-L) a contact map with 185 (same number as with WT protein-L), predominantly local contacts (Fig.(7b)). The values of kθ for these pseudo-proteins is considerably larger than for the WT, implying that proteins dominated by local contacts are minimally collapsible. We repeated the exercise by creating contact maps with predominantly non-local contacts (Fig.(7c)). Interestingly, kθ values in this case are significantly less than for the WT. This finding explains why in proteins with varied α/β topology there is a balance between the number of local and non-local contacts. Such a balance is needed to achieve native state stability and speed of folding [72] with polypeptide compaction playing an integral part [26].
Based on these findings we conclude that Rg of the unfolded states of proteins dominated by non-local contacts must undergo greater compaction compared to those with that have mostly local contacts. The results in Fig. (2) also show that proteins rich in β-sheet are more collapsible than predominantly α-helical proteins. It follows that β-sheet proteins must have a larger fraction of non-local contacts than proteins rich in α-helices. In Fig. (7d) we plot the distribution of the fraction of non-local contacts for the 2306 proteins. Interestingly, there is a clear separation in the distribution of non-local contacts between α-helical rich and β-sheet rich proteins. The latter have substantial fraction of non-local contacts which readily explains the findings in Fig. (7c) and the predictions in Fig. (2).
5. CONCLUSIONS
We have created a theory to assess collapsibility of proteins using a combination of analytical modeling and simulations. The major implications of the theory are the following. (i) Because single domain proteins are small, the changes in the radius of gyration of the unfolded states as the denaturant concentration is lowered are often small. Thus, it has been difficult to detect the Rg changes using SAXS experiments in a couple of proteins, raising the question if unfolded polypeptide chains become compact below Cm. Here, we have solved this long-standing problem showing that the unfolded states of single-domain proteins do become compact as the denaturant concentration decreases, sharing much in common with the physical mechanisms governing homopolymer collapse. By adopting concepts from polymer physics, and using the contact maps that reflect the topology of the native states, we established that proteins are collapsible. Simulations using models that describe the unfolded states of proteins reasonably well further confirm the conclusions based on theory. (ii) Based on a survey of over two thousand proteins we surmise that there is evolutionary pressure for collapsibility is universal although the extent of collapse can vary greatly, because this ensures that the propensity to aggregate is minimized even if environmental fluctuations under cellular conditions transiently populate unfolded states. Two factors contribute to aggregation. First, the rate of dimer formation by diffusion controlled reaction would be enhanced if a pair of UD rather than UC molecules collided due cellular stress because the contact radius in the former would be greater than in the latter. Second, the fraction of exposed hydrophobic resides in UD is much greater than in UC, thus greatly increasing the probability of aggregation. The second factor is likely to be more important than the first. Consequently, transient population of UC due to cellular stress minimizes the probability of aggregation. (iii) We have also shown that the position of the residues forming the native contact greatly influences the collapsibility of β sheet proteins (containing a number of non-local contacts showing greater compaction than α helical proteins, which are typically stabilized by local contacts.
Our theory also shows that most RNAs may have evolved to be compact in their natural environments. Although the evolutionary pressure to be compact is likely to be substantial for viral RNAs [64, 65, 68, 73], it is apparent that even non-coding RNAs are also likely to be almost maximally compact in their natural environments. Our theory suggests that, to a large extent, collapsibility of RNA is similar to proteins with β-sheet structures. Both classes of biological macromolecules are stabilized by non-local contacts. Interestingly, it has been argued that the need to be compact (“Compaction selection hypothesis” [73]) could be a major determinant for evolved biopolymers to have minimum energy compact structures as their ground states.
Acknowledgements:
This work was supported by a grant from the National Science Foundation (CHE 16-36424). We acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing resources for the simulations.
Appendix A: Collapse of homopolymers:
The theory described for protein collapse resulting in Eq. (14) is general and applicable to the collapse of homopolymers as well. We show in this Appendix that the ES formalism can be used to derive the scaling of kθ with N, the number of monomers.
Consider a homopolymer with the following Hamiltonian: where r(s) is the position of the monomer s, and a0 is the monomer size. The first term in Eq. (A1) accounts for chain connectivity, and the second term represents volume interactions and favorable interactions between monomers, given by VH(r(s)),
The form of VH(r(s)) is exactly the same as in Eq. 2 except in the above equation all monomers interact favorably as long as self-avoidance is not violated whereas in Eq. (2) attractive interactions depend on the topology of the protein. The first (second) term in Eq. (A2) describes non-specific excluded volume (attractive) interactions. Thus, the model in Eq. (A1) describes the behavior in good solvents (k = 0) as well as the transition point at which there is a transition to the collapsed state. For the excluded volume repulsion, the range of interactions is on the order of the size of the monomer α0 and for attractive interactions, the range is σ. In good solvents, with υ > 0, the polymer swells with Rg ~ aNν (ν ≈ 0.6). In poor solvents (υ < 0), the polymer undergoes a coil-globule transition with . These are the well-known Flory laws.
Following the ES method described in the main text, we arrive at the self-consistent equation for a for the homopolymer chain,
To obtain an expression for the θ-point we derive the condition for homopolymer collapse instead of solving the complicated Eq. (A3) numerically. The volume interactions are on the right hand side of Eq. (A3). At the θ-point, the υ-term should exactly balance the k-term arising from attractive interaction between the monomers. Since at the θ-point the chain is ideal with a = a0, we can substitute this value for a in the sums in the denominators of the υ- and k-terms, to obtain an expression for kθ. Thus, from Eq. (A3), the specific interaction strength at which two-body repulsion (υ-term) equals two-body attraction (k-term) is: The expression for kθ in Eq. (A4) for homopolymers differs from kθ (Eq. (14)) for proteins only by the term in the denominator. The sum over specific interactions for proteins is replaced by the non-specific interaction in Eq. (A4). It can be shown that the N dependence is the same in both the numerator and denominator in Eq. (A4). Therefore, to leading order in 𝒲, kθ is independent of N for a homopolymer.
In order to derive the scaling of kθ with N, we need to analyze the corrections arising from second order in 𝒲. To second order in 𝒲, the radius of gyration is, In the expression only the contribute to kθ. Here, 𝒲1 is the same as Eq. (5), and 𝒲2 is given by Eq. A2. The terms associated with 𝒲1 are zero at the θ-transition point. By counting the powers of N it follows that scales as and scales as . Hence, at the θ-point, we find that kθ satisfies the following quadratic equation, in the large N limit. The scaling law for kθ (∝ Tθ) obtained first by Flory [58], was confirmed using simulations much later [74]. To our knowledge this is the first microscopic derivation of the result. Thus, our general formalism can be applied to describe collapse of homopolymers as well as proteins and RNA.
Proteins: The results for homopolymers given above may be extended to obtain the N dependence of kθ for proteins. By considering the second order correction to the radius of gyration, we obtain the following quadratic equation for kθ, In deriving the above equation we assume that total number of contacts Nnc ~ Nγ. A plot of Nnc as a function of N (Fig. (8e)) for the PDBselect proteins confirms that this is indeed the case. For γ = 1.3, kθ ~ N0.9, which shows that larger proteins are less collapsible than smaller ones, implying that when N exceeds a critical value they are likely to form multi-domain structures. Comparison of Eqs. (A6) and (A7) shows that collapsibility in proteins and homopolymers differs dramatically. For homopolymers the coil-to-globule transition occurs at a finite temperature. The sharpness of the transition increases as N increases. In sharp contrast, the growth of kθ with N for proteins (Eq. (A7)) implies that larger proteins must organize themselves into domains with individual domains forming compact structures.
Appendix B: Simulations
The theoretical results were obtained using a set of approximations, whose validity need to be confirmed using simulations. The purpose of these simulations is to show that the predicted theoretical values of kθ correlate well with simulation results. We performed Langevin dynamics simulations for 21 globule proteins (Fig. (5)). The set includes both all-α and all-β proteins as well as α + β and α/β proteins according to Structural Classification Of Proteins (SCOP).
The simple form (sum of Gaussians) of the interaction energy in Eq. (2) was devised in order to obtain analytic expression for kθ so that collapsibility of two thousand or more proteins could be easily analyzed. The potential in Eq. (2) has no hard core, which is physically not realistic. Because of the soft interactions it is clear that the theoretical values of kθ have to be an upper bound. In order to firmly establish the qualitative predictions obtained using theory we use a realistic interaction energy in the simulations. The potential function in the simulations is, where
The first term, describing chain connectivity, the is discrete version of the first term in Eq. (1) with a0 = 0.38 nm. The second term accounts for excluded volume interactions used for any pair of residues not included in the contact map. We chose ευ = 1.0 kcal/mol so that monomer particles do not overlap with each other. In this crucial respect, the potential function is drastically different from the interaction potential used in the theory, in which the Gaussian-type soft core potential was used in order to solve the problem analytically.
The summation in the last term in Eq. (B1) runs over all pairs in the contact map. The potential, ΦWCA, is the Weeks-Chandler-Andersen potential [75], a variant of Lenard-Jones potential, consisting of well-separated repulsive and attractive terms (Fig. 8(c), (d)). This is necessary in order to vary the strength of the attraction potential without affecting the repulsive interactions. The coefficient of the attractive term is εk = k · kBT. We varied k between 0.0 and 5.0 to find the collapse-transition point, k = kθ. The contact distance is the same as in the theory, σ = 0.63 nm.
For each protein and k value, we generated 100 independent simulation trajectories. Initial conformations were generated in a preliminary simulation at high temperature T = 400 K with k = 0. Each production run at T = 300 K lasts for 108 steps. We discarded the first 2 × 107 steps in analyzing the data. Conformations are sampled every 104 steps. In total, 8 × 105 conformations were sampled to calculate the average radius of gyration, 〈Rg〉 for each k.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].
- [15].
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].
- [28].↵
- [29].↵
- [30].
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].
- [39].
- [40].↵
- [41].
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].
- [47].↵
- [48].↵
- [49].
- [50].↵
- [51].↵
- [52].
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵