Abstract
The solvent-excluded surface (SES) of a protein is determined by and in turn affects protein-solvent interaction and consequently plays important roles in its solvation, folding and function. However, accurate quantitative relationships between them remain largely unknown at present. To evaluate SES’s contribution to protein-solvent interaction we have applied our accurate and robust SES computation algorithm to various sets of proteins and ligand-protein interfaces. Our results show that each of the analyzed water-soluble proteins has a negative net charge on its SES. In addition we have identified a list of SES-defined physical and geometrical properties that likely pertain to protein solvation and folding based on their characteristic changes with protein size, their differences between folded and extended conformations, and their correlations with known hydrophobicity scales and with experimentally-determined protein solubility. The relevance of the list of SES-defined properties to protein structure and function is supported by their differences between water-soluble proteins and transmembrane proteins and between solvent-accessible regions and ligand-binding interfaces. Taken together our analyses reveal the importance of SES for protein solvation, folding and function. In particular the universal enrichment of negative charge and the larger than average SES area for a polar atom on the surface of a water-soluble protein suggest that from a protein-solvent interaction perspective to fold into a native state is to optimize the electrostatic and hydrogen-bonding interactions between solvent molecules and the surface polar atoms of a protein rather than to only minimize its apolar surface area.
1 Introduction
Protein-solvent interaction is believed to contribute largely to the solvation, folding and structure of a water-soluble protein [1, 2, 3, 4, 5] and plays an important role in its function such as ligand binding [6]. However it is challenging to quantify such contributions [7, 8] using either experimental approach [9, 10] or theoretical model [11, 12, 13, 14, 15] or molecular dynamic (MD 1) simulation [16, 17] or structural information [18, 19]. For example, due to the difficulty to evaluate protein-solvent interaction it is not clear at present how evolution has optimized the surfaces of naturally-occurring water-soluble proteins to make them best adapted to aqueous solvent. Clues to possible adaptation may be found through a systematic and detailed analysis of the surfaces of different types of proteins with known structures. There exist three mathematical models for protein surface called respectively van der Waals (VDW) surface, solvent-accessible surface (SAS) [20, 21] and solvent-excluded surface (SES) [22, 23]. A SES is a two-dimensional (2D) manifold impenetrable to solvent molecules. In other words a SES defines a 2D boundary that seals off the interior of a protein from direct contact with solvent molecules [24]. The SES of any molecule consists of three different types of 2D patches: convex spherical polygons on a set of solvent-accessible atoms, saddle-shaped toroidal patches each of them defined by a pair of accessible atoms2 and concave spherical patches each of them determined by a triple of accessible atoms. In the past predominately SAS and to a much less degree SES have been extensively investigated mainly at residue-level for their roles in protein solvation, folding, stability and function [25, 26, 27, 21, 28, 29, 30, 31, 32, 9, 14]. For example it has been well documented that polar (hydrophilic) residues especially the charged ones prefer to be on the surface of a water-soluble protein while apolar (hydrophobic) residues are generally buried inside [33]. Further efforts have been made to establish quantitative relationships between SAS area and solvation free energy. For example, the free energies (∆Gsolvs) of the transfer of either organic compounds or small peptides between aqueous solvent and nonpolar solvents have been fitted to a linear equation ∆Gsolv = Σi σi Ai where Ai is the SAS area of atom i of either a compound or a peptide. The fitted σis are called atomic solvation parameters [28]. Though such an empirical equation has found wide applications in various implicit solvent models for representing the contributions of solvent to protein folding, structure and ligand binding [14, 34], the physics behind the fitted σis is not well understood. Furthermore, to the best of our knowledge no efforts have been made in the past to establish a quantitative relationship between SES and protein-solvent interaction through a comprehensive analysis of the SESs for different types of proteins and ligand-protein interfaces at atomic level and on a large-scale.
To examine SES’s contribution to protein-solvent interaction at atomic level, to identify plausible physics behind atomic solvation parameter and to obtain clues to SES’s optimization via evolution we have applied our accurate and robust SES computation algorithm to a set 𝕊 of 16,483 water-soluble proteins with high quality crystal structures, a set 𝕄e of 1,314 structural models of extended conformations and a set of proteins whose solubilities have been determined experimentally. The SESs of 𝕊 and 𝕄e are further compared with the SESs of the lipid-exposing regions of transmembrane proteins and the SESs of ligand-protein interaction interfaces where ligand is either lipid or DNA or protein. Our analysis is inspired by the observations that water as a protic solvent prefers anions over cations as its solutes and both the intermolecular3 hydrogen bonding and the VDW attraction between the surface atoms of a solute and solvent molecules contribute to protein-solvent interaction. The analyses especially the comparisons of the atomic SES areas and atomic properties among different types of proteins and between the surfaces of water-soluble proteins and ligand-protein interfaces have identified a list of SES-defined physical and geometrical properties that are likely to be important for protein solvation, folding and function. This paper focuses on SES’s contribution to protein solvation and folding through the analyses of a list of SES-defined properties over 𝕊 and 𝕄e while our sequels will demonstrate SES’s importance to protein structure and function using as examples the characteristic SES-defined properties for protein-protein [35], lipid-protein and DNA-protein interaction interfaces.
Our analyses show that every structure in S has a negative net surface charge. For example, the charges per atom for all the accessible atoms in 𝕊 have an average of −2.90 × 10−2e (elementary charge) while the charges per atom for all the buried atoms in S have an average of +2.70 × 10−2e. This large difference in charge per atom confirms quantitatively and at atomic level the residue-level observation that polar residues especially the charged ones prefer to be on the surface of a water-soluble protein [33]. Interestingly we find that compared with charge only or area only SES-area weighted surface charge and charge density seem to be more relevant to protein-solvent interaction. This finding provides a plausible explanation to atomic solvation parameters.
Our analyses have identified several SES-defined geometrical properties pertinent to intermolecular hydrogen bonding interaction. Specifically we find that SES area per accessible polar atom is, on average, almost 2-fold larger than SES area per accessible apolar atom. In our definition (section S1 of the Supplementary Materials) a polar atom is capable of forming a hydrogen bond with other atoms while an apolar one may not. In addition though the total SES area Ai of all the accessible polar atoms of a water-soluble protein is, on average, 1.2-fold smaller than the total SES area Ao of its accessible apolar atoms, Ai decreases but Ao increases upon unfolding4. Thus Ao and Ai as well as the ratio of SES area per apolar atom over SES area per polar atom likely pertain to protein-solvent interaction. These findings confirm quantitatively and at atomic level the preference of polar residues on the surface of a water-soluble protein. They also support the importance of intermolecular hydrogen bonding to protein solvent interaction [36] and may provide an alternative explanation [37] to some phenomena usually being associated with hydrophobic effect.
It is widely accepted that hydrophobic effect is the driving force for protein folding [2, 3, 7, 38, 39]. However, the quantitative contributions of hydrophobic effect to protein folding and PPI remain controversial [37]. One reason is that it has been difficult to evaluate the hydrophobic interaction between a folded water-soluble protein and solvent molecules since the protein surface is amphipathic. For an apolar solute it has been assumed that the intermolecular VDW attraction between the solute and aqueous solvent molecules is important for its solvation [39, 40]. Along this line of thinking we have identified a SES-defined geometrical property called concave-convex ratio rcc that likely pertains to protein-solvent interaction. Our analysis shows that for a water-soluble protein the rcc of an accessible apolar atom is, on average, 1.5-fold larger than the rcc of a polar one. Most interestingly at residue-level rcc correlates well with known hydrophobicity scales [41, 42, 43, 44]. These findings support the importance of intermolecular VDW attraction to the solvation of apolar atoms if we assume that the larger atomic rcc is the stronger the VDW attraction between a protein surface atom and solvent molecules. These findings could also mean that the larger rcc is, the less disruption to water’s hydrogen-bonded network [7].
The relevance to protein-solvent interaction and protein function of the list of SES-defined physical and geometrical properties is further supported by (a) their well-defined changes with protein size, (b) the differences between their values for folded proteins and for extended conformations, (c) the differences between their values for water-soluble proteins and for ligand-protein interfaces, and (d) the correlations between these properties and experimentally-determined solubility. From our large-scale analysis we hypothesize that the optimization of protein-solvent interaction through natural selection has been achieved via (1) the universal enrichment of negative surface charge, (2) the increased surface area for a surface polar atom for optimal hydrogen bonding with water molecules with minimal disruption to water’s hydrogen-bonded network, and (3) the increased concave-convex ratio for a surface apolar atoms for either stronger VDW attraction with water molecules or less disruption to water’s hydrogen-bonded network or both. This hypothesis is consistent with the observation that some of these SES-defined properties for de novo designed water-soluble proteins differ largely from those for naturally-occurring ones. It seems to us that a paradigm shift may be needed in the study of protein folding by taking a more balanced view of surface charge and side chain hydrophobicity since from a solvation perspective to fold into a native state is to optimize both the surface charges and the SES areas of the accessible polar atoms of a water-soluble protein rather than to only minimize the total SES area of its exposed apolar atoms.
2 Materials and Methods
In this section we first describe the data sets used in the analysis and then briefly present SES computation. Finally we define a list of SES-defined physical and geometric properties that likely pertain to protein-solvent interaction.
2.1 The data sets
We have downloaded from the PDB a non-redundant set of 25, 729 crystal structures of water-soluble proteins each has at most 70% sequence identity with any others, a resolution ≤ 3.5Å and a R-factor ≤ 27.5%. In this set each monomeric protein has > 800 atoms (with protons) and each multimer > 1, 000 atoms. This set excludes hyper-thermophilic, anti-freeze, membrane and nucleic acid binding proteins in order to minimize other structural features that may affect protein-solvent interaction. A prepossessing step that requires that no structures have > 5% missing atoms and no structures include bound compounds with > 20 heavy atoms reduces the number of structures to 16, 483. This set of structures is denoted as 𝕊 and is used as the representatives of water-soluble proteins. Set 𝕊 has the number of atoms ranging from 833 to 171, 552 and includes a set 𝕄 of 8, 974 monomeric proteins with 833 to 44, 200 atoms. Out of 𝕄 we select a subset 𝕄f of 1,314 structures (section S2 of the Supplementary Materials) with 1, 004 to 10, 297 atoms that have coordinates for every residue, no bound compounds with > 5 atoms and < 0.2% missing atoms. Set 𝕄f is used to represent water-soluble proteins in native state for the quantification of the changes in SES-defined properties upon unfolding. The corresponding model structures in unfolded state are a set of extended and energy-minimized conformations 𝕄e generated by CNS [45] using the amino acid sequences in 𝕄f.
2.2 The preprocessing of PDB files for SES computation
The PDB files are preprocessed as follows for SES computation. Protons are first added using the program REDUCE [46] to any PDB structure that lacks their coordinates and the protonated structures are then processed by our structural analysis and visualization program. A graph with atom as node and bond as edge is first constructed for each of the 20 naturally-occurring amino acid residues, HSD, HSP and protonated ASP and GLU residues using Charmm atom nomenclature [34]. A molecule graph is then built for a whole protein by adding an edge for each peptide bond. For atoms with more than one conformation, only their first forms are selected for SES computation. Next any gap (a residue with no experimental coordinates) in a protein chain is identified and the percentage of missing atoms in each structure is computed by a comparison of the number of the nodes in the protein molecule graph with the number of atoms that have coordinates in the PDB file. Charmm force field parameters (e.g. Charmm partial charges) [34] are assigned to individual or a subset of atoms using a protein molecule graph. Only protein atoms are included in SES computation.
2.3 SES computation
A SES is composed of three types of areas: a spherical polygon area as(i) on the surface of a solvent-accessible atom i, a patch area at(i, j) on a toroid defined by two atoms i, j and a spherical polygon area ap(i, j, k) on the surface of a probe whose position is determined by three atoms i, j, k. The SESs and areas by our algorithm have higher accuracy than those by MSMS [47] due in part to the analytic computation of all the intersecting arcs among the probes, the accurate treatments of various cases of probe-probe intersections and no modifications to atomic radii [24]. In this study we set the probe radius to 1.4Å except for set 𝕄 over which SESs are computed twice using respectively 1.4Å and 1.2Å. The SESs with 1.2Å radius are compared with those with 1.4Å to see how probe radius affects area and SES-defined physical and geometrical properties5.
2.4 SES-defined physical and geometrical properties
A list of physical and geometrical properties have been defined using atomic SES to evaluate their possible contributions to protein solvation and function. These SES-defined properties are inspired by the observations that water as a protic solvent prefers anions over cations, and that both the hydrogen bondings between solvent molecules and the polar atoms of a solute and the VDW attractions between solvent molecules and its apolar atoms contribute to its solvation. Their definitions rely on atomic SES area. However except for atomic concave-convex ratio each of the other properties is defined over a specific set of atoms.
To each accessible atom i we assign an atomic SES area a(i): where as(i), at(i) and ap(i) are respectively the accessible, toroidal and probe areas for atom i. From as(i) and ap(i) we define a concave-convex ratio rcc(i) for atom i to estimate its local ruggedness and for a set of accessible atoms T to represent the average ruggedness of the surface formed by them. For example the rcc for the set of accessible atoms belonging to a single residue is called residue rcc.
On the set A of accessible atoms of a protein we define as follows its SES area A, net surface charge QA, surface charge density ΣA, average-partial charge (charge per atom) ρA, average-atomic area (area per atom) η, and surface atom density (number of atoms per area) ν. where nA = |A| is the number of accessible atoms and e(i) the Charmm partial charge for atom i [34]. By Eq. (3) we have . On the set of buried atoms B in a protein we define its net charge and average-partial charge . where nB = |B| is the number of atoms in B. The net charge Q, and average-partial charge ρ for a whole protein are defined as follows. where n = |N| is the total number of atoms in a protein and set N = A ∪ B includes all of its atoms. Area-weighted surface charge qs and area-weighted surface charge density σs are defined as follows to represent simultaneous contributions of surface charge and area to protein-solvent interaction.
To distinguish the different contributions to protein-solvent interaction between accessible polar atoms and accessible apolar atoms we divide A into two different subsets, set Ao of apolar atoms and set Ai of polar atoms, that is, A = Ao ∪ Ai. The accessible atoms in Ai are either hydrogen bond donors or acceptors as specified in Charmm force field [34] while those in Ao include the rest. On both Ai and Ao we define as follows their respective SES areas Ao, Ai and their ratio Aoi, average-atomic areas ηi and ηo and their ratio Rio, concave-convex ratios and their ratio where no = |Ao| and ni = |Ai| are respectively the numbers of atoms in Ao and Ai, and noi is their ratio. The SES areas Ai and Ao are called respectively the polar surface area and the apolar surface area of a protein.
3 Results and Discussion
In this section we first briefly describe the processing of PDB structure files. We then present the analyses of the list of SES-defined properties on set 𝕊, 𝕄f and 𝕄e, and discuss their relevance to protein solvation and folding. The importance of this list of properties to protein function is discussed in terms of their differences between 𝕊 and ligand-protein interaction interfaces where ligand is either lipid or DNA or protein. Overall in terms of SES-defined properties the differences between 𝕊 and 𝕄f are statistically insignificant while the differences between 𝕄f and 𝕄e are relatively large and the differences between ligand-protein interfaces and 𝕊 are substantial.
3.1 The processing of PDB structure files
In order to eliminate as much as we could other factors that may interfere with our SES analysis, we have applied a list of strict criteria to ensure that the sets of analyzed structures have good structural qualities and whose surfaces are representatives of water-soluble proteins. Both the SES and the structure of any protein that has a SES-defined property in the upper or lower 1.0% of its distribution over 𝕊 are inspected visually using our structural analysis and molecular visualization program to make sure that the PDB file has been properly processed. Any PDB file that could not be correctly processed by our program is removed from further analysis. Such an outlier is further checked against literature to ensure it is not one of hyperthermophilic, anti-freeze, membrane and DNA-binding proteins.
3.2 The surface charges of water-soluble proteins
Previous studies on protein surfaces mainly SAS and VDW surfaces and to a much less extent SESs have shown that polar residues especially charged ones prefer to be on the surface of a water-soluble protein [33]. In principle protein-solvent interaction is electrostatic in nature6 [16, 48]. In theory surface charge and dipole moment are closely related to protein solvation [11, 12, 13, 14, 15]. Inspired by the importance of electrostatic interaction for solvation especially by the observation that water as a protic solvent prefers anions over cations we first analyze the differences in charge between accessible atoms and buried atoms. As shown in Fig. 1 we discover that each of the 16,483 proteins in 𝕊 has a negative net charge (negative QA and ρA) for its accessible atoms and a positive net charge (positive QB and ρB) for its buried atoms. Most strikingly the difference between the average ρ for all the sets of the accessible atoms in 7 and the average ρ for all the sets of the buried atoms in , and the ratio is the average net charge for all the atoms in a protein is 19.33, equivalent to a 19-fold difference in negativity between the accessible atoms and all the atoms. In addition QA increases with protein size8 via a well-fitted power law and the enrichment in negativity is apparent for the folded (native) structures in 𝕄f when compared with the extended conformations in 𝕄e (Table 1). In stark contrast with the average , the average is negative while the average for 𝕄e is more than 100-fold less negative than that for 𝕄f (Table 1). The negativity of for 𝕄e is due mainly to the buried backbone nitrogen and oxygen atoms. Furthermore, the for the buried atoms in PPI interfaces [35] and DNA-protein interfaces are both positive, and the becomes less negative for the lipid-exposing regions of transmembrane proteins and for the surface atoms that become buried upon ligand bindings.
Another SES-defined electrical property is surface charge density Σ. As shown in Fig. 2 the three surface charge densities, , for the extended conformations in 𝕄e differ largely from those for 𝕄f. For the native structures in 𝕄f, increases while both decrease with protein size. If we fit the ΣA s for 𝕄f to a power law, Σ = anb + c, where n is number of atoms (protein size), then the fitted parameter c = −5.00 × 10−3 is much more negative than the ΣA average . The extended conformations in 𝕄e likely deviate from the real unfolded states existent in a typical experimental setting [49] and thus their SES-defined properties differ from those for a genuine unfolded state. However the large differences in Σs between 𝕄f and 𝕄e support at least qualitatively the relevance of net surface charge density to solvation and folding. In addition as shown in Figs. 11(d) and 12(d) there exist good correlation between ΣA and experimentally-determined solubility. As to be expected, more negative ΣA value a protein has better solubility in aqueous solution.
However, as shown in Figs. 1, 2 and Fig. 5 of section 3.3, neither ρ nor Σ nor η (area per atom) changes linearly with protein size (n) and the distributions around their means are not symmetrical especially for small-sized proteins. The non-uniformity implies that none of them alone could provide a proper description to protein-solvent interaction because its strength is expected to be statistically independent of n. In contrast to ρ, Σ and η, area-weighted surface charge (qs) changes almost linearly with n and area-weighted surface density (σs) is almost independent of n (Fig. 3). In addition the distribution around the mean for σs is rather symmetrical as indicated by a very small difference between its mean and median even for small-sized proteins. More interestingly each of the 16,483 proteins in 𝕊 has a negative σs (Fig. 3). In addition as shown in Table 2 the ratio between the σs for a folded structure in 𝕄f and the σs for a corresponding extended conformation in 𝕄e has an average of 1.57. Furthermore, the σss for the lipid-exposing atoms of transmembrane proteins and for the interface atoms that become buried upon ligand-binding all become less negative. Thus the three SES-defined area-weighted properties, , will likely provide a more balanced description to protein solvation, folding and function. In particular the expression for area-weighted surface charge resembles the expression for atomic solvation parameters. Thus atomic solvation parameter σi is possibly related to partial charge e(i).
In summary our large-scale analysis shows that folding into a native state in aqueous solution turns a water-soluble protein into a capacitor with a positive net charge buried inside and a negative net charge on its SES (the outer surface of the capacitor) to maximize its electrostatic attraction to the solvent [50]. In other words, a water-soluble protein behaves, on average and as far as surface charge is concerned, as a micelle with an exterior formed predominately by atoms with negative partial charges and an interior composed of mainly atoms with positive partial charges. By extension there must exist a 2D manifold (the inner surface of the capacitor) inside a water-soluble protein that encloses a set of atoms with zero net charge. A model of alternative layers of negative and positive charges has been alluded before in MD simulation [51].
3.3 Accessible polar and apolar atoms and their SES areas
Previous structural analyses [26, 41, 30, 33] have found that polar residues prefer to be on the surface of a water-soluble protein while apolar ones are likely to be buried inside. Such preferences are often cited as one piece of evidence for the importance of hydrophobic effect to the folding of a water-soluble protein. With the assignment of a SES area to an individual atom and the division of the set of accessible atoms into polar and apolar ones it is possible to quantify such preferences at atomic level using SES-defined physical and geometrical properties. The ratio of the number of accessible apolar atoms over that of polar atoms, noi, is a property that could possibly quantify at atomic level the preference of polar atoms on the surface of a water-soluble protein. However average for 𝕊 is , and noi increases very slowly with protein size n when n < 10, 000 and remains essentially the same when n > 10, 000 (Fig. S1 of the Supplementary Materials). It means that for the water-soluble proteins in 𝕊 the numbers of apolar atoms are on average more than 2-fold larger than the numbers of polar atoms. As with noi the SES-defined property Aoi has an average and on average the Aois do not change with protein size (Fig. S2 of the Supplementary Materials). Thus the set of accessible apolar atoms in a typical water-soluble protein still has larger SES area than its set of accessible polar atoms. On the other hand the for the buried atoms in 𝕊 is 13.7% larger than the for 𝕊 (Fig. S1 of the Supplementary Materials). In addition the for 𝕄e increases to 2.457 and the for 𝕄e increases to 1.570. Furthermore both Ai and Ao decrease upon folding though Ao ≥ Ai remains to be true. Thus as been shown before at residue level [26, 41, 30, 33] folding into a native state indeed reduces both the number and the area of surface apolar atoms. A SES-defined property that could more directly quantify the previously-documented preferences for polar residues is area per atom η. As shown in Fig. 4 the ratio, , for 𝕊 ranges from 1.451 to 2.555 with . In other words, a polar atom has, on average, 1.875-fold larger SES area than an apolar atom. More interestingly only three structures (2ouw, 3qva and 4z0m) in . In addition the average Rio for 𝕄e is 1.567, a 17.8% smaller than . Furthermore though both ηi and ηo decrease upon folding the reduction in ηi is smaller than that in ηo (Fig. 5). One possible explanation for a large value is the importance to protein-solvent interaction of the intermolecular hydrogen bonding between accessible polar atoms and solvent molecules [36]. A large SES area for an accessible polar atom is likely to be favorable for optimal hydrogen bonding. The inter-atomic distance between two hydrogen-bonded atoms is smaller than the summation of their respective VDW radii. The larger SES area a polar atom has, the less likely a solvent molecule clashes with its neighboring protein atoms and less likely perturbs water’s hydrogen-bonded network when they form an optimal intermolecular hydrogen bond.
The relevance to protein solvation and function of the four SES-defined properties, noi, η, Aoi and Rio, is supported by the following observations. The for the lipid-exposing regions of transmembrane proteins, PPI interfaces and lipid-protein interfaces are all larger than while the for DNA-protein interfaces is smaller than . As with for lipid-exposing regions, PPI interfaces and lipid-protein interfaces are all larger than . Significantly as shown in Figs. 11(b) and 12(b) Aoi correlates well with experimentally-determined protein solubility. However in contrast to and , the for the lipid-exposing regions of transmembrane proteins, lipid-protein interfaces, DNA-protein interfaces and PPI interfaces are all smaller than the respective . Furthermore, as shown in Fig. 4 and Table 3 the seven structures in 𝕊 with are either PSI targets with unknown functions or proteins that seem to interact with lipids in some fashions. Their Rio values are close to those for 𝕄e and to those for PPI interfaces [35]. On the other hand, four (three ferredoxins and one flavodoxin) of the nine structures in 𝕊 that have their (Fig. 4 and Table 4) are involved in electron-transfer, two are DNA mimics, the other two are putative hemolysins, and 5cwh is a de novo designed protein [52]. The contrast between the SES of a protein with a large Rio and the SES of a protein with a small Rio is visually detectable: as shown in Fig. 6 the former has more largely-exposed polar atoms per SES area while the latter has more largely-exposed apolar atoms per SES area.
The enrichment of polar atoms, the enlargement of their total areas especially the large increase in SES area per polar atom on the SES of a water-soluble protein are consistent with the previous view that the hydrogen bonding interactions between surface polar atoms and solvent molecules contribute largely to protein solvation, folding and function. In addition there exist no or only weak correlations between SES-defined electrical properties such as ρA and σs and geometrical properties such as SES area, Aoi and Rio (section S6 of the Supplementary Materials). Furthermore the differences in SES area between polar surface atoms and apolar ones are in line with the heterogeneity of water motion in the first hydration shell. Thus from an evolutionary perspective it seems that the surfaces of naturally-occurring water-soluble proteins have evolved for best interaction with aqueous solvent through optimal intermolecular hydrogen bondings between surface polar atoms and solvent molecules. The importance of intermolecular hydrogen bondings to protein-solvent interaction may provide an explanation to hydrophobic effect [37].
3.4 The SES geometry of polar and apolar atoms
One advantage of SES over SAS is that the former includes both convex and concave areas while the latter has only convex ones. With SES we could define a concave-convex ratio rcc either for a single atom or over a set of accessible atoms such as the set of all the accessible atoms of a surface residue and the set of all the accessible atoms of a protein (Eqs. 2 and 7). To see the possible relevance of rcc to protein-solvent interaction we have analyzed the rccs for 𝕊, 𝕄f and 𝕄e as well as the rccs for the lipid-exposing regions of transmembrane proteins and ligand-protein interaction interfaces. Both the and the for 𝕊 increase with protein size via well-defined power laws. More relevantly their ratio is independent of protein size and ranges from 0.951 to 2.833 with a mean of = 1.496 (Fig. 7). In fact except for four structures, 2qsk, 3vqj, 2ouw and 3qva, the for each water-soluble protein in 𝕊 is larger than its . The relevance of rcc and to protein solvation, folding and function is further supported by the following observations. Firstly, the for 𝕄e is 1.31-fold smaller than that for 𝕄f (Table 2). Interestingly, compared with the rccs for 𝕄f, the rccs for 𝕄e do not change with protein size and are several-fold smaller (Fig. 8). Secondly, the for the lipid-exposing regions of transmembrane proteins, lipid-protein interfaces, DNA-protein interfaces and PPI interfaces are all smaller than that for 𝕊. Particularly the for the lipid-exposing regions of transmembrane proteins and lipid-protein interfaces are close to 1.0. Accordingly we expect that a protein that has a value close to 1.0 (Fig. 7 and Table 5) is likely either a peripheral membrane protein or a lipid-binding protein. For example, a previous experiment has shown that the expression in E.coli of an antiviral lectin scytoririn led to the accumulation of the expressed proteins in membrane [53]. Thirdly, the rccs for PPI interfaces are several-fold smaller than those for 𝕊 [35]. And finally as shown in Table 7 and Fig. 9, the solvent-accessible residue rccs correlate well with known hydrophobicity scales. There exists modest correlation between and Rio (section S6 and Fig. S6 of the Supplementary Materials) likely because both are defined in terms of Ai and A0 (Eqn. 7).
A small rcc for a single atom implies that it has a large αs area, that is, the atom is much exposed to solvent and is thus a good candidate for hydrogen bonding if it is a polar atom. A small rcc over a set of neighboring atoms means that the region formed by those atoms is locally-rugged and likely tightly-packed. Typically such a region has more accessible carbon atoms than a region with a larger rcc. In the contrary, a large rcc for a single atom implies that the atom is largely hidden from the solvent while a large rcc over a set of neighboring atoms means that they together form a locally-smooth surface region. Typically such a region has more accessible protons, oxygen and nitrogen atoms than a region with a smaller rcc. Compared with a rugged surface a smooth one is less disruptive to water’s hydrogen-bonded networks [7], and the VDW attraction between its surface atoms and solvent molecules is likely to be stronger. The proteins with a small rcc have surface geometrical properties akin to those for the lipid-exposing regions of transmembrane proteins, lipid-protein, DNA-protein and PPI interfaces. VDW attraction has been shown to be important for the solvation of apolar molecules in aqueous solvent [17, 39]. As shown in Fig. 8 one salient feature of rcc is that it increases with protein size but the rate of growth becomes smaller when the number of atoms in a structure is > 10,000. With more accessible atoms it becomes increasingly possible to form locally-smooth surface and consequently to have stronger VDW attraction between accessible atoms and solvent molecules. However it is obvious that some of the naturally-occurring proteins could remain soluble with a value close to 1.0 (Table 5) and there also seems to be an upper limit for for all the naturally-occurring water-soluble proteins (Fig. 7 and Table 6). The limited range for naturally-occurring water-soluble proteins suggests that their surfaces may have been optimized to interact with aqueous solvent. In the contrary some de novo designed proteins have rather large values (Fig. 7 and Table 6) [52] possibly because of the desire to enhance their solubility via so-called supercharging approach that increases the percentage of surface polar atoms over apolar ones. As shown in Fig. 10 the SES of a protein with a large has largely-exposed polar atoms while a protein with a small has largely-exposed apolar atoms.
In summary our large-scale analyses of the SES-defined properties rcc and for different sets of protein structures and interfaces show that they likely pertain to protein solvation, folding and function possibly via the optimization of both the intermolecular VDW attractions between the accessible apolar atoms of a protein and solvent molecules and the intermolecular hydrogen bondings between its accessible polar atoms and solvent molecules. Since there exist no or only weak correlations between and SES-defined electrical properties ρ and σs (section S6 of the Supplementary Materials), and likely rcc are related more to intermolecular VDW attraction than to intermolecular hydrogen bonding. In addition the difference in rcc between a surface apolar atom and a surface polar atom is in line with the heterogeneity of water motion in the first hydration shell. Taken together our SES analyses support the importance of VDW attraction to the solvation of an apolar molecule in a polar solvent [39].
3.5 Protein solubility and SES-defined properties
Previous analyses of the relationship between protein surface and protein-solvent interaction have focused mainly on SAS area and surface charge at residue-level [26, 31, 54, 34]. However, quantitative relationships between SESs and protein solvation and folding remain largely unknown and controversial [55, 56, 17, 57]. For example the past efforts to correlate SAS area with experimentally-determined solubility have only met limited success [9]. In the following we analyze two sets of experimental solubility data to illustrate the possible advantages of using atomic SES-defined properties to characterize protein-solvent interaction in general and protein solubility in particular.
3.5.1 Experimentally-measured solubility and SES-defined properties
Recently Scholtz group has investigated seven proteins with crystal structures in order to find any correlations between experimentally-determined solubility and either SAS area or SAS-defined properties [9]. With the same goal we have analyzed the SESs of the same seven crystal structures with protons added by REDUCE [46]. As shown in Table 8 and Figs. 11 and 12, out of the list of SES-defined properties we have identified four of them that correlate well with the measured solubility data reported in their paper [9]. In the following we compare our SES-based analysis with their SAS-based analysis that uses only the SAS areas of heavy atoms since no protons have been added to any of the seven crystal structures. Though both our SES-based analysis (Figs. 11c, 11d, 12c and 12d) and their SAS-based analysis (Figs. 7 and 8 of their paper) have found good correlations between solubility and surface charge, important differences exist between the found correlations. Their SAS-based analysis had found only one good correlation with a Rsquare = 0.82 between the solubility in ammonium sulfate and the absolute value of net charge (Fig. 6F of their paper). In contrast, our SES-based analysis has found a good correlation with a Rsquare = 0.86 between ΣA and solubility in ammonium sulfate (Fig. 11d) and a weak correlation with a Rsquare = 0.38 between ΣA and solubility in PEG-8000 (Fig. 12d). In addition good correlations with respective Rsquare = 0.70 and Rsquare = 0.73 exist between and solubility in both ammonium sulface (Fig. 11c) and PEG-8000 (Fig. 12c). In terms of SAS area, their SAS-based analysis had found good correlations with respective Rsquare = 0.81 and Rsquare = 0.84 between fraction negatively-charged SAS area and solubility in both ammonium sulfate (Fig. 8E of their paper) and PEG-8000 (Fig. 8F of their paper). As with their analysis we have found strong correlations with respective Rsquare = 0.84 and Rsquare = 0.94 between Rio and the solubility in both ammonium sulfate (Fig. 11a) and PEG-8000 (Fig. 12a). Most interestingly good correlations with respective Rsquare = 0.67 and Rsquare = 0.82 exist between Aoi and the solubility in both ammonium sulfate (Fig. 11b) and PEG-8000 (Fig. 12b). No similar correlations were reported in their paper [9]. The strong correlation between Rio and the solubility and the modest correlation between Aoi and the solubility sugget that the intermolecular hydrogen bonding interaction between accessible polar atoms and solvent molecules contributes largely to protein solubility. On the other hand, there exists no clear correlation between and solubility. Though the data set is rather small and thus the significance of these correlations is limited, the relevance to protein-solvent interaction of is consistent with the conclusions drawn from our large-scale SES analyses described earlier. And importantly these correlations between SES-defined property and protein solubility show that SES is better than or at least as good as SAS for the evaluation of surface area’s contribution to protein-solvent interaction in general and protein solubility in particular.
3.5.2 A water-soluble protein with a few titratable surface residues
In theory protein-solvent interaction is electrostatic in nature and thus the number of titratable surface residues in a protein is expected to be closely related to protein solubility. However, a recent protein redesign experiment by Winthers group [10] shows that the number of titratable surface residues in a protein is not a critical factor for its solubility. Specifically starting with a naturally-occurring protein (1exg) that has only four titratable surface residues (K28, D36, R68 and H90) Winther’s group has demonstrated that a soluble, functional protein with no titratable side chains could be engineered via protein redesign. It will be interesting to see whether the SES-defined properties for this particular protein differ largely from their averages for 𝕊. Since no structure is available for the redesigned protein and since the differences between 1exg and the redesigned one are likely to be small as far as their surfaces are concerned, we will compare the SES-defined properties for 1exg with those for 𝕊. As shown in Table 9 and Fig. S3 of the Supplementary Materials, except for the three ρA, ρB and ρ that are somewhat more positive than their averages for 𝕊, the other six SES-defined properties, and Aoi, are all rather close to their averages for 𝕊. In other words, at atomic level this particular protein is not an outstanding outlier in terms of the SES-defined properties that likely pertain to protein-solvent interaction. Thus 1exg and very likely the redesigned protein are expected to be as soluble as a typical protein in 𝕊 (section S5 of the Supplementary Materials). This example illustrates a possible advantage of SES-defined properties at atomic level over SAS-defined properties at residue-level for the description of protein solubility.
3.6 The statistical distributions and power laws for SES-derived properties
At present the details of protein-solvent interaction could only be obtained through all-atom MD simulation with either explicit or implicit solvent models due to the amphipathic nature of the surface of a water-soluble protein. However long time all-atom MD with explicit solvent suffers from convergence problem especially for large-sized proteins while implicit models rely on a prior values for dielectric constants especially the dielectric constants near the surface of or inside a protein [16]. For example accurate dielectric constant for protein surface is the key for the computation of solvation free energy via electrostatic interaction. However the accurate determination of dielectric constants remains to be a challenging problem at present. As described above we have identified a list of SES-defined physical and geometrical properties that are likely to be important to protein-solvent interaction. Their statistical distributions and the power laws governing their changes with protein size obtained over large sets of high quality structures may help verify theories on anion solutes in protic solvent [12] or PLDL solvent model [58, 16]. In addition the statistical values and the power laws for SES-defined properties could be used to restraint the folding space of a protein and thus could serve as a term in an empirical scoring function for either protein structure prediction [59] or protein redesign [60, 61] or quality control in structure determination [62].
4 Conclusion
The solvent-accessible surface of a water-soluble protein is closely related to protein-solvent interaction and should have been adapted to the unique properties of aqueous solvent. To evaluate surface’s contributions to protein-solvent interaction and to find clues to surface’s adaptation to aqueous solvent we have analyzed the solvent-excluded surfaces (SESs) of four sets of water-soluble proteins and four sets of ligand-protein interaction interfaces. We discover that all the analyzed water-soluble proteins have a negative net surface charge. We have also identified a list of SES-defined physical and geometrical properties that are likely relevant to protein-solvent interaction based on their changes with protein size, their variations upon either unfolding or ligand-binding as well as the correlations between them and five known hydrophobicity scales and the correlations between them and experimentally-measured protein solubility. In contrast to previous structural analyses that focus mainly on accessible solvent surface area we find that surface charge is at least as important as surface area to protein-solvent interaction. Furthermore our analyses show that both the intermolecular hydrogen bondings between accessible polar atoms and solvent molecules and the intermolecular VDW attractions between accessible apolar atoms and solvent molecules contribute to protein-solvent interaction. These findings are consistent with water being a protic solvent prefers anions over cations and show that from a protein-solvent interaction perspective to fold into a native state is to simultaneously optimize net surface charge, intermolecular hydrogen bonding and VDW attraction rather than to only minimize apolar surface area. Our results suggest that the optimization of protein-solvent interaction through natural selection is achieved via (1) universal enrichment of negative surface charges for stronger intermolecular electrostatic interaction, (2) increased SES area for a polar atom for stronger intermolecular hydrogen bonding, and (3) higher concave-convex ratio for an accessible apolar atom for either stronger intermolecular VDW attraction or less disruption to solvent’s internal structure.
Footnotes
↵1 Abbreviations: MD, molecular dynamics; SES, solvent-excluded surface; SAS, solvent-accessible surface; VDW, van der Waals; PPI, protein-protein interaction; DNA, deoxyribonucleic acid; 2D, two-dimensional; PDB, Protein Data Bank; PSI, protein structure initiative.
↵2 In the rest of the paper, solvent-accessible atoms, accessible atoms, surface atoms are used interchangeably.
↵3 In this paper intermolecular means between a solute and its solvent.
↵4 In this paper unfolding means the change from a folded structure to an extended conformation in 𝕄e while the reverse change is called folding.
↵5 In terms of the list of SES-defined physical and geometrical properties described in this paper, no large differences exist between the SESs computed using 1.4Å probe radius and those computed using 1.2Å probe radius.
↵6 R. P. Feynman tried to explain the protein salt-out effect by assuming the existence of negative charges on protein surfaces.”The molecule (protein) has various charges on it, and it sometimes happens that there is a net charge, say negative, which is distributed along the chain”, The Feynman Lectures on Physics, page 7-10, Vol.2.
↵7 For a SES-defined property x, denotes its average over all the sets of accessible atoms in 𝕊 except for that denotes the average over all the sets of the buried atoms in 𝕊. For brevity such a is to be written as either x average for the accessible atoms in 𝕊 or x average for the buried atoms in 𝕊 or simply as x average for 𝕊. The averages over 𝕄e are to be written in the same manner.
↵8 In this paper protein size could mean either n or nA or A since they are proportional to each other.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵