Abstract
We have set up and manually curated a dataset containing experimental information on the impact of amino acid substitutions in a protein on its thermal stability. It consists of a repository of experimentally measured melting temperatures (Tm) and their changes upon point mutations (ΔTm) for proteins having a well-resolved X-ray structure. This high-quality dataset is designed for being used for the training or benchmarking of in silico thermal stability prediction methods. It also reports other experimentally measured thermodynamic quantities when available, i.e. the folding enthalpy (ΔH) and heat capacity (ΔCP) of the wild type proteins and their changes upon mutations (ΔΔH and ΔΔCP), as well as the change in folding free energy (ΔΔG) at a reference temperature. These data are analyzed in view of improving our insights into the correlation between thermal and thermodynamic stabilities, the asymmetry between the number of stabilizing and destabilizing mutations, and the difference in stabilization potential of thermostable versus mesostable proteins.
I. Introduction
The availability of a complete and well-curated dataset for training and testing purposes is a basic prerequisite for the development of any knowledge-based bioinformatics prediction tool. Here we present a repository containing thermal and thermodynamic stability data on experimentally characterized single-site protein mutants for which an X-ray structure is available, which we have set up by screening the literature and freely accessible databases. This dataset is meant to be as complete as possible, and to contain as much as possible noise- and error-free data. The amount and quality of the data used to set up a predictor are indeed two fundamental requirements for getting reliable predictions.
We have used this dataset to design and test our method for predicting the change in melting temperature of proteins upon point mutations [1], and to compare its performance with that of other existing tools. This dataset is intended as a common benchmark for training and validating different in silico tools for protein stability prediction and rational design.
There is an increasing interest for the development of reliable stability predictors that can be used for rationally designing protein mutants with improved properties, and there is thus concomitantly an increasing need for high-quality and easily accessible datasets. Indeed, the design of new enzymes and other proteins that remain stable and active in unusual environments or at temperatures that differ from their physiological temperatures would allow the optimization of a wide series of biotechnological processes in many sectors such as agro-food, biopharmaceuticals and environment [2, 3].
Another asset of this dataset is that it can serve as a basis for large-scale analyses in view of improving our understanding of the factors that modulate the thermal resistance and other thermodynamic properties of proteins and their variation upon mutation. As a matter en fact, we performed in this paper some analyses that yield some interesting insights.
II. Brief Theoretical Review of Protein Stability
The protein folding transition is thermodynamically characterized by a change in free energy, enthalpy, entropy and heat capacity. Assuming protein folding to be a two-state transition and the change in heat capacity ΔCp to be temperature independent, the folding free energy ΔG(T) = Gfolded(T) – Gunfolded(T) can be written as where Tm is the melting temperature and ΔHm the enthalpic change measured at Tm. Note that with these conventions, ΔHm and ΔCp are negative; the folding free energy ΔG(T) is negative under physiological conditions and is positive for temperatures above Tm. The protein stability curve has thus an inverse bell shape, as shown in Figure 1.
When one or several residues of the wild type protein are substituted, the change in protein stability due to the mutation can be characterized by a temperature descriptor, namely the change in melting temperature upon mutation: which measures the change in thermal stability. It can also be characterized by a free energy descriptor, i.e. the change in folding free energy a the reference temperature Tr (usually chosen to be the room temperature Tr = 298 K): which measures the change in so-called thermodynamic stability. Note that with these conventions, thermally stabilizing mutations have positive ΔTm-values, and thermodynamically stabilizing mutations have negative ΔΔG(298K) values.
The dataset presented in this paper has been created in view of developing a predictor of protein thermal stability changes upon point mutations [1]. It thus contains all the point mutations with experimentally measured ΔTm values that we have collected from the literature and satisfy certain criteria. The corresponding values of ΔΔG(Tr) and of the other thermodynamic quantities appearing in equation (1) are known only for a subset of the entries.
In order to understand the precise relation between thermal and thermodynamic stability changes, one needs to have independent experimental determinations of their respective descriptors ΔTm and ΔΔG(Tr), or to know the values of all the thermodynamic quantities that appear in equation (1), i.e. Tm, ΔHm, and ΔCP, for both the wild type and mutant proteins. Unfortunately all these informations are not always available or not sufficiently accurately measured. Under the assumption that the mutated protein is a perturbation of the wild type, some approximations can be made; for example it is quite reasonable to consider that the parameter with the temperature expressed in Kelvin, is small and thus that an expansion of equation (3) in powers of x can be performed. This yields: where . As seen from this equation, the correlation between ΔΔG(Tr) and ΔTm generically depends on an intricate combination of variations of thermodynamic quantities. If we assume ΔΔHm ⋍ 0 and ΔΔCp ⋍ 0, equation (5) reduces to:
Under this (strong) assumption, we find thus a linear relation between ΔΔG(Tr) and ΔTm; the proportionality coefficient is however protein-dependent. Note that ΔHm is negative with our conventions, and that ΔΔG(Tr) and ΔTm are thus anticorrelated.
On the other hand, at the reference temperature , equation (5) simplifies to:
The proportionality assumption between ΔΔG(T) and ΔTm is thus valid at Tm. If moreover we assume ΔΔHm ⋍ 0, this equation becomes the Becktel-Schellman formula [4]: where is the entropic contribution at .
III. Methods
A. Dataset design
We started collecting the mutations with experimentally measured ΔTm values from the ProTherm database [5], and searched for additional entries by literature screening. Each entry (including those from ProTherm) was manually and carefully checked from the original literature to remove imprecisions and errors. We selected the mutations that satisfy the following criteria:
Only single point mutations were included.
Only mutations in proteins, whose three dimensional (3D) structures were experimentally solved by X-ray crystallography with a resolution of at most 2.5 Å, were considered.
Only mutations that were experimentally characterized in monomeric proteins were taken into account, irrespective of the oligomeric state of the biological unit; this ensures that the measured Tm corresponds to the (un)folding transition and not to a change in quaternary state.
Only wild-type and mutant proteins that are described in the reference articles as undergoing a two-state (un)folding transition were included.
Destabilizing or stabilizing mutations by more than 20 °C were overlooked, as they are likely to induce important structural modifications.
When several experimental ΔTm values were found in the literature for the same mutation, we chose the one measured at pH closest to seven and with the lowest concentration of additives; if more than one measurement in the same conditions was available, the average ΔTm was taken.
In addition to the change in melting temperature upon mutation, other thermodynamic quantities associated to the mutation are reported in the dataset when available. These are the ΔCP of the wild type protein and its change upon mutation ΔΔCP, ΔΔG(Tr) and the reference temperature Tr at which the measurement was performed, the ΔHm of the wild type and
refers to a quantity that is slightly different from ΔΔHm appearing in equation (5): the former is computed at different Tm values whereas the latter is computed at ; the difference is proportional to ΔTm. We report in the dataset rather than ΔΔHm as these are the measured quantities. Note that for a given mutation these different quantities are not always measured in exactly the same experimentally conditions than the corresponding ΔTm values.
When available, the ΔΔG’s indicated in the dataset are the values that are measured by monitoring the (un)folding transition through chemical (de)naturation using urea or guanidinium chloride (GdmCl); the temperature at which the experiments were performed is also reported. If such data are not available but all the thermodynamic quantities in equation (1) are known for the wild type and the mutant proteins, they are used to evaluate ΔΔG at 25°C. Otherwise, approximations were made to evaluate ΔΔG, and the corresponding entries in the dataset are labeled by a subscript. The ΔΔG values obtained with the approximation consisting in considering ΔΔCP ⋍ 0 are indicated with a subscript (b); the temperature at which they were estimated is equal to 25°C. When the stronger approximation consisting of supposing also ΔΔHm ⋍ 0 (see equation (6)) is used to derive the value of ΔΔG from ΔTm, we mark it with the subscript (a); the temperature at which this quantity is given is equal to . For the few entries whose values are labeled with a subscript (c), the ΔΔG(25°C) values are computed from ΔTm using an empirical correlation between the two quantities computed on a subset of mutants of the same wild type protein [6]. Finally the ΔΔG values at Tm that are derived from the approximation (see ref. [7, 8]) ΔΔG(Tm) = are labeled with the subscript (d).
The experimental techniques used for measuring the protein melting temperatures and other thermodynamic quantities are indicated in the dataset. These are differential scanning calorimetry (DSC), circular dichroism (CD), absorbance (Abs), and fluorescence.
The Protein DataBank (PDB) code [9] of the best resolved 3D X-ray structure of each wild type protein is specified in the dataset. For a few entries, the PDB code is labeled with a subscript. This means that the wild type structure of the protein whose ΔTm was measured was unavailable, and that the structure of an almost identical protein was used instead, under the assumption that the impact of the modification on the structure is negligible. In particular, the 1bnih102a code means that the structure is obtained from the PDB structure 1bni with the His residue at position 102 manually substituted into an Ala. The same procedure is used for the PDB structures 1yccc102a, 1urpl265c and 5ptim52l. The other PDB codes with subscripts, i.e 1tpkr, 1yu5d1 and 1yu5d2, refer to experimentally characterized proteins whose sequences have been manually truncated by a few residues compared to the original PDB structure. Note that we checked that the mutations or truncated residues in these pseudo-wild type proteins are all distant from the mutations whose ΔTm was measured, so that they may be assumed as not interfering.
B. Data records
The dataset contains experimental information on 1,626 point mutations that have been introduced in about 93 proteins. This data was collected by screening the literature and databases, and carefully checked on the basis of the original articles. For each mutation, the following informations are reported:
The PDB [9] code of the 3D structure of the wild type protein (Column II).
The chain name, residue number and residue name of the wild type and mutant amino acids (Columns III-VI).
The experimental value of the change in melting temperature upon mutation (ΔTm) using the convention of equation (2) (Column VII).
The experimentally measured melting temperature (Tm) and the number of residues (Nr) of the wild type protein (Columns VIII and XIV, respectively).
The experimental values of , which is the change in calorimetric enthalpy upon mutation measured at the mutant and wild type melting temperatures, respectively, as defined in equation (9) (Column IX), and of the ΔHm of the wild type protein at (Column X), when available. The subscript (e) means that the reported values correspond to the van’t Hoff enthalpy instead of the calorimetric enthalpy.
The experimentally measured values of ΔΔCP (Column XI) and the of the wild type protein (Column XII), when available.
The values of ΔΔG(Tr), using the conventions of equation (3) (Column XIII), and the reference temperature Tr (in degrees Celsius) at which they were measured or derived (Column XIV). Entries without superscripts are experimental or calculated from other measured thermodynamic quantities, whereas entries with superscripts (a), (b), (c) or (d) were obtained using different levels of approximations, as explained in the previous section.
The resolution of the X-ray structure (in Å) (Column XVI).
The name of the protein and its host organism (Columns XVII-XVIII)
The bibliographic references (Column XIX).
The pH and the experimental technique used for measuring ΔTm (Columns XX-XXI).
IV. Results
We investigated some biophysical properties of the data reported in our dataset. First of all, the ΔTm distribution obtained from all the entries is dominated by destabilizing mutations, as shown in Figure 2. The average Tm-value, 〈ΔΤη〉, is indeed equal to −2.7°C, while the standard deviation and the kurtosis of the distribution are equal to 5.3°C and 4.0°C, respectively. The large majority of point mutations (about 70%) are thus destabilizing. Although this ΔTm distribution is not built from the ensemble of possible mutations but rather from the subset of experimentally characterized mutations, we may nevertheless assume that it represents well the actual ΔTm distribution of all possible mutations. The relative abundance of destabilizing mutations with respect to stabilizing ones can be interpreted as being due to the evolutionary force that tends to optimize the proteins for stability and thus to minimize the deleterious impact of random mutations. It must nevertheless be emphasized that all proteins are left with stability weaknesses [10] or frustrations [11], in particular in functional regions.
The solvent accessibility of the mutated residues is an important feature that modulates the average stability changes. It is defined as the ratio between the solvent accessible surface of a residue X in a given structure and in the extended tripeptide Gly-X-Gly conformation, and has been computed using an in-house program [12]. In Figures 3a-c, we show the experimental ΔTm distribution as a function of the solvent accessibility of the mutated residues; three solvent accessibility (Acc) ranges are considered: Acc < 15% (core), 15% < Acc < 50% (partially buried) and Acc > 50% (surface). The three distributions were found to be significantly different according to the 2-sample Kolmogorov-Smirnov (K-S) test (P-value < 10—3). The mean 〈ΔTm〉 values of the distributions are equal to −4.3°C, −1.6°C and −1.1°C for the core, partially buried and surface mutations, respectively. As expected, the mutations in the core are on the average more destabilizing than those at the surface since core residues play a stronger role in the structural stability than surface residues, which also contribute to stability but to a lesser extent.
It is also informative to analyze the relation between ΔTm and Tm. Indeed, it could be argued that it is "easier" to destabilize thermostable proteins or equivalently, to stabilize mesostable proteins. To check this hypothesis, we computed 〈ΔTm〉 separately for the mutations introduced in thermostable proteins defined here as having a melting temperature higher than 65°C and those introduced in mesostable proteins with Tm < 65°C. We obtain a value of 〈ΔTm〉 = –3.6°C for thermostable proteins and 〈ΔTm〉 = –2.1°C for mesostable proteins. On the average, mutations in thermostable proteins are thus more destabilizing than in mesostable proteins. Moreover, the normalized ΔTm distributions for mesostable and thermostable proteins are shown in Figure 4. They are statistically different according to the K-S test with a P-value < 10−4, and the former appears to be shifted towards stabilizing mutations compared to the latter. This interesting result supports the view that the fraction of stabilizing mutations is larger in mesostable proteins than in thermostable proteins, and thus that the former are easier to stabilize than the latter, in agreement with the starting hypothesis.
We also analyzed the other thermodynamic quantities reported in our dataset and first of all, the change in folding enthalpy upon point mutations. The normalized distribution (see equation (9)) is plotted in Figure 5a; its mean value is = 7.3 kcal/mol for the set of 993 entries for which this quantity has been measured experimentally. Hence, the mutations are on the average enthalpically destabilizing at Tm. Figure 5b shows the normalized distribution of the change in entropy upon mutation . The mean value of the distribution is found to be = 0.04 kcal/(mol K).
Finally, we plotted the normalized distribution of ΔΔCP and ΔΔG in Figures 5c-d. The mean values of these two distributions are positive: 〈ΔΔCp〉 = 0.08 kcal/(mol K) for the set of 250 entries for which this value has been measured, and 〈ΔΔG〉 =0.89 kcal/mol for 1,147 entries. Hence, the majority of mutant proteins have a less negative ΔCP and a less negative ΔG than wild type proteins; the mutant proteins are thus on the average less thermodynamically stable than the wild type. Note that the asymmetric nature of the ΔΔG distribution is likely to cause biases in the prediction methods that use these data as learning set [13].
In summary, the large majority of the mutations are thermodynamically destabilizing (as measured by positive ΔΔG) and thermally destabilizing (as measured by negative ΔTm).
The next point we investigated is the correlation between the thermodynamic stability descriptor ΔΔG(Tr) and the thermal stability descriptor ΔTm. Indeed, these two quantities are often taken as equivalent stability measures even though this assumption is based on an approximation, as shown in equation (6). Nevertheless, this hypothesis seems a priori not totally unjustified, as the linear anticorrelation between the two quantities is in general quite good. In our dataset, the Pearson correlation coefficient r, computed on the 1,147 mutations for which both ΔΔG(Tr) and ΔTm are available independently from the choice of Tr is equal to:
We must however notice that the temperature at which the ΔΔG measurements were performed is not always the same (as described in the "Dataset design" subsection of Methods); it is usually either 25°C or the Tm of the wild type protein.
For the subset of 461 mutations for which ΔΔG has been measured or computed at , the linear anti-correlation is close to perfect:
The anticorrelation does not reach −1 since the proportionality coefficient between ΔΔG(Tm) and ΔTm is protein-dependent (see equations (6)-(7)). It is illustrated in Figure 6b.
In contrast, for the 449 mutations for which ΔΔG(25°C) has been directly measured or for which all thermodynamic quantities that allow using the full equation (5) have been measured (entries without subscript in the dataset), the anticorrelation is much lower: as shown in Figure 6a. We would like to stress the value of this ΔΔG(25°C)-ΔTm anticorrelation coefficient can be expected to be close to the real one, as it has not been artificially improved by adding computed ΔΔG’s that presuppose this anticorrelation.
For some entries, the two descriptors ΔΔG(25°C) and ΔTm are correlated rather than anticorrelated. These signal interesting mutations that stabilize the protein thermally while destabilizing it thermodynamically at room temperature, or conversely, destabilize it thermally while stabilizing it thermodynamically. As an example of such an unusual behavior, we plotted in Figure 1 the full protein stability curve of the wild type human lysozyme and of the mutant R21A [14]; these are one of the entries of our dataset.
Sources of experimental errors
The experimental errors on the measured thermodynamic quantities describing the folding transition have to be taken in consideration. The most noisy thermodynamic quantities are ΔCP. Their error is generally of the order of 10-20%. Sometimes it is of the same order as ΔΔCP itself, which makes the numeric evaluation of equation (5) not quite precise, even though the ΔΔCP term is subleading compared to the others.
The errors on the two thermodynamic descriptors ΔHm and Tm are in general less severe, being of the order of a few percents. These should thus not really affect the results obtained in this analysis.
Another source of error comes from the fact that the experiments are often performed in different environmental conditions in terms of pH, buffer type, ionic concentration and additives. Such errors are non negligible, even if their effect can be expected to be less important for the variation of the thermal characteristics upon mutations compared to that of the thermal characteristics themselves. Moreover, to decrease this effect, we have collected data that are as much as possible uniform in terms of environmental variables, as explained in the section "Dataset design". Note that the size of this type of error is difficult to quantify in general.
Usage notes
The dataset that we have constructed is available as a pdf file in attachment to this paper and can be downloaded as a text file at the address http://babylone.ulb.ac.be.
Data Citations
Bibliographic information for the data records described in the manuscript.
References
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.
- 110.
- 111.
- 112.
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.
- 119.
- 120.
- 121.
- 122.
- 123.
- 124.
- 125.
- 126.
- 127.
- 128.
- 129.
- 130.
- 131.
- 132.
- 133.
- 134.
- 135.
- 136.
- 137.
- 138.
- 139.
- 140.
- 141.
- 142.
- 143.
- 144.
- 145.
- 146.
- 147.
- 148.
- 149.
- 150.
- 151.
- 152.
- 153.
- 154.
- 155.
- 156.
- 157.
- 158.
- 159.
- 160.
- 161.
- 162.
- 163.
- 164.
- 165.
- 166.
- 167.
- 168.
- 169.
- 170.
- 171.
- 172.
- 173.
- 174.
- 175.
- 176.
- 177.
- 178.
- 179.
- 180.
- 181.
- 182.
- 183.
- 184.
- 185.
- 186.
- 187.
- 188.
- 189.
- 190.
- 191.
- 192.
- 193.
- 194.
- 195.
- 196.
- 197.
- 198.
- 199.
- 200.
- 201.
- 202.
- 203.
- 204.
- 205.
- 206.
- 207.
- 208.
- 209.
- 210.
- 211.
- 212.
- 213.
- 214.
- 215.
- 216.
- 217.
- 218.
- 219.
- 220.
- 221.
- 222.
- 223.
- 224.
- 225.
- 226.
- 227.
- 228.
- 229.
- 230.
- 231.
- 232.
- 233.
- 234.
- 235.
- 236.
- 237.
- 238.
- 239.
- 240.
- 241.
- 242.
- 243.
- 244.
- 245.
- 246.
- 247.
- 248.
- 249.
- 250.
- 251.
- 252.
- 253.
- 254.
- 255.
- 256.
- 257.
- 258.
- 259.
- 260.
- 261.
- 262.
Acknowledgments
We acknowledge support from an FRFC grant from the Belgian Fund for Scientific Research (FNRS). RB is a Postdoctoral Fellow, FP a Postdoctoral Researcher and MR a Research Director at the FNRS.