ABSTRACT
Proteins are highly dynamic macromolecules. A classical way to analyze their inner flexibility is to perform molecular dynamics simulations. It provides pertinent results both for basic and applied researches. The different approaches to define and analyze their rigidity or flexibility have been established years ago. In this context, we present the advantage to use small structural prototypes, namely the Protein Blocks (PBs). PBs give a good approximation of the local structure of protein backbone. More importantly, they allow analyzes of local protein deformability which cannot be done with other methods and had been used efficiently in different applications. PBxplore is a suite of tools to analyze the dynamics of protein structures using PBs. It is able to process large amount of data such as those produced by molecular dynamics simulations. It produces various outputs with text and graphics, such as frequencies, entropy and information logo. PBxplore is available at https://github.com/pierrepo/PBxplore and is released under the open-source MIT license.
Introduction
Proteins are highly dynamic macromolecules1, 2. To analyze their inner flexibility, computational biologists often use molecular dynamics (MD) simulations. The quantification of protein flexibility is based on various methods such as Root Mean Square Fluctuations (RMSF) that relies on multiple MD snapshots or Normal Mode Analysis (NMA) that relies on a single structure and focus on quantifying large movements.
Other interesting in silico approaches assess protein motions through the protein residue network3 or dynamical correlations from MD simulations4, 5. We can also notice the development of the MOdular NETwork Analysis (MONETA), which localizes the perturbations propagation throughout a protein structure6.
However, a classical limitation of all analyzes of protein structures relies in their description. Protein structures are often considered as rigid bodies described by two regular states, namely the α -helices7, 8 and the β -sheets (composed of β -strands)9, and one non-repetitive state, the coil (or loops)10. The use of only three states oversimplifies the description of protein structures11; 50 % of all residues are classified as coil, even when they encompass repeated local structures12, 13, emphasizing the lack of a more detailed description. To this aim, elaboration of small prototypes or “structural alphabets” (SAs) has emerged. Protein Blocks (PBs)14 is the most used structural alphabet15–17. They approximate conformations of protein backbones and code the local structures of proteins as one-dimensional sequences18.
PBs are composed of 16 local prototypes designed through an unsupervised training performed on a representative nonredundant databank of protein structures14. PBs are labeled from a to p (see Figure 1a). PBs m and d can be described as prototypes for α -helix and central β -strand, respectively. PBs a to c primarily represent β -strand N-caps and PBs e and f, β -strand C-caps; PBs g to j are specific to coils, PBs k and l are specific to α -helix N-caps, and PBs n to p to α -helix C-caps15. Figure 1 illustrates how a PB sequence is assigned from a protein structure. Starting from the 3D coordinates of the barstar protein (Figure 1b), the local structure of each amino acid is compared to the 16 PB definitions (Figure 1a). The most similar protein block is assigned to the residue under consideration (the similarity metrics is explained in a latter section of the article). Eventually, assignment leads to the PB sequence (Figure 1c).
PBs are efficient to describe long protein fragments19, 20 and short loops13, 21, 22. They have also been used to analyze protein contacts23, to propose a structural model of a transmembrane protein15, to reconstruct globular protein structures24, to design peptides25 and to define binding site signatures26, to perform local protein conformation predictions27–31, to predict β -turns32 and recently to understand local conformational changes due to mutations of the αIIbβ 3 human integrin33–35.
PBs are also useful to compare and superimpose protein structures with pairwise and multiple approaches36, 37, namely iPBA38 and mulPBA39, both currently showing best results compared to other superimposition methods. Eventually, PBs is also the only SA which has been used to predict protein structures from their sequences40, 41 and to predict protein flexibility42, 43.
Our results on biological systems such as, the DARC protein44, the human αIIbβ3 integrin33–35 and the KISSR1 protein45, highlighted the usefulness of PBs to understand local deformations of protein structures. Specially, these analyzes have shown that a region considered as highly flexible through RMSF quantifications, can be seen through PBs as locally highly rigid. This unexpected behavior is explained by a local rigidity, surrounded by deformable regions46. The only other related approach based on SA is GSATools47, it is specialized in the analysis of functional correlations between local and global motions, and the mechanisms of allosteric communication.
We thus propose PBxplore, a tool to analyze local protein structure and deformability using PBs. It is available at https://github.com/pierrepo/PBxplore. PBxplore can read PDB structure files48, PDBx/mmCIF structure files49, and MD trajectory formats from most MD engines, including Gromacs MD topology and trajectory files50, 51. Starting from 3D protein structures, PBxplore assigns PBs sequences; computes a local measurement of entropy, a density map of PBs along the protein sequence and a WebLogo-like representation of PBs.
In this paper, we first present the principle of PBxplore, then its different tools, and finally a simple user-case with the β3 subunit of the human platelet integrin αIIbβ3.
Design and Implementation
PBxplore is written in Python52–54. It is compatible with Python 2.7, and with Python 3.4 or greater. It requires the Numpy Python library for array manipulation55, the matplotlib library for graphical representations, and the MDAnalysis library for molecular dynamics simulation files input56. Optionally, PBxplore functionalities can be enhanced by the installation and the use of WebLogo57 to create sequence logos.
PBxplore is available as a set of command-line tools and as a Python module. Users less familiar with the Python programming language can use the command-line programs. These programs can be linked up together to make a structure analysis pipeline of protein flexibility. For more advanced users, PBxplore provides an API to access its core functionalities and allow creation of custom workflows.
PBxplore is released under the open-source MIT license58. It is available on the software development platform GitHub59 at https://github.com/pierrepo/PBxplore. The package contains unit and regression tests and is continuously tested using Travis CI60. An extensive documentation is available on Read the Docs61 at https://pbxplore.readthedocs.io.
Installation
The easiest way to install PBxplore is through the Python Package Index (PyPI):
pip install --user pbxplore
It will ensure all required dependencies are installed correctly.
Command-line Tools
A schematic description of PBxplore command line interface is provided in Figure 2. The interface is composed of three different programs: PBassign to assign PBs, PBcount to compute PBs frequency on multiple conformations, and PBstat to perform statistical analyses and visualization. These programs can be linked up together to make a structure analysis pipeline to study protein flexibility.
PBassign
The very first task is to assign PBs from the protein structure(s). A PB is associated to each pentapeptide included in the protein sequence. To assign a PB to a residue n, 5 residues are required (residues n − 2, n − 1, n, n + 1 and n + 2). From the structure of these 5 residues, 8 dihedral angles (ψ and ϕ) are computed, going from the ϕ angle of residue n − 2 to the ψ angle of residue n + 215. This set of 8 dihedral angles is then compared to the reference angles set of the 16 PBs14 using the Root Mean Square Deviation Angle (RMSDA) measure, i.e., an Euclidean distance on angles. PB with the smallest RMSDA is assigned to residue n. A dummy PB “Z” is assigned to residues for which all 8 angles cannot be computed. Hence, the first two N-terminal and the last two C-terminal residues are always assigned to PB “Z”.
The program PBassign reads one or several protein 3D structures and performs PBs assignment as one PBs sequence per input structure. PBassign can process multiple structures at once, either provided as individual structure files, as a directory containing many structure files or as topology and trajectory files issued from MD simulations. Note that PBxplore should be able to read any trajectory file format handled by the MDAnalysis library, yet we have tested Gromacs and CHARMM trajectories. Output PBs sequences are bundled in a single file in fasta format.
PBcount
During the course of a MD simulation, the local protein conformations can change. It is then interesting to analyze them through PB description. For that, once PBs are assigned, PBs frequencies per residue can be computed.
The program PBcount reads PBs sequences for different conformation of the same protein from a file in the fasta format (as outputed by PBassign). Many input files can be provided at once. The output data is a 2D matrix of x rows by y columns, where x is the length of the protein sequence and y is the 16 distinct PBs. A matrix element is the count of a given PB at a given position in the protein sequence.
PBstat
The number of possible conformational states covered by PBs is higher than the classical secondary structure description (16 states instead of 3). As a consequence, the amount of information produced by PBcount can be complex to handle. Hence, we propose three simple ways to visualize the variation of PBs which occur during a MD simulation.
The program PBstat reads PBs frequencies as computed by PBcount. It can produce three types of outputs based on the input argument(s). The first two use the matplotlib library and the last one requires the installation of the third-party tool Weblogo57. PBstat offers also two options (--residue-min and --residue-max) to define a residue frame allowing the user to quickly look at segments of interest. The three graphical representations proposed are:
Distribution of PBs. This feature plots the frequency of each PB along the protein sequence. The output file could be in format .png, .jpg or .pdf. A dedicated colorblind safe color range62 allows visualizing the distribution of PBs. For a given position in the protein sequence, blue corresponds to a null frequency when the particular PB is never met at this position and red corresponds to a frequency of 1 when the particular PB is always found at this position. It is produced with the –map argument.
Equivalent number of PBs (Neq). The Neq is a statistical measurement similar to entropy18. It represents the average number of PBs taken by a given residue. Neq is calculated as follows: where fx is the probability (or frequency) of the PB x. A Neq value of 1 indicates that only a single type of PB is observed, while a value of 16 is equivalent to a random distribution, i.e. all PBs are observed with the same frequency 1/16. For example, a Neq value around 5 means that, across all the PBs observed at the position of interest, 5 different PBs are mainly observed. If the Neq exactly equals to 5, this means that 5 different PBs are observed in equal proportions (i.e. 1/5).
A high Neq value can be associated with a local deformability of the structure whereas a Neq value close to 1 means a rigid structure. In the context of structures issued from MD simulations, the concept of deformability / rigidity is independent to the one of mobility. The distribution of PBs is produced with the --neq argument.
Logo representation of PBs frequency. This is a WebLogo-like representation57 of PBs sequences. The size of each PB is proportional to its frequency at a given position in the sequence. This type of representation is useful to pinpoint PBs patterns. This WebLogo-like representation is produced with the --logo argument.
Python Module
PBxplore is also a Python module that more advanced users can embed in their own Python script. Here is a Python 3 example that assigns PBs from the structure of the barstar ribonuclease inhibitor63:
import urllib.request
import pbxplore as pbx
# Download the pdb file
urllib.request.urlretrieve(’https://files.rcsb.org/view/1BTA.pdb’, ‘1BTA.pdb’)
# The function pbx.chain_from_files() reads a list of files
# and for each one returns the chain and its name.
for chain_name, chain in pbx.chains_from_files([’1BTA.pdb’]):
# Compute phi and psi angles
dihedrals = chain.get_phi_psi_angles()
# Assign PBss
pb_seq = pbx.assign(dihedrals)
print(’PBs sequence for chain {}:\n{}’.format(chain_name, pb_seq))
The documentation contains complete and executable Jupyter notebooks explaining how to properly use the module. It goes from the PBs assignments to the visualization of the protein deformability using the analysis functions. This allows the user to quickly understand the architecture of the module.
Results
This section aims at giving the reader a quick tour of PBxplore features on a real-life example. We will focus on the β3 subunit of the human platelet integrin αIIbβ3 that plays a central role in hemostasis and thrombosis. The β3 subunit has also been reported in cases of alloimmune thrombocytopenia64, 65. We studied recently this protein by MD simulations (for more details, see references33–35).
The β3 integrin subunit structure66 comes from the structure of the integrin complex (PDB 3FCS67). Final structure has 690 residues and was used for MD simulations. All files mentioned below are available in the demo_paper directory from the GitHub repository (https://github.com/pierrepo/PBxplore/tree/master/demo paper).
Protein Blocks assignment
The initial file beta3.pdb contains 225 structures issued from a single 50 ns MD simulation of the β3 integrin.
PBassign -p beta3.pdb -o beta3
This instruction generates the file beta3.PB.fasta. It contains as many PB sequences as there are structures in the input beta3.pdb file.
Protein Blocks assignment is the slowest step. In this example, it took roughly 80 seconds on a laptop with a quad-core-1.6-GHz processor.
Protein Blocks frequency
PBcount -f beta3.PB.fasta -o beta3
The above command line produces the file beta3.PB.count that contains a 2D-matrix with 16 columns (as many as different PBs) and 690 rows (one per residue) plus one supplementary column for residue number and one supplementary row for PBs labels.
Statistical analysis
Distribution of PBs
PBstat -f beta3.PB.count -o beta3 --map
Figure 3 shows the distribution of PBs for the β3 integrin. The color scale ranges from blue (the PB is not found at this position) to red (the PB is always found at this position). The β3 protein counts 690 residues. This leads to a cluttered figure and prevents getting any details on a specific residue (Figure 3a). However, it exhibits some interesting patterns colored in red that correspond to series of neighboring residues exhibiting a fixed PB during the entire MD simulation. See for instance patterns associated to PBs d and m that reveal β -sheets and α -helices secondary structures15.
With a large protein such as this one, it is better to look at limited segments. A focus on the PSI domain (residue 1 to 56)33, 67 of the β3 integrin was achieved with the command:
PBstat -f beta3.PB.count -o beta3 --map --residue-min 1 --residue-max 56
Figure 3b shows the PSI domain dynamics in terms of PBs. Interestingly, residue 33 is the site of the human platelet antigen (HPA)-1 alloimmune system. It is the first cause of alloimmune thrombocytopenia in Caucasian populations and a risk factor for thrombosis64, 65. In Figure 3b, this residue occupies a stable conformation with PB h. Residues 33 to 35 define a stable core composed of PBs h-i-a. This core is found in all of the 255 conformations extracted from the MD simulation and then is considered as highly rigid. On the opposite, residue 52 is flexible as it is found associated to PBs i, j, k and l corresponding to coil and α -helix conformations.
Equivalent number of PBs
The Neq is a statistical measurement similar to entropy and is related to the flexibility of a given residue. The higher is the value, the more flexible is the backbone. The Neq for the PSI domain (residue 1 to 56) was obtained from the command line:
PBstat -f beta3.PB.count -o beta3 --neq --residue-min 1 --residue-max 56
The output file beta3.PB.Neq.1-56 contains two columns, corresponding to the residue numbers and the Neq values. Figure 4a represents the Neq along with the PBs sequence of the PSI domain, as generated by PBstat. The rigid region 33-35 and the flexible residue 52 are easily spotted, with low Neq values for the former and a high Neq value for the latter.
An interesting point, seen in our previous studies, is that the region delimited by residues 33 to 35 was shown to be highly mobile by the RMSF analysis we performed in Jallu et al.33 (for more details, see Materials and Methods section in Jallu et al.33). For comparison, RMSF and Neq are represented on the same graph on Figure 4b. This high mobility was correlated with the location of this region in a loop, which globally moved a lot in our MD simulations. Here, we observe that the region 33-35 is rigid. The high values of RMSF we observed in our previous work were due to flexible residues in the vicinity of the region 33-35, probably acting as hinges (residues 32 and 36–37). Understanding the flexibility of residues 33 to 35 is important since this region defines the HPA-1 alloantigenic system involved in severe cases of alloimmune thrombocytopenia. PBxplore allows discriminating between flexible and rigid residues; the Neq is a metric of deformability and flexibility whereas RMSF quantifies mobility.
Logo representation of PBs frequency
While the Neq analysis focuses on the flexibility of amino acids, the WebLogo-like representation57 aims at identifying the diversity of PBs and their frequencies at a given position in the protein sequence. With a focus on the PSI domain, the following command line was used:
PBstat -f beta3.PB.count -o beta3 --logo --residue-min 1 --residue-max 56
Figure 5 represents PBs found at a given position. The rigid region 33-35 is composed of a succession of PBs h-i-a while the flexible residue 52 is associated to PBs i, j, k and l. This third representation summarized pertinent information, as shown in ref34.
Conclusion
From our previous works33–35, 45, we have seen the usefulness of a tool dedicated to the analysis of local protein structures and deformability with PBs. We also showed the relevance of studying molecular deformability in the scope of structures issued from molecular dynamics simulations. Thus, we propose to the community PBxplore, available at https://github.com/pierrepo/PBxplore. PBxplore is written in a modular fashion that allows embedding in any PBs related Python applications.
Software Availability
PBxplore is released under the open-source MIT license58. Its source code can be freely downloaded from the GitHub repository of the project: https://github.com/pierrepo/PBxplore. In addition, the present version of PBxplore (1.3.6) is also archived in the digital repository Zenodo68.
Acknowledgements
This work was supported by grants from National Institute for Blood Transfusion (INTS, France) and Lab of Excellence GR-Ex to JB, PC, SL, APJ, VJ, AGdB and PP, from the Ministry of Research (France), University Paris Diderot, Sorbonne Paris Cite´ (France), National Institute for Health and Medical Research (INSERM, France) to JB, PC, SL, APJ, AGdB and PP. The labex GR-Ex, reference ANR-11-LABX-0051 is funded by the program “Investissements d’avenir” of the French National Research Agency, reference ANR-11-IDEX-0005-02. AGdB acknowledges to Indo-French Centre for the Promotion of Advanced Research / CEFIPRA for collaborative grant (number 5302-2).
Author contributions
PP and AGdB conceived the project. PP, JB and HS wrote the software. AGdB, PC, APJ and VJ improved and tested the software. All authors reviewed the manuscript.
Competing Interests
The authors declare that they have no competing interests.