ABSTRACT
Direct coupling analysis (DCA) is a powerful tool based on protein evolution and introduced to predict protein fold and protein-protein interactions which has been applied also to the prediction of entire interactomes. We have used DCA to analyse three proteins of the iron-sulfur biogenesis machine, an essential metabolic pathway conserved in all organisms. We show that, although based on a relatively small number of sequences due to its distribution in genomes, we can correctly recapitulate all the features of the fold of the CyaY/frataxin family, a protein involved in the human disease Friedreich’s ataxia. This result gave us confidence in the use of this tool. Application of DCA to the iron-sulfur cluster scaffold protein IscU, which has been suggested to function both as an ordered and a disordered form, allows us to clearly distinguish evolutionary traces of the structured species, suggesting that, if present in the cell, the disordered form has not left any evolutionary imprinting. We observe instead, for the first time, direct indications of how the protein can dimerize head-to-head and bind 4Fe4S clusters. Analysis of the alternative scaffold protein IscA provides strong support to a coordination of the cluster mediated by a dimeric rather than a tetrameric form as previously suggested. Our analysis also suggests the presence in solution of a mixture of monomeric and dimeric species and guide us to the prevalent one. Finally, we used DCA to analyse protein-protein interactions between some of these proteins and discuss the potentialities and the limitations of the method.
I. INTRODUCTION
The whole history of protein folding and interactions is encoded in the correlations between residues in the protein sequence. The logic connecting residue-residue contacts to evolutionary correlation is very simple: residues in contact cannot evolve independently. If one residue gets larger, the other needs to be smaller in a concerted and not necessarily pairwise way. Charges must be compensated in the same way. Stabilizing/destabilizing amino acid substitutions need to be compensated by substitution of other interacting positions over the evolutionary timescale to retain interaction. In principle, one could use a comparative analysis of the primary sequences of proteins as a powerful way to predict their structures and interactions. This idea has been the “elusive Holy Grail” for more than twenty years since the first establishment of bioinformatics [1]. More recently an effective method, named direct coupling analysis (DCA) [2,3], has been proposed as a powerful approach to determine which residues interact the most from an evolutionary perspective, exploiting the large, and growing, number of available protein sequences. The method has been successfully used to acquire constraints for structural, dynamical and functional analysis [4-7], multimerization [8,9], and to shed light on interaction specificity [10] and inter-pathway cross-talk in bacterial signal transduction [11].
Here, we have applied DCA to explore the nature of the interactions between proteins involved in the iron-sulfur (FeS) cluster biogenesis pathway. Iron-sulfur clusters are essential prosthetic groups in biological material bound to proteins to provide electrons in reduction/oxidation reactions and/or stabilize protein folds. Their biosynthesis is a complex process involving specialized machines which mediate the recruitment of sulfur and free iron from the cellular environment, catalyse the synthesis and fulfil the delivery of the newly formed clusters to acceptor proteins. In bacteria, the systems able to perform these tasks belong to the nif (nitrogen fixation, NifiscA-nifSU), isc (iron-sulfur complex, iscRSUA-hscBA-fdx) and suf (mobilization of sulfur, sufABCDSE) operons. Amongst these, the most universal one is the isc operon, whose proteins have direct orthologues in eukaryotes. Because malfunction in FeS cluster assembly has direct effects onto human health [12,13], elucidating the structures and interaction patterns between the various proteins involved in this process can provide valuable insights in the origin of several diseases.
The central players in the isc machinery are IscS (or Nfs1 in eukaryotes) and IscU (Isu). IscS is a desulfurase, which converts cysteine to alanine and forms the persulfide that participates to the cluster, and IscU is a scaffold protein where the cluster is assembled. Together, they form a complex in which two IscU monomers are bound to the IscS obligate dimer. IscU was suggested to exist in the cell in two conformational states, one folded and ordered (S state), the second being partially unfolded (D state) [14]. However, all crystal structures of IscU in isolation and in complexes with zinc or IscS capture the protein in its ordered state. Two regulatory proteins are CyaY (frataxin), which is the protein involved in Friedreich’s ataxia in humans, and IscA thought to be an alternative scaffold protein. CyaY/frataxin is a monomeric protein formed by a globular conserved domain, in eu-karyotes preceded by an intrinsically unfolded mitochon-drial import sequence. It is highly conserved from bacteria to primates [15], to act as a regulator of the enzymatic activity of IscS and to bind it in a site close to the enzyme active site [16,17]. Puzzlingly, its presence seems to inhibit the activity in prokaryotes but to activate it in eukaryotes [18-21]. IscA is an ancient protein thought to be an alternative scaffold for cluster formation. The IscA family is characterized by a conserved CXnCGCG pattern though to be involved in iron and/or 2Fe-2S binding [22]. In all available structures, IscA is either dimeric or tetrameric but different symmetries and cluster coordination were suggested. We used DCA to address important outstanding questions, which would help us to understand the specific role of these proteins and their fold. We used frataxin, which is monomeric and globular, to calibrate the method. We then tested whether any trace of the D state of IscU is detectable as compared to the S state and whether evolution provides information on the quaternary arrangement of IscA. We found that this technique was able to describe in great detail the proteins considered. We were able to identify the correct biological location of the elusive N-terminus of the IscU protein and conserved contacts which hint at a head-to-head dimerization of the protein which is in agreement with the cluster coordination. We also found that not all the IscA structures in the PDB database match the conserved contacts which suggests that the location of the FeS cluster was likely misattributed. Finally, we used DCA to predict protein interactions. We could predict successfully interactions between IscU and the functional partner IscS whereas contacts predicted for CyaY do not match our current knowledge. These observations are likely to reflect the possibilities but also the limitations of DCA.
II. RESULTS
A. Validating the method on the frataxin family
The major sequence divergence within the CyaY/frataxin family is in the non-conserved mainly unstructured N-terminus [23,24]. The evolutionary conserved C-terminal domain forms a compact globular structure in which two α-helices pack against β-sheet composed of 5-7 strands arranged in a αβββββ(ββ)α motif. The available structures of this region are all similar (average RMSD 2.3 Å) with minor differences in details (Table 1). Different orthologs differ for the length of the C-terminus that is longer in human frataxin and shorter in yeast. This difference contributes to the thermodynamic stability of the protein [25]. Experimental evidence suggests that the region interacting with iron and with the desulfurase IscS/Nfs1 is located in α1 and β1 (Fig 1A) [18,26,27].
We retrieved from the Uniprot database all the sequences matching a Hidden Markov Model (HMM) constructed from a seed made of the 196 CyaY entries of the Swiss-Prot database. We then built a multiple sequence alignment (MSA) containing 3459 sequences, defining 109 consensus residue positions which cover 1102 eu-karyotes and 2326 bacteria. The number of retrieved sequences is relatively small for a successful application of DCA but reflects the absence of frataxin in several species [28]. We then performed DCA on this MSA using the pseudo-likelihood approximation as described in Balakrishnan et al. [29], a method that estimates the joint probability distribution of a collection of random variables. The predicted contacts are displayed in contact maps which have the protein sequence numbering on both axes. Contacts are displayed as spots which indicate interactions between residues. Traces antiparallel to the diagonal indicate that this region forms antiparallel secondary structure. Parallel traces reflect interactions between parallel strands. Contacts which do not line up in parallel or antiparallel fashion but cluster in various regions of the plot correspond to contacts between distal elements.
The predicted contacts (Fig 1B) were ranked according to their DCA scores, which describe the coevolution strength between pairs of residues. Comparison of the frataxin structures with the contacts predicted by DCA, discarding contacts between residues less than five residues apart which reflect local secondary structures, shows a perfect match for the first 25 residues, drops to two thirds in the first hundred, and stays over 50% at the 250 residue mark (Fig 1C). We retained the top 109 DCA contacts with the highest scores which correspond to 2% of the total 5460 possible contacts. The retained contacts correlate well with the secondary structure of the protein. Additionally, three clusters were observed, all involving the domain N-terminus (Fig 1D). They reflect packing of α1 against β1-β2 and β3-β4. The third cluster reflects the contacts between the two helices. This tells us how important α1 is for this protein fold. The only other tertiary interactions between distant secondary structure elements involve β4-β5 and the C-terminal α2. This interaction is reflected in the DCA analysis by a small cluster visible at the very bottom of the DCA plot.
These results support the confident use of DCA for the analysis of FeS proteins: even though the number of retrieved sequences is suboptimal, we could recapture all the important features of the CyaY/frataxin fold.
All the crystal structures of IscU have a folded N-terminus while most of the NMR structures show an unstructured or partially structured N-terminus. Only a few structures are available for IscA, all from X-ray crystallography.
B. Structure of IscU proteins and N-terminal localization
IscU is a more complex case. Twelve structures are available from 8 different species (Table 2). They can be divided in three groups. All the X-ray structures, which are available for isolated cluster-loaded (holo), zinc-loaded and cluster-free (apo) IscU as well as complex with IscS/Nfs1, have a compact ordered structure with a β-sheet packing against two α-helices (Fig 2A). The N-terminus (residues 1-21) does not contain regular secondary structure elements except for a two-turn helix (α1) between residues 5-12 which packs against the other helix anchoring the N-terminus to the rest of the structure. In one of the structures (2Z7E), the N-terminus adopts different orientations in the different protomers of a homo-trimer. In the solution structures, (1R9P, 1Q48, 2L4X, 2KQK and the 1WFZ), the fold is similar but the N-terminus is disordered and completely solvent-exposed (Fig 2B). Some of these structures are thought to contain a zinc atom in the same position where the cluster is coordinated (i.e. on the tip of the approximately ellipsoid where three conserved cysteines are). However, zinc is NMR silent and could not be observed directly. Only two crystallographic structures (1SU0 and 2QQ4) explicitly contain zinc. Finally, one zinc free NMR structure (2L4X) is supposed to be representative of the D state.
It is distorted and contains only a β-hairpin and the C-terminal helix. It is probably more correct to describe this entry as a nascent chain or a molten globule rather than a structure as we normally intend it. Its presence in PDB is misleading.
DCA analysis on 13148 IscU sequences, resulted in clear coevolutionary prediction of contacts (Fig 2C,D). Using the secondary structure and the nomenclature described in the IscU alignment [30], we can observe interactions between secondary structure elements: the contacts between β1-β2, β2-β3, β3-α2, α2-α3, and α3-α6 left traces perpendicular to the diagonal, while the β2-α2, β3-α6, β2-α6 interactions are reflected by three parallel traces. All secondary structure elements between β1 and α6 form contacts with the previous and the subsequent secondary elements, forming hairpins. The parallel traces reflect interactions between parallel strands. The α1 helix is excluded from this pattern and forms interactions with several strands suggesting a transversal orientation which crosses the sheet.
Most experimental structures agree with these predicted contacts (Fig 2C,D) with the exception of the N-terminal region (up to ca. residue 16) which is also where the structures differ most. Contacts between the N-terminus and the β2-β3-α2 region are conserved, in support to a structured state of the α1 region (Fig 2E,F). This does not, however, preclude the existence or the functional relevance of a disordered conformation of the N-terminus: disordered regions would likely not have a co-evolutionary signal and are thus out of reach in current DCA predictions.
The N-terminus also forms contacts with the β-sheets and the α1-β1 loop. Superposition of the predicted contacts to the deposited structures leaves two unaccounted predicted contact clusters, one between α2 and β1-β2 loop, another within the α5 region (Fig 3A). These contacts are incompatible with the inter-molecular interactions observed in the crystal structures of the cluster-loaded trimer (2Z7E) or of a decamer (2QQ4) (Fig S1) and include areas involved in or surrounding the FeS cluster binding site (Fig 3B). A different explanation could be that these contacts reflect formation of a head-to-head dimer with an interface located around the conserved cysteines. This hypothesis would be fully consistent with the necessity of at least a dimer to coordinate a 4Fe4S cluster [31] according to a oxidative mechanism previously proposed [32].
C. Multimerization and FeS cluster coordination of IscA
Seven structures of IscA-like proteins are available (Table 3). The first published structure (1R95) [33] has an internal two-fold symmetry with tandem pseudo-symmetric motifs (β1-α1-β2-β3/β5-α2-β6-β7) separated by a quasi-palindromic hinge (E43FVDEPTPEDIVFE56 in the β3-α4 region). The fold of each protomer consists of a β-sandwich of a mixed twisted four-stranded β-sheet, β4-β5-β2-β3, packed against a three-stranded β1-β6-β7 sheet. The protomers could form a dimer or two possible tetramers or dimer of dimers (tetramers A and B, Fig 4A). The electron density around the C-terminus (where two of the three cysteine residues are) is fuzzy, indicating disorder or conformational exchange. An alternative apo IscA crystal structure [34] has the individual pro-tomers nearly identical to those observed in 1R95 but the dimer interface, described as an α1α2 dimer with minor differences between protomers, is different. The overall tetrameric (α1α2)2 structure is similar to the 1R95 A tetramer. Also this structure lacks a defined C-terminus but the authors modelled it based on stereochemical parameters. The authors concluded that the cysteines of the dimer would be unable to coordinate the FeS cluster and that tetramer formation is necessary to stabilise coordination. They also suggested that of the three cys-teines of the CXnCGCG motif, only the last two (Cys99 and Cys101 in E. coli) are involved in cluster coordination, whereas Cys35 would remain idle. The only fully resolved holo IscA is from T. elongatus (1X0G). This structure has a structured C-terminus which allows coordination of the FeS cluster. It is a dimer of asymmetric dimers (αβ)2 and has domain swapping between two of the protomers (β and β’) which exchange their central domain forming a long intertwined β-sheet (Fig 4B). The unusual asymmetry imposes asymmetric interfaces, one of which (the one between αand the domain-swapped β’) forms the pocket which accommodates the FeS cluster. The pocket itself is asymmetric with the cysteine motif (Cys37, Cys101, Cys103) contributed both by the αprotomer and the swapped βdomain of the protomer (Cys103 (βsw) (Fig 4B and Fig S2).
Most of the sequences belong either to the IscA or to the ErpA subfamilies but comprise also SufA and the eukaryotic paralogs IscA1/IscA2 (ca. 11,000 sequences). These proteins are all part of the A-type carrier (ATC) family and should have overlapping functions. Structurally, both SufA (2D2A) and IscA (1R95, 1S98) have similar contact maps except for two regions, which account for contacts within the C-terminus and between the C-terminus and residues 30-40. These regions contain the three conserved cysteines. Since cluster coordination is thought to occur inter-molecularly because no structure could allow intra-molecular coordination [35], we hypothesize that these contacts reflect inter-molecular interactions. None of the inter-chain contact maps matches sh he experimental structures (data not shown), strongly suggesting that in solution there might be different structures in mutual equilibrium or that none of the available structure represents the functional species. The first hypothesis is also in agreement with the diversity of packing observed in the crystal structures.
The contacts within C-terminal residues show the characteristic pattern of β-sheets or loop conformations. These patterns could be in agreement with the anomalous swapped dimer of 1X0G, where the loop harbouring the first cysteine of the CXnCGCG motif (Cys37) is bent towards the C-terminus and stabilized by steric hindrance of the swapped central twisted β-sheets. In this structure, cluster coordination is asymmetric and achieved by Cys37 and Cys101 of the α protomer and Cys103 of the β protomer. The evolutionary trace of contacts between the C-terminus (residues 98-112) and the loop between residues 33-41 suggests the existence of a conformation which allows the proximity of the first cysteine (Cys37) to the terminal cysteine pair (Cys101 and Cys103) (Fig S2), supporting a contribution of Cys37 in cluster coordination. This conclusion is at strong variance with the previous belief that only the C-terminal cysteines participate to coordination and implies that cluster coordination can occur at the level of the dimer without invoking formation of a tetramer. The 1X0G structure is currently the only available structure able to describe cluster coordination although domain swapping may not be required to explain the interactions: domain swapping could easily be replaced by a non-swapped protomer in a symmetric dimer.
We can thus conclude that DCA analysis of IscA suggests new important hypotheses which can change drastically our views on this protein cluster coordination properties.
D. Protein-protein interactions
DCA can in principle be extended to predict conserved contacts between interacting proteins on the basis of MSAs of protein pair sequences that are known to interact. In the absence of such a curated set, several matching strategies have been developed [8–10,36]. Among these, two independent implementations have recently been suggested in back-to-back publications [10,36]. Weadopted the Iterative Paralog Matching (IPA) [10] to investigate the interactions between frataxin, IscU and the desulfurase IscS and used a self-consistent method that simultanously identifies the best matching sequences among paralogs, and predicts pairs of interacting residues across two proteins. In this approach, multiple IPA runs are performed, and protein-protein contacts are scored based on the number of times they are accepted among all the runs (acceptance frequency). We first analyzed the interactions between IscU and IscS, because a high resolution crystal structure of this complex is available (3LVL). We observed that the four most often accepted contacts do indeed lie in the interface of the IscU-IscS dimer. These contacts have acceptance frequencies between 100% and 85% (Fig 5A and Fig S3). Contacts with lower acceptance frequencies are mainly incompatible with the structural model of the IscU-IscS dimer (i.e. false-positives). We also observed at least one contact (V17-L383, accepted in 17% IPA runs), that lies in the IscU-IscS interface. In the absence of an absolute scale quantifiying the reliability of predicted contacts, and of known structures for the IscU-frataxin and IscS-frataxin complexes, we used the IscU-IscS case as a reference. We assumed that contacts being accepted in more than 85% of IPA simulations are all in excellent agreement with an experimental model, while contacts with lower acceptance frequency display high variability and false positive rates. We observed absence of contacts with high acceptance frequency for the IscS-frataxin pair (compared to the IscU-IscS case) (Fig S4A). The acceptance frequency, 68%, of the two most frequent contacts (Fig 5B) falls in the range where, in the case of IscU-IscS, most contacts are false positives. Therefore, even though the two contacts have geometrical compatibility, i.e. they could in principle be satisfied by a docked pose, their high statistical uncertainty prevents drawing conclusions about their biological relevance. In the case of interactions between frataxin and IscU, IPA identified three contacts with very high acceptance frequencies (>94%) (Fig 5C and Fig S4B) and potentially geometrically compatible with a docked complex. However, there is no overlap between these three coevolutionary predicted contacts and the interaction interface between frataxin and IscU in an available model of the IscU-IscS-frataxin trimer [37]. It must however be noted that the number of sequences in the IscU and IscS families are significantly higher than for the frataxin family. This will probably contribute to a higher statistical robustness of the predictions for the IscU-IscS complex.
III. DISCUSSION
DCA is a powerful method, by now shown to be robust and reliable as long as a sufficiently high number of independent protein sequences are available [2,38,39]. In this work, we have interrogated evolution through DCA to gain new insights into the molecular machine involved in FeS cluster biogenesis. We selected three essential components: the scaffold protein IscU, the alternative scaffold IscA and the regulator of cluster formation, CyaY/frataxin. Besides the medical and biological interest of the latter, the choice of CyaY/frataxin revealed to be appropriate to validate the method for our purposes since this protein has a well compact and stable fold which presents a high structural conservation. Fewer CyaY/frataxin sequences are in agreement with the origin of this protein back only to the root of the alpha-beta-gamma proteobacteria, whereas, for instance, the IscU presence goes back to at least to the origin of bacteria. IscU is thus older of at least a couple of 100 million years (M. Huynen, personal communication). We nevertheless observed that, despite the relatively lower number of sequences, we can reproduce most features of the CyaY/frataxin fold, giving us confidence with the other two much better represented proteins. We then applied DCA to resolve questions which could allow us to understand cluster coordination and protein assembly of the other two proteins.
Much has been said about the presence of partially unstructured structures of IscU which could be in equilibrium with the fully folded form in solution [40]. There is no doubt that IscU is a marginally stable protein which, when in the absence of partners like zinc, the cluster or IscS is able to unfold not only at high but also at low temperatures [41]. The N-terminus is flexiblshe or in a conformational exchange in solution also in the presence of zinc. Nevertheless, we do not find traces of the unstructured conformation in our analysis, while the signal from the structured form is clear and unmistakable. Even more interestingly, we found for the first time some indication that directly supports experimentally the existence of a head-to-head IscU dimer whose interface would involve the conserved cysteines. This dimer was suggested to be the result of an oxidative event occurring at the later stages of FeS cluster formation, after the cluster-loaded IscU has detached from IscS [32]. This event would lead to the formation of a 4Fe4S cluster. IscU dimerization is the only way to reach sufficient coordination groups and enable formation of the 4Fe4S cubane which would instead be too unstable to be coordinated by the IscU monomer [41].
DCA of IscA suggests new hypotheses on the structure of this otherwise still obscure protein. Because IscA binds both iron and FeS clusters, the protein has alternatively been suggested to be a scaffold protein or the carrier protein that delivers iron to the desulfurase [35,42,43]. What remains certain is that IscA contains three conserved cysteines, which are excellent candidates for both ion and cluster coordination. The crystal structures of IscA have been relatively uninformative both on the type of molecular assembly and on cluster/metal coordination. Our DCA data rely on a large number of sequences, just a little bit inferior to those retrieved for IscU. We observe a signal that is compatible with formation of the αβ fold observed in all available structures. However, we also observe contacts which cannot easily be explained by only one structure, suggesting the presence of several different species at least in the absence of cluster or cations. This is well consistent with our experimental evidence [44] which clearly supports the presence of an equilibrium between at least two species in a range of concentrations compatible with those expected in the cell. After analysing different structures we conclude that the co-presence of structures such like 1X0G and 1R95 would match what we observe in the DCA analysis. These conclusions strongly suggest that, while not necessarily giving domain swapping, we can envisage cluster coordination mediated by the dimeric form of IscA rather than the tetramer.
In conclusion, we found that DCA is a methodology which can enhance our knowledge on specific protein families and provide new information that can address unresolved questions. We can thus confidently add DCA to the tools which can allow us to study the FeS cluster machine.
A. Materials and Methods
1. Multiple Sequence Alignments
Multiple sequences alignments (MSAs) for each of the studied protein families were constructed using the following protocol: We first gathered all sequences from Uniprot with gene names corresponding to the canonical members of the families (CYAY or FXN for frataxin, ISCA for IscA, ISCS for IscS, ISCU for IscU). We then aligned the sequences in each seed using MAFFT (http://mat.cbrc.jp/alignment/software/) [45]. The resulting MSA was then used to generate a Hidden Markov Model using the HMMER package (http://hmmer.org/) [46] The Uniprot database was then searched using the HMMs to extract homologues sequences. The resulting MSAs were further filtered, removing all sequences containing more than 10% of gapped positions.
2. Direct Coupling Analysis
DCA [2,3] was performed using an in-house code of the asymmetric version of the Pseudo-likelihood method to infer the parameters of the Potts Model [29,38]. Sequences were reweighted using a maximum 90% identity threshold. We used the L2 regularization parameters [38]. The DCA scores were taken as the Frobenius norm of the 20x20 coupling matrices Jij of the Potts model (ignoring the couplings with the gap state) [47]. The average product correction term was subtracted [48]. The result was filtered to remove background and allow easier interpretation. The N most scoring predictions (N equals the MSA sequence length) were compared with the contact map of reference structures in which two residues were considered in contact if they have at least one atom 8.5 Åapart. Contacts between residues <5 amino-acids apart in the sequence were skipped to favor visualization of long range contact interactions.
3. Iterative Paralog Matching and inter-protein predictions
To build matched MSAs of two interacting protein families (denoted A and B), we used the Iterative Paralog Matching (IPA) strategy [10]. The rationale of this procedure is to find the matching between paralogs of two protein families in an organism, such that the inter-protein coevolutionary signal is self-consistently maximized. The protocol can be summarized as follows: a random seed is built, such that for each organism, sequences of protein A are randomly matched with sequences of protein B. Mean-Field DCA [2] is used to infer the statistical model. The random seed is then discarded. The inferred couplings are then used to score all possible matchings of paralogs in all organisms. All potential matched sequence pairs are then ranked based on their inter-protein coevolution score, and a user defined number Ninc of the top ranking sequence pairs is added to the MSA, which will then be fed as input to MF-DCA for the next iteration. This procedure is repeated, increasing Ninc at each iteration, until the maximal number of sequences are matched. Finally, the best scoring MSA obtained by IPA is used as input to the Pseudo-Likelihood DCA method described above to perform contact prediction. This procedure is repeated NIPA times, and for each realization, we recorded the inter-protein contacts with normalized DCA score above 0.8, an acceptance criterion introduced in [9]. We used NIPA=200 realizations for the IscU-IscS system, and NIPA=300 for the faster frataxin-IscU and frataxin-IscS systems. Finally, we ranked all inter-protein contacts by the normalized number of times they were accepted in the NIPA realizations (acceptance frequency). Contacts being accepted more often across several IPA runs should reflect more robustness and higher statistical significance.