Abstract
Paralogous proteins often arise from the duplication of genes encoding homomeric proteins. Such events lead to the formation of homomers and heteromers, thus creating new complexes after a single duplication event. We exhaustively characterize this phenomenon using the budding yeast protein-protein interaction network. We observe that heteromerizing paralogs are very frequent and less functionally diverged than non-heteromerizing ones, raising the possibility that heteromerization prevents functional divergence. Using in silico evolution, we show that for homomers and heteromers that share binding interfaces, mutations in one paralog have pleiotropic effects on the homomer and the heteromer, affecting both paralogous proteins at the same time and resulting in highly correlated responses to selection. As a result, heteromerization could be preserved indirectly due to negative selection for the maintenance of homomers. By integrating data on gene expression and protein localization, we find that paralogs can overcome the obstacle of structural pleiotropy and develop functional divergence through regulatory evolution.
Introduction
Proteins can assemble into both stable and transient molecular complexes that perform and regulate structural, metabolic and signalling functions (Janin et al., 2008; Marsh and Teichmann, 2015; Pandey et al., 2017; Scott and Pawson, 2009; Vidal et al., 2011; Wan et al., 2015). The assembly of such complexes is necessary for protein function and thus constrains the sequence space available for protein evolution. One direct consequence of protein-protein interactions (PPIs) is that a mutation in a given gene can have pleiotropic effects on other gene functions through physical associations. Therefore, to understand how genes and cellular systems evolve, we need to consider physical interactions as part of the environmental factors shaping a gene’s evolutionary trajectory (Landry et al., 2013; Levy et al., 2012).
A context in which PPIs and pleiotropy may be particularly important is during the evolution of new genes after duplication events (Amoutzias et al., 2008; Baker et al., 2013; Diss et al., 2017; Kaltenegger and Ober, 2015). Indeed, the potential for a gene to evolve new functions depends on its molecular environment. The molecular environment includes a gene’s paralog if the duplicates are derived from an ancestral self-interacting protein (homomer) (Figure 1). In this case, mutations in one paralog could have functional consequences for the other copy because the duplication of a homomeric protein leads not only to the formation of new homomers (HM) but also to a new heteromer (HET) (Figure 1) (Pereira-Leal et al., 2007; Wagner, 2003).
Paralogs that derive from HMs are physically associated as HETs when they arise. Subsequent evolution can lead to the maintenance or the loss of these HETs. There are several examples of paralogs that maintained the ability to form HETs and, as a consequence, they have evolved new functional relationships (Amoutzias et al., 2008; Baker et al., 2013; Kaltenegger and Ober, 2015). Examples include paralogs degenerating and becoming a repressor for the other copy (Bridgham et al., 2008), that split the functions of the ancestral HM between one of the HMs and the HET (Baker et al., 2013), that cross-stabilize and thus need each other to perform their function (Diss et al., 2017) or that evolve a new function together as a HET (Boncoeur et al., 2012). However, there are other paralogs that have lost the ability to form HETs through the divergence of their structural properties. Examples include duplicated histidine kinases (Ashenberg et al., 2011) and many duplicated heat-shock proteins (Hochberg et al., 2018), which do form HMs but appear to have lost the ability to form HETs.
One important question to examine is therefore: what are the evolutionary forces at work for the maintenance or the disruption of HETs arising from HMs. Previous studies suggest that if a paralog pair maintains its ability to form HMs, it is very likely to maintain the HET complex as well (Pereira-Leal et al., 2007). For instance, Lukatsky et al. (Lukatsky et al., 2007) showed that proteins intrinsically tend to interact with themselves and that negative selection could be needed to prevent unwanted HMs. Given this, since nascent paralogs are identical just after duplication, they would have a high propensity to assemble with each other. Hence, the formation of both HMs and HETs could be the default state after duplication and this, until specific destabilizing mutations accumulate (Ashenberg et al., 2011; Hochberg et al., 2018). Here, we hypothesize that the association of paralogs forming HETs acts as a constraint that may slow the functional divergence of paralogs by keeping gene products physically associated.
Previous studies have shown that HMs are enriched in eukaryotic PPI networks (Lynch, 2012; Pereira-Leal et al., 2007). However, the extent to which paralogs interact with each other has not been precisely quantified in any species. We therefore examine paralog assembly exhaustively in an eukaryotic interactome by collecting data from the literature and extending these with a large-scale PPI screening experiment. Second, using functional data analysis, we examine the consequences of losing HET formation for HM forming paralogs. We then perform in silico evolution experiments to examine whether the molecular pleiotropy caused by shared interfaces could contribute to maintain interaction between paralogs that derive from ancestral HMs. We show that selection to maintain HMs alone coupled with the pleiotropic effects of mutations may be sufficient to prevent the loss of HETs. Finally, we find that regulatory evolution, either at the level of gene transcription or protein localization, may relieve the pleiotropic constraint maintaining the interaction of paralogous proteins.
Results
Homomers and heteromers in the yeast PPI network
We collected information on Saccharomyces cerevisiae HMs and HETs from publicly available data (see methods) for 255 pairs of small-scale duplicates (SSDs) and 240 pairs of whole-genome duplicates (WGDs). We combined this data with our own experimental data on 155 SSDs and 131 WGDs. We performed Protein-fragment Complementation Assay experiments (referred to as PCA) to test for binary interactions of paralogs with themselves (HM) and with their sister copy (HET). PCA detects direct and near direct interactions without disturbing endogenous regulation, giving insight into the role of transcriptional regulation in the evolution of PPIs (Barshir et al., 2018; Gagnon-Arsenault et al., 2013; Rochette et al., 2014; Tarassov et al., 2008). In general, the PCA signal in our study strongly correlates with results from previous PCA experiments (Stynen et al., 2018; Tarassov et al., 2008) and other publicly available data (Figure S1). Roughly 75% of the HMs and 74% of the HETs detected in our PCA were previously reported (Figure S2, Tables S1 and S2), suggesting that most of the HMs and HETs that can be detected with available tools and in standard conditions have been discovered. While 91 HMs and 49 HETs reported in other studies were not detected in our PCA, our experiments discovered 54 HMs and 22 HETs not previously reported (Tables S1 and S2). The data we assembled represents a total of 595 pairs of paralogs (315 SSDs and 280 WGDs) covering 62.1% of the SSDs and 51.6% of the WGDs (Tables S1 and S2).
Using this dataset, complemented with HMs previously reported among non-paralogs (singletons), we find that about 32.5% of yeast proteins form HMs, which agrees with previous estimates from crystal structures (Lynch, 2012). The proportion of HMs among singletons (n = 628, 25%) is lower than for all duplicates: SSDs (n = 965, 38%, p-value < 2.2e-16), WGDs (n = 281, 33%, p-value = 9.9e-06) and those duplicated by both SSD and WGD (henceforth referred to as 2D, n = 117, 48%, p-value = 2.4e-13) (Figure 2. A). Although a large number of PPIs have been previously reported in S. cerevisiae, it is possible that the frequency of HMs is slightly underestimated because they were not systematically tested (see methods). Another possibility is that some methods cannot detect them due to low expression levels (this is also true for HETs, see below). To test for the effect of expression levels on PPI detection in the PCA assays, we measured mRNA abundance in cells grown in conditions that reflect the ones used in PCA experiments and also used integrated protein abundance for the yeast proteome (Wang et al., 2012) (Tables S3 and S4). As previously observed (Celaj et al., 2017; Freschi et al., 2013), we found a correlation between PCA signal and expression level, both at the level of mRNA and protein abundance (Spearman r = 0.32, p-value < 2.2e-16 and r = 0.45, p-value < 2.2e-16 respectively). When focusing only on HMs previously reported, we also detected both correlations (Spearman r = 0.27, p-value = 2.2e-07 and r = 0.29, p-value < 2.2e-16 respectively). Associations between PCA signal and expression translate to a roughly two-fold increase in the probability of HM detection when mRNA levels change by two orders of magnitude (Figure S3. A). We also find that PCA signal for HMs is generally stronger for the most expressed paralog of a pair, confirming the effect of expression on our ability to detect PPIs (Figure S3. B). Finally, we find that HMs reported in the literature but not detected by PCA have on average lower expression levels (Figure S3. B-C). We therefore conclude that some HMs (and also HETs) remain unknown because of low expression levels.
The overrepresentation of HMs among duplicates was initially observed for human paralogs (Pérez-Bercoff et al., 2010). One possibility to explain this finding is that homomeric proteins are more likely to be maintained after duplication (Diss et al., 2017). Another explanation is that proteins forming HMs could be more expressed and therefore, easier to detect, as shown above. High expression could also increase the long term probability of genes to persist after duplication (Gout et al., 2010; Gout and Lynch, 2015). We observed that WGDs are more expressed than SSDs and that both are more expressed than singletons at mRNA and protein levels (Figure S4. A-B). However, expression does not explain completely the enrichment of HMs among duplicated proteins and the enrichment does not result entirely from enhanced detection sensitivity. Both factors, expression and duplication, have significant effects on the probability of proteins to form HMs (Table S5. A). It is therefore likely that the overrepresentation of HMs among paralogs is linked to their higher expression but other factors are also involved.
Paralogous heteromers frequently derive from ancestral homomers
The model presented in Figure 1 assumes that the ancestral protein leading to HET formed a HM before duplication. Under the principle of parsimony, we can assume that when at least one paralog forms a HM, the ancestral protein was also a HM. This was shown to be true in general by (Diss et al., 2017) who compared yeast WGDs to their orthologs from Schizosaccharomyces pombe. To further support this observation, we used PCA to test for HM formation for orthologs from species that diverged prior to the whole-genome duplication event (Lachancea kluyveri and Zygosaccharomyces rouxii). We looked at the mitochondrial translocon complex and at the transaldolase, which both contain HETs (see methods). We confirm that when one HM was observed in S. cerevisiae, at least one ortholog from pre-whole genome-duplication species also formed a HM (Figure 2. B-C). We also detected interactions between orthologs, suggesting that ability to interact has been preserved despite the millions of years of evolution separating these species. The absence of interactions for some pre-WGD proteins may be due to the incompatibility of their expression in S. cerevisiae.
We classified paralog pairs into four classes according to whether they show only the HET (HET, 8.6%), at least one HM but no HET (HM, 46%), at least one of the HM and the HET (HM&HET, 31.6%) or no interaction (NI, 13.8%) (Figures 2. D-E and S5). The number of interactions detected for all paralogous pairs is higher than expected among random pairs of proteins in large-scale screens (1% of PPIs tested are positive in (Tarassov et al., 2008)). This is in line with previous observations showing that paralogs frequently interact with each other (Ispolatov et al., 2005). Overall, most pairs forming HETs also form at least one HM (78.7%, Figure 2. E for the frequency for each type of duplication).
Previous observations showed that paralogs are enriched in protein complexes comprising more than two subunits, partly because complexes evolved by the initial establishment of self-interactions followed by duplication of the homomeric proteins (Musso et al., 2007; Pereira-Leal et al., 2007). We examined if HM&HET were part of complexes and could have evolved this way. We found 10 HETs among the yeast paralogs present in the Protein Data Bank (PDB), seven of them are in complexes with more than two distinct subunits and thus involve more proteins than just a pair of paralogs. The remaining three are in complexes with only two distinct subunits and are therefore exclusively constituted of a paralog pair. We examined further the presence of HM&HETs in complexes with more than two distinct subunits by combining data on 5,535 complexes (see methods). In 77 out of the 188 cases of HM&HET, we found both paralogs to be part of the same complex. Complexes containing more than two distinct proteins are thus unlikely to account for most of the HETs that derive from ancestral HMs. A large fraction of HM&HETs could therefore be simple oligomers of paralogs.
We observed that the correlation between HM and HET formation depends on whether paralogs derive from SSD or WGD (Figure 2. E). WGDs tend to form more often HETs when they form at least one HM, resulting in a larger proportion of HM&HET motif. We hypothesize that this could be at least partly due to the fact that most SSDs are older than WGDs (Table S1 and S2) and increasing protein sequence divergence with time could lead to the loss of the HET. We indeed find that among SSDs, those showing higher sequence identity are more likely to form HM&HET (Figure 2. F). The effect of time of duplication was also detectable for paralogs deriving from the whole-genome duplication. Recently, Marcet-Houben and Gabaldón (Marcet-Houben and Gabaldón, 2015; Wolfe, 2015) showed that WGDs likely have two distinct origins: actual duplication (generating true ohnologs) and hybridization between species (generating homeologs). For pairs whose ancestral state was a HM, we observe that true ohnologs have a tendency to form HET more frequently than homeologs (Figure 2. E). Because homeologs had already diverged before the hybridization event, they are older than ohnologs, as shown by their lower pairwise sequence identity (Figure S5. A). This observation supports the fact that younger paralogs derived from HMs are more likely to form HETs than older ones. We further tested for an association between the age of SSDs established using gene phylogenies and their propensity to form HET, but we found no significant association, most likely due to the small number of pairs per age group (Figure S5. B).
Amino acid sequence conservation could also have a direct effect on the retention of HETs, independently of age. For instance, among WGDs (either within true ohnologs or within homeologs), which all have the same age in their own category, HM&HET pairs have higher sequence identity than HM pairs (Figure S5. C). This is also apparent for pairs of paralogs whose HM or HET structures have been solved by crystallography (Table S1). Indeed, we found that pairwise amino acid sequence identity was higher for HM&HET than for HM pairs for both entire proteins and their binding interfaces (Figure S6. A). Interestingly, the binding interface is more conserved than the rest of the protein for those forming HM&HET, suggesting a causal link between sequence identity at the interface and assembly of HM&HETs (Figure S6. B).
Heteromer formation correlates with functional conservation
To test if the retention of HETs correlates with the functional similarity of HM and HM&HET paralogs, we used the similarity of Gene Ontology (GO) terms, known growth phenotypes of loss-of-function mutants and patterns of genome-wide genetic interactions. These features represent the relationship of genes with cell growth in specific conditions and the gene-gene relationships underlying cell growth. The use of GO terms could bias the analysis because they are often predicted based on sequence features. However, phenotypes and genetic interactions are derived from unbiased experiments because interactions are tested without a priori consideration of a gene’s function (Costanzo et al., 2016). We found that HM&HET pairs are more similar than HM for SSDs (Figure 3). We observe the same tendency for WGDs, although some of the comparisons are either marginally significant or non-significant (Figure 3, comparison between true ohnologs and homeologs in Figure S7). Overall, the retention of HETs after the duplication of HMs is correlated with a weaker divergence of function.
Pleiotropy contributes to the maintenance of heteromers
Since molecular interactions between paralogs predate their functional divergence, it is likely that physical association by itself affects the retention of functional similarity among paralogs. Because of this, any feature of the paralogs that contributes to the maintenance of the HET state could have a strong impact on the fate of new genes emerging from the duplication of a HM. A large fraction of HMs and HETs use the same binding interface (Bergendahl and Marsh, 2017). This shared interface could cause mutations to have pleiotropic effects on HMs and HETs (Figure 1). If we assume that HMs need to self-interact in order to perform their function, it is expected that natural selection would favor the maintenance of self-assembly. Negative selection on HM interfaces will thus also preserve HET interfaces, preventing the loss of the HET.
We tested this correlated selection model using in silico evolution of HM and HET protein complexes (Figure 4. A). We used a set of six representative high-quality structures of HMs (Dey et al., 2018). We evolved these HM complexes by duplicating them and by following the binding energies of the two HMs and of the HET. We let mutations occur at the binding interface 1) in the absence of selection (neutral model), 2) in the presence of negative selection maintaining only one HM or 3) both HMs. In both cases, we apply no selection on binding energy of the HET. In the fourth scenario, we apply selection on the HET but not on the HMs to examine if selection maintaining the HET could also favor the maintenance of HMs. Mutations that have deleterious effects on the complex under selection were lost or allowed to fix with exponentially decaying probability depending on the fitness effect (see methods) (Figure 4. A).
We find that neutral evolution leads to the destabilization of all complexes (Figure 4. B), as is expected given that there are more destabilizing mutations than stabilizing ones, both in terms of binding energy and chain stability (Brender and Zhang, 2015; Guerois et al., 2002). However, selection to maintain one HM or both HMs significantly slows down the loss of the HET (Figure 4. C, E-F). Interestingly, the HET is being destabilized more slowly than the second HM when only one HM is under negative selection, which could explain why for some paralog pairs, only one HM and the HET are conserved (Figure S8). The reciprocal situation is also true, i.e. negative selection on HET significantly decelerates the loss of stability of HMs (Figure 4. D). These observations hold when simulating the evolution of duplication of five other structures (Figure S9). This reveal that pleiotropic effects are surprisingly strong in this context.
By examining the effects of single mutants (only one of the loci gets a non-synonymous mutation at the interface) on HMs and HETs, we see that the evolutionary paths followed by the simulations are caused by the correlated effects of mutations (Figure 5). The majority of mutations (77%) appear to have effects with the same directionality on both complexes, either being stabilizing or destabilizing. The remaining fraction of mutations (23%) specifically destabilizes either the HETs or HMs and these mutations tend to have small effects, further reducing the likelihood of mutations that would rapidly disrupt a specific complex. Again, the results hold for the six structures tested (Figure S10). Additionally, mutations tend to have greater effects on the binding energy of HMs compared to HETs, presumably because of the presence of a mutation on both chains simultaneously (Figure S11. A-B). Likewise, a higher percentage of double mutants (one at the interface of each duplicated locus) are fixed during the simulations when negative selection is applied to the HET rather than to the two HMs (Figure S12). This effect is more pronounced for mutants having effects with opposite effects on the HMs, which could partially cancel out when applied to chains forming a HET. The particular capability of HETs to accommodate mutations with opposite effects in each of the monomers without disrupting interactions results in a higher robustness with respect to protein complex assembly.
Regulatory evolution may break down molecular pleiotropy
The results from simulations show that the loss of HET after the duplication of a HM occurs at a slow rate and that specific rare mutations are required for HETs to be destabilized. However, the simulations only consider the evolution of binding interfaces, which limits the modification of interactions to a subset of all mutations that can ultimately affect PPIs (Hochberg et al., 2018). Other mechanisms responsible for the loss of paralog interactions could involve transcriptional regulation or cell compartment localization such that paralogs are not present at the same time or in the same cell compartment. To test this, we measured the correlation coefficient of expression profiles of paralogs across growth conditions using previously published data (Ihmels et al., 2004). These expression profiles are more correlated for SSDs forming HM&HET than for those forming only HM (p-value = 2.5e-04, Figure 6. A). A similar tendency is observed for WGDs, but the difference is not statistically significant (p-value = 0.18, Figure 6. A). To contribute to the isolation of paralogs, the differences of expression correlation between HM&HET and HM would need to be caused by cis regulatory divergence rather than to element in trans, which would be shared by both members of a pair. The similarity of transcription factor binding sites between paralogs confirms this (Figure 6. B). When we split WGDs into true ohnologs and homeologs, even though the true ohnologs are more co-expressed (Figure S13. A), we do not observe differences of co-expression between interaction motifs (Figure S13. B). Because we found that sequence identity was correlated with the probability of observing HET and that sequence identity can be correlated with the co-expression of paralogs, we tested if these factors had an independent effect on HET formation. For SSDs, both sequence identity and co-expression show significant effects on HM&HET formation (Table S5. B), but for WGDs, only the percentage of identity seems to impact HM&HET formation (Figure 6. C, Table S5. B). Similar results were obtained for both true ohnologs and homeologs (Figure S13. C).
Finally, we find that HM&HET paralogs are more similar than HM for both SSDs and WGDs in terms of cellular compartments (GO) (Figure 6. D) and of cellular localization derived from experimental data (Figure 6. E). Taking into account the two origins of WGDs (true ohnologs and homeologs), we observe the same significant differences of localization and the same tendency (but not significant) for the similarity of cellular compartments between HM and HM&HET pairs (Figure S13. D-F). The comparison of the correlation of expression profiles, of transcription factor binding sites, of GO cellular component and localization similarities show that changes in gene and protein regulation could prevent the interaction between paralogs that derive from ancestral HMs, reducing the role of pleiotropy in maintaining their associations.
Discussion
Upon duplication, the properties of proteins are inherited from their ancestor, which may affect how paralogs subsequently evolve. Here, we examined the extent to which physical interactions between paralogs are preserved after the duplication of HMs and how these interactions affect functional divergence. Using reported PPI data, crystal structures and PCA experiments, we found that paralogs originating from ancestral HMs are more likely to functionally diverge if they do not interact with each other. We propose that non-adaptive mechanisms could play a role in the retention of physical interactions and in turn, impact functional divergence. By developing a model of in silico evolution of PPIs, we found that molecular pleiotropic effects of mutations on binding interfaces can constrain the maintenance of HET complexes even if they are not under selection. We hypothesize that this non-adaptive constraint could play a role in slowing down the divergence of paralogs but that it could be counteracted by regulatory evolution.
The proportions of HMs and HETs among yeast paralogs were first studied more than 15 years ago (Wagner, 2003). From this first study, it was suggested that most paralogs forming HETs do not have the ability to form HMs and thus, that evolution of new interactions was rapid. Since then, many PPI experiments have been performed (Chatr-Aryamontri et al., 2017; Kim et al., 2019; Stark et al., 2006; Stynen et al., 2018) and the resulting global picture is different. We found that most of the paralogs forming HETs also form HMs, suggesting that interactions between paralogs are inherited rather than gained de novo. This idea is supported by models predicting interaction losses to be much more likely than interaction gains (Gibson et al., 2009; Presser et al., 2008). Accordingly, the HM&HET state can be more readily achieved by the duplication of an ancestral HM and the subsequent loss of one of the two HMs than by the duplication of a monomeric protein followed by the gain of the HM and of the HET. Interacting paralogs are therefore more likely to derive from ancestral HMs, as shown by (Diss et al., 2017). For some pairs of S. cerevisiae paralogs presenting the HM&HET motif, we indeed detected HM formation of their orthologs from pre-whole-genome duplication species, supporting the model by which self-interaction and cross-interactions is inherited from the duplication. We did not detect HMs in both pre-whole-genome duplication species, which may reflect the incorrect expression of these proteins in S. cerevisiae rather than their lack of interactions. Overall, we conclude that to lose HET complexes, interacting paralogs would have to diverge in sequence of the binding interfaces, or by other mechanisms. The role of the divergence of binding interface is supported by our results showing that interacting paralogs have a higher sequence identity both at the full sequence and at the binding interface level than paralogs that do not interact.
We observed an enrichment of HMs among duplicated proteins compared to singletons. This observation was reported for various interactomes (Ispolatov et al., 2005; Pereira-Leal et al., 2007; Pérez-Bercoff et al., 2010; Yang et al., 2003). Also, analyses of PPIs from large-scale experiments have shown that interactions between paralogous proteins are more common than expected by chance alone (Ispolatov et al., 2005; Musso et al., 2007; Pereira-Leal et al., 2007). Several adaptive hypotheses have been put forward to explain the over-representations of interacting paralogous proteins. Altogether, these hypotheses imply that the retention of paralogs after duplication is more likely for HM proteins due to the enhanced probability of gain of functions (although subfunctionalization is also possible). For example, symmetrical HM proteins could have key advantages over monomeric ones for protein stability and regulation (André et al., 2008; Bergendahl and Marsh, 2017). Levy and Teichmann (Levy and Teichmann, 2013) suggested that the duplication of HM proteins serves as a seed for the growth of protein complexes. These duplications would allow the diversification of complexes by creating asymmetry and the recruitment of other proteins, leading to their specialization. It is also possible that the presence of HETs itself offers a rapid way to evolve new functions. For instance, Bridgham et al. (Bridgham et al., 2008) showed that degenerative mutations in one copy of an heterodimeric transcription factor can switch its role to become a repressor. Regulatory mechanisms could therefore rapidly evolve this way (Bridgham et al., 2008; De Smet et al., 2013; Kaltenegger and Ober, 2015). Finally, Natan et al. (Natan et al., 2018) showed that cotranslational folding can be a problem for homomeric proteins because of premature assembly, particularly for proteins with interfaces closer to their N-terminus. The replacement of these HMs by HETs could solve this issue by separating the translation of the proteins to be assembled on two distinct mRNAs.
Non-adaptive mechanisms could also be at play to maintain HETs. Diss et al. (Diss et al., 2017) recently showed that in the absence of their paralog, some proteins are unstable and lose their capacity to interact with other proteins. The fact that a paralog can be unstable in the absence of its sister copy appears to be enriched among paralogs forming HET, suggesting that the individual proteins depend on each other due to their physical interactions (Diss et al., 2017). Independent observations by (DeLuna et al., 2010) also showed that the deletion of paralogs was sometimes associated with the degradation of their sister copy, particularly among HET paralogs. The Diss et al. and DeLuna et al. observations led to the proposal that paralogs could accumulate complementary degenerative mutations at the structural level after the duplication of a HM (Diss et al., 2017; Kaltenegger and Ober, 2015). This scenario would lead to the maintenance of the HET because destabilizing mutations in one chain can be compensated by stabilizing mutations in the other, keeping binding energy near the optimum. Compensatory effects cannot happen for HMs because the two copies of the protein are identical. Our simulations are consistent with the compensatory model where some pairs of mutations in the two chains of the HET have opposite effects on binding energy. On the long term, the accumulation of opposite effect mutations could maintain the HET and it could become the only functional unit capable of performing the ancestral function. However, our data suggests that most (89%) of dependent paralogs that form HET in (Diss et al., 2017) also form at least one HM, suggesting that the loss of both HMs is not required for dependency. Further experiments will be needed to fully determine the likelihood of the dependency model and in which conditions it could take place.
Our simulated evolution of the duplication of HMs leads to the proposal of a simple mechanism for the maintenance of HET that does not require adaptive mechanisms. A large fraction of HMs and HETs use the same binding interface (Bergendahl and Marsh, 2017) and as a consequence, negative selection on HM interfaces will also preserve HET interfaces. Our results show that mutations have correlated effects on HM and HET, which slows down the evolution of these independent complexes. Although mutations with specific destabilization effects are available, their small magnitude would slow down evolution. This limitation in terms of mutational effects that could separate complexes could therefore be another non-adaptive mechanism for the long-term maintenance of heteromers.
One of our observations is that WGDs present proportionally more HM&HET motifs than SSDs. We propose that this is at least partly due to the age of paralogs. This proposal was based on the fact that SSDs in yeast are in general older than WGDs and even among WGDs, homeologs show less frequent HM&HET than HMs compared to true ohnologs. However, the mode of duplication itself could also impact HET maintenance. For instance, upon a whole-genome duplication event, all subunits of complexes are duplicated at the same time, which can lead to differential rates of gene retention across functional categories. Consistent with this, previous studies have found that WGDs are enriched in proteins that are subunits of protein complexes when compared with SSDs (Hakes et al., 2007; Papp et al., 2003). A possible explanation for this trend is that while the duplication of a single subunit of a complex may be deleterious by perturbing stoichiometry, duplication of all subunits at once may not cause a disadvantage since balance is preserved (Birchler and Veitia, 2012; Rice and McLysaght, 2017). Another possibility is that some WGDs are maintained due to selection for gene dosage (Ascencio et al., 2017; Edger and Pires, 2009; Gout and Lynch, 2015; Sugino and Innan, 2006; Thompson et al., 2016). Therefore, the ancestral gene sequence, regulation and function are conserved, which ultimately favors the maintenance of HETs.
We noticed a significant fraction of paralogs forming only HMs but not HET, including some cases of recent duplicates, indicating the forces that maintain HETs can be overcome. Duplicate genes in yeast and other model systems often diverge quickly in terms of transcriptional regulation (Li et al., 2005; Thompson et al., 2013) due to cis regulatory mutations (Dong et al., 2011). Because transcriptional divergence of paralogs can directly change PPI profiles, expression changes would be able to rapidly change a motif from HM&HET to HM. Indeed, Gagnon-Arsenault et al. (Gagnon-Arsenault et al., 2013) showed that switching the coding sequences between paralogous loci was sometimes sufficient to change PPI specificity in living cells. Protein localization can also be an important factor affecting the ability of proteins to interact (Rochette et al., 2014). We found that paralogs that derive from HMs and that have lost their ability to form HETs are less co-regulated and co-localized. This divergence suggests that regulatory evolution could play a role in relieving duplicated homomeric proteins from the correlated effects of mutations affecting shared protein interfaces. Although we did not explore this possibility, it is likely that pairs of paralogs that are co-regulated and co-localized but that do not form HET show other mechanisms that prevent their association. For instance, it is possible that paralogs could gain or lose interaction domains (Nasir et al., 2014), which could potentially bypass the constraints imposed by homologous interaction interfaces and drive the functional differentiation of paralogs. Further analyses of structural data as well as transcriptional and localization data will allow to disentangle the causal or correlated roles of these individual mechanisms in the evolution of PPIs and in the functional evolution of paralogs.
Overall, our analyses show that the duplication of self-interacting proteins creates paralogs whose evolution is constrained by pleiotropy in ways that are not expected for monomeric paralogs. Pleiotropy has been known to influence the architecture of complex traits and thus to shape their evolution (Wagner and Zhang, 2011). However, specific descriptions of how it takes place at the molecular levels and how it can be overcome to allow molecular traits to evolve independently is still largely unknown. Here we provide a surprisingly simple system in which the role of pleiotropy can be examined at the molecular level. Because gene duplication is the major mechanism for the evolution of cellular networks and because a large fraction of proteins is oligomeric, these constraint could be an important force in shaping protein networks. Another surprising result is that negative selection for the maintenance of heteromers of paralogs is not needed for their preservation on the long term, further enhancing the role of non-adaptive evolution in shaping the complexity of cellular structures (Lynch et al., 2014).
Material and Methods
All scripts used to analyze the data are available at https://github.com/landrylaboratory/Gene_duplication_2019
1. Characterization of paralogs in S. cerevisiae genome
1.1 Classification of paralogs by mechanism of duplication
We classified duplicated genes in three categories according to their mechanism of duplication: small-scale duplicates, SSDs; whole-genome duplicates, WGDs (Byrne and Wolfe, 2005); and double duplicates, 2D (SSDs and WGDs). We removed WGDs from the paralogs defined in (Guan et al., 2007) to generate the list of SSDs. If one of the two paralogs of a SSDs pair is associated to another paralog in a WGDs pair, this paralog was considered a 2D (Tables S7 and S8). To decrease the potential bias from multiple duplication events, we removed the 2Ds and paralogs from successive small-scale genome duplications from the data on interaction motifs. We used data from (Marcet-Houben and Gabaldón, 2015) to identify WGDs that are likely true ohnologs or that originated from allopolyploidization (homeologs).
1.2 Sequence similarity
The amino-acid sequences of each pair of proteins were obtained from the Saccharomyces Genome Database (SGD) (S288C strain version 2007-03-01) and their pairwise identity were computed using the pairwise Alignment function (default parameters) from the R Biostrings package (Pagès et al., 2018). The percentage of identity was estimated using the pid function (option type = “PID1”) from the same package.
1.3 Age of duplication
We estimated the age of SSDs using gene phylogenies. If duplication of gene A led to the formation of two paralogs A1/A2, before the speciation event separating two species, orthologs A1 are expected to be more similar than paralogs A1/A2 (Li et al., 2003). Based on this principle, we estimated the age of the SSDs using gene phylogenies extracted from PhylomeDB (Huerta-Cepas et al., 2008) and imported in the R Tidytree package (Yu, 2019; Yu et al., 2018, 2017). For a gene A1, the common node with its paralog A2 was identified and the next more recent node leading to A1 was retained. Among the descending clades of the retained node, the more distant species from S. cerevisiae was selected and an age group was defined following a species tree (see Figures S14 and S15, Tables S1 and S9). If the two paralogs were not found in the same tree, the node assigned was the oldest for each tree. When the age group determined from the gene phylogeny of paralog A1 did not match the age group of paralog A2, it was potentially due to the disappearance of one paralog in the more distant clade; therefore, the oldest age group was retained. The age group assignment was then validated by checking whether for at least one species of the corresponding phylogenetic group, the ortholog gene had at least two paralogs. If it was not the case, the age group assignment was decreased by one and retested until a species was found with at least two paralogs within the group. This custom method was tested for WGDs that occured in the common ancestor of post-WGD species or of pre-KLE (Kluyveromyces, Lachancea, and Eremothecium) branch defined by the age groups number 1 and 2 respectively (Figure S14). The error rate was estimated to be less than 1%.
1.4 Function, transcription factor binding sites, localization and protein complexes
We obtained GO terms (GO slim) from SGD (Cherry et al., 2012) in September 2018. We removed terms corresponding to missing data and created a list of annotations for each SSD and WGD gene. The list was compared to measure the extent of similarity between two members of a pair of duplicates. We calculated the similarity of molecular function, cellular component and biological process taking the number of GO terms in common divided by the total number of unique GO terms of the two paralogs combined (Jaccard index). We compared the same way transcription factor binding sites using YEASTRACT data (Teixeira et al., 2018, 2006), cellular localizations extracted from YeastGFP database (Huh et al., 2003) and many phenotypes associated with the deletion of the paralogs (data from SGD in September 2018). For the deletion phenotypes, we kept only information with specific changes (a feature observed and a direction of change relative to wild type). We compared the pairwise correlation of genetic interaction profiles using the genetic interaction profile similarity (measured by Pearson’s correlation coefficient) of non-essential genes available in TheCellMap database (version of March 2016) (Usaj et al., 2017). We used the median of correlation coefficients if more than one value was available for a given pair. Non-redundant set of protein complexes was derived from the Complex Portal (Meldal et al., 2015), the CYC2008 catalogue (Pu et al., 2009, 2007) and Benschop et al., (Benschop et al., 2010).
2. HMs and HETs identified from databases
To complement our experimental data, we extracted HM and HET published in BioGRID version BIOGRID-3.5.166 (Chatr-Aryamontri et al., 2017, 2013). We used data derived from the following detection methods: Affinity Capture-MS, Affinity Capture-Western, Reconstituted Complex, Two-hybrid, Biochemical Activity, Co-crystal Structure, Far Western, FRET, Protein-peptide, PCA and Affinity Capture-Luminescence. It is possible that some HMs or HETs are absent from the database because they have been tested but not detected. This negative information is not reported. We therefore attempted to discriminate non-tested interactions from truly non interacting pairs the following way. We obtained the list of Pubmed IDs from studies in which at least one HM is reported to identify those using methods that can detect HMs. We then examined every study individually to examine if a given HM was reported and inferred that the HM existed if reported (coded 1). If it was absent but other HMs were reported and that we confirmed the presence of the protein as a bait and a prey, we considered the HM absent (coded 0). If the protein was not present in both baits and preys, we considered the HM as not tested (coded NA). Then, if the HM was detected in the Protein Data Bank (PDB) (Berman et al., 2000), we inferred that it was present (coded 1). If the HM was not detected but the monomer was reported, it is likely that there is no HM for this protein and it was thus considered non-HM (coded 0). If there was no monomer and no HM, the data was considered as missing (coded NA). We proceeded the same way for HETs. Data on genome-wide HM screens was obtained from (Kim et al., 2019; Stynen et al., 2018). The two methods relied on Protein-fragment complementation assays (PCA), the first one using the dihydrofolate reductase (DHFR) enzyme as a reporter and the second one, a fluorescent protein (also known as Bimolecular fluorescence complementation (BiFC)). We discarded proteins from (Stynen et al., 2018) flagged as problematic by (Rochette et al., 2014; Tarassov et al., 2008) (Stynen et al., 2018) and false positives identified by (Kim et al., 2019). All discarded data was considered as missing data. We examined all proteins tested and considered them as HM if they were reported as positive (coded 1) and considered non-HM if tested but not reported as positive (coded 0).
3. Experimental Protein-fragment complementation assay
We performed a screen using PCA based on DHFR (Tarassov et al., 2008) following standard procedures (Rochette et al., 2014; Tarassov et al., 2008)
3.1 DHFR strains
We identified 188 pairs of SSDs and 453 pairs of WGDs that were present in the Yeast Protein Interactome Collection (Tarassov et al., 2008) and another set of 143 strains constructed by (Diss et al., 2017). We retrieved strains from the collection (Tarassov et al., 2008) and we let them grow on YPD supplemented with nourseothricin (NAT) for DHFR F[1,2] strains and hygromycin B (HygB) for DHFR F[3] strains. We confirmed the insertion of the DHFR fragments at the correct location by colony PCR using a specific forward Oligo-C targeting a few hundred base pairs upstream of the fusion and a reverse complement oligonucleotide ADH-term located in the ADH terminator after the DHFR fragment sequence (Table S10). Cells from colonies were lysed in 40 µL of 20 mM NaOH for 20 min at 95°C. Tubes were centrifuged for 5 min at 1792 g and 2.5 µL of supernatant was added to a PCR mix composed of 16.85 µL of DNAse free water, 2.5 µL of 10X Taq buffer (BioShop Canada Inc., Canada), 1.5 µL of 25 mM MgCl2, 0.5 µL of 10 mM dNTP (Bio Basic Inc., Canada), 0.15 µL of 5 U/µL Taq DNA polymerase (BioShop Canada Inc., Canada), 0.5 µL of 10 µM Oligo-C and 0.5 µL of 10 µM ADH-term. The initial denaturation was performed for 5 min at 95°C and was followed by 35 cycles of 30 sec of denaturation at 94°C, 30 sec of annealing at 55°C, 1 min of extension at 72°C and 3 min of a final extension at 72°C. We confirmed 1769 out of the 2564 strains from the DHFR collection and 117 strains out of the 143 from (Diss et al., 2017) (Tables S7, S8, and S10).
The missing or non-validated strains were constructed de novo using the standard DHFR strain construction protocol (Michnick et al., 2016; Rochette et al., 2015). The DHFR fragments and associated resistance modules were amplified from plasmids pAG25-linker-F[1,2]-ADHterm (NAT resistance marker) and pAG32-linker-F[3]-ADHterm (HygB resistance marker) (Tarassov et al., 2008) using oligonucleotides defined in (Table S10). PCR mix was composed of 16.45 µL of DNAse free water, 1 µL of 10 ng/µL plasmid, 5 µL of 5X Kapa Buffer (Kapa Biosystems, Inc., A Roche Company, Canada), 0.75 µL of 10 mM dNTPs, 0.3 µL of 1 U/µL Kapa HiFi HotStart DNA polymerase (Kapa Biosystems, Inc., A Roche Company, Canada) and 0.75 µL of both forward and reverse 10 µM oligos. The initial denaturation was performed for 5 min at 95°C and was followed by 32 cycles of 20 sec of denaturation at 98°C, 15 sec of annealing at 64.4°C, 2.5 min of extension at 72°C and 5 min of a final extension at 72°C.
We performed strain construction in BY4741 (MATa his3Δ leu2Δ met15Δ ura3Δ) and BY4742 (MATα his3Δ leu2Δ lys2Δ ura3Δ) competent cells prepared as in (Gagnon-Arsenault et al., 2013) for the DHFR F[1,2] and DHFR F[3] fusions respectively. Competent cells (20 µL) were combined with 8 µL of PCR product (∼0.5-1 µg/µL) and 100 µL of Plate Mixture (PEG3350 40%, 100 mM of LiOAc, 10 mM of Tris-Cl pH 7.5 and 1 mM of EDTA). The mixture was vortexed and incubated at room temperature without agitation for 30 min. Heat shock was then performed after adding 15 µL of DMSO and mixing thoroughly by incubating in a water bath at 42°C for 15-20 min. Following the heat shock, cells were spun down at 400g for 3 min. Supernatant was removed by aspiration and cell pellets were resuspended in 100 µL of YPD (1% yeast extract, 2% tryptone, 2% glucose). Cells were allowed to recover from heat shock for 4 hours at 30°C before being plated on YPD plates (YPD with 2% agar) supplemented with 100 µg/mL of NAT for DHFR F[1,2] strains or with 250 µg/mL of HygB for DHFR F[3] strains. Cells were incubated at 30°C for 3 days (Table S11). The correct integration of DHFR fragments was confirmed by colony PCR as described above. At the end, we were able to reconstruct and validate 152 new strains (Tables S7 and S8). From all available strains, we selected pairs of paralogs for which we had both proteins tagged with both DHFR fragments (four different strains per pair). This resulted in 1268 strains corresponding to 317 pairs of paralogs (Tables S7 and S8). We finally discarded pairs considered as forming false positives by (Tarassov et al., 2008), which resulted in 286 pairs.
3.2 Construction of DHFR plasmids for orthologous genes
For the plasmid-based PCA, Gateway cloning-compatible destination plasmids pDEST-DHFR F[1,2] (TRP1 and LEU2) and pDEST-DHFR F[3] (TRP1 and LEU2) were constructed based on the CEN/ARS low-copy yeast two-hybrid (Y2H) destination plasmids pDEST-AD (TRP1) and pDEST-DB (LEU2) (Rual et al., 2005). A DNA fragment having I-CeuI restriction site was amplified using DEY001 and DEY002 primers (Table S10) without template and another fragment having PI-PspI/I-SceI restriction site was amplified using DEY003 and DEY004 primers (Table S10) without template. pDEST-AD and pDEST-DB plasmids were each digested by PacI and SacI and mixed with the I-CeuI fragment (destined to the PacI locus) and PI-PspI/I-SceI fragment (destined to the SacI locus) for Gibson DNA assembly (Gibson et al., 2009) to generate pDN0501 (TRP1) and pDN0502 (LEU2). Four DNA fragments were then prepared to construct the pDEST-DHFR F[1,2] vectors: (i) a fragment containing ADH1 promoter; (ii) a fragment containing Gateway destination site; (iii) a DHFR F[1,2] fragment; and (iv) a backbone plasmid fragment. The ADH1 promoter fragment was amplified from pDN0501 using DEY005 and DEY006 primers (Table S10) and the Gateway destination site fragment was amplified from pDN0501 using DEY007 and DEY008 primers (Table S10). The DHFR-F[1,2] fragment was amplified from pAG25-linker-F[1,2]-ADHterm (Tarassov et al., 2008) using DEY009 and DEY010 primers (Table S10). The backbone fragment was prepared by restriction digestion of pDN0501 or pDN0502 using I-CeuI and PI-PspI and purified by size-selection. The four fragments were assembled by Gibson DNA assembly where each fragment pair was overlapping with more than 30 bp, producing pHMA1001 (TRP1) or pHMA1003 (LEU2). The PstI–SacI region of the plasmids was finally replaced with a DNA fragment containing an amino acid flexible polypeptide linker (GGGGS) prepared by PstI/SacI double digestion of a synthetic DNA fragment DEY011 to produce pDEST-DHFR F[1,2] (TRP1) and pDEST-DHFR F[1,2] (LEU2). The DHFR F[3] fragment was then amplified from pAG32-linker-F[3]-ADHterm with DEY012 and DEY013 primers (Table S10), digested by SpeI and PI-PspI, and used to replace the SpeI–PI-PspI region of the pDEST-DHFR F[1,2] plasmids, producing pDEST-DHFR F[3] (TRP1) and pDEST-DHFR F[3] (LEU2) plasmids. In this study, we used pDEST-DHFR F[1,2] (TRP1) and pDEST-DHFR F[3] (LEU2) for the plasmid-based DHFR PCA. After Gateway LR cloning of Entry Clones to these destination plasmids, the expression plasmids encode protein fused to the DHFR fragments via an NPAFLYKVVGGGSTS linker.
We obtained the orthologous gene sequences for the mitochondrial translocon complex and the transaldolase proteins of Lachancea kluyveri and Zygosaccharomyces rouxii from the Yeast Gene Order Browser (YGOB) (Byrne and Wolfe, 2005). Each ORF was amplified using oligonucleotides listed in Table S10. We used 300 ng of purified PCR product to set a BPII recombination reaction (5 μL) into the Gateway Entry Vector pDONR201 (150 ng) according to the manufacturer’s instructions (Invitrogen, USA). BPII reaction mix wa incubated overnight at 25°C. The reaction was inactivated with proteinase K. The whole reaction was used to transform MC1061 competent E. coli cells (Invitrogen, USA), followed by selection on solid 2YT medium (1% Yeast extract, 1.6% Tryptone, 0.2% Glucose, 0.5% NaCl and 2% Agar) supplemented with 50 mg/L of kanamycin (BioShop Inc., Canada) at 37°C. Positive clones were detected by PCR using an ORF specific oligonucleotide and a general pDONR201 primer (Table S10). We then extracted the positive Entry Clones using minipreps for downstream application.
LRII reactions were performed by mixing 150 ng of the Entry Clone and 150 ng of expression plasmids (pDEST-DHFR F[1,2]-TRP1 or pDEST-DHFR F[3]-LEU2) according to manufacturer’s instructions (Invitrogen, USA). The reactions were incubated overnight at 25°C and inactivated with proteinase K. We used the whole reaction to transform MC1061 competent E. coli cells, followed by selection on solid 2YT medium supplemented with 100 mg/L ampicillin (BioShop Inc., Canada) at 37°C. Positive clones were confirmed by PCR using a ORF specific primer and a plasmid universal primer. The sequence-verified expression plasmids bearing the orthologous fusions with DHFR F[1,2] and DHFR F[3] fragments were used to transform the yeast strains YY3094 (MATa leu2-3,112 trp1-901 his3-200 ura3-52 gal4Δ gal80Δ LYS2::PGAL1-HIS3 MET2::PGAL7-lacZ cyh2R can1Δ::PCMV-rtTA-KanMX4) and YY3095 (MATα leu2-3,112 trp1-901 his3-200 ura3-52 gal4Δ gal80Δ LYS2::PGAL1-HIS3 MET2::PGAL7-lacZ cyh2R can1Δ::TADH1-PtetO2-Cre-TCYC1-KanMX4), respectively. The strains YY3094 and YY3095 were generated from BFG-Y2H toolkit strains RY1010 and RY1030 (Yachie et al., 2016), respectively, by restoring their wild type ADE2 genes. The ADE2 gene was restored by homologous recombination of the wild type sequence cassette amplified from the laboratory strain BY4741 using primers DEY014 and DEY015 (Table S10). SC -ade plates (Table S11) were used to screen successful transformants.
3.3 DHFR PCA experiments
Three DHFR PCA experiments were performed, hereafter referred to as PCA1, PCA2 and PCA3. The configuration of strains on plates and the screenings were performed using robotically manipulated pin tools (BM5-SC1, S&P Robotics Inc., Toronto, Canada (Rochette et al., 2015)). We first organized haploid strains in 384 colony arrays containing a border of control strains using a cherry-picking 96-pin tool (Figure S16). We constructed four haploid arrays corresponding to paralog 1 and 2 (P1 and P2) and mating type: MATa P1-DHFR F[1,2]; MATa P2-DHFR F[1,2] (on NAT media, Table S11); MATα P1-DHFR F[3]; MATα P2-DHFR F[3] (on HygB media, Table S11). Border control strains known to show interaction by PCA (MATa LSM8-DHFR F[1-2] and MATα CDC39-DHFR F[3]) were incorporated respectively in all MATa DHFR F[1,2] and MATα DHFR F[3] plates in the first and last columns and rows. The strains were organized as described in Figure S16. The two haploid P1 and P2 384 plates of the same mating type were condensated into a 1536 colony array using a 384-pintool. The two 1536 arrays (one MATa DHFR F[1,2], one MATα DHFR F[3]) were crossed on YPD to systematically test P1-DHFR F[1,2] / P1-DHFR F[3], P1-DHFR F[1,2]/P2-DHFR F[3], P2-DHFR F[1,2]/P1-DHFR F[3] and P2-DHFR F[1,2]/P2-DHFR F[3] interactions in adjacent positions. We performed two rounds of diploid selection (S1 to S2) by replicating the YPD plates onto YPD containing both NAT+HygB and growing for 48 hours. The resulting 1536 diploid plates were replicated twice for 96 hours on DMSO control plates (for PCA1 and PCA2) and twice for 96 hours on the selective MTX medium (for all runs) (Table S11). Five 1536 PCA plates (PCA1-plate1, PCA1-plate2, PCA2, PCA3-plate1 and PCA3-plate2) were generated this way. We tested the interactions between 286 pairs in five to twenty replicates each (Table S1).
We also used the robotic platform to generate three bait and three prey 1536 arrays for the DHFR plasmid-based PCA, testing each pairwise interaction at least four times. We mated all MATa DHFR F[1,2] and MATα DHFR F[3] strains on YPD medium at room temperature for 24 hours. We performed two successive steps of diploid selection (SC -leu -trp -ade, Table S11) followed by two steps on MTX medium (SC -leu -trp -ade MTX, Table S11). We incubated the plates of diploid selection and the first MTX step at 30°C for 48 hours. Finally, the second MTX step was incubated and monitored for 96 hours at the same temperature.
3.4. Analysis of DHFR PCA results
3.4.1 Image analysis and colony size quantification
All images were analysed the same way, including images from (Stynen et al., 2018). Images of plates were taken with a EOS Rebel T5i camera (Canon, Tokyo, Japan) every two hours during the entire course of the PCA experiments. Incubation and imaging were performed in a spImager custom platform (S&P Robotics Inc., Toronto, Canada). We considered images after 2 days of growth for diploid selection plates and after 4 days of growth for DMSO and MTX plates. Images were analysed using gitter (R package version 1.1.1 (Wagih and Parts, 2014)) to quantify colony sizes defining a square around the colony center and measuring the foreground pixel intensity minus the background pixel intensity.
3.4.2 Data filtering
For the images from (Stynen et al., 2018), we filtered data based on the diploid selection plates. Colonies smaller than 200 pixels were considered as missing data rather than as non-interacting strains. For PCA1, PCA2 and PCA3, colonies flagged as irregular by gitter (as S (colony spill or edge interference) or S, C (low colony circularity) flags) or that did not grow on the last diploid selection step or on DMSO medium (smaller than quantile 25 minus the interquartile range) were considered as missing data. We considered only bait-prey pairs with at least four replicates and used the median of colony sizes as PCA signal. The data was finally filtered based on the completeness of paralogous pairs so we could test HMs and HETs systematically. Median colony sizes were log2 transformed after adding a value of 1 to all data to obtain PCA scores. The results of (Stynen et al., 2018) and PCA1, PCA2 and PCA3 were strongly correlated, with an overall Pearson correlation of 0.578 (p-value < 2.2e-16) (Figure S1. B).
3.4.3 Detection of protein-protein interactions
The distribution of PCA scores was modeled per duplication type (SSD and WGD) and per interaction tested (HM or HET) as in (Diss et al., 2017) with the normalmixEM function (default parameters) available in the R mixtools package (Benaglia et al., 2009). The background signal on MTX was used as a null distribution to which interactions were compared. The size of colonies (PCA scores (PCAs)) were converted to z-scores using the mean (μb) and standard deviation (sdb) of the background distribution (Zs = (PCAs - μb)/sdb). PPI were considered as detected if Zs of the bait-prey pair was greater than 2.5 (Figure S17) (Chrétien et al., 2018).
We observed 38 cases in which only one of the two possible HET interaction was detected (P1-DHFR F[1,2] x P2-DHFR F[3] or P2-DHFR F[1,2] x P1-DHFR F[3]). It is typical for PCA assays to detect interactions in one orientation or the other (See (Tarassov et al., 2008)). However, this could also be caused by one of the four strains having an abnormal fusion sequence. We verified by PCR and sequenced the fusion sequences to make sure this was not the case. The correct strains were conserved and the other ones were re-constructed and retested. Only 14 cases of unidirectional HET were observed in our final results. For all other 69 cases, both reciprocal interactions were detected.
3.4.4 Dataset integration
The PCA data was integrated with other data obtained from databases. The overlaps among the different datasets and the results of our PCA experiments are shown in Figure S2.
4. Gene expression in MTX condition
4.1 Cell cultures for RNAseq
We used the border control diploid strain from the DHFR PCA (MATa/α LSM8-DHFR F[1,2]/LSM8 CDC39/CDC39-DHFR F[3]) to measure expression profile in MTX condition. Three overnight pre-cultures were grown separately in 5 ml of YPD+NAT+HygB (Table S11) at 30°C with shaking at 250 rpm. A second set of pre-cultures were grown starting from a dilution at OD600 = 0.01/ml in 50 ml in the same condition to an OD600 of 0.8 to 1/ml. Final cultures were started at OD600 = 0.03/ml in 250 ml of two different synthetic media supplemented with MTX or DMSO (Table S11) at 30°C with shaking at 250 rpm. These cultures were transferred to 5 × 50 ml tubes when they reached an OD600 of 0.6 to 0.7/ml and centrifuged at 1008 RCF at 4°C for 1 min. The supernatant was discarded and cell pellets were frozen in liquid nitrogen and stored at −80°C until processing. RNA extractions and library generation and amplification were performed as described in (Eberlein et al., 2019). Briefly, the Quantseq 3’ mRNA kit (Lexogen, Vienna, Austria) was used for library preparation (Moll et al., 2014) following the manufacturer’s protocol. The PCR cycles number during library amplification was adjusted to 16. The six libraries were pooled and sequenced on a single Ion Torrent chip (ThermoFisher Scientific, Waltham, United States) for a total of 7,784,644 reads on average per library. Barcodes associated to the samples in this study are listed in Table S3.
4.2 RNAseq analysis
Read quality statistics were retrieved from the program FastQC (Andrews, 2010). Reads were cleaned using cutadapt (Martin, 2011). We removed the first 12 bp, trimmed the poly-A tail from the 3’ end, trimmed low-quality ends using a cutoff of 15 (phred quality + 33) and discarded reads shorter than 30 bp. The number of reads before and after cleaning can be found in Table S3. Raw sequences can be downloaded under the NCBI BioProject ID PRJNA494421.
Cleaned reads were aligned on the reference genome of S288c from SGD (S288C_reference_genome_R64-2-1_20150113.fsa version) using bwa (Li and Durbin, 2009). Because we used a 3’mRNA-Seq Library, reads mapped largely to 3’UTRs. We increased the window of annotated genes in the SGD annotation (saccharomyces_cerevisiae_R64-2-1_20150113.gff version) using the UTR annotation from (Nagalakshmi et al., 2008). Based on this reference genes-UTR annotation, the number of mapped reads per genes was estimated using htseq-count of the Python package HTSeq (Anders et al., 2015) and reported in Table S3.
4.3 Correlation of gene expression profiles
The correlation of expression profiles for paralogs was calculated using Pearson’s correlation from the large-scale normalized expression data from S. cerevisiae (Ihmels et al., 2004) over 1000 mRNA expression profiles from different conditions and different cell cycle phases.
5. Structural analyses
5.1. Sequence conservation in interfaces of yeast complexes
5.1.1. Identification of crystal structures
The reference proteome of Saccharomyces cerevisiae assembly R64-1-1 was downloaded on April 16th, 2018 from the Ensembl database at (http://useast.ensembl.org/info/data/ftp/index.html) (Zerbino et al., 2018). The sequences of paralogs classified as SSDs or WGDs (Byrne and Wolfe, 2005; Guan et al., 2007) were searched using BLASTP (version 2.6.0+) (Camacho et al., 2009) to all the protein chains contained in the Protein Data Bank (PDB) downloaded on September 21st, 2017 (Berman et al., 2000). Due to the high sequence identity of some paralogs (up to 95%), their structures were assigned as protein chains from the PDB that had a 100% sequence identity and an E-value lower than 0.000 001.
5.1.2. Identification of interfaces
Residue positions involved in protein interaction interfaces were defined based on the distance of residues to the other chain (Tsai et al., 1996). Contacting residues are defined as those whose two closest non-hydrogen atoms are separated by a distance smaller than the sum of their van der Waals radii plus 0.5 Å. Reference van der Waals radii were obtained with FreeSASA version 2.0.1 (Mitternacht, 2016). Nearby residues are those whose alpha carbons are located at a distance smaller than 6 Å. All distances were measured using the Biopython library (version 1.70) (Cock et al., 2009).
5.1.3. Sequence conservation within interfaces
The dataset of PDB files was then filtered to include only the crystallographic structures with the highest resolution available for each complex involving direct contacts between subunits of the paralogs. Full-length protein sequences from the reference proteome were then aligned to their matching chains from the PDB with MUSCLE version 3.8.31 (Edgar, 2004) to assign the structural data to the residues in the full-length chain. These full-length chains were then aligned to their paralogs and sequences from PhylomeDB phylogenies (Huerta-Cepas et al., 2008) with MUSCLE version 3.8.31. Sequence identity was calculated within interface regions, which were considered the contacting and nearby residues. PDB identifiers for structures included in this analysis are shown in Table S12. Pairs of paralogs for which the crystallized domain was only present in one of the proteins were not considered for this analysis.
5.2. Simulations of coevolution of protein complexes
5.2.1 Mutation sampling during evolution of protein interfaces
Simulations were carried out with high quality crystal structures of homodimeric proteins from PDB (Berman et al., 2000). Four of them (PDB: 1M38, 2JKY, 3D8X, 4FGW) were taken from the above dataset of structures that matched yeast paralogs and two others from the same tier of high quality structures (PDB: 1A82, 2O1V). The simulations model the duplication of the gene encoding the homodimer, giving rise to separate copies that can accumulate different mutations, leading to the formation of HMs and HETs as in Figure 1.
Mutations were introduced using a transition matrix whose substitution probabilities consider the genetic code and allow only substitutions that would require a single base change in the underlying codons (Thorvaldsen, 2016). Due to the degenerate nature of the genetic code, the model also allows synonymous mutations. Thus, the model explores the effects of mutations in both chains, as well as mutations in only one chain. The framework assumes equal mutation rates at both loci, as it proposes a mutation at each locus after every step in the simulation, with 50 replicates of 200 steps of substitution in each simulation. Restricting the mutations to the interface maintains sequence identity above 40%, which has been described previously as the threshold at which protein fold remains similar (Addou et al., 2009; Todd et al., 2001; Wilson et al., 2000).
5.2.2 Implementation of selection
Simulations were carried out using the FoldX suite version 4 (Guerois et al., 2002; Schymkowitz et al., 2005). Starting structures were repaired with the RepairPDB function, mutations were simulated with BuildModel followed by the Optimize function, and estimations of protein stability and binding energy of the complex were done with the Stability and AnalyseComplex functions, respectively. Effects of mutations on complex fitness were calculated using methods previously described (Kachroo et al., 2015). The fitness of a complex was calculated from three components based on the stability of protein chains and the binding energy of the complex using equation 1: where i is the index of the current substitution, k is the index of one of the model’s three energetic parameters (stability of chain A, stability of chain B, or binding energy of the complex), is the fitness component of the kth parameter for the ith substitution, β is a parameter that determines smoothness of the fitness curve, is the free energy value of the kth free energy parameter (stability of chain A, stability of chain B, or binding energy of the complex) for the ith substitution, and is a threshold around which the fitness component starts to decrease. The total fitness of the complex after the ith mutation was calculated as the sum of the three computed values for , as shown in equation 2: The fitness values of complexes were then used to calculate the probability of fixation (pfixation) of the substitutions using the Metropolis criterion, as in equation 3: where pfixation is the probability of fixation, xi is the total fitness value for the complex after i substitutions; xj is the total fitness value for the complex after j substitutions, with j = i + 1; and N is the population size, which influences the efficiency of selection.
Different selection scenarios were examined depending on the complexes whose binding energy and chain stabilities were under selection: neutral evolution (no selection applied on chain stability and on the binding energy of the complex), selection on one homodimer, selection on the two homodimers, and selection on the heterodimer. β was set to 10, β was set to 1000 and the were set to 99.9% of the starting values for each complex, following the parameters described in (Kachroo et al., 2015). For the simulations with neutral evolution, β was set to 1.
5.2.3 Analyses of simulations
The results from the simulations were then analyzed by distinguishing mutational steps with only one non-synonymous mutation (single mutants, between 29% and 34% of the steps in the simulations) from steps with two non-synonymous mutations (double mutants, between 61% and 68% of the steps). The global data was used to follow the evolution of binding energies of the complexes over time, which are shown in Figure 4. The effects of mutations in HM and HET were compared using the single mutants (Figure 5). The double mutants were used to compare the rates of mutation fixation based on their effects on the HMs (Figure S10).
Author contributions
CRL, AM and AFC designed this study. AM, AKD, IGA, DA, SA, CE and DEY performed the experiments. AFC performed the in silico evolution experiments and the analysis of protein structures. AM, AFC, HAJ and CRL analysed the results. CRL and NY supervised the research. AM, AFC and CRL wrote the manuscript with input from all authors.
Competing interests
The authors have no competing interests to declare.
Acknowledgements
This work was supported by Canadian Institutes of Health Research grants 299432, 324265 and 387697 to CRL AM was supported by a FRQS postdoctoral scholarship. AFC was supported by fellowships from PROTEO, MITACS, and Université Laval, as well as joint funding from MEES and AMEXCID. SA was supported by a NSERC undergraduate scholarship. CRL holds the Canada Research Chair in Evolutionary Cells and Systems Biology. We thank SW Michnick for sharing data before publication. The authors thank Philippe Després, Rohan Dandage, Johan Hallin and Anna Fijarczyk for comments on the paper, Rong Shi for useful discussions, and Stéphane Larose for assistance on data management.