Abstract
Whole-Genome Duplications (WGDs) have shaped the gene repertoire of many eukaryotic lineages. The redundancy created by WGDs typically results in a phase of massive gene loss. However, some WGD-derived paralogs are maintained over long evolutionary periods and the relative contributions of different selective pressures to their maintenance is still debated. Previous studies have revealed a history of three successive WGDs in the lineage of the ciliate Paramecium tetraurelia and two of its sister species from the P. aurelia complex. Here, we report the genome sequence and analysis of 10 additional P. aurelia species and one additional outgroup, allowing us to track post-WGD evolution in 13 species that share a common ancestral WGD. We found similar biases in gene retention compatible with dosage constraints playing a major role opposing post-WGD gene loss across all 13 species. Interestingly we found that post-WGD gene loss was slower in Paramecium than in other species having experienced genome duplication, suggesting that the selective pressures against post-WGD gene loss are especially strong in Paramecium. We also report a lack of recent segmental duplications in Paramecium, which we interpret as additional evidence for strong selective pressures against individual genes dosage changes. Finally, we hope that this exceptional dataset of 13 species sharing an ancestral WGD and two closely related outgroup species will be a useful resource for future studies and will help establish Paramecium as a major model organism in the study of post-WGD evolution.
Introduction
Gene duplication is a common type of mutation that can occur at frequencies rivaling that of point mutations (Lynch 2007; Lipinski, et al. 2011; Schrider, et al. 2013; Reams and Roth 2015). Because duplicated genes are often redundant, mutations crippling one copy are expected to frequently drift to fixation, unaffected by selection. As a consequence, the fate of most duplicated genes is to rapidly turn into pseudogenes and eventually evolve beyond recognition. However, ancient duplicated genes are ubiquitous in the genomes of all free-living organisms sequenced to date (Zhang 2003), to the point that almost all genes in the human genome might be the result of an ancient gene duplication (Britten 2006). Therefore, selective pressures opposing the loss of duplicated genes must be commonly operating despite the initial redundancy between the two copies.
Several models have been proposed to explain the long term retention of duplicated genes. Retention can happen through change in function when one copy acquires beneficial mutations conferring a new function (neofunctionalization (Ohno 1970)) or when each copy independently loses a subset of the functions performed by the ancestral (pre-duplication) gene (subfunctionalization (Force, et al. 1999; Lynch and Force 2000)). It should also be noted that new mutations may not always be necessary to promote gene retention through change of function. Indeed, segmental duplications can lead to partial gene duplications so that the new copy already diverges from the ancestral sequence immediately following the duplication event. Even when all of the exons of a gene are duplicated, regulatory regions and chromatin context might be different between the two copies, potentially resulting in the new copy being born with a different expression profile. Finally, duplicated genes can also be retained without a change in their function. Dosage constraints can drive selection to maintain duplicated genes performing the same function when multiple copies are needed to produce the required amount of transcripts (Edger and Pires 2009; Birchler and Veitia 2012). In this case, selection is expected to act on the total (i.e. combined between all copies) amount of transcripts produced.
In its most extreme form, duplication can encompass the entire genome, creating a new copy of each gene. Such Whole-Genome Duplication (WGD) events are not uncommon, with evidence of ancient WGDs in the lineages of many eukaryotes including the budding yeast (Kellis, et al. 2004), insects (Li, et al. 2018), the African clawed frog (Session, et al. 2016), the rainbow trout (Berthelot, et al. 2014) and Paramecium (Aury, et al. 2006). It is also now widely accepted that two successive rounds of WGDs occurred in the ancestor of vertebrates (Hokamp, et al. 2003; Dehal and Boore 2005; Holland and Ocampo Daza 2018) and an additional round of genome duplication in the lineage leading to all teleost fish (Meyer and Schartl 1999; Jaillon, et al. 2004; Howe, et al. 2013; Glasauer and Neuhauss 2014). Additionally, WGDs are extremely common in land plants, to the point that all angiosperms are believed to have experienced at least one round of genome duplication in their history (De Bodt, et al. 2005; Ren, et al. 2018). Because they create the opportunity for thousands of genes to evolve new functions, WGDs have been suggested to be responsible for the evolutionary success of several lineages (De Bodt, et al. 2005; Glasauer and Neuhauss 2014). However, the precise link between WGDs and evolutionary success remains unclear (Clarke, et al. 2016; Laurent, et al. 2017).
Here, we investigate the evolutionary trajectories of duplicated genes across multiple Paramecium species that share a common ancestral WGD. The initial sequencing of the Paramecium tetraurelia genome revealed a history of three (possibly four) successive WGDs (Aury, et al. 2006). Similarly to what was observed in other lineages having experienced WGDs, the Paramecium WGD was followed by a phase of massive gene loss and only a fraction of WGD-derived paralogs (ohnologs) have been retained in two copies. Still, about 50% of ohnologs from the most recent WGD are retained in two copies in P. tetraurelia (Aury, et al. 2006), a situation very different from that in the budding yeast (about 10% retention rate (Scannell, et al. 2007)), the other widely studied unicellular eukaryote with an ancestral WGD. This situation makes Paramecium an ideal model organism for studying the earlier stages of post-WGD genome evolution. Interestingly, the most recent Paramecium WGD shortly predates the first speciation events in the formation of the P. aurelia group of 15 cryptic Paramecium species (Aury, et al. 2006; McGrath, Gout, Doak, et al. 2014; McGrath, Gout, Johri, et al. 2014). It has been suggested that reciprocal gene losses following genome duplication have fueled the speciation of P. aurelias (Aury, et al. 2006; McGrath, Gout, Johri, et al. 2014). However, it seems unlikely that this WGD resulted in massive genetic and phenotypic innovations as suggested for yeast (Huminiecki and Conant 2012), plants (Van de Peer, et al. 2009; Edger, et al. 2015) and vertebrates (Voldoire, et al. 2017; Clark and Donoghue 2018). Indeed, the 15 species in the P. aurelia complex are morphologically so similar to each other that they were thought to be one single species (Paramecium aurelia) until Tracy Sonneborn discovered the existence of mating types and realized that the species he was studying, P. aurelia, was in fact a complex of many genetically isolated species (Sonneborn 1937). This apparent lack of post-WGD phenotypic innovation is also supported by a recent study focusing on the Rab GTPase gene family which found that the recent ohnologs in this gene family did not show any sign of functional diversification (Bright, et al. 2017). These observations suggest that neofunctionalization probably did not play a major role in the retention of ohnologs following the most recent genome duplication in Paramecium. Our previous studies based on three P. aurelia genomes and one pre-WGD outgroup also pointed to an important role of dosage constraint in the retention pattern of ohnologs in Paramecium (McGrath, Gout, Doak, et al. 2014; McGrath, Gout, Johri, et al. 2014; Gout and Lynch 2015).
Here, we sought to increase our understanding of post-WGD genome evolution by sequencing the somatic (macronucleus) genomes of the remaining P. aurelia species and mapping the trajectories of all genes created by the recent WGD and the subsequent speciation events in P. aurelia. We also generated transcriptomic data for each species in order to characterize gene expression levels and better understand the role of expression level and dosage constraints in ohnolog retention. With this dataset, we aim at providing the scientific community with resources comparable to those available in the budding yeast (Byrne and Wolfe 2005) and help establishing Paramecium as another model organism for the study of post-WGD genome evolution.
Results and Discussion
Genome and transcriptome sequence of 13 P. aurelia species that share a common ancestral Whole-Genome Duplication
Previous studies have revealed a history of Whole-Genome Duplication (WGDs) in the lineage of Paramecium species belonging to the P. aurelia complex (Aury, et al. 2006; McGrath, Gout, Doak, et al. 2014; McGrath, Gout, Johri, et al. 2014), a group of species thought to have speciated shortly following the most recent Paramecium WGD. To further understand the evolutionary trajectories of WGD-derived paralogs (ohnologs) we sought to complete our previous efforts (McGrath, Gout, Doak, et al. 2014; McGrath, Gout, Johri, et al. 2014) and sequenced the remaining species from the P. aurelia group as well as an additional closely related outgroup. The complete dataset contains 13 species from the P. aurelia group and the two outgroups that diverged before the most recent WGD: P. caudatum and P. multimicronucleatum. All genomes (including previously published ones) were annotated using the EuGene pipeline (Foissac 2008; Arnaiz, et al. 2017) and evidence for a recent WGD was observed in all species of the P. aurelia group but absent from both outgroup species (methods). The fraction of ohnologous pairs that maintained both genes intact varied from 0.39 (P. tredecaurelia) to 0.58 (P. jeningsi) with a median retention rate across species of 0.52 (Table S1).
Phylogeny of the P. aurelia complex
In order to investigate the fate of duplicated genes, we mapped all orthologous and paralogous relationships in the P. aurelia complex. Because the first speciation events occurred very shortly after the genome duplication, discriminating orthologs from paralogos when analyzing the most divergent P. aurelia species is challenging. We used PoFF (Lechner, et al. 2014) to infer orthology relationships and took advantage of the low rate of large-scale genomic rearrangements in Paramecium to assign orthology by blocs of conserved syntheny (Methods). Assigning orthology relationships in blocs of genes gives us more phylogenetic signal for each orthology assignment and increases our capacity to accurately discriminate orthologs from paralogs in deep species comparisons. The final orthology assignments were used to build a reliable phylogeny of the P. aurelia group (Figure 1). All positions are strongly supported (100%) by bootsraping, with the exception of P. biaurelia (60%).
Timing of post-WGD gene losses
Having established the phylogenetic relationships between all 13 P. aurelia species sequenced, we used the patterns of gene presence/absence in the extant species to infer the timing of gene loss. Using a parsimony-based method to map the location of gene losses onto the P. aurelia phylogeny (Methods), we found that, following a short period of rapid gene loss immediately after the WGD, gene loss rate decreased and entered a phase of linear decline (Figure 2). This pattern is similar to what has been observed in yeast, teleost fish and plants (Scannell, et al. 2006; Inoue, et al. 2015; Ren, et al. 2018) although the exact shape of the survival curve is disputed (Inoue, et al. 2015). Using all of the ancestrally reconstructed duplicated gene survival rates, we found that an exponential decay model provided a better fit (R2=0.49) than a linear decay model (R2=0.44). However, after excluding the deepest points in the ancestral reconstruction, the exponential decay model did not outperform the linear model, suggesting a two-phase model for gene loss rate, similar to what has been reported in teleost fish (Inoue, et al. 2015). The observation that post-WGD gene-loss rate evolution follows similar trends across organisms as diverse as Paramecium, yeast and teleost fish suggests the possibility that the selective pressures responsible for gene retention following genome duplications are similar across the tree of eukaryotes. Although the rate at which ohnologs are lost has considerably slowed down, gene losses remain frequent in Paramecium. For example, we found 146 genes that have been lost in P. primaurelia while still being retained in two copies in the closely related sister species P. pentaurelia, highlighting the fact that gene loss is still an active ongoing process in Paramecium.
Slow gene losses in Paramecium in comparison to other species
Having determined the general pattern of post-WGD gene loss with time, we aimed at directly comparing the strength of selective pressures responsible for ohnolog retention in different lineages. Although the amount of time elapsed since the genome duplication will be the same in all extant species, the mutation rate, generation time, and strength of selection might vary between P. aurelia species, resulting in different strength of selective pressures on post-WGD ohnologs evolution across these lineages. We used the average amount of synonymous substitutions between the retained ohnologs as a proxy for the amount of mutational pressure that has been faced by ohnologs since the genome duplication. Within extant P. aurelia species, we found a strong negative correlation between the probability of ohnolog retention and the level of sequence divergence between the remaining pairs of ohnologs (r = −0.96, p<0.01; Figure 3, red dots). This correlation remained significant when accounting for the phylogenetic non-independence of the data (r = −0.75, p = 0.003). We then performed the same analysis in a number of non-Paramecium lineages having experienced a genome duplication. We found that, despite a shared general trend of decreased retention with synonymous sequence divergence, the rate of post-WGD gene loss relative to synonymous divergence was lower in Paramecium than in other species (Figure 3). In other words, the rate of gene loss per synonymous substitution was lower in Paramecium than in other phylogenetic groups having experienced a genome duplication. We interpret this observation as evidence that the strength of selection opposing gene loss is stronger in Paramecium than in plants and vertebrates.
Selective pressures opposing gene loss
In order to understand why selection against gene loss is stronger in Paramecium than in other species we must first clarify the nature of the selective pressures promoting ohnolog retention. While it is difficult to pinpoint which scenario (neo/subfunctionalization, or dosage constraint) is responsible for the retention of each ohnolog pair, some general trends can be derived from genome-wide analyses. We initially reported that the probability of retention is positively correlated with the expression level of ohnologs in Paramecium and have interpreted this observation as evidence for stronger dosage constraints in highly expressed genes (Gout, et al. 2010; Gout and Lynch 2015). A similar trend had been reported for S. cerevisiae (Seoighe and Wolfe 1999), suggesting a universal role for expression level in post-WGD gene retention. We first confirmed that the increased retention rate for highly expressed genes was a universal pattern, present in all 13 P. aurelia species (Figure S1). With 13 P. aurelia species available, we were able to compute an across-species ohnolog retention rate defined as the fraction of species within the P. aurelia complex retaining both copies of an ohnolog pair. We used the expression level of the ortholohous gene in P. caudatum as a proxy for the pre-duplication expression level. As expected, we found a positive correlation between expression level and across-species retention rate (r = 0.24, p<0.001). Although this corresponds to only 6% of the variance in P. aurelia retention rate being explained by expression level in P. caudatum, the effect is consistent across all ranges of expression levels, as can be seen when genes are binned according to their expression level (Figure 4). As a result, the most highly expressed genes are twice as likely to be retained in a P. aurelia species than the genes with the lowest expression levels (0.81 vs 0.37 across-species retention rate for the highest and lowest bin respectively, p<0.001; Welch two sample t-test).
Previous studies reported a bias in the probability of post-WGD retention for different functional categories (Seoighe and Wolfe 1999; Maere, et al. 2005; McGrath, Gout, Doak, et al. 2014; McGrath, Gout, Johri, et al. 2014; Rody, et al. 2017). We assigned Gene Ontology (GO) terms to genes in the P. aurelia complex and the two outgroup species using the panther pipeline (Mi, et al. 2013). We found that the retention biases per GO category were highly conserved among species, even between those that diverged very shortly after the WGD. When comparing the average retention rate for each GO category between the two groups of species that diverged the earliest (P. sexaurelia, P. jeningsi, P. sonneborni vs. every other species), we found a striking positive correlation (r = 0.85, p<0.01) between the two groups, indicating that the different selective pressures associated to each functional category have been preserved throughout the evolution of the P. aurelia complex. While the different average expression levels for each functional category explain a part of this pattern (for example genes annotated as “structural constituent of the ribosome – GO:0003735” tend to be highly expressed, and therefore are preferentially retained in two copies) we still find a number of functional categories with either significant excess or scarcity of post-WGD retention when expression level is taken into account (Table S2). One possible explanation for this pattern is that functional categories that are enriched for protein-coding genes encoding subunits of multimeric protein complexes (such as the ribosome for example) are preferentially retained due to increase dosage-balance constraints on these genes.
Increased predetermination of paralogs’ fate
The previous observations suggest that the fate of ohnologs is at least partially predetermined at the time of duplication by their expression level and functional category. While this allows us to predict which pairs of ohnologs are most likely to rapidly lose a copy, it does not inform us about which copy, if any, is more likely to be lost. To investigate the extent of asymetrical gene loss and its evolution with time, we estimated the fraction of parallel and reciprocal gene loss at different points of the P. aurelia phylogeny. Parallel gene losses are cases where two species independently lose the same copy in a pair of ohnologs. Reciprocal losses are defined as two species losing a different copy in a pair of ohnologs. Gene losses that happened shortly after the genome duplication are equally distributed between reciprocal and parallel losses, as expected if both copies in a pair of ohnologous genes are equally likely to be eventually lost. However, the fraction of gene losses occupied by parallel losses increases with the distance between the genome duplication and the time of speciation between the two species considered (r = 0.4, p<0.001). In other words, one of the two genes in a pair of ohnologs becomes gradually more likely to be the one that will eventually be lost. This observation suggests that ohnologs gradually accumulate mutations that set the two copies on different trajectories, one with increased vulnerability to be eventually lost.
We previously reported that drift in expression level between ohnologs can result in a pattern such that the copy with the lowest expression is more likely to be rapidly lost (Gout and Lynch 2015). With 13 species available, we confirm that this pattern is universal across the P. aurelia species. Indeed, we found that, for all 13 P. aurelia species available, changes in expression level between two ohnologs retained in a given species increase the probability of gene loss for the orthologous pair in the closest sister species (Table S3). Additionally, gene loss is biased towards the ortholog of the copy with the lowest expression level in the sister species, a bias that becomes stronger when looking at closely related species (Table S3). For example, when looking specifically at ohnologs that have been retained in P. decaurelia, we find that only 3% of the orthologous pairs in P. dodecaurelia (the most closely related species in our dataset) have lost a copy. However, among the P. decaurelia ohnologs that have divergent expression level (top 5% most divergent pairs), 22% are orthologous to a P. dodecaurelia “pair” where one copy has been lost. This significant increase in probability of gene loss (p<0.001, χ2 test) is driven preferentially by the loss of the copy orthologous to the lowly expressed copy in the species that has retained both ohnologs (82% of the cases vs. 50% expected by chance, p<0.001; 1-sample proportions test with continuity correction). Therefore, it appears that divergence in gene expression between ohnologs sets the two copies on opposite trajectories for their long term survival. However, contrary to our previous prediction (Gout and Lynch 2015), we did not find any evidence for compensatory mutations increasing expression level of the remaining copy. Therefore it is possible that decreased expression level in one copy is allowed simply by reduced dosage requirements, rather than by compensatory increased expression level in the other copy.
Interplay with segmental duplications
While Paramecium seems highly permissive to genome duplications, it has been noted that additional segmental duplications are rare (Aury, et al. 2006; McGrath, Gout, Doak, et al. 2014). We searched for evidence of recent segmental duplication in all P. aurelia genomes reported here and confirmed their extreme paucity in P. aurelia with a median of 28 recent segmental duplications per genome (Table S4 and Methods). This number is in sharp contrast to the thousands of gene losses that have happened since the most recent Paramecium WGD in each of these lineages. Despite the small number of recent segmental duplications, we were able to detect a bias for these segmental duplications to encompass genes that have already lost their ohnolog from the most recent Paramecium WGD. Genes that had reverted to single copy status since the WGD are on average twice as likely to be part of a subsequent recent segmental duplication as those that had maintained both WGD-derived duplicates (Table S4).
Although this observation may seem paradoxical, it is actually compatible with dosage constraints playing a major role in the fate of duplicated genes in Paramecium. Indeed, whole-genome duplications preserve the relative dosage between genes and might even not result in any immediate change in the amount of DNA being transcribed in Paramecium because of the separation between somatic and germline nuclei (McGrath, Gout, Doak, et al. 2014), a singularity shared by all ciliates. While loss of a copy would result in a dosage change following WGD, the opposite is true for segmental duplications, where it is the fixation of a new copy that would disturb the initial dosage. Therefore, if selection against dosage changes is particularly strong in Paramecium, as suggested by its higher rate of post-WGD gene retention compared to other species, it follows that segmental duplications should be rare. We interpret these observations as additional evidence in support of dosage sensitivity playing a major role in gene retention and duplication in Paramecium. Indeed, the genes that have allowed a copy to be lost following the recent WGD are also more permissive to following segmental duplications, suggesting that they are simply more tolerant to dosage changes.
Conclusions
With this study, we now have a view of post-duplication genome evolution in 13 Paramecium species that share a common whole-genome duplication (WGD). All species have undergo massive gene loss since the WGD, to the point that 40 to 60% of paralogs created by the WGD (ohnologs) have lost one copy. Despite this significant variation in retention rate between species, we observed a number of strikingly similar trends in gene retention and loss across all 13 P. aurelia species. Most notably, highly expressed genes are systematically over-retained in two copies. Different functional categories of genes also showed consistent patterns of over- and under-retention across the entire phylogeny of P. aurelia. The observation that both expression level and functional category influence the probability of post-WGD retention in a way that is consistent across many species indicates that the fate of ohnologs is in part predetermined, as opposed to being dependent on the random chance of acquiring mutations that would change the function of one copy or the other. While we cannot exclude the possibility that the number of mutational targets for neo- and subfunctionalization depends on the expression level and functional category, the patterns observed here are at odds with random mutations creating new functions as the main force driving post-WGD gene retention. It should also be noted that, with the exception of genes lost very early following the genome duplication, purifying selection has been operating to maintain duplicated copies for some time before allowing gene loss. We found an average dN/dS between onhologs in P. aurelia species of 0.05, indicating strong purifying selection against pseudogenization operating since the WGD. Yet, many of these ohnologs have recently been lost as evidenced by the differential gene retention between closely related species, indicating that purifying selection opposed gene loss for significant amounts of time before eventually allowing gene loss. Assuming that neo- or sub-functionalization is responsible for the initial retention of a given gene, one would have to invoke a change in the strength of selection against loss of function to explain its subsequent loss. While this is not impossible (changes in the environment can alter the strength of selection on different functions for example), a more parsimonious explanation of our observation is that dosage constraints are the major driver of post-WGD gene retention pattern. Indeed, the early loss of paralogs with the lowest levels of selection for dosage constraints is expected to impact dosage requirements for their interacting partners, paving the way for future paralog gene losses. We also note that expression level is relatively fluid, with divergence in expression levels between paralogs resulting in decreased selective pressures to maintain the copy with the lowest expression level. With all these observations pointing at dosage constraints being a major contributor of post-genome duplication evolution, we interpret our finding that post-WGD gene loss relative to sequence divergence was lower in Paramecium than in other species with well-studied WGDs as evidence that Paramecium species are especially sensitive to small-scale gene dosage changes. This conclusion is corroborated by the scarcity of segmental duplications in the genome of all the Paramecium species studied here. In all 13 species, the few segmental duplications observed are biased towards encompassing genes that have already lost their paralog from the most recent WGD, which we interpret as additional evidence for these genes being simply more tolerant to dosage changes.
Finally, we hope that this dataset will be useful to other researchers studying whole-genome duplications and will help establish Paramecium as a model species for WGDs, alongside yeast.
Materials and Methods
Genome sequencing, assembly and annotation
Paramecium cells that had recently undergo autogamy (a self-fertilization process that create 100% homozygous individuals) were grown in up to 2 liters of Wheat Grass Powder medium (Pines International) medium before being starved and harvested. Paramecium cells were separated from the remaining food bacteria by filtration on a 10 µm Nitex membrane. Macronuclei were isolated away from other cellular debris by gentle lysis of the cell membrane and sucrose density separation. DNA was extracted and purified using a CTAB protocol (Doyle 1987). DNA libraries were constructed with the Illumina Nexttera DNA library preparation kit following manufacturer’s recommendations and sequencing was performed on a HiSeq 2500 machine producing 2 × 150 nt reads. Reads were trimmed for adapter sequences and quality (3′ end trimming down to Q=20) with cutadapt version 1.15 (Martin 2011). Genome assembly was performed with SPades version 3.11 (Nurk, et al. 2013) with default parameters. Final assembly was cleaned up by removing short scaffolds (less than 1 kb) and scaffolds with strong blast hits to bacterial genomes. Genome annotation was done with the EuGene pipeline (Foissac 2008) using the RNAseq data (see below) generated for each data as described in (Arnaiz, et al. 2017). The list of Paramecium strains used in this study is as follows. P. primaurelia Ir4-2, P. biaurelia V1-4, P. tetraurelia 51, P. pentaurelia 87, P. sexaurelia AZ8-4, P. octaurelia K8, P. novaurelia TE, P. decaurelia 223, P. dodecaurelia 274, P. tredecaurelia d13-2 (derived from 209), P. quadecaurelia N1A, P. jenningsi M, P. sonneborni ATCC30995, P. multimicronucleatum MO 3c4 and P. caudatum 43.
RNAseq and expression level quantification
Paramecium cells were grown in ∼1 liter of Wheat Grass Powder medium to mid-log phase before harvesting. Cells were purified away from bacteria by filtration on a 10 µm Nitex membrane. Whole-cell RNA was isolated using TRIzol (Ambion) and the manufacturer’s suggested protocol for tissue culture cells. cDNA libraries were prepared with the Illumina TruSeq library preparation kit following the manufacturer’s suggested protocol and then sequenced with Illumina single-end 150 nt reads. RNAseq reads were mapped to each corresponding genome with Bowtie/TopHat (Langmead, et al. 2009; Kim, et al. 2013) and transcript abundance (FPKM) was computed using cufflinks (Trapnell, et al. 2010) with –multi-read-correct and –frag-bias-correct options to obtain values of FPKM for each predicted protein-coding gene. Expression level was defined for each gene as the log(FPKM+0.1), the small offset (0.1) being added to include genes with FPKM values of zero even after log-transformation.
Orthology and paralogy relationships inference
Paralogs derived from the three successive Paramecium whole-genome duplications (WGDs) were annotated using the pipeline initially described in (Aury, et al. 2006). Briefly, this pipeline blast reciprocal best hits to identify large blocks of synteny derived from the most recent WGD and identify retained and lost duplicates within these blocks. Ancestral (pre-WGD) genome reconstruction is then performed by fusion of the paralogous blocks with the following criteria: if both paralogs are still retained, one copy is randomly chosen to be incorporated in the ancestral genome, if one copy has been lost, the remaining copy is included at the ancestral locus. The process is then repeated with the ancestrally reconstructed genome for more ancient genome duplications.
Orthologs were assigned using a combination of PoFF (Lechner, et al. 2014) and in-house scripts. PoFF was used to perform an initial round of orthologs prediction across all 13 P. aurelia species. Following this first round, an ‘orthology score’ was attributed to each pair of scaffolds linked by at least one orthologous gene pair. The score was defined as the number of genes being annotated as orthologous between the two scaffolds by PoFF. Orthology relationships were then updated with the following criteria: 1-to-2 orthology relationships where the “2” corresponds to two WGD-derived paralogs were converted to 1-to-1, selecting the gene on the scaffold with the highest orthologous score as being the ortholog. Orthology relationship with P. caudatum and P. multimicronucleatum were then inferred by selecting the genes in these two species with the highest blast hit scores to the entire P. aurelia orthologs family.
Building the phylogenetic tree
Protein sequences for orthologous genes that were present in a single copy in at least half of the P. aurelia species were aligned to their corresponding orthologous sequences from P. caudatum and P. multimicronucleatum, using MUSCLE version 3.8 (Edgar 2004). Alignments were cleaned using gblocks (Castresana 2000) and a phylogenetic tree was build using the distance method implemented in Seaview (Gouy, et al. 2010).
Inferring loss of gene duplicate
Branch-specific loss of gene duplicates were inferred by parsimony using ancestral reconstruction with in-house scripts. We assumed that probability of gain of duplicates is zero. Missing data was encoded as “NA” such that: ancestor(child1=“NA” and child2=“gene duplicate present”) = “gene duplicate present”; ancestor(child1=“NA” and child2=“only one duplicate present”) = “NA”; ancestor(child1=“NA” and child2 = “NA”) = “NA”. In total, 9983 gene duplicate pairs were present in the ancestor (or root) of all P. aurelia species. Probability of survival was obtained for every node in the phylogenetic tree (based on protein sequences) as: 1.0 – (Number of duplicates present in root – Number of duplicates present at the node) / Number of duplicates present in root.
Finding segmental duplications
We started the search for recent segmental duplication in each species with a BLAST (Altschul, et al. 1990) search of a database containing all protein-coding genes against itself. After removing self-hits, we selected pairs of reciprocal best blast hits and removed the pairs that were already annotated as being WGD-derived paralogs. We then removed hits that were not inside a paralogon (a bloc of WGD-related genes with preserved synteny) to avoid the possibility of “contamination” with WGD-related paralogs that would have been missed by the initial annotation because of subsequent gene relocation. Finally, we computed the rate of synonymous substitution for each remaining pair of genes and retained only those with a synonymous substitution below 1.0. P. sonneborni was excluded from this analysis because of the presence of micronucleus-derived sequences in the genome assembly mimicking recent paralogs.
Acknowledgments
This work was supported by National Science Foundation grant EF-0328516-A006 to M.L. Additional support for genome sequencing was provided to S.D. by France Genomique.