Phylogenomics reveals an extensive history of genome duplication in diatoms (Bacillariophyta)

Matthew Parks; Teofil Nakov; Elizabeth Ruck; Norman J. Wickett; Andrew J. Alverson

doi:10.1101/181115

ABSTRACT

Premise of the study Diatoms are one of the most species-rich lineages of microbial eukaryotes. Similarities in clade age, species richness, and contributions to primary production motivate comparisons to flowering plants, whose genomes have been inordinately shaped by whole genome duplication (WGD). These events that have been linked to speciation and increased rates of lineage diversification, identifying WGDs as a principal driver of angiosperm evolution. We synthesized a relatively large but scattered body of evidence that, taken together, suggests that polyploidy may be common in diatoms.

Methods We used data from gene counts, gene trees, and patterns of synonymous divergence to carry out the first large-scale phylogenomic analysis of genome-scale duplication histories for a phylogenetically diverse set of 37 diatom taxa.

Key results Several methods identified WGD events of varying age across diatoms, though determining the exact number and placement of events and, more broadly, inferences of WGD at all, were greatly impacted by gene-tree uncertainty. Gene-tree reconciliations supported allopolyploidy as the predominant mode of polyploid formation, with particularly strong evidence for ancient allopolyploid events in the thalassiosiroid and pennate diatom clades.

Conclusions Whole genome duplication appears to have been an important driver of genome evolution in diatoms. Denser taxon sampling will better pinpoint the timing of WGDs and likely reveal many more of them. We outline potential challenges in reconstructing paleopolyploid events in diatoms that, together with these results, offer a framework for understanding the evolutionary roles of genome duplication in a group that likely harbors substantial genomic diversity.

INTRODUCTION

Duplicated genes are a hallmark of eukaryotic genomes. For example, some two-thirds of the genes in Arabidopsis are present in more than one copy (Ambrosino et al., 2016), a proportion that is typical of most plant genomes (Panchy, Lehti-Shiu, and Shiu, 2016). These duplicated genes can provide raw materials for evolutionary innovation and change, thereby representing an important source of novel traits in lineages spanning the eukaryotic tree of life (Ohno, 1970). In flowering plants, for example, gene duplications have been linked to changes in a diverse set of traits, including floral pigmentation and structure, flowering time, disease and herbivore resistance, fruit characteristics, and stress response (reviewed by Panchy, Lehti-Shiu, and Shiu, 2016). Gene duplication can occur across multiple scales: from small tandem duplications affecting one or a few genes, to transposon-mediated segmental duplications affecting large stretches of a chromosome, to, most dramatically, doubling of the entire genome [whole genome duplication (WGD) or polyploidy] (Flagel and Wendel, 2009; Panchy, Lehti-Shiu, and Shiu, 2016).

The evolutionary history of angiosperms is replete with ancient polyploidy events, such that a majority of the duplicated genes in Arabidopsis can be traced to a series of at least four separate WGDs dating back to the origin of flowering plants (Bowers et al., 2003; Jiao et al., 2011). In addition to providing a source of novel and potentially adaptive traits, gene and genome duplications can also serve as mechanisms of speciation (Winge, 1917; Lynch and Force, 2000). Whole-genome duplications, in particular, frequently coincide with speciation events in flowering plants (Otto and Whitton, 2000; Wood et al., 2009; Zhan et al., 2016). An association between WGD and subsequent increases in net diversification rate is also emerging (Otto and Whitton, 2000; Soltis et al., 2009; Tank et al., 2015; but see Kellogg, 2016), implicating WGD as a potentially important driver of species diversification in angiosperms. Polyploidy events have been an important source of novelty in other species-rich lineages as well, including vertebrates (Ohno, 1970; Dehal and Boore, 2005) and fungi (Wolfe and Shields, 1997; Albertin and Marullo, 2012). With longstanding genetic model systems and a wealth of genomic data, these groups are, however, some of the most intensively studied eukaryotes. Growing genomic resources for equally diverse but historically understudied groups have made it possible to explore whether WGD has played a similarly important role in non-model lineages.

With diversity estimates in the tens to hundreds of thousands of species (Guiry, 2012; Mann and Vanormelingen, 2013), a prominent role in the global cycling of carbon and oxygen (Field et al., 1998), a critical position at the base of their native food webs, and a crown age of roughly 200 My (Nakov, Beaulieu, and Alverson, 2017), diatoms are in many respects the angiosperms of the sea. They exhibit many layers of diversity beyond their species richness, including a broad range of ecological niches, life history strategies, and most famously in the diverse patterns and ornamentations of their silicified cell walls (Round, Crawford, and Mann, 1990). Very little is known, however, about the primary sources of genetic change underlying the origins and evolutionary shifts in these traits. Many independent lines of direct and indirect evidence collected over decades suggest that WGD may be common in diatoms. For example, although karyotypes are available for very few species, chromosome counts range from 2n = 8– 130 among raphid pennate species alone (Kociolek and Stoermer, 1989). Flow cytometric measurements have shown substantial variation in genome size, with estimates spanning more than three orders of magnitude among the few dozen species that have been surveyed (Connolly et al., 2008; von Dassow et al., 2008). Within species, a recent genome doubling distinguishes natural populations of the polar centric species, Ditylum brightwellii (Koester et al., 2010), and WGDs apparently occur in strains maintained in long-term cell culture as well (von Dassow et al., 2008). Finally, and perhaps most compellingly, simultaneous fusions of three or four gametes, leading to the formation of autopolyploid auxospores (i.e., zygotes), have been directly observed in several raphid pennate diatoms, including Cocconeis (Geitler, 1927), Craticula (Mann and Stickle, 1991), Dickea (Mann, 1994), Achnanthes (Chepurnov and Roschin, 1995), and Seminavis (Chepurnov et al., 2002). The latter set of observations, in particular, led to the prediction that polyploidy might be an important driver of speciation in diatoms (Mann, 1994, 1999a). Finally, there is some evidence for polyploidy in non-diatom stramenopiles, the higher-order lineage to which diatoms belong (Coyer et al., 2006; Ioos et al., 2006). In light of this relatively large body of evidence, the most surprising discovery might be the lack of a genomic signature for paleopolyploidy in diatoms.

We compiled new and previously sequenced genomic and transcriptomic data for 37 phylogenetically diverse diatom species to estimate, for the first time, the extent to which diatom genomes have been shaped, if at all, by WGD events. Gene counts, gene trees, and patterns of synonymous sequence divergence (Ks) between gene duplicates identified numerous putatively allopolyploid-driven WGDs across the phylogeny and potentially dating as far as back as 200 Mya. We discuss possible modes of polyploid formation in diatoms as well as identifying research directions that will shed light on the mechanisms and evolutionary consequences of WGD in diatoms.

MATERIALS AND METHODS

Taxon sampling

We sampled 37 diatom species that spanned the known breadth of extant phylogenetic diversity, the bolidophyte Triparma pacifica, and two pelagophyte outgroups (Appendix 1).

Transcriptome sequencing

We extracted total RNA from exponentially growing cultures using the Qiagen RNeasy kit. We prepared indexed sequencing libraries using the Illumina TruSeq RNA Sample Preparation Kit v2 (Appendix 1). Multiplexed libraries were sequenced with the Illumina HiSeq 2000 or HiSeq 4000 platforms (Appendix 1). Newly generated data were deposited in the Sequence Read Archive databased maintained by the National Center for Biotechnology Information (NCBI) under accessions XXXXXXX–XXXXXXX (Appendix 1 [note: GenBank submissions are pending]).

Transcriptome assembly and annotation

RNA-seq reads were filtered and assembled following the basic guidelines outlined in the Oyster River Protocol (MacManes, 2015). In short, raw sequence reads were corrected with Rcorrector (Song and Florea, 2015) and quality-trimmed with Trimmomatic (ver. 0.32) (Bolger, Lohse, and Usadel, 2014). Corrected and trimmed reads were filtered for common laboratory vectors and diatom rRNA genes using bowtie2 (ver. 2.2.3) (Langmead and Salzberg, 2012). Overlapping forward and reverse pairs of filtered reads were merged with BBMerge (ver. 8.8) (Bushnell 2014), and both merged and unmerged reads were assembled with Trinity (ver. 2.2.0) (Grabherr et al., 2011b). Assembled transcripts were translated into amino acid sequences using TransDecoder (ver. 2.0.1) (https://transdecoder.github.io/), with translation predictions enabled by BLASTP searches of the longest identified open reading frames to the Swiss-Prot database and HMMER searches (Eddy, 2011) to the Pfam database (Finn et al., 2015). Assembly quality was measured by TransRate scoring (ver. 1.01) (Smith-Unna et al., 2016) and recovery of conserved eukaryotic orthologs present in the BUSCO database (Simão et al., 2015).

Orthology/Paralogy-based transcriptome clustering

We used CD_HIT (-c 0.99 -n 5) (Fu et al., 2012)to remove redundant isoform transcripts from the full set of amino acid sequences for each species. The non-redundant transcriptome of each species was then searched against a database of all 40 (non-redundant) transcriptomes with BLASTP (ver. 2.3.0+) (e-value ≤ 10^-5 and max-target sequences = 100) (Camacho et al., 2009), and this output was used to identify putative orthologous clusters with MCL (ver. 12-135) (Van Dongen, 2001; Enright, Van Dongen, and Ouzounis, 2002; Van Dongen and Abreu-Goodger, 2012) with e-value cutoff of 10^-30 and an inflation value of 1.4. MCL clusters with fewer than four taxa were excluded from subsequent analyses.

Homolog and species tree reconstructions

Initial orthologous clusters were pruned and resulting ortholog trees were constructed using the ‘phylogenomic_dataset_construction’ pipeline of Yang and Smith (2014). For this pipeline, we aligned sequences with MAFFT (ver. 7.309) (Katoh and Standley, 2013) and reconstructed gene and ortholog trees with RAxML (ver. 8.2.9) (Stamatakis, 2014) using the PROTCATWAG model and 100 rapid bootstrap pseudoreplicates per alignment. As part of the pruning pipeline to create single-copy orthologous clusters for phylogenetic analyses, alignments were trimmed to include only sites with a minimal column occupancy of 0.1, terminal branches with lengths greater than two branch-length units or with lengths greater than 10 times the length of a sister branch were removed, internal branches with lengths greater than 2 branch-length units were removed, and sister tips belonging to the same taxon were reduced to include only the tip with the largest number of unambiguous characters in the trimmed alignment. Yang and Smith’s (2014) ‘RT’ strategy was used to create final, single-copy ortholog alignments, with the two pelagophyte samples specified as outgroup taxa and all diatom samples and Triparma pacifica specified as ingroup taxa; this allowed the final set of gene trees to be rooted with a non-diatom outgroup. Finally, we used SumTrees (Sukumaran and Holder, 2010)to collapse nodes on the final ortholog trees with less than 33% bootstrap support.

For species tree reconstructions, ortholog alignments and trees were filtered again to include only those alignments with 100% taxon occupancy and alignment columns with less than 20% missing data or gap characters. Species trees were then reconstructed using both summary-coalescent and concatenation-based approaches. We used ASTRAL (ver. 4.10.8) for summary-coalescent species tree reconstruction, with topology and support estimated with local posterior probabilities (Sayyari and Mirarab, 2016) and multilocus bootstrapping (Seo, 2008). We refer to these as ASTRAL and ASTRAL-mlbs, respectively. For the concatenation-based analysis, models of protein evolution were first determined for each ortholog alignment using ProtTest (ver. 3.4.2) based on the AICc selection criterion (Guindon et al., 2010; Darriba et al., 2011). Alignments were concatenated with AMAS (Borowiec, 2016), and the resulting species tree was inferred using IQ-TREE with ultrafast bootstrapping and SH-aLRT testing (1000 replicates each) (Guindon et al., 2010; Minh, Nguyen, and von Haeseler, 2013; Chernomor, von Haeseler, and Minh, 2016). This tree was used in subsequent analyses as a reference species tree, since we recovered both relatively high levels of gene tree discordance and low levels of gene tree support across input gene trees (see results), under which conditions concatenation-based methods may outperform summary coalescent methods (Mirarab and Warnow, 2015), and its topology was nearly identical to recovered ASTRAL species tree topologies (see results). Gene tree support for this recovered species tree was estimated with PhyParts (analysis=fullconcon) (Smith et al., 2015) and a companion script, phypartspiecharts.py (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts), with gene tree concordance estimated against the IQ-TREE species tree and using a 33% bootstrap support threshold.

The IQ-TREE species tree was time-calibrated using TreePL (Smith and O’Meara, 2012) with 10 fossil-derived calibration points. The minimum and maximum bounds were set following Nakov et al. (2017), except the calibration for the most recent common ancestor of diatoms and Parmales was constrained to a maximum age of 250 million years before present. The optimal rate-smoothing parameter for TreePL was estimated with random-subsample-and-replace cross-validation with a range of tested values on a log scale between 10⁵ and 10^-5.

Overall approach to identification of paleopolyploidy events

Identifying WGD events from transcriptome data necessarily relies on temporal or phylogenetic signal, rather than spatial syntenic signal, and so may be impacted by historical variation in molecular evolutionary rates and saturation artifacts (McKain et al., 2016). Nevertheless, several complementary methods are now available that together provide increased confidence in transcriptome-based WGD inferences in the absence of synteny information. These approaches are broadly divided into three categories: (1) paralog divergence (i.e., Ksbased) methods (Lynch and Conery, 2000; Blanc and Wolfe, 2004); (2) gene-tree/species-tree reconciliation methods (Durand, Halldorsson, and Vernot, 2006; Jiao et al., 2011; Thomas, Ather, and Hahn, 2017); and (3) gene count methods (Rabier, Ta, and Ane, 2014). Each of the three approaches provides incrementally more rigorous and specific tests for WGD: (1) the Ks analyses provide semi-quantitative evidence for the presence of synchronously duplicated genes, (2) gene-tree reconciliation pipelines identify specific branches on the species tree with elevated numbers of gene duplications and losses (Durand, Halldorsson, and Vernot, 2006; Yang et al., 2015), a reconciliation approach allows specific tests about the mechanism of WGD events (auto-vs. allopolyploidy) (Thomas, Ather, and Hahn, 2017), and (3) a gene-count method for detecting and locating WGD events independent of both Ks and gene-tree information. As described in the following sections, we applied each of these methods to one or more sets of orthologous clusters and their corresponding gene trees (Fig. 1).

Figure 1.

Dataset sizes at critical stages of analysis. The area of each circle is proportional to the total transcript count at that stage of analysis. Total transcript counts represent all assembled transcripts (transcriptomes) and predicted genes (genomes) available from all taxa at a given stage of analysis.

Synonymous divergence (Ks) of paralogs

We looked for evidence for historic genome duplication events based on pairwise divergence between paralogs at synonymous sites (Ks) in both diatom and outgroup genomes (Lynch and Conery, 2000; Blanc and Wolfe, 2004). Methods for identifying secondary Ks peaks can vary considerably in several parameters (e.g., clustering criteria for paralogs and codon substitution model), and the behaviors of different Ks pipelines have not been systematically evaluated, so we used several different available pipelines and settings. We restricted Ks analyses to a set of relatively conserved genes, based on a BLASTP search (e-value ≤ 10^-10) of each transcriptome against a database of complete proteomes from 17 protist species. The first approach followed Johnson et al. (2016), with initial filtering of each gene set to remove highly similar sequences (e.g., isoforms or very recent duplicates) using CD-HIT-EST (-c 0.98 -aS 0.90). Remaining proteins were then clustered for each species with CD-HIT (-c 0.40 -aL 0.75 -n 2), aligned with MAFFT, and back-translated by forcing nucleotide sequences to protein alignments with Pal2Nal (Suyama, Torrents, and Bork, 2006), with gap regions and internal stop codons removed. For each pair of paralogous nucleotide sequences in the CD-HIT clusters, Ks was calculated using the KaKs_Calculator (Zhang et al. 2006) under both the YN (Yang and Nielsen, 1998) and GY (Goldman and Yang, 1994) codon substitution models. We also estimated Ks distributions using the FASTKs pipeline (McKain et al., 2016) with default settings. The Trinity transcriptome assembler distinguishes closely related paralogous genes from isoforms of the same gene (Grabherr et al., 2011a). As a result, transcript assemblies are hierarchically organized according to assembly read clusters, which are comprised of ‘gene’ and gene ‘isoforms’. In some cases, isoforms of the same Trinity ‘gene’ might represent recently diverged paralogs, and some Ks pipelines are ‘Trinity-agnostic’, instead relying on alternative filtering strategies to distinguish paralogs and isoforms (Jiao et al., 2011; Johnson et al., 2016). Due to this ambiguity, Ks distributions were determined using the FastKs pipeline both before and after removing BLASTN self-hits at the ‘gene’ level for the Trinity assemblies (i.e., BLAST hits between two Trinity isoforms of the same Trinity gene). For both pipelines, we tested for multiple normal distributions in the Ks distributions using the R package mclust (Fraley et al., 2012), with the best fit model chosen using the Bayesian Information Criterion (BIC).

Gene-tree reconciliation

Focal points of gene duplication and loss

We applied two gene-tree reconciliation strategies to two subsets of homolog trees to identify parts of the species tree with concentrations of gene duplication (and loss) events. First, we applied the approach used by Yang et al. (2015) to a set of 3163 homolog alignments (‘3.1K dataset’) filtered to include at least 30 diatoms and one outgroup taxon (Fig. 1). This pipeline maps rooted clades of homolog (orthologs and paralogs) trees to a species tree to determine the proportion of duplicated gene families, taking into account confidence in homolog tree topologies as measured by average bootstrap support across a sampled clade. For this analysis, we used RAxML, with the PROTCATWAG model and 100 bootstrap pseudoreplicates, to reconstruct homolog trees. Average bootstrap support values were relatively low across homolog trees, so the bootstrap cutoff was set at 40% (Yang et al., 2015).

Second, we used Notung (ver. 2.9) (Durand, Halldorsson, and Vernot, 2006; Darby et al., 2017) to reconcile and root two sets of gene trees: the set of 3163 homolog trees (‘3.1K dataset’, as described above) and a broader set of 9497 homolog trees with at least 8 diatoms (‘9.5K dataset’) (Fig. 1). We ran Notung’s phylogenomic pipeline to estimate the number of gains and losses in each gene tree and total counts of duplication and loss per node for the entire set of 9497 homolog trees. For the 3.1K dataset, we also performed bootstrap-based rearrangements, which minimize the reconciliation cost by making rearrangements around poorly supported nodes. We applied three bootstrap thresholds (40%, 50%, and 70%) and repeated the Notung phylogenomic pipeline on each of the resulting sets of rearranged trees. We applied relatively low bootstrap thresholds because the overall levels of bootstrap support in gene trees were low, i.e., only about 30% of nodes across all trees had bootstrap values >50%. In addition to bootstrap-based rearrangement, we also run Notung with gene trees that had average bootstrap support > 50% (total=374).

WGD validation at duplication focal points

To specifically test for the mechanism of WGD formation at focal nodes highlighted by the Yang and Notung pipelines, we used the software package GRAMPA (Thomas, Ather, and Hahn, 2017) to compare the reconciliation scores of multiply-labeled (MUL) trees against the singly-labeled species tree using homolog trees from the 3.1K and 9.5K datasets. Cases when the MUL tree – a topology in which a taxon or clade appears twice as the result of a duplication – had a better reconciliation score than the species tree were considered supportive of a WGD event. By default, GRAMPA performs least-common ancestor (LCA) reconciliation of all gene trees against both the species tree and all possible MUL trees, and reports the number of duplications and losses and their sum (the reconciliation score). Overly complex gene trees, which might take a prohibitively long time to reconcile, are filtered out based on a maximum allowed number of polyploid groups, which we set to 12 (GRAMPA’s group cap setting, default=8).

We ran GRAMPA with two basic strategies. First, we made no assumptions about the placement of polyploid lineages by excluding GRAMPA’s H1 and H2 parameters. We refer to these analyses as “unconstrained”. This approach tested all possible arrangements for the two parents of a putative allopolyploid event, including the same parent for autopolyploid events. For these analyses, a substantial fraction of gene trees were also filtered out due to the group cap setting (36% for the 3.1K dataset without rearrangements and 20% for the 9.5K dataset). To minimize this filtering and to base our inferences on the largest possible sets of trees, we also ran GRAMPA for each tested internal and terminal node separately by setting the H1 node and letting GRAMPA find the best H2 node or nodes. This reduced the number of alternative MUL topologies to those relevant for the focal node and resulted in the filtering out of many fewer trees as overly complex given our group cap setting of 12 (maximum of 15% and 16% for the un-rearranged 3.1K and 9.5K datasets, respectively). These analyses are subsequently referred to as “constrained”. We ran these analyses for all datasets, including the unrearranged 3.1K and 9.5K sets of trees, the rearranged versions of 3.1K trees at three bootstrap thresholds (40%, 50%, and 70%), and the pre-filtered set of trees with mean bootstrap >50%. Although for each dataset the “constrained” and “unconstrained” tests started with the same set of gene trees, depending on the topology of the relevant MUL trees, GRAMPA filtered out different sets of gene trees as overly complex. The reconciliation scores between the two search strategies and the scores of runs with different focal (H1) nodes are therefore based on slightly differing sets of input trees and are not comparable.

Gene count analyses

We further tested 18 inferred duplication events identified from Ks distributions and gene-tree reconciliation methods (located on 11 terminal and 7 internal branches) with gene-count data derived from both the 3.1K and 9.5K datasets (Appendix 4) using the R package WGDgc (Rabier, Ta, and Ane, 2014). Initial tests used the entire species phylogeny, and required an orthologous cluster to include Triparma pacifica and at least one ingroup species, thereby removing orthologous clusters unique to diatoms. Using this strategy, most of the putative WGD events identified through Ks analyses were not detectable, likely due to excessively stringent filtering to meet the above criterion. Similar results have been observed in other studies that use gene count data, and one common solution is to focus analyses on subtrees that maximize the amount of data relevant to testing a particular WGD hypothesis (Tiley, Ane, and Burleigh, 2016). To increase the pool of orthologous clusters for detection of WGD events, while keeping computation memory and time reasonable, we created datasets and pruned accordingly the time-calibrated chronogram to include only those taxa relevant to a specific WGD hypothesis. For example, when testing the putative Ks-inferred WGD in Gyrosigma, we pruned the species tree down to include raphid pennates only (Fig. 2). The final datasets represented orthologous clusters represented in the outgroup and at least in one species of the ingroup. WGDgc analyses were run with the root prior set to the mean number of copies per cluster in each of the datasets and with the option “oneInBothClades” that reflected our filtering strategy. The putative WGD events were assumed to have occurred at the midpoint of branches leading to the focal node. Hypotheses were tested using likelihood ratio tests against a null model of no WGD events (Rabier, Ta, and Ane, 2014; Tiley, Ane, and Burleigh, 2016).

Figure 2.

Time-calibrated species tree of 37 diatoms and the outgroup Triparma (Parmales) reconstructed from a concatenated alignment of 197 single-copy genes. Nodes relevant to downstream analyses are labeled (A–F). Ks-based age distributions were calculated with CD-HIT filtering and the GY model of codon substitution.

RESULTS

Assembly results

A total of 34 diatom and one outgroup (Triparma pacifica) taxa were assembled from paired-end RNA-seq read pools ranging in size from 21.3 to 424 million reads. Trinity assemblies ranged in size from 13 578 to 61 091 genes and 16 145 to 70 488 transcripts (including isoforms). BUSCO recovery averaged 70 ± 8% for combined complete and fragmented orthologs. Gene counts for protein sets from the five genome sequences ranged from 10 402 to 27 137 genes, with a corresponding average BUSCO recovery of 83 ± 6%. Sample information and assembly details are available in Appendix 1.

Homology and orthology inference

A total of 9463 orthologous clusters containing at least four taxa were circumscribed with MCL (Fig. 1). After branch-length-based pruning, 9497 alignments and corresponding phylogenetic trees with at least eight taxa were recovered. These alignments were then filtered based on various taxon-occupancy thresholds to create data subsets for further analyses (Fig. 1).

Species tree reconstruction

197 single-copy ortholog alignments with 100% taxon occupancy were recovered (Fig. 1), representing a combined alignment length of 58 294 amino acids. Coalescent-summary (ASTRAL, ASTRAL-mlbs) and concatenation-based (IQ-TREE) inference methods recovered generally well-supported species trees with identical branching orders, with the exception of the polar centric diatom Ditylum brightwellii (Fig. 2, Appendix 2), which was also difficult to place in another phylogenomic dataset (Parks, Wickett, and Alverson, 2017). Similar to previous findings (Parks, Wickett, and Alverson, 2017), relationships among the major multi-polar centric clades were the least supported in ASTRAL and ASTRAL-mlbs analyses. Gene-tree support varied across the species tree and, as in previous phylogenomic analyses of diatoms (Parks, Wickett, and Alverson, 2017), relationships among the major multi-polar clades were the most difficult to resolve, with deep splits supported by few or no gene trees (Appendix 2).

Synonymous divergence (Ks) between paralogs

Ks-based age distributions of gene duplicates revealed secondary Ks peaks consistent with historic duplication events in most diatom species (Fig. 2), though the strength of the signal and the locations (average synonymous divergence) of secondary peaks varied by method, codon substitution model, and whether blastn self-hits at the Trinity ‘gene’ level were included in the analysis. The Ks distributions inferred from CD-HIT clusters (Johnson et al., 2016) using two substitution models (GY and YN) were largely overlapping, although the model used caused the size or placement of secondary peaks to shift slightly (to higher Ks values for GY model). Secondary Ks peaks inferred from BLAST-based clusters (McKain et al., 2016) were more distinct when blastn self-hits at the Trinity ‘gene’ level were removed and also tended to be both smaller and centered on lower Ks values than those called from CD-HIT clustering (Appendix 3). Two sister groups, Actinocyclus subtilis + Rhizosolenia setigera and Asterionellopsis glacialis + Talaroneis poseidonae, with strong secondary Ks peaks were each sister taxa on the species tree, possibly indicative of shared duplication events in those clades (Fig. 2). Secondary peaks in two other pennate diatoms, Striatella and Diatoma, suggest either a deeper, shared WGD along the pennate backbone or independent WGDs in these taxa (Fig. 2); the secondary Ks peaks in these two taxa were not, however, recovered by all of the Ks-based methods. Although less striking than those highlighted here, mclust identified numerous secondary Ks peaks in several other taxa as well (Appendix 3).

Gene-tree reconciliation (Yang and Notung pipelines)

Although gene-tree reconciliation results largely agreed across analyses and sets of gene trees, bootstrap-based filtering and gene-tree rearrangement had a substantial impact on the number of families with inferred duplications and losses (Fig. 3). The Yang and Notung gene-duplication pipelines highlighted six branches along the backbone of the species tree with high concentrations of gene duplications (Figs. 2, nodes A-F). Four of these branches retained a high percentage of gene duplications irrespective of the set of gene trees used: the 9.5K and 3.1K sets for Notung and the 3.1K set for the Yang pipeline (Fig. 3). These nodes were: (1) the MRCA of all diatoms excluding Corethron hystrix and Leptocylindrus danicus (‘branch A’), (2) the MRCA of pennate and multipolar centric diatoms (‘branch C’), (3) the MRCA of Thalassiosirales excluding Porosira pseudodenticulata (‘branch D’), and (4) the MRCA of all pennate diatoms excluding Striatella unipunctata (‘branch E’) (Fig. 3). Aside from the three deepest nodes on our species tree (MRCA of Triparma + diatoms, the MRCA of diatoms, and the MRCA of all diatoms except C. hystrix), all other nodes across the species tree featured moderate to high proportions of gene loss, including the six nodes identified with high rates of duplication (Fig. 3).

Figure 3.

Trends in gene duplication and loss across select nodes of species phylogeny. Percent gene duplication and loss at six nodes discussed in the text were reconstructed with different sets of homolog trees, different reconciliation pipelines, and at different bootstrap thresholds for rearrangement. Refer to Figure 2 for a key to the branch names.

We also used Notung to perform bootstrap-based rearrangements of gene trees in order to conservatively estimate counts of gene duplications and losses. For these rearrangements, nodes in gene trees with bootstrap support lower than a set threshold that were inconsistent with the species tree were rearranged to minimize the number of inferred duplications and losses. Analyses of rearranged trees reduced the number of families with duplications and losses across all internal nodes, including the six focal nodes identified by the Yang and Notung pipelines and shifted a small portion of the events towards the tips of the tree. The number of gene families with inferred duplications and losses also decreased, as expected, as the bootstrap threshold was increased (Notung analyses with 3.1K dataset; Fig. 3). Among the six Yang/Notung focal nodes, reconciliation of the original gene trees found duplications in some 40–70% of gene families and losses in as many as 80% of gene families (Fig. 3). With a bootstrap threshold of 40%, the maximum percent of duplicated families reduced to ∼50% and further decreased to ∼26% at a bootstrap threshold of 70% (Fig. 3). The percentage of gene families with inferred losses dropped even more precipitously between the sets of original and rearranged trees, reducing from ∼80% to 20–30% for the most loss-rich nodes when rearranging at a 40% bootstrap threshold (Fig. 3). Duplication and loss counts continued to drop as the bootstrap threshold was increased further (Fig. 3). Overall, even at the most stringent bootstrap rearrangement threshold, the top three nodes had ≥19% of gene families with duplications, suggesting that gene tree reconciliation detected some signal for synchronous duplication events at or near these nodes (branches A, C, D; Fig. 3).

Gene-tree reconciliation (GRAMPA)

These analyses were designed to test more specifically whether the high concentrations of duplicated gene families at focal branches identified by the Yang/Notung pipelines were due to WGDs. Our GRAMPA searches detected pervasive WGD signal in the absence of bootstrap-based rearrangement of the gene trees. For example, the least conservative, un-rearranged set of trees (9.5K dataset) recovered 481 MUL trees in unconstrained analyses, involving 42 different polyploid clades (H1 nodes) that were better than the species tree by at least 1000 units. For the set of un-rearranged trees from the 3.1K dataset, there were 179 MUL trees in unconstrained analyses, involving 27 different polyploid clades that scored better than the species tree by at least 1000 units. Nearly all detected signal was, however, very sensitive to gene tree support, with just two WGD events supported with a bootstrap rearrangement threshold of 40% [constrained analyses of branches A and C (Fig. 2)]. Nonetheless, resampling based on bootstrap-filtered gene trees, and clade-based resampling of both gene tree subtrees and MUL trees, revealed considerable support for WGD at several focal branches, including within the Thalassiosirales clade (Fig. 2, branch D). These results are described in detail in the following sections.

Whole-genome duplication at deep internodes

The Yang/Notung reconciliations against the singly labeled species tree inferred high concentrations of duplicated families at the MRCA of all diatoms excluding Leptocylindrus and Corethron (Figs. 2 and 3, branch A), and at the MRCA of pennate+multi-polar diatoms (Figs. 2 and 3, branch C). The former clade, all diatoms except Leptocylindrus and Corethron (Fig. 4), was not among the highest supported in GRAMPA analyses, although this scenario was still better than assuming no polyploidy. With unrearranged gene trees, GRAMPA detected the strongest signal for a putative WGD involving the latter clade (branch C, Fig. 4), and a GRAMPA search constrained to this clade only, found similar results using gene trees rearranged at a 40% bootstrap threshold.

Figure 4.

Results of GRAMPA reconciliation against multiply labeled (MUL) trees. Trees on the left represent the best MUL trees for each focal clade, with the corresponding network representation of the inferred allopolyploid events shown on the right. For the best MUL trees (left), the two sets of homeologs are indicated in Blue. In each case, the species tree placement of the allopolyploid clade is shown as ‘Homeologs A’; the placement of the second subgenome of the allopolyploid is marked as ‘Homeologs B’. For inferred allopolyploid events (right), parental and allopolyploid lineages are indicated in Grey dashed and Blue solid lines, respectively. In all cases, the inferred parental taxa of the allopolyploid clades are either extinct or have not been sampled. Branch names correspond to labeling in Figure 2.

In unconstrained GRAMPA searches, the best-ranked MUL tree involving all diatoms except Leptocylindrus and Corethron was 3919–14 893 units worse than the overall best MUL tree (MRCA of pennate+multi-polar diatoms), and MUL trees involving this clade were never better than the species tree when gene trees were rearranged. However, GRAMPA runs constrained to MUL trees specific to the MRCA of all diatoms excluding Leptocylindrus and Corethron found support for a WGD event up to 40% bootstrap rearrangement threshold, with reconciliation scores at least 407 units better than the species tree. This outcome was likely due to more gene trees passing GRAMPA’s tree-complexity filter when the search was constrained to fewer MUL trees, in this case only those relevant for the clade of all diatoms excluding Leptocylindrus and Corethron.

We repeated the Notung and GRAMPA analyses focusing on the above clades and using only trees with average bootstrap support ≥50% (total=374) filtered from the 3.1K dataset (Fig. 3). In this case, the best MUL tree was 415 units better than the singly labeled species tree, and again identified an allopolyploidy event involving the clade of pennate + multi-polar diatoms and an extinct or unsampled second parental lineage represented with the ancestor of all diatoms except Corethron (branch C, Fig. 4). MUL trees involving the MRCA of pennate + multi-polar + coscinodiscoid diatoms, multi-polar diatoms (minus Attheya), and the MRCA of all diatoms except Leptocylindrus and Corethron also scored high. Overall, results based on this set of well-supported gene trees agreed with the results from our analyses of the entire sets of un-rearranged gene trees, suggesting that rearrangements, even at low bootstrap thresholds, might be overly conservative. The Yang et al. (2015) pipeline, which extracts and calculates duplication counts on clades with average bootstrap beyond a certain threshold, gave similar results.

Allopolyploidy within Thalassiosirales

In unconstrained searches, GRAMPA found that MUL trees defined by the branch representing Thalassiosirales – Porosira pseudodenticulata (Figs. 2 and 4, branch D) were never better than the species tree. This applied irrespective of the set of trees used and whether or not the trees were rearranged. This result was also confirmed by an analysis of only the Thalassiosirales subtree with Ditylum brightwellii as the outgroup. Closer examination of other MUL trees relevant to Thalassiosirales revealed several smaller clades nested within the Thalassiosirales – Porosira clade (Fig. 5, branch D) with strong support for a WGD (branch D/D′, Fig. 4). The strongest support was observed for a clade comprised of Thalassiosira oceanica, Skeletonema marinoi and Discostella pseudostelligera (branch D′, Figs. 4 and 5), with their MRCA as one parent and the MRCA of all Thalassiosirales excluding Porosira as the second parent of an allopolyploid event. These results were robust to the different sets of trees used, analyses performed at different phylogenetic scales (all diatoms or Thalassiosirales only), and to bootstrap-based rearrangement up to a threshold of 50%. To further evaluate these results, we extracted the Thalassiosirales + Ditylum clade from all trees in the 3.1K dataset and filtered the resulting subtrees to: (1) include a minimum of three species, and (2) have an average bootstrap support ≥70. A total of 240 gene trees met these criteria. Repeating the reconciliation analyses with Notung and GRAMPA on this conservative set of well-supported trees returned identical results, supporting the clade comprised of T. oceanica, S. marinoi and D. pseudostelligera as allopolyploid (branch D/D′, Fig. 4). Such outcomes, i.e., a concentration of gene duplications in a branch older than the MRCA of the polyploid clade, have been interpreted as strong support for ancient hybridization and allopolyploidy in yeast (Marcet-Houben and Gabaldon, 2015; Thomas, Ather, and Hahn, 2017). The discrepancy between the relative age of the reconciliation-inferred peak of duplications (Fig. 5, branch D) and the polyploid clade (Fig. 5, branch D′) is therefore likely due to the earlier divergence of the hybridization-derived homeologs present in the genome of the polyploid lineage, which trace back to the earlier branch D, compared to the age of the polyploid species lineage itself.

Figure 5.

Summary of WGD across diatoms. Green: WGDs supported by Ks-based age distributions of duplicated genes (KS). Pink: WGDs detected with gene count data (GC). Blue: WGDs inferred by reconciliation of gene trees against the singly labeled species tree (RC). Red: WGDs inferred by reconciliation of gene trees against multiply labeled trees (GR). Branches discussed in the text are labeled (A–F). Within Thalassiosirales, D’ denotes a GRAMPA-inferred allopolyploid clade (GR) that did not coincide with the duplication peak inferred from the Notung and Yang analyses (D). Within pennate diatoms, GRAMPA-inferred events are added to both branches E and F to reflect uncertainty in the placement of the WGD.

Allopolyploidy within the pennate clade

The unconstrained GRAMPA searches found strong support for polyploidy of either the MRCA of pennate diatoms excluding Striatella (Fig. 2, branch E) or the clade of all pennate diatoms excluding Striatella, Asterionellopsis, and Talaroneis (Fig. 2, branch F). To investigate potential pennate-specific WGD events, we ran GRAMPA on the subtree containing only pennate diatoms, with Attheya as the outgroup. The best reconstructions identified the clade of pennates excluding Striatella, Asterionellopsis, and Talaroneis as the most likely polyploid clade (Figs. 2 and 4, branch F), with the second parental lineage being extinct or unsampled member of the clade subtended by the MRCA of all pennate diatoms excluding Striatella. Events specific to: (1) all pennate diatoms except Striatella, (2) the raphid pennate clade, and (3) a smaller clade comprised of Diatoma, Fragilaria, and Thalassiothrix had lower, but comparable scores, all of which were better than the species tree.

These results were not, however, robust to bootstrap rearrangements, with support for allopolyploidy disappearing after gene-tree rearrangement with a 40% bootstrap cutoff. Searches constrained to MUL trees defined by the MRCA of all pennates except Striatella (branch E, Figs. 2 and 5), or the subsequent clade [all pennates excluding Striatella, Asterionellopsis, and Talaroneis (branch F, Figs. 2 and 5)] showed similar results, with only the former clade robust to bootstrap rearrangement. However, an analysis of pennate diatom subtrees with a mean bootstrap support ≥70% (84 in total) again showed support for a polyploid clade composed of all pennate diatoms except Striatella, Asterionellopsis, and Talaroneis. Further, this analysis also supported a smaller clade within raphid pennates composed of Sellaphora and Craticula as polyploid. The best MUL tree in this case placed the second parental lineage of this inferred allopolyploid event at the MRCA of Asterionellopsis and Talaroneis, suggesting that the actual second parent might be an extinct or unsampled lineage from the clade circumscribed by the MRCA of Asterionellopsis + Talaroneis and all of its descendants (branch F, Fig. 4). Overall, although bootstrap-based rearrangement erased most of the signal for polyploidy within the pennate clade, analyses of a small set of strongly supported trees largely agreed with the inferences made from the entire set of un-rearranged phylogenies.

Gene count analyses

Analyses based on gene counts were designed to test 18 putative WGD events whose placements were based either on analyses of paralog divergence at synonymous sites (10 terminal and 2 internal branches; Fig. 2) or gene-tree reconciliation (6 internal branches) (Figs. 4 and 5). Each putative event was tested independently through comparison to a null, non-WGD model. We performed the analyses using gene counts based on both the 3.1K and 9.5K datasets and, with three exceptions as described below, recovered the same set of results for both analyses. We detected WGDs in eight out of 18 tested branches, with relatively low rates of homolog retention following the duplication event, whereby the retention rate (q) is defined as the probability of retaining the WGD-derived copy of a gene (Rabier, Ta, and Ane, 2014). Retention of two WGD-derived homologs following a duplication event was generally <2%, though four WGDs had retention rates between 3% and nearly 15%. All eight tests that returned support for WGD via retention rates > 0 were significantly better than their corresponding no-WGD null models (likelihood ratio tests, df=1, X² P-value ≤ 0.001for all tests; Appendix 4).

Within pennate diatoms, gene-count analyses detected the Ks-inferred WGDs in Gyrosigma (q = 14.8%), Asterionellopsis (q = 7.0%), and Talaroneis (q = 1.4%) (Fig. 5). There was also signal for WGD along the branch leading to the MRCA of Asterionellopsis and Talaroneis (q = 3.6%), suggesting that the Ks peaks observed in these two taxa might represent a shared WGD (Fig. 2, 4). Finally, in agreement with gene-tree reconciliation results, we also detected signal for WGD along the branch leading to the MRCA of all pennate diatoms excluding Striatella (q = 1.1%) (Fig. 5). The latter event was not detected with gene counts derived using the 3.1K dataset, which instead detected signal for WGD in Attheya alone (q = 0.3%). WGD along the branch leading to Attheya was not observed in the analysis of counts calculated from the 9.5K dataset..

Several nodes across the centric and multi-polar centric diatoms were also tested using WGDgc. We tested two distinct hypotheses within Thalassiosirales: the Ks-inferred WGD in Thalassiosira oceanica and the GRAMPA-inferred WGD at the MRCA of Thalassiosira oceanica, Skeletonema marinoi, and Discostella pseudostelligera. We detected signal for an event within the T. oceanica lineage (q = 1.2%) but found no evidence for the older WGD (Fig. 5). Despite distinct Ks peaks, we did not detect WGD events in Actinocyclus, Rhizosolenia, or their MRCA, nor did we find support for the WGD events implied by secondary Ks peaks in Corethron and Leptocylindrus. Finally, we tested for two events supported by both reconciliation and GRAMPA results, at the MRCA of pennate + multi-polar diatoms (Figs. 2 and 4, branch C) and the MRCA of all diatoms excluding Corethron and Leptocylindrus (Figs. 2 and 4, branch A). The gene-count analysis detected WGDs on both branches, including a WGD with retention rate q = 1.7% along the branch leading to pennate + multi-polar diatoms using both datasets and an event on the branch leading to the MRCA of all diatoms excluding Corethron and Leptocylindrus using the 9.5K dataset (q = 6.4%) (Fig. 5).

DISCUSSION

Substantial variation in genome size and chromosome number, a high rate of genome size evolution, and direct observations of polyploidization in cell cultures together suggest that diatom genomes might have undergone past WGD events (Mann, 1994, 1999a; Oliver et al., 2007). Our survey of 37 diatom genomes and transcriptomes provided strong support for this hypothesis, identifying as many as 16 separate historic WGDs across diatoms, seven of which were supported by multiple lines of evidence. Our exemplar-based taxon sampling precluded precise pinpointing of the timing of these events, with four strongly supported events assigned to terminal branches that represent ca. 60-100 million years of evolutionary history. Nevertheless, despite our sampling and general challenges of working with a group of non-model organisms, our analyses point to a relatively extensive history of WGD in diatoms.

Mechanisms of polyploid formation in diatoms

Although auto- and allopolyploids are equally abundant in angiosperms (Barker et al., 2016), the mechanisms underlying polyploid formation are much more poorly known in diatoms. Our results suggest that allopolyploidy may be especially important in diatoms, though the patterns and rates of hybridization are very poorly known. High sequence divergence in homologous chromosome assemblies from a raphid pennate diatom, Fistulifera solaris, points to an allodiploid origin of that species (Tanaka et al., 2015). Given the time- and labor-intensive nature of experimental reproductive studies of diatoms (see Chepurnov et al., 2004; Mann et al., 2004; Chepurnov et al., 2008; Chepurnov et al., 2012), evidence supporting hybridization and introgression in diatoms is likely to come from genomic data (Mallet, 2005), emphasizing the need for more intensive studies focused on taxon-rich clades at lower phylogenetic scales. Candidates for such studies include Ditylum brightwellii (Koester et al., 2010), Sellaphora (Mann et al., 2004; Evans et al., 2008), Seminavis (Moeys et al., 2016), Pseudo-nitzschia (Casteleyn et al., 2009; Basu et al., 2017), and Cocconeis (Geitler, 1927; Geitler, 1973). Given the evidence for relatively frequent ancient hybridization uncovered by our analyses, including in the pennate diatoms, it will be important to determine the specificity of sex pheromone systems used by pennate diatoms for mate attraction (Sato et al., 2011; Gillard et al., 2013; Moeys et al., 2016). Finally, although our analyses highlighted allopolyploidy as a potentially important mode of WGD in diatoms, it is important to note that autopolyploidy may be shown to be equally, if not more, common with increased sampling. Indeed, autopolyploid formation has been directly observed in vitro for several different species of raphid pennate diatoms (Geitler, 1927; Mann and Stickle, 1991; Mann, 1994; Chepurnov and Roschin, 1995; Chepurnov et al., 2002).

A number of observed meiotic anomalies suggest that diatom polyploids could form in a variety of ways. First, Meiotic nonreduction, which is thought to be the predominant mode of polyploid formation in plants (Thompson and Lumaret, 1992; Ramsey and Schemske, 1998), likely occurs in diatoms as well. Although the rate of meiotic nonreduction in diatoms is unknown, Mann (1994) observed that failed cleavage in gametangia of the raphid pennate diatom, Dickea ulvacea, led to the formation of ‘double gametes’ that produced a dikaryotic, triploid-like zygote following fusion with a reduced gamete (Mann, 1994). Second, although polyspermy is thought to occur relatively rarely in plants (Ramsey and Schemske, 1998), the production of triploid and tetraploid zygotes from simultaneous gamete fusions has been observed in culture studies of several raphid pennate diatoms (Geitler, 1927; Mann and Stickle, 1991; Mann, 1994; Chepurnov and Roschin, 1995; Chepurnov et al., 2002), suggesting that this may be a principal pathway to polyploidization in diatoms. These studies have found mixed populations of co-occurring haploid, triploid, and tetraploid zygotes following one or two rounds of crossing in culture, suggesting that two-step, ‘triploid-bridge’ routes to stable polyploidy may be more common in diatoms than other groups (Ramsey and Schemske, 1998). These hypotheses further underscore the value of the experimental reproductive studies in diatoms that initially led to these discoveries. Extending these studies to include longer-term tracking of in vitro polyploids will help clarify the long-term viability and reproductive dynamics of vegetative haploids, triploids, and tetraploids, thereby distinguishing culturing anomalies from observations that hint at the the natural frequencies and mechanisms of polyploid formation in diatoms.

Combined genomic evidence for whole-genome duplication in diatoms

We applied three different strategies to a large transcriptomic dataset to investigate support for historical duplication events across the diatom lineage: (1) traditional Ks-based age distributions of duplicated genes; (2) tree-based reconciliation methods to identify nodes on the species tree with concentrations of gene duplications or to construct specific tests for allopolyploidy, and; (3) gene-count methods that provide conservative, sequence- and gene-tree-agnostic inferences of WGD. Although each of these approaches suffers some drawbacks, we considered a putative WGD as strongly supported when two or more analyses with disparate approaches were in agreement (Fig. 5).

Although Ks-based age distributions are useful for initial exploration of duplication signal within a genome, several clear challenges with these types of approaches became evident in diatom analyses. First, there is no consensus strategy for discerning duplication peaks from Ks distributions. In some cases, peaks are identified essentially ‘by eye’ (Blanc and Wolfe, 2004; Fawcett, Maere, and Van de Peer, 2009; Tang et al., 2010; Cannon et al., 2015), which can easily turn into an exercise in ‘the reading of tea leaves’. Although several statistical approaches have been adopted to identify discrete shifts or peaks in Ks distributions (Schlueter et al., 2004; Cui et al., 2006; Vanneste et al., 2015), secondary peaks may not always correspond to large-scale duplication events (Johnson et al., 2016) and there are no concise methods to distinguish between peaks representing WGD, large-segmental duplications, and small-segmental duplications based solely on Ks distributions. The identification of ancient duplication events is also challenged by saturation at synonymous sites (Vanneste, Van de Peer, and Maere, 2013), and this problem should be more pronounced in lineages with higher substitution rates. As unicells with short generation times, diatoms have relatively high rates of nucleotide substitution compared to multicellular lineages (Bowler et al., 2008). As a result, Ks-based age distributions are more likely to saturate sooner, erasing the signature of ancient WGDs (Vanneste, Van de Peer, and Maere, 2013). On average, 45% of the paralog pairs in a given species had Ks values that were out of range (>2) for drawing Ks-based inferences of WGD. Further analyses may show that these trace back to some of the deeper duplication events identified by gene-tree and gene-count analyses (Fig. 5).

In contrast to Ks plots, gene-tree reconciliation methods allow more rigorous statistical determination and increased confidence in phylogenetic mapping of large-scale duplication events (Durand, Halldorsson, and Vernot, 2006; Jiao et al., 2011), even allowing for specific tests of auto-vs. allopolyploidy (Thomas, Ather, and Hahn, 2017). The power of these approaches is limited, however, by the quality of the gene trees. Our gene trees were poorly supported in most cases. For the bootstrapped gene trees of our 3.1K dataset, the overall distribution of bootstrap values across all nodes of these trees was relatively low, with median bootstrap support = 29 for all gene tree nodes combined; 68% and 80% of the nodes across gene trees had bootstrap support lower than 50% and 70%, respectively. Bootstrap-based rearrangement of gene trees to minimize the numbers of inferred duplications and losses is a common strategy for guarding against false inferences of WGD from poorly supported gene trees (Durand, Halldorsson, and Vernot, 2006; Inoue et al., 2015; Thomas, Ather, and Hahn, 2017). All of our gene trees, and a majority of nodes within our trees, had the potential to be rearranged. Deciding on a bootstrap threshold on which to base our inferences, therefore, depended on our confidence in the correct reconstruction of the gene trees. The inference of WGD events should ideally be based on strongly supported nodes that, when reconciled against the species tree, identify duplication events (Hahn, 2007). However, the amount of data necessary to obtain strong support for nodes depends on tree shape and the distribution of internal branch lengths (Alfaro, Zoller, and Lutzoni, 2003; Hahn, 2007; Philippe et al., 2011). Short internal branches potentially require substantial amounts of data to obtain strong bootstrap support, but the amount of data (i.e., alignment length) available is clearly limited for individual gene trees. Simulation studies have shown even correct nodes can receive low bootstrap values under a variety of conditions (Alfaro, Zoller, and Lutzoni, 2003). These considerations highlight the difficulties in determining an empirical cutoff for what should be considered an accurate bipartition and, by extension, a bootstrap threshold for gene-tree rearrangement for reconciliation analyses.

Finally, empirical and simulation studies have shown that gene-count matrices also provide robust identification of WGD events (Rabier, Ta, and Ane, 2014; Tiley, Ane, and Burleigh, 2016), although these methods are conservative and may fail to identify multiple events in the same lineage and duplication events followed by high rates of gene loss (Hahn, 2007; Tiley, Ane, and Burleigh, 2016). We did not recover strong signal in any of our analyses for multiple duplication events in a single lineage; however, our analyses suggest low rates of duplicate retention are common to diatoms, as the majority of our tested branches featured very low gene retention rates (q<0.02). Similarly, our Notung analyses also identified relatively high rates of gene loss in nodes across the species tree (Fig. 3). This likely impedes the effectiveness of WGDgc in identifying WGD events in our dataset, and suggests that high rates of molecular and genome evolution in diatoms may rapidly mask signal from historic duplications and lead to underestimation of the number of duplication events. On the other hand, fast rates of gene loss following duplication coupled with WGDgc’s ignorance of gene tree topology and potential asymmetric gene loss, increase the confidence in WGDs inferred by gene count data. Our exemplar approach may also bias WGD events estimated from gene count data toward older events, as duplicated genes are more likely to be lost along the longer terminal branches of our species phylogeny. In this regard, denser taxon sampling may reveal gene-count support for putative terminal duplication events supported by Ks analyses.

Ancient paleopolyploidy in diatoms

We identified two nested WGDs occurring roughly 200 Mya (Fig. 5, branch A) and 170 Mya (Fig. 5, branch C) that support polyploid ancestry for the vast majority of diatom diversity. Importantly, these were among the most robust WGDs in our analyses, being detected by gene tree reconciliation against the species tree, reconciliation against multiply-labeled trees, and by gene count analysis (Fig. 5). Although both events enjoy strong support from multiple analyses, limitations of our exemplar taxon sampling and lack of genome size and karyotype data for most of diatoms, impede a complete understanding of the precise origins of these events. For example, the inferred deep WGDs at branches A and C were supported by both reconciliation (Fig. 3) and gene count data (Appendix 4) and reconciliation against multiply-labeled trees clearly supported an allopolyploid mode of origin for both events. In both cases, the second parental lineage of the allopolyploid event was an extinct or unsampled lineage vaguely identified as the MRCA of all diatoms or all diatoms excluding Corethron. Although it is possible (or perhaps likely) that these ancestors are extinct, the fact that only two branches separate the older of these events and the diatom stem lineage leaves open the possibility that our sampling is too coarse for a precise determination of the lineages involved in this allopolyploid event.

Gene-tree reconciliations also showed some evidence for a polyploidy event in the MRCA of all diatoms, but several limitations of our dataset caution against over-interpreting this finding. With just a few dozen species (Ichinomiya et al., 2016), the sister lineage to diatoms (Bolidophyceae: Parmales) is a phylogenetic depauperon (Donoghue and Sanderson, 2015), resulting in long stem branches separating them from diatoms. The large number of morphological and life history differences distinguishing these two clades makes it difficult to polarize the large number of changes that have accumulated since they split roughly 200–250 Mya (Nakov, Beaulieu, and Alverson, 2017) – a limitation that extends to the gene family data used in our analyses. This problem is further exacerbated by the available Parmales data, which are currently limited to a single transcriptome from the flagellated, unsilicified stage of the life cycle. We know, for example, that the dataset is missing genes involved in silicification (Kessenich et al., 2014). A more complete representation of the Parmales genome will help show whether, like angiosperms, all diatoms have shared polyploid ancestry.

Historic allopolyploidy in Thalassiosirales

Thalassiosirales is among the most common and abundant diatom lineages in the plankton of both marine and freshwaters. It is also a long-established, genome-enabled model system for studies of diatom physiology, morphology, and ecology (Guillard and Ryther, 1962; Armbrust et al., 2004; Poulsen and Kroger, 2004; Alverson, Jansen, and Theriot, 2007). The discovery of ancient hybridization and allopolyploidy in this group further establishes them as an excellent system for understanding these and other evolutionary processes in diatoms.

The signal for polyploidy within Thalassiosirales was among the strongest recovered in our analyses, being detectable even after applying a relatively stringent (given our gene trees) bootstrap rearrangement threshold of 50%. Gene tree reconciliation supported an allopolyploid event involving the clade comprised of T. oceanica, S. marinoi and D. pseudostelligera (Fig. 5, branch D′). Uncertainty in the species tree, however, makes it difficult to accurately circumscribe this event. Although most nodes within the Thalassiosirales species tree were well supported, gene tree discordance was especially high for splits within the putatively polyploid subtree [(S. marinoi, (T. oceanica, D. pseudostelligera))] (Appendix 2). Interestingly, the two nodes immediately predating this clade (the MRCA of Thalassiosirales minus Porosira and the MRCA of Thalassiosirales) had many more concordant gene trees, suggesting that the high levels of discordance in (S. marinoi, (T. oceanica, D. pseudostelligera)) may reflect, at least in part, conflict resulting from past hybridization (Appendix 2). Densely sampled phylogenies of Thalassiosirales inferred from ribosomal RNA and chloroplast genes produce an alternative topology [((T. oceanica, S. marinoi), D. pseudostelligera)], with T. oceanica sister to Skeletonema (Alverson, Jansen, and Theriot, 2007). If this relationship is correct, the strong secondary Ks peak in T. oceanica and the heavily tailed Ks distribution in S. marinoi (Fig. 2) raise the possibility the polyploid lineage is comprised of these two lineages alone, with the inclusion of Discostella representing an artifact of sparse taxon sampling and uncertainty in the species tree.

Finally, secondary peaks are evident in the age distributions of both T. oceanica and S. marinoi, and although less pronounced, age distributions of Cyclotella nana and Conticribra weissflogii also have heavy right tails (Fig. 2, Appendix 3). Although WGD inferences from Ks distributions alone, especially in lineages with high substitution rates, can be problematic (Vanneste, Van de Peer, and Maere, 2013 and see above), these results may nevertheless point to additional paleopolyploid events in the evolutionary history of Thalassiosirales, a hypothesis that has some support from T. oceanica data. Here, gene tree reconciliation supported a shared, deep WGD event, gene count analysis detected a T. oceanica-specific WGD, and its Ks distribution was one of two with readily observable secondary peaks (Fig. 2; Appendix 3). We cannot, therefore, rule out the possibility that the T. oceanica genome carries signal from two temporally distinct WGD events – its genome representing the product of as many as four paleopolyploidy events (Fig. 5).

Historic allopolyploidy in pennate diatoms

The transition from radial to axial cell wall symmetry and from oogamous to isogamous sexual reproduction are landmark events in diatom evolution (Round, Crawford, and Mann, 1990), circumscribing a clade whose species diversity vastly outnumbers the remaining diatoms (Guiry and Guiry, 2017) and, as a result, motivating great interest in identifying the underlying drivers of this disparity. The clade of actively motile, raphe-bearing species nested within pennate diatoms (Fig. 2) diversified at a faster rate compared to the grade of clades of non-motile diatoms with axial and radial symmetry (Nakov, Beaulieu, and Alverson, 2017), suggesting that the evolution of active motility allowed these species to better utilize complex benthic habitats, leading to finer niche partitioning and faster rates of evolution and diversification (Nakov, Beaulieu, and Alverson, 2017). However, other factors have almost certainly influenced the diversification of pennate diatoms as a whole, and the results presented here point to a potential role for whole genome duplication.

Taking our combined results from gene tree reconciliation, gene count, and Ks analyses, we found evidence for as many as six independent WGD events within pennate diatoms (Fig. 5). Three of these events were supported by at least two out of the three strategies, whereas the others were more tenuous and observed only in the Ks-based age distributions (Fig. 5). The best-supported events included: (1) a deep split within the pennates circumscribing nearly the entirety of the clade (Fig. 5, branch E or F); (2) a deep split within the deepest clade of araphid pennates (Fig. 5, the MRCA of Asterionellopsis and Talaroneis), and; (3) a terminal branch representing the highly diverse navicuoloid diatoms with a stem age of > 100 Mya (Fig. 5, Gyrosigma). Placements of these events suggests that the majority of pennate diatoms share an ancient WGD, followed by multiple rounds of additional, nested polyploidizations that have affected several subclades of pennate diatoms (Fig. 5). Note also that these pennate-specific events might have occurred in addition to at least two earlier WGDs (Fig. 3, branches A and C), analogous to the complex polyploid ancestry of numerous angiosperm lineages (Bowers et al., 2003; Jiao et al., 2011).

As with WGDs in other parts of the phylogeny, a degree of uncertainty exists with regard to the deep, nearly pennate-wide WGDs (Fig. 5, branch E or F). Specifically, reconciliation against the species tree and gene count analyses indicated that the most likely placement of this event was at the branch representative of the MRCA of all pennate diatoms excluding Striatella (Fig. 5, branch E). Reconciliation against MUL trees was more equivocal, however, supporting either this branch or the next branch up the backbone as the likely ancestor that experienced the duplication (Fig. 5, branch F). As before, we were unable to determine whether this uncertainty is a byproduct of our exemplar sampling, i.e., the lineages relevant for pinpointing the placement of this event might be missing from our dataset. Alternatively, uncertainty in the species tree might be carried over into the placement of this WGD. More densely sampled phylogenies based on conventional phylogenetic markers place Striatella (along with Asterionellopsis and Talaroneis) within one of the clades in the araphid grade (Theriot et al., 2015; Nakov, Beaulieu, and Alverson, 2017). Phylogenomic analyses with fewer species but more markers place Striatella as sister to all other pennate diatoms (Fig. 2 and Appendix 2; see also Parks, Wickett, and Alverson, 2017). These competing hypotheses have clear implications for our ability to infer the location and timing of this WGD and further highlight this part of the tree as a primary target for additional genomic sampling.

Similar uncertainty, linked again to the granularity of our sampling, is evident in the (Asterionellopsis + Talaroneis) clade. Ks-based age distributions highlighted strong secondary peaks in these two taxa (Fig. 2, Appendix 3), possibly indicative of a shared WGD in the MRCA of this clade. We found evidence to support this hypothesis as well as signal for WGD on each of the terminal branches leading to Asterionellopsis and Talaroneis (Fig. 5, Appendix 4). A parsimonious interpretation would support that all of this signal traces back to the shared event, but given the stem ages of these lineages (∼80 Mya; Figs 2 and 4) – which are currently represented by just a single taxon – we cannot rule out the presence of multiple independent WGDs in this group.

Finally, with respect to hybridization and polyploidy, raphid pennate diatoms have received far more attention than any other group of diatoms (see references in the Introduction and ‘Mechanisms of polyploid formation in diatoms’ sections). There is direct evidence for autopolyploid formation in vitro (e.g., Mann, 1994; Chepurnov and Roschin, 1995) and strong genetic evidence for natural hybrids in the few species that have been examined (Casteleyn et al., 2009; Tanaka et al., 2015). Their unique suite of traits, species richness, history of accelerated diversification, the availability established and emerging genetic model species, and extensive research on their reproductive biology establish raphid pennates as the premier lineage for uncovering the mechanisms and evolutionary consequences of polyploidy in diatoms.

Conclusions

The phylogenomic results presented here provide strong support for a history of paleopolyploidy in diatoms that, with increased taxonomic sampling, will likely prove to be more extensive than what was uncovered with our exemplar sampling. Although WGD may be common in diatoms, its importance – if any – as a driver of speciation, lineage diversification, trait and life history evolution, or habitat shifts remains unknown. Establishing these associations, and further establishing causal links between WGD and possible evolutionary consequences, is notoriously challenging, even with the benefit of datasets much larger than those available for diatoms (Kellogg, 2016; Panchy, Lehti-Shiu, and Shiu, 2016). As with all species-rich and ecologically diverse groups, however, establishing these links represents perhaps the single greatest challenge in evolutionary research of diatoms.

Extending our sampling to more fully capture the broad ecological diversity of diatoms across environmental gradients in, for example, temperature, pH, and salinity, will help establish the evolutionary significance – if any – of WGD in diatoms. A larger comparative framework and a more precise reconstruction of the pattern and timing of paleopolyploid events – coupled with laboratory experiments – will show whether physiological shifts, either in short-term stress responses or in major habitat transitions, have been facilitated by genetic novelties introduced by gene or genome duplication. Compared to their diploid progenitors, for example, polyploid Arabidopsis have increased tolerance to salinity (Chao et al., 2013) – one of the principal ecological divides in diatoms and other microbial eukaryotes (Round and Sims, 1980; Mann, 1999b; Logares et al., 2009).

These results highlight numerous gaps in our understanding of diatom genomes. For a group of this size and diversity, for example, very few karyotypes and genome size estimates are available. The few data available, however, point to a level of genomic diversity and complexity that is proportional to their many other, much better known, layers of morphological and ecological diversity (Kociolek and Stoermer, 1989; Connolly et al., 2008). In light of this small but compelling collection of research, the discoveries presented herein were predictable. As genomic data for diatoms continue to accumulate, a coordinated effort to establish a reference genome dataset that captures their broad phylogenetic and ecological diversity, similar to the current call for angiosperms (Galbraith et al., 2011), is necessary to fully understand the evolution of genome size and ploidy in diatoms. Although genome size data are few, cell size data – which are available for every described diatom species – could help guide these efforts (Connolly et al., 2008). Finally, although diatoms are generally assumed to be diploid, very little is known about natural variation in ploidy levels. Few species have been surveyed for genome size data, but the intraspecific ploidy variation in two studied species, Cocconeis placentula (Geitler, 1973) and Ditylum brightwellii (Koester et al., 2010), suggests that polyploidy may play a consequential role in speciation and diversification of diatoms.

DATA ACCESSIBILITY

RNA-seq data for XX have been deposited in the National Center for Biotechnology Information’s XX database under accessions XXX–XXX (note: GenBank submissions are in progress).

ACKNOWLEDGEMENTS

The authors thank David Chafin, Jeff Pummill, and Pawel Wolinski for providing computational support through the Arkansas High Performance Computing Center (AHPCC), and the Chicago Botanic Garden for hosting and support of the Treubia and Fabronia computing clusters. This work was supported by the National Science Foundation (NSF) (Grant No. DEB-1353131 to AJA and DEB-1353152 to NJW), a Simons Foundation Early Career Investigator in Marine Microbial Ecology and Evolution award to AJA, and multiple awards from the Arkansas Biosciences Institute to AJA. This research used computational resources available through the AHPCC, which were funded through multiple NSF grants and/or the Arkansas Economic Development Commission, and resources available at the Chicago Botanic Garden, which were funded by NSF (DEB-1239992 and DEB-1342873 to NJW).

APPENDICES

Appendix 1

Sampling and assembly information for taxa used in this study.

Appendix 2

Diatom species tree reconstruction and gene tree con-/discordance analysis. The IQ-TREE species tree reconstruction (left - phylogram, right - cladogram) of 37 diatom taxa is shown, with pie charts on cladogram indicating proportions of gene trees in congruence (blue), most common incongruence (green), all other incongruence (red) and uninformative (grey; i.e., ≤33% bootstrap support) to the shown species tree topology amongst 197 gene trees with 100% taxon occupancy and no more than 20% gaps in any alignment column. The numbers on cladogram indicate gene tree counts supporting (above) or conflicting with (below) species tree topology at a node. Support values for nodes with less than full support in IQ-TREE SH-aLRT/IQ-TREE ultrafast bootstrapping/ASTRAL/ ASTRAL-MLBS are indicated on the phylogram; * indicates a node not recovered in ASTRAL or ASTRAL-MLBS analysis.

Appendix 3

Ks density distributions for sampled diatom taxa and Triparma pacifica for all Ks pipeline variations.

Appendix 4

Results from the gene count tests of WGD across the diatom phylogeny. We used WGDgc as described in the text to estimate the birth, death and retention rate of gene duplicates assuming WGD events occurred halfway along the branch leading to the taxa specified under Hypothesis. Shown are the log likelihoods of the WGD and no WGD (null) models, the likelihood ratio (LR), degrees of freedom (df), and the probability of observing the likelihood ratio assuming a Chi-square distribution (Pchi2) with 1 degree of freedom (presence vs. absence of WGD). The taxa forming the tree used to test each putative WGD event are also shown. Species abbreviations represent the first two letters of genera and specific epithet (see also Appendix 1). q - retention rate, ns - not significant.

View this table:

LITERATURE CITED

↵
Albertin, W., and P. Marullo. 2012. Polyploidy in fungi: Evolution after whole-genomeduplication. Proceedings of the Royal Society B: Biological Sciences 279: 2497–2509.
OpenUrl CrossRef PubMed
↵
Alfaro, M. E., S. Zoller, and F. Lutzoni. 2003. Bayes or bootstrap? A simulation study comparingthe performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessingphylogenetic confidence. Molecular Biology and Evolution 20: 255–266.
OpenUrl CrossRef PubMed Web of Science
↵
Alverson, A. J., R. K. Jansen, and E. C. Theriot. 2007. Bridging the Rubicon: Phylogenetic analysisreveals repeated colonizations of marine and fresh waters by thalassiosiroid diatoms. Molecular Phylogenetics and Evolution 45: 193–210.
OpenUrl CrossRef PubMed Web of Science
↵
Ambrosino, L., H. Bostan, P. D. Salle, M. Sangiovanni, A. Vigilante, and M. L. Chiusano.2016. pATsi: Paralogs and singleton genes from Arabidopsis thaliana. Evolutionary Bioinformatics 12: 1–7.
OpenUrl
↵
Armbrust, E. V., J. A. Berges, C. Bowler, B. R. Green, D. Martinez, N. H. Putnam, S. Zhou, et al.2004. The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306: 79–86.
OpenUrl Abstract/FREE Full Text
↵
Barker, M. S., N. Arrigo, A. E. Baniaga, Z. Li, and D. A. Levin. 2016. On the relative abundance ofautopolyploids and allopolyploids. New Phytologist 210: 391–398.
OpenUrl CrossRef PubMed
↵
Basu, S., S. Patil, D. Mapleson, M. T. Russo, L. Vitale, C. Fevola, F. Maumus, et al. 2017. Findinga partner in the ocean: Molecular and evolutionary bases of the response to sexual cues in aplanktonic diatom. New Phytologist 215: 140–156.
OpenUrl CrossRef PubMed
↵
Blanc, G., and K. H. Wolfe. 2004. Widespread paleopolyploidy in model plant species inferred fromage distributions of duplicate genes. Plant Cell 16: 1667–1678.
OpenUrl Abstract/FREE Full Text
↵
Bolger, A. M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for IlluminaSequence Data. Bioinformatics 10.1093/bioinformatics/btu170 DOI.
↵
Borowiec, M. L. 2016. AMAS: A fast tool for alignment manipulation and computing of summarystatistics. PeerJ 4: e1660.
OpenUrl CrossRef
↵
Bowers, J. E., B. A. Chapman, J. K. Rong, and A. H. Paterson. 2003. Unravelling angiospermgenome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433–438.
OpenUrl CrossRef PubMed Web of Science
↵
Bowler, C., A. E. Allen, J. H. Badger, J. Grimwood, K. Jabbari, A. Kuo, U. Maheswari, et al.2008. The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 456: 239–244.
OpenUrl CrossRef PubMed Web of Science
↵
Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. 2009. BLAST+: Architecture and applications. BMC Bioinformatics 10: 421.
OpenUrl CrossRef PubMed
↵
Cannon, S. B., M. R. Mckain, A. Harkess, M. N. Nelson, S. Dash, M. K. Deyholos, Y. Peng, et al.2015. Multiple polyploidy events in the early radiation of nodulating and nonnodulating legumes. Molecular Biology and Evolution 32: 193–210.
OpenUrl CrossRef PubMed
↵
Casteleyn, G., N. G. Adams, P. Vanormelingen, A. E. Debeer, K. Sabbe, and W. Vyverman.2009. Natural hybrids in the marine diatom Pseudo-nitzschia pungens (Bacillariophyceae): Genetic and morphological evidence. Protist 160: 343–354.
OpenUrl CrossRef PubMed Web of Science
↵
Chao, D. Y., B. Dilkes, H. B. Luo, A. Douglas, E. Yakubova, B. Lahner, and D. E. Salt. 2013. Polyploids exhibit higher potassium uptake and salinity tolerance in Arabidopsis. Science 341:658–659.
OpenUrl Abstract/FREE Full Text
↵
1. R. Gordon and
2. J. Seckbach
Chepurnov, V., P. Chaerle, K. Vanhoutte, and D. Mann. 2012. How to breed diatoms: Examination of two species with contrasting reproductive biology. In R. Gordon and J. Seckbach [eds.], The Science of Algal Fuels, vol. 25, Cellular Origin, Life in Extreme Habitatsand Astrobiology, 323–340. Springer Netherlands.
↵
Chepurnov, V. A., and A. M. Roschin. 1995. Inbreeding influence on sexual reproduction ofAchnanthes longipes Ag. (Bacillariophyta). Diatom Research 10: 21–29.
OpenUrl
↵
Chepurnov, V. A., D. G. Mann, K. Sabbe, and W. Vyverman. 2004. Experimental studies on sexualreproduction in diatoms. International Review of Cytology 237: 91–154.
OpenUrl CrossRef PubMed Web of Science
↵
Chepurnov, V. A., D. G. Mann, W. Vyverman, K. Sabbe, and D. B. Danielidis. 2002. Sexualreproduction, mating system, and protoplast dynamics of Seminavis (Bacillariophyceae). Journal of Phycology 38: 1004–1019.
OpenUrl CrossRef Web of Science
↵
Chepurnov, V. A., D. G. Mann, P. Von Dassow, P. Vanormelingen, J. Gillard, D. Inze, K. Sabbe, and W. Vyverman. 2008. In search of new tractable diatoms for experimental biology. BioEssays 30: 692–702.
OpenUrl CrossRef PubMed Web of Science
↵
Chernomor, O., A. Von Haeseler, and B. Q. Minh. 2016. Terrace aware data structure forphylogenomic inference from supermatrices. Systematic Biology 65: 997–1008.
OpenUrl CrossRef PubMed
↵
Connolly, J. A., M. J. Oliver, J. M. Beaulieu, C. A. Knight, L. Tomanek, and M. A. Moline.2008. Correlated evolution of genome size and cell volume in diatoms (Bacillariophyceae). Journal of Phycology 44: 124–131.
OpenUrl CrossRef Web of Science
↵
Coyer, J. A., G. Hoarau, G. A. Pearson, E. A. SERRÃO, W. T. Stam, and J. L. Olsen. 2006. Convergent adaptation to a marginal habitat by homoploid hybrids and polyploid ecads in theseaweed genus Fucus. Biology Letters 2: 405.
OpenUrl
↵
Cui, L., P. K. Wall, J. H. Leebens-Mack, B. G. Lindsay, D. E. Soltis, J. J. Doyle, P. S. Soltis, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Research 16: 738–749.
OpenUrl Abstract/FREE Full Text
↵
Darby, C. A., M. Stolzer, P. J. Ropp, D. Barker, and D. Durand. 2017. Xenolog classification. Bioinformatics 33: 640–649.
OpenUrl
↵
Darriba, D., G. L. Taboada, R. Doallo, and D. Posada. 2011. ProtTest 3: Fast selection of best’fitmodels of protein evolution. Bioinformatics 27: 1164–1165.
OpenUrl CrossRef PubMed Web of Science
↵
Dehal, P., and J. L. Boore. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLOS Biology 3: 1700–1708.
OpenUrl Web of Science
↵
Donoghue, M. J., and M. J. Sanderson. 2015. Confluence, synnovation, and depauperons in plantdiversification. New Phytologist 207: 260–274.
OpenUrl CrossRef PubMed
↵
Durand, D., B. V. Halldorsson, and B. Vernot. 2006. A hybrid micro-macroevolutionary approachto gene tree reconstruction. Journal of Computational Biology 13: 320–335.
OpenUrl CrossRef PubMed Web of Science
↵
Eddy, S. R. 2011. Accelerated profile HMM searches. PLOS Computational Biology 7: e1002195.
OpenUrl
↵
Enright, A. J., S. Van Dongen, and C. A. Ouzounis. 2002. An efficient algorithm for large-scaledetection of protein families. Nucleic Acids Research 30: 1575–1584.
OpenUrl CrossRef PubMed Web of Science
↵
Evans, K. M., A. H. Wortley, G. E. Simpson, V. A. Chepurnov, and D. G. Mann. 2008. Amolecular systematic approach to explore diversity within the Sellaphora pupula species complex(Bacillariophyta). Journal of Phycology 44: 215–231.
OpenUrl CrossRef Web of Science
↵
Fawcett, J. A., S. Maere, and Y. Van De Peer. 2009. Plants with double genomes might have had abetter chance to survive the Cretaceous–Tertiary extinction event. Proceedings of the National Academy of Sciences 106: 5737–5742.
OpenUrl Abstract/FREE Full Text
↵
Field, C. B., M. J. Behrenfeld, J. T. Randerson, and P. Falkowski. 1998. Primary production ofthe biosphere: Integrating terrestrial and oceanic components. Science 281: 237–240.
OpenUrl Abstract/FREE Full Text
↵
Finn, R. D., P. Coggill, R. Y. Eberhardt, S. R. Eddy, J. Mistry, A. L. Mitchell, S. C. Potter, et al.2015. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research 44: D279–D285.
OpenUrl CrossRef PubMed
↵
Flagel, L. E., and J. F. Wendel. 2009. Gene duplication and evolutionary novelty in plants. New Phytologist 183: 557–564.
OpenUrl CrossRef PubMed Web of Science
↵
Fraley, C., A. E. Raftery, T. B. Murphy, and L. Scrucca. 2012. mclust version 4 for R: Normalmixture modeling for model-based clustering, classification, and density estimation. University of Washington, D. o. Statistics, Seattle.
↵
Fu, L., B. Niu, Z. Zhu, S. Wu, and W. LI. 2012. CD-HIT: Accelerated for clustering the next-generationsequencing data. Bioinformatics 28: 3150–3152.
OpenUrl CrossRef PubMed Web of Science
↵
Galbraith, D. W., J. L. Bennetzen, E. A. Kellogg, J. C. Pires, and P. S. Soltis. 2011. Thegenomes of all angiosperms: A call for a coordinated global census. Journal of Botany 2011: 1–10.
OpenUrl
↵
Geitler, L. 1927. Somatische teilung, reduktionsteilung, copulation und parthenogenese bei Cocconeis placentula. Archiv Fur Protistenkunde 59: 506–549.
OpenUrl
↵
Geitler, L. 1973. Auxosporenbildung und systematik bei pennaten diatomeen und die cytologie vonCocconeis-Sippen. Österreichische botanische Zeitschrift 122: 299–321.
OpenUrl
↵
Gillard, J., J. Frenkel, V. Devos, K. Sabbe, C. Paul, M. Rempt, D. Inze, et al. 2013. Metabolomicsenables the structure elucidation of a diatom sex pheromone. Angewandte Chemie-International Edition 52: 854–857.
OpenUrl
↵
Goldman, N., and Z. H. Yang. 1994. Codon-based model of nucleotide substitution for protein-codingDNA sequences. Molecular Biology and Evolution 11: 725–736.
OpenUrl CrossRef PubMed Web of Science
↵
Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, et al.2011a. Trinity: Reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature Biotechnology 29: 644–652.
OpenUrl CrossRef PubMed
↵
Grabherr, M. G., 2011b. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652.
OpenUrl CrossRef PubMed
↵
Guillard, R. R. L., and J. H. Ryther. 1962. Studies of marine planktonic diatoms. I. Cyclotella nana Hustedt and Detonula confervacea (Cleve) Gran. Canadian Journal of Microbiology 8: 229–239.
OpenUrl CrossRef PubMed Web of Science
↵
Guindon, S., J.-F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel. 2010. Newalgorithms and methods to estimate maximum-likelihood phylogenies: Assessing theperformance of PhyML 3.0 Systematic Biology 59: 307–321.
OpenUrl CrossRef PubMed Web of Science
↵
Guiry, M. D. 2012. How many species of algae are there? Journal of Phycology 48: 1057–1063.
OpenUrl CrossRef PubMed Web of Science
↵
Guiry, M. D., and G. M. Guiry. 2017. AlgaeBase Website http://www.algaebase.org/ 2017].
↵
Hahn, M. W. 2007. Bias in phylogenetic tree reconciliation methods: Implications for vertebrate genomeevolution. Genome Biology
↵
Ichinomiya, M., A. L. DOS Santos, P. Gourvil, S. Yoshikawa, M. Kamiya, K. Ohki, S. Audic, et al.2016. Diversity and oceanic distribution of the Parmales (Bolidophyceae), a picoplanktonic groupclosely related to diatoms. ISME J 10.1038/ismej.2016.38
↵
Inoue, J., Y. Sato, R. Sinclair, K. Tsukamoto, and M. Nishida. 2015. Rapid genome reshaping bymultiple-gene loss after whole-genome duplication in teleost fish suggested by mathematicalmodeling. Proceedings of the National Academy of Sciences of the United States of America 112:14918–14923.
OpenUrl Abstract/FREE Full Text
↵
Ioos, R., A. Andrieux, B. Marcais, and P. Frey. 2006. Genetic characterization of the natural hybridspecies Phytophthora alni as inferred from nuclear and mitochondrial DNA analyses. Fungal Genetics and Biology 43: 511–529.
OpenUrl CrossRef PubMed
↵
Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E. Ralph, L. P. Tomsho, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.
OpenUrl CrossRef PubMed Web of Science
↵
Johnson, M. G., C. Malley, B. Goffinet, A. J. Shaw, and N. J. Wickett. 2016. Aphylotranscriptomic analysis of gene family expansion and evolution in the largest order ofpleurocarpous mosses (Hypnales, Bryophyta). Molecular Phylogenetics and Evolution 98: 29–40.
OpenUrl
↵
Katoh, K., and D. M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30: 772–780.
OpenUrl CrossRef PubMed Web of Science
↵
Kellogg, E. A. 2016. Has the connection between polyploidy and diversification actually been tested? Current Opinion in Plant Biology 30: 25–32.
OpenUrl CrossRef PubMed
↵
Kessenich, C. R., E. C. Ruck, A. M. Schurko, N. J. Wickett, and A. J. Alverson. 2014. Transcriptomic insights into the life history of Bolidophytes, the sister lineage to Diatoms. Journal of Phycology 50: 977–983.
OpenUrl CrossRef
↵
Kociolek, J. P., and E. F. Stoermer. 1989. Chromosome numbers in diatoms: A review. Diatom Research 4: 47–54.
OpenUrl
↵
Koester, J. A., J. E. Swalwell, P. Von Dassow, and E. V. Armbrust. 2010. Genome sizedifferentiates co-occurring populations of the planktonic diatom Ditylum brightwellii(Bacillariophyta). BMC Evolutionary Biology 10.
↵
Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nat Meth 9:357–359.
OpenUrl CrossRef
↵
Logares, R., J. Brate, S. Bertilsson, J. L. Clasen, K. Shalchian-Tabrizi, and K. Rengefors.2009. Infrequent marine-freshwater transitions in the microbial world. Trends in Microbiology 17: 414–422.
OpenUrl CrossRef PubMed Web of Science
↵
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155.
OpenUrl Abstract/FREE Full Text
↵
Lynch, M., and A. G. Force. 2000. The origin of interspecific genomic incompatibility via geneduplication. American Naturalist 156: 590–605.
OpenUrl CrossRef Web of Science
↵
Macmanes, M. D. 2015. An opinionated guide to the proper care and feeding of your transcriptome. bioRxiv 10.1101/035642 doi:10.1101/035642.
OpenUrl Abstract/FREE Full Text
↵
Mallet, J. 2005. Hybridization as an invasion of the genome. Trends in Ecology & Evolution 20: 229–237.
OpenUrl CrossRef PubMed Web of Science
↵
Mann, D. G. 1994. Auxospore formation, reproductive plasticity and cell structure in Navicula ulvacea and the resurrection of the genus Dickieia (Bacillariophyta). European Journal of Phycology 29:141–157.
OpenUrl
↵
Mann, D. G. 1999a. The species concept in diatoms. Phycologia 38: 437–495.
OpenUrl CrossRef Web of Science
↵
1. S. Mayama,
2. M. Idei, and
3. I. Koizumi
Mann, D. G. 1999b. Crossing the Rubicon: The effectiveness of the marine/freshwater interface as abarrier to the migration of diatom germplasm. In S. Mayama, M. Idei, and I. Koizumi [eds.], Proceedings of the 14th International Diatom Symposium, 1–21. Koeltz Scientific Books, Koenigstein.
↵
Mann, D. G., and A. J. Stickle. 1991. The genus Craticula. Diatom Research 6: 79–107.
OpenUrl CrossRef
↵
Mann, D. G., and P. Vanormelingen. 2013. An inordinate fondness? The number, distributions, andorigins of diatom species. Journal of Eukaryotic Microbiology 60: 414–420.
OpenUrl CrossRef PubMed
↵
Mann, D. G., S. M. Mcdonald, M. M. Bayer, S. J. M. Droop, V. A. Chepurnov, R. E. Loke, A. Ciobanu, and J. M. H. Du Buf. 2004. The Sellaphora pupula species complex(Bacillariophyceae): morphometric analysis, ultrastructure and mating data provide evidence forfive new species. Phycologia 43: 459–482.
OpenUrl CrossRef Web of Science
↵
Marcet-Houben, M., and T. Gabaldon. 2015. Beyond the whole-genome duplication: Phylogeneticevidence for an ancient interspecies hybridization in the baker’s yeast lineage. PLOS Biology
↵
Mckain, M. R., H. Tang, J. R. Mcneal, S. Ayyampalayam, J. I. Davis, C. W. Depamphilis, T. J. Givnish, et al. 2016. A phylogenomic assessment of ancient polyploidy and genome evolutionacross the Poales. Genome Biology and Evolution 8: 1150–1164.
OpenUrl CrossRef PubMed
↵
Minh, B. Q., M. A. Nguyen, and A. Von Haeseler. 2013. Ultrafast approximation for phylogeneticbootstrap. Molecular Biology and Evolution 30: 1188–1195.
OpenUrl CrossRef PubMed Web of Science
↵
Mirarab, S., and T. Warnow. 2015. ASTRAL-II: Coalescent-based species tree estimation withmany hundreds of taxa and thousands of genes. Bioinformatics 31: i44–i52.
OpenUrl CrossRef PubMed
↵
Moeys, S., J. Frenkel, C. Lembke, J. T. F. Gillard, V. Devos, K. Van Den Berge, B. Bouillon, et al. 2016. A sex-inducing pheromone triggers cell cycle arrest and mate attraction in the diatomSeminavis robusta. Scientific Reports
↵
Nakov, T., J. M. Beaulieu, and A. J. Alverson. 2017. A time-calibrated phylogeny of diatoms(Bacillariophyta). under review.
↵
Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin, New York.
↵
Oliver, M. J., D. Petrov, D. Ackerly, P. Falkowski, and O. M. Schofield. 2007. The mode andtempo of genome size evolution in eukaryotes. Genome Research 17: 594–601.
OpenUrl Abstract/FREE Full Text
↵
Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Annual Review of Genetics 34:401–437.
OpenUrl CrossRef PubMed Web of Science
↵
Panchy, N., M. Lehti-Shiu, and S. H. Shiu. 2016. Evolution of gene duplication in plants. Plant Physiology 171: 2294–2316.
OpenUrl Abstract/FREE Full Text
↵
Parks, M., N. J. Wickett, and A. J. Alverson. 2017. Signal, uncertainty, and conflict inphylogenomic data for a diverse lineage of microbial eukaryotes (diatoms, Bacillariophyta). under review.
↵
Philippe, H., H. Brinkmann, D. V. Lavrov, D. T. J. Littlewood, M. Manuel, G. WÖRheide, and D. Baurain. 2011. Resolving difficult phylogenetic questions: Why more sequences are notenough. PLOS Biology 9: e1000602.
OpenUrl CrossRef PubMed
↵
Poulsen, N., and N. Kroger. 2004. Silica morphogenesis by alternative processing of silaffins in thediatom Thalassiosira pseudonana. Journal of Biological Chemistry 279: 42993–42999.
OpenUrl Abstract/FREE Full Text
↵
Rabier, C. E., T. Ta, and C. Ane. 2014. Detecting and locating whole genome duplications on aphylogeny: A probabilistic approach. Molecular Biology and Evolution 31: 750–762.
OpenUrl CrossRef PubMed
↵
Ramsey, J., and D. W. Schemske. 1998. Pathways, mechanisms, and rates of polyploid formation inflowering plants. Annual Review of Ecology and Systematics 29: 467–501.
OpenUrl CrossRef Web of Science
↵
1. R. Ross
Round, F. E., and P. A. Sims. 1980. The distribution of diatom genera in marine and freshwaterenvironments and some evolutionary considerations. In R. Ross [ed.], Proceedings of the SixthSymposium on Recent and Fossil Diatoms, 301-320. Otto Koeltz Science Publishers, Hirschberg.
↵
Round, F. E., R. M. Crawford, and D. G. Mann. 1990. The Diatoms: Biology & Morphology of the Genera. Cambridge University Press, Cambridge.
↵
Sato, S., G. Beakes, M. Idei, T. Nagumo, and D. G. Mann. 2011. Novel sex cells and evidence forsex pheromones in diatoms. PLOS ONE 6: e26923.
OpenUrl CrossRef PubMed
↵
Sayyari, E., and S. Mirarab. 2016. Fast coalescent-based computation of local branch support fromquartet frequencies. Molecular Biology and Evolution 33: 1654–1668.
OpenUrl CrossRef PubMed
↵
Schlueter, J. A., P. Dixon, C. Granger, D. Grant, L. Clark, J. J. Doyle, and R. C. Shoemaker.2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47:868–876.
OpenUrl CrossRef PubMed
↵
Seo, T. K. 2008. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Molecular Biology and Evolution 25: 960–971.
OpenUrl CrossRef PubMed Web of Science
↵
Simão, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov. 2015. BUSCO: Assessing genome Assembly and annotation completeness with single-copy Orthologs. Bioinformatics 31: 3210–3212.
OpenUrl CrossRef PubMed
↵
Smith, S. A., and B. C. O’Meara. 2012. treePL: Divergence time estimation using penalized likelihoodfor large phylogenies. Bioinformatics 28: 2689–2690.
OpenUrl CrossRef PubMed Web of Science
↵
Smith, S. A., M. J. Moore, J. W. Brown, and Y. Yang. 2015. Analysis of phylogenomic datasetsreveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evolutionary Biology 15: 150.
OpenUrl
↵
Smith-Unna, R., C. Boursnell, R. Patro, J. Hibberd, and S. Kelly. 2016. TransRate: Referencefree quality assessment of de novo transcriptome assemblies. Genome Research10.1101/gr.196469.115 DOI.
↵
Soltis, D. E., V. A. Albert, J. Leebens-Mack, C. D. Bell, A. H. Paterson, C. Zheng, D. Sankoff,Et al. 2009. Polyploidy and angiosperm diversification. American Journal of Botany 96: 336–348.
OpenUrl Abstract/FREE Full Text
↵
Song, L., and L. Florea. 2015. Rcorrector: Efficient and accurate error correction for Illumina RNAseqreads. GigaScience 4: 48.
OpenUrl CrossRef
↵
Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post’analysis of largephylogenies. Bioinformatics 30: 1312–1313.
OpenUrl CrossRef PubMed Web of Science
↵
Sukumaran, J., and M. T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569–1571.
OpenUrl CrossRef PubMed Web of Science
↵
Suyama, M., D. Torrents, and P. Bork. 2006. PAL2NAL: Robust conversion of protein sequencealignments into the corresponding codon alignments. Nucleic Acids Research 34: W609–W612.
OpenUrl CrossRef PubMed Web of Science
↵
Tanaka, T., Y. Maeda, A. Veluchamy, M. Tanaka, H. Abida, E. MarÉChal, C. Bowler, et al.2015. Oil accumulation by the oleaginous diatom Fistulifera solaris as revealed by the genomeand transcriptome. The Plant Cell 27: 162.
OpenUrl Abstract/FREE Full Text
↵
Tang, H., J. E. Bowers, X. Wang, and A. H. Paterson. 2010. Angiosperm genome comparisonsreveal early polyploidy in the monocot lineage. Proceedings of the National Academy of Sciences 107: 472–477.
OpenUrl Abstract/FREE Full Text
↵
Tank, D. C., J. M. Eastman, M. W. Pennell, P. S. Soltis, D. E. Soltis, C. E. Hinchliff, J. W. Brown, et al. 2015. Nested radiations and the pulse of angiosperm diversification: Increaseddiversification rates often follow whole genome duplications. New Phytologist 207: 454–467.
OpenUrl CrossRef PubMed
↵
Theriot, E. C., M. P. Ashworth, T. Nakov, E. Ruck, and R. K. Jansen. 2015. Dissecting signal andnoise in diatom chloroplast protein encoding genes with phylogenetic information profiling. Molecular Phylogenetics and Evolution 89: 28–36.
OpenUrl CrossRef PubMed
↵
Thomas, G. W. C., S. H. Ather, and M. W. Hahn. 2017. Gene-tree reconciliation with MUL-trees toresolve polyploidy events. Systematic Biology: syx044.
↵
Thompson, J. D., and R. Lumaret. 1992. The evolutionary dynamics of polyploid plants: Origins, establishment and persistence. Trends in Ecology & Evolution 7: 302–307.
OpenUrl CrossRef PubMed Web of Science
↵
Tiley, G. P., C. Ane, and J. G. Burleigh. 2016. Evaluating and characterizing ancient whole-genomeduplications in plants with gene count data. Genome Biology and Evolution 8: 1023–1037.
OpenUrl CrossRef PubMed
↵
Van Dongen, S., and C. Abreu-Goodger. 2012. Using MCL to extract clusters from networks. Bacterial Molecular Networks: Methods and Protocols: 281-295.
↵
Van Dongen, S. M. 2001. Graph clustering by flow simulation. PhD, University of Utrecht, Utrecht, Netherlands.
↵
Vanneste, K., Y. Van De Peer, and S. Maere. 2013. Inference of genome duplications from agedistributions revisited. Molecular Biology and Evolution 30: 177–190.
OpenUrl CrossRef PubMed Web of Science
↵
Vanneste, K., L. Sterck, A. A. Myburg, Y. Van De Peer, and E. Mizrachi. 2015. Horsetails areancient polyploids: Evidence from Equisetum giganteum. The Plant Cell 27: 1567–1578.
OpenUrl Abstract/FREE Full Text
↵
Von Dassow, P., T. W. Petersen, V. A. Chepurnov, and E. V. Armbrust. 2008. Inter-andintraspecific relationships between nuclear DNA content and cell size in selected members of thecentric diatom genus Thalassiosira (Bacillariophyceae). Journal of Phycology 44: 335–349.
OpenUrl CrossRef PubMed Web of Science
↵
Winge, Ö. 1917. The chromosomes. Their numbers and general importance. Comptes Rendus des Travaux du Laboratoire Carlsberg 13: 131–175.
OpenUrl
↵
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entireyeast genome. Nature 387: 708–713.
OpenUrl CrossRef PubMed Web of Science
↵
Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L. H. Rieseberg.2009. The frequency of polyploid speciation in vascular plants. Proceedings of the National Academy of Sciences of the United States of America 106: 13875–13879.
OpenUrl Abstract/FREE Full Text
↵
Yang, Y., and S. A. Smith. 2014. Orthology inference in nonmodel organisms using transcriptomesand low-coverage genomes: Improving accuracy and matrix occupancy for phylogenomics. Molecular Biology and Evolution 31: 3081–3092.
OpenUrl CrossRef PubMed
↵
Yang, Y., M. J. Moore, S. F. Brockington, D. E. Soltis, G. K. S. Wong, E. J. Carpenter, Y. Zhang, et al. 2015. Dissecting molecular evolution in the highly diverse plant clade Caryophyllales using transcriptome sequencing. Molecular Biology and Evolution 32: 2001–2014.
OpenUrl CrossRef PubMed
↵
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes ofmammals. Journal of Molecular Evolution 46: 409–418.
OpenUrl CrossRef PubMed Web of Science
↵
Zhan, S. H., M. Drori, E. E. Goldberg, S. P. Otto, and I. Mayrose. 2016. Phylogenetic evidencefor cladogenetic polyploidization in land plants. American Journal of Botany 103: 1252–1258.
OpenUrl Abstract/FREE Full Text

View the discussion thread.

Posted August 26, 2017.

Download PDF

Citation Tools

Subject Area

Evolutionary Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5200)
Biochemistry (11703)
Bioengineering (8718)
Bioinformatics (29127)
Biophysics (14930)
Cancer Biology (12048)
Cell Biology (17353)
Clinical Trials (138)
Developmental Biology (9406)
Ecology (14143)
Epidemiology (2067)
Evolutionary Biology (18266)
Genetics (12219)
Genomics (16765)
Immunology (11841)
Microbiology (28003)
Molecular Biology (11551)
Neuroscience (60804)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3229)
Physiology (4939)
Plant Biology (10383)
Scientific Communication and Education (1679)
Synthetic Biology (2877)
Systems Biology (7333)
Zoology (1642)

[1] ↵
Albertin, W., and P. Marullo. 2012. Polyploidy in fungi: Evolution after whole-genomeduplication. Proceedings of the Royal Society B: Biological Sciences 279: 2497–2509.
OpenUrl CrossRef PubMed

[2] ↵
Alfaro, M. E., S. Zoller, and F. Lutzoni. 2003. Bayes or bootstrap? A simulation study comparingthe performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessingphylogenetic confidence. Molecular Biology and Evolution 20: 255–266.
OpenUrl CrossRef PubMed Web of Science

[3] ↵
Alverson, A. J., R. K. Jansen, and E. C. Theriot. 2007. Bridging the Rubicon: Phylogenetic analysisreveals repeated colonizations of marine and fresh waters by thalassiosiroid diatoms. Molecular Phylogenetics and Evolution 45: 193–210.
OpenUrl CrossRef PubMed Web of Science

[4] ↵
Ambrosino, L., H. Bostan, P. D. Salle, M. Sangiovanni, A. Vigilante, and M. L. Chiusano.2016. pATsi: Paralogs and singleton genes from Arabidopsis thaliana. Evolutionary Bioinformatics 12: 1–7.
OpenUrl

[5] ↵
Armbrust, E. V., J. A. Berges, C. Bowler, B. R. Green, D. Martinez, N. H. Putnam, S. Zhou, et al.2004. The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306: 79–86.
OpenUrl Abstract/FREE Full Text

[6] ↵
Barker, M. S., N. Arrigo, A. E. Baniaga, Z. Li, and D. A. Levin. 2016. On the relative abundance ofautopolyploids and allopolyploids. New Phytologist 210: 391–398.
OpenUrl CrossRef PubMed

[7] ↵
Basu, S., S. Patil, D. Mapleson, M. T. Russo, L. Vitale, C. Fevola, F. Maumus, et al. 2017. Findinga partner in the ocean: Molecular and evolutionary bases of the response to sexual cues in aplanktonic diatom. New Phytologist 215: 140–156.
OpenUrl CrossRef PubMed

[8] ↵
Blanc, G., and K. H. Wolfe. 2004. Widespread paleopolyploidy in model plant species inferred fromage distributions of duplicate genes. Plant Cell 16: 1667–1678.
OpenUrl Abstract/FREE Full Text

[9] ↵
Bolger, A. M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for IlluminaSequence Data. Bioinformatics 10.1093/bioinformatics/btu170 DOI.

[10] ↵
Borowiec, M. L. 2016. AMAS: A fast tool for alignment manipulation and computing of summarystatistics. PeerJ 4: e1660.
OpenUrl CrossRef

[11] ↵
Bowers, J. E., B. A. Chapman, J. K. Rong, and A. H. Paterson. 2003. Unravelling angiospermgenome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433–438.
OpenUrl CrossRef PubMed Web of Science

[12] ↵
Bowler, C., A. E. Allen, J. H. Badger, J. Grimwood, K. Jabbari, A. Kuo, U. Maheswari, et al.2008. The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 456: 239–244.
OpenUrl CrossRef PubMed Web of Science

[13] ↵
Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. 2009. BLAST+: Architecture and applications. BMC Bioinformatics 10: 421.
OpenUrl CrossRef PubMed

[14] ↵
Cannon, S. B., M. R. Mckain, A. Harkess, M. N. Nelson, S. Dash, M. K. Deyholos, Y. Peng, et al.2015. Multiple polyploidy events in the early radiation of nodulating and nonnodulating legumes. Molecular Biology and Evolution 32: 193–210.
OpenUrl CrossRef PubMed

[15] ↵
Casteleyn, G., N. G. Adams, P. Vanormelingen, A. E. Debeer, K. Sabbe, and W. Vyverman.2009. Natural hybrids in the marine diatom Pseudo-nitzschia pungens (Bacillariophyceae): Genetic and morphological evidence. Protist 160: 343–354.
OpenUrl CrossRef PubMed Web of Science

[16] ↵
Chao, D. Y., B. Dilkes, H. B. Luo, A. Douglas, E. Yakubova, B. Lahner, and D. E. Salt. 2013. Polyploids exhibit higher potassium uptake and salinity tolerance in Arabidopsis. Science 341:658–659.
OpenUrl Abstract/FREE Full Text

[17] ↵
R. Gordon and
J. Seckbach
Chepurnov, V., P. Chaerle, K. Vanhoutte, and D. Mann. 2012. How to breed diatoms: Examination of two species with contrasting reproductive biology. In R. Gordon and J. Seckbach [eds.], The Science of Algal Fuels, vol. 25, Cellular Origin, Life in Extreme Habitatsand Astrobiology, 323–340. Springer Netherlands.

[18] R. Gordon and

[19] J. Seckbach

[20] ↵
Chepurnov, V. A., and A. M. Roschin. 1995. Inbreeding influence on sexual reproduction ofAchnanthes longipes Ag. (Bacillariophyta). Diatom Research 10: 21–29.
OpenUrl

[21] ↵
Chepurnov, V. A., D. G. Mann, K. Sabbe, and W. Vyverman. 2004. Experimental studies on sexualreproduction in diatoms. International Review of Cytology 237: 91–154.
OpenUrl CrossRef PubMed Web of Science

[22] ↵
Chepurnov, V. A., D. G. Mann, W. Vyverman, K. Sabbe, and D. B. Danielidis. 2002. Sexualreproduction, mating system, and protoplast dynamics of Seminavis (Bacillariophyceae). Journal of Phycology 38: 1004–1019.
OpenUrl CrossRef Web of Science

[23] ↵
Chepurnov, V. A., D. G. Mann, P. Von Dassow, P. Vanormelingen, J. Gillard, D. Inze, K. Sabbe, and W. Vyverman. 2008. In search of new tractable diatoms for experimental biology. BioEssays 30: 692–702.
OpenUrl CrossRef PubMed Web of Science

[24] ↵
Chernomor, O., A. Von Haeseler, and B. Q. Minh. 2016. Terrace aware data structure forphylogenomic inference from supermatrices. Systematic Biology 65: 997–1008.
OpenUrl CrossRef PubMed

[25] ↵
Connolly, J. A., M. J. Oliver, J. M. Beaulieu, C. A. Knight, L. Tomanek, and M. A. Moline.2008. Correlated evolution of genome size and cell volume in diatoms (Bacillariophyceae). Journal of Phycology 44: 124–131.
OpenUrl CrossRef Web of Science

[26] ↵
Coyer, J. A., G. Hoarau, G. A. Pearson, E. A. SERRÃO, W. T. Stam, and J. L. Olsen. 2006. Convergent adaptation to a marginal habitat by homoploid hybrids and polyploid ecads in theseaweed genus Fucus. Biology Letters 2: 405.
OpenUrl

[27] ↵
Cui, L., P. K. Wall, J. H. Leebens-Mack, B. G. Lindsay, D. E. Soltis, J. J. Doyle, P. S. Soltis, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Research 16: 738–749.
OpenUrl Abstract/FREE Full Text

[28] ↵
Darby, C. A., M. Stolzer, P. J. Ropp, D. Barker, and D. Durand. 2017. Xenolog classification. Bioinformatics 33: 640–649.
OpenUrl

[29] ↵
Darriba, D., G. L. Taboada, R. Doallo, and D. Posada. 2011. ProtTest 3: Fast selection of best’fitmodels of protein evolution. Bioinformatics 27: 1164–1165.
OpenUrl CrossRef PubMed Web of Science

[30] ↵
Dehal, P., and J. L. Boore. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLOS Biology 3: 1700–1708.
OpenUrl Web of Science

[31] ↵
Donoghue, M. J., and M. J. Sanderson. 2015. Confluence, synnovation, and depauperons in plantdiversification. New Phytologist 207: 260–274.
OpenUrl CrossRef PubMed

[32] ↵
Durand, D., B. V. Halldorsson, and B. Vernot. 2006. A hybrid micro-macroevolutionary approachto gene tree reconstruction. Journal of Computational Biology 13: 320–335.
OpenUrl CrossRef PubMed Web of Science

[33] ↵
Eddy, S. R. 2011. Accelerated profile HMM searches. PLOS Computational Biology 7: e1002195.
OpenUrl

[34] ↵
Enright, A. J., S. Van Dongen, and C. A. Ouzounis. 2002. An efficient algorithm for large-scaledetection of protein families. Nucleic Acids Research 30: 1575–1584.
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Evans, K. M., A. H. Wortley, G. E. Simpson, V. A. Chepurnov, and D. G. Mann. 2008. Amolecular systematic approach to explore diversity within the Sellaphora pupula species complex(Bacillariophyta). Journal of Phycology 44: 215–231.
OpenUrl CrossRef Web of Science

[36] ↵
Fawcett, J. A., S. Maere, and Y. Van De Peer. 2009. Plants with double genomes might have had abetter chance to survive the Cretaceous–Tertiary extinction event. Proceedings of the National Academy of Sciences 106: 5737–5742.
OpenUrl Abstract/FREE Full Text

[37] ↵
Field, C. B., M. J. Behrenfeld, J. T. Randerson, and P. Falkowski. 1998. Primary production ofthe biosphere: Integrating terrestrial and oceanic components. Science 281: 237–240.
OpenUrl Abstract/FREE Full Text

[38] ↵
Finn, R. D., P. Coggill, R. Y. Eberhardt, S. R. Eddy, J. Mistry, A. L. Mitchell, S. C. Potter, et al.2015. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research 44: D279–D285.
OpenUrl CrossRef PubMed

[39] ↵
Flagel, L. E., and J. F. Wendel. 2009. Gene duplication and evolutionary novelty in plants. New Phytologist 183: 557–564.
OpenUrl CrossRef PubMed Web of Science

[40] ↵
Fraley, C., A. E. Raftery, T. B. Murphy, and L. Scrucca. 2012. mclust version 4 for R: Normalmixture modeling for model-based clustering, classification, and density estimation. University of Washington, D. o. Statistics, Seattle.

[41] ↵
Fu, L., B. Niu, Z. Zhu, S. Wu, and W. LI. 2012. CD-HIT: Accelerated for clustering the next-generationsequencing data. Bioinformatics 28: 3150–3152.
OpenUrl CrossRef PubMed Web of Science

[42] ↵
Galbraith, D. W., J. L. Bennetzen, E. A. Kellogg, J. C. Pires, and P. S. Soltis. 2011. Thegenomes of all angiosperms: A call for a coordinated global census. Journal of Botany 2011: 1–10.
OpenUrl

[43] ↵
Geitler, L. 1927. Somatische teilung, reduktionsteilung, copulation und parthenogenese bei Cocconeis placentula. Archiv Fur Protistenkunde 59: 506–549.
OpenUrl

[44] ↵
Geitler, L. 1973. Auxosporenbildung und systematik bei pennaten diatomeen und die cytologie vonCocconeis-Sippen. Österreichische botanische Zeitschrift 122: 299–321.
OpenUrl

[45] ↵
Gillard, J., J. Frenkel, V. Devos, K. Sabbe, C. Paul, M. Rempt, D. Inze, et al. 2013. Metabolomicsenables the structure elucidation of a diatom sex pheromone. Angewandte Chemie-International Edition 52: 854–857.
OpenUrl

[46] ↵
Goldman, N., and Z. H. Yang. 1994. Codon-based model of nucleotide substitution for protein-codingDNA sequences. Molecular Biology and Evolution 11: 725–736.
OpenUrl CrossRef PubMed Web of Science

[47] ↵
Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, et al.2011a. Trinity: Reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature Biotechnology 29: 644–652.
OpenUrl CrossRef PubMed

[48] ↵
Grabherr, M. G., 2011b. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652.
OpenUrl CrossRef PubMed

[49] ↵
Guillard, R. R. L., and J. H. Ryther. 1962. Studies of marine planktonic diatoms. I. Cyclotella nana Hustedt and Detonula confervacea (Cleve) Gran. Canadian Journal of Microbiology 8: 229–239.
OpenUrl CrossRef PubMed Web of Science

[50] ↵
Guindon, S., J.-F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel. 2010. Newalgorithms and methods to estimate maximum-likelihood phylogenies: Assessing theperformance of PhyML 3.0 Systematic Biology 59: 307–321.
OpenUrl CrossRef PubMed Web of Science

[51] ↵
Guiry, M. D. 2012. How many species of algae are there? Journal of Phycology 48: 1057–1063.
OpenUrl CrossRef PubMed Web of Science

[52] ↵
Guiry, M. D., and G. M. Guiry. 2017. AlgaeBase Website http://www.algaebase.org/ 2017].

[53] ↵
Hahn, M. W. 2007. Bias in phylogenetic tree reconciliation methods: Implications for vertebrate genomeevolution. Genome Biology

[54] ↵
Ichinomiya, M., A. L. DOS Santos, P. Gourvil, S. Yoshikawa, M. Kamiya, K. Ohki, S. Audic, et al.2016. Diversity and oceanic distribution of the Parmales (Bolidophyceae), a picoplanktonic groupclosely related to diatoms. ISME J 10.1038/ismej.2016.38

[55] ↵
Inoue, J., Y. Sato, R. Sinclair, K. Tsukamoto, and M. Nishida. 2015. Rapid genome reshaping bymultiple-gene loss after whole-genome duplication in teleost fish suggested by mathematicalmodeling. Proceedings of the National Academy of Sciences of the United States of America 112:14918–14923.
OpenUrl Abstract/FREE Full Text

[56] ↵
Ioos, R., A. Andrieux, B. Marcais, and P. Frey. 2006. Genetic characterization of the natural hybridspecies Phytophthora alni as inferred from nuclear and mitochondrial DNA analyses. Fungal Genetics and Biology 43: 511–529.
OpenUrl CrossRef PubMed

[57] ↵
Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E. Ralph, L. P. Tomsho, et al. 2011. Ancestral polyploidy in seed plants and angiosperms. Nature 473: 97–100.
OpenUrl CrossRef PubMed Web of Science

[58] ↵
Johnson, M. G., C. Malley, B. Goffinet, A. J. Shaw, and N. J. Wickett. 2016. Aphylotranscriptomic analysis of gene family expansion and evolution in the largest order ofpleurocarpous mosses (Hypnales, Bryophyta). Molecular Phylogenetics and Evolution 98: 29–40.
OpenUrl

[59] ↵
Katoh, K., and D. M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30: 772–780.
OpenUrl CrossRef PubMed Web of Science

[60] ↵
Kellogg, E. A. 2016. Has the connection between polyploidy and diversification actually been tested? Current Opinion in Plant Biology 30: 25–32.
OpenUrl CrossRef PubMed

[61] ↵
Kessenich, C. R., E. C. Ruck, A. M. Schurko, N. J. Wickett, and A. J. Alverson. 2014. Transcriptomic insights into the life history of Bolidophytes, the sister lineage to Diatoms. Journal of Phycology 50: 977–983.
OpenUrl CrossRef

[62] ↵
Kociolek, J. P., and E. F. Stoermer. 1989. Chromosome numbers in diatoms: A review. Diatom Research 4: 47–54.
OpenUrl

[63] ↵
Koester, J. A., J. E. Swalwell, P. Von Dassow, and E. V. Armbrust. 2010. Genome sizedifferentiates co-occurring populations of the planktonic diatom Ditylum brightwellii(Bacillariophyta). BMC Evolutionary Biology 10.

[64] ↵
Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nat Meth 9:357–359.
OpenUrl CrossRef

[65] ↵
Logares, R., J. Brate, S. Bertilsson, J. L. Clasen, K. Shalchian-Tabrizi, and K. Rengefors.2009. Infrequent marine-freshwater transitions in the microbial world. Trends in Microbiology 17: 414–422.
OpenUrl CrossRef PubMed Web of Science

[66] ↵
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155.
OpenUrl Abstract/FREE Full Text

[67] ↵
Lynch, M., and A. G. Force. 2000. The origin of interspecific genomic incompatibility via geneduplication. American Naturalist 156: 590–605.
OpenUrl CrossRef Web of Science

[68] ↵
Macmanes, M. D. 2015. An opinionated guide to the proper care and feeding of your transcriptome. bioRxiv 10.1101/035642 doi:10.1101/035642.
OpenUrl Abstract/FREE Full Text

[69] ↵
Mallet, J. 2005. Hybridization as an invasion of the genome. Trends in Ecology & Evolution 20: 229–237.
OpenUrl CrossRef PubMed Web of Science

[70] ↵
Mann, D. G. 1994. Auxospore formation, reproductive plasticity and cell structure in Navicula ulvacea and the resurrection of the genus Dickieia (Bacillariophyta). European Journal of Phycology 29:141–157.
OpenUrl

[71] ↵
Mann, D. G. 1999a. The species concept in diatoms. Phycologia 38: 437–495.
OpenUrl CrossRef Web of Science

[72] ↵
S. Mayama,
M. Idei, and
I. Koizumi
Mann, D. G. 1999b. Crossing the Rubicon: The effectiveness of the marine/freshwater interface as abarrier to the migration of diatom germplasm. In S. Mayama, M. Idei, and I. Koizumi [eds.], Proceedings of the 14th International Diatom Symposium, 1–21. Koeltz Scientific Books, Koenigstein.

[73] S. Mayama,

[74] M. Idei, and

[75] I. Koizumi

[76] ↵
Mann, D. G., and A. J. Stickle. 1991. The genus Craticula. Diatom Research 6: 79–107.
OpenUrl CrossRef

[77] ↵
Mann, D. G., and P. Vanormelingen. 2013. An inordinate fondness? The number, distributions, andorigins of diatom species. Journal of Eukaryotic Microbiology 60: 414–420.
OpenUrl CrossRef PubMed

[78] ↵
Mann, D. G., S. M. Mcdonald, M. M. Bayer, S. J. M. Droop, V. A. Chepurnov, R. E. Loke, A. Ciobanu, and J. M. H. Du Buf. 2004. The Sellaphora pupula species complex(Bacillariophyceae): morphometric analysis, ultrastructure and mating data provide evidence forfive new species. Phycologia 43: 459–482.
OpenUrl CrossRef Web of Science

[79] ↵
Marcet-Houben, M., and T. Gabaldon. 2015. Beyond the whole-genome duplication: Phylogeneticevidence for an ancient interspecies hybridization in the baker’s yeast lineage. PLOS Biology

[80] ↵
Mckain, M. R., H. Tang, J. R. Mcneal, S. Ayyampalayam, J. I. Davis, C. W. Depamphilis, T. J. Givnish, et al. 2016. A phylogenomic assessment of ancient polyploidy and genome evolutionacross the Poales. Genome Biology and Evolution 8: 1150–1164.
OpenUrl CrossRef PubMed

[81] ↵
Minh, B. Q., M. A. Nguyen, and A. Von Haeseler. 2013. Ultrafast approximation for phylogeneticbootstrap. Molecular Biology and Evolution 30: 1188–1195.
OpenUrl CrossRef PubMed Web of Science

[82] ↵
Mirarab, S., and T. Warnow. 2015. ASTRAL-II: Coalescent-based species tree estimation withmany hundreds of taxa and thousands of genes. Bioinformatics 31: i44–i52.
OpenUrl CrossRef PubMed

[83] ↵
Moeys, S., J. Frenkel, C. Lembke, J. T. F. Gillard, V. Devos, K. Van Den Berge, B. Bouillon, et al. 2016. A sex-inducing pheromone triggers cell cycle arrest and mate attraction in the diatomSeminavis robusta. Scientific Reports

[84] ↵
Nakov, T., J. M. Beaulieu, and A. J. Alverson. 2017. A time-calibrated phylogeny of diatoms(Bacillariophyta). under review.

[85] ↵
Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin, New York.

[86] ↵
Oliver, M. J., D. Petrov, D. Ackerly, P. Falkowski, and O. M. Schofield. 2007. The mode andtempo of genome size evolution in eukaryotes. Genome Research 17: 594–601.
OpenUrl Abstract/FREE Full Text

[87] ↵
Otto, S. P., and J. Whitton. 2000. Polyploid incidence and evolution. Annual Review of Genetics 34:401–437.
OpenUrl CrossRef PubMed Web of Science

[88] ↵
Panchy, N., M. Lehti-Shiu, and S. H. Shiu. 2016. Evolution of gene duplication in plants. Plant Physiology 171: 2294–2316.
OpenUrl Abstract/FREE Full Text

[89] ↵
Parks, M., N. J. Wickett, and A. J. Alverson. 2017. Signal, uncertainty, and conflict inphylogenomic data for a diverse lineage of microbial eukaryotes (diatoms, Bacillariophyta). under review.

[90] ↵
Philippe, H., H. Brinkmann, D. V. Lavrov, D. T. J. Littlewood, M. Manuel, G. WÖRheide, and D. Baurain. 2011. Resolving difficult phylogenetic questions: Why more sequences are notenough. PLOS Biology 9: e1000602.
OpenUrl CrossRef PubMed

[91] ↵
Poulsen, N., and N. Kroger. 2004. Silica morphogenesis by alternative processing of silaffins in thediatom Thalassiosira pseudonana. Journal of Biological Chemistry 279: 42993–42999.
OpenUrl Abstract/FREE Full Text

[92] ↵
Rabier, C. E., T. Ta, and C. Ane. 2014. Detecting and locating whole genome duplications on aphylogeny: A probabilistic approach. Molecular Biology and Evolution 31: 750–762.
OpenUrl CrossRef PubMed

[93] ↵
Ramsey, J., and D. W. Schemske. 1998. Pathways, mechanisms, and rates of polyploid formation inflowering plants. Annual Review of Ecology and Systematics 29: 467–501.
OpenUrl CrossRef Web of Science

[94] ↵
R. Ross
Round, F. E., and P. A. Sims. 1980. The distribution of diatom genera in marine and freshwaterenvironments and some evolutionary considerations. In R. Ross [ed.], Proceedings of the SixthSymposium on Recent and Fossil Diatoms, 301-320. Otto Koeltz Science Publishers, Hirschberg.

[95] R. Ross

[96] ↵
Round, F. E., R. M. Crawford, and D. G. Mann. 1990. The Diatoms: Biology & Morphology of the Genera. Cambridge University Press, Cambridge.

[97] ↵
Sato, S., G. Beakes, M. Idei, T. Nagumo, and D. G. Mann. 2011. Novel sex cells and evidence forsex pheromones in diatoms. PLOS ONE 6: e26923.
OpenUrl CrossRef PubMed

[98] ↵
Sayyari, E., and S. Mirarab. 2016. Fast coalescent-based computation of local branch support fromquartet frequencies. Molecular Biology and Evolution 33: 1654–1668.
OpenUrl CrossRef PubMed

[99] ↵
Schlueter, J. A., P. Dixon, C. Granger, D. Grant, L. Clark, J. J. Doyle, and R. C. Shoemaker.2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47:868–876.
OpenUrl CrossRef PubMed

[100] ↵
Seo, T. K. 2008. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Molecular Biology and Evolution 25: 960–971.
OpenUrl CrossRef PubMed Web of Science

[101] ↵
Simão, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov. 2015. BUSCO: Assessing genome Assembly and annotation completeness with single-copy Orthologs. Bioinformatics 31: 3210–3212.
OpenUrl CrossRef PubMed

[102] ↵
Smith, S. A., and B. C. O’Meara. 2012. treePL: Divergence time estimation using penalized likelihoodfor large phylogenies. Bioinformatics 28: 2689–2690.
OpenUrl CrossRef PubMed Web of Science

[103] ↵
Smith, S. A., M. J. Moore, J. W. Brown, and Y. Yang. 2015. Analysis of phylogenomic datasetsreveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evolutionary Biology 15: 150.
OpenUrl

[104] ↵
Smith-Unna, R., C. Boursnell, R. Patro, J. Hibberd, and S. Kelly. 2016. TransRate: Referencefree quality assessment of de novo transcriptome assemblies. Genome Research10.1101/gr.196469.115 DOI.

[105] ↵
Soltis, D. E., V. A. Albert, J. Leebens-Mack, C. D. Bell, A. H. Paterson, C. Zheng, D. Sankoff,Et al. 2009. Polyploidy and angiosperm diversification. American Journal of Botany 96: 336–348.
OpenUrl Abstract/FREE Full Text

[106] ↵
Song, L., and L. Florea. 2015. Rcorrector: Efficient and accurate error correction for Illumina RNAseqreads. GigaScience 4: 48.
OpenUrl CrossRef

[107] ↵
Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post’analysis of largephylogenies. Bioinformatics 30: 1312–1313.
OpenUrl CrossRef PubMed Web of Science

[108] ↵
Sukumaran, J., and M. T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569–1571.
OpenUrl CrossRef PubMed Web of Science

[109] ↵
Suyama, M., D. Torrents, and P. Bork. 2006. PAL2NAL: Robust conversion of protein sequencealignments into the corresponding codon alignments. Nucleic Acids Research 34: W609–W612.
OpenUrl CrossRef PubMed Web of Science

[110] ↵
Tanaka, T., Y. Maeda, A. Veluchamy, M. Tanaka, H. Abida, E. MarÉChal, C. Bowler, et al.2015. Oil accumulation by the oleaginous diatom Fistulifera solaris as revealed by the genomeand transcriptome. The Plant Cell 27: 162.
OpenUrl Abstract/FREE Full Text

[111] ↵
Tang, H., J. E. Bowers, X. Wang, and A. H. Paterson. 2010. Angiosperm genome comparisonsreveal early polyploidy in the monocot lineage. Proceedings of the National Academy of Sciences 107: 472–477.
OpenUrl Abstract/FREE Full Text

[112] ↵
Tank, D. C., J. M. Eastman, M. W. Pennell, P. S. Soltis, D. E. Soltis, C. E. Hinchliff, J. W. Brown, et al. 2015. Nested radiations and the pulse of angiosperm diversification: Increaseddiversification rates often follow whole genome duplications. New Phytologist 207: 454–467.
OpenUrl CrossRef PubMed

[113] ↵
Theriot, E. C., M. P. Ashworth, T. Nakov, E. Ruck, and R. K. Jansen. 2015. Dissecting signal andnoise in diatom chloroplast protein encoding genes with phylogenetic information profiling. Molecular Phylogenetics and Evolution 89: 28–36.
OpenUrl CrossRef PubMed

[114] ↵
Thomas, G. W. C., S. H. Ather, and M. W. Hahn. 2017. Gene-tree reconciliation with MUL-trees toresolve polyploidy events. Systematic Biology: syx044.

[115] ↵
Thompson, J. D., and R. Lumaret. 1992. The evolutionary dynamics of polyploid plants: Origins, establishment and persistence. Trends in Ecology & Evolution 7: 302–307.
OpenUrl CrossRef PubMed Web of Science

[116] ↵
Tiley, G. P., C. Ane, and J. G. Burleigh. 2016. Evaluating and characterizing ancient whole-genomeduplications in plants with gene count data. Genome Biology and Evolution 8: 1023–1037.
OpenUrl CrossRef PubMed

[117] ↵
Van Dongen, S., and C. Abreu-Goodger. 2012. Using MCL to extract clusters from networks. Bacterial Molecular Networks: Methods and Protocols: 281-295.

[118] ↵
Van Dongen, S. M. 2001. Graph clustering by flow simulation. PhD, University of Utrecht, Utrecht, Netherlands.

[119] ↵
Vanneste, K., Y. Van De Peer, and S. Maere. 2013. Inference of genome duplications from agedistributions revisited. Molecular Biology and Evolution 30: 177–190.
OpenUrl CrossRef PubMed Web of Science

[120] ↵
Vanneste, K., L. Sterck, A. A. Myburg, Y. Van De Peer, and E. Mizrachi. 2015. Horsetails areancient polyploids: Evidence from Equisetum giganteum. The Plant Cell 27: 1567–1578.
OpenUrl Abstract/FREE Full Text

[121] ↵
Von Dassow, P., T. W. Petersen, V. A. Chepurnov, and E. V. Armbrust. 2008. Inter-andintraspecific relationships between nuclear DNA content and cell size in selected members of thecentric diatom genus Thalassiosira (Bacillariophyceae). Journal of Phycology 44: 335–349.
OpenUrl CrossRef PubMed Web of Science

[122] ↵
Winge, Ö. 1917. The chromosomes. Their numbers and general importance. Comptes Rendus des Travaux du Laboratoire Carlsberg 13: 131–175.
OpenUrl

[123] ↵
Wolfe, K. H., and D. C. Shields. 1997. Molecular evidence for an ancient duplication of the entireyeast genome. Nature 387: 708–713.
OpenUrl CrossRef PubMed Web of Science

[124] ↵
Wood, T. E., N. Takebayashi, M. S. Barker, I. Mayrose, P. B. Greenspoon, and L. H. Rieseberg.2009. The frequency of polyploid speciation in vascular plants. Proceedings of the National Academy of Sciences of the United States of America 106: 13875–13879.
OpenUrl Abstract/FREE Full Text

[125] ↵
Yang, Y., and S. A. Smith. 2014. Orthology inference in nonmodel organisms using transcriptomesand low-coverage genomes: Improving accuracy and matrix occupancy for phylogenomics. Molecular Biology and Evolution 31: 3081–3092.
OpenUrl CrossRef PubMed

[126] ↵
Yang, Y., M. J. Moore, S. F. Brockington, D. E. Soltis, G. K. S. Wong, E. J. Carpenter, Y. Zhang, et al. 2015. Dissecting molecular evolution in the highly diverse plant clade Caryophyllales using transcriptome sequencing. Molecular Biology and Evolution 32: 2001–2014.
OpenUrl CrossRef PubMed

[127] ↵
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in nuclear genes ofmammals. Journal of Molecular Evolution 46: 409–418.
OpenUrl CrossRef PubMed Web of Science

[128] ↵
Zhan, S. H., M. Drori, E. E. Goldberg, S. P. Otto, and I. Mayrose. 2016. Phylogenetic evidencefor cladogenetic polyploidization in land plants. American Journal of Botany 103: 1252–1258.
OpenUrl Abstract/FREE Full Text

Phylogenomics reveals an extensive history of genome duplication in diatoms (Bacillariophyta)

ABSTRACT

INTRODUCTION

MATERIALS AND METHODS

Taxon sampling

Transcriptome sequencing

Transcriptome assembly and annotation

Orthology/Paralogy-based transcriptome clustering

Homolog and species tree reconstructions

Overall approach to identification of paleopolyploidy events

Synonymous divergence (Ks) of paralogs

Gene-tree reconciliation

Focal points of gene duplication and loss

WGD validation at duplication focal points

Gene count analyses

RESULTS

Assembly results

Homology and orthology inference

Species tree reconstruction

Synonymous divergence (Ks) between paralogs

Gene-tree reconciliation (Yang and Notung pipelines)

Gene-tree reconciliation (GRAMPA)

Whole-genome duplication at deep internodes

Allopolyploidy within Thalassiosirales

Allopolyploidy within the pennate clade

Gene count analyses

DISCUSSION

Mechanisms of polyploid formation in diatoms

Combined genomic evidence for whole-genome duplication in diatoms

Ancient paleopolyploidy in diatoms

Historic allopolyploidy in Thalassiosirales

Historic allopolyploidy in pennate diatoms

Conclusions

DATA ACCESSIBILITY

ACKNOWLEDGEMENTS

APPENDICES

Appendix 1

Appendix 2

Appendix 3

Appendix 4

LITERATURE CITED

Citation Manager Formats

Subject Area