Abstract
Most angiosperms bear hermaphroditic flowers, but a few species have evolved outcrossing strategies, such as dioecy, the presence of separate male and female individuals. We previously investigated the mechanisms underlying dioecy in diploid persimmon (D. lotus) and found that male flowers are specified by repression of the autosomal gene MeGI by its paralog, the Y-encoded pseudo-gene OGI. This mechanism is thought to be lineage-specific, but its evolutionary path remains unknown. Here, we developed a full draft of the diploid persimmon genome (D. lotus), which revealed a lineage-specific genome-wide paleoduplication event. Together with a subsequent persimmon-specific duplication(s), these events resulted in the presence of three paralogs, MeGI, OGI and newly identified Sister of MeGI (SiMeGI), from the single original gene. Evolutionary analysis suggested that MeGI underwent adaptive evolution after the paleoduplication event. Transformation of tobacco plants with MeGI and SiMeGI revealed that MeGI specifically acquired a new function as a repressor of male organ development, while SiMeGI presumably maintained the original function. Later, local duplication spawned MeGI’s regulator OGI, completing the path leading to dioecy. These findings exemplify how duplication events can provide flexible genetic material available to help respond to varying environments and provide interesting parallels for our understanding of the mechanisms underlying the transition into dieocy in plants.
Author summary Plant sexuality has fascinated scientists for decades. Most plants can self-reproduce but not all. For example, a small subset of species have evolved a system called dioecy, with separate male and female individuals. Dioecy has evolved multiple times independently and, while we do not understand the molecular mechanisms underlying dioecy in many of these species yet, a picture is starting to emerge with recent progress in several dioecious species. Here, we focused on the evolutionary events leading to dioecy in persimmon. Our previous work had identified a pair of genes regulating sex in this species, called OGI and MeGI. We drafted the whole genome sequence of diploid persimmon to investigate their evolutionary history. We discovered a lineage-specific genome duplication event, and observed that MeGI underwent adaptive evolution after this duplication. Transgenic analyses validated that MeGI newly acquired a male-suppressor function, while the other copy of this gene, SiMeGI, did not. The regulator of MeGI, OGI, resulted from a second smaller-scale duplication event, finalizing the system. This study sheds light on the role of duplication as a mechanism that promote flexible genes functions, and how it can affect important biological functions, such as the establishment of a new sexual system.
Introduction
Most species of flowering plants are hermaphrodite, but a small proportion have genetically determined separate sexes (Renner, 2014). The rarity of dioecy contrasts with its broad distribution across the flowering plant phylogenetic tree, suggesting multiple independent transitions into dioecy. Our study aimed to understand the molecular and evolutionary mechanisms underlying such changes. Advances in genomic analyses have allowed studies of plant sex chromosomes in a few dioecious plant species including papaya and Silene (Liu et al., 2004; Wang et al., 2012; Kazama et al., 2016), and a few genetic sex determining genes have recently been identified, including in the persimmon, kiwifruit, and asparagus (Akagi et al., 2014, 2018; Harkess et al., 2017). Consistent with theoretical models (Charlesworth and Charlesworth, 1978a, b), the results indicate that at least one gain-of-function mutation occurred in the evolution of dioecy, creating a dominant gynoecium or androecium suppressor. Data from these species is also consistent with gene duplication events as the first event leading to these gain-of-function mutations, because the redundancy provided by the presence of duplicate copies allows one copy to be neofunctionalized without loss of the original function (Flagel and Wendel, 2009). Unlike many animal taxa, flowering plants have experienced numerous whole-genome duplication events (WGD) (Van de Peer et al., 2017), which are thought to have provided opportunities for the appearance of new traits specific to each plant species. For example, functional differentiation between paralogs, which had been derived from whole-genome duplication (WGD), resulted in the establishment of ripening characteristics in tomato fruits (The Tomato Genome Consortium, 2012), and potentially enabled the adaptation to life underwater in seagrass (Zostera marina) (Olsen et al., 2016).
Within the large order Ericales, a heterogametic male (XY) sex determination system has evolved independently in at least two genera, Diospyros and Actinidia (Fraser et al., 2009; Akagi et al., 2014, 2018). Diospyros had evolved a Y-encoded pseudogene called OGI, that produces small-RNA, which in turn repress the autosomal feminization gene, MeGI (Akagi et al., 2014). MeGI belongs to the HD-Zip1 gene family conserved across angiosperms, but the specific function of MeGI to act for repression of male function, or feminization, has not been observed in MeGI orthologs from other plants so far (Komatsuda et al., 2007; Whipple et al., 2011; Sakuma et al., 2013). Indeed, although Actinidia and Diospyros are phylogenetically close to each other, the Y-encoded sex determination system in Actinidia does not involve the MeGI ortholog or another member of the HD-Zip1 family (Akagi et al., 2018). The existence of MeGI, OGI, and a third paralog called Sister-of-MeGI (SiMeGI), which was newly identified in this study, provide the opportunity to investigate both the scale and context of gene duplication events that triggered the appearance of a lineage-specific sex determination system in this species. To address this question, we sequenced the genome of Caucasian diploid persimmon, focusing on the lineage-specific duplication events. Evolutionary analyses on the duplicated pairs found a limited numbers of the genes which were potentially neofunctionalized via adaptive evolution after the duplication. Our results provide a potential path from the duplicated paralogs of a HD-Zip1 to dioecy, and shed light on how lineage-specific duplication events contribute to the evolution of a new sex determination system in a plant species.
Results and Discussion
Draft genome sequencing of Diospyros
Initially, we assembled a draft genome from ca 65X PacBio long read coverage of the expected haploid genome size (907Mb from nuclear weight (Tamura et al., 1998), 877.7Mb from kmer analysis) using Falcon (Supplemental Figure 1, Supplemental Table 1). This resulted in 3,073 primary contigs, and 5,901 “secondary” contigs, which are putative allelic contigs to the primary contigs. Next, we built three genetic maps, created from two segregating F1 populations (N = 314 and 119, see Materials and Methods and Supplemental Table 2). These maps were created from a total of 5,959 markers derived from GBS/ddRAD sequencing and allowed for the anchoring of the contigs into a genome draft comprised of 15 pseudomolecules (Figure 1, Supplemental Figure 1, Supplemental Tables 2 and 3).
To start characterizing this newly assembled genome, we documented sequence variation between female and male individuals of D. lotus and content and type of repeat sequences of the draft sequence compared to other sequences eudicots (Figure 1b and 1c, Supplemental Tables 3 and 4). Mapping of transcriptome data to this draft genome resulted in 40,532 predicted gene locations (Figure 1d, Supplemental Dataset 1). These numbers are similar to results from other asterid plant species, such as tomato (N = 34,879) (The Tomato Genome Consortium, 2012) or kiwifruit (N = 39,040) (Huang et al., 2013) (Supplemental Figures 2 and 3). Of these primary genes, we selected 12,058 which were determined to be either unique or low copy number within the genome (see Materials and Methods).
Identification of a whole-genome duplication event specific to the Diospyros genus
To investigate gene duplication patterns, we analyzed the distribution of silent divergence rate (dS) between homologous gene pairs. We compared the distribution of silent divergence rate of homologous gene pairs within the persimmon genome, with those within the kiwifruit (Actinidia), tomato (Solanum) and grape (Vitis) genomes. A subset of persimmon genes formed a clear peak of silent divergence rate (Figure 2a, dS = ca 0.5-0.9, mode dS = 0.69), suggesting that a whole-genome duplication (WGD) event, named Dd-α, occurred in this clade, and approximately simultaneously with the tomato genome triplication (The Tomato Genome Consortium, 2012) (Figure 2a). The genomic regions including the gene pairs in this peak exhibited long regions of synteny (Figure 2b, Supplemental Figure 4). The distribution of four-fold synonymous (degenerative) third-codon transversion (4DTv) supported this lineage-specific WGD (Figure 2c). Comparison of intraspecific dS between homologous gene pairs in the Diospyros genome and interspecific dS between the orthologs from Diospyros and Actinidia, or from Diospyros and Vitis, indicated that the Dd-α event postdated the divergence of Diospyros and Actinidia, and might coincide with the divergence of the Ebenaceae family (Figure 2c-d, Supplemental Figure 4). Two other events, Ad-α and Ad-β, have been inferred by a similar analysis in the Actinidia genome (Huang et al., 2013) (Supplemental Figure 5) but are not detectable in the Diospyros genome. Thus, Actinidia and Diospyros differ by at least three lineage-specific ancestral WGD events. These occurred at a time similar to previously reported paleoduplication events in the asterids (Huang et al., 2013; Iorizzo et al., 2016; Reyes-Chin-Wo et al., 2017), as well as across the angiosperms (Vanneste et al., 2014; Van de Peer et al., 2017), concentrated around the K-Pg (Cretaceous-Paleogen) boundary (Figure 2e).
Only a few gene, including MeGI, exhibit signs of positive selection but divergent expression patterns are common following the WGD event
To explore the evolutionary significance of lineage-specific duplications, and particularly of the Dd-α WGD event, dN/dS values between the duplicated gene pairs putatively derived from the Dd-α WGD events (N = 2,619) were calculated. The dN/dS values averaged over the coding regions indicated that most of the duplicates experienced either purifying or neutral selection (dN/dS ≤ 1.0, Figure 3a). In contrast, site- and evolutionary branch-specific tests for positive selection (dN/dS ≫ 1.0), using PAML, suggested that at least 9 genes experienced strong positive selection (posterior probability > 0.99 in Bayes Empirical Bayes analysis) following the Dd-α WGD event (Figure 3b-c). Importantly, MeGI and its paralog, named Sister of MeGI (SiMeGI), were one of these 9 gene pairs.
In contrast to very small number of genes exhibiting positive selection, a larger proportion of the gene pairs derived from the Dd-α WGD events exhibited significant differences in expression patterns. We described expression patterns in male and female buds/flowers using transcriptome data from 8 time points throughout the annual cycle (see Materials and Methods for details). Our results suggest that 45.5% of the gene pairs (597/1,311 pairs) showed significant differentiation (Pearson product-moment correlation test r2 < 0.3, Supplemental Dataset 2). To investigate differences in expression pattern between male and female flowers throughout development, we conducted 2×2 Fisher’s exact test on the Dd-α-derived gene pairs (see Materials and Methods) and identified 36 and 65 gene pairs (of 1,311 pairs) exhibiting significant differentiation (p < 0.01) in expression patterns between male and female flowers at developing and maturing stages, respectively (Figure 3d-e, Supplemental Dataset 3). These might have potentially contributed to the establishment of Diospyros-specific sex determining mechanisms. Such frequent variation in expression patterns is consistent with previous results in soybean (Roulin et al., 2013) and could have originated from rapid evolution in cis-motifs after WGD.
Adaptive evolution of MeGI to act specifically for repression of androecium development
Genome-wide survey of the HD-Zip1 family, to which MeGI belongs, found 34 genes in the D. lotus genome. Phylogenetic analysis of MeGI/Vrs1 orthologs from representative angiosperm species indicated that only MeGI and SiMeGI belong in the MeGI/Vrs1 clade (bootstrap = 100/100, Figure 4a, Supplemental Figure 6). Finer evolutionary analysis on the MeGI/SiMeGI orthologs, to detect site-branch specific evolutionary rates using PAML, indicated that specific regions of MeGI experienced strong positive selection soon after the Dd-α event (Figure 4b, p = 0.0027 for dN/dS > 1.0, post. prob. > 0.99 for P23-V40-S152, and Figure 4c-d, dN/dS > 2.0 for the region between 45 and 165 bp in the sliding window test).
On the other hand, MeGI experienced strong purifying selection overall (average dN/dS = 0.095) since the establishment of the Ebenaceae (Euclea and Diospyros) (Figure 4b). Furthermore, the regions that experienced positive selection early are currently under stronger purifying selection in MeGI than in SiMeGI (Figure 4e). This is also consistent with the idea that MeGI first underwent neofunctionalization following the paleoduplication event, and that these changes were later fixed by positive selection. On the other hand, stronger purifying selection in MeGI than in SiMeGI could reflect lesser functional importance of SiMeGI (or possibly that it is degenerating since the duplication occurred). Alternatively, it could reflect from the need to conserve high sequence homology between OGI and MeGI in order to maintain the regulatory role of OGI via smRNA targeting MeGI.
Consistent with the evolutionary analysis presented above, ectopic expression of MeGI or SiMeGI in Nicotiana tabacum indicate differentiation of their protein functions. Constitutive induction of MeGI under the control of the CaMV35S promoter resulted in severely dwarfed plants and repressed androecium development (Figure 5a-c and g-h, Supplemental Figure 7, Supplemental Table 5), consistent with previous results using the same construct in A. thaliana (Akagi et al., 2014). On the other hand, constitutive induction of SiMeGI under the control of the same promoter resulted in plants of only slightly reduced stature and normal androecium development in N. tabacum (Figure 5d-h, Supplemental Figure 7, Supplemental Table 6). The function of MeGI as repressor of androecium in persimmon is due to the ability to regulate PISTILLATA (PI) in young developing androecium (Yang et al. 2019). The expression level of PI in N. tabacum was significantly down-regulated in the transgenic lines with MeGI, while the lines transformed with SiMeGI showed no changes in PI expression (Figure 5i-j). In Arabidopsis, which is a very far lineage from Diospyros, high expression of SiMeGI typically did not result in altered flower morphology although it occasionally resulted in inhibited androecium development (Supplemental Figure 8, Supplemental Tables 7 and 8).
Taken together, our results are consistent with the hypothesis that a role in androecium development is specific to MeGI. This is further supported by the fact that mutants of the MeGI/SiMeGI orthologs which are normally expressed in flower primordia in other angiosperm species, do not affect androecia development (Komatsuda et al., 2007; Whipple et al., 2011; Sakuma et al., 2013). Our evolutionary analyses revealed that the positive selection that affected MeGI specifically did not occur on the region binding to the target cis-motifs, called homeobox-domain (HB) (Figure 4d-e), but rather on the 5’ undefined region and on the leucine zipper region putatively forming heterodimers (Ariel et al., 2007; Sakuma et al., 2011). This was supported by the results of DNA affinity purification sequencing (DAP-Seq) (Bartlett et al. 2017) using MeGI (Yang et al. 2019) or SiMeGI fused to a Halo-tag, to identify which genes and/or motifs they target. The DAP-Seq reads were mapped to the D. lotus genome to characterize the accumulated recognition motifs (see Materials and Methods). We identified the motifs using the top 1,000 high-confidence peaks, and determined that the AATWATT sequence was enriched when using MeGI (Yang et al. 2019) and SiMeGI as the probes (Figure 5k). This motif is commonly recognized by the Arabidopsis HD-ZIP1 genes as well (Khan et al. 2018, Yang et al. 2019). Thus, it is possible that the feminization role of MeGI could have resulted from either increased efficiency or novel affinity to interact with other factors. Finally, the native expression patterns of MeGI and SiMeGI in persimmon are also slightly different in developing buds and flower primordia (Figure 5l-n, Supplemental Figure 9 and 10). Specifically, MeGI exhibits higher expression than SiMeGI during the flower maturing stages (Figure 5l-n). This expression differentiation might also contribute to MeGI-specific feminizing function.
Transitions towards dioecy are associated with duplication events
Our results suggest the following working hypothesis for the evolutionary path into dioecy in Diospyros. The Diospyros-specific WGD event, Dd-α resulted in the appearance of MeGI and promoted the neofunctionalization of this gene into a dominant suppressor of androecium, as a feminization factor. This was followed by a second, local duplication of MeGI to derive a Y-encoded OGI, which is a dominant repressor of MeGI (Figure 6). Interestingly, the information available so far from other dioecious species hints at the possibility that this type of pattern may have played a role in the evolution of dioecy in other species. For example, in the establishment of dioecy in garden asparagus, the Y-encoded putative sex determinant, SOFF, is thought to have originated from an Asparagus-specific gene duplication event, which was followed by the acquisition of its function as a dominant suppressor of feminization (SuF) (Harkess et al., 2017). Furthermore, the Y-encoded putative sex determinant in kiwifruit (Actinidia spp.), Shy Girl, which acts as a dominant suppressor of feminization, also arose via an Actinidia-specific duplication event (Akagi et al., 2018), probably one of the Actinidia-specific WGD events, Ad-α (Huang et al., 2013). These parallel paths towards the independent evolution of all three of these sex determinants is probably not coincidental, but consistent with the theoretical framework described above. In flowering plants, transition into separated sexuality requires the appearance and selection of a gain-of-function event in order to acquire a dominant suppressor(s), such as MeGI. Genome-wide duplication events provide good opportunities for such a scenario. The concentration of independent paleoplodization events in the K-Pg boundary is consistent with the adaptive evolution of plants against the substantial environmental changes, including mass extinction of their pollinators that took place at the time (Wilf et al., 2006; Van de Peer et al., 2017). A selfing habit engendered by polyploidy would be advantageous, but protracted evolutionary success would be favored by an eventual return to outcrossing. The neofunctionalization of MeGI resulting in the acquisition of a lineage-specific new sexual system could be one of these adaptive strategies. This hypothesis is also consistent with the observed wide diversity of sex determination system within plants.
Materials and Methods
Initial genome sequence assembly
Dormant buds of D. lotus cv. Kunsenshi-male were burst in the dark for 2-weeks to harvest chlorophyll-starved young leaves. High molecular weight DNA were extracted using the Genome-tip 100/G kit (QIAGEN, Tokyo, Japan), followed by purification using phenol/chloroform extraction. Libraries were size-selected using the Blue Pippin and the following size minimums: 12 kb (14 SMRT cells), 15 kb (34 SMRT cells) and 16 kb (12 SMRT cells). A total of 60 SMRT cells and 54 Gb of PacBio raw data were obtained using the PacBio RSII. Filtered sub-reads were pooled and the longest were retained for assembly, by removing all filtered subreads shorter than 12 kb. This resulted in approximately 32x coverage of the estimated 1 Gb haploid genome size. PacBio reads were assembled using Falcon, producing 3,417 primary contigs and 6,318 alternate contigs. Next, all contigs were assessed for the presence of contaminating sequences by aligning each contig to a custom database using BLASTN+ version 2.2.31+. The custom database contained Kiwifruit psuedomolecule (ftp://bioinfo.bti.cornell.edu/pub/kiwifruit/Kiwifruit_pseudomolecule.fa.gz), the A. thaliana chromosomes (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_chromosome_files/), as well as the human draft genome and representative bacterial / archaeal genome databases (pre-formatted blast+ database ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html). Hits to the two contaminant databases were identified and used to remove sequences that were largely contaminant, or to trim those with non-contaminant sequences at least 10kb long. After this step, 3,252 primary and 5,939 alternative contigs were retained. This set of contaminant-free contigs were next polished using quiver (version 2.3.0-140936) and default parameters. After this last step, 3,073 primary and 5,901 alternative contigs remained.
Illumina library construction and sequencing
1. Genomic libraries
Approximately 1.5 μg of genomic DNA was used for the construction of Illumina genomic libraries; the DNA was fragmented using NEBNext dsDNA Fragmentase (New England BioLabs; NEB) for 40–60 min at 37°C and cleaned using Agencourt AMPure XP (Beckman Coulter Genomics, Tokyo, Japan) for size selection. To select fragments ranging between 300 and 600 bp, 27 μl of AMPure was added to the 63 μl reaction. After a brief incubation at RT, 90 μl of the supernatant was transferred to a new tube and 20 μl water and 30 μl AMPure were added. After a second brief incubation at RT, the supernatant was discarded and the DNA was eluted from the beads in 20 μl of water, as recommended. Next, DNA fragments were subjected to end repair using NEB’s End Repair Module Enzyme Mix, and A-base overhangs were added with Klenow (NEB), as recommended by the manufacturer. A-base addition was followed by AMPure cleanup using 1.8:1 (v/v) AMPure reaction. Barcoded NEXTflex adaptors (Bioo Scientific, Austin, USA) were ligated at room temperature using NEB Quick Ligase (NEB) following the manufacturer’s recommendations. To remove contamination of self-ligated adapter dimers, libraries were size-selected using AMPure in 0.8:1 (v/v) AMPure:reaction volume to select for adapter-ligated DNA fragments at least 400-bp long. Half of the eluted DNA was enriched by PCR reaction using Prime STAR Max (Takara, Tokyo, Japan) at the following PCR conditions: 30 s at 98°C, 10 cycles of 10 s at 98°C, 30 s at 65°C and 30 s at 72°C and a final extension step of 5 min at 72°C. Enriched libraries were purified with AMPure (0.7:1 v/v AMPure to reaction volume), and quality and quantity were assessed using the Agilent BioAnalyzer (Agilent Technologies, Tokyo, Japan) and Qubit fluorometer (Invitrogen, Waltham, USA). Libraries were sequenced using Illumina’s HiSeq 2500 or HiSeq4000 (150-bp paired-end reads).
2. GBS/ddRAD-Seq libraries
Two F1 mapping populations, derived from crosses between two D. lotus, Kunsenshi-male and Kunsenshi-female, and between two D. lotus, Kunsenshi-male and Budogaki-female, were employed for ddRAD-Seq (Peterson et al., 2012) and GBS (Elshire et al., 2011) analyses to construct genetic linkage maps. The former and latter mapping populations were named KK (n = 314) and VM (n = 119), respectively. Genomic DNA was extracted from the leaves of each line using the CTAB method. The ddRAD-Seq libraries for KK and VM were constructed using restriction enzymes PstI and MspI (Shirasawa et al., 2016), while the GBS library for KK were prepared using with PstI (Elshire et al., 2011).
3. mRNA libraries
Developing buds and flowers from two D. lotus individuals, Kunsenshi-male and Kunsenshi-female, were harvested from June to April to cover the annual cycle of leaves/flower development. Total RNA was extracted using the Plant RNA Reagent (Invitrogen) and purified by phenol/chloroform extraction. Five micrograms of total RNA was processed in preparation for Illumina Sequencing, according to a previous report (Akagi et al., 2014). In brief, mRNA was purified using the Dynabeads mRNA purification kit (Life Technologies, Tokyo, Japan). Next, cDNA was synthesized via random priming using Superscript III (Life Technologies) followed by heat inactivation for 5 min at 65°C. Second-strand cDNA was synthesized using the second-strand buffer (200 mM Tris–HCl, pH 7.0, 22 mM MgCl2 and 425 mM KCl), DNA polymerase I (NEB, Ipswich, USA) and RNaseH (NEB) with incubation at 16°C for 2.5 h. Double-stranded cDNA was purified using AMPure with a 0.7:1 (v/v) AMPure to reaction volume ratio. The resulting double-stranded cDNA was subjected to fragmentation and library construction, as described above, for genomic library preparation. Ten cycles of PCR enrichment were performed using the method described above. The constructed libraries were sequenced on Illumina’s HiSeq 4000 sequencer (50-bp single-end reads).
4. DAP-Seq libraries
The DAP genomic DNA libraries were prepared as previously described (O’malley et al., 2016, Bartlet et al., 2017, Yang et al. 2019). Briefly, the Covaris M220 ultrasonicator (with the manufacturer-recommended setting) was used to fragment gDNA to an average size of 200 bp. The resulting fragmented gDNA was ligated to the NEXTflex adaptors (Bioo Scientific, Austin, USA) as described, to make genomic libraries. The full-length SiMeGI cDNA was cloned into the pDONR221 vector (Life Technologies) and then transferred to the pIX-Halo using LR clonase II (Life Technologies) to generate pIX-Halo-SiMeGI. pIX-Halo-MeGI has been constructed previously (Yang et al. 2019). The N-terminally Halo-tagged MeGI and SiMeGI were produced using the TNT SP6 Coupled Wheat Germ Extract System (Promega, Fitchburg, WI, USA) and purified with Magne HaloTag beads (Promega). A total of 50 ng DAP gDNA library was incubated with Halo-tagged MeGI and SiMeGI at room temperature for 1 h.
5. Sequencing
The ddRAD Seq sequences were obtained at the Kazusa DNA Research Institute. The GBS sequences were obtained from the Genomic Diversity Facility (Cornell University). All other Illumina sequencing were conducted at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, and the raw sequencing reads were processed using custom Python scripts developed in the Comai laboratory and available online (http://comailab.genomecenter.ucdavis.edu/index.php/Barcoded_data_preparation_tools), as previously described. In brief, reads were split based on index information and trimmed for quality (average Phred sequence quality > 20 over a 5 bp sliding window) and adaptor sequence contamination. A read length cut-off of 35 bps was applied to mRNA reads. Sequencing analysis of ddRAD-Seq libraries was performed at the Kazusa DNA Research Institute, and data processing was conducted as described in Shirasawa et al., (2016). All samples used to generate Illumina sequences are listed in Supplemental Table 10.
Gene prediction and genome/genes annotation
The RNA-Seq data for gene prediction was obtained from developing buds and flowers from D. lotus Kunsenshi-male at the following eight time points in 2013 to 2015 (June, July, August, October, January, March, early April, and late April) to cover the annual cycle of leaves/flower development. The RNA-Seq reads were trimmed according to previous reports (Akagi et al., 2014). The cleaned reads were mapped onto the scaffolds of DLO_r1.1 using TopHat 2.0.14 (Trapnell et al., 2009), and the BAM files obtained were used for BRAKER1 1.9 pipeline (Hoff et al., 2016). In the pipeline, GeneMark-ET 4.32 (Lomsadze et al., 2018) and Augustus 3.1 (Stanke and Waack, 2003) were used to construct the training set, and Augustus 3.1 was used for the gene prediction, using the training set. Genes were compared to the UniProtKB (http://www.uniprot.org/uniprot/) and of Araport11 (Krishnakumar et al., 2015) peptide sequences using BLASTP with E-value cutoff of 1E-10. Genes that were similar to those in the databases were categorized as “highly confident” (HC). Analysis of the conservation of the single-copy genes was conducted using BUSCO v1 (Simão et al., 2015). Repeat sequences were detected using RepeatScout 1.0.5 (Price et al., 2005) and RepeatMasker 4.0.6 (http://www.repeatmasker.org) against the Repbase database (Bao et al., 2015), according to the method used previously (Hirakawa et al., 2014). The HC genes on the primary scaffolds (DLO_r1.1 primary) were compared to the genes of Actinidia chinensis (kiwifruit; 39,040 genes (Huang et al., 2013)), Vitis vinifera (grape; 29,927 genes (IGGP 12x.31) (Jaillon et al., 2009)), Solanum lycopersicum (tomato; 34,789 genes (ITAG 3.10) (The Tomato Genome Consortium, 2012)) and Arabidopsis thaliana (27,655 genes (Araport11)) using OrthoMCL 2.0.9. To estimate the divergence time between D. lotus, A. chinensis, V. vinifera, and A. thaliana, the single copy genes conserved amongst all four species were aligned by MUSCLE 3.8.31 (Edgar, 2004). InDels in the alignment were eliminated using Gblocks 0.91b (Castresana, 2000), and the sequences were concatenated by species and used to construct the phylogenetic tree using the Maximum Likelihood method using MEGA 7.0.26 (Tamura et al., 2013) with the Jones-Taylor-Thornton (JTT) model as the substitution model. The divergence time was estimated based on that between A. chinensis and V. vinifera (117 MYA) published in TIMETREE (http://www.timetree.org).
Construction of the persimmon database
The sequence data obtained was released in the form of the PersimmonDB (http://persimmon.kazusa.or.jp). In the database, BLAST searches can be conducted against the scaffolds (DLO_r1.0) and pseudomolecules (DLO_r1.0_pseudomolecules), cds (DLO_r1.1_cds), and pep (DLO_r1.1_pep). Keyword searches are available against the results of the similarity searches against TrEMBL and peptide sequences in Araport11. The genomic and genic sequences, GFF files of the scaffolds and pseudomolecules, and BED files can be downloaded from the database. The scaffolds are also available under accession numbers BEWH01000001-BEWH01008974 (8,974 entries) in DDBJ. The raw sequence data is also available from under accession numbers DRA006168 (Illumina WGS for D. lotus Kunsenshi-male and female), DRA006169-DRA006176 (ddRAD-Seq/GBS for KK and VM populations), DRA006177 (RNA-Seq for D. lotus Kunsenshi-male), and DRA006182-DRA006184 (PacBio WGS for D. lotus Kunsenshi-male) in DDBJ.
Genetic anchoring of the scaffold using two mapping populations
The sequence reads from the ddRAD-Seq and GBS libraries were mapped onto the primary contigs of the DLO_r1.0 reference sequence using Bowtie 2 (version 2.2.3) (Langmead and Salzberg, 2012). SNP calling was performed using the mpileup command of SAMtools (version 0.1.19) (Li et al., 2009) and the view command of BCFtools (Li et al., 2009). High-confidence SNPs were selected using VCFtools (version 0.1.12b) (Danecek et al., 2011) using the following parameters: ≥10×◻coverage of each sample (--minDP 10);◻>999 SNP quality value (--minQ 999);◻≥0.2 minor allele frequency (--maf 0.2), and <0.5 missing data rate (--max-missing 0.5). Totals of 3,535 and 4,027 high-confident SNPs were obtained in the KK and VM populations, respectively. Genotype information for all lines were prepared for the CP mode of JoinMap (version 4) and classified into groups using the Grouping Module of JoinMap with LOD scores of 4 to 7. Marker order and relative map distances were calculated using its regression-mapping algorithm with the following parameters: Haldane’s mapping function ≤0.35 recombination frequency, and ◻≥2.0 LOD score. LPmerge (version 1.5) (Endelman et al., 2014) was used to integrate the linkage maps into a single consensus map. To construct pseudomolecule sequences, scaffolds assigned to the genetic map for the Kunsenshi-male, the cultivar used for the genome sequencing analysis, were ordered and oriented in accordance with marker order if at least two marker loci were mapped on a single scaffold. Otherwise, in the cases of a single marker on a scaffold, the orientation of the sequence was determined as “unknown”.
Comparative genomics
Whole genome-resequencing analysis on the Kunsenshi-male and female individuals were performed as described in Shirasawa et al. (2017). Paired-end sequences reads were obtained from the male and female lines with Illumina NextSeq, and trimmed and filtered based on quality score using Prinseq (Schmieder and Edwards, 2011) and base similarity to adapter sequences, AGATCGGAAGAGC, using fastx_clipper in the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit). The resulting reads were mapped on the primary contigs of DLO_r1.0 reference sequence with Bowtie2, and single nucleotide polymorphisms were detected with SAMtools mpileup (Li et al., 2009) and filtered with the conditions of sequence depth of ≥10 in each line (--minDP 10) and mapping quality of >200 in each SNP locus (--minQ 200) using VCFtools (Danecek 2011). The effect of SNPs on gene function were predicted with SnpEff (Cingolani et al., 2012) to assign the SNPs to four impact categories, high, moderate, modifier, and low, predifined by SnpEff. Synteny relationship of the genome structures were predicted with PROmer program of Mummer package (Delcher et al., 2002) between Diospyros (this study) and Actinidia (Huang et al., 2013), as well as within the Diospyros genus. The results were filtered with delta-filter (parameters of AAA and BBB), and visualized using Circos (Krzywinski et al., 2009).
Detection of genetic diversity within paralogs
Genes annotated as potential transposable elements by blastn/blastp using the TAIR/nr databases, and potentially repetitive genes which produced >5 homologous genes in the D. lotus genome (<e−20 in blastp), were discarded. Gene pairs showing significant sequence similarity (<e−20 in blastp) and their orthologs from three species, Actinidia, Solanum and Vitis were subjected to in-codon frame alignment using their protein and nucleotide sequences with Pal2Nal and MAFFT ver. 7 under the L-INS-i model. The resulting alignments were subjected to Mega v.6 to estimate the Jukes and Cantor corrected values of synonymous (dS) and non-synonymous (dN) substitutions and the index of evolutionary rate (dN/dS). The four-fold degenerative sites were extracted from the alignments with PAML (icode=11), and their pairwise transversion rates (4DTv) were calculated according to previous reports (Tuskan et al., 2006). To estimate the divergence time between the gene pairs, we adopted an estimated rate of 2.81 × 10−9 substitutions per synonymous site per year, according to the report in Actinidia (Shi et al., 2010).
Evolutionary analysis on the paralogs derived from Dd-α WGD event
To search for signs of positive selection, aligned nucleotide sequences of each gene pair and an outgroup ortholog, from either the Actinidia, Solunum or Vitis genomes, were subjected to codon-based detection of positive selection test using PAML (Yang, 1997). The statistical significance of positive selection on branches was evaluated using the likelihood ratio test of the null hypothesis that dN/dS = 1. Site-specific positive selection was assessed by Bayes Empirical Bayes analysis. To examine the positively selected sites common across the all three outgroups, in-frame alignments of the D. lotus gene pairs with the orthologs from all of the Actinidia, Solanum or Vitis genomes were used for the construction of evolutionary topologies using ML method by Mega v. 6, using the general time reversible (+I+G) model. Based on these alignments and topology, the branch- and site-specific positive selection test was performed using PAML, as well.
To define the phylogenetic relationship between the MeGI/SiMeGI-like orthologs/paralogs in angiosperms, genes showing significant homology (<1e−10 in blastp analysis) to a HD-ZIP1 OsHOX4 from Oryza sativa, which was previously used as the outgroup gene for the MeGI clade (Akagi et al. 2014), were collected from the Diospyros lotus, Solanum lycopersicum, Arabidopsis thaliana, Oryza sativa, and Zea mays genomes. A total of 174 protein sequences from these genomes, and that of Vrs1 from barley (Komatsuda et al. 2007) were aligned using MAFFT ver. 7, followed by manual pruning with SeaView. The pruned alignment was subjected to the NJ approach using Mega v. 6, with the JTT model, to construct phylogenetic tree (Supplemental Figure 6).
To assess selective pressure on MeGI and SiMeGI, their alleles from other members of the Ebenaceae family (Diospyros and Euclea genera), and their orthologs in the Actinidia, Solunum or Vitis genomes were subjected to in-codon frame alignment by MAFFT ver. 7, followed by a ML approach using Mega v. 6, with HKY+G model, to construct an evolutionary topology. The putative ancestral sequences of the MeGI and SiMeGI origins in the Ebenaceae family, and the sequences in the most recent common ancestor (MRCA) of the order Ericale and of the Asterids, were estimated using Mega. Informative SNPs in the aligned sequences were analyzed by DnaSP 5.1 (Librado and Rozas, 2009) and used to calculate a series of window-average dN/dS values, from the start codon (ATG) in a 150-bp window with a 30-bp step size, until the walking window reached the stop codon. To assess differentiation of expression patterns between the Dd-α-derived paralog pairs, we conducted Pearson’s product moment correlation analysis and Fisher’s exact test. Differentiation between the developmental stages of the buds/flowers throughout the annual cycle was examined by the “cor.test” function in R (with “pearson” method), using mRNA-Seq transcriptome data from datapoints (Supplemental Dataset 3). Differentiation of expression pattern between male and female flowers was examined for each paralog pair using a 2×2 Fisher’s exact test (“fisher.test” function in R), and using mRNA-Seq transcriptome data from early developing stage and maturing stage, respectively (Supplemental Dataset 3).
Transformation of MeGI and SiMeGI
Full length sequences of the MeGI and SiMeGI transcripts were amplified by PCR using PrimeSTAR Max (TaKaRa) from cDNA synthesized from RNA, itself derived from developing flower buds of D. lotus cv. Kunsenshi-male (Supplemental Tables 10 and 11). The amplicons were cloned into the pGWB2 vector to place the genes under the control of CaMV35S promoter. We constructed pGWB2-MeGI and pGWB2-SiMeGI using the Gateway system (Invitrogen) and the pENTR/D-TOPO cloning kit and LR clonase. Tobacco plants (N. tabacum) cv. Petit Havana SR1 were grown in vitro under white light with 16-h-light and 8-h-dark cycles at 22°C until transformation. The binary construct was introduced into the A. tumefaciens strain EHA101. Young petioles and leaves of tobacco plants were transformed by the leaf disk method as previously described (Akagi et al., 2014). Transgenic plants were selected on Murashige and Skoog medium supplemented with 100 μg/mL kanamycin. Pollen tube germination was assessed 6 h after placing the pollen grains on 15% sucrose/0.005% boric acid/1.0% agarose media at 25°C. The pollen germination ratio was counted as average percentages in batches of 200 pollen grains from the first three flowers.
RNA in situ hybridization
RNA in situ hybridization was performed as previously described (Esumi et al., 2007), but with minor modifications. Briefly, bud samples were fixed in FAA (1.8% formaldehyde, 5% acetic acid, 50% ethanol), dehydrated using an ethanol: t-butanol series, and then embedded in paraffin. The embedded tissues were sliced into ca 10-μm sections, and the sections were mounted on FRONTIER coated glass slide (Matsunami Glass Ind., Japan). Paraffin was removed with xylene, and the tissue sections were rehydrated in an ethanol series. The tissue sections were then incubated in a Proteinase K solution (700U/mL Proteinase K, 50mM EDTA, 0.1M Tris-HCl pH 7.5) for 30 min at 37°C, followed by acetylation with acetic anhydride (0.25% acetic anhydride in 0.1 M triethanolamine solution) for 10 min. Full length MeGI and SiMeGI cDNA sequences were cloned into the pGEM-T Easy vector (Promega, WI, USA) to synthesize the DIG-labelled probes, respectively. Antisense RNA probes were synthesized using the DIG-labeling RNA synthesis kit (Roche, Switzerland), according to the manufacturer’s instruction. The probe solution including RNaseOUT (Thermo Fisher Scientific, Waltham, USA) was applied to the slides and covered with parafilm. Hybridization was performed at 48°C for >16 h. For detection, 0.1% Anti-Digoxigenin-AP Fab fragments (Sigma-Aldrich, St. Louis, USA) was used as the secondary antibody to stain with NBT/BCIP solutions.
Author contributions
T.A., I.M.H. and L.C. conceived the study, T.A. and I.M.H., preponderantly designed the experiments. T.A. and K.S. performed the experiments. T.A., K.S., H.N., H.H. and I.M.H. analyzed the data. T.A. and R.T. initiated and maintained the plant materials. T.A., I.M.H. and L.C. drafted the manuscript. All authors participated in data interpretation, edited the manuscript and approved the final manuscript.
Acknowledgements
We thank Dr Deborah Charlesworth for the extensive discussions on the interpretation of our results and the many thoughtful pieces of advises provided through this work. We thank Meric Lieberman (UC Davis Genome Center) for bioinformatics support, Ayaka Sugimoto and Yang Ho-Wen (Graduate School of Agriculture, Kyoto University) for experimental support. Some of this work was performed at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant. This work was supported by PRESTO, Japan Science and Technology Agency (to TA), and Grant-in-Aid for Young Scientists (A) (no. 26712005 to TA), for Challenging Exploratory Research (no. 15K14654 to TA), Grant-in-Aid for Scientific Research on Innovative Areas No. J16H06471 to TA from JSPS, and by the National Science Foundation (NSF) IOS award under Grant No. 1457230 (to IMH and LC).