Abstract
The Cucurbita genus contains several economically important species in the Cucurbitaceae family. Interspecific hybrids between C. maxima and C. moschata are widely used as rootstocks for other cucurbit crops. We report high-quality genome sequences of C. maxima and C. moschata and provide evidence supporting an allotetraploidization event in Cucurbita. We are able to partition the genome into two homoeologous subgenomes based on different genetic distances to melon, cucumber and watermelon in the Benincaseae tribe. We estimate that the two diploid progenitors successively diverged from Benincaseae around 31 and 26 million years ago (Mya), and the allotetraploidization happened earlier than 3 Mya, when C. maxima and C. moschata diverged. The subgenomes have largely maintained the chromosome structures of their diploid progenitors. Such long-term karyotype stability after polyploidization is uncommon in plant polyploids. The two subgenomes have retained similar numbers of genes, and neither subgenome is globally dominant in gene expression. Allele-specific expression analysis in the C. maxima × C. moschata interspecific F1 hybrid and the two parents indicates the predominance of trans-regulatory effects underlying expression divergence of the parents, and detects transgressive gene expression changes in the hybrid correlated with heterosis in important agronomic traits. Our study provides insights into plant genome evolution and valuable resources for genetic improvement of cucurbit crops.
Background
Pumpkin, squash and gourd species (Cucurbita spp.,) belong to the Cucurbitaceae (cucurbit) family and consist of at least five domesticated and more than ten wild species1,2. C. maxima Duch. and C. moschata Duch. are two of the three most economically important cultivated Cucurbita species3. Archaeological data indicate that C. maxima and C. moschata originated in southern and northern South America, respectively, and were introduced to the Old World after European contact with the Americas2. C. maxima and C. moschata then underwent a great diversification in their China-Japan and India-Myanmar secondary domestication centers, respectively2,4. These Cucurbita crops are used as a staple food in many developing countries, and are consumed all over the world, mainly for their mature fruits and seeds, which are a rich source of nutritional compounds2. In addition to their culinary uses, the fruits are also used as ornaments and carved into decorative lanterns around Halloween. Nowadays, C. maxima and C. moschata are cultivated worldwide. The annual production of all Cucurbita species reached 25.2 million tons in 2014, with China and India leading in world production (http://faostat3.fao.org/).
The economic value of C. maxima and C. moschata is increasing, as they are also used as rootstocks for other cucurbit crops, including watermelon (Citrullus lanatus), cucumber (Cucumis sativus) and melon (Cucumis melo), to enhance tolerance to soilborne diseases and abiotic stresses5. The interspecific hybrid developed from a cross between C. maxima cv. Rimu and C. moschata cv. Rifu, ‘Shintosa’, is a popular rootstock for different cucurbits, and especially preferred in watermelon grafting for its Fusarium wilt resistance, cold-tolerance, and the ability to increase fruit weight, fruit quality and plant vigor6. C. maxima, in general, has better fruit flavor and texture than C. moschata, but is not as resistant as C. moschata to diseases and abiotic stresses. Interest in interspecific crossing between C. maxima and C. moschata has centered on transferring high-quality flesh traits from C. maxima to C. moschata and on introducing resistance traits from C. moschata7. However, the molecular mechanisms underlying these agronomical traits in these Cucurbita interspecific hybrids remain largely unexplored.
Compared with other cucurbit genera, Cucurbita has a higher chromosome number1 (2n = 40). Early cytogenetic and isozyme studies as well as a recent synteny analysis suggest a possible whole-genome duplication in this genus8-10. However, genomic resources are very limited in Cucurbita and little information is available regarding their genome and gene evolution. Here, we present the genome sequences of C. maxima and C. moschata, which provide evidence for an ancient allotetraploidization event in these two genomes, likely involving a hybridization between two highly diverged diploid progenitors. In the same nucleus, the two distinct subgenomes have largely maintained the chromosomal structures of the progenitors. Both subgenomes have experienced similar levels of gene loss, and neither is globally dominant in gene expression levels, with many homoeologous genes having divergent expression patterns. The draft genome sequences of C. maxima and C. moschata enabled us to analyze the allele-specific expression in the interspecific F1 hybrid, ‘Shintosa’, and determine the regulatory mechanisms underlying the expression divergence between C. maxima and C. moschata, as well as gene expression changes occurring upon interspecific hybridization that are correlated with heterosis in the Cucurbita F1 hybrid.
Results
Genome assembly, anchoring and quality evaluation
The genomes of C. maxima cv. Rimu and C. moschata cv. Rifu, the two parents of the ‘Shintosa’ rootstock, were sequenced and assembled. A total of 109.3 and 80.1 Gb of high-quality cleaned Illumina paired-end and mate-pair reads were generated for C. maxima and C. moschata, respectively, representing 283× and 215× coverage of their genomes (Table S1). Based on the 17-mer depth distribution analyses of the sequenced reads (Figure. S1), the genome sizes of C. maxima and C. moschata were estimated to be 386.8 and 372.0 Mb, respectively. De novo assemblies resulted in a draft genome of 271.4 Mb for C. maxima and 269.9 Mb for C. moschata (Fig. 1; Figure. S2), representing 70.2% and 72.6% of their estimated genome sizes, respectively. The assembled C. maxima genome had scaffold and contig N50 sizes of 3.7 Mb and 40.7 kb, respectively, and the N50 sizes for the scaffolds and contigs of the C. moschata assembly were 4.0 Mb and 40.5 kb, respectively (Table 1).
We constructed a C. maxima intraspecific genetic map containing 2,030 SNPs spanning 3,008.2 cM across 20 linkage groups (LGs) and a more saturated interspecific map (13,779 SNPs, 4,053.9 cM, 20 LGs) (Table S2). Both the intraspecific and interspecific maps and a previous map constructed using the same intraspecific population11 were largely collinear with all 20 linkage groups having shared scaffolds. Using the genetic maps of C. maxima (both the intra- and interspecific), 92 scaffolds (211.4 Mb, 77.9% of the assembly) were anchored to the 20 LGs, among which 64 scaffolds (201.2 Mb, 74.14 of the assembly) were oriented (Figure. S3a; Table S3). Likewise, we also constructed a C. moschata intraspecific map that contained 3,487 SNPs grouped into 20 LGs with a total genetic distance of 3,376.1 cM, and an interspecific map that had 16,214 SNPs distributed in 20 LGs and an overall genetic distance of 2,233.5 cM (Table S2). Based on these two genetic maps, 98 C. moschata scaffolds (238.5 Mb, 88.4% of the assembly) were anchored, and 71 scaffolds (221.3 Mb, 82.0%) were oriented (Figure. S3b; Table S3).
To evaluate the quality of our assemblies, we first aligned genome reads back to the assembled genomes. About 93% and 87% of the reads used in de novo assemblies were mapped back to the C. maxima and C. moschata genomes, respectively, in a proper paired-end relationship (Table S4). We then mapped RNA-Seq reads to the assembly, and of all RNA-Seq reads derived from different tissues of C. maxima (including fruit, leaf, stem and root), 93.8- 95.8% were mapped to the genome. Similarly, 86.2-95.0% of C. moschata RNA-Seq reads were mapped (Table S5). Further evaluation using BUSCO12 revealed that 97.0% and 97.3% of the core eukaryotic genes were covered by genomes of C. maxima and C. moschata, respectively, and 95.5% and 95.9% were completely covered (Table S6). In summary, the high coverage of the genome and RNA-Seq reads as well as the core eukaryotic genes indicated the high quality of the assembled C. maxima and C. moschata genomes.
Repeat sequence prediction and gene annotation
Repeat sequences accounted for more than 40% of both C. maxima and C. moschata genome assemblies (Table S7). Among these repeats, 69.9% in C. maxima and 62.9% in C. moschata were annotated as long terminal repeat (LTR) retrotransposons, and predominantly, the copia- and gypsy-type LTRs. Similar findings have been reported in the genomes of other species in the Cucurbitaceae, including cucumber13, melon14 and watermelon15.
We predicted 32,076 and 32,205 protein-coding genes in the C. maxima and C. moschata genomes, respectively, much higher than numbers of genes predicted in genomes of cucumber 16 (23,248), melon14 (27,427) and watermelon15 (23,440). Biological functions were assigned to 26,993 (84.2%) genes in C. maxima and 27,230 (84.6%) in C. moschata. A total of 2,414 and 2,336 transcription factors were identified in C. maxima and C. moschata, respectively, numbers that are close to double those in the other sequenced cucurbit genomes (Table S8). We identified 30 and 57 disease resistance genes encoding nucleotide-binding site leucine-rich repeat (NBS-LRR) proteins in C. maxima and C. moschata, respectively (Table S9). Low copy number of disease resistance genes is known for other cucurbit genomes17. Orthology and synteny analyses between C. maxima and C. moschata showed that most C. maxima NBS-LRR genes had at least one orthologue in C. moschata, while certain genes were unique to C. moschata (Table S9).
Two distinguishable paleo-subgenomes with nearly intact progenitor chromosome structures in C. maxima and C. moschata
We identified 195 paralogous syntenic blocks (7,773 homoeologous gene pairs) in C. maxima and 170 blocks (7,742 gene pairs) in C. moschata, with synonymous substitution rate (Ks) values peaking at 0.32 (Fig. 2a; Figure. S4 and S5), collectively covering 92.5% and 94.1% of the gene space, respectively. These homoeologous pairs clearly indicated that both C. maxima and C. moschata genomes underwent a whole-genome duplication event that was not observed in other sequenced cucurbits, including cucumber13, melon14, watermelon15 and bitter gourd (Momordica charantia) (Fig. 2a).
Whole-genome synteny analysis revealed pairs of C. maxima (or C. moschata) homoeologous regions shared between chromosomes corresponding to two subgenomes (Fig. 1; Figure. S4). Unlike the three subgenomes of Brassica rapa, which cannot be differentiated based on the Arabidopsis thaliana-B. rapa Ks values18, the two subgenomes of Cucurbita could be distinguished according to their different genetic distances to their orthologues in the Benincaseae species, including melon, cucumber and watermelon (Fig. 2b; Figure. S6 and S7). About 63.2% 58.7% and 68.3% of the C. maxima genomic regions (64.0%, 62.2% and 69.5% for C. moschata) syntenic to melon, cucumber and watermelon, respectively, were composed of homoeologous blocks with significantly different Cucurbita-Benincaseae Ks values. Except for chromosome 4, all other chromosomes contained only homoeologous blocks with either higher or lower Ks values (Fig. 2b; Table S10).
Based on these findings, we separated the entire chromosomes into two groups to represent the two paleo-subgenomes. Eight and 11 chromosomes were assigned to subgenomes A and B, respectively. Chromosome 4 was divided into three segments with two assigned to subgenome A and one to subgenome B (Figure. S8). The reconstructed subgenomes A and B of C. maxima contained 14,816 and 16,068 protein-coding genes, respectively (15,136 and 16,473 genes in C. moschata, respectively). Phylogenetic analysis clearly indicated an allotetraploid origin of Cucurbita from two progenitors that diverged from one another approximately 30.75 million years ago (Mya) (Fig. 2c). One ancestor gave rise to watermelon, cucumber and melon in the Benincaseae tribe, as well as to the B-genome progenitor of Cucurbita, which has not been sampled and may be extinct. No extant diploid species that could represent the A-genome progenitor have been sampled, so it is possible that this genome donor is also extinct. The oldest possible date for the polyploidy event is the date of divergence of Benincaseae from the B-genome of Cucurbita (approximately 26.28 Mya, Fig. 2c). There was extensive shared synteny between the two Cucurbita species with only a few inversions (Figure. S9), consistent with a shared polyploidy event. The divergence date of the two Cucurbita species (3.04-3.84 Mya, Fig. 2c) provides a minimum age for the allopolyploidy event. There was no significant difference of Ks values between C. maxima-C. moschata subgenome A and subgenome B orthologous gene pairs (Figure. S10), suggesting that the two subgenomes evolved with similar mutation rates after polyploidization. Furthermore, the distribution of Cucurbita-Benincaseae syntenic blocks with higher (or lower) Ks along an entire chromosome (except for chromosome 4) (Fig. 2b) indicated a low level of large-scale inter-chromosomal exchanges since the allotetraploidization event, which suggested karyotype stability in the polyploid Cucurbita genomes. This was further strongly supported by the fact that the same ancestral states of four chromosomes from progenitor B (Cucurbita chromosomes 3, 10, 13 and 15) were also found in the modern melon genome (Fig. 2b).
Genome and gene evolution after allotetraploidization
Following polyploidization, functionally redundant duplicated (homoeologous) genes often revert to single-copies through fixation of null mutations and fractionation mediated by intrachromosomal recombination, unless prevented by mechanisms such as gene dosage-balance constraints and functional diversification19. Recent studies have shown that gene loss after polyploidization can be biased towards one of the progenitor genomes, such as in Arabidopsis20, maize21, Brassiaca rapa18 and cotton22. We analyzed patterns of gene loss in C. maxima and C. moschata after the genome merger using high-confidence retained homoeologous genes and singletons (whose duplicated copies were lost) that had syntenic orthologues in both cucumber and watermelon. Unlike in the aforementioned plant species, in Cucurbita we observed random loss of genes (unbiased fractionation) from the two subgenomes. Seven pairs of putative homoeologous chromosomes retained similar numbers of genes, whereas slightly biased gene-loss patterns were detected for Chr08-Chr17 and Chr18-Chr04, 13, 16 pairs (Fig. 3a; Figure. S11; Table S11).
A total of 6,550 genes in C. maxima (6,639 in C. moschata) were assigned to 348 KEGG pathways, as expected, most of which contained similar numbers of genes from the two subgenomes. However, we did find that pathways of pentose and glucuronate interconversions, galactose metabolism, and starch and sucrose metabolism were composed of more subgenome B genes, and phenylalanine metabolism had more subgenome A genes (Table S12). The biased numbers were due to more genes from subgenome B encoding four enzymes, polygalacturonase (K01184), β-glucosidase (K01188), β-fructofuranosidase (K01193) and galacturan 1,4- α - galacturonidase (K01213) shared by the three carbohydrate metabolism pathways, and more genes from subgenome A encoding caffeoyl-CoA O-methyltransferase (K00588), tyrosine aminotransferase (K00815) and phenylalanine ammonia-lyase (K10775) in phenylalanine metabolism (Table S12; Figure S12).
We next tested the presence or absence of genome dominance by comparing expression levels between pairs of high-confidence duplicated genes. Using RNA-Seq data from fruit, leaf, stem and root, we found no evidence of overall biased expression between the two subgenomes (Figure. S13). This result should be expected given the lack of biased fractionation, and given that bias in gene expression is thought to be responsible for biased homoeologous gene loss23.
Among all the high-confidence homoeologous genes in C. maxima, 11,296 (5,648 pairs) were retained in two copies and 4,063 were singletons. Similarly, the C. moschata genome retained 11,572 duplicated genes (5,786 pairs) that resulted from the allotetraploidization, and 5,027 had become single-copy genes. Most of these retained duplicates (4,507 pairs) and singletons (3,633 genes) were shared between the two species, which is to be expected considering the relatively recent divergence of C. maxima and C. moschata. In addition to almost double the number of transcription factors in Cucurbita compared to watermelon, cucumber and melon (Table S8), enriched gene ontology (GO) categories for retained duplicates included response to unfolded protein, G-protein coupled receptor signaling pathway, and steroid hormone mediated signaling pathway (Table S13). The retention of these genes could be due to dosage-balance constraints since such genes often function in macromolecular complexes or regulatory cascades, and are prone to be co-retained to avoid detrimental consequences of disrupting stoichiometric relationships among interacting partners19,24,25. On the other hand, genes functioning in DNA and RNA metabolic processes were over-represented in the singletons (Table S13), consistent with previous findings on under-retention of genes with conserved biological functions24,25.
Evolutionary innovations in allopolyploids are provided by combining parental legacies and novel divergence in regulatory elements and/or coding regions and rewiring of duplicated regulatory networks upon, or subsequent to polyploidization26,27. For most duplicated genes, the specific expression pattern of a gene was not shown in its homoeologue (Fig. 3b; Figure. S14). About 49.0% of high-confidence duplicated gene pairs in C. maxima (48.6% in C. moschata) were differentially expressed by at least 2-fold. These results suggest extensive variation in gene expression patterns between the two subgenomes, which might well be caused by cis-regulatory divergence between the diploid progenitors, immediate alteration upon genome merger, as well as long-term evolutionary forces28,29.
Robust evidence of positive selection was detected between 49 pairs of duplicated genes in C. maxima and also 49 pairs in C. moschata (Table S14), which might have acquired new functions important for adaptation and/or increased Cucurbita fruit morphological complexity. Among these genes, 23 pairs were common between C. maxima and C. moschata, suggesting functional diversification of some homoeologous genes before the speciation of C. maxima and C. moschata. These genes included four in C. maxima and C. moschata encoding OVATE family proteins, whose expansion and functional divergence are present in many plant species, and members of this family are key regulators of fruit shape in tomato and pepper30, consistent with the highly diverse fruit shapes of Cucurbita species.
After polyploidization, retained duplicated genes may exhibit increased expression divergence between species compared to the single-copy genes31. We compared gene expression levels between C. maxima and C. moschata, and found that the proportions of differentially expressed genes were slightly higher for duplicated genes (Table S15). We detected positive selection between 1,784 pairs of orthologous genes in C. maxima and C. moschata (Table S16), among which 948 had high-confidence retained duplicates in both genomes (out of 9,109 pairs of such genes in the two genomes, 10.4%) and 301 pairs (out of 3,622, 7.8%) were singletons. These results suggested the duplicates might have contributed more to the functional divergence of orthologous genes between C. maxima and C. moschata.
Cucurbita lineage-specific gene expansion
Using protein sequences of C. maxima, C. moschata, three Benincaseae species (cucumber, melon and watermelon), grape and papaya, we detected 52 gene families that underwent significant expansion in the Cucurbita lineage (Table S17). These included genes encoding pectin lyases (OG0000048, OG0000060 and OG0000491), pathogenesis-related 1 proteins (OG0000041), terpene synthases (OG0000056), carbohydrate esterases (OG0001297), pectinesterase inhibitors (OG0000281), subtilisin-like serine protease inhibitors (OG0000148) and mitogen-activated protein kinases (OG0001303) that are predicted to function in plant cell wall modification during plant growth and fruit softening, and/or plant defense. The expanded defense-related genes could provide alternative defense mechanisms complementary to the low numbers of NBS-LRR genes in C. maxima and C. moschata. We also detected expansions of gene families that may control meristem activity and organ morphogenesis such as orthologues of Arabidopsis LONELY GUY genes (OG0000097) and CLAVATA3/ESR-RELATED 45 (OG0002365), RADIALIS-LIKE SANT/MYB genes (OG0000105), ROTUNDIFOLIA-like genes (OG0000137) and genes encoding NAC domain-containing proteins (OG0010138), CONSTANS-like proteins (OG0003099) and Gibberellin 3 beta-hydroxylase family proteins (OG0011381). The higher retention rate of these genes after small- and/or large-scale duplications suggested that they might underlie important adaptive traits during evolution to develop survival strategies. While most expanded families contained genes from both subgenomes, the expansions of some families were contributed by only one subgenome (Figure. S15; Table S17). For example, the pectin lyase (OG0000048) and terpenene synthase (OG0000056) genes were almost all from subgenome B, and the pectinesterase (OG0000281) and carbonhydrate esterase (OG0001297) genes were nearly exclusively from subgenome A. This might result from gene loss and/or tandem duplication specific to one subgenome before and/or after genome merger.
Expression divergence between C. maxima and C. moschata
Divergence in the regulatory mechanisms controlling gene expression is pivotal in evolution by providing the source of phenotypic diversity between divergent species. Successful interspecific hybridization between C. maxima and C. moschata offered an opportunity to evaluate the cis-and trans- contributions to their expression divergence through quantifying the total and allele-specific gene expression levels in the interspecific F1 hybrid ‘Shintosa’ and the two parents. About 22-32% of the genes (7,303 in fruit, 6,464 in leaf, 7,749 in stem and 9,371 in root) were differentially expressed between C. maxima and C. moschata (Table S18). Of these genes, 62% to 80% (4,528 in fruit, 4,691 in leaf, 6,223 in stem and 7,210 in root) were found to have only trans-regulatory effects. The remaining genes were classified as having only cis-effects (24.0%, 19.3%, 13.2% and 13.9% of the differentially expressed genes in fruit, leaf, stem and root, respectively) or both cis- and trans-effects (14.0%, 8.2%, 6.5% and 9.2%). This result suggested that trans-regulatory effect was the predominant cause of expression divergence between C. maxima and C. moschata. A general trend has been identified from previous studies that trans-effects account for more of the expression differences between less divergent species, whereas cis-effects primarily cause expression variation between more genetically distinct species32-34. The prevalence of trans-variation between C. maxima and C. moschata is consistent with genetic similarity due to their relatively recent divergence. Among the 1,784 pairs of C. maxima-C. moschata orthologous genes under significant positive selection (see above), 351 pairs were found to have significant cis-effects, indicating gene functional diversification accompanied by regulatory divergence.
Gene expression alteration in the interspecific F1 hybrid
Interspecific crosses play an important role in Cucurbita breeding for transferring favorable traits between species2. Hybridization between C. maxima and C. moschata created novel interactions between the parental alleles and a new regulatory environment in the F1 hybrid. Genes exhibiting dominant and transgressive expression patterns might underlie heterotic phenotypes and combined favorable characters from each parent, such as high carotenoid levels in fruit from C. maxima35, and better insect pest resistance and abiotic stress tolerance from C. moschata36,37. In the F1 hybrid, 4,002 (fruit), 6,718 (leaf), 6,732 (stem) and 7,067 (root) genes were expressed higher than or equal to the high parent (up-regulated), or lower than or equal to the low parent (down-regulated), representing 12.5-22.0% of the total genes. Among these genes, 49.6-75.0% (2,417 in fruit, 3,330 in leaf, 5,049 in stem and 4,590 in root) exhibited C. moschata-like expression, and 7.7-19.0% showed C. maxima-like expression (Fig. 4a, b). About 41.7-50.8% of the genes exhibiting dominant and transgressive expression patterns had significant trans-regulatory effects in different tissues, whereas 6.3-20.1% had significant cis-effects (Fig. 4c, d).
We identified GO terms enriched in the dominantly and transgressively expressed genes in the hybrid. In the hybrid root, enrichment of genes involved in defense response (GO:0006952) and response to heat (GO:0009408) was found among up-regulated genes (Fig. 4e), which could be related to the elevated stress tolerance of ‘Shintosa’. Interestingly, down-regulated genes in the hybrid root were significantly enriched with pathways related to active root growth, including auxin-signaling (GO:0009734), auxin-transport (GO:0060918), cell division (GO:0051301), cell growth (GO:0016049), cell wall organization/biogenesis (GO:0071554), and root hair cell development (GO:0080147) (Table S19), suggesting a tradeoff between defense response and growth processes, which has also been observed in Arabidopsis38,39. We also found that carotenoid biosynthesis (GO:0016117) was significantly over-represented in the up-regulated genes in the fruit flesh of the hybrid. A comparison of gene expression in the carotenoid biosynthetic pathway among the two parents and F1 showed that the expression levels of phytoene synthase (PSY) and lycopene β-cyclase (LYCB) genes were higher in the F1 fruit compared to the parents (Table S20; Figure. S16). The duplicated copies of PSY (and LYCB) were homoeologues resulting from the allotetraploidization. Only one of the homoeologues was highly expressed in either C. maxima or C. moschata fruit, whereas both copies were highly expressed in the F1 hybrid (Figure. S16). The β-carotene hydroxylase (CHYB) genes were expressed at a relatively lower level in C. moschata fruit compared with the F1 hybrid (higher) and C. maxima (highest), which could lead to different levels of hydroxylation of α- and β-carotene. Another transgressively expressed carotenoid biosynthesis gene in the F1 fruit encoded zeaxanthin epoxidase (ZEP), which epoxidizes zeaxanthin to violaxanthin. The photosynthesis-related pathways (GO:0015979 and GO:0009765) were significantly over-represented in all three aerial tissues in the up-regulated genes (Fig. 4e). These photosynthesis-related genes included those encoding light-harvesting chlorophyll a/b binding proteins and subunits of photosystem I and II, previously shown to be associated with heterosis in other plant species40.
Discussion
Most economically important genera in the Cucurbitaceae belong to the Cucurbiteae and Benincaseae tribes41. We present high-quality draft genome sequences of two Cucurbita species of the Cucurbiteae, C. maxima and C. moschata, as well as evidence for an ancient allotetraploidy event in Cucurbita. The divergence of the two diploid progenitors occurred around 31 Mya, soon after the common ancestor of Cucurbiteae and Benincaseae diverged from the Momodiceae that includes bitter gourd. Although the hybridization date of the two diploid progenitors is uncertain, the divergence between C. maxima and C. moschata is estimated to be at 3 Mya, constraining the tetraploidization to be earlier than this date.
Our results suggest only limited inter-chromosomal rearrangements since the hybridization of the two diploid progenitors. Chromosome structural variation between the two subgenomes is therefore likely due to different fission and fusion events in the two progenitor genomes before hybridization, with the current Cucurbita subgenomes largely maintaining the chromosome structures of the two progenitors despite sharing the same nucleus for at least 3 million years. Such karyotype stability is in contrast to the frequent homoeologous exchanges among subgenomes of allotetraploid cotton during 1-2 million years after polyploidization42 and considerable rearrangements experienced by many other plant polyploids43, but has been observed in amphibians such as allotetraploid frog (Xenopus laevis) that originated from an ancient hybridization event 17-18 Mya44, as well as in newly synthesized allotetraploid wheat with genome combinations analogous to natural tetraploids45. The karyotype stability in the Cucurbita genomes could be due to divergence between the two diploid genome donors, which prevented meiotic pairing of homoeologous chromosomes and subsequent exchanges. Karyotype stability could play a key role in the initial establishment of the Cucurbita species by avoiding fitness costs associated with genomic aberrations immediately following polyploidization.
The two subgenomes have retained similar numbers of genes, and neither subgenome is dominant in overall expression levels. Fractionation bias and genome dominance have been observed in several plant species after polyploidization, including Arabidopsis, maize, rice, Brassica rapa and diploid cotton46. On the other hand, lack of genome dominance and biased fractionation in the genomes of banana, poplar and soybean was reported by Garsmeur et al.47, who hypothesized that this different postpolyploidy subgenome evolution pattern was possibly associated with ancient autopolyploidy events, contradicting the hypothesis that soybean is an allotetraploid48. Here, in Cucurbita, we identified another case of random fractionation and lack of genome-wide expression bias after an ancient allopolyploidy event.
The genome sequences of C. maxima and C. moschata allowed us to perform an allele-specific gene expression analysis in the interspecific F1 hybrid, to parse the genome-wide expression differences between C. maxima and C. moschata into cis- and trans- regulatory effects, and detect transgressive gene expression changes in the hybrid correlated with increased disease resistance and growth vigor. The expression patterns of carotenoid biosynthesis genes (including PHY, LYCB, CHYB and ZEP) in the fruits of the F1 hybrid and the parents suggest that the hybrid harnesses the active pathway components used by both parents. Our result is consistent with previous findings that the predominant carotenoids in C. moschata fruits are α- and β-carotene, whereas C. maxima accumulates lutein and violaxanthin at high levels35,49, and also suggests that interspecific hybridization between C. maxima and C. moschata could lead to a different fruit carotenoid profile and potentially higher total carotenoid content.
Conclusions
High-quality genome sequences of two Cucurbita species, C. maxima and C. moschata, were assembled using high coverage of Illumina paired-end and mate-pair reads, and each constructed into 20 pseudochromosmes. In addition to providing evidence of a whole-genome duplication in the Cucurbita lineage, our comparative genomics analyses revealed the evolutionary history of this ancient allotetraploidy event, including successive divergence of the two diploid progenitors (probably now extinct) of Cucurbita from the common ancestor of Cucumis and Citrullus species that happened around 31 and 26 Mya, and hybridization of the diploids earlier than 3 Mya. The genomes of Cucurbita represent a type of paleo-allotetraploid genomes with stable karyotypes that maintained the chromosome structures of the diploid genome donors. We also showed that the Cucurbita genomes underwent unbiased fractionation after allotetraploidization associated with the lack of genome dominance. Allelic-specific expression analysis in the interspecific hybrid of Cucurbita and the two parents showed that the expression divergence between C. maxima and C. moschata was predominantly controlled by trans-regularoty effects. We also detected transgressive gene expression changes in the hybrid correlated with heterosis in important agronomic traits such as disease resistance and carotenoid biosynthesis. The genome sequences of C. maxima and C. moschata have improved our knowledge on the genome evolutionary of plant polyploids, and provide an invaluable resource for the study of agronomic traits in Cucurbita and genetic improvement of the species.
Materials and methods
Plant materials and DNA/RNA extraction
Two inbred Cucurbita lines, C. maxima cv. Rimu and C. moschata cv. Rifu, were selected for genome sequencing. Seedlings were grown in the greenhouse under natural light supplemented with artificial light (16/8 h light/dark) and transferred to a dark room for 24 hours to promote starch degradation. Young leaves were harvested and stored at −80°C. To construct genetic maps, two intraspecific F2 populations were generated from crosses between C. maxima cultivars, Rimu and SQ026, and between C. moschata cultivars, Rifu and Honey jujube. The interspecific population was derived from a C. maxima cv. Rimu × C. moschata cv. Rifu F1 individual backcrossing to C. maxima cv. Rimu. For transcriptome sequencing, fully developed leaves, stems, roots, and fruit flesh at 46 days after pollination were collected from C. maxima cv. Rimu, C. moschata cv. Rifu and their F1 hybrid ‘Shintosa’. Three independent biological replicates were prepared for each sample.
Genomic DNA and total RNA were extracted using the QIAGEN DNeasy Plant Mini Kit and the QIAGEN RNeasy Plant Mini Kit, respectively, following the manufacturer’s instructions (QIAGEN, Valencia, California, USA). DNA and RNA quality was evaluated via agarose gel electrophoresis and their quantity was determined on a NanoDrop (Thermo Fisher Scientific, Waltham, MA, USA).
Genomic and RNA-Seq library construction and sequencing
Paired-end libraries with insert sizes ranging from 200 bp to 1 kb and mate-pair libraries with insert size ranging from 3 to 15 kb were prepared using Illumina Genomic DNA Sample Preparation kit and the Nextera Mate Pair Sample Preparation kit (Illumina, San Diego, CA), respectively, following the manufacturer’s instructions. Strand-specific RNA-Seq libraries were prepared from total RNA following the protocol described by Zhong et al.50. Genomic and RNA-Seq libraries were sequenced on the Illumina HiSeq 2500 system with the paired-end mode.
De novo genome assembly
Raw Illlumina reads were processed to collapse duplicated read pairs into unique pairs. Duplicated read pairs were defined as those having identical bases at positions of 14 to 50 in both left and right reads. The non-duplicated reads were further processed with Trimmomatic51 to remove adaptors and low-quality sequences. Read pairs from mate-pair libraries were processed with the ShortRead package52 to remove junction adaptors, followed by removing reads shorter than 40 bp and error correction with QuorUM53.
The high-quality cleaned reads were assembled into scaffolds with SOAPdenovo254 and gaps in the resulting scaffolds were filled with the GapCloser program in the SOAPdenovo2 package. Pilon55 was used to improve the assembly by correcting bases, fixing mis-assemblies and further filling gaps. Potential contaminations from microorganisms were detected by aligning the assemblies to NCBI non-redundant nucleotide (nt) database using BLASTN with an e-value cutoff of 1e-5. Scaffolds with more than 90% of their length similar to bacterial sequences were considered contaminants and removed. Finally, scaffolds contained within other scaffolds with sequence identity >99% and coverage >99% were removed.
Genetic map construction and scaffold anchoring
To anchor the assembled C. maxima and C. moschata scaffolds, we constructed high-density genetic maps using two intraspecific F2 populations (186 C. maxima and 186 C. moschata individuals) and one interspecific F1BC1 population (186 individuals) derived from C. maxima Rimu × SQ026, C. moschata Rifu × Honey jujube, and C. maxima Rimu × C. moschata Rifu, respectively. For the interspecific F1BC1 population, C. maxima Rimu was used as the recurrent parent. Genomic DNA was extracted from young healthy leaf tissues of F2 and F1BC1 individuals and the parents using the cetyltrimethylammonium bromide (CTAB) method. DNA concentration was measured using an ND-2000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and quality was assessed by electrophoresis using 1% agarose gels with a lambda DNA standard. Genotyping of these plants was performed following the GBS protocol56, using ApeKI as the restriction enzyme. The resulting libraries were multiplexed and sequenced on a HiSeq 2500 system (Illumina Inc. USA) with single-end mode and read length of 101 bp. The GBS sequencing reads were processed using the TASSEL-GBS (v4) pipeline57 for SNP calling. Briefly, for each population, the reads from all samples were combined and collapsed into a master tag list. Tags that occurred at least 10 times were retained and mapped to the C. maxima or C. moschata genome using BWA58 (v0.7.12) with default parameters. Alignments with mapping quality ≥ 1 were used for SNP calling. The minimum genotype quality was set to 10. SNPs with significant segregation distortion (Chi-square test, p <0.05), high data missing rate (>30%) or low minor allele frequency (MAF <0.03) were removed. The resulting SNPs were used to construct genetic maps using the minimum spanning tree (MST) algorithm59 implemented in the R package ASMap (https://cran.r-project.org/web/packages/ASMap/). During scaffold anchoring, one initial C. moschata scaffold was identified as chimeric, on the basis of the same scaffold mapping to different linkage groups and the lack of mate-pair read support at the point of potential misjoining. This scaffold was split into two new ones.
Annotation of transposable elements
A de novo long terminal repeat retrotransposon (LTR-RT) library and a miniature inverted repeat transposable elements (MITE) library were constructed by screening the assembled C. maxima and C. moschata genomes using LTRharvest60 and MITE-Hunter61, respectively. Both genomes were then masked using RepeatMasker (http://www.repeatmasker.org/) with the LTR-RT and MITE libraries. The unmasked sequences in the genomes were further searched for repeat elements using RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html). All the repetitive sequences generated above were combined into a single repeat library for each genome, and then compared against the Swiss-Prot database62. Sequences that matched non-TE proteins in the database were removed from the repeat libraries. The remaining TEs were classified using REPCLASS63. The classified repeat libraries were then used to identify TEs in the assembled C. maxima and C. moschata genomes with RepeatMasker (http://www.repeatmasker.org/).
Protein coding gene prediction and functional annotation
The repeat-masked C. maxima and C. moschata genomes were used for gene prediction with MAKER64, which combines evidences from ab initio gene prediction, transcript mapping and protein homology to define the confident gene models. SNAP65 and AUGUSTUS66 were used for ab initio gene predictions. RNA-Seq data were assembled using Trinity 67 with the de novo mode and the genome-guided mode, respectively. The two assemblies were combined and aligned to the assembled genomes by the PASA2 pipeline68. The resulting alignments were used as the transcript evidence. Protein sequences from Arabidopsis69, watermelon15, cucumber16, and melon14 and the UniProt (Swiss-Prot plant division) database were aligned to the genomes using Spaln70 to provide the protein homology evidence. The same annotation pipeline was also applied to the bitter gourd genome71, which predicted 20,778 protein-coding genes that were used in the analyses in this study.
For gene annotation, protein sequences of the predicted genes were compared against the Arabidopsis protein and UniProt (Swiss-Prot and TrEMBL) databases using BLAST, as well as the InterPro database using InterProScan72. Gene ontology (GO) annotations were obtained using Blast2GO73. Transcription factors were identified using the iTAK program74.
Detection of lineage-specific expansion of protein families
Lineage-specific gene family expansion was inferred from orthologous groups using the software CAFE75 (v3.0). Orthologous groups were constructed using OrthoFinder76 with protein sequences from grape, papaya, C. maxima, C. moschata and three other cucurbit species (watermelon, cucumber and melon). The random gene birth and death rates were estimated across the species tree composed of the seven species using the maximum likelihood method. Orthologous groups with accelerated rate of expansion in the Cucurbita genus (represented by C. maxima and C. moschata) were determined with a threshold p-value of 0.05.
Detection of positive selection
Positive selection on pairs of paralogous (within C. maxima or C. moschata) or orthologous genes (between C. maxima and C. moschata) was detected using the CodeML application in the PAML package77. Positive selection was estimated by pairwise calculation of nonsynonymous/synonymous rate ratio (Ka/Ks or ω) between all members of an orthologous group using the codon-based ‘branch’ model78. Substitution rates were averaged across a gene (option ‘NSsites = 0’). The tested paralogous or orthologous branch was used as the ‘foreground’ and all other cucurbit branches (cucumber, watermelon and melon) as the ‘background’. We first tested whether the ω of the paralogous (or orthologous) genes was different from the rest of the tree by comparing a two-ratio model (option ‘model=2’ allowing ω to vary among branches) to the null model 0 (assuming an identical ω among all branches). To further test whether the difference was due to positive selection or just relaxation of selection pressure, the two-ratio model result was contrasted against a model of neutrality (option ‘fix_omega =1’). If the model allowing different ω (in the first comparison) or positive selection (in the second comparison) fit the data significantly better, as judged by the likelihood-ratio test (LRT), difference in ω or positive selection was inferred. Twice the difference in log-likelihood values between two models was compared to a χ2 distribution with one-degree of freedom to determine the p-value.
Analyses of genome evolution, phylogeny and divergence time
To identify syntenic regions within and between genomes, protein sequences of C. maxima, C. moschata, watermelon, cucumber, melon and bitter gourd were aligned against themselves or each other using BLASTP, and high-confidence collinear blocks were determined using MCScanX79 with an E-value cutoff of 1e-10. Ks values of homologous pairs were calculated using the Yang-Nielsen algorithm implemented in the PAML package80. To construct the species tree containing the diploid progenitors of C. maxima and C. moschata (represented by the two subgenomes), watermelon, cucumber and walnut, we used 187 orthologous groups containing only one syntenic gene from each species. The CDS sequences of these genes had at least 80% coverage to their walnut orthologues, and were longer than 300 bp. The alignments of the 187 orthologous groups were used to construct phylogenetic trees with PhyML81 (v3.0; default parameters). A consensus tree was produced using the TreeAnnotator program in BEAST282 with default parameters. The divergence time was estimated using the program BPP, under the multispecies coalescent model in a Bayesian framework with the fixed species phylogeny constructed above83. Based on the fossil information on the divergence time between Fagales and Cucurbitales, which is around 84 Mya84, the gamma prior G(2, 8.66) were assigned to the population size parameters (θs), with mean 2/8.66 = 0.2309, and the gamma prior G(282, 2445) was assigned to the divergence time at the root of the species tree (τ0). The analysis was run twice to confirm consistency between runs.
RNA-Seq analysis
Raw RNA-Seq reads were processed by Trimmomatic51 (version 0.32) to remove adapter, polyA/T tails and low quality (quality score < 20) sequences. Reads equal to or longer than 40 bp were kept and aligned to the SILVA rRNA database (release 111) (https://www.arb-silva.de/) to remove rRNA contaminations. Cleaned reads were aligned to the C. maxima and C. moschata genomes using HISAT285 allowing no mismatch. Alignment results were used to classify each read as C. maxima-specific, C. moschata-specific, or common to both species. Raw counts for each gene were then derived and normalized to fragments per kilobase of exon model per million mapped reads (FPKM). DESeq86 was used to identify differentially expressed genes between C. maxima and C. moschata orthologues with a cutoff of adjusted p-value < 0.005, fold-change >2 and FPKM ≥ 3 in at least one sample.
cis and trans effects and inheritance mode classification
The orthologous relationship between a pair of C. maxima and C. moschata genes was detected with LAST87 by looking for unique best corresponding regions between the genomes of C. maxima and C. moschata. Gene pairs with more than 90% coverage in both genomes were selected for the allele-specific expression analyses. Statistical significance of cis and trans effects were determined using binomial and Fisher’s exact tests as described previously88. Briefly, DESeq86 analyses were carried out to identify genes that were expressed differently between the parents and/or had expression ratios of C. maxima and C. moschata alleles significantly different (adjusted p-value < 0.005) from 1 in the F1 hybrid. The identified genes were then subjected to a Fisher’s exact test to determine whether the expression ratios of C. maxima and C. moschata in the F1 were significantly different (adjusted p-value < 0.005) from the ratios between the parents. Based on the results from the binomial tests and Fisher’s exact tests, each gene was classified into one of the seven different regulatory categories, namely, cis, trans, cis + trans, cis × trans, compensatory, conserved and ambiguous, as described by McManus et al.88. The mode of inheritance for each gene was determined based on the magnitude of the expression level difference between the hybrid and parents. Genes with expression level in the hybrid deviated more than 1.25-fold from that of either parent were further classified as having additive, dominant, under-dominant or over-dominant expression.
Data access
The Whole Genome Shotgun projects of C. maxima and C. moschata have been deposited at DDBJ/ENA/GenBank under the accessions NEWN00000000 and NEWM00000000, respectively. The versions described in this paper are NEWN01000000 and NEWM01000000. Raw genome and transcriptome sequence reads have been deposited into the NCBI sequence read archive (SRA) under accessions SRP106545, SRP106546 and SRP106360. The genome sequences and the annotations of C. maxima and C. moschata are also available at the Cucurbit Genomics Database (http://cucurbitgenomics.org).
AUTHOR CONTRIBUTIONS
Z.F., Y.X., S.W., J.J.D. and W.J.L. designed and managed the project. G.Z., H.L., S.G., Y.R., J.Z. H.Z., G.G., Z.J., F.Z., J.T and C.J. prepared samples and performed biological experiments for DNA sequencing and genetic map construction. H.S., S.W., C.J., S.G. and Y.R. performed data analyses. S.W. and H.S. wrote the manuscript. Z.F., J.J.D., W.J.L. and Y.X. revised the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
ACKNOWLEDGMENTS
This research was supported by grants from the Beijing Scholar Program (BSP026), the Beijing Excellent Talents Program (2014000021223TD03), the Ministry of Agriculture of China (CARS-26), the Beijing Natural Science Foundation (6141001 and 6144023), the US National Science Foundation (IOS-1339287, IOS-1339128 and IOS-1539831), and USDA National Institute of Food and Agriculture Specialty Crop Research Initiative (2015-51181-24285).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵