Abstract
The endangered whale shark (Rhincodon typus) is the largest fish on Earth and is a long-lived member of the ancient Elasmobranchii clade. To characterize the relationship between genome features and biological traits, we sequenced and assembled the genome of the whale shark and compared its genomic and physiological features to those of 81 animals and yeast. We examined scaling relationships between body size, temperature, metabolic rates, and genomic features and found both general correlations across the animal kingdom and features specific to the whale shark genome. Among animals, increased lifespan is positively correlated to body size and metabolic rate. Several genomic features also significantly correlated with body size, including intron and gene length. Our large-scale comparative genomic analysis uncovered general features of metazoan genome architecture: GC content and codon adaptation index are negatively correlated, and neural connectivity genes are longer than average genes in most genomes. Focusing on the whale shark genome, we identified multiple features that significantly correlate with lifespan. Among these were very long gene length, due to large introns highly enriched in repetitive elements such as CR1-like LINEs, and considerably longer neural genes of several types, including connectivity, activity, and neurodegeneration genes. The whale shark’s genome had an expansion of gene families related to fatty acid metabolism and neurogenesis, with the slowest evolutionary rate observed in vertebrates to date. Our comparative genomics approach uncovered multiple genetic features associated with body size, metabolic rate, and lifespan, and showed that the whale shark is a promising model for studies of neural architecture and lifespan.
The relationships between body mass, longevity, and basal metabolic rate (BMR) across diverse habitats and taxa have been researched extensively over the last century, and led to generalized rules and scaling relationships that explain many physiological and genetic trends observed across the tree of life. While studies of endothermic aquatic mammals have shown that selection for larger body sizes is driven by the minimization of heat loss1, metabolic rate in ectothermic aquatic vertebrates is directly dependent on temperature, and decreased temperatures are correlated with decreased BMRs, decreased growth rates, longer generational times, and increased body sizes2-4. The whale shark (Rhincodon typus) is the largest extant fish, reaching lengths of 20 meters (m)5 and 42 tonnes (t) in mass6 and has a maximum lifespan estimated at 80 years6. Unlike the two smaller filter-feeding shark species (Cetorhinus maximus, Megachasma pelagios) that inhabit colder temperate waters with increased prey availability, whale sharks have a cosmopolitan tropical and warm subtropical distribution and have rarely been sighted in areas with surface temperatures less than 21°C7-9. However, recent GPS tagging studies have revealed that they routinely dive to mesopelagic (200-1,000 m) and bathypelagic (1,000-4,000 m) zones to feed, facing water temperatures of <4°C10. Observations of increased surface occupation following deeper dives led to the suggestion that thermoregulation is a primary driver for their occupation of the warmer surface waters7,11. Since larger body masses retain heat for longer periods of time, the large body mass of whale sharks may slow their cooling upon diving and maximize their dive times to cold depths, where food is abundant. Larger body mass could thus play a role in metabolic regulation.
Body size, environmental temperature, metabolic rate, and generation time are all correlated with variations in evolutionary rates12,13. Since many of these factors are interconnected, modeling studies have shown that observed evolutionary rate heterogeneity can be predicted by accounting for the impact of body size and temperature on metabolic rate14, suggesting these factors together drive the rate of evolution through their effects on metabolism. Consistent with these results, the coelacanth and elephant fish have the slowest reported evolutionary rates15,16. Moreover, genome size and intron size have also been linked to metabolic rate in multiple clades. Intron length varies between species and plays an important role in gene regulation and splice site recognition. In an analysis of amniote genomes, intron size was reduced in species with metabolically demanding powered flight and was correlated with overall reductions in genome size17,18. However, since most previous studies were limited by poor taxonomic sampling and absence of genome data for the deepest branches of the vertebrate tree, comprehensive comparative genomic analyses across gnathostomes are necessary to gain a deeper understanding of the evolutionary significance of the correlations between genome size, intron size and metabolic demands.
Here we sequenced and analyzed the genome of the whale shark and compared its genome and biological traits to those of 81 eukaryotic species, with a focus on gnathostomes such as fishes, birds, and mammals. In particular, we identified scaling relationships between body size, temperature, metabolic rates, and genomic features, and found general genetic and physiological correlations that span the animal kingdom. We also examined characteristics unique to the whale shark and its slow-evolving, large genome.
The whale shark genome
The DNA of a Rhincodon typus individual was sequenced to a depth of 164× using a combination of Illumina short-insert, mate-pair, and TSLR libraries (Table S1 and S2), resulting in a 3.2 Gb genome with a scaffold N50 of 2.56 Mb (Tables S2, S5, and S6). A sliding window approach was used to calculate GC content and resulted in a genome-wide average of 42%, which is similar to the coelacanth and elephant fish (Fig. S2). Roughly, 50% of the whale shark genome is comprised of transposable elements (TEs), which were identified using both homology-based and ab initio approaches19,20. Of these, long interspersed nuclear elements (LINEs) made up 27% of the total TEs identified (Table S7). A combination of homology based and ab initio genome annotation methods19,20 resulted in a total of 28,483 predicted protein coding genes (Table S8).
Correlation of physiological characteristics with genome features across 82 taxa
Body mass is intrinsically linked to physiological traits such as lifespan and basal metabolic rate (BMR)21. To better understand how genomic features correlate with physiological and ecological parameters such as body weight, lifespan, temperature, and metabolic rate, we compared the whale shark to 80 animals and yeast (Table S15-16) using physiological and genomic data (Fig.1, Fig. S3-6 and Table S16). Across the 81 animals examined, we found a strong positive correlation with significant p-values between the log transformed values for body weight and maximum lifespan (ρ = 0.79, Fig. 2A and Table S17) and BMR (ρ = 0.958, Fig. S9A, exponent B = 0.68, Fig. S25, and Table S17), consistent with previous reports21. Comparisons of physiological traits and genome characteristics across the 81 animals revealed several genetic features that also scaled with body weight. Among these, total gene length, intron length, and genome size all show a moderate statistical correlation with body mass, lifespan, and BMR (ρ = ~0.5) (Fig. 2B-E and Table S17). These results are consistent with previous findings of decreased intron size with increased metabolic rates. Furthermore, genome size and relative intron size are strongly correlated (ρ = 0.707) (Fig. 2B and Table S17), with the whale shark being a notable outlier. Moreover, genome size, measured as golden path length, scales with gene size, measured as summed length of exons and introns per gene (B = 1.32, Fig. S26). Additionally, we found that, unlike in bacteria22 and crustaceans23, genome size in Chordates scales positively with temperature (B = 0.77, Fig. S27).
Exon length is remarkably constant across animals, regardless of genome size or intron length (Fig. 1C and Fig. S4C). Early observations of this phenomenon across small numbers of taxa led to the suggestion that the splicing machinery imposes a minimum exon size while exon skipping begins to predominate when exons exceed ~500 nt in length24. Also of note is the tight correlation (ρ = 0.975) between the overall GC content and GC3, the GC content of the third codon position (Fig. S9B and Table S17), while both features are negatively correlated with the codon adaptation index (CAI) (ρ=−0.799 and ρ=−0.841, respectively; Fig. 2G-H and Table S17) in Eukaryota, and negatively correlated with the genome size in Mammalia (ρ = −0.434 and ρ = −0.473, respectively) (Table S17). These results are partially supported by previous research, which showed that GC3 is negatively correlated with body mass, genome size, and species longevity within 1,138 placental mammal orthologs25. However, our results using whole genome data do not support the GC3 correlation with body mass and longevity (ρ = 0.067 and ρ = 0.059; Table S17). Thus, exon and intron length may affect body mass and longevity through a strong association between GC content and coding sequence length26. Additionally, CAI and intron size are moderately positively correlated (ρ = 0.463; Fig. 2I and Table S17). Since the CAI and codon usage bias have an inverse relationship, this is consistent with the negative correlation between intron length and codon usage bias in multicellular organisms27.
Whale shark longevity and genome characteristics
The allometric scaling relationships between longevity, mass, temperature, and metabolic rate are well established21, and the long lifespan of the whale shark can be explained by its large mass and the extremely low mass- and temperature-adjusted BMR (Fig. 1H and 1L). There has been considerable debate in the literature over the evolutionary causes and consequences of genome size, particularly as it relates to BMR. At 3.2 Gb, the whale shark has a genome that is significantly larger than those of other Chondrichthians (elephant fish), though both exon number and size are comparable. The whale shark is, however, a notable outlier, particularly among fish, for its long introns (Fig. 1E and S4E). Interestingly, the whale shark’s relative intron length (Fig. 1E and S4E) is significantly longer than any of the other 81 species (Fig. S5G and S6G). Analyses of single copy orthologous gene clusters did not reveal any large intron gains or losses in the whale shark (Fig. S10), though retrotransposon analyses revealed a significant expansion of CR1-like LINEs and Penelope-like elements in the introns (Fig. 3A and S11-15). The CR1-like LINEs are the dominant family of transposable elements (TEs) in non-avian reptiles and birds28. In the whale shark, the summed length of CR1-like LINE elements is 176 Mbp (Fig. S13C), which is eleven times longer than that of the anole lizard, a species known for expanded CR1-like LINEs. The total length of intronic repetitive elements is as great as in the opossum genome, known to be rich in repetitive elements29 (Fig. S12). In the whale shark genome, 38% of the CR1-like LINE, 39% of the CR1-Zenon like LINE, and 30% of the Penelope-like elements are located in intronic regions (Fig. S14). Strikingly, most genes (more than 88%) in the whale shark genome have the CR1-like LINE elements within their introns (Fig. S15) and 56% of genes also have LINE1 elements (Fig. S15). Thus, the whale shark’s relatively large genome and long introns are due to repetitive elements.
Previous research has shown that there is an association between codon usage and the evolutionary age of genes in metazoans30. Interestingly, two principal component analyses (PCA) of relative synonymous codon usage (RSCU) from 82 and 76 species (six species having distant codon usage patterns were excluded), respectively, revealed that the whale shark pattern of RSCU is most similar to that of the coelacanth; with well separated patterns of RSCU for each class (Fig. S16). While the whale shark genome has a relatively short exon length (smaller than that of 59 species), importantly, it has a smaller number of exons per gene than all but two species (the yeast and fruit fly have the smallest number of exons) (Fig. S3B and S4G). Thus, the whale shark CDS length is shorter than all but the yellow sea squirt genome (Fig. 1D and S4D).
Evolutionary rate and historical demography
Analyses of the whale shark genome showed it is the slowest evolving vertebrate yet characterized. A relative rate test and two cluster analyses revealed that the whale shark has a slower evolutionary rate than those of the elephant fish and all other bony vertebrates examined, including coelacanth16 (Fig. 3B, S17 and Table S18-20). These results support the previous work which predicted a slow evolutionary rate in ectothermic, large-bodied species with relatively low body temperature (compared to similarly sized warm-blooded vertebrates)14. They are also consistent with previous studies of nucleotide substitution rates in elasmobranchs, which are significantly lower than those of mammals31,32.
A phylogenetic analysis of the 255 single-copy orthologous gene clusters from the whale shark and 24 other animal genomes (Fig. 3D) showed a divergence of the Elasmobranchii (sharks) and Holocephali (chimaeras) roughly 268 MYA and of the Chondrichthyes from the bony vertebrates about 457 MYA (Fig. 3D), consistent with previous estimates. To understand how many genes appeared in each evolutionary era within the whale shark genome, we evaluated the evolutionary age of whale shark protein-coding genes based on protein sequence similarity33. Grouping the whale shark genes into three broad evolutionary eras, we observed that while the majority (58%) of genes are ancient (older than 684 MYA), a few (~5%) are middle age (684-199 MYA), and many (36%) are young (199 MYA to present) (Fig. 3C). Normalizing the number of genes by evolutionary time suggests that gene turnover is highest near the present time (Fig. S18). Examining the age of genes shows many genes are ancient (PS 1) and many genes appear very young (PS 20) (Fig. S19), though the large number of young PS 20 genes may in part reflect the paucity of closely related species with fully sequenced genomes. These results highlight both the conservation of a large part of the genome as well as the innovative potential of the whale shark genome, since many new genes appeared within the last 200 million years.
Gene family expansions and contractions in the whale shark
Gene family expansion and contraction analyses across 25 species identified 101 contracted gene families in the whale shark. Of these, nucleosome assembly (GO:0006334) and chromatin assembly (GO:0031497) were significantly decreased in the whale shark compared to the Chondrichthyes common ancestor (Table S21A). Interestingly, the whale shark genome has a smaller number of histone gene families (H1, H2A and H2Bs) than other bony fishes and mammals (Fig. S20). This small number of histone gene families, especially the H1 family which encodes the linkers important for higher order chromatin structures, may be related to the long length of whale shark introns34. We also identified 13 expanded gene families that are enriched for several metabolic pathways, including fatty acid metabolism, along with neurogenesis and nervous system development, and cardiac conduction system development (Table S21B).
Gene length of neural genes and correlation with physiological features
Gene length has recently emerged as an important feature of neural genes, as long genes are preferentially expressed in neural tissues and their expression is under tight transcriptional and epigenetic control35. Within the 81 animal species, we compared the dimensions of average genes with those of ten categories of neural genes (neuronal connectivity, cell adhesion, olfactory receptors, ion channels, unfolded protein response associated genes, neuronal activity and memory, neuropeptides, homeobox genes, synaptic genes, and neurodegeneration) (Fig. 4A and S21). Interestingly, we found that neuronal connectivity genes are longer than average genes in most vertebrates, with the length increase being significant in whale shark and most mammals, as well as in coelacanth and platypus (Fig. 4A and S22A). Surprisingly, we found that neural genes are scaled to average genes with an exponent greater than 1 (B = 1.038, Fig. S28), with the whale shark showing an extreme lengthening of neural genes. Moreover, we found that cell adhesion, ion channels, homeobox genes, and neurodegeneration genes are increased in length in the whale shark (Fig. 4B). Thus, the organization of whale shark neural genes may reflect the need to maintain the shape, activity, identity, and resistance to neurodegeneration in a body that is both very large and long-lived. Finally, neuronal functions are enriched in long genes in more than 60 species (Fig. 4C and Additional File 1).
To determine whether physiological traits and genomic features are linked, we examined the correlation of gene size and maximum lifespan, body weight, and BMR (Fig. 2A-F). In 155 gene families, we found that gene length was significantly correlated to maximum lifespan, body weight, and BMR. Gene ontology analyses of this gene group showed statistical enrichment of biological processes such as telomere maintenance (GO:0007004: XRCC5, MAPKAPK5, and NAT10) and RNA and protein export from nucleus (GO:0006405 and GO:0006611: SDAD1, SARNP, RAE1, NUP155, ABCE1, ENY2, XPO5, CSE1L and STYX; Fig. 4D, Tables S22 and S23), both of which are associated with longevity and cancer36,37. Of the genes in which gene length is associated with lifespan, NUP210 (nucleoporin 210) and VWF (von Willebrand factor) are both associated with longevity38 (Fig. S24A and Table S24). Moreover, the genes correlated to BMR include SNX14, which is linked to protein metabolism and whose deficiency causes ataxia and intellectual disability39 (Fig. S24B and Table S24). The only gene previously correlated with body mass (COX5B; the terminal enzyme of the mitochondrial respiratory chain) is a subunit of Complex IV and is essential to energy production in the cell and ultimately to aging40 (Fig. S24C and Table S24). Taken together, these results suggest that there is an evolutionary relationship between gene size and physiological traits size such as body size, metabolic rate, and lifespan. This holds particularly among genes whose functions are essential for living long lives, such as telomere maintenance and energy production.
Conclusions
We sequenced and assembled the genome of the whale shark (Rhincodon typus), an endangered species that is the largest extant fish on Earth. Its relatively large 3.2 Gb genome is the slowest evolving vertebrate genome found to date, and has a striking amount of CR1-like LINE transposable elements. In most genomes, we found that major genomic traits, including intron length and gene length, correlate with body size, temperature, and lifespan, and that GC content and codon adaptation index are negatively correlated. Unexpectedly, we found that neural connectivity genes are substantially longer than average genes. In the whale shark genome, specifically, we found that introns are longer than in most other species due to the presence of repetitive elements and that neural genes of several types, including neurodegeneration genes, are much longer than average genes of species with long lifespans. These results show the power of the comparative evolutionary approach to uncover both general and specific relationships that reveal how genome architecture is shaped by size and ecology.
Methods
Sample preparation and sequencing
Genomic DNA was isolated from heart tissue acquired from an approximately seven years old, 4.5 meter deceased male whale shark from the Hanwha Aquarium, Jeju, Korea. DNA libraries were constructed using a TruSeq DNA library kit for the short-read libraries and a Nextera Mate Pair sample prep kit for the mate pair libraries. Sequencing was performed using the Illumina HiSeq2500 platform. Libraries were sequenced to a combined depth of 164× (Tables S1 and S2).
Genome assembly and annotation
Reads were quality filtered (Table S3) and the error corrected reads from the short insert size libraries (<1 Kb) and mate pair libraries (>1 Kb) were used to assemble the whale shark genome using SOAPdenovo241. As the quality of assembled genome can be affected by the K-mer size, we used multi-K-mer value (minimum 45 to maximum 63) with the ‘all’ command in the SOAPdenovo2 package41. The gaps between the scaffolds were closed in two iterations with the short insert libraries (<1 Kb) using the GapCloser program in the SOAPdenovo2 package41. We then aligned the short insert size reads to the scaffolds using BWA-MEM42 with default options. Variants were identified using SAMtools43. Since at least one of alleles from the mapped reads of same individual as reference should be presented in the assembly, we corrected erroneous bases where both alleles were not present in the assembly by substituting the first variant alleles. Finally, we mapped the Illumina TruSeq synthetic long reads (TSLR) to the assembly and corrected the gaps covered by the synthetic long reads to reduce erroneous gap regions in the assembly (Table S5).
The GC distribution of the whale shark genome was calculated using a sliding window approach. We employed 10 Kb sliding windows to scan the genome to calculate the GC content. Tandem repeats were predicted using the Tandem Repeats Finder program (version 4.07)44. Transposable elements (TEs) were identified using both homology-based and ab initio approaches. The Repbase database (version 19.02)45 and RepeatMasker (version 4.0.5)19 were used for the homology-based approach, and RepeatModeler (version 1.0.7)20 was used for the ab initio approach. All predicted repetitive elements were merged using in-house Perl scripts. Two candidate gene sets were built to predict the protein coding genes in the whale shark genome; using AUGUSTUS46 and Evidence Modeler (EMV)47, respectively (Supplementary Text 1.7).
Genomic context calculations
From the 82 species (Table S15), we computed the following genomic factors: GC3 (GC content at third codon position), CAI (codon adaptation index), number and length of coding exon(s), and relative intron length between first and last exon (or coding exon). CDS sequences with premature stop codons and lengths that were multiples of three were excluded. The relative intron length was calculated by dividing the total intron length between first and last exon (or coding exon) by the CDS length (or mRNA length). GC3 was computed from concatenated third codon nucleotides using the canonical method48. We measured relative synonymous codon usage (RSCU) using the method from Sharp et al.49 and the codon adaptation index (CAI) in a CDS using Sharp and Li’s method50 for each of the 82 species. The principle component analysis (PCA) on RSCU was performed using the R packages (version 3.3.0)51 ggplot252 and ggfortify53.
Orthologous gene family clustering
To identify orthologous gene families among the whale shark and the other 82 species, we downloaded all pair-wise reciprocal BLASTP results using the ‘peptide align feature’ in the Ensembl comparative genomics resources54 (release 86). To generate pair-wise orthologous that were not available in the Ensembl resources, we performed reciprocal BLASTP55 with the ‘-evalue 1e-05-seg no-max_hsps_per_subject 1-use_sw_tback’ options. From the pair-wise reciprocal BLASTP results among the 82 species, we generated similarity matrixes by connecting possible orthologous pairs. To constrain the computational load, we did not join additional nodes when the number of node was bigger than 1500. The normalized weights for the similarity matrix were calculated using the OrthoMCL approach56. We identified orthologous gene families by using an in-house C++ script based on the MCL clustering algorithm57, with inflation index 1.3. A total of 1,461,312 genes were assigned to 225,530 clusters, including 192,174 of singletons.
Gene age estimation
Phylostratigraphy uses BLASTP-scored sequence similarity to estimate the minimal age of every protein-coding gene. The NCBI non-redundant database is queried with a protein sequence to detect the most distant species in which a sufficiently similar sequence is present and posit that the gene is at least as old as the age of the common ancestor33. Using NCBI for every species, the timing of lineage divergence events is estimated with TimeTree58. To facilitate detection of protein sequence similarity, we use the e-value threshold of 10−3. We evaluate the age of all proteins with length equal or greater than 40 amino acids. First, we count the number of genes in each phylostratum, from the most ancient (PS 1) to the most recent (PS 20). Most genes are ancient (PS 1-2) and a substantial number appear young (PS 20) (Fig. S19). Second, to understand broad evolutionary patterns, we aggregate the counts from several phylostrata into three broad evolutionary eras: ancient (PS 1-7, cellular organisms to Deuterostomia, 4,204 Mya - 684 Mya), middle (PS 8-14, Chordata to Selachii, 684 Mya - 199 Mya) and young (PS 15-20, Galeomorpha to Rhincodon typus, 199 Mya to present). To understand the gene flow per time unit, we normalized the number of genes by the age and the duration of the evolutionary era.
Correlation tests in orthologous gene families
From the 82 species, we selected 6,929 single-copy orthologous gene families which are found in at least 40 species to calculate the correlation between gene length, i.e., exon + intron length between first and last coding exon and three physiological properties (the maximum lifespan, body weight, and BMR). We identified gene families which had significant correlations between the gene length and the maximum lifespan (2,882 genes), body mass (2,193 genes), and the BMR (2,627 genes) by Spearman’s rho correlation coefficient and Benjamini & Hochberg adjustment (adjust p-value ≤ 0.01). All these gene families were subject to alignment filtering criterion of containing more than 50% of conserved exon-exon boundaries (intron position) in their CDS alignments. This step reduces the effect of gene length change due to intron gain or loss and increases the accuracy of multiple sequence alignments (Fig. S23). Finally, we acquired four sets of correlated gene families between the gene length and the three properties: 1) 25 gene families with the maximum lifespan only (Table S24), 2) one gene family with the body weight only (Table S24), 3) seven gene families with the BMR only, and 4) 155 gene families with all three physiological properties (Table S23).
Author contributions
J.B. conceived and planned the project. J.A.W., Y.S.C., S.G.P, and J.B. coordinated the project. J.A.W., V.L., S.G.P., S.J., J.S.E., and J.B. wrote the manuscript. W.H.H., S.W.K., S.K., Y.S.C., H.M.K., and S.K. prepared the samples and performed the experiments. S.G.P., J.A.W., V.L., S.J., H.M.K., Y.J., Y.B., A.K. and J.J. performed in-depth bioinformatics and evolutionary analyses. S.G.P., J.A.W., V.L., S.J., H.M.K., Y.J., Y.B., J.J., S.W.K., W.H.H., S.L., Y.S.C., A.K., A.M., S.K., J.S.E., J.B., and G.M.C. reviewed the manuscript and discussed the work.
Competing interests
Authors declare no competing interests.
Data and materials availability
The whale shark whole genome project has been deposited at DDBJ/ENA/GenBank under the accession QPMN00000000. The version described in this paper is version QPMN01000000. DNA sequencing reads have been uploaded to the NCBI Read Archive (SRP155581).
2. Comparative genomic studies
2.1 Data resources
Genome sequences and gene sets for 69 species were downloaded from Ensembl FTP (ftp://ftp.ensembl.org/pub/release-86/). An additional twelve species were added from NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/genomes/) (Table S15).
2.2 Comparison of genomic factors
Due to the absence of transcriptome data for fourteen of the comparison species (cod, sloth, hyrax, elephant, lesser tenrec, megabat, shrew, hedgehog, dolphin, tree shrew, pika, kangaroo rat, tarsier, whale shark), we focused our analyses on the genomic features in translated region (Figure 1 and Figure S3-S6).
3. Maximum lifespan, body weight, basal metabolic rates association studies with gene size
3.1 Maximum lifespan data and maximum adult weight
Maximum lifespan, maximum adult weight and basal metabolic rates were downloaded from AnAge (http://genomics.senescence.info/species/), ADW (http://animaldiversity.org/), EOL (http://eol.org/), and aqW (https://www.theaquariumwiki.com/). The weight record of ten fishes were calculated by Froese, R., et al.’s methods, ‘length-weight relationship’ (Table S16).
3.2 Basal metabolic rates calculation
We calculated basal metabolic rates (BMRs) of 82 species using Gillooly’s equation40 based on maximum adult weight and average body temperature (or growth optimum temperature for cold-blooded animal) (Table S16). The calculated BMRs were compared with published BMRs from the AnAge database (http://genomics.senescence.info/) using the Spearman’s rank correlation coefficient (Figure S7). The Calculated Gillooly’s BMRs40 were significantly correlated with BMRs downloaded from the AnAge database.
3.3 Intron gain or loss
From the single-copy orthologous gene sets, CDS of each orthologous family were aligned using MUSCLE (version 3.8.31)41. Exon-exon boundaries as intron positions were marked on the CDS alignments. The intron positions within a permissible length (six bp), which are considered as alternative splice sites, were aligned. The aligned intron position was converted to a binary character matrix as an input table of the Malin program42, which was used to calculate the “intron gain or loss” using Dollo parsimony (Figure S10).
3.4 Prediction of repetitive elements within introns
To compare intronic repetitive elements in each species, we constructed consensus models of putative interspersed repeats by RepeatModeler (version 1.0.10)7. Using RepeatMasker (version 4.0.5)6, we then predicted repeat elements in the introns of 81 species (yeast is excluded from our 82 species set) with the ‘-no_is-cutoff 255-frag 20000’ options. The predicted repetitive elements containing domains that overlapped with other repeats (higher-scoring match), which were denoted by asterisk in the RepeatMasker result file, were filtered out.
3.5 Synonymous codon usage comparison
We measured relative synonymous codon usage (RSCU) using Sharp et. al.’s method43 in each of the 82 species. A principal component analysis (PCA) on the RSCU was performed using the R packages (version 3.3.0)44 ggplot245 and ggfortify46 (Figure S16).
4. Evolutionary studies of whale shark
4.1 Phylogeny construction
We constructed a phylogenetic tree of 25 species including whale shark (Anolis carolinensis, Balaenoptera acutorostrata scammoni, Bos taurus, Callorhinchus milii, Canis familiaris, Danio rerio, Esox lucius, Gadus morhua, Gallus gallus, Heterocephalus glaber, Homo sapiens, Latimeria chalumnae, Lepisosteus oculatus, Loxodonta africana, Mus musculus, Myotis lucifugus, Nothobranchius furzeri, Orcinus orca, Oryzias latipes, Petromyzon marinus, Physeter catodon, Rhincodon typus, Takifugu rubripes, Tursiops truncatus, Xenopus tropicalis). We first extracted 275 single-copy gene families of 25 species from the orthologous gene family table of 82 species. We filtered out clusters with average GC3 below 0.45 to prevent bias47, leaving 255 clusters. We performed multiple sequence alignment (MSA) of each remaining single copy gene family using MUSCLE 3.8.3141 and concatenated the MSA results without gap regions. The phylogenetic tree was constructed using RAxML 8.248 with maximum likelihood (1,000 bootstrapping), using the PROTCATLG amino acid substitution model (Figure 3D).
4.2 Divergence time estimation
We estimated that the common ancestor of the whale shark and elephant fish diverged roughly 268 million years ago (MYA) (Figure 3D). Divergence times were estimated using the MCMCtree program in PAML package 4.849 with the independent rates model (clock=2). The date of the node between O. orca-L. chalumnae was constrained to 401-425 MYA and O. latipes-R. typus was constrained to 450-497 MYA based on the TimeTree database50.
4.3 Whale shark evolutionary rate
We compared the molecular evolutionary rate of the whale shark and other 23 species with sea lamprey as an outgroup. We found that the whale shark had the shortest distance to the outgroup (sea lamprey) indicating slowest evolutionary rate (Table S18). We also performed a relative rate test using MEGA751, and found that the whale shark protein coding genes are evolving more slowly than any 81 species (Table S19). We also performed the Two-Cluster test with LINTRE52. The distances between nodes in the phylogenetic tree (Figure 3D and Figure S17) used as the pairwise distances for the Two-Cluster test. The distances from the sea lamprey as an outgroup were calculated using the ‘ape’ R-package53. The two-cluster test also supported that the whale shark has a slower evolutionary rate than the elephant fish (Table S20).
The pairwise distances were calculated using the R-package ‘ape’53 with an outgroup (sea lamprey).
When accounting for the age and length (duration) of evolutionary eras (average era ages: ancient, ~1,570 Mya; mid, ~473 Mya; young, ~100 Mya), the number of genes in every era increases steadily as the genes are more recent, which suggests that gene turnover is highest in recent ages.
4.4 Gene family expansion and contraction analyses
Gene family expansion/contraction analyses were performed by using CAFÉ software54 (v3.1) with phylogenetic tree and p-value cut-off <0.05 demonstrating significantly changed the number of genes in the family (Figure 3D). 32 gene families were expanded and 233 gene families were contracted in the whale shark genome. We performed gene ontology (GO) enrichment test using ClueGO55. Expanded gene families were enriched in pattern specification involved in kidney development (GO:0061004) and nephron tubule formation (GO:0072079). Contracted gene families were enriched in nucleosome assembly (GO:0006334) and chromatin assembly (GO:0031497). We also found smaller number of histone 1 (H1), histone 2A (H2A) and histone 2Bs (H2Bs) in the whale shark than in other bony fishes and mammals (Figure S20).
4.5 Neural genes
We downloaded and corrected the neuronal genes with ten categories from GO and public databases as below.
1) Neuronal connectivity genes:
View this table:2) Cell adhesion:
View this table:3) Olfactory receptors:
View this table:4) Ion channel:
View this table:5) Unfolded protein response associated genes:
View this table:6) Neuronal activity and memory:
View this table:7) Neuropeptides:
- Neuropeptide database (http://www.neuropeptides.nl/tabel%20neuropeptides%20linked.htm) (96 genes)
- Two genes, CCAP and AstA (Allatostatin)
8) Homeobox genes:
- HGNC database (https://www.genenames.org/) (319 genes)
9) Synaptic genes:
- SynaptomeDB (http://metamoodics.org/SynaptomeDB/index.php) (1,886 genes)
10) Neurodegeneration:
- KEGG Human diseases (http://www.genome.jp/)
✓ Neurodegenerative diseases (236 genes)
4.6 Gene set enrichment analysis with gene size
Gene Set Enrichment Analysis (GSEA)56 was used to calculate statistically significant differences between short and long gene in 82 species using the clusterProfiler package57 with Gene Ontology. All genes were assigned to human gene symbols in order to use human-GO. Finally, we obtained the results of 77 species (Figure 4C and Additional File 1).
5. Scaling relationships
5.1 Scaling between genomic traits, physiological traits, and ecological parameters
We set out to determine whether the statistically significant correlations between genomic traits, physiological traits, and ecological parameters that we observed (Figure 2) in our array of species (centered on Chordates) reflect scaling relationships that may be formalized as power laws written as Y = A*XB 40. Consistent with previous work40, we found that the Basal Metabolic Rate (BMR) correlates with mass (B = 0.68, Figure S25). Furthermore, genome size, measured as golden path length, scales with gene size, measured as summed length of exons and intron per gene (B = 1.32, Figure S26), consistent with the observed lengthening of the whale shark genome by expanded CR1 repetitive elements (Figure 1A). Additionally, unlike in bacteria58 and crustaceans59, genome size in Chordates scales positively with temperature (B = 0.77, Figure S27).
5.2 Scaling of neural genes to average gene lengths
Since several categories of neural genes are longer than average genes (Figure 4A, Figure S21), we examined whether their neural and average lengths obey a scaling relationship. Surprisingly, we found that neural genes are scaled to average genes with an exponent greater than 1 (B = 1.038, Figure S28), with the whale shark showing an extreme lengthening of neural genes.
Acknowledgements
We thank Dr. Mark Erdmann for generously providing the whale shark photograph used in Figure
1. This work was supported by the Genome Korea Project in Ulsan (800 genome sequencing) Research Fund (1.180017.01) of UNIST (Ulsan National Institute of Science & Technology) and the Genome Korea Project in Ulsan (200 genome sequencing) Research Fund (1.180024.01) of UNIST (Ulsan National Institute of Science & Technology). V.L. thanks Marc W. Kirschner and gratefully acknowledges funding support from the National Institutes of Health of USA (R01 HD073104 and R01 HD091846 to M.W.K.).
Footnotes
↵* These authors jointly supervised this work.