ABSTRACT
Heterochromatic regions of the genome are repeat-rich and gene poor, and are therefore underrepresented in even in the best genome assemblies. One of the most difficult regions of the genome to assemble are sex-limited chromosomes. The Drosophila melanogaster Y chromosome is entirely heterochromatic, yet has wide-ranging effects on male fertility, fitness, and genome-wide gene expression. The genetic basis of this phenotypic variation is difficult to study, in part because we do not know the detailed organization of the Y chromosome. To study Y chromosome organization in D. melanogaster, we develop an assembly strategy involving the in silico enrichment of heterochromatic long single-molecule reads and use these reads to create targeted de novo assemblies of heterochromatic sequences. We assigned contigs to the Y chromosome using Illumina reads to identify male-specific sequences. Our pipeline extends the D. melanogaster reference genome by 11.9-Mb, closes 43.8% of the gaps, and improves overall contiguity. The addition of 10.6 MB of Y-linked sequence permitted us to study the organization of repeats and genes along the Y chromosome. We detected a high rate of duplication to the pericentric regions of the Y chromosome from other regions in the genome. Most of these duplicated genes exist in multiple copies. We detail the evolutionary history of one sex-linked gene family—crystal-Stellate. While the Y chromosome does not undergo crossing over, we observed high gene conversion rates within and between members of the crystal-Stellate gene family, Su(Ste), and PCKR, compared to genome-wide estimates. Our results suggest that gene conversion and gene duplication play an important role in the evolution of Y-linked genes.
BACKGROUND
Heterochromatic regions of the genome are dense in repetitive elements and rarely undergo recombination via crossing over (Charlesworthet al. 1986). While heterochromatin is generally gene-poor, this compartment of the genome harbors functional elements (Gatti and Pimpinelli 1992) that impact a diverse array of processes, including nuclear organization (Csink and Henikoff 1996), chromosome pairing and segregation (Dernburget al. 1996; McKeeet al. 2000; Rosicet al. 2014), and speciation (e.g. Bayes and Malik 2009; Ferree and Barbash 2009; Cattani and Presgraves 2012). In many cases, the functionally relevant sequences are unknown, in part because it is difficult to sequence and assemble repeat-rich heterochromatic sequences. These sequences can be unstable in cloning vectors and/or toxic to E. coli cells (Carlson and Brutlag 1977; Lohe and Brutlag 1987b; Lohe and Brutlag 1987a) and thus may be underrepresented in clone-based sequencing libraries. Repetitive reads also present a challenge to most genome assemblers (Treangen and Salzberg 2011). As a result, many heterochromatic regions of the genome are missing from even the best genome assemblies (Hoskinset al. 2002; Carvalhoet al. 2003). Drosophila melanogaster has arguably one of the most contiguous genome assemblies of any metazoan (Chakrabortyet al. 2016). However, only ~143 Mb of the estimated ~180-Mb-genome is assembled into contigs (Hoskinset al. 2015). Heterochromatin makes up ~20% of the female and ~30% of the male D. melanogaster genome (the entire 40-Mb Y chromosome is heterochromatic; (Hoskins et al. 2002). The latest iteration of the reference genome assembly used BAC-based methods to extend into pericentromeric and telomeric regions, and increased the representation of the Y chromosome over 10-fold—the most recent genome assembly (version 6, R6 hereafter) includes ~27 Mb of heterochromatin, ~4 Mb of which is from the Y chromosome (Hoskinset al. 2015).
The Drosophila Y chromosome has been particularly recalcitrant to assembly (Hoskinset al. 2015). In addition to problems with cloning and assembly, we expect Y-linked sequences to have 50% and 25% of the autosomal coverage in male and mixed-sex sequencing libraries, respectively. Approximately 80% of the D. melanogaster Y chromosome is likely tandem repeats (Bonaccorsi and Lohe 1991). There are only ~20 known Y-linked genes (Carvalhoet al. 2015), at least six of which are essential for male fertility (Kennison 1981). Despite being gene-poor, Y chromosomes can harbor functional variation. For example, structural variation on the Y chromosome in mammals affects male fertility (Reijoet al. 1995; Vogtet al. 1996; Sunet al. 2000; Reppinget al. 2003). Similarly, Y-linked genetic variation in D. melanogaster has significant effects on male fertility (Chippindale and Rice 2001), including heat-induced male sterility (Rohmeret al. 2004). Y-linked variation in Drosophila also affects global gene expression (Lemoset al. 2008) and chromatin states across the genome (Lemoset al. 2010; Brown and Bachtrog 2014; Brown and Bachtrog 2017). It is unlikely that this functional variation maps to the few known Y-linked genes because there is very little nucleotide variation in coding regions (Zurovcova and Eanes 1999; Larracuente and Clark 2013). Instead, the Y chromosome may act as a sink for chromatin factors. Variation in the amount of Y-linked heterochromatin may influence the distribution of chromatin modifiers elsewhere in the genome (Dimitri and Pisano 1989; Henikoff 1996; Francisco and Lemos 2014; Brown and Bachtrog 2017). Without knowing the structure and composition of Y chromosomes, it is difficult to study this phenomenon in detail. Targeted attempts to sequence and assemble the Y chromosome have only had limited success (Hoskinset al. 2002; Abadet al. 2004; Mendez-Lagoet al. 2009; Mendez-Lagoet al. 2011; Hoskinset al. 2015; Mahajanet al. 2018). Single-molecule long read sequencing approaches (Brantonet al. 2008; Eidet al. 2009) are improving our ability to assemble repetitive regions of complex genomes (Huddlestonet al. 2014; Chaissonet al. 2015; Chang and Larracuente 2017; Jainet al. 2017; Khostet al. 2017) However, so far these approaches have only resolved relatively small segments of the Drosophila Y chromosome (Carvalhoet al. 2015; Krsticevicet al. 2015).
Here we develop an approach using single-molecule long read sequencing from Pacific Biosciences (PacBio; Kim et al. 2014) to create heterochromatin-enriched genome assemblies. We use this approach to build a new assembly of the D. melanogaster genome that fixes gaps in euchromatin, adds a substantial amount of heterochromatin, and improves the overall contiguity of the genome assembly. Most of the additional sequence in our assembly is Y-linked, allowing us study Y chromosome composition in fine detail. We describe the landscape of transposable elements, the high rate of Y-linked gene duplication, and patterns of gene conversion among members of Y-linked multi-copy gene families.
METHODS
Heterochromatin sensitive assembly
We used BLASR (v5.1; Chaisson and Tesler 2012) to map PacBio reads (from Kim et al. 2014) to the release 6 (R6) D. melanogaster genome. Both the PacBio sequence reads and the reference genome are from the Iso1 strain. To curate a set of heterochromatin-enriched reads, we extracted any reads that: 1) map outside of the major chromosome arms (i.e. 2L, 2R, 3L, 3R, 4, X) and mitochondria; or 2) are unmapped. We took an iterative approach to genome assembly, generating two versions of both the heterochromatin and the whole genome assemblies, and then reconciling differences between them using quickmerge (Chakrabortyet al. 2016). For the heterochromatin, we generated de novo assemblies with the heterochromatin-enriched reads using Canu v 1.3 (r7740 72c709ef9603fd91273eded19078a51b8e991929; Koren et al. 2017; repeat sensitive settings) and Falcon (v0.5; Chin et al. 2016; see Supplementary methods; Table S1). To improve the assembly of the major chromosome arms, we generated de novo assemblies with all PacBio reads using Falcon and Canu (Supplementary methods). We used quickmerge to combine our de novo heterochromatin-enriched assemblies with our all-read de novo assemblies sequentially, and then with two reference assemblies (R6; Hoskins et al. 2015) and a de novo PacBio assembly from (Chakraborty et al. 2016; Table S1). The detailed Falcon and Canu parameters for each de novo assembly and outline of the assembly and reconciliation process are in the supplementary methods. We also manually inspected each assembly, paying particular attention to Y-linked genes, where gaps in the assembly can occur because of low read coverage. We extracted raw or corrected reads from 7 Y-linked regions with read coverage < 10 and reassembled these manually in Geneious v 8.1.6 (Kearseet al. 2012). Before attempting to merge any assemblies, we checked that the gene order on all major chromosome arms agreed with R6 and examined the completeness of genes in pericentromeric regions, telomeres, and the Y chromosome. In our final reconciled assembly, we manually adjusted any errors in the Rsp, Sdic, and Mst77Y regions based on their organization in previous studies (Krsticevicet al. 2015; Cliftonet al. 2017; Khostet al. 2017). We removed redundant contigs using MUMMER implemented in Masurca (v3.2.2; Zimin et al. 2017) and polished the resulting assembly using Quiver (SMRT Analysis v2.3.0; Chin et al. 2013). To correct any errors in regions with low PacBio coverage, we ran Pilon v1.22 (Walkeret al. 2014) 10 times with both raw Illumina reads and synthetic reads (Table S2; with parameters “--mindepth 3 --minmq 10 --fix bases”). We created two and five scaffolds for the third and Y chromosomes respectively, based on known gene structure. We used MUMMER v3.23 (Kurtzet al. 2004) to map our new assembly to the R6 assembly using “nucmer --mum -l 10000 -D 40”, and only reported the one-to-one alignments using “delta-filter -1”. We remapped PacBio reads to this assembly using minimap v2.5-r607 (Li 2016) with parameters “-t 24 - ax map-pb”. We called coverage of uniquely mapped reads using samtools (v1.3 -Q 10; Li et al. 2009). To report on the sequence added in our assembly, we define heterochromatic regions based on the coordinates in Hoskins et al. (2015) and assume all added sequence beyond these coordinates on major chromosome arms, assigned to the Y chromosome, or on unassigned contigs is enriched in heterochromatin.
Identifying Y-linked contigs
We used Illumina reads from male and female genomic libraries (Table S2) to identify Y-linked contigs. We mapped the male and female reads separately using BWA (v0.7.15; Li and Durbin 2010) with default settings and estimated the coverage of uniquely mapped reads per site with samtools (v1.3; -Q 10). We designated contigs with a median female-to-male read ratio of 0 as Y-linked (excluding sites with 1 or fewer Q>10 reads). To validate the sensitivity and specificity of our methods, we examined our X, Y, and autosome assignments for all 10-kb regions with a known location (only for regions with more than 1-kb of mappable sites).
Gene and repeat annotation
We transferred r6.17 Flybase annotations from the R6 assembly to our final assembly using pBlat (v0.35, https://github.com/icebert/pblat-cluster; Kent 2002) and CrossMap (v0.2.5; Zhao et al. 2014). We then used HISAT2 (2.0.5; Kim et al. 2015) to map the male RNAseq reads (Table S2) to the genome based on known splice sites from the new annotation file. We used Stringtie (1.3.3b; Pertea et al. 2015) with these mapped reads and the guided annotation file from CrossMap to improve annotations and estimate expression levels. For unknown genes, we searched for homology using NCBI-BLAST against known D. melanogaster transcripts sequences (r6.17). To verify misassemblies and duplications, we designed primers to amplify segments of putatively Y-linked contigs/scaffolds with PCR in males and virgin females (Table S3). We also extracted and reverse transcribed RNA from 3-5 days old testes with TRIzol (ThermoFisher) and M-MLV reverse transcriptase (ThermoFisher), and examine splice sites using RT-PCR (Table S3).
To annotate repetitive DNA, we used RepeatMasker 4.06 (Smitet al. 2013) with Repbase 20150807 and parameters “-species drosophila -s”. We modified scripts from (Bailly-Bechetet al. 2014) to summarize TEs and other repetitive sequences. We searched for satellites using TRF (v4.09; Benson 1999) with parameters “2 7 7 80 10 100 2000 -ngs -h”.
Sequence alignments and recombination analyses
We used BLAST v2.2.31+ (Altschulet al. 1990) and custom scripts to extract the transcript sequences from the genome. We aligned and manually inspected transcripts using Geneious v8.1.6 (Kearseet al. 2012). We constructed phylogenetic trees for regions conserved between members of the cry-Stellate family with MrBayes using the autosomal parent gene Ssl as an outgroup. (GTR+gamma HKY85 model; mcmc ngen=1,100,000 nchains=4 temp=0.2 samplefreq=200; seed=20,649). The consensus tree was generated with sumt burnin=500 with > 50% posterior probability. We used the APE phylogenetics package in R (Paradiset al. 2004) to plot the tree. We used compute 0.8.4 (Thornton 2003) to calculate Rmin and estimate population recombination rates based on linkage disequilibrium (Hudson 1987). In addition, we estimated gene conversion rates based on gene similarity (Supplementary methods; Ohta 1984; Rozenet al. 2003; Backstromet al. 2005).
Data availability
The genome assembly and annotations will be publicly available at Dryad. Supplemental materials (Figures S1-S5, Tables S1-S10, and File S1) are available at Figshare. The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.
RESULTS
Closing gaps in the Release 6 assembly
Major blocks of heterochromatin including the Y chromosome are missing from the latest version of the D. melanogaster genome (Hoskinset al. 2015). We built a new assembly of the D. melanogaster genome that closes gaps in the euchromatin and adds to the assembly in heterochromatin, most notably the Y chromosome. Even with long single molecule reads, unequal read coverage across heterochromatic regions may cause assembly problems (Carvalhoet al. 2016). Because assemblers typically use the top ~30X longest reads for genome assembly, sex-linked regions may be under sampled. For example, some Y-linked regions are extremely underrepresented (e.g. there are no reads from the 3rd exon in Ppr-Y and only 9 reads come from the 2nd and 3rd exons of kl-3). To reduce this potential bias, we assembled the heterochromatin and euchromatin separately and then combine these assemblies with each other and with published versions of the D. melanogaster genome (Fig. 1). We first isolate a set of heterochromatin-enriched reads by mapping all Pacbio reads to the R6 reference and discarding reads mapping uniquely to the euchromatic genome (Fig 1A). Using this approach, we extracted ~1.58 Gb of sequence across 204,065 reads (12% of total reads) for reassembly. With this small subset of reads, we are able to optimize parameters for repeat assembly, partially remedy assembly errors, and increase assembly contiguity. For Canu, we experimented with assembly conditions by varying bogart parameters (see Supplementary methods). For Falcon, we experimented with the minimal overlap length in the string graph. For both methods, we identified parameter combinations that maximized assembly N50, total assembly length, and longest contig length; and without detectable misassemblies in Y-linked coding regions. We note that while assembly length and contiguity are often used to assess assembly quality, the most contiguous assemblies are not always correct (Khostet al. 2017). We therefore reconciled the assembled contigs from the two “best” versions of our heterochromatin-enriched and whole genome assemblies sequentially, and finally with the R6 assembly and another PacBio reference assembly (Fig 1B-C; Chakrabortyet al. 2016). We manually adjusted misassembled contigs and polished the final assembly for use in downstream analyses (Fig 1D). Our final reconciled genome has 200 contigs and is 155.6 Mb in total—this is a great improvement in assembly contiguity over R6 (143 Mb in 2,442 contigs; Table 1). The improvement is in both euchromatic and heterochromatic regions (Figs S2 and S3).
Our new assembly fills all three unassembled gaps in the gene-rich regions of the R6 major chromosome arms (one each on 2R, 3L and 4; Fig S2 and Table S5). Chromosome 4 had a predicted 17-kb gap in R6. In agreement with this predicted gap size, our new assembly inserts 17.996 kb in this gap with (AAATTAT)n repeats and other AT-rich sequences. The gap on chromosome 2R was unsized; our assembly fills this gap with 4,664 bp consisting of 123-bp complex repeats. Interestingly, an annotated non-coding gene, CR44666, is located near the 2R gap in R6 and consists entirely of this 123-bp unit. In agreement with the predicted gap size of ~7kb on 3L, our new assembly inserts 6,157 bp containing one of four tandem copies of the 3S18/BEL transposons. Our assembly therefore places all euchromatic regions of the major chromosome arms on single contigs.
We also made a marked improvement to heterochromatic regions (as defined by Hoskinset al. 2015). In total, we filled 25 of 57 gaps in the R6 major chromosome scaffolds (Table S5). Of these gaps, 14 were located in transposon dense regions; four were associated with complex repeats (two with Responder, one with 1.688 family repeats and one with a newly-identified 123-bp unit), three were associated with 7-bp tandem repeats, and one is associated with rDNA repeats. One is a 17-kb deletion and the other two gaps involve complex rearrangements between R6 and our assembly that may represent scaffolding errors in R6. Our new assembly has ~38.6 Mb of heterochromatin-enriched DNA across 193 contigs, whereas the R6 assembly has ~26.7 Mb of heterochromatin-enriched DNA in 2,432 contigs. Approximately 89% of the additional heterochromatic sequence in this assembly is from the Y chromosome (see below). We assigned some contigs based on their repeat content, e.g. a 180-kb contig from chromosome 2 (Contig 142). This contig terminates in (AATAACATAG)n and (AAGAG)n repeats mapping to cytological bands h37 and h38 (Garaviset al. 2015). Contig 142 extended an existing unmapped R6 scaffold (Unmapped_Scaffold_8_D1580_D1567), which contains a gene (klhl10) that maps to chromosome 2 (http://flybase.org/reports/FBgn0040038).
Identifying Y-linked contigs
The estimated size of the Y chromosome is 40 Mb, however only ~4 Mb is assembled and assigned to the Y chromosome in R6 (Hoskinset al. 2015). Our assembly pipeline based on PacBio reads circumvents the cloning steps associated with BAC-based sequencing, and results in a better representation of heterochromatin, including the Y chromosome. We developed an approach to identify and assign Y-linked contigs based on detecting male-specific sites using Illumina reads (Fig 1E). To validate our method to assign Y-linkage, we used contigs with a known location in R6 as benchmarks. Previous studies in mosquitos and D. melanogaster identified Y-linked contigs using the chromosome quotient (CQ): the female-to-male ratio of the number of alignments to a reference sequence (Hallet al. 2013). In D. melanogaster, this method has 76.3% sensitivity and 98.2% specificity (Hallet al. 2013). Our approach instead considers the number of male-specific regions (where the median per-site female-to-male ratio = 0) and is a better indicator of Y-linkage than CQ—among 14,116 10-kb regions in our assembly with known chromosomal location based on previous data (R6 assembly), we appropriately assigned 99.0% of Y-linked regions (714/721 regions; Fig S4). Only 1.5% of all regions that we assigned to the Y chromosome are not Y-linked in the R6 assembly (11/725 regions; Fig S4). Therefore, our method has both a higher sensitivity and specificity than previous methods. For the 11, 10-kb regions that may be false positives in our method, 9 are from a centromeric scaffold (3Cen_31_D1643_D1653_D1791), and 2 are from the 2nd chromosome telomeres. These regions may be misassigned in the R6 assembly because the centromeric scaffold has a Y-specific repeat, AAAT, (Weiet al. 2018) and telomeric transposons are found on all chromosomes and may vary within strains. We used our method to assign 14.6 Mb to the Y chromosome across 106 contigs (N50 = 415 kb; Table 1). Because ~80% of the 40-Mb Y chromosome consists of tandem repeats (Loheet al. 1993), this is likely near the maximum amount of Y-linked sequence we can expect to identify with current sequencing technology.
Improving known Y-linked gene annotations
The gene order and orientation of Y-linked genes in our assembly is consistent with previous mapping data (Fig. 2; Carvalho et al. 2000; Carvalho et al. 2001; Vibranovski et al. 2008) using Y chromosome deletions, except for Pp1-Y1. Unfortunately, we cannot distinguish whether this difference is due to a misassembly or strain variation. We found splice site errors in two previous Y-linked gene models: kl-5 has three additional introns (in the 1st, 5th and 17th exons of the R6 annotation; Table S6), and CCY has one additional intron (in the 6th exon of the R6 annotation; Table S6). We also found partial duplications of exons in kl-3, ORY, PPr-Y, and WDY (Table S7). Each of these duplications, except ORY, exists on unannotated regions of the R6 assembly. In the R6 assembly, CCY and kl-3 contain misassembled sequences in 6th and 3rd exons coding region, respectively. We therefore corrected the misassemblies in the R6 Y-linked coding regions based on our assembly and PCR validation (Table S3). We used gene annotation data to scaffold the assembly (Fig 1F).
Y-linked gene duplications
We identified 13 independent duplications to the Y chromosome from other chromosomes, seven of which we identify as Y-linked for the first time. 11 of these duplications exist in multiple copies on the Y chromosome (Table 2). We also identified a new Y-linked gene, CG41561 located on an unmapped contig (211000022280328) in the R6 assembly (Mahajan and Bachtrog 2017). Among the 13 duplications, we found that the Y-linked copies of Hsp83, Mst77F (Mst77Y), and vig2 (FDY) are still expressed in testes (FPKM > 5 in at least one dataset; Table S8); however, Hsp83 contains a premature stop codon and a TE insertion. Therefore, outside of Mst77Y and FDY, we do not have evidence for their function (Krsticevicet al. 2010; Krsticevicet al. 2015). Interestingly, these duplications seem to be clustered on the Y chromosome: six of duplications are on Y_scaffold4 and five of the duplications are on Y_Contig2 (Table 2). Y_scaffold4 and Y_Contig2 are from the cytological divisions h10-15 and h17-18, respectively (Fig. 2). Additionally, FDY (Y_Contig10) maps to h15-h20 (Krsticevicet al. 2015). Therefore, 12 of the duplications are located between h10-h20 (11 of 25 Y-linked cytological bands), suggesting that the pericentromere of the Y chromosome (defined here as h10-h20) is enriched for duplicated genes in D. melanogaster (Fisher’s exact test; P = 0.005).
Repeat content in Y-linked contigs
Cytological observations indicate that the Y chromosome is highly enriched for repetitive sequences (Loheet al. 1993; Carmena and Gonzalez 1995; Pimpinelliet al. 1995), however there have not been attempts to document this at the sequence level. We used our assembly to identify repetitive elements across the Y chromosome. Consistent with previous studies, we find that the Y chromosome is enriched for rDNA and IGS repeats (Ritossa and Spiegelman 1965; Fig 3A and Table S9). The rDNA are located across 54 scaffolds/contigs, including 1 Y-linked scaffold, 12 Y-linked contigs, 2 X-linked contigs, and 39 unknown contigs (Table S9). We identified 56 copies of 18s rDNA, 238 copies of 28s rDNA, and 721 copies of IGS repeats on the Y chromosome. LTR and LINE transposons contribute 53% and 19% of the total sequence, respectively, in our Y-linked contigs (Fig 3A). We assume that most of the unassembled parts of the Y chromosome are simple tandem repeats (Loheet al. 1993). Based on this assumption, we estimate that 65% of the 40-Mb Y chromosome is simple tandem repeats, and LTR and LINE elements comprise 18% and 7% of the total 40-Mb Y chromosome, respectively. Compared to the rest of the genome, the Y chromosome has a 1.4 - 1.8 fold enrichment of retrotransposons (10.2% of LTR and 5.0% of LINE for the rest of the genome), while DNA transposon content is similar among chromosomes (2.3% on Y and 2.2% for the rest of the genome, Fig. 3A). The Y chromosome is enriched for retrotransposons over DNA transposons even when compared to other heterochromatic genomic regions (Fig. S5).
Previous studies predicted the repeat composition of the Y chromosome based on the presence/absence of in situ hybridization (ISH) signals on mitotic chromosomes. (Carmena and Gonzalez 1995; Pimpinelli et al. 1995). Our assemblies recapitulate these ISH results. For example: P, hobo, FB4, and Bari-1 are nearly absent from the Y chromosome (< 3.5 kb of total sequence), while Dm412, Gypsy, HetA, Doc, TART, Mdg1, Mdg3, blood, and FW have at least 14 kb of sequence on the Y chromosome (Fig 3B and Table S9; Carmena and Gonzalez 1995; Pimpinelli et al. 1995; Junakovic et al. 1998; Agudo et al. 1999). Previous studies are conflicted about the presence/absence of I elements (Carmena and Gonzalez 1995; Pimpinelli et al. 1995), however we do not see evidence of Y linkage in our assembly. Other transposons also appear to be absent from the Y chromosome, e.g. gypsy4 (Table S9; Fig 3B). Since I element-mediated dysgenesis only occurs in females (Buchetonet al. 1976), it is possible that this element is inactive in the male germline and therefore rarely has the opportunity to invade Y chromosomes. We suggest that the sex-specific activity of TEs may contribute to their genomic distribution.
Tandem repeats are also enriched on Y chromosomes (~65% on the Y chromosome compared to 2.8% on the other chromosomes; (Lohe et al. 1993). Approximately 5% (742,964 bp) of our Y-linked sequences are tandem repeats. We assume that this is a gross underestimate of tandem repeat abundance, but nevertheless helps lend insight into the repeat content and organization of the Y chromosome. Our assembly agrees with most previous cytological and molecular evidence of Y chromosome simple tandem repeat content (Fig. 2; Bonaccorsi and Lohe 1991). Among 32 known Y-linked simple repeats, 20 appear in our Y-linked contigs (Table S10; Bonaccorsi and Lohe 1991; Jagannathanet al. 2017; Weiet al. 2018). The repeats that we do not find may be sequence variants of abundant repeats (e.g. we detect AAAAC and AAAGAC but not AAAAAC or AAAAGAC), not perfectly in tandem, or part of a more complex repeat (e.g. AAGACAAGGAC is part of AAGACAAGGAAGACAAGGACAAGACAAGGAC; Table S10). Although we recover only ~60% of known Y-linked repeats (based on Illumina data, Wei et al. 2018; or ISH, Bonaccorsi and Lohe 1991; Jagannathanet al. 2017), our new assembly including genes and transposable elements provides the most detailed view of Y chromosome organization.
Evolution of the cry-Stellate gene family
The multicopy crystal-Stellate (cry-Ste) gene family is thought of as a relic of intragenomic conflict between X and Y chromosomes (reviewed in Bozzetti et al. 1995; Hurst 1996; Malone et al. 2015). Stellate (Ste) is an X-linked multicopy gene family whose expression is controlled by the Y-linked Suppressor of Stellate (Su(Ste)) locus through an RNA interference mechanism (Nishidaet al. 2007). If left unsuppressed, Ste expression leads to the accumulation of crystals in primary spermatocytes of the testes, resulting in male sterility (Bozzettiet al. 1995). This multicopy gene family has a complicated evolutionary history (Koganet al. 2000). Ste and Su(Ste) are recent duplications of the autosomal gene Su(Ste)-like (Ssl or CK2β) with a testis-specific promoter from casein kinase subunit 2 (Koganet al. 2000). Following the initial duplication of Ssl to the Y chromosome, members of this gene family expanded and duplicated to the X chromosome (Fig 4A). All sex-linked members of this gene family exist in multiple copies. The X-linked copies and Y-linked copies amplified independently, perhaps driven by sex chromosome conflict (Koganet al. 2012). We used our assembly to study the evolution of this interesting gene family and patterns of gene conversion on the Y chromosome. We found 666 copies of genes in the cry-Ste family: 37 on the X chromosome, 627 on the Y chromosome, and 2 from an unknown region. We detect more Y-linked copies than were previously estimated (200-250 complete copies) using southern blotting (McKee and Satter 1996). We found a clade of 122 Y-linked genes that are from an ancestral duplication of Ssl and fall as an outgroup to Ste and Su(Ste) (Fig. 4B). These copies, originally identified in a Y-derived BAC, are designated as pseudo-CK2β repeats on the Y chromosome (PCKRs) and have the ancestral promoters (Danilevskayaet al. 1991; Usakinet al. 2005). However, there is very little expression among the 107 copies of PCKR (total FPKM <3 from FBtr0302352 and MSTRG.17120.1; Table S8). Ste copies appear in both the X heterochromatin and euchromatin (hereafter referred to as hetSte and euSte, respectively; (Livak 1984; Shevelyov 1992). In addition to the 13 previously-assembled copies of euSte (cytological divisions 12E1 to 12E2), we found an additional 20 copies of Ste located on two X-linked contigs (17 on Contig5 and 3 on X_9), corresponding to functional hetSte copies, and pseudogenized tandemly-repeated heterochromatic elements related to Stellate (SCLRs; Nurminsky et al. 1994; Tulin et al. 1997). The 3 SCLRs on the contig X_9 were present but not annotated in the R6 assembly and were located proximal to hetSte. We assembled 17 hetSte in a single 500-kb contig, where two hetSte loci (5 and 12) are separated by BATUMI and rDNA sequences. However, previously published data using restriction maps and southern blotting suggests that hetSte are organized into three loci (with ~14, 3, and 4 copies) separated by BATUMI and rDNA (Tulinet al. 1997). Our phylogenetic analysis reveals that SCLRs and hetSte are clustered, suggesting that hetSte and euSte amplified independently or experience concerted evolution (Fig. 4B).
The 627 Su(Ste) and PCKR copies are spread across 10 and 3 Y-linked contigs, respectively. These repeats primarily occur in tandem and are flanked by different transposon sequences, including 1360, Gypsy12, and the telomere-associated transposons, HeT-A, TART, and TAHRE. Previous studies suggested that the acquisition of 1360 in Su(Ste) may have been an important step in Su(Ste) evolving a piRNA function to suppress Ste (Usakinet al. 2005). HeT-A colocalizes with Ste-like sequences in the BAC Dm665 (Danilevskayaet al. 1991). We found that the Ste-like sequences in Dm665 are PCKRs and are located proximal to Su(Ste), between WDY and Pp1-Y1. Consistent with BAC data and our assembly, this region is also enriched for telomeric sequences (based on ISH, Traverse and Pardue 1989; Abad et al. 2004). Interestingly, we found 2 chimeric copies of PCKR and Su(Ste) (Fig. 4C), suggesting inter-genic gene conversion occurred between these genes. Previous studies hypothesized that gene conversion homogenizes Su(Ste) clusters, but these studies were only based on restriction maps or a few variants (Balakirevaet al. 1992; McKee and Satter 1996). We investigated the rate of gene conversion on the Y chromosome using 107 copies of PCKR and 406 copies of Su(Ste) after removing fragments smaller than 280 bp. We detected evidence of recombination at both PCKR (per 857-bp locus: Rmin=2 and ρ = 2.67; cg = 2.9 x 10-5 events per site, per generation) and Su(Ste) (per 1203-bp locus: Rmin=1 and ρ = 4.04; cg = 8.2 x 10-6 events per site, per generation). Since there is no recombination via crossing over, we estimate the Y-linked gene conversion rate to be 0.8-5 x 10-5 events per site, per generation. We also used estimates of similarity among repeats within each gene family to estimate gene conversion rates (Supplementary methods; cg). Assuming a mutation rate of 2.8 x 10-9 per site per generation (Keightleyet al. 2014) we estimate the rate of gene conversions per site per generation to be 2.1 x 10-5 and 1.5 x 10-4 for PCKR and Ste, respectively. These rates are ~103 – 104 times higher than gene conversion rates on the autosomes and X chromosome (Comeronet al. 2012; Milleret al. 2012; Milleret al. 2016), and surprisingly similar to the rate observed in mammalian Y and bird W chromosomes (Reppinget al. 2003; Backstromet al. 2005); both based on cg). Rmin and LD-based estimators may underestimate the true gene conversion rate because both recent amplification and selection could decrease variation among copies and cause us to miss recombination events. On the other hand, we likely overestimate the gene conversion rate based on similarity among copies for the same reasons. With both approaches, our data suggest high rates of intrachromosomal gene conversion on Y chromosomes. Recombination may also occur between the X and Y chromosomes—of the 116 variant sites in Ste, 62 of the same variants are found at the homologous positions in PCKR and/or Su(Ste). It will be important to further explore rates of Y-linked gene conversion using multiple strains of D. melanogaster. Higher gene conversion rates in Y-linked multicopy gene families may be important for the evolution of Y-linked genes.
Discussion
Heterochromatic sequences can contain important genetic elements (e.g. Gatti and Pimpinelli 1992) but tend to be underrepresented in genome assemblies. Single-molecule real time sequencing is making strides towards achieving complete assemblies of complex genomes (Huddlestonet al. 2014; Chaissonet al. 2015), however densely-repetitive regions still present a significant assembly challenge that often requires manual curation (Krsticevicet al. 2015; Cliftonet al. 2017; Khostet al. 2017). Uneven read coverage across the genome, and lower read coverage in heterochromatic regions, likely cause problems with genome assembly (Krsticevicet al. 2015; Chang and Larracuente 2017; Khostet al. 2017). Our assembly approach is based on the in silico enrichment of heterochromatic reads, followed by the targeted reassembly of heterochromatic regions, and finally, a reconciliation between whole genome and heterochromatin-enriched assemblies. This approach helped fill gaps, fix errors, and expand the D. melanogaster reference assembly by 11.9 Mb (8% more sequence than the latest release, R6).
Approximately 89% of the additional sequence in our assembly is from the Y chromosome, allowing us to get a detailed view of Y chromosome organization. Despite these improvements, we are still missing some Y-linked regions and some required manual correction. Assemblers filter reads when they appear chimeric or where pairs of reads disagree about overlaps. Canu and Falcon tend to disagree about the organization of some highly repetitive sequences (e.g. Rsp, Khostet al. 2017; Sdic, Cliftonet al. 2017; and Mst77Y, Krsticevic et al. 2015). Our approach does not completely remedy this problem—we also identified errors in our preliminary assemblies that required manual correction. For these misassembled regions, Falcon and Canu arrive at different sequence configurations (e.g. we found 20 copies of Mst77Y in the Canu assembly and 14 copies in the Falcon assembly). To resolve these differences, we leveraged evidence from ISH studies and known gene structures to identify and reconcile differences between the assemblies. Our results suggest that merging multiple assemblies and examining discordant regions using independent evidence is instrumental in assembling complex genomes.
Our biggest improvement to the assembly was on the Y chromosome, which has an unusual composition—its ~20 genes are interspersed among ~40 Mb of repetitive elements (Ritossa and Spiegelman 1965; Loheet al. 1993; Carmena and Gonzalez 1995; Pimpinelliet al. 1995; Abadet al. 2004). Natural variation among D. melanogaster Y chromosomes can have broad effects on genome function and organismal fitness (e.g. Carvalho et al. 2000; Vibranovski et al. 2008; Paredes et al. 2011; Francisco and Lemos 2014; Kutch and Fedorka 2017; Wang et al. 2017). The extremely low nucleotide diversity of Y-linked genes (e.g. Zurovcova and Eanes 1999; Larracuente and Clark 2013; Morgan and Pardo-Manuel de Villena 2017) suggests that the Y-linked functional variation likely maps to the non-genic regions. The Y chromosome is a strong modifier of position effect variegation (PEV), a phenomenon that results in the stochastic silencing of euchromatic reporters caused by the spreading of heterochromatin (Karpen 1994; Elgin 1996; Wakimoto 1998). Y chromosomes may act as heterochromatin sinks, where extra Y-linked heterochromatin can titrate available heterochromatin-binding proteins away from other genomic locations. This may explain how genetic variation in Y-linked heterochromatin affects global gene expression (Henikoff 1996; Francisco and Lemos 2014; Brown and Bachtrog 2017). Alternatively, variation in Y-linked loci that generate small RNAs may have wide-scale impacts on chromatin organization (Zhouet al. 2012). These effects are difficult to tease apart without having a detailed view of Y chromosome sequence and organization. Our study discovered features of the Y chromosome that may relate to its interesting biology. Variation in Y-linked heterochromatin may affect the amount of silent chromatin marks in transposons (Brown and Bachtrog 2017), perhaps contributing to the higher rate of TE activity in males. We show that RNA transposons are generally overrepresented on the Y chromosome. It is possible that the overrepresentation of Y-linked retrotransposons is due to their increased activity in males: the Y chromosome heterochromatin sink effect may lead to reduced transcriptional silencing of TEs. In contrast to DNA transposons, the movement of retrotransposons is transcription dependent and therefore may result in differences in activity between the sexes. If the Y chromosome behaves as a sink for heterochromatin proteins, then we may expect the overrepresentation of RNA transposons to be a universal feature of Y chromosomes. Alternatively, differences in DNA repair or non-homologous recombination might lead to the differential accumulation of DNA and retrotransposons on the Y chromosome compared to the rest of the genome.
Y-linked structural variations can impact genome-wide gene regulatory variation in flies (e.g. Su(Ste) and rDNA; Lyckegaard and Clark 1989; Zhou et al. 2012) and male fertility in mammals (Reijoet al. 1995; Vogtet al. 1996; Sunet al. 2000; Reppinget al. 2003; Morgan and Pardo-Manuel de Villena 2017). We find a large amount of gene traffic to the D. melanogaster Y chromosome from elsewhere in the genome. While estimates of interchromosomal duplications between the X and major autosomal arms range from ~3 (Bhutkaret al. 2007) to 7 (Han and Hahn 2012) on the D. melanogaster branch, we find at least 10 interchromosomal duplications to the Y chromosome. This observation is similar to other studies across taxa (Koerichet al. 2008; Hallet al. 2013; Hughes and Page 2015; Mahajan and Bachtrog 2017; Tobleret al. 2017). Our Y chromosome assembly provides new insights into the organization and mechanisms behind these duplications. For example, we found that most new translocations are DNA based and clustered in the Y pericentromic heterochromatin. The Y chromosome heterochromatin appears to be distinct from other heterochromatic regions of the genome, with properties that vary along the length of the chromosome (Wanget al. 2014). We hypothesize that the Y chromosome pericentromeric heterochromatin may be more accessible than other regions of the chromosome. If so, the increased accessibility may affect transcriptional activity and make these regions more prone to double stranded breaks (DSBs) that would facilitate structural rearrangements. Therefore, Y-linked pericentromeric chromatin may be more permissive to transcription compared to the rest of the chromosome allowing for natural selection to retain insertions that result in functional products. This may provide insights into how new Y-linked genes gain testis-specific functions. Notably, most Y-linked translocations are DNA-based and therefore involve DSB repair. Without a homolog to provide a template for DSB repair, microhomology-mediated end joining of non-homologous sequences may lead to insertions in the Y chromosome. DSB repair may also result in tandem duplications that contribute to the observed copy number variation in Y-linked genes. We discovered that most of the recent translocations to the Y chromosome exist in multiple copies (Table 2), suggesting that the tandem duplication rate may also be higher in the pericentric regions. However, most of these newly acquired genes are pseudogenized and are likely not constrained by natural selection. Many of the functional Y-linked genes are at least partially duplicated. Most essential Y-linked genes (kl-2, kl-3, kl-5 and ORY) have larger introns (> 100Kb), with some introns reaching megabases in size (Kureket al. 2000; Reugelset al. 2000). For genes with large overall sizes, complete gene duplications are less likely. In contrast, some functional genes, e.g. rDNA, Mst77-Y and Su(Ste), exist in multiple copies and are sensitive to gene dosage (Lyckegaard and Clark 1989; Zhouet al. 2012; Kostet al. 2015). A high duplication rate on the Y chromosome may therefore facilitate the evolution of Y-linked gene expression.
In mammals, some Y-linked genes have amplified into tandem arrays and exist in large palindromes (e.g. Rozen et al. 2003; Hughes et al. 2012; Soh et al. 2014). Gene conversion within these palindromes may be important for increasing the efficacy of selection on an otherwise non-recombining chromosome (Charlesworth 2003; Rozenet al. 2003; Connallon and Clark 2010). Interestingly, the largest gene families in the D. melanogaster genome, outside of the rDNA and histone clusters, are the Y-linked genes Su(Ste) and PCKR. We inferred a higher rate of gene conversion in both PCKR and Su(Ste) than the rest of the genome, and similar to the rate observed in mammalian Y chromosome (Rozenet al. 2003). However, our estimates do not consider recent selection or amplification of PCKR and Su(Ste). The elevated Y-linked gene conversion rates may be a consequence of having more highly amplified gene families than other genomic locations. Alternatively, the Y chromosome may have evolved distinct patterns of mutation because it lacks a homolog —low copy number Y-linked genes also have relatively high rates of gene conversion in Drosophila (Koppet al. 2006) and humans (Rozenet al. 2003). Gene conversion between members of Y-linked multi-copy gene families may counteract the accumulation of deleterious mutations through evolutionary processes such as Muller’s ratchet (reviewed in Charlesworth and Charlesworth 2000; Charlesworth 2003; Rozenet al. 2003; Connallon and Clark 2010). If so, then we might expect high gene conversion rates to be a feature common among Y chromosomes.
Acknowledgements
We thank Dr. Kevin Wei for feedback on the manuscript, and the University of Rochester Center for Integrated Research Computing for access to computing cluster resources. We also thank Drs. Casey Bergman, Matt Hahn, and Tom Eickbush for helpful discussion. This work was supported by National Institutes of Health National Institute of General Medical Sciences grant R35 GM119515-01 to A.M.L.
Footnotes
Author contact information: 343 Hutchison Hall Department of Biology University of Rochester Rochester, NY 14627 585-273-1693