Abstract
A long-standing question in molecular biology relates to why the testes express the largest number of genes relative to all other organs. Here, we report a detailed gene expression map of human spermatogenesis using single-cell RNA-Seq. Surprisingly, we found that spermatogenesis-expressed genes contain significantly fewer germline mutations than unexpressed genes, with the lowest mutation rates on the transcribed DNA strands. These results suggest a model of ‘transcriptional scanning’ to reduce germline mutations by correcting DNA damage. This model also explains the rapid evolution in sensory- and immune-defense related genes, as well as in male reproduction genes. Collectively, our results indicate that widespread expression in the testes achieves a dual mechanism for maintaining the DNA integrity of most genes, while selectively promoting variation of other genes.
Main Text
Human tissues and organs are distinguished by the genes that they express and those that they do not 1,2. Tissues have transcriptomes of different complexities in terms of uniquely-expressed genes, as well as those genes expressed at differential levels 3–6. One overarching goal in the life sciences is to characterize the specific transcriptomic signatures of all human tissues, and ultimately each different cell type at the single-cell level 7.
In males, the testis is unique in comparison with somatic tissues in that it contains germ cells which pass the genetic information on to the next generation 8. Interestingly, it has been known for many years that the testis stands out as having the most complex transcriptome with the highest number of expressed genes 9–12. Widespread transcription in the testes has been reported to account for an amazing expression of over 80% of all our protein-coding genes 10,11,13, as well as across many other mammals 3,10.
Several hypotheses have been proposed to explain this observation. Widespread expression may represent a functional requirement for the gene-products in question 12. However, other more complex organs such as the brain do not exhibit a corresponding number of expressed genes despite the fact that they consist of a substantially greater number of distinct cell types 3,10,14–16. Moreover, recent animal studies have shown that many testis-enriched and evolutionarily-conserved genes are not required for male fertility in mice 17. A second hypothesis implicates leaky transcription during the massive chromatin remodeling that occurs throughout spermatogenesis 12,18,19. However, this model predicts more expression during later stages of spermatogenesis – when the genome is undergoing the most chromatin changes – contradicting the observation 13,18. Additionally, the energetic requirements for the observed widespread expression are sufficiently large that such leaky expression would be expected to be under tighter control 20. Given this lack of a compelling explanation for widespread testes transcription, the topic remains an interesting and yet unanswered question.
Here we propose a model that widespread testis transcription modulates gene evolution rates. Beyond functional requirements for reproduction, widespread transcription acts as a scanning mechanism through the majority of human genes, detecting and repairing bulky DNA damage events through transcription-coupled repair (TCR) 21,22, which ultimately reduces germline mutations rates and gene evolution rates. Genes that are not expressed in the male germline do not benefit from the reduced mutation rates. These genes do not constitute a random set but rather are enriched in sensory and defense-immune system genes, accounting for previous observations that these genes evolve faster 23,24. We also found that transcription-coupled damage (TCD) overwhelms this pattern in the very highly expressed genes, which are enriched in spermatogenesis-related functions, implicating TCD-modulated gene evolution. By understanding the uneven germline mutation patterns and the intrinsic mechanism of germline DNA damage removal, we will be in a better position to understand human genome evolution and genetic diseases 25.
Single-cell RNA-Seq reveals the developmental trajectory of spermatogenesis
The developmental process of spermatogenesis includes mitotic amplification, meiotic specification to generate haploid germ cells, and finally differentiation and morphological transition to mature sperm cells (Fig. 1A) 26. Technical limitations confined previous gene expression analyses of spermatogenesis to its broad stages: spermatogonia, spermatocytes, round spermatids and spermatozoa 10,13. To systematically characterize the detailed transcriptomic signatures throughout the entirety of spermatogenesis, we applied high-throughput single-cell RNA-Seq to the human testes (Fig. S1A) 27.
A principal component analysis (PCA) revealed clusters of cells including a large continuous cluster (Fig. 1B). Using previously determined stage markers to infer the identity of the cells, we annotated the main spermatogenic stages, as well as the somatic Leydig and Sertoli cells (Fig. 1B, right). Excluding the somatic cells, PCA on the germ cells revealed a horseshoe-shaped cluster suggesting that the order of the cells corresponds to developmental time (Fig. 1C, S1C, SI methods). Three independent lines of evidence support this projection. First, the order of expression of known marker genes across the horseshoe-shaped cluster matches their developmental order (Fig. S1C). Second, the Monocle2 algorithm which identifies developmental trajectories also revealed the same order of cells (Fig, S1D-E) 29. Finally, using the pattern of unspliced versus spliced transcripts across the cluster as a means to predict the developmental trajectory 28 also reinforced this interpretation (Fig, 1C and SI methods). The arrows in Figure 1C relate the unspliced transcriptome of cells with the spliced transcriptome of other cells, allowing the inference of developmental time. From these lines of evidence, we concluded that the germ cell transcriptomes could be ordered as successive stages throughout spermatogenesis. This detailed delineation of spermatogenic stages provides stage-specific marker-gene expression with unprecedented resolution of molecular signatures of spermatogenesis (Figs. 1D and S2).
TCR-induced reduction of germline mutation rates
We hypothesized that the widespread transcription in spermatogenesis may lead to two scenarios (Fig. 2A): 1) open chromatin in transcribed regions leads to a higher mutagenic likelihood by transcription-coupled damage (TCD) 30, and consequently to higher germline mutation rates and divergence across species; and/or 2) the transcribed regions are subject to transcription-coupled repair (TCR) of the DNA 21, thus reducing germline mutation rates and safeguarding the germline genome, leading to lower divergence across species. To study these hypotheses, we first utilized our single-cell RNA-Seq data and assigned a spermatogenic stage to each gene according to its period of maximal expression (Fig. 2B, SI methods). Overall, we detected the expression of 87% of all protein-coding genes in one or more stages throughout spermatogenesis (Fig. 2B), consistent with previous observations 10,13.
The public databases have amassed over 200 million germline variants detected in the human population, providing a rich resource for studying germline mutation rates 31. Since ∼80% of these germline variants are thought to have originated in males 33,34, we used this dataset to query for widespread transcription-induced effects on the pattern of germline mutations. We thus sought to compare the number of DNA variants between genes expressed and unexpressed in spermatogenesis as a proxy for a difference in the level of DNA damage 35,36. Interestingly, we found that spermatogenesis expressed-genes, regardless of spermatogenic stage of expression, generally have a lower level of germline mutations, relative to the unexpressed genes (Fig. 2C), consistent with previous notion of transcription-coupled repair in spermatogenic cells 37,38. This difference is not observed in the gene flanking sequences (5kb of upstream and downstream), indicating a stronger effect in the genic region (Fig. S3) and supporting the notion that the widespread spermatogenesis transcription reduces the level of germline mutations.
If the reduction of mutations follows from a TCR-induced process, we would expect an asymmetry between the mutation levels of the coding and the template strands in the spermatogenesis expressed genes, but not in the unexpressed genes 32,38–41. The asymmetry would be such that the template strand accumulates fewer mutations since, in TCR, the RNA polymerase on the template strand detects DNA damage 21. To distinguish between mutations occurring on the coding and template strands, we adapted previous approaches to identify strand-asymmetries in the mutation rate (Fig. 2D) 32,38. By studying mutation categories with reference to the coding and template strand, Haradhvala et al. inferred a bias in mutation rates (Fig. 2D, schematic) 32 and such strategy was also utilized by Chen et al. 38. We applied this approach to germline mutations and found that a lower mutation rate was inferred on the template strands of expressed genes during spermatogenesis, while such effect is unapparent in the unexpressed genes, as represented by A>T transversion mutations in Figure 2E and in the other mutation types (Fig. S4A). In addition, for the coding strand, we observed an inferred rate of mutations that is lower in the expressed relative to that in the unexpressed genes, suggesting that antisense transcription in spermatogenesis may be used to further reduce mutation levels 42.
We next computed an ‘asymmetry score’ to study the ratio between mutation levels inferred to occur in the coding and template strands (Fig. 2E-F) 32. As expected, the unexpressed group of genes has minimal level of asymmetry scores (Fig. 2F and Fig. S4E), indicating no transcription-induced removal of DNA damage. Examining this measure across the spermatogenic stages, we observed that the asymmetry scores are highest in the early stages of spermatogenesis (spermatogonia and spermatocytes) and gradually decrease along the spermatogenesis lineage (Figs. 2F, S4D), consistent with a stronger transcription-induced removal of DNA damage earlier in spermatogenesis. Such a pattern is also reflected in the expression levels of TCR genes which show higher expression levels in early spermatogenesis (Fig. S7). As negative controls, we found that mutational asymmetry was not observed when comparing Watson and Crick strands (instead of gene-specific coding and template strands, Fig. S5), nor did we detect difference between the gene groups when shuffling the spermatogenic gene group assignments (while maintaining the group sizes, Fig. S6).
Bidirectional transcription signatures of mutation asymmetries
While the Figure 2 analysis examined transcription in the gene body (start to end of mRNA transcription), transcription in the human genome contains additional levels of complexity. For example, while expression is usually considered as transcribing the gene body, transcription in the opposite direction is common 43,44, leading to bidirectional transcription initiation on opposite strands (Fig. 3A). If lower mutation rates are indeed transcription-induced, we would predict that mutation asymmetry scores would display an inverse pattern between the opposite strands of the initiation of bidirectional transcription. Consistently, we detected an inverse pattern of asymmetry scores between the gene body and the upstream sequences (Figs. 3B,C, S4). Since transcription may extend beyond the annotated end or alternative polyadenylation sites (Fig. 3A) 45, we would also predict that the asymmetry scores between the gene body and the downstream sequences would display a coherent pattern. Again, we find the expected pattern whereby the gene body and the downstream sequences have the same pattern of asymmetry scores (Figs. 3B,D, S4). Together, these analyses provide striking support for transcription-induced germline mutation reduction.
‘Transcriptional scanning’ is tuned by gene-expression level
Our results led us to propose a model whereby widespread spermatogenesis transcription functions for ‘transcriptional scanning’ to reduce DNA damage-induced mutagenesis and thus safeguard the germline genome (Fig. 4A). Such a model suggests that mutation rates of scanned genes might be tuned by their expression levels in the testis. First, we expect that even minimally expressed genes should show fewer mutations than unexpressed genes, since a single round of transcription would pick up any damage. To test this, we binned all genes into seven groups according to their peak level of expression (Fig. 4B, SI methods). Consistently, we found that even the most lowly-expressed genes have lower levels of germline mutations than the unexpressed genes (Figs. 4C, S8A-B).
The ‘transcriptional scanning’ model predicts that higher expression levels would lead to additional scanning, and consequently further reduced mutation rates on the template strand. Indeed, examining our asymmetry score according to different expression levels, we observed that as expression level increases, the overall mutation level drops (Fig. 4C). Surprisingly, however, the very highly expressed genes showed the opposite effect: asymmetry between the strands is reduced and a paradoxically higher level of germline mutations relative to the unexpressed genes is observed (Figs. 4C,D, S8A,B). This pattern is consistent with observations that very high expression levels can lead to transcription-coupled DNA damage (Fig. 2A), as previously reported for transcription-associated mutagenesis in highly expressed genes in other systems 46. The mutation type in which TCD is most evident is A>G (Fig. 4C), and similarly, such TCD was readily observed in somatic A>G mutation in liver cancer samples 32. Our findings therefore extend support for TCD occurring for all mutation types in highly expressed genes (Figs. 4C-D, S8).
Our analyses suggest that spermatogenesis gene-expression levels tune germline mutation levels and we interpret our results as follows (Fig. 4E). ‘Transcriptional scanning’ reduces mutation rates even in genes with low-expression. Increasing expression levels are correlated with further reductions in mutation rates, but only to a point. In the very highly expressed genes, TCD overwhelms the TCR-induced reductions, and produces an overall higher mutation rate than genes expressed at low and moderate levels (Fig. S8A).
Transcriptional scanning and differential rates of genome evolution
We hypothesized that the reduction in mutation rates by transcriptional scanning would have cumulative effects over evolutionary time-scales. Specifically, since we observed lower mutation rates for spermatogenesis expressed genes at the level of the human population, we expected that these genes would be more conserved at the sequence level across orthologues in other apes (Fig. S9A), than the unexpressed genes. Consistently, examining across our stage-specific gene groups, we found that unexpressed genes show the highest level of divergence when comparing across the apes (Fig. 5A). Examining divergence across expression levels, we found a negative correlation between increased expression and divergence (Fig. 5D). However, the most highly expressed genes showed higher divergence. These observations are fully consistent with our analyses implicating higher mutation rates by TCD (Fig. 4). Collectively, as expected, the same mutation-level pattern is detected both in the population (Figs. 2-4) and across species (Fig. 5
The observation of different evolutionary rates between spermatogenesis expressed and unexpressed genes suggests a distinct selective regime acting upon the unexpressed genes. To test this, we studied the ratio of nonsynonymous to synonymous substitution rates (dN/dS) of evolution for stage-specific and expression-level specific gene groupings. We found that the unexpressed genes have a higher dN/dS ratio than the expressed genes, indicating that they are subject to weaker levels of purifying selection (Figs. 5B, S9B,C). Thus, the higher divergence levels of the unexpressed genes follows from both their higher mutation rates (Fig. 2C) and their weaker levels of purifying selection. Studying the set of 2,623 unexpressed genes at the functional level, we found that this set is enriched for environmental sensing, immune and defense systems, and signaling genes (Fig. 5C and Table S1). These functions strikingly coincide with those known to be fast-evolving in the human genome 23,24. Our results suggest that, beyond differential levels of purifying selection, the underlying levels of mutations are increased in this important set of genes by virtue of their being unexpressed during spermatogenesis. Our analysis into expression levels further revealed that the very highly expressed genes will also have high mutation levels (Fig. 4). We found that the very highly expressed genes also exhibit low levels of purifying selection (high dN/dS, Fig. 5E). Functionally, this set of genes is enriched for roles in male reproduction and mitochondrial function (Fig. 5F and Table S2).
Discussion
Our findings led us to propose a model whereby widespread transcription at fine-tuned levels of expression leads to a rugged landscape of germline mutations by transcriptional scanning (Fig. 6). Given that this process is carried out in the germline, the variable mutation rates have important implications for genome evolution. In this model, the widely transcribed genes in male germ cells benefit from transcription-coupled repair (TCR), which scans through the expressed genes, thereby reducing germline mutations and safeguarding the germ cell genome. Over long time-scales these genes evolve slower (Fig. 6 middle). The small group of genes that are unexpressed throughout spermatogenesis are enriched for sensory and defense-immune system genes (Fig. 5C) and exhibit higher mutation rates, which in our model is explained by the lack of a TCR-induced germline mutation reduction (Fig. 6 left). Defense and immune system genes are known to evolve faster 23,24 and our selective transcriptional scanning model provides insight into how variation is preferentially provided to this class of genes. Such rapid evolution may be under strong selective biases for adaptation at the population-level in rapidly changing environments. A third class of genes are characterized by very high germline expression. These genes have higher germline mutation rates since their transcription-coupled DNA damage obscures the effect of transcription-coupled repair (Fig. 6 right). This model provides more comprehensive view of TCR-TCD crosstalk in spermatogenic cells with expression level-tuned mutation rates fluctuation (Fig. 4E), and corrects the previous observation that the germline mutation rates increase with expression levels 38. In this Discussion, we address the issues of the full spectrum of mutagenesis pattern in the male germline, a proxy for detecting important genomic regions, and testable predictions of our model.
The transcriptional scanning model can account for a reduction of ∼15-20% of mutagenic DNA damage by detecting and removing bulky germline DNA damage (as estimated from the Fig. 2C analysis). Such a mechanism is critical for germ cell viability as retained bulky DNA damage may lead to cell death47. On the other side, the expressed genes of male germ cells still retain mutations that cannot be repaired by the TCR machinery22,48. These male germline mutations likely originate from DNA replication errors, accumulating with paternal age 49. Thus, it would be of great interest to further analyze the observed germline mutation pattern, in particular relative to replication fork directionality 50.
Beyond the protein-coding genes expressed here, it would be interesting to study non-coding genomic regions that are also expressed in the testes. Previous studies have reported that testis also expressed large numbers of non-coding genes10. These genomic regions may be inferred to be biologically important given that they are subjected to TCR-induced mutation reduction. According to this logic, it might follow that sensory and defense-immune system genes are unimportant since they are not generally expressed in the testes. Instead, we argue that this gene set is the exception that highlights the rule. In other words, most genes benefit from TCR mutation reduction excepting those under selection for faster evolution. Similarly to phylogenetic profiling for identifying functionally important regions of the genome 51, identification of testis-expressed regions – for example non-coding genes and retrotransposons – may be an efficient method for identifying these important regions.
Our model leads to important testable predictions and may provide deeper insights into human genetics and diseases originated from de novo germline mutations. First, we predict that de novo male-derived mutations would be enriched for genes unexpressed in spermatogenesis. Second, the same process should also hold in other mammals. Finally, we would expect that TCR-deficient animals should produce offspring with an increase in the number of de novo mutations. For patients with TCR gene-associated mutations, such as Cockayne syndrome and xeroderma pigmentosum 52, our model predicts higher germline mutation rates. It would also be of interest to study TCR/TCD processes in the female germline, though widespread gene expression has not been reported in the ovaries 11. The brain is another organ with a highly complex transcriptome 3,10, and it would be interesting to explore whether transcriptional scanning might have a function in certain somatic tissues. For example, such a function might help prevent somatic mutation induced neurodegenerative diseases in the aging brain 53.
Materials and Methods
Human testes sample
Human testis tissue was obtained from New York University Langone Health (NYULH) Fertility Center; this was approved by the NYULH Institutional Review Board (IRB). Fresh seminiferous tubules were collected from testicular sperm extraction (TESE) surgery of a healthy patient with an obstructive etiology for infertility; there were no drug or hormonal treatments prior to TESE surgery. The research donor was fully informed before signing consent to donating excess tissue for research use; this was again done in fashion consistent with the IRB (including tissue sample de-identification).
Single cell suspension preparation
After TESE surgery, samples were kept in cell culture PBS and transported to the research lab on ice within 1h of surgery for single-cell preparation. Testicular single-cell suspension was prepared by adapting existing protocols 54. Specifically, samples from TESE surgery was washed once with PBS and resuspended in 5mL PBS. Seminiferous tubules were minced quickly in a cell culture dish and spun down at 100g for 0.5min to remove supernatants. The minced tissue was resuspended in 8mL of 37°C pre-warmed tissue dissociation enzyme mix (See below). Tissue dissociation was done by incubating at 37°C for 20min with mechanical dissociation with pipetter every 5min. After digestion, the reaction was quenched by adding 2mL of 100% FBS (Gibco, Cat. 16000044) to a final concentration of 10%. Dissociation mix was filtered through a 100um strainer to remove remaining seminiferous tubule chunks. Cells were washed once with DMEM medium (Gibco, Cat. 11965092) with 10% of FBS and twice with PBS. Cell viability was checked with Trypan-blue staining (with expectation of over 85% viable cells) before moving to the inDrop microfluidics platform. The tissue dissociation enzyme mix (8mL) was composed of 7.56mL of 0.25% Trypsin-EDTA (Gibco, Cat. 25200056), 400uL of 20mg/mL type IV Collagenase (Gibco, Cat. 17104019) and 40uL of 2U/uL TURBO DNase (Invitrogen, Cat. AM2238).
Single-cell RNA-Seq
Single-cell barcoding was carried out with the inDrop microfluidics platform 27 as instructed by the manufacturer (1CellBio). Briefly, the microfluidic chip and barcoded hydrogel beads were primed ahead of single cell preparation. The ready-to-use single-cell suspension in PBS (after two times wash with PBS buffer) was adjusted to 0.1 million/mL by counting with hemocytometer. Next, the prepared cells, reverse transcription reagents (SuperScript III Reverse Transcriptase, Invitrogen, Cat. 18080085), barcoded hydrogel beads and droplet-making oil were loaded onto the microfluidic chip sequentially. Encapsulation was done by adjusting microfluidic flow rates as instructed. Single-cell barcoding and reverse transcription in the droplets were done by incubating at 50°C for 2h followed by heat inactivation at 70°C for 15min. Barcoded single-cells in droplets were aliquoted as desired and then decapsulated by adding demulsifying agent.
Sequencing library preparation
Single-cell RNA-Seq library preparation after inDrop was carried out as instructed by the manufacturer (1CellBio) and similar to the CEL-Seq2 method 55. Basically, barcoded single-cell cDNA was purified with Agencourt RNAClean XP magnetic beads (Beckman Coulter, Cat. A63987) followed by second-strand synthesis reaction with NEBNext mRNA Second Strand Synthesis KIT (New England Biolabs, Cat. E6111S). Then linear amplification of cDNA was carried out through in vitro transcription (IVT) using HiScribe T7 High Yield RNA Synthesis kit (New England Biolabs, Cat. E2040S). IVT-amplified RNA was fragmented and purified again with Agencourt RNAClean XP magnetic beads. The second reverse transcription was done with PrimeScriptTM Reverse Transcriptase (Takara Clonetech, Cat. 2680A) followed with cDNA purification with Agencourt AMPure XP magnetic beads (Beckman Coulter, Cat.A63881). cDNA quantity was determined by qPCR on a fraction (5%) of purified cDNA. Final PCR amplification was done according to qPCR results and purified with Agencourt AMPure XP magnetic beads. Library concentration was determined by Qubit dsDNA HS Assay Kit (Invitrogen, Cat. Q32851). Library size was determined by Bioanalyzer High Sensitivity DNA Kit (Agilent, Cat. 5067-4626).
Sequencing
Single-cell RNA-Seq library sequencing was carried out with Illumina NextSeq 500/550 75 cycles High Output v2 kit (Cat. FC-404-2005). Custom sequencing primers were used as instructed by manufacturer 27. In addition, 5% of PhiX Control v3 (Illumina, Cat. FC-110-3001) library was added to give more complexity to scRNA-Seq libraries. Pair-end sequencing was carried out with read1 (barcodes) for 34bp, index read for 6bp and read2 (transcripts) for 50bp.
Sequencing data processing
Raw sequencing data obtained from the inDrop method were processed using a custom-built pipeline, available at (https://github.com/flo-compbio/singlecell). Briefly, the “W1” adapter sequence of the inDrop RT primer was located in the barcode read (the second read of each fragment), by comparing the 22-mer sequences starting at positions 9-12 of the read with the known W1 sequence (“GAGTGATTGCTTGTGACGCCTT”), allowing at most two mismatches. Reads for which the W1 sequence could not be located in this way were discarded. The start position of the W1 sequence was then used to infer the length of the first part of the inDrop cell barcode in each read, which can range from 8-11 bp, as well as the start position of the second part of the inDrop cell barcode, which always consists of 8 bp. Cell barcode sequences were mapped to the known list of 384 barcode sequences for each read, allowing at most one mismatch. The resulting barcode combination was used to identify the cell from which the fragment originated. Finally, the UMI sequence was extracted, and reads with low-confidence base calls for the sex bases comprising the UMI sequence (minimum PHRED score less than 20) were discarded. The reads containing the mRNA sequence (the first read of each fragment) were mapped by STAR 2.5.1 with parameter “—outSAMmultNmax 1” and default settings otherwise56. Mapped reads were split according to their cell barcode and assigned to genes by testing for overlap with exons of protein-coding genes and long non-coding RNA genes, based on genome annotations from Ensembl release 90. For each gene, the number of unique UMIs across all reads assigned to that gene was determined (UMI filtering), corresponding to the number of transcripts expressed and captured. Cells with a total transcript count of less than 1,000 or more than 20% of transcripts originating from mitochondrial genes (i.e., genes that are part of the mitochondrial genome) were removed for downstream analysis.The resulting gene expression matrix contained UMI counts for 27,378 genes across 783 cells.
Inferring the transcriptomic trajectory of spermatogenesis
To obtain a temporal ordering of our cells that reflected the developmental process of spermatogenesis, we first filtered the expression matrix for protein-coding genes, retaining 19,788 genes. We then applied a variant of our recently proposed kNN-smoothing method 57, with k=3. This variant differed from the published version in that it relied on the Anscombe transform instead of the Freeman-Tukey-transform as a variance-stabilizing transformation, and in that it identified all neighbors in a single step, rather than adopting a step-wise approach. Briefly, all single-cell expression profiles were normalized to median number of total transcripts per cell 58, the Anscombe transform was applied to all expression values, and the k=3 closest neighbors of each cell were identified using Euclidean distance. The expression profile of each cell was then combined with those of its neighbors, thus obtaining its smoothed expression profile.
We next transformed the smoothed data using principal component analysis, and applied multidimensional scaling (MDS) to the cell scores for the first four principal components. Based on the two-dimensional results, we constructed a nearest-neighbor graph in which we connected each cell to its closest 32 neighbors, with a maximum distance of 80. We calculated the minimum spanning tree of this nearest-neighbor graph, determined the longest path in the tree, and applied smoothing by averaging the x and y coordinates of four consecutive vertexes. This created a continuous “backbone” representing the transcriptomic trajectory of spermatogenesis. To obtain the temporal ordering of all cells, we then projected all cells onto this path in the manner described by Qiu et al 29 and excluded 42 cells (5.4 %) with a distance of 25 or greater, which likely presented rare cell types or damaged cells. We used the expression of the PRM1 gene 59 to determine which “end” of the ordering corresponded to the last stage of spermatogenesis. Minimal manual adjustments to the cell ordering inferred through the aforedescribed process were made by comparison with unsupervised hierarchical clustering results. Finally, we obtained a temporal ordering (from early to late) for 741 cells that formed the basis for our downstream analyses.
Cell stage and cell type identification
Following MDS ordering of cells, several marker genes were used to determine cell types or spermatogenic stages. CSF1, CYP11A1 and IGF1 60–62 genes were used to distinguish Leydig cells. WT1 and SOX9 61,63 were used to distinguish Sertoli cells. Both Leydig cells and Sertoli cells were then excluded from the dataset to determine developmental stages of spermatogenesis. FGFR3 and DMRT1 26,64 were used to determine spermatogonia. SYCP3 and TEX101 61,65 were used to determine spermatocytes. ACRV1 and ACTL7B 61,65 were used to determine round spermatids. TNP1, PRM1, PRM2, YBX1 and YBX2 18,59,65,66 were used collectively to determine elongating spermatids, condensing spermatids and condensed spermatids. Based on the main spermatogenic stages, a more detailed spermatogenesis staging were defined by hierarchical clustering to increase resolution.
Principal component analysis (PCA)
The PCA plots in Figure 1 and S1 were perform on the UMI expression matrix of all testicular cells (741 cells, Fig. 1B) or spermatogenic cells (664 cells, Fig. 1C). In both cases, expression matrices were first normalized to 100,000 transcripts per cell. Fano factor or variance-to-mean ratio (VMR) was computed for each gene to determine dynamically expressed genes. PCA was then performed on the normalized and log2 transformed expression matrix using the dynamically expressed genes. For all testicular cells (Fig. 1B), 860 dynamic expressed genes were included. For spermatogenic cells (Fig. 1C), 1648 dynamic expressed genes were used.
Spermatogenic cell ordering by Monocle2
With the same smoothed spermatogenic cell expression matrix for building developmental trajectory as input, we used Monocle2 (version 2.6.0) 29 to infer the pseudotime track. We performed the required processes with default parameters according to the user manual (http://cole-trapnell-lab.github.io/monocle-release/docs/): 1) Set “negbinomial.size()” for expression distribution, and estimated size factors and dispersions. 2) Selected genes detected among at least 5% of 664 cells to project cells to 2D space using “DDRTree” method. 3) Ordered cells and visualized pseudotime track as shown in Fig. S1D. The increasing order of pseudotime values was consistent to the pattern of marker genes during spermatogenesis (data not shown). Pseudotime values were unique so the index of cell order was determined. The Monocle2-determined and MDS-determined cell index were plotted and Pearson correlation coefficient was calculated as shown in Fig. S1E.
Cell fate prediction with “RNA velocity”
We used the R package velocyto.R (version 0.5) to estimate RNA velocity 28. This required three separate counts matrices (emat, nmat, and spmat) which were composed of the intronic UMIs, exonic UMIs and intron/exon spanning UMIs, respectively. They were generated by the dropEst pipeline (https://github.com/hms-dbmi/dropEst). 1) The raw sequencing reads was tagged by droptag with the default “inDrop v1&v2” config file except “r1_rc_length” was set as 3. 2) The tagged reads were mapped to the human reference genome GRCh38 using STAR (version 2.5.3a) 56 with default settings. 3) The alignments were processed by dropest with gene annotation GTF file (Ensembl release 90) and the default settings except the “--merge-barcodes” option was additionally called as suggested. The result contained 655 of the 664 spermatogenic cells. Pearson correlation coefficient between the UMI count profile of each cell estimated by custom-built single-cell RNA-Seq pipeline (https://gitlab.com/yanailab/singlecell) and dropEst pipeline was calculated and the median of all 655 cells was 0.968.
We followed the velocyto.R manual (https://github.com/velocyto-team/velocyto.R) and used emat and nmat to estimate and visualize RNA velocity. With predefined cell stage, we performed gene filtering with the parameter “min.max.cluster.average” set to 0.1 and 0.03 for emat and nmat, respectively. RNA velocity using the selected 4266 genes was estimated with the default settings except parameter “kCells” and “fit.quantile” which were set to was 3 and 0.05, respectively. RNA velocity field was visualized on a separate PCA embedding as shown in Fig. 1C.
Stage-marker identification
To identify gene markers for stages throughout spermatogenesis, we searched for genes exclusively expressed in the corresponding stage. We constructed an idealized gene expression pattern exclusive to each stage (main or detailed), which was used as a reference to find gene expression pattern. A correlation coefficient higher than 0.5 and P-value lower than 0.0001 was used as thresholds to detect stage-specific marker genes. The top 50 genes with the highest correlation coefficient values to each stage are shown in Fig. S2.
Delineating the stage and expression level groups
To assign genes to specific stages, we computed for each, its average gene expression levels across the six main stages (Sg, Sc, RS, ES, CS, CedS). Genes were then assigned to a main stage in which they have highest level of expression. Unexpressed genes formed a separate group.
To assign groups based on expression levels, we binned the peak expression level to 7 groups:
Human germline variations
Human germline variations were downloaded from the Ensembl FTP site (ftp://ftp.ensembl.org/pub/release-91/variation/vcf/homo_sapiens). We selected from these, the variations from dbSNP_150 and used BEDOPS together with custom Bash scripts to associate them with gene body, upstream 5kb and downstream 5kb genomic regions. The gene body region was defined as the genomic interval between the gene start site and gene end site annotated in GTF file (Ensembl release 91). Upstream and downstream 5kb region was defined according to gene body region and with reference to gene strand information. We classified the variants into the six mutation classes: (A>T/T>A; A>G/T>C; T>G/A>C; C>T/G>A; G>T/C>A; C>G/G>C). Each variant was them further distinguished in terms of the coding and the template strands, as previously introduced 32. The same procedures were also performed on upstream and downstream genomic regions, with the strand specificity (coding strand versus template strand) being assigned in consistent with the associated genes.
The germline mutation rates of the coding and the template stands were calculated by normalizing to a length of 1kb. Specifically, for germline mutations in total, the mutation rates were calculated as the sum of all germline short variants normalized to a length of 1kb. For specific base substitution mutation type, the mutation rates were calculated as the number of specific mutation type normalized to 1kb of the reference base type.
Gene divergence datasets
The sequence divergence datasets of human to apes were downloaded from Ensembl release 9131. Percent divergences in Figure 5 were calculated as: Divergence = 100% - Identity (human to other apes). dN and dS values were also retrieved from Ensembl and we excluded genes zero dN or dS. The mean values shown in Figure 5 were computed on non-outlier values, where an outlier value is defined as more than three scaled median absolute deviations (MAD) away from the median. For a set of divergence or dN/dS values made up N genes, MAD is defined as: MAD = median (|Ai - median(A)|), for i = 1,2,…,N.
Statistical Analysis
Statistical significance was computed by the Mann-Whitney test (Mann-Whitney-Wilcoxon test or rank-sum test) to test whether two groups of genes have distinct value distributions. Error bars of bar plots represents 99% percent confidence intervals, calculated as 2.58×standard error, as values are all normal distributed or close to normally distributed.
Acknowledgments
We thank Yael Kramer for coordinating the human sample collection. We thank Molly Przeworski, Hannah Klein, Huiyuan Zhang and the members of the Yanai lab for constructive comments and suggestions to the manuscript. We thank Megan Hogan and Matthrew Maurano for assistance with sequencing.
Funding: This work was supported by the NYU School of Medicine with funding to I.Y.
Author contributions: B.X. and I.Y. conceived the project, interpreted the results and drafted the manuscript. B.X. led the experimental and analysis components. M.B. contributed expertise in the inDrop analysis and sequencing. Y.Y. contributed to RNA velocity and Monocle2 analysis, and mutation data processing. F.W. contributed to raw data processing of scRNA-seq and cell ordering. J.A., S.Y.K., and D.K. contributed to the sample collection. All authors edited the manuscript.
Competing interests: Authors declare no competing interests.
Data and materials availability: Raw sequencing data will be deposited to GEO and will include gene expression matrices including both smoothed and unsmoothed UMI counts matrices.