Abstract
Briefly Analysis of single-cell RNA-Seq data from mouse neocortex exposes evidence for local neuropeptidergic modulation networks that involve every cortical neuron directly.
Data Highlights
At least 98% of mouse neocortical neurons express one or more of 18 neuropeptide precursor proteins (NPP) genes.
At least 98% of cortical neurons express one or more of 29 neuropeptide-selective G-protein-coupled receptor (NP-GPCR) genes.
Neocortical expression of these 18 NPP and 29 NP-GPCR genes is highly neuron-type-specific and permits exceptionally powerful differentiation of transcriptomic neuron types.
Neuron-type-specific expression of 37 cognate NPP / NP-GPCR gene pairs predicts modulatory connectivity within 37 or more neuron-type-specific intracortical networks.
Summary Seeking insight into homeostasis, modulation and plasticity of cortical synaptic networks, we analyzed results from deep RNA-Seq analysis of 22,439 individual mouse neocortical neurons. This work exposes transcriptomic evidence that all cortical neurons participate directly in highly multiplexed networks of modulatory neuropeptide (NP) signaling. The evidence begins with a discovery that transcripts of one or more neuropeptide precursor (NPP) and one or more neuropeptide-selective G-protein-coupled receptor (NP-GPCR) genes are highly abundant in nearly all cortical neurons. Individual neurons express diverse subsets of NP signaling genes drawn from a palette encoding 18 NPPs and 29 NP-GPCRs. Remarkably, these 47 genes comprise 37 cognate NPP/NP-GPCR pairs, implying a strong likelihood of dense, cortically localized neuropeptide signaling. Here we use neuron-type-specific NP gene expression signatures to put forth specific, testable predictions regarding 37 peptidergic neuromodulatory networks that may play prominent roles in cortical homeostasis and plasticity.
Introduction
Neuromodulation - the adjustment of synapse and ion channel function via diffusible cell-cell signaling molecules - is a fundamental requirement for adaptive nervous system function (Abbott and Regehr, 2004; Bargmann, 2012; Bucher and Marder, 2013; Marder, 2012; Marder et al., 2015; Mccormick and Nusbaum, 2014; Nadim and Bucher, 2014; Nusbaum et al., 2017). Neuromodulator molecules take many different chemical forms, including diatomic gases such as nitric oxide, lipid metabolites such as the endocannabinoids, and amino acids and their metabolites such as glutamate, GABA, acetylcholine, serotonin and dopamine. By far the largest family of neuromodulator molecules, however, comprises the evolutionarily ancient proteinaceous signaling molecules known as neuropeptides (Baraban and Tallent, 2004; Burbach, 2011; Gonzalez-Suarez and Nitabach, 2018; Hökfelt et al., 2013; van den Pol, 2012; Wang et al., 2015). The most well-known and widely studied neuropeptides are the endogenous “opioid” peptides - enkephalins, endorphins and dynorphins - but there are nearly one hundred other NPP genes in the human genome and numerous homologs are present in all known animal genomes except for those of the sponges (Porifera) (Elphick et al., 2018; Jekely, 2013).
The broadest definition of “neuropeptide” would embrace any soluble peptide that serves as a messenger by diffusing from one neuron to another. A narrower but more common definition (Burbach, 2011) requires that (1) a neuropeptide precursor protein (NPP) transcript be translated as an NPP into the lumen of a source neuron’s rough endoplasmic reticulum (rER), (2) packaged into dense-core vesicles (DCVs) and enzymatically cleaved into one or more neuropeptide (NP) products after passage through the rER–Golgi complex, (3) transported and stored within the source neuron in DCVs, (4) released upon demand by activity- and calcium-dependent exocytosis, and only then (5) diffuse interstitially to act upon a target neuron by binding to a specific receptor. This pathway enlarges the potential palette of distinct neuropeptides beyond that established simply by the large number of NPP genes, as a given NPP may be cleaved into alternative NP products during its intracellular and interstitial passage.
Most neuropeptide receptors are encoded by members of the very large superfamily of G-protein-coupled receptor (GPCR) genes (Hoyer and Bartfai, 2012; Krishnan and Schioth, 2015; Mains and Eipper, 2006; van den Pol, 2012). GPCRs are selective, high-affinity receptors distinguished by characteristic seven-transmembrane-segment atomic structures and signal transduction involving heterotrimeric G-proteins (hence the name). Phylogenomic evidence suggests that the earliest behaving animals relied exclusively upon early neuropeptide homologs and cognate neuropeptide-selective GPCRs (NP-GPCRs) for the slow intercellular communication sufficient to generate their slow and simple behaviors (Elphick et al., 2018; Grimmelikhuijzen and Hauser, 2012; Jekely, 2013; Krishnan and Schioth, 2015; Varoqueaux and Fasshauer, 2017). The later evolution of neurons, focal synaptic contacts, rapidly recycled small-molecule neurotransmitters, and numerous ionotropic receptors was likely driven by survival advantages of faster cell-cell signaling (Varoqueaux and Fasshauer, 2017). The fast synaptic transmission characteristic of contemporary higher animals is almost invariably based on recycling small molecule neurotransmitters and ionotropic receptors, but modulation of synaptic transmission and membrane excitability by NP-GPCRs remains very prominent in all extant behaving animals (Elphick et al., 2018; Grimmelikhuijzen and Hauser, 2012; Jekely, 2013; Krishnan and Schioth, 2015; Varoqueaux and Fasshauer, 2017).
Because modulatory neuropeptides are not subject to the rapid transmitter re-uptake and/or degradation processes necessary for fast synaptic transmission, secreted neuropeptides persist long enough (e.g., minutes) in brain interstitial spaces for diffusion to NP-GPCRs hundreds of micrometers distant from release sites (Ludwig and Leng, 2006; Nässel, 2009; Russo, 2017). Neuropeptide signaling in the CNS can thus be presumed “paracrine”, with secretion from one neuron acting upon many others by diffusion over distance and signals likewise converging by diffusion from many neurons onto one. The degradation of active neuropeptides by extracellular peptidases in cortex is nonetheless generally expected to restrict signal diffusion to sub-millimeter scale local circuit volumes, such as cortical “columns” or “barrels” or other commonly envisioned small anatomic/functional subunit tiles of the cortical sheet.
The many receptors encoded by different NP-GPCR genes are each highly selective for specific peptides but show considerable conservation at the level of downstream cellular signal transduction effects. Although GPCR signaling has long been recognized as complex and many faceted (Hamm, 1998), most neuronal NP-GPCR actions reflect phosphorylation of ion channel or synaptic proteins, mediated by protein kinases dependent on the second messengers cyclic AMP and calcium (Mains and Eipper, 2006; Nadim and Bucher, 2014; van den Pol, 2012). Primary effects of NP-GPCRs, in turn, fall into just three major categories distinguished by G-protein alpha subunit class. The Gαi class (i) inhibits cAMP production, the Gαs class (s) stimulates cAMP production, and the Gαq class (q) amplifies calcium signaling dynamics (Syrovatkina et al., 2016). For most NP-GPCR genes, the primary G-protein α-subunit class (i.e., i, s or q) is now known (Alexander et al., 2017) and offers a good first-order prediction of the encoded GPCR’s signal transduction activity. The profound functional consequences of neuromodulation by GPCRs range from adjustment of neuronal firing properties and calcium signaling dynamics through regulation of synaptic weights and synaptic plasticity (Bargmann, 2012; Markram et al., 2013; Mccormick and Nusbaum, 2014).
It is well established that particular neuropeptides, including vasoactive intestinal peptide (VIP), somatostatin (SST), neuropeptide Y (NPY), substance P, and cholecystokinin (CCK), are detectible at high levels in particular subsets of GABAergic cortical neurons (Tremblay et al., 2016). These neuropeptides, consequently, have come into broad use as markers for GABAergic interneuron classes, while the corresponding NPP and NP-GPCR genetics have provided molecular access to these and other broad neuron type classes (Daigle et al., 2018; Maximiliano José et al., 2018). In situ hybridization and microarray data (e.g., the Allen Brain Atlases (Hawrylycz et al., 2012; Lein et al., 2007)) have also established that mRNA transcripts encoding these five NPPs and that many other NPPs and cognate NP-GPCR genes are expressed differentially in different brain regions. There has been a critical lack, however, of comprehensive expression data combining whole-genome depth with single-cell resolution. Absent such data, it has been difficult to generate specific and testable hypotheses regarding cortical neuropeptide function and to design repeatable experiments to test those hypotheses (Tremblay et al., 2016; van den Pol, 2012).
Here we describe new findings regarding NPP and NP-GPCR gene expression in single cortical neurons, based on analysis of deep mRNA-Seq data acquired from 22,439 isolated mouse cortical neurons as described fully in a recent publication (Tasic et al., 2018). We begin by leveraging only the genomic depth and single-cell resolution of this dataset. Then, we briefly introduce the transcriptomic neurotaxonomy (i.e., neuron-type taxonomy) also developed in the Tasic 2018 publication and explore the additional analytical power of a taxonomic framework. Finally, we distill these findings into specific and testable predictions concerning intracortical peptidergic modulation networks.
Results
The present study is based on analysis of a resource single-cell mRNA-Seq dataset acquired at the Allen Institute (Tasic et al., 2018) and available for download at http://celltypes.brain-map.org/rnaseq/. These RNA-Seq data were acquired from a total of 22,439 isolated neurons, with detection of transcripts from a median of 9,462 genes per cell (min = 1,445; max = 15,338) and an overall total of 21,931 protein-coding genes detected. Neurons were sampled from two distant and very different neocortical areas: 13,491 neurons from primary visual cortex (VISp), and 8,948 neurons from anterior lateral motor cortex (ALM). Tasic, et al., harvested tissue specimens from a variety of transgenic mice expressing fluorescent proteins to enable enrichment of samples for neurons and for relatively rare neuron types by FACS sorting after dissociation. This enrichment procedure resulted, by design, in a disproportionate representation of GABAergic neurons, canonically ∼20% of neurons (Sahara et al., 2012), such that the sampled neuron population is roughly half GABAergic (47%) and half glutamatergic (53%). The resource publication (Tasic et al., 2018) should be consulted for full details of neuronal sample and library preparation, sequencing and data processing.
The resource single-cell RNA-Seq data tables (Tasic et al., 2018) report the abundance of transcripts from individual neurons in both “counts per million reads” (CPM) and “fragments per kilobase of exon per million reads mapped” (FPKM) units. Our analysis of this data compares gene expression levels quantitatively, with two distinct use cases: (1) comparisons across large sets of different genes, and (2) comparisons of the same gene across different individual cells, cell types and brain areas. We have relied upon FPKM data (Mortazavi et al., 2008; Pimentel, 2014), for use case 1 (i.e., the Table 1 and 2 comparisons across genes). For use case 2 (as in all figures below), we have preferred the CPM units, because these units were used to generate the Tasic 2018 neurotaxonomy. While choice of units here seems unlikely to make any significant difference, it would seem inconsistent to use FPKM units to compare across cell types discerned from CPM data.
The NP signaling genes upon which the present analysis focuses are expressed very differentially across the sampled populations of individual mouse cortical neurons. That is, each gene is expressed at a high level in some subset of cells but at zero or very low levels in the remainder of the population. To compactly characterize such expression, we developed a “Peak FPKM” metric. This metric is generated by ranking single-cell FPKM values for a given gene across the entire population of 22,439 neurons sampled, then designating the FPKM value at the ascending 99.9th percentile point as “Peak FPKM”. This metric was designed to minimize effects of sporadic outliers while still closely approximating the actual peak expression value in even very small subsets of neurons expressing the gene in question.
18 Neuropeptide Precursor Protein (NPP) genes are extremely highly expressed in mouse neocortex
Table 1 lists 18 NPP genes highly expressed in varied subsets of the 22,439 individual neurons sampled from cortical areas VISp and ALM. This gene list was circumscribed by two requirements: (1) that the included NPP gene be highly expressed (top quartile Peak FPKM, across all protein-coding genes) in both VISp and ALM cortical areas, and (2) that at least one NP-GPCR gene cognate to a candidate NPP gene also be highly expressed in neurons within the same cortical local areas. Requirement (2) was imposed here to focus on prospects for intracortical paracrine neuropeptide signaling as noted in Introduction above. Table 1 also lists Peak FPKM values for each NPP gene, percentile and absolute ranks of that Peak FPKM value across all protein-coding genes, the fraction of cells sampled in which expression of the listed gene is detectible, predicted neuropeptide product(s) encoded, and the NP-GPCR gene(s) fulfilling requirement (2) for that NPP gene. Transcripts of no other known NPP genes met the criteria specified above.
The Peak FPKM ranking columns in Table 1 show that expression levels of most of the 18 NPP genes are extremely high in the range of Peak FPKM values for all 21,931 protein-coding genes detected in all neurons sampled. Of these genes, Npy, Sst, Vip and Tac2 rank as the top four overall in peak FPKM values, while three more, Cck, Penk and Crh also rank in the top ten. Eleven of these NPP genes rank in the top percentile and all 18 rank above the 80th percentile in peak FPKM. The extremely high peak abundance of these NPP transcripts suggests that NPP products are likely synthesized in the highly expressing cells at correspondingly high rates. To maintain a steady state, the cell must therefore eliminate those protein products at a very high rate, with processing and secretion of active neuropeptides being the most likely route of elimination. The high abundance of transcripts encoding these 18 NPPs can thus be construed as evidence for secretion of the respective active neuropeptide products.
Expression of NPP genes by neocortical neurons is highly differential
Figure 1A characterizes differential expression of the 18 NPP genes of Table 1. Each of 18 color-coded solid curves represents the distribution of single-neuron CPM values for one NPP gene. Curves were generated by plotting CPM for each individual neuron in descending rank order along a sampled cell population percentile axis. Each curve shows an abrupt transition from very high to very low (commonly zero) expression across the sampled neuron population, but these transitions occur at very different population percentile points, providing clear evidence for highly differential single-cell expression of each gene. Percentages of the sampled neuron population expressing a given NPP gene (at greater than 1 CPM) range from more than 65% for Cck down to 1% for Nts. Recall that the cell population sampled here has been enriched for GABAergic cell types as noted above and described at length in the resource publication (Tasic et al., 2018).
Almost all (and possibly all) neocortical neurons express at least one NPP gene
The dashed curve in the Fig. 1Ai, labeled “Max NPP Gene”, was generated by plotting CPM values of the NPP gene with the highest CPM in each individual cell in descending order along a cell population percentile axis. This curve therefore shows that 97% percent of the sampled mouse cortical neurons express at least one NPP gene at >1 CPM and that 80% express at least one NPP gene at >1,000 CPM, a very high level. When one takes into account the pulsatile nature of transcription (Suter et al., 2011) and the stochastic nature of RNA-Seq transcript sampling (Fu and Pachter, 2016; Kim et al., 2015; Tasic et al., 2016), these numbers must be understood as lower limits to percentages of cortical neurons expressing at least one of the 18 NPP genes. The results summarized in Fig. 1A may therefore be consistent with the proposition that every cortical neuron is peptidergic.
Statistics of differential NPP gene expression are highly conserved between different neocortical areas
Figures 1B and 1C illustrate strong conservation of differential NPP expression profiles between VISp and ALM, two distant and very different neocortical areas. The paired bars in Fig. 1B represent fractions of cells expressing a given gene in each of the two areas. It is obvious that the differential expression profiles in VISp and ALM are highly similar (ρ=0.972, p<1.72E-11), in spite of stark differences in function and cytoarchitecture between these two areas. Conservation of expression fractions across so many genes in such divergent cortical areas may suggest that these patterns have strong connections to conserved features of cortical function and argues against these patterns being secondary to more ephemeral variables such as neuronal activity patterns, which seem unlikely to be highly conserved between VISp and ALM areas.
Multiple NPP genes are co-expressed in almost all cortical neurons
Figure 1C represents frequencies with which transcripts of various multiples drawn from the set of 18 NPP genes were detected in individual neurons. These data establish a remarkable degree of NPP gene co-expression in almost all individual cortical neurons. The modal number of co-expressed NPP genes detected is 2 in VISp and 5 in ALM, but both distributions are actually quite flat between 2 and 5, with broad plateaus out to 7 co-expressed NPP genes per cell and a substantial tail out to 10. Fig. 1C also profiles strong similarities of NPP co-expression distributions between VISp and ALM.
29 Neuropeptide-selective G-protein-coupled receptor (NP-GPCR) genes are highly expressed in mouse neocortex
Table 2 lists 29 NP-GPCR genes that are highly expressed in varied subsets of the 22,439 individual neurons sampled from cortical areas VISp and ALM. These 29 genes encode receptor proteins selective for neuropeptide products encoded by the 18 NPP genes of Table 1 (cross-referenced in that table as “Cognate NP-GPCR Genes”). Table 2 provides quantitative information on expression levels of these 29 NP-GPCR genes, names the receptor proteins they encode, indicates the A-F GPCR class and expected primary G-protein signal transduction type and cross-references the cognate cortically-expressed NPP genes. As noted above, the 18 NPP genes and 29 NP-GPCR genes listed in Tables 1 and 2 were selected for focused analysis here due to their cognate pairing relationships and the consequent prospect that they may transmit local intracortical signals.
The “pFPKM Percentile” column in Table 2 shows that most of these 29 NP-GPCR genes are expressed in cortex with Peak FPKM values well above median (50th percentile) for all protein coding genes. The high end of the range of cortical neuron pFPKM values for NP-GPCR genes does not match the extreme values noted for NPP genes, but this is as expected given that NP-GPCR gene products are thought to be durable cellular components, unlikely to be rapidly disposed by secretion as expected for NPP gene products. Peak FPKM values for NP-GPCR transcripts are nonetheless quite high in the range of transcripts of other likely durable cellular component genes, suggesting a strong likelihood that they are indeed translated into functionally important protein products.
Expression of NP-GPCR genes by cortical neurons is highly differential
Figure 2 represents expression patterns of the 29 NP-GPCR genes listed in Table 2 in a manner that closely parallels the presentation for 18 NPP genes in Fig.1. Figure 2A establishes that each of the 29 NP-GPCR genes, like the 18 NPP genes, is expressed in highly differential fashion across the population of 22,439 mouse cortical neurons sampled. Each of 29 color-coded solid curves represents the distribution of single-neuron expression level values for one NP-GPCR gene. Curves were generated by plotting CPM for each individual neuron in descending order along a cell population percentile axis. As was noted for NPP genes in Fig. 1, each of the curves in Fig. 2A shows an abrupt transition from very high to very low (commonly zero) expression across the sampled neuron population. These transitions again occur at very different population percentile points, providing clear evidence for highly differential expression of NP-GPCR gene. Percentages of the sampled neuron population expressing a given NP-GPCR gene (at greater than 1 CPM) range from more than 72% for Adcyap1r1 down to 0.7% for Vipr2.
Almost all (and possibly all) neocortical neurons express at least one NP-GPCR gene
The dashed curve in the left panel of Fig. 2A, labeled “Max NP-GPCR Gene”, was generated by plotting CPM values of the NP-GPCR gene with the highest CPM in each individual cell in descending order along a cell population percentile axis. This curve shows that 98% percent of the sampled mouse cortical neurons express at least one NP-GPCR gene at >1 CPM and that 78% express at least one NP-GPCR gene at >100 CPM, lower than the comparable point for NPP genes (see Fig. 1) but still a very high value. Again, these numbers must be understood as lower limits to percentages of cortical neurons actually expressing at least one of the 29 NP-GPCR genes, after taking into account the pulsatile transcription and stochastic sampling issues cited above. The results summarized in Fig. 2A may thus be consistent with a conclusion that every cortical neuron expresses at least one NP-GPCR gene cognate to a cortically expressed NPP gene.
Statistics of differential NP-GPCR gene expression are highly conserved between different neocortical areas
Figure 2B provides evidence for strong conservation of differential NP-GPCR expression profiles between distant cortical areas VISp and ALM. The paired bars represent fractions of cells expressing a given gene in each of the two areas, again revealing strong similarities of differential expression profiles in the two very different neocortical areas (ρ=0.959, p<2.2E-16).
Multiple NP-GPCR genes are co-expressed in almost all cortical neurons
Figure 2C represents frequencies of NP-GPCR gene co-expression multiples detected in individual neurons. These data establish that multiple NP-GPCR genes are co-expressed in almost all cortical neurons and that numbers of genes co-expressed are even higher than those noted above for co-expression of NPP genes. Modal numbers of co-expressed NP-GPCR genes detected is 6 in both VISp and ALM with broad plateaus extending out to 12 co-expressed NP-GPCR genes per cell. The striking similarities of NP-GPCR co-expression distributions between the two otherwise divergent neocortical areas once again suggests that the patterning of NP-GPCR co-expression may have consequences for cortical function that are conserved because they are functionally important.
Transcriptomic neurotaxonomy enables the generation of testable predictions about neocortical neuropeptidergic signaling
Our analysis so far has relied solely upon the genomic depth and single-cell resolution characteristics of the 2018 Tasic transcriptomic data, without utilizing the transcriptomic neurotaxonomy derived as one major goal of that study (Tasic et al., 2018). This taxonomy was developed from a large body of single-cell mRNA-Seq data based on dimensionality reduction and iterative hierarchical clustering methods. Such a transcriptomic neurotaxonomy makes it possible to predict a protein “parts list” for any neuron that can be mapped to a given transcriptomic type. While additional work now under way (Cadwell et al., 2017; Daigle et al., 2018; Moffitt et al., 2016; Shah et al., 2016; Wang et al., 2018; Zeng and Sanes, 2017) will be needed to reconcile this transcriptomic neurotaxonomy to existing anatomical and physiological neurotaxonomies, this taxonomy already offers the prospect of genetic access to specific neuron classes and types for physiological and anatomical study and thereby the prospect of experimental test of transcriptomically generated hypotheses. The present analysis will make extensive use of a subset of the 2018 Tasic neurotaxonomy representing 115 types discriminated in VISp and ALM cortical areas, as summarized in Supplementary Fig. 1. This neurotaxonomy will be represented in the following figures by cladograms and/or color code strips that can be interpreted by reference to Supplementary Fig. 1 or (Tasic et al., 2018).
Expression of the 18 NPP genes is highly neuron-type-specific
Figure 3A represents expression levels of the 18 NPP genes across all 115 VISp+ALM neuron types as a “heat map” matrix color coding log10 CPM values for each NPP and each neuron type. The CPM values so rendered are calculated as “trimmed means” (mean value after discarding the top 1% of distributions to reject outliers) of single-cell CPM values aggregated by each neuron-type cluster (commonly on the order of 100 cells, see Supplementary Figs. 1A and 1C for actual cell counts). Figure 3A confirms and extends four reasonable expectations from the type-agnostic single-cell analyses of Figs. 1 and 2 above: (1) neurons of every type express one or more of the 18 NPP genes, (2) each of the 18 NPP genes is expressed in multiple neuron types, (3) neurons of every type express multiple NPP genes, and (4) expression of NPP genes is highly differential across neuron types. Remarkably, Fig. 3A shows that type-to-type variations in expression level for every one of the 18 NPP genes span the full >10,000-fold dynamic range characteristic of the Tasic 2018 RNA-Seq data. Quite intriguingly, Fig. 3A also suggests that each of the 115 VISp+ALM cell types might be distinguished by a unique combinatorial pattern of NPP gene expression. This possibility will be explored quantitatively in connection with Fig. 4 below.
Figure 3A provides for ready comparison of NPP gene expression patterns between glutamatergic and GABAergic neuron types. Clearly, GABAergic types are more prolific in the variety and strength of their NPP genes expression. While glutamatergic types express fewer NPP genes and do not match the extremely high NPP expression levels observed in almost every GABAergic type, each type nonetheless expresses at least one NPP gene, and generally more, at a very substantial level. This differential is consistent with a long history of neuroscientific use of neuropeptide products as protein markers of GABAergic neuron subsets (e.g., VIP, SST, NPY, Substance P), which has no parallel in the marking of glutamatergic neuron subsets.
Expression of the 29 NP-GPCR genes is highly neuron-type-specific
Figure 3B illustrates neuron type specificity of NP-GPCR expression in a manner identical to the treatment of NPP gene expression in Fig. 3A and invites analogous conclusions: (1) neurons of every neuron type express one or more of the 29 NP-GPCR genes at very high levels, (2) neurons of every type express multiple NP-GPCR genes, and (3) expression of NP-GPCR genes is highly differential across neuron types. Figure 3B also shows, however, that the stronger and more varied expression of NPP genes in GABAergic expression profiles that was evident in Fig. 3A is leveled or even reversed for NP-GPCR genes. That is, while GABAergic neurons clearly show the more prolific and varied expression of NPP genes, glutamatergic neurons may be somewhat more prolific expressors of NP-GPCR genes. Finally, it should be noted that there are cases where both an NPP gene and its cognate NP-GPCR receptor are expressed in the same neuron type, with the Cck / Cckbr and Adcyap1 / Adcyap1r1 pairs being prominent examples, with both being highly expressed in majorities of glutamatergic neuron types.
A transcriptomic signature based upon just 47 NP-signaling genes (18 NPPs and 29 NP-GPCRs) permits exceptionally accurate classification of neocortical neurons
The strong marker patterning of the 47 NP gene expression profiles evident in Fig. 3 suggests the possibility that each of the 115 neuron types profiled in that figure might be distinguished by a unique combination of these 18 NPP and 29 NP-GPCR genes. To explore this possibility quantitatively, we developed the analysis presented in Figure 4.
We began by asking whether there exists a low dimensional representation of gene expression that naturally separates neurons of different types into distinct parts of that low-dimensional space. The extent to which a neuron’s location in such a space can be inferred from the expression of a limited subset of genes (such as our 47 NP genes) would then provide a measure of the sufficiency of that subset to classify a that neuron accurately. Hierarchical clustering methods to define neuron types based upon gene expression are well established (Hastie et al., 2001; Oyelade et al., 2016) but have difficulty when comparing and making inferences between datasets. We therefore devised a machine learning approach based on linked multi-layer autoencoders (see Supplementary Methods) to address this question explicitly and quantitatively.
A single autoencoder network (Hinton and Salakhutdinov, 2006) was developed and trained to encode CPM values of the 6,083 most highly expressed genes represented in the Tasic 2018 dataset (the “HE” gene set). Results are illustrated in Figure 4A, where encoding coordinates in a two-dimensional latent space of 22,439 individual neurons are displayed as discrete dots, each colored according to the neuron’s Tasic 2018 hierarchical classification (i.e., the neurotaxonomic color code introduced in Fig. 3 and Supplementary Fig. 2). The tight grouping of type-code colors evident in Fig. 4A implicitly represents that position within this latest space corresponds well to the neuron types defined by hierarchical classification, in spite of the fact that the autoencoder was given no explicit prior information about how neurons were classified by Tasic, et al. We then trained a second, linked auto-encoder with the architecture schematized in Fig. 4B to classify cells using only the 47-gene subset, under a cost function constraint that latent spaces of the two auto-encoders be as similar as possible. This allowed us to test the extent to which any small gene subset by itself could match the encoding performance obtained using the much larger gene set. Fig. 4C displays a two-dimensional latent space resulting from encoding the same 22,439 neurons based only on 47 NP genes tabulated above, again projecting one dot for each cell using the Tasic 2018 type-code colors. Once again, the tight color grouping evident in Fig. 4C suggests qualitatively that these 47 genes indeed enable excellent type encoding of individual neurons.
For quantitative comparison of classification performance based on varied neocortical gene sets, we partitioned the autoencoder encodings into classes using a supervised Gaussian Mixture Model (see Supplementary Methods) and designed the resolution index schematized in Fig. 4D to evaluate consensus between classifications driven by autoencoder encoding with the resource hierarchical neurotaxonomy (Tasic et al., 2018). This index yields a value of 0 when a neuron is mapped incorrectly from the root node and 1 when a neuron is mapped correctly all the way to a terminal leaf node. By averaging this metric over all 22,439 neurons, we generated an overall figure of merit called a resolution index. This figure for the large HE 6083 gene set was 0.987, the same index for classification based on the NP 47-gene subset was 0.928. To place these resolution index numbers in context and test the significance of this correspondence, we compared resolution indices resulting from linked autoencoder classification based on 100 subsets of 47 genes drawn randomly from the Tasic 2018 expression dataset. The sets of 47 random genes yielded an average resolution index of 0.645 ± 0.047 (Fig. 4E), establishing clearly that NP genes yield classification greatly superior to random subsets of 47 genes. Figure 4E also shows results from encoding runs using 100 sets of 47 random genes chosen to approximate the same high expression statistics of the NP genes. Again, resolution indices from the random sets fell well below that yielded by the 47 NP genes (average = 0.858 ± 0.0242, with none reaching the NP gene index of 0.928 and the difference being significant at p<0.01). This demonstration of the exceptional power of NP genes to mark transcriptomic neuron types reinforces earlier indications of an especially close and fundamental connection between neuropeptide gene expression and neuron type identity.
Cell-type-specificity of differential NP gene expression is conserved between neocortical areas
Figure 5 juxtaposes separate VISp and ALM expression profiles for NPP and NP-GPCR genes across 93 VISp neuron types (Fig. 5A) and 84 ALM neuron types (Fig. 5B). Similarities of expression profiles for the two areas are obvious in Fig. 5, but there are also visible differences. The latter are rooted primarily in the substantial divergence of glutamatergic neuron taxonomies discussed at length in Tasic, et al. (Tasic et al., 2018) and summarized here in Supplementary Fig. 3. Very strong similarities of both NPP and NP-GPCR expression profiles are most obvious for the GABAergic types, where the taxonomies are identical except for the absence of two GABAergic types in ALM (indicated by dark gray vertical placeholder bars in Fig. 5B). The general conservation of neuron-type-specific expression patterns between the two distant neocortical areas (NPP correlation: ρ= 0.974, p<2.2e-16, NP-GPCR: 0.877, p<2.2e-16) thus provides another indication of robust connection between NP gene expression and cortical neuron differentiation.
Expression of 37 cognate NPP/NP-GPCR pairs in cortex predicts the potential existence of 37 intracortical peptidergic networks
Expression of an NPP gene in one neuron and a cognate NP-GPCR gene in another nearby neuron implies the prospect of local paracrine signaling, with secretion of a specific neuropeptide by the first neuron activating the cognate specific neuropeptide receptor on a second, nearby neuron. The present set of 47 cortical NP genes (18 NPP and 29 NP-GPCR) comprises the 37 distinct cognate NPP/NP-GPCR pairs enumerated in Table 3 and predicts accordingly 37 distinct peptidergic neuromodulation networks. As noted in the Introduction, expected neuropeptide diffusion distances suggest that any neuron within a local cortical area (e.g., VISp or ALM) might signal by diffusion to any other neuron within that same local area, but almost surely not to more distant areas (e.g., from VISp to ALM). In the following, we therefore make predictions of 74 (37 × 2) peptidergic distinct signaling networks, keeping separate consideration of signaling within VISp and within ALM.
Type-specific NP gene expression profiles predict type-specific peptidergic coupling
Figure 6 displays heat maps representing predictions of neuron-type-specific peptidergic coupling from a selection of the 37 cognate NP gene pairs and expression profiles of the paired NPP and NP-GPCR genes. The predictions of Fig. 6 are based on cell-type-by-cell-type aggregation of binarized cell-pair-by-cell-pair products of the NPP and NP-GPCR gene CPM values. CPM values were first thresholded at the 50th percentile independently for each cell type. The coupling matrix is then defined as: where denotes the expression of individual cell j in cell type p, |C(p)| denotes the total number of expressing cells of type p, and I is the indicator function and is therefore the fraction of expressing pairs exceeding the 50th percentile threshold.
The exemplar matrix displayed in Fig. 6A predicts coupling in area VISp based on the expression profiles of Npy and Npy1r in VISp. Figure 6B represents a similar prediction for the same pair in area ALM. The dashed white crosses overlying both plots partition the matrices based on pairings of glutamatergic and GABAergic neuron types. Both matrices predict strong signaling from the canonical broad class of Npy-positive GABAergic neurons to a broad subset of GABAergic neurons that strongly express the Npy1r NP-GPCR: the strongest coupling thus falls in the GABA→ GABA quadrant. Weaker coupling is observed in the GABA→ Glut quadrant, where the Npy1r NP-GPCR gene is less strongly expressed in the Glutamatergic cell types. Strong similarities between the VISp and ALM coupling matrices are most notable. Apparent differences between VISp and ALM coupling predictions are mainly due to exclusive expression of different glutamatergic cell types in the two areas, and only in small part due to difference in same-type expression within the two cortical areas.
Figures 6C-E represent 12 more of the 37 cognate pair coupling matrices predicted for VISp using Eqn. 1. Along with Figs. 6A and 6B, these exemplify the wide variety of neuron-type-specific coupling motifs resulting from transcriptomic prediction. Most coupling matrices (i.e., pairs 2, 6, 9, 16, 19, 25, 29, 31), predict significant coupling over wide swaths of type-pairs, approaching 20% of the entire matrix. A few matrices at the other extreme, such as 27 and 33, predict very sparse coupling. Other predictions are intermediate in sparsity. The full sets of 37 predicted coupling matrices enumerated in Table 3 for both VISp and ALM are represented in Suppl. Figure 2, where strong similarities between the two cortical areas are again quite obvious.
Figure 6 and Suppl. Fig. 2 also illustrate the tendency of coupling predictions from most cognate NP pairs to fall in contiguous “patches” of the full coupling matrix. This is a natural reflection of the strong tendency of both NPP and NP-GPCR expression to align with early nodes in the 2018 Tasic hierarchical clustering which was also evident in Figs. 3 and 5. The broadest example of coupling matrix patches reflecting hierarchical neurotaxonomy structure is provided by the observation of that most sizable coupling patches fall strictly within single quadrants of glutamatergic-GABAergic neuron type pairing.
Discussion
Light from single-cell transcriptomics is now beginning to illuminate dark corners of cellular neuroscience that have long resisted mechanistic and functional analysis (Fan et al., 2018; Földy et al., 2016; Gokce et al., 2016; Okaty et al., 2011; Paul et al., 2017; Shekhar et al., 2016; Tasic et al., 2018, 2016; Telley et al., 2016; Zeng and Sanes, 2017). Cortical neuropeptide signaling may be one such corner. While profound impacts of neuropeptide signaling are well-established in a wide range of non-mammalian and sub-cortical neural structures (Borbély et al., 2013; Burbach, 2011; Elphick et al., 2018; Grimmelikhuijzen and Hauser, 2012; Katz and Lillvis, 2014; Kuffler et al., 1979) and there certainly is an excellent literature on cortical neuropeptide signaling (Crawley, 1985; Férézou et al., 2007; Gallopin et al., 2006; Gomtsian et al., 2018; Hamilton et al., 2013; Liu et al., 2018; Mena et al., 2013; Rossier and Chapouthier, 1982; Williams and Zieglgänsberger, 1981), published physiological results are surprisingly rare given the breadth of neuroscientific interest in cortex. The new transcriptomic data analyzed here suggest a possible explanation for this relative rarity. Though many NPP and cognate NP-GPCR genes are expressed abundantly in all or very nearly all neocortical neurons, such expression is highly differential, highly cell-type specific, and often redundant. These previously uncharted differential expression factors may have hindered repeatable experimentation. Our analysis supports this unwelcome proposition but may also point the way to more productive new perspectives on intracortical peptidergic neuromodulation.
Summary of findings
The present analysis establishes that mRNA transcripts from one or more of 18 NPP genes are detectible in over 97% of mouse neocortical neurons and that transcripts of one or more of 29 NP-GPCR genes are detectible in over 98%. Transcripts of at least one of the 18 NPP genes are present in the vast majority of cortical neurons at extremely high copy number, strongly suggesting brisk translation into neuropeptide precursor proteins. Brisk synthesis of precursor proteins further suggests brisk processing to active neuropeptide products and secretion of these products. Likewise, NP-GPCR transcripts rank high in abundance compared to transcripts of other cellular proteins, again strongly supporting product functionality. Our observations thus support the proposition that all, or very nearly all, neocortical neurons are both neuropeptidergic and modulated by neuropeptides. We are not aware of any previous empirical support for such a conclusion.
We have closely examined single-neuron expression patterns of sets of 47 NP genes (18 NPP and cognate 29 NP-GPCR) and find that these patterns are highly conserved between two distant and generally quite different areas of neocortex. Such conservation lends additional support to the proposition that NP gene products may have a very fundamental importance to cortical local circuit function and argues against these patterns reflecting more ephemeral variable such as recent activity patterns, which would seem unlikely to correlate so strongly between cortical areas with such different roles in brain function.
Following earlier indications that neurons may express multiple NPP genes, e.g., (Mezey et al., 1999), our analysis establishes that expression of multiple NPP genes in individual neurons may be the rule in cortex. Our analysis also establishes the generality of expression of multiple NP-GPCR genes in individual cortical neurons. The significance of these observations remains to be explored but should be viewed in light of recent discoveries of large numbers and great diversity of transcriptomic neuron types in neocortex and many other brain regions. Combinatorial expression of neuropeptide precursor and receptor genes obviously expands the prospects for molecular multiplexing that may allow selective communication amongst a multiplicity of distinct neuron types even though the signaling molecules propagate in diffuse paracrine fashion.
We also find that a modest set of 47 neuropeptide-signaling genes permits transcriptomic neuron type classification that is exceptionally precise in comparison to other similarly small gene sets. This tight alignment of neuron type classifications based solely on neuropeptide-signaling gene expression with classifications based on genome-wide expression patterns offers an intriguing suggestion of a very deep and fundamental connection between the expression of evolutionarily ancient neuropeptide-signaling genes and the differentiation of neuron type identities during metazoan speciation.
Prediction of cortical modulation networks
Our analysis delineates neuron-type-specific expression of 37 cognate pairs amongst the 18 NPP and 29 NP-GPCR genes expressed in mouse neocortex. Each of these pairs can be taken to predict a modulatory connection from cells expressing a particular NPP gene, via a secreted NP product, to cells expressing the particular NP-GPCR gene. Each pair thus establishes the prospect of a modulatory network with nodes defined by the neurotaxonomic identities of the transmitting NPP-expressing and the receiving NP-GPCR-expressing neurons. The analyses represented in Figs. 1, 2, 3 and 5 and Table 3 establish that at least one of the 37 pairs directly involves every neuron sampled, and that the vast majority of neurons are directly involved in more than one of the 37 predicted networks. Because of this saturated, multiplexed coverage of all neurons and neuron types, we refer to these predicted neuropeptidergic networks as “dense”.
The logic of our prediction of multiplexed NP networks is summarized in the form of a simplified schematic by Figure 7A-E. Figure 7 also suggests how multiple NP networks may align with neuron-type-based predictions of synaptic network architectures as the relevant empirical connectomic and neurotaxonomic information, as schematized by (Fig. 7F) becomes available. Figure 7G integrates the fictitious network graphs of Figs. 7E and 7G, articulating a schematic view of cortical circuitry as the superimposition of many and diverse modulatory and synaptic networks, with neuron types as common nodes uniting a heterogeneous multiplicity of slow and fast signaling networks.
Transcriptomic prediction of paracrine local signaling from GABAergic neuron sources is particularly compelling. Because few cortical GABAergic neurons have axons that project beyond the confines of a single cortical area, considerations of diffusion physics and the limited lifetime of peptides after secretion strongly imply that secreted neuropeptides must act locally, if at all. The extremely high levels of NPP expression in GABAergic neurons argue, in turn, that they must act somewhere. Most cortical glutamatergic neurons do emit long axons, so it is possible that neuropeptides secreted from such neurons may act in remote and perhaps extracortical projection target areas. Even so, most cortical glutamatergic neurons do have locally ramifying axons and may also secrete neuropeptides from their local dendritic arbors (Vila-Porcile et al., 2009). The high cortical expression of NP-GPCRs cognate to NPP genes expressed by glutamatergic neurons in the same local area suggests a scenario supportive of local modulatory signaling from glutamatergic neuron sources, even though this case may not be quite as strong as that for GABAergic neurons. That said, the much more profuse expression of NPP genes in GABAergic neuron types along with the somewhat more profuse NP-GPCR expression in glutamatergic types still suggests a “prevailing wind” of peptidergic signaling, blowing predominantly from GABAergic to glutamatergic neurons, as presaged in an earlier microarray analysis of developing mouse cortex (Batista-Brito et al., 2008). Though our NP network predictions are entirely consistent with decades of pioneering work on peptidergic neuromodulation and cortical gene expression (Burbach, 2011; Hökfelt et al., 2013; van den Pol, 2012), it is perhaps only with the recent advent of data with single-cell resolution and genomic depth that it has become reasonable to propose the extreme neuron-type-specificity and density of network coverage suggested by our analysis. The cell-type-specific patterning of NP gene expression has allowed us to cast our predictions in testable form, and we believe emerging means to perturb and sense neuropeptide signaling, as discussed below, bring with them means to test these predictions critically.
Caveats to transcriptomic prediction
The present predictions of functional neuromodulatory coupling are based on analysis of cellular mRNA abundance, but prediction from such data depends upon (1) extrapolation from cellular mRNA census to inference about the synthesis, processing, localization and functional status of cellular NPP and NP-GPCR proteins, (2) assumptions about neuropeptide diffusion and lifetime in cortical interstitial spaces, (3) assumptions about signal transduction and effector consequences of neuropeptide binding in cortex to target cell NP-GPCR receptors. Though we have already discussed several factors mitigating such concerns, we stipulate here that these uncertainties remain substantial, and note the need for much further investigation.
Testing peptidergic network predictions
Physiological and anatomical experimentation will be essential to testing transcriptomic predictions of intracortical neuropeptide signaling. We have suggested that such work may have been frustrated in the past by irreproducibility due to the uncharted multiplicity, neuron-type-specificity, and redundancy of NPP and NP-GPCR expression. This conundrum may now be resolved with the emergence of transcriptomic neurotaxonomies and new tools for experimental access to specific cortical neuron types. Such access may be either prospective, using Cre driver lines (Daigle et al., 2018; He et al., 2016; Madisen et al., 2015) or viral vectors (Dimidschstein et al., 2016) of substantial neuron-type-specificity, or retrospective using highly multiplexed FISH (Lein et al., 2017; Zeng and Sanes, 2017) or immunostaining methods (He et al., 2016; Xu et al., 2010), patch-seq (Cadwell et al., 2017; Lein et al., 2017) or morphological neuron type classification methods (DeFelipe et al., 2013; Zeng and Sanes, 2017). By allowing the generation of highly specific predictions of peptidergic signaling between specific neuron types, these new molecular tools should enormously advance the prospects for decisive and repeatable tests of neuron-type-specific intracortical neuropeptide signaling hypotheses.
A vast pharmacopeia of well-characterized specific ligands and antagonists for most NP-GPCRs (Alexander et al., 2017) will be bedrock for the functional analysis of neuron-type-specific peptide signaling. For analysis of type-specific neuropeptide signaling in network context (i.e., ex vivo slices and in vivo), newer optophysiological methods of calcium imaging and optogenetic stimulation/inhibition will certainly join electrophysiology as foundations for measurement of neuropeptide impacts. In addition, many new tools more specific to neuropeptide signaling are emerging. Super-resolution 3D immunohistologies like array tomography (Smith, 2018) and 3D single-molecule methods (Jia et al., 2014; von Diezmann et al., 2017) will enable imaging of DCV localization and neuropeptide contents in type-specific network anatomical context. Genetically encoded sensors of extracellular GPCR ligands (Patriarchi et al., 2018; Sun et al., 2018), GPCR activation (Haider et al., 2019; Hill and Watson, 2018; Livingston et al., 2018; Ratnayake et al., 2017; Stoeber et al., 2018), G-protein mobilization (Ratnayake et al., 2017), cAMP concentration (Hackley et al., 2018; Ma et al., 2018), protein kinase activation (Chen et al., 2014) and protein phosphorylation (Haider et al., 2019) will enable fine dissection of NP dynamics and NP-GPCR signal transduction events (Spangler and Bruchas, 2017). In addition, new caged NP-GPCR ligands (Banghart et al., 2018) and antagonists (Banghart et al., 2013) will provide for precise spatial and temporal control for NP receptor activation. All of these tools have been demonstrated already in physiological applications, and all should be readily applicable to testing specific hypotheses derived from the type-specific peptidergic signaling predictions we have set forth.
Prospects for elucidating network homeostasis, modulation and plasticity
The original motivation for the present analysis was to deepen our understanding of the homeostasis, modulation and plasticity of cortical synaptic networks. Our work has raised the prospect that dense and highly multiplexed peptidergic neuromodulation could play very significant roles in these processes. Due to the clearly formidable complexity of cortical networks, however, a real grasp of the myriad network interactions implicated is certain to require theoretical and computational approaches, in addition to the biophysical approaches outlined in the preceding section. Perhaps most intriguing in the more theoretical directions are concepts that have emerged from work at the fertile intersection of the neuroscience of learning and memory and the computer science of machine learning and artificial neural networks (Dayan and Abbott, 2001; Huh and Sejnowski, 2017; Koch and Segev, 1998; Lillicrap et al., 2016; Marblestone et al., 2016; Shai and Larkum, 2017; Song et al., 2000).
Neuroscience and computer science efforts to model or engineer adaptive neural networks share the hard problem of optimal adjustment of large numbers of what both fields call “synaptic weights”. At the heart of this challenge is “credit assignment”, that is, the assignment of “credit” for progressive improvement during network development and learning processes to the correct subsets of synapses as needed to guide individualized synaptic weight adjustment. Neuroscientists struggle with the credit assignment problem as they search for the relevant biological learning rules. Computer scientists struggle with the excessive computational requirements of currently standard backpropagation-of-error-based credit assignment. One concept that has come into prominence as a candidate biologically plausible solution to the credit assignment problem is that of modulated “Hebbian” or “spike-timing-dependent” plasticity (STDP) (Bengio et al., 2016; Dan and Poo, 2006; Farries and Fairhall, 2007; Florian, 2007; Frémaux and Gerstner, 2016; Izhikevich, 2007; Marblestone et al., 2016; Pawlak et al., 2010; Poo et al., 2016; Roelfsema and Holtmaat, 2018; Xie and Seung, 2003). While most biological studies of modulated STDP so far have focused on the monoamine neuromodulator dopamine (Izhikevich, 2007; Kuśmierz et al., 2017; Schultz, 2015), known commonalities of signal transduction downstream from widely varying GPCRs suggest strongly that NP-GPCRs could play roles closely analogous to those postulated for dopamine-selective GPCRs (Hamilton et al., 2013; Roelfsema and Holtmaat, 2018).
A neurotaxonomic framework for integrating multiple, superimposed modulatory and synaptic networks, analogous to that schematized in very simple form by Fig. 7, may prove critical to advancing theoretical analyses of synaptic network homeostasis and plasticity. At present, efforts in this direction are limited by scant empirical information on synaptic connectomes and their neurotaxonomic annotation. It is very encouraging, however, that vigorous ongoing efforts, e.g., see (Daigle et al., 2018; Swanson and Lichtman, 2016; Tasic, 2018; Zeng and Sanes, 2017), suggest that such information is likely to materialize soon.
Prospects for neuropsychiatric drug development
Molecular components of neuropeptide signaling have beguiled as drug since targets the first wave of discovery that crested in the late twentieth century (Hökfelt et al., 2003; Hoyer and Bartfai, 2012). Many billions of dollars have been invested accordingly, but the returns seem to have been less than originally hoped. The present study raises the possibility that both NP-targeted drug discovery and the reproducibility of physiological experimentation have been hindered in similar ways by the same uncharted multiplicity, cell-type-specificity and redundancy of NPP and NP-GPCR expression. By charting these waters, single-neuron transcriptomic analysis may improve the odds substantially for both reproducible research and drug development.
Today’s psychiatric pharmaceuticals almost all target signaling by the monoamine neuromodulators dopamine, serotonin, noradrenaline and/or histamine and their selective GPCR receptors (Data-Franco et al., 2017; Hamon and Blier, 2013; Millan et al., 2015; Urs et al., 2014). Because they are so numerous, neuropeptide signaling systems may be much more neuron-type specific than monoamines. Greater neuron-type-specificity may translate to NP-targeting drugs being less troubled by side-effects and compensation (Hoyer and Bartfai, 2012). Moreover, while GPCRs have long been known as among the most “druggable” of targets (Gurrath, 2001; Lundstrom, 2009), the “druggability” of GPCRs is currently advancing very rapidly due to advances in GPCR structural biology and molecular dynamic simulations (Hilger et al., 2018; Koehl et al., 2018; Weis and Kobilka, 2018). It seems likely that new knowledge of peptide component neuron-type-specificity may substantially advance the development of NP-targeting pharmaceuticals.
Conclusions
Analysis of single-cell RNA-Seq data from mouse cortex reveals a new panoramic view of NPP and NP-GPCR gene expression. This view exposes an unexpected density and multiplicity of neuropeptide gene expression, as we have summarized and discussed. We have articulated many of findings into new and specific predictions regarding ways that cortical neurons may modulate one another’s function. These predictions are just now subject to experimental test with the recent emergence of transcriptomic neurotaxonomies, means for genetic access to specific neuron types and powerful new tools for biophysical analysis of neuropeptide actions. Such tests are likely to greatly deepen our understanding of adaptive cortical function.
Supplementary Methods
Autoencoder-based classifier development and evaluation methods
We used two types of gene datasets: 1) The “HE” gene set, which contains the expression of 6083 highly expressed neuronal genes in 22,439 neurons and 2) 47-gene sets, which contain the expression of 47 specific genes in 22,439 neurons (chosen either as the set of peptidergic precursor genes or random sets of 47 genes as explained in main text). Both “HE” and 47-gene datasets are divided into training and validation sets using a 92%-8% split.
Autoencoders are deep neural network models that consist of encoder/decoder subnetworks. In its basic form (Hinton and Salakhutdinov, 2006), the encoder subnetwork compresses the high dimensional input into a low dimensional representation, and the decoder subnetwork estimates the original input from that low dimensional representation. We constructed a network with two autoencoders, with 8 hidden layers each. The architecture of the first autoencoder(“HE Genes autoencoder”) is Input(6083) → Dropout(0.8) → Dense(100) → Dense(100) → Dense(100) → Dense(100) → Dense(d) → Batch Normalization (latent representation z1) → Dense(100) → Dense(100) → Dense(100) → Dense(100) → Dense(6083), and the architecture of the second autoencoder (“NP Genes autoencoder”) is Input(47) → Dropout(0) → Dense(50) → Dense(50) → Dense(50) → Dense(50) → Dense(d) → Batch Normalization (latent representation z2) → Dense(50) → Dense(50) → Dense(50) → Dense(50) → Dense(47). Here, the numbers in parentheses denote the number of units in that layer, the numbers of input/output units in each network match the number of input genes, and the Dropout layer (Srivastava et al., 2014) is used to prevent overfitting in the first network. The 2-d representations shown in Fig. 4-a (d=2) and the 5-d representations used in Fig4-d,e (d=5) are the outputs of the Batch Normalization layer (Ioffe and Szegedy, 2015) for both networks. The Dense layers use the rectified linear (ReLU) function as the nonlinear transformation except for Dense(d) layers which do not use a nonlinear transformation. Both networks were iteratively trained using the backpropagation algorithm with the Adam optimizer (Kingma and Ba, 2014) and a batch size of 956. The “HE Genes” network was trained for 50,000 epochs using the mean squared error between the input and the output layers as the loss function. The “NP genes” network was trained for 10,000 epochs using L=R+λC as the loss function, where R denotes the mean squared reconstruction loss as in the HE Genes network, C denotes the coupling loss between the latent representations of the two networks, and λ=100 is the weighting scalar between the two terms. After training the HE genes network and obtaining the latent representation z1 for each cell, C calculates the mean squared error between the latent representation of the NP genes network z2 and z1, normalized by the minimum eigenvalue of the 2-d representations of each batch during each training iteration. The two additive terms, R and C, together minimize the reconstruction error while attempting to match the representations learned based on the HE gene set. The same procedure was used for all small gene subsets including NP and random gene sets. The Python implementations of the autoencoders using the Tensorflow [128] and Keras [129] libraries will be made publicly available upon acceptance.
We determined the optimal latent dimensionality (d=5) by varying the latent space dimensionality of the HE Genes network between 2 and 20 dimensions. The optimal dimensionality was chosen by maximizing the classification accuracy of a Gaussian Mixture Model (GMM) on a test set, whose cluster memberships in the training set are those of the resource taxonomy (Tasic et al., 2018). We used the adjusted Rand index to quantify the similarity between two different partitionings of the same test set (e.g., the Tasic 2018 taxonomy as the ground truth and the predictions of the GMM), where a score of 1 corresponds to a perfect matching and a score of 0 corresponds to the chance level. At the optimal latent dimensionality of d=5, the GMM classifier achieved an adjusted Rand index of 0.8672 on the test set.
We quantified the performance of the GMM classifiers due to the different gene sets using the hierarchical dendrogram of the Tasic 2018 taxonomy by calculating the Resolution Index (RI) [33] for each cell. RI measures the depth of the first common ancestor of the predicted node and the original node in the taxonomy, from the lowest resolution at the root (RI = 0) to the finest resolution at the leaves (RI = 1). To account for the exclusion of all non-neuronal cell-types, scores were normalized over the resolution index corresponding to the first neuronal node on the Tasic 2018 taxonomy [33]. The performance of each gene set was quantified by taking the average RI score (across the cells) due to the respective GMM classifiers acting on the respective latent space representations. The RI scores reported in the main text are averages over all cells (training and test). The corresponding scores due to the test set only are 0.920 for the NP 47-gene subset, 0.768±0.039 for the expression-matched random sets of 47 genes, and 0.464±0.069 for the random sets of 47 genes, demonstrating an even wider performance gap.
Acknowledgements
We wish to thank the Allen Institute for Brain Science founder, Paul G. Allen, for his vision, encouragement and support. This work was supported in part by award number R01NS092474 from the Office of the Director of National Institutes of Health and award number R01MH104227 from the National Institute of Mental Health. The content is solely the responsibility of the authors and does not necessarily represent official views of the National Institutes of Health.