SUMMARY PARAGRAPH
Extracting high-degree interactions and dependences between variables (pairs, triplets, … k-tuples) is a challenge posed by all omics approaches1, 2. Here we used multivariate mutual information (Ik) analysis3 on single-cell retro-transcription quantitative PCR (sc-RTqPCR) data obtained from midbrain neurons to estimate the k-dimensional topology of their gene expression profiles. 41 mRNAs were quantified and statistical dependences in gene expression levels could be fully described for 21 genes: Ik analysis revealed a complex combinatorial structure including modules of pairs, triplets (up to 6-tuples) sharing strong positive, negative or zero Ik, corresponding to co-varying, clustering and independent sets of genes, respectively. Therefore, Ik analysis simultaneously identified heterogeneity (negative Ik) of the cell population under study and regulatory principles conserved across the population (homogeneity, positive Ik). Moreover, maximum information paths enabled to determine the size and stability of such transcriptional modules. Ik analysis represents a new topological and statistical method of data analysis.
MAIN TEXT
The recent evolution of single-cell transcriptomics has created much hope for our understanding of cell identity, cell development and gene regulation4. Using quantitative PCR or RNAseq, tens to thousands of mRNAs can be quantified from a single cell, generating particularly high-dimensional datasets (gene expression profiles). Combined with clustering and dimensionality-reduction techniques, these approaches have been successfully used to identify and separate cell types in various tissues, including the brain5. Single-cell transcriptomics has also be used to shed light on the gene regulatory principles underlying the specific phenotype of different cell types6, 7, frequently relying on pairwise analysis of gene expression levels to infer gene regulatory networks6, 8. However, the modular architecture of gene networks suggests that extracting higher-degree interactions between gene expression profiles may be necessary to understand gene regulation, and various approaches based on probability/information theory8–10 or homology11 have been proposed to tackle this issue.
Several transcriptomics studies have been performed on midbrain dopaminergic (DA) neurons5, 12: consistent with the heterogeneous vulnerability of this neuronal population in Parkinson’s disease13, qPCR and RNAseq performed at the single-cell level have revealed a significant diversity in gene expression profiles5,12. In parallel, much work has also been performed to understand the gene regulatory networks and identify the regulatory factors underlying the emergence of the DA phenotype14, with the therapeutical intent of producing functional DA neurons from induced pluripotent stem cells15.
Here we implement multivariate mutual information (Ik) analysis on transcriptomics data from single midbrain DA neurons to simultaneously provide new insights about the molecular heterogeneity of this neuronal population and about the gene regulatory principles underlying its specific phenotype.
We performed sc-RTqPCR on acutely dissociated identified midbrain neurons using the microfluidic BioMark™ HD Fluidigm platform. TH-GFP mice were used to preferentially target putative DA neurons (identified by the presence of tyrosine hydroxylase, TH-positive neurons, Supplementary Figure 1a). Electrophysiological recordings confirmed that acutely dissociated GFP and non-GFP neurons displayed the electrical properties expected for DA and non-dopaminergic (nDA) midbrain neurons16, 17, respectively (Supplementary Figure 2). However, since TH presence alone has been shown to not be a reliable marker18, DA and nDA phenotypes were refined based on the combined expression of Th/TH and Slc6a3/DAT (DA transporter) or lack thereof, allowing neurons collected from wild-type animals to be included (Supplementary Figure 1b). Based on Th-Slc6a3 expression, 111 neurons were classified as DA and 37 as nDA.
We quantified the levels of expression of 41 genes (Figure 1a), including 19 related to ion channel function, 9 related to neurotransmitter definition, 5 related to neuronal activation and calcium binding, and 3 related to neuronal structure (Supplementary Figure 1c). As expected, DA metabolism and signaling-related genes such as Th/TH, Slc6a3/DAT, Slc18a2/VMAT2, Drd2/D2R were highly expressed in DA neurons only, while expression levels of Slc17a6/VGLUT2, Gad1/GAD67 and Gad2/GAD65 suggested that collected nDA neurons used mainly glutamate or GABA as neurotransmitters (Figure 1a-b, Supplementary Figure 3). While some ion channels showed similar expression profiles in DA and nDA neurons (Cacna1c/Cav1.2, Cacna1g/Cav3.1, Hcn2/HCN2, Hcn4/HCN4, Kcna2/Kv1.2, Scn8a/Nav1.6), others (Kcnb1/Kv2.1, Kcnd3_2/Kv4.3, Kcnj6/GIRK2, Kcnn3/SK3, Scn2a1/Nav1.2) displayed higher levels of expression in DA neurons (Figure 1b, Supplementary Figure 3). In addition, although a few genes displayed a fairly stable level of expression across DA neurons (Th/TH, Slc6a3/DAT, Kcnd3_2/Kv4.3, Scn2a1/Nav1.2), most genes displayed significant variability in their expression levels (including dropout events) across cells (Figure 1b, Supplementary Figure 3), consistent with the already documented heterogeneity of midbrain DA neurons5, 13, 14.
As a first step in deciphering higher-degree relationships, we performed Pearson correlation analysis on the 33 most relevant genes (Figure 1c-d, Supplementary Figure 4). The patterns of correlations were clearly different for DA and nDA neurons, with more widespread correlations in DA neurons, as can be seen in the correlation maps (Figure 1c). This is only partly surprising as most of the genes were chosen because of their known expression in DA neurons, but it nonetheless demonstrates that specific signatures of second-degree linear relationships participate in the identity of the two populations under study (Figure 1d). While most of the cell type-specific correlations involved differentially expressed mRNAs, some similarly expressed genes displayed a stronger correlation in a specific population: Kcnj6/GIRK2 vs Scn2a1/Nav1.2 for instance in DA neurons, Scn2a1/Nav1.2 vs Slc17a6/VGLUT2 or Hcn4/HCN4 vs Nefm/NEF3 in nDA neurons (Figure 1d, Supplementary Figure 4). Several correlations were also present in both cell types (Kcna2/Kv1.2 vs Nefm/NEF3, Hcn2/HCN2 vs Nefm/NEF3). Interestingly, some of the strongest correlations found in DA neurons linked the group of genes involved in DA metabolism and signaling (Th/TH, Slc6a3/DAT, Slc18a2/VMAT2, Drd2/D2R) to a group of ion channel genes (Kcnj6/GIRK2, Kcnd3_2/Kv4.3, Kcnn3/SK3, Scn2a1/Nav1.2) (Figure 1d, Supplementary Figure 4), suggesting the existence of a large module of co-regulated genes. However, the size of such modules might only be accurately defined by methods capturing high-dimensional (beyond pairs) statistical dependences.
Various information theoretical approaches have been proposed to define gene regulatory modules based on the exploration of higher-degree relationships, notably three-way interactions8-10 (see also Supplementary methods). Here we present a method that combines in a single framework statistical and topological analysis of gene expression for systematic identification and quantification of such regulatory modules, based on the information cohomology developed by Baudot and Bennequin3. In this framework, joint-entropy (Hk) and multivariate mutual information (Ik) quantify the variability/randomness and the statistical dependences of the variables, respectively, while simultaneoustly estimating the topology of the dataset. We restricted the general setting defining information structures from the whole lattice of partitions of joint random variables to the simplicial sublattice of “set of subsets”, thus computationally allowing an exhaustive estimation of Hk and Ik at all degrees k and for every k-tuple (for k ≤ n=21, k being the degree/number of genes analyzed as a k-tuple, n being the total number of genes analyzed; Figure 2a, Supplementary methods). Information values obtained with this analysis provide a ranking of the lattices at each degree k (Supplementary methods). The Hk and Ik analysis therefore estimate the variability and statistical dependences at all degrees k, from 1 to n. Ik is defined as follows3, 19, 20: giving, for k=3, , where XI denotes the joint-variable corresponding to the subset I. Ik is equivalent to entropy for k = 1, has upper and lower limit values of log2(N) and –log2(N) bits (N being the number of bins or graining used to discretize the data; N=8 in the present case, Supplementary Figure 5), is always non-negative for k < 3, and can take negative values for k ≥ 3 19-21 (Supplementary methods). As an example, the maxima and minima of I3 for 3 binary variables are depicted in Supplementary Figure 6: while maxima (positive Ik) correspond to a fully redundant behavior (x1, x2 and x3 are informationally equivalent), the minima (negative Ik) correspond to cases where variables are pairwise independent (I2=0) but strictly tripletwise dependent (emergent behavior). In other terms, positive Ik captures co-variations and usual linear correlations as a subcase, zeros of Ik capture statistical independence, and negativity captures more complex relationships that cannot be detected on lower dimensional projections, such as degree-specific clustering patterns (also called synergy or frustation)9, 21, 22 (Supplementary methods).
We applied Ik analysis to the gene expression levels measured in DA and nDA neurons for the 21 most relevant genes (Figure 2). The variability in expression of each gene Xi is quantified by the entropy H1(Xi)=I1(Xi) (Supplementary methods). Consistent with the expression profiles depicted in Figure 1b, the smallest and largest values of I1 were found for nDA neurons (Figure 2b,d). The genes sharing the strongest I2 values (Figure 2b) significantly overlapped with those sharing strong Pearson correlations (Figure 1d), in particular for DA neurons (Th/TH, Slc6a3/DAT, Slc18a2/VMAT2, Drd2/D2R, Kcnj6/GIRK2, Kcnd3_2/Kv4.3, Kcnn3/SK3, Scn2a1/Nav1.2). Nevertheless the precise patterns of I2-sharing genes were different, due to the fact that Ik also identifies non-linear dependences23. Interestingly, for k ≥ 3, the modules of genes sharing the strongest positive I3 and I4 displayed dense overlap with those sharing the strongest I2, while the groups of genes sharing the strongest negative I3 and I4 (Cacna1g/Cav3.1, Calb1/CB, Drd2/D2R, Kcna2/Kv1.2, Kcnb1/Kv2.1, Kcnj11/Kir6.2, Nefm/NEF3, Slc17a6/VGLUT2) had very little overlap with the strongly correlated (see Figure 1d) or strong I2-sharing genes (Figure 2b), especially for DA neurons. Ik was also calculated for superior degrees (5 to 21), and examples of the strongest positive and negative information modules are shown for I5 and I10 in Figure 2b. Consistent with the theoretical examples presented in Supplementary Figure 6, strong negative I4 was associated with clustering patterns of expression while strong positive I4 corresponded to co-varying patterns of expression (Figure 2c). In general, the distribution of Ik at each degree was found to be very different between DA and nDA neurons, with a predominance of independence (0 values) and strong negative values in nDA compared to DA neurons (Figure 2d).
In order to provide an exhaustive picture of the statistical dependences in both populations, we determined the information landscapes corresponding to the distribution of Ik values as a function of degree k (Figure 3a, Supplementary Figure 7). To help the reader understand this representation, two theoretical examples are given in Supplementary Figure 7b: for randomly equidistributed (independent) variables, I1 = log2(N), and I2,…,n = 0; while for strictly redundant variables (e.g. correlation of 1), I1,…,n = log2(N). The information landscapes of DA and nDA neurons were found to be very different from these two theoretical examples and from each other: in particular, the landscape of nDA neurons mainly comprised strong negative and 0 Ik values for k ≥ 3, suggesting that most k-tuples of genes are k-independent in these neurons. The prevalence of k-independence was found to be even stronger when the information landscape was computed for the 20 “less-relevant” genes in DA and nDA neurons (Supplementary Figure 7c). In contrast, the information landscape of DA neurons showed a predominance of negative Ik for k < 5 and predominance of positive Ik for k ≥ 5 (Figure 3a). Therefore, this analysis revealed a complex combinatorial structure of gene expression profiles in DA and nDA neurons, mixing independent, synergistic and redundant k-tuples of genes for k ≥ 3. In analogy with mean-field approximations, we also calculated the mean information for all degrees (Figure 3a, Supplementary methods). Due to the rather small number of cells analyzed and the inherent undersampling issue, the information landscapes computed here (especially the mean landscapes) should be intepreted with caution for k > 6 (DA) and k > 5 (nDA), even though maximal positive and negative Ik values are less sensitive to this limit (see Supplementary methods).
The Ik analysis presented in Figure 2 revealed that modules of strong positive or negative Ik could persist across degrees, but did not allow us to estimate the size of these gene modules. In order to quantify the stability of information modules and determine their size, we estimated the information flow over paths in the lattice of random variables in DA neurons (Figure 3b-c, Supplementary Figure 8). For a given information path, the first derivative with respect to the degree k is given by the conditional mutual information with a minus sign (Supplementary methods): , where ^ denotes the omission of Xi (the conditioning variable). Xi.Ik-1 stays positive (negative slope) if adding a variable Xi to the module increases the information while a negative Xi.Ik-1 (positive slope) indicates that adding a variable increases the uncertainty about the module. Therefore, reaching the first minima Xi.Ik-1 = 0 indicates that adding a variable stops being informationally relevant, and allows to define the degree for which information modules become unstable. In other words, the degree of the first minima gives a definitive assessment of the size of a gene module.
We characterized the paths that maximized mutual information (most informative modules) or that minimize mutual information (sequence of variables that segregate the most the whole set of variables), and that stay stable (Supplementary methods). Figure 3b presents the 4 longest paths of maximal and minimal information, which correspond to stable modules of degree 6 and 4, respectively. We then built the scaffold composed of the 4 maximal and minimal information paths (Figure 3c). All the genes involved in defining DA metabolism and signaling were found in the scaffold of maximal paths (Th/TH, Slc6a3/DAT, Slc18a2/VMAT2, Drd2/D2R), together with three ion channel genes (Kcnj6/GIRK2, Kcnd3_2/Kv4.3, Kcnn3/SK3), in keeping with the pairs, triplets and quadruplets of positive Ik-sharing genes identified in Figure 2b. This finding brings new insights to our understanding of gene regulation in DA neurons. As shown in Figure 2c, the genes sharing strong positive Ik have co-varying profiles of expression, which is usually considered to indicate a co-regulation of expression4, 8. Therefore the positive information module determined using conditional mutual information (Figure 3c) should correspond to a group of genes co-targeted by the same regulatory factors. Several studies have demonstrated that the expression levels of Th/TH, Slc6a3/DAT, Slc18a2/VMAT2 and Drd2/D2R are indeed under the control of the same pair of transcription factors Nurr1/Pitx3 24 (Supplementary Figure 9). Our results are consistent with these observations, but moreover suggest that these four genes might be part of a larger transcriptional module (≥ 7 genes) that also includes genes defining the electrical phenotype of DA neurons (Kcnj6/GIRK2, Kcnd3_2/Kv4.3, Kcnn3/SK3). This also means that defining the neurotransmitter identity and the electrical phenotype of these neurons might be the product of a single transcriptional program, involving at least the Nurr1 and Pitx3 transcription factors (Supplementary Figure 9). Alternatively, this coupling between ion channel and DA metabolism genes might also reflect the documented activity-dependent regulation of DA-specific genes such as Th/TH, which has been shown to be sensitive to blockade of sodium (including Nav1.2) and potassium (including SK3) channel activity25.
On the other hand, the minimal information paths identified the 8 genes that best segregate midbrain DA neurons (Figure 3c), supporting the already documented diversity of this neuronal population12-14. The presence of Abcc8/SUR1, Cacna1g/Cav3.1, Calb1/CB, Gad2/GAD65, Kcnj11/Kir6.2 and Drd2/D2R is perfectly consistent with several studies linking the expression of these genes to specific subpopulations of SNc and VTA neurons 13, 26-29 (Supplementary Figure 9). Importantly, our analysis reveals that other genes, in particular the potassium channels Kcna2/Kv1.2 and Kcnb1/Kv2.1 might be used as markers of midbrain DA neuron subpopulations.
In summary, we showed that the topology of a high-dimensional dataset defined by the independence, and the simple (redundant) and complex (synergistic) statistical dependences at all degrees can be estimated using multivariate mutual information analysis (Ik). Applied to sc-RTqPCR data, Ik analysis allowed us to simultaneously determine the size and identity of gene regulatory modules conserved across a cell population and the size and identity of gene modules underlying cell diversity (Supplementary Figure 9). Therefore, the specific complex combinatorial structure of genetic interactions (positive, negative, null) underlying the stability and diversity of a given cell type is described at once by the presented method. While applied here to transcriptomics data, Ik analysis could be applied to any type of high-dimensional data, within the limit of computational tractability.
COMPETING INTERESTS
The authors declare no competing financial interests.
MATERIAL AND METHODS
Acute midbrain slice preparation
Acute slices were prepared from P14–P23 TH-GFP mice (transgenic mice expressing GFP under the control of the tyrosine hydroxylase promoter) 30 of either sex. All experiments were performed according to the European and institutional guidelines for the care and use of laboratory animals (Council Directive 86/609/EEC and French National Research Council). Mice were anesthetized with isoflurane (Piramidal Healthcare Uk) and decapitated. The brain was immersed briefly in oxygenated ice-cold low calcium artificial cerebrospinal fluid (aCSF) containing the following (in mM): 125 NaCl, 25 NaHCO3, 2.5 KCl, 1.25 NaH2PO4, 0.5 CaCl2, 4 MgCl2, 25 glucose, pH 7.4, oxygenated with 95% O2 / 5% CO2 gas. The cortices were removed and then coronal midbrain slices (250 μm) were cut on a vibratome (Leica VT 1200S) in oxygenated ice-cold low calcium aCSF. Following 30–45 min incubation in 32°C oxygenated low calcium aCSF, the slices were incubated for at least 30 min in oxygenated aCSF (125 NaCl, 25 NaHCO3, 2.5 KCl, 1.25 NaH2PO4, 2 CaCl2, 2 MgCl2 and 25 glucose, pH 7.4, oxygenated with 95% O2 / 5% CO2 gas) at room temperature prior to electrophysiological recordings. Picrotoxin (100 μM, Sigma Aldrich, St. Louis, MO) and Kynurenate (2 mM, Sigma Aldrich) were bath-applied via continuous perfusion in aCSF to block inhibitory and excitatory synaptic activity, respectively.
Cell dissociation and collection
Midbrain DA neurons were acutely dissociated following a modified version of the methods described in references 31 and 32. Regions containing the SNc, part of the VTA and SNr were excised from each coronal midbrain slice. The tissue was submitted to papain digestion (2.5 mg/ml and 5mM L-cysteine) for 15-20 min in oxygenated low calcium HEPES aCSF (containing 10 mM HEPES, pH adjusted to 7.4 with NaOH) at 35-37° C and subsequently rinsed in low-calcium HEPES aCSF supplemented with trypsin inhibitor and bovine serum albumin (1mg/ml). Single cells were isolated by gentle trituration with fire-polished Pasteur pipettes and plated on poly-L-Lysine-coated coverslips. Dissociated cells were maintained in culture in low calcium HEPES aCSF at 37° in 5% CO2 for at least 45 minutes. Coverslips were then placed in a cell chamber of a fluorescence microscope and continuously perfused with HEPES-aCSF. Cells were collected by aspiration into borosilicate glass pipettes mounted on a micromanipulator under visual control. Cell dissociation and collection were performed using RNA-protective technique and all solutions were prepared with RNase-free reagents when possible and filtered before use.
Electrophysiology recordings, data acquisition and analysis
All recordings were performed as already described previously 33. Picrotoxin and Kynurenate were present for all recordings to prevent contamination of the intrinsic activity by spontaneous glutamatergic and GABAergic synaptic activity. Statistical analysis (performed according to data distribution) included: unpaired t test, Mann Whitney, paired t test with a p value <0.05 being considered statistically significant. Statistics were performed utilizing SigmaPlot 10.0 (Jandel Scientific, UK) and Prism 6 (GraphPad Software, Inc., La Jolla, CA).
qPCR assays, specific retro-transcription and targeted amplification (RT-STA)
Pre-designed TaqMan assays (TaqMan® Gene Expression Assays, Thermo Fisher Scientific) used in this study are listed in Supplementary Table 1. Assays were systematically selected to target the coding region and to cover all known splice variants. In the case of Kcnd3 and Kcnj6 genes, two different assays were used to detect all known splice variants. Excluding Fos (754 bp intron) and Bdnf, Kcna2 and Kcnj11 (both primers and probe within a single exon), assays spanning a large intron (>1000 bp) were chosen to avoid DNA amplification. Gad1 primers and probe were designed according to Applied Biosystems criteria and MIQE recommendations 34. TaqMan® assays were pooled (0.2x final concentration) and the preamplification step was validated using log serial dilutions of mouse brain total RNA (MBTR) 5, 6. The following thermal profile was applied: 50°C for 15 min, 95°C for 2 min and 22 cycles of amplification 35 (95°C for 15 s and 60°C for 4 min) following Fluidigm recommendations. For each assay, efficiency was estimated from the slope of the standard curve using the formula E= (10(-1/slope)]-1) x100. All assay efficiencies (89.4≤ E ≤100.4 %) are listed in Supplementary Table 1.
Single-cell RTqPCR, data processing and analysis
Individual GFP and non-GFP neurons were harvested directly into 5 μl of 2x Reaction buffer (CellsDirect™ One-Step qRTPCR, Lifetech) and kept at −80°C until further processing. A reverse transcription followed by a specific targeted pre-amplification (RT-STA) was performed in the same tube (2.5 μl 0.2x assay pool; 0.5 μl SuperScript III) applying the same thermal profile described above. The pre-amplified products were treated with ExoSAPI (Affimetrix) and diluted 5-fold prior to analysis by qPCR using 96.96 Dynamic Arrays on a BioMark System (BioMark™ HD Fluidigm). Data were analyzed using Fluidigm Real-Time PCR Analysis software (Linear Baseline Correction Method and User detector Ct, Threshold Method). Two genes, Kcnj6_c and Chat were undetectable in all analyzed cells. Cells that had a Ct for Hprt above 21 were excluded from further analysis. After interplate calibration, all Ct values were converted into relative expression levels using the equation Log2Ex = CtLOD - Ct(Assay) 36. LOD (limit of detection) was set to Ct=25 by calculating the theoretical Ct value for 1 single molecule in the Biomark system from two custom-designed oligonucleotides: Slc17a6 and Penk. All data pre-processing was performed in Microsoft Excel (Microsoft, Redmond, USA). Heatmap and correlation maps (Pearson correlation coefficient values excluding zero values, p value <0.5, n >5) were generated in the R environment (R Core Team 2016) using gplots, heatmap3, Hmics and corrplot packages. Gene expression scatter plots and frequency distribution plots were created in SigmaPlot 10.0 (Jandel Scientific) and Prism 6 (GraphPad Software, Inc, La Jolla, CA). Figures were prepared using Adobe Illustrator CS6.
Topological information data analysis
The present analysis is based on the information cohomology framework developed by Baudot and Bennequin3 and relies on theorems establishing uniquely the usual entropy (Hk) and multivariate mutual information (Ik) as the first class of cohomology and coboundaries respectively with finite (non-asymptotic) methods (see Supplementary Methods for more detail).
Simplicial Information structures
The information functions are defined on the whole lattice of partitions of the probability simplex of atomic probabilities, providing the general random variable lattice of joint-variables. The application of this framework to data analysis is developed in the subcase of simplicial information homology, which consists in the exploration of the simplicial sublattice of “set of subsets” defined dually for joint and mutual (meet) monoid structures of random variables, and whose exploration follows binomial combinatorics with a complexity in O(2n). It allows an exhaustive estimation of the information structure, that is the joint-entropy Hk and the mutual information Ik, on all degrees k and for every k-tuple of variables (gene expression levels), defined respectively by the following equations: for a probability joint-distribution PX1,…,Xk and joint-random variables (X1,…,Xk) with alphabet [N1…Nk] and k=-1/ln2, where n variables are mutually independent if and only if ∀ k ≤ n, Ik=0. Due to the combinatorial complexity, in the current study Hk and Ik values were computed for n=21 (for n=21, the total number of information elements to estimate is 2 097 152).
The distributions of Ik and Hk for every degree k (corresponding to k-tuples of variables) were represented as Ik and Hk landscapes (Supplementary Figure 7). The landscapes are representations of the simplicial information structures where each element of the lattice is represented as a function of its corresponding value of entropy or mutual information, and quantify the variability-ramdomness and statistical dependencies at all degrees k, respectivey, from 1 to n. Mean landscapes were calculated by averaging Ik and Hk for each degree k over the number of k-tuples. The mean information landscape quantifies the average behavior of the whole structure. The mean information landscape (or path) is given by:
Probability estimation
The probability estimation procedure is explained in Supplementary Figure 5 for the simple case of two random variables (the expression levels of two genes). For each variable Xj, we consider the space in the intervals [min xj, max xj] and divide it into Nj boxes, N being the graining of the data. The empirical joint probability is estimated by box counting after a graining of the data space into N1…Nk boxes (for k-tuple probability estimation). In the current study, a graining of N1=…=Nk=8 was chosen as it provided a correct description of the distribution of the expressions levels (see Supplementary Figure 8 for the influence of changing the graining on the identification of gene modules).
Information paths
An information path IPk or HPk of degree k on Ik or a Hk landscape is defined as a sequence of elements of the lattice that begins at the leastest element of the lattice (the identity-constant “0”), travels along edges from element to element of increasing degree of the lattice and ends at the greatest element of the lattice of degree k. The first derivative of an IPk path is minus the conditional mutual information. The (“non-Shannonian”) information inequalities 19, e.g. the negativity of conditional mutual information that quantifies the instability of the mutual information along the path, are then equivalent to the existence of local minima on such paths (see Supplementary methods). The critical dimension of an IPk path is the degree of its first minima. A positive information path is an information path from 0 to a given Ik corresponding to a given k-tuple of variables such that Ik<Ik-1<…<I1. We call the marginal component of a path I1 a self-information energy and the interacting compoenents functions Ik, k > 1, a free information energy. A maximal positive information path is a positive information path of maximal length: it ends at a minima of the free information energy function. In the current study, the length of maximal positive information paths was considered to indicate the size of a stable information module. The set of all these paths defines uniquely the minimum free information complex (see Supplementary methods). In simple terms, this complex is the homological formulation of the minimum energy principle with potentially many local and degenerate minima. The set of all paths of degree k is in one-to-one correspondence with the symmetric group Sk and hence untractable computationally (complexity in O(k!)). In order to bypass this issue, we used a fast local algorithm that selects at each element of degree k of an IP path the positive information path with maximal or minimal Ik+1 value or stops whenever Xk.Ik+1 ≤ 0 and rank those paths by their length.
Robustness of the method
To estimate the degree after which the sample size m becomes limiting and biases our estimations, the undersampling regime was quantified by the degree ku beyond which a significant proportion (10%) of the Hk values get close to log2(m). Using these criteria, with log2(111)=6.79 and log2(37)=5.21, the ku obtained for DA neurons was 6 and 5 for nDA neurons, and Ik and Hk values beyond these degrees should be interpreted with caution (Supplementary methods). It must be noted however that this limit is calculated on the average Hk, whose value is mainly determined by non-relevant independent k-tuples. The biologically relevant statistical dependences correspond to extrema in the raw landscape (minimal Hk and maximal or minimal Ik) and therefore are less affected by this sampling problem. In order to evaluate the robustness of our results to sample size (m) and graining value (N), we calculated the maximal positive paths obtained for DA neurons for smaller samples (m = 28, 56, 84, taken fully arbitrarily among 111) and smaller (N=4, 6) or larger (N=10, 12) graining values (Supplementary Figure 8). The information paths of maximal length were found to be relatively robust to variations in N and m, even though, as expected, m=28 yielded significantly different paths. For most N and m combinations, the main genes identified in Figure 3c were also present in the maximal information paths, including in particular the DA metabolism/signaling genes and the two ion channel genes Kcnd3/Kv4.3 and Kcnn3/SK3. Concerning the statistical significance of the results, I2 functions are Kullback-Leibler divergences 37 and estimate the divergence from 2-independence. Their generalization to arbitrary degree k (Ik) can be interpreted as a statistical significance of a test, here against the null hypothesis of k-independence Ik=0. Our analysis, based on the ranking of the Ik for every k, considered only the 5 maximal (positive) and 5 minimal (negative) values of Ik, which are the 5 most significantly dependent positive and negative Ik-sharing k-tuples (for k > 2).
Computation and algorithm
The Information Topology open source program, written in Python, is available on Github depository. It allows to compute the information landscapes, paths, and minimum free energy complex, which encode and represent directly all the usual equalities, inequalities, and functions of information theory (as justified at length in Supplementary methods), and all the structures of the statistical dependences within a given set of empirical measures (up to the approximations, computational tractability and finite size biases, see previous sections). It can be run on a regular personal computer up to k = n = 21 random-variables in reasonable time (3 hours), and provide new tools for pattern detection, dimensionality reduction, ranking and clustering based on a unified homological and informational theory.
ACKNOWLEDGEMENTS
This work was funded by the French National Research Agency (ANR JCJC grant ROBUSTEX to J.M.G.; supporting S.T.), the European Research Council (ERC consolidator grant 616827 CanaloHmics to J.M.G.; supporting M.T.P., P.B. and M.L.), and the French Ministry of Research (doctoral fellowship to M.A.D.). We would like to thank Pr. E. Marder for helpful discussions on the manuscript.