Abstract
In largely non-mitotic tissues such as the brain, cells are prone to a gradual accumulation of stochastic genetic and epigenetic errors, which may lead to increased gene expression variation with time, both between cells and possibly also between individuals. Evolutionary theory also predicts increased genetic variation during aging, associated with the expression of slightly deleterious variants. Although increased inter-individual heterogeneity in gene expression during brain aging was previously reported, whether this process starts at the beginning of life or it is mainly restricted to the aging period has not been studied. The regulation and functional significance of putative age-related heterogeneity are also unknown. Here we address these issues by a systematic analysis of 19 transcriptome datasets from diverse brain regions in human covering the whole postnatal lifespan. Among all datasets, we observed a significantly higher increase in inter-individual gene expression heterogeneity during aging (20 to 98 years of age) than during postnatal development (0 to 20 years of age). Moreover, increased heterogeneity during aging was consistent among different brain regions and was associated with many biological processes and pathways that are important for aging and neural function, including longevity regulating pathway, autophagy, mTOR signaling pathway, axon guidance, and synapses. Overall, our results show that an increase in gene expression heterogeneity during aging is a general effect in human brain transcriptomes and, may play a significant role in processes of aging-related changes of brain functions. We also provide the necessary functions to calculate heterogeneity change with age as an R package, ‘hetAge’.
Introduction
Aging is a complex process characterized by a gradual decline in maintenance and repair mechanisms, accompanied by an increase in genetic and epigenetic mutations, and oxidative damage to protein and lipids (Gorbunova, Seluanov, Mao, & Hine, 2007; Lu et al., 2004). The human brain experiences dramatic structural and functional changes in the course of aging. These include decline in gray matter and white matter volumes (Sowell, Thompson, & Toga, 2004), increase in axonal bouton dynamics (Grillo et al., 2013) and reduced synaptic plasticity, which may be associated with the decline in cognitive functions (Dorszewska, 2013). Changes during brain aging are suggested to be a result of stochastic processes, unlike changes associated with postnatal neural development which are known to be primarily controlled by regulatory processes (Polleux, Ince-Dunn, & Ghosh, 2007; Schratt, 2009; Stefani & Slack, 2008). The molecular mechanisms underlying age-related alteration of regulatory processes and eventually leading to aging-related phenotypes, however, are little understood.
Over the past decade, a number of transcriptome studies focusing on age-related changes in human brain gene expression profiles were published (Kang et al., 2011; Lu et al., 2004; Miller et al., 2014; Somel et al., 2010; Tebbenkamp, Willsey, State, & Šestan, 2014). These studies report aging-related differential expression patterns in many functions, including synaptic functions, energy metabolism, inflammation, stress response, and DNA repair. Analyzing age-related change in gene expression profiles in diverse brain regions, we previously showed that gene expression changes occur in the opposite direction during postnatal development (pre-20 years of age) and aging (post-20 years of age), which may be associated with aging-related phenotypes in healthy brain aging (Dönertaş et al., 2017). While different brain regions are associated with specific, and often independent, gene expression profiles (Kang et al., 2011; Miller et al., 2014; Tebbenkamp et al., 2014), these studies also show that age-related alteration of gene expression profiles during aging is a widespread effect across different brain regions.
One of the suggested effects of aging on gene expression is increased variability between individuals and somatic cells, which has been previously reported by several studies. Some of these studies find an increase in age-related heterogeneity in heart, lung and white blood cells of mice (Angelidis et al., 2019; Bahar et al., 2006; Martinez-Jimenez et al., 2017), C.elegans (Herndon et al., 2002),□ and human twins (Fraga et al., 2005). However, Viñuela et al. find more decrease than an increase in heterogeneity in human twins (Viñuela et al., 2018) and Ximerakis et al. show the direction of the heterogeneity change depends on cell-type in aging mice brain (Ximerakis et al., 2018). Using GTEx data covering different brain regions (20 to 70 years of age), Brinkmeyer-Langford et al. identify a set of differentially variable genes between different age groups, but they do not observe increased heterogeneity in the old (Brinkmeyer-Langford, Guan, Ji, & Cai, 2016). A more recent study, performing single-cell RNA sequencing of human pancreatic cells, identifies an increase in transcriptional heterogeneity and somatic mutations with age (Enge et al., 2017). In an earlier study, we re-analysed microarray datasets from different tissues of humans and rats, and found that an increase in age-related heterogeneity of expression is a general effect in the transcriptome (Somel, Khaitovich, Bahn, Pääbo, & Lachmann, 2006). However, we found no significant consistency across datasets, nor any significant enrichment in functional gene groups. In another study, a meta-analysis suggested differences across brain regions collected from the same individuals are higher in aging than in development, suggesting an increase in inter-individual variability (Dönertaş et al., 2017). More recently we conducted a prefrontal cortex transcriptome analysis that revealed a weak increase in age-dependent heterogeneity at the gene, transcriptome and pathway level independent of the preprocessing methods (Kedlian, Donertas, & Thornton, 2019).
Although the age-related increase in heterogeneity has been suggested in previous studies, whether it is a time-dependent process that starts at the beginning of life or it (and its functional consequences) are only seen after developmental processes are completed, were not explored. In this study, we retrieved transcriptome data from independent microarray-based studies covering the whole lifespan from diverse brain regions and conducted a comprehensive analysis to identify the prevalence of age-related heterogeneity changes in human brain aging, compared with those observed during postnatal development. We confirmed that increased age-related heterogeneity is a consistent trend in the human brain transcriptome during aging but not during development, and is associated with aging-related biological functions.
Results
To investigate how heterogeneity in gene expression changes with age, we used 19 published microarray datasets from three independent studies. Datasets included 1,010 samples from 17 different brain regions of 298 individuals, ranging from 0 to 98 years in age (Table S1, Figure S1). In order to analyze the age-related change in gene expression heterogeneity during aging compared to the change in development, we divided datasets into two groups as development (0 to 20 years of age, n = 441) and aging (20 to 98 years of age, n=569). We used the age of 20 to separate pre-adulthood and adulthood based on commonly used age intervals in earlier studies (see Methods). For the analysis, we focused only on the genes for which we have a measurement across all datasets (n=11,137).
Age-related change in gene expression levels
Although the primary focus of this study is to explore how heterogeneity in gene expression changes with age, we first characterized the changes in gene expression level. In order to quantify age-related changes in gene expression, we used a linear model between gene expression levels and age (see Methods, Figure S2). We transformed the ages to the fourth root scale before fitting the model as it provides relatively uniform distribution of ages across the lifespan, but we also confirmed that different age scales yield quantitatively similar results (see Figure S3). We measured expression change of each gene in aging and development periods separately and considered regression coefficients from the linear model (values) as a measure of age-related expression change (Figure S4, Table S2).
We first analyzed similarity in age-related expression changes across datasets by calculating pairwise Spearman’s correlation coefficients among the β values (Figure 1a). Both development (Median correlation coefficient = 0.56, permutation test p < 0.001, Figure S6a) and aging datasets (Median correlation coefficient = 0.43, permutation test p = 0.003, Figure S6b) showed moderate correlation with the datasets within the same period. Although the difference between the correlations within development and aging datasets was not significant (Permutation test p = 0.1, Figure S5a), weaker consistency during aging may reflect the stochastic nature of aging, causing increased heterogeneity between aging datasets. In addition, we observed a mostly negative correlation between aging and development (Median correlation coefficient = −0.04), consistent with our previous report of gene expression reversal, which involved the same datasets but a different measure of age-related expression change (Dönertaş et al., 2017).
The principal component analysis (PCA) of age-related expression changes (β) revealed distinct clusters of development and aging datasets (Figure 1b). Moreover, aging datasets were more dispersed than development datasets (median pairwise Euclidean distances between PC1 and PC2 were 77 for aging and 21 for development), which may again reflect stochasticity in gene expression change during aging and can indicate more heterogeneity among different brain regions or datasets during aging than in development.
We next identified genes showing significant age-related expression change (FDR corrected p < 0.05), for development and aging datasets separately (Figure 1c). Development datasets showed more significant changes compared to aging (Permutation test p = 0.003, Figure S5c), which may again indicate higher expression variability among individuals during aging. Moreover, the direction of change in development was mostly positive (14 datasets with more positive and 5 with more negative), whereas in aging datasets, we observed more genes with a decrease in expression level (13 datasets with more genes decreasing expression and 5 with no significant change, and 1 with an equal number of positive and negative changes).
Age-related change in gene expression heterogeneity
In order to assess age-related change in heterogeneity, we used the unexplained variance (residuals) from the linear model we constructed to calculate the change in gene expression level. For each gene in each dataset separately, we calculated Spearman’s correlation coefficients (ρ) between the absolute value of residuals and age, irrespective of whether the gene shows a significant change in expression (see Methods, Figure S2). We considered ρ values as a measure of heterogeneity change, where positive values mean an increase in heterogeneity with age (Table S2). Moreover, we repeated this approach using loess regression instead of a linear model between expression level and age. We confirmed the correlations between the change in heterogeneity based on a linear model and loess regression were high (Figure S15) but preferred to continue with the results based on the linear model as loess regression was observed to be more sensitive to the changes in sample sizes and parameters.
Then, we asked if datasets show similar changes in heterogeneity by calculating pairwise Spearman’s correlation across datasets (Figure 2a). Unlike the correlations among expression level changes, age-related change in expression heterogeneity did not show a higher consistency during development. In fact, although the difference is not significant (permutation test p = 0.2, Figure S5b), the median value of the correlation coefficients was higher in aging (Median correlation coefficient = 0.21, permutation test p = 0.24, Figure S6c), than in development (Median correlation coefficient = 0.11, permutation test p = 0.25, Figure S6d).
A principal component analysis (PCA) showed that heterogeneity change is also able to differentiate aging datasets from development (Figure 2b). Similar to the pairwise correlations (Figure 2a), aging datasets clustered more closely than development datasets (median pairwise Euclidean distances between PC1 and PC2 are 41 and 44 for aging and development, respectively). Both observations imply more similar changes in heterogeneity during aging.
Using the p-values from Spearman’s correlation between age and the absolute value of residuals for each gene, we then investigated the genes showing a significant change in heterogeneity during aging and development (FDR corrected p-value < 0.05). We found almost no significant change in heterogeneity during development, except for Colantuoni2011 dataset, for which we have high statistical power due to the large sample size. Aging datasets, on the other hand, showed more significant changes in heterogeneity (Permutation test p = 0.06, Figure S5d) and the majority of the genes with significant changes in heterogeneity tended to increase in heterogeneity (Figure 2c). However, the genes showing a significant change did not overlap across aging datasets (Figure S7). Since the significance of the changes is highly dependent on the sample size, instead of focusing on these changes, we utilized having multiple datasets and focused on shared trends across them, capturing weak but reproducible trends across multiple datasets.
Nevertheless, these analyses indicated relatively more consistent heterogeneity change among datasets in aging, compared to development, which may imply that heterogeneity change is one of the characteristics of aging (see Discussion).
Consistent increase in heterogeneity during aging
As our previous analyses suggested age-related changes in heterogeneity can differentiate development and aging and show more similarity in aging, we sought to characterize these changes. The method that we used in this study uses the consistent changes in heterogeneity across datasets, instead of considering significant ones within individual datasets.
We first examined profiles of age-related heterogeneity change in aging and development. 18/19 aging datasets showed more increase than decrease in heterogeneity with age (Median ρ > 0), while the median heterogeneity change in one dataset was zero (i.e. there is an equal number of genes with increase and decrease in heterogeneity). In development, on the other hand, only 5/19 datasets showed more increase in heterogeneity, while heterogeneity of remaining 14/19 datasets showed more decrease with age (Median ρ < 0) (Figure 3a). The age-related change in heterogeneity during aging was significantly higher than development (permutation test p<0.001, Figure 5e). We also checked if there is a relationship between the changes in heterogeneity during development and aging (e.g. if those genes that decrease in heterogeneity tend to increase in heterogeneity during aging) but did not find any significant trend (Figure S16).
A potential explanation why we see different patterns of heterogeneity change with age in development and aging could be the accompanying changes in the expression levels, as it is challenging to remove dependence between the mean and variance. To address this possibility, we first examined calculated Spearman’s correlation between the changes in heterogeneity (ρ values) and expression (β values), for each dataset. Overall, all datasets had values close to zero, suggesting the association is not strong. Surprisingly, we saw an opposing profile for development and aging; while the change in heterogeneity and expression were positively correlated in development, they showed a negative correlation in aging (Figure 3b).
Having observed both a tendency to increase and a higher consistency in heterogeneity change during aging, we next asked if particular genes become more heterogeneous consistently across datasets and how the numbers compare with development. We first calculated, for each gene, the number of datasets with an increase in heterogeneity, for development and aging separately (Figure 3c). To calculate significance and expected consistency, while controlling for dataset dependence, we performed 1,000 random permutations of individuals’ ages and re-calculated the heterogeneity changes (see Methods). Importantly, we only created random permutations for the heterogeneity change but not the gene expression changes. In development, there was no significant consistency in heterogeneity change in either increase or decrease. During aging, however, there was a significant shift toward heterogeneity increase, i.e. genes showed more than expected consistency toward heterogeneity increase across aging datasets (Figure 3c, lower panel). We identified 147 common genes with a significant increase in heterogeneity across all aging datasets (one-sided permutation test p < 0.001, Table S3). Based on our permutations, we estimated that 84/147 genes could be expected to have consistent increase just by chance, suggesting almost 60% false positives. In development, however, there was no significant consistency in heterogeneity change in either direction (increase or decrease). Nevertheless, comparing the consistency in aging and development, there was an apparent shift towards a consistent increase in aging – even if we cannot confidently report the genes that become significantly more heterogeneous with age across multiple datasets. Low statistical power due to the small number of independent datasets (i.e. three independent data sources) is likely to contribute to the high false positive rate.
Heterogeneity Trajectories
We next asked if there are specific patterns of heterogeneity change, e.g. increase only after a certain age. We used the genes with a consistent increase in heterogeneity with age, during aging (n = 147) to explore the trajectories of heterogeneity change (Figure 4). Genes grouped with k-means clustering showed multiple patterns in heterogeneity increase (Table S3). Three patterns are observed: i) genes in clusters 3 and 7 show noisy but a steady increase throughout aging, ii) genes in clusters 4, 5 and 8 show increase in early aging but slightly decrease after a certain age, revealing a reversal (up-down) pattern, and iii) the other genes increase in heterogeneity dramatically after the age of 60 (clusters 1, 2 and 6). Next, we asked if these genes have any consistent pattern in development (Figure S22). However, most of the clusters showed almost no age-related change. We also analyzed the accompanying changes in mean expression levels for these clusters. Except for cluster 1, which shows a decrease in expression level at around the age of 60 and then shows a dramatic increase, all clusters show a steady scaled mean expression level at around zero, i.e. different genes in a cluster show different patterns (Figure S17).
We further tested the genes showing dramatic heterogeneity increase after the age of 60 (clusters 1, 2 and 6) for association with Alzheimer’s Disease, as the disease incidence increases after 60 as well, however, found no evidence for such an association (see Figure S8).
Functional analysis
To examine the functional associations of heterogeneity changes with age, we performed gene set enrichment analysis using KEGG pathways (Kanehisa, Sato, Furumichi, Morishima, & Tanabe, 2019), Gene Ontology (GO) categories (Ashburner et al., 2000; The Gene Ontology Consortium, 2019), Disease Ontology (DO) (Kibbe et al., 2015), Reactome pathways (Fabregat et al., 2018), Transcription Factor (TF) Targets (TRANSFAC) (Matys et al., 2003), and miRNA targets (MiRTarBase) (Chou et al., 2016). In particular, we rank-ordered genes based on the number of datasets that show a consistent increase in heterogeneity and asked if the extremes of this distribution are associated with the gene sets that we analyzed. There was no significant enrichment for any of the functional categories and pathways for the consistent changes in development. The significantly enriched KEGG pathways for the genes that become consistently heterogeneous during aging included longevity regulating pathway, autophagy, mTOR signaling pathway and other pathways that are previously suggested to be important for aging (Figure 5a). Among the pathways listed in Figure 5a, only protein digestion and absorption, primary immunodeficiency, linoleic acid metabolism, and fat digestion and absorption pathways had negative enrichment score, meaning these pathways were significantly associated with the genes having the least number of datasets showing an increase. However, it is important to note that this does not mean these pathways have a decrease in heterogeneity as the distribution of consistent heterogeneity is skewed (Figure 3c, lower panel). We also calculated if the KEGG pathways that we identified are particularly enriched in any of the heterogeneity trajectories we identified. Although we lack the necessary power to test the associations statistically, we saw that i) group 1, which showed a stable increase in heterogeneity, is associated more with the metabolic pathways and mRNA surveillance pathway, ii) group 2, which showed first an increase and a slight decrease at later ages, is associated with axon guidance, mTOR signaling, and phospholipase D signaling pathways, and iii) group 3, which showed a dramatic increase after age of 60, is associated with autophagy, longevity regulating pathway and FoxO signaling pathways. The full list is available as Figure S9.
The distribution of consistent heterogeneity in development and aging also showed a clear difference. The pathway scheme for the longevity regulating pathway (Figure 5b), colored based on the number of datasets with a consistent increase, shows how particular genes compare between development and aging. The visualizations for all significant pathways, including the gene names are given in the Supplementary Information. Although we focused on KEGG pathways here, other significantly enriched gene sets, including GO, Reactome, TF and miRNA sets are included in as Tables S4-11. In general, while the consistent changes in development did not show any enrichment (except for miRNAs, see Table S11), we detected a significant enrichment for the genes that become more heterogeneous with age during aging, with the exception that disease ontology terms were not significantly associated with the consistent changes in either development or aging. The gene sets included specific categories such as autophagy and synaptic functions as well as broad functional categories such as regulation of transcription and translation processes, cytoskeleton or histone modifications. We also did GSEA for each dataset separately and confirmed that these pathways show consistent patterns in aging (Figure S24-S28). There were 30 significantly enriched Transcription Factors, including EGR and FOXO, and 99 miRNAs (see Table S9-10 for the full list). We also asked if the genes that become more heterogenous consistently across datasets are known aging-related genes, using GenAge Human gene set (Tacutu et al., 2018), but did not find a significant association (Figure S10).
Apart from having specific regulators that affect the heterogeneity, we also asked if the total number of transcription factors or miRNAs regulating a gene might be related to the heterogeneity (Figure 6). We calculated the correlations between the total number of regulators and the heterogeneity changes while controlling for the expression changes in development and aging. Genes that show a decrease in expression first and increase during aging (down-up) did not show any significant association between the change in heterogeneity and the number of regulators. Genes that show a decrease in expression during aging, irrespective of their expression during development (down-down and up-down), showed a higher correlation between the change in heterogeneity and the number of regulators in aging, and was mostly positive in aging datasets, meaning genes with a higher number of regulators become more heterogeneous with age. Genes that showed an increase in expression throughout the lifespan (up-up) also had a higher correlation between the heterogeneity and the number of miRNAs in aging, but this trend is not observed for transcription factors.
We further tested if genes with a consistent heterogeneity increase in aging are more central in the protein interaction network using STRING database (von Mering, 2004). Using multiple cutoffs and repeating the analysis, we observed a higher degree for the genes with increasing heterogeneity (Figure S20).
Johnson & Dong et al. previously compiled a list of traits that are age-related and have been sufficiently tested for genome-wide associations (Johnson, Dong, Vijg, & Suh, 2015). Using the genetic associations for those traits in GWAS Catalog, we tested if there are significantly enriched traits for the consistent changes in heterogeneity during aging (Table S12). Although there was no significant enrichment, all these age-related terms had positive enrichment scores, i.e. they all tended to include genes that consistently become more heterogeneous with age during aging.
Using cell-type specific transcriptome data generated from FACS-sorted cells in mouse brain (Cahoy et al., 2008), we also analyzed if there is an association between genes that become heterogeneous with age and cell-type specific genes. Although there was an overlap with oligodendrocytes and myelinated oligodendrocytes, there was no significant enrichment (which could be attributed to low power due to small overlap between aging and cell-type specific expression datasets) (Figure S23).
Discussion
Aging is characterized by a gradual decrease in the ability to maintain homeostatic processes which leads to functional decline, age-related diseases, and eventually to death. This age-related deterioration, however, is thought as not a result of expression changes in a few individual genes, but rather as a consequence of an age-related alteration of the whole genome, which could be a result of an accumulation of both epigenetic and genetic errors in a stochastic manner (Enge et al., 2017; Vijg, 2004). This stochastic nature of aging hinders the identification of the age-related change patterns in gene expression from a single dataset with a limited number of samples.
In this study, we examined 19 gene expression datasets compiled from three independent studies to identify the changes in gene expression heterogeneity with age. While all datasets have samples representing the whole lifespan, we used age of 20 years to separate postnatal development (0 to 20 years of age) and aging (20 to 98 years of age), as 20 years of age is considered to be a turning-point in gene expression trajectories. We implemented a regression-based method and identified genes showing a consistent change in heterogeneity with age, during development and aging separately. As we did not observe a substantial significant age-related heterogeneity change in most of the datasets, which could be due to lack of power due to the small sample sizes, we took advantage of a meta-analysis approach and focused on consistent signals among datasets, irrespective of their effect sizes and significance. Although this approach will fail to capture patterns that are specific to individual brain regions, it includes genes that fail to pass the significance threshold due to insufficient power. Furthermore, we expected our method to be robust to noise and confounding effects within individual datasets.
Increase in gene expression heterogeneity during aging
Analyzing age-related gene expression changes, we first observed that there are more significant and more similar changes during development than in aging. Additionally, genes showing significant change during aging tended to decrease in expression (Figure 1). These results can be explained by the accumulation of stochastic detrimental effects during aging, leading to a decrease in expression levels (Lu et al., 2004). Our initial analysis of gene expression changes suggested a higher heterogeneity between aging datasets.
We next focused on age-related heterogeneity change between individuals and found a significant increase in age-related heterogeneity during aging, compared to development. Notably, increased heterogeneity is not limited to individual brain regions, but a consistent pattern across different regions during aging. We found that age-related heterogeneity change is more consistent among aging datasets which may reflect an underlying systemic mechanism. Further, more genes showed more significant heterogeneity changes during aging than in development, and the majority of these genes tended to have more heterogeneous expression.
It was previously proposed that somatic mutation accumulations (Lodato et al., 2018; Lombard et al., 2005; Lu et al., 2004; Vijg, 2004) and epigenetic regulations (Cheung et al., 2018) might be associated with transcriptome instability. While Enge et al. and Lodato et al. suggested that genome-wide substitutions in single cells are not so common as to influence genome stability and cause transcriptional heterogeneity at the cellular level (Enge et al., 2017; Lodato et al., 2015), epigenetic mechanisms may be relevant. Although we cannot test age-related somatic mutation accumulation and epigenetic regulation in this study, an alternative mechanism might be related to transcriptional regulation, which is considered to be inherently stochastic (Maheshri & O’Shea, 2007). Several studies demonstrated that variation in gene expression is positively correlated with the number of TFs controlling gene’s regulation (Gustavo Valadares Barroso, Natasa Puzovic, 2018). We also found that genes with a higher number of regulators and a decrease in expression during aging become more heterogeneous, and the association is higher in aging. Further, significantly enriched TFs includes early growth response (EGF), which is known to be regulating the expression of many genes involved in synaptic homeostasis and plasticity; and FOXO TFs, which regulate stress resistance, metabolism, cell cycle arrest and apoptosis. Together with these studies, our results support that transcriptional regulation may be associated with age-related heterogeneity increase during aging and may have important functional consequences in brain aging.
Increased heterogeneity is not a result of technical or statistical artifacts
We next confirmed that observed increase in heterogeneity was not a result of low statistical power (Figure S1) or a technical artifact (Figure 3b, S11, S18). Specifically, we tested whether increased heterogeneity during aging can be a result of the mean-variance relationship, but we found no significant effect that can confound our results. In fact, the mean-variance relationship in development and aging showed opposing profiles. We further analyzed this by grouping genes based on their expression in development and aging (Figure S11). The genes that decrease in expression both in development and aging showed the most opposing profiles in terms of the mean-variance relationship, which could suggest that the decrease in development are more coordinated and well-regulated whereas the decrease in aging occurs due to stochastic errors. Another potential confounder is the post-mortem interval (PMI), which is the time between death and sample collection. Since we do not have this data for all datasets we analyzed, we could not account for this in our model. However, using the list of genes previously suggested as associated with PMI (Zhu, Wang, Yin, & Yang, 2017), we checked if the consistency among aging datasets could be driven by PMI. Only 2 PMI-associated genes were among the 147 that become consistently heterogeneous, and the distribution also suggested there is no significant relationship (Figure S18). We also confirmed that the increase in heterogeneity is not caused by outliers in datasets (Figure S19).
Microarrays do not bias against identifying age-related heterogeneity change
One important limitation of our study is that we analyze microarray-based data. Since gene expression levels measured by microarray do not reflect an absolute abundance of mRNAs, but rather are relative expression levels, we were only able to examine relative changes in gene expression. A recent study, analyzing single-cell RNA Sequencing data from aging Drosophila brain, identified an age-related decline in total mRNA abundance (Davie et al., 2018). It is also suggested that, in microarray studies, genes with lower expression levels tend to have higher variance (Aris et al., 2004). In this context, whether the change in heterogeneity is a result of the total mRNA decay is an important question. As an attempt to see if the age-related increase in heterogeneity is dependent on the technology used to generate data, we repeated the initial analysis using RNA-seq data for the human brain, generated by GTEx Consortium (“The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans,” 2015) (Figure S12-14). Nine out of thirteen datasets confirmed that there is more increase in heterogeneity at the transcriptome level, while the remaining four datasets were from BA24, cerebellar hemisphere, cerebellum and substantia nigra regions. The change in expression and heterogeneity, on the other hand, were positively correlated and the correlation was much higher in the magnitude. Unfortunately, expression level and variation in RNA-seq is challenging to disentangle. Thus, the biological relevance of the relationship between the age-related change in expression and heterogeneity still awaits to be understood through comprehensive experimental and computational approaches. Nevertheless, RNA-seq analysis also suggests an overall increase in age-related heterogeneity increase.
Another limitation is related to use of bulk RNA expression datasets, where each value is an average for the tissue. While it is important to note that our results indicate increased heterogeneity between individuals rather than cells, the fact that the brain is composed of different cell types raises the question if increased heterogeneity may be a result of changes in brain cell-type proportions. To explore the association between heterogeneity and cell-type specific genes, we used FACS-sorted cell type specific transcriptome dataset from mouse brain (Cahoy et al., 2008). We only had nine genes that have consistent heterogeneity increase and are specific to one cell-type. Eight out of nine were highly expressed in oligodendrocytes, which is consistent with the results reported in Kedlian et al. 2019. However, we did not observe any significant association between cell-type specific genes and heterogeneity (Figure S23).
Biological processes are associated with increased heterogeneity
Gene set enrichment analysis of the genes with increased heterogeneity with age revealed a set of significantly enriched pathways that are known to modulate aging, including longevity regulating pathway, autophagy, mTOR signaling pathway (Figure 5a). Furthermore, GO terms shared among these genes include some previously identified common pathways in aging and age-related diseases (Figure S25-27). We have also tested if these genes are associated with age-related diseases through GWAS, and although not significant, we found a positive association with all age-related traits defined in Johnson & Dong et al. 2015. Overall, these results indicate the effect of heterogeneity on pathways that modulate aging and may reflect the significance of increased heterogeneity in aging. Importantly, we identified genes that are enriched in terms related to neural and synaptic functions, such as axon guidance, neuron to neuron synapse, postsynaptic specialization, which may reflect the role of increased heterogeneity in synaptic dysfunction observed in the mammalian brain, which is considered to be a major factor in age-related cognitive decline (Morrison & Baxter, 2012). We also observed genes that become more heterogeneous with age consistently across datasets are more central (i.e. have a higher number of interactions) in a protein-protein interaction network (Figure S20). Although this could mean the effect of heterogeneity could be even more critical because it affects hub genes, another explanation is again research bias that these genes are studied more than others.
In summary, performing a meta-analysis of transcriptome data from diverse brain regions we found a significant increase in gene expression heterogeneity during aging, compared to development. Increased heterogeneity was a consistent pattern among diverse brain regions in aging, while no significant consistency was observed across development datasets. Our results support the view of aging as a result of stochastic dynamics, whilst development is regulated. We also reported that genes showing a consistent increase in heterogeneity during aging are involved pathways that are important for aging and neural function. Therefore, our results suggest that the increase in heterogeneity is one of the characteristics of brain aging and is unlikely to be only driven by the passage of time as we observe different trends during development.
Methods
Dataset collection
Microarray datasets
Raw data used in this study were retrieved from the NCBI Gene Expression Omnibus (GEO) from three different sources (Table S1). All three datasets consist of human brain gene expression data generated through microarray experiment. In total, we obtained 1017 samples from 298 individuals, spanning the whole lifespan with age ranging from 0 to 98 year (Figure S1).
RNA-seq dataset
We used the transcriptome data generated by GTEx consortium (v6p) (“The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans,” 2015). We only used the samples with a death circumstance of 1 (violent and fast deaths due to an accident) and 2 (fast death of natural causes) on the Hardy Scale so that we do not include any samples with illnesses. As we focus only on the brain, we used all 13 brain tissues. As a result, we analyzed 623 samples, samples from 99 individuals.
Separating datasets as aging and development datasets
To differentiate changes in gene expression heterogeneity during aging from those during development, we used the age of 20 to separate pre-adulthood from adulthood. It was shown that the age of 20 corresponds to the first age of reproduction in human societies (Walker et al., 2006). Structural changes after the age of 20 in the human brain were previously linked to age-related phenotypes, specifically neuronal shrinkage and a decline in total length of myelinated fibers (Sowell et al., 2004). Earlier studies examining age-related gene expression changes in different brain regions also showed a global change in gene expression patterns after the age of 20 (Colantuoni et al., 2011; Dönertaş et al., 2017; Somel et al., 2010). Thus, consistent with these studies, we separated datasets using the age of 20 into development (0 to 20 years of age, n = 441) and aging (20 to 98 years of age, n=569).
Preprocessing
Microarray datasets
RMA correction (using ‘oligo’ library in R) and log2 transformation were applied to Somel2011 and Kang2011 datasets. For Colantuoni2011 dataset, we used the preprocessed data deposited in GEO, which was loess normalized, as there was no public R package to analyze the raw data. We quantile normalized all datasets using ‘preprocessCore’ library in R. Since our analysis focused on consistent patterns across datasets, we minimized the effects of confounding factors through quantile normalization, and we considered consistent results as potentially a biological signal. We also applied an additional correction procedure for Somel2011 datasets, in which there was a batch effect influencing the expression levels, as follows: for each probeset (1) calculate mean expression (M), (2) scale each batch separately, (3) add M to each value. We excluded outliers given in Table S1, through a visual inspection of the first two principal components for the probeset expression levels. We mapped probeset ids to Ensembl gene IDs 1) using the Ensembl database, through the ‘biomaRt’ library in R for Somel2011 dataset, 2) using the GPL file deposited in GEO for Kang2011 as probeset IDs were not complete in Ensembl, and 3) using the Entrez gene ids in the GPL file deposited in GEO for Colantuoni2011 dataset and converting them the Ensembl gene ids using ensemble database, through the “biomaRt” library in R. Lastly, we scaled expression levels for genes by ‘scale’ function in R. Age values of each dataset were converted to fourth root of age (in days) to ensure the relationship between age and expression is linear.
RNA-Seq dataset
The genes with median RPKM value of 0 are excluded from data. The RPKM values provided in the GTEx data are log2 transformed and quantile-normalized. Similar to the microarray data, we excluded the outliers based on the visual inspection of the first and second principal components (Table S1). As ages are given as an interval in GTEx, we used the mean of values in our analysis.
Age-related expression change
We used linear regression to assess the relationship between age and gene expression. The model used in the analysis is: where Yi is the scaled log2 expression level for the ith gene, βi0 is the intercept, βi1 is the slope, and εi is the residual. We performed the analysis for each dataset and considered β1 value as a measure of change in expression. P-values obtained from the model were corrected for multiple testing according to Benjamini & Hochberg procedure by using ‘p.adjust’ function in R.
Age-related heterogeneity change
In order to quantify the age-related change in gene expression heterogeneity, we calculated Spearman’s correlation coefficient (ρ). The correlations were calculated between the absolute values of residuals obtained from equation (1) and the fourth root of age. We regarded the absolute values of residuals as a measure of heterogeneity. Therefore, high positive correlation coefficients suggest that heterogeneity increases with age, whereas strong negative correlation implies heterogeneity decreases with age. P-values were calculated from the correlation analysis and corrected for multiple testing with Benjamini & Hochberg by ‘p.adjust’ function in R. To compare heterogeneity changes in aging and development, we employed paired Wilcoxon test in which we compared median heterogeneity changes in aging and development dataset pairs.
Principal Component Analysis
We conducted principal component analysis on both age-related changes in expression (β) and heterogeneity (ρ). We followed a similar procedure for both analyses, in which we used ‘prcomp’ function in R. Analysis was performed on a matrix containing β values (for the change in expression level) and ρ values (for the change in heterogeneity), for 11,137 commonly expressed genes for all 38 development and aging datasets. The change in expression (β) or heterogeneity (ρ) values were scaled for each dataset before calculating principal components. The first two principal components that explain the variance between variables the most are used to examine the patterns of aging and development datasets.
Permutation test
We performed a permutation test, taking non-independence of Somel2011 and Kang2011 datasets into account. These datasets include multiple samples from the same individuals for different brain regions. We first randomly permuted ages among individuals, not samples, for 1,000 times in each data source, using ‘sample’ function in R. Next, we assigned ages of individuals to corresponding samples and calculated age-related expression and heterogeneity change for each dataset, corresponding to brain regions. For the tests related to the changes in gene expression with age, we used a linear model between gene expression levels and the randomized ages. However, for the tests related to the changes in heterogeneity with age, we measured the correlation between the randomized ages and the absolute value of residuals from the linear model that is between expression levels and non-randomized ages for each gene. In this way, we preserved the relationship between age and expression, and we were able to ensure that our regression model was viable for calculating age-related heterogeneity change. Using expression and heterogeneity change values calculated using permuted ages, we tested (a) if the correlation of expression (and heterogeneity) change in aging and development datasets differ significantly; (b) if the correlation of expression (and heterogeneity) change in development and aging datasets is significant; (c) if the number of genes showing significant change in expression (and heterogeneity) differ significantly between development and aging datasets; (d) if the overall increase in age-related heterogeneity during aging is significantly higher than development; (e) if the observed consistency in heterogeneity increase is significantly different from expected. We also demonstrate that our permutation strategy is more stringent in Figure S21, giving the distributions calculated using both dependent permutations and random permutations.
To test the overall correlation within development or aging datasets for the changes in expression (β) and heterogeneity (ρ), we calculated median correlations among independent three subsets of datasets (one Kang2011, one Somel2011 and the Colantuoni2011 dataset), taking the median value calculated for each possible combination of independent subsets (16 × 2 × 1 = 32 combinations). Using 1,000 permutations of individuals’ ages, we generated an expected distribution for the median correlation coefficient for triples and compared with the observed value. When testing the concordance in correlations, we used this approach because the number of independent pairwise comparisons are outnumbered by the number of dependent pairwise comparisons, causing low statistical power.
To further test the significance of the difference between correlations among development and aging datasets, we calculated the median difference in correlations between aging and development datasets for each permutation. We next constructed the distribution for 1,000 median differences and calculated p-value using the observed difference. Next, to test the significance of the difference in the number of significantly changing genes between development and aging, we calculated the difference in the number of genes showing significant change between development and aging datasets for each permutation. Empirical p-values were computed according to observed differences. Likewise, to test if the overall increase in age-related heterogeneity during aging is significant compared to development, we computed median differences between median heterogeneity change values of each aging and development dataset, for each permutation, following an empirical p-value calculation.
Expected heterogeneity consistency
Expected consistency in heterogeneity change was calculated from heterogeneity change values (ρ) measured using permuted ages. For each permutation, we first calculated the total number of genes showing consistent heterogeneity increase for N number of datasets (N=0,.,19). To test if observed consistency significantly differed from the expected, we compare observed consistency values to the distribution of expected numbers, by performing a one-sided test for the consistency in N number of datasets, N=1,.,19.
Clustering
We used k-means algorithm (‘kmeans’ function in R) to cluster genes according to their heterogeneity profiles. We first subset the heterogeneity levels (absolute value of the residuals from equation (1)) to include only the genes that show a consistent increase with age and then scaled the heterogeneity levels, so that each gene has a mean heterogeneity level of zero and standard deviation of 1. Since the number of samples in each dataset is different, just running k-means on the combined dataset would not equally represent all datasets. Thus, we first calculated the spline curves for scaled heterogeneity levels for each gene in each dataset (using ‘smooth.spline’ function in R, with three degrees of freedom). We interpolate at 11 (the smallest sample size) equally distant age points within each dataset. Then we use the combined interpolated values to run k-means algorithm with k=8. To test association of the clusters with Alzheimer’s Disease, we retrieved overall AD association scores of the 147 consistent genes (n = 40) from the Open Targets Platform (Carvalho-Silva et al., 2019).
Functional Analysis
We used “clusterProfiler” package in R to run Gene Set Enrichment Analysis, using Gene Ontology (GO) Biological Process (BP), GO Molecular Function (MF), GO Cellular Compartment (CC), Reactome, Disease Ontology (DO), and KEGG Pathways. We performed GSEA on all gene sets with a size between 5 and 500, and we corrected the resulting p values with Benjamini Hochberg correction method. We used the number of datasets with a consistent increase to run GSEA so that we can test if the genes with a consistent increase or decrease in their expression are associated with specific functions. Since we are running GSEA using number of datasets showing consistency, our data includes many ties, potentially making the ranking process difficult and non-robust. In order to assess how robust our results are, we run GSEA 1,000 times on the same data and counted how many times we observe the same set of KEGG pathways as significant (Table S4). The lowest number among the pathways with a significant positive enrichment score was 962 out of 1,000 (Phospholipase D signaling pathway). Moreover, we repeated the same analysis using the heterogeneity change levels (Spearman’s ρ between the absolute value of residuals and age) for each dataset to confirm the gene sets are indeed associated with the increase/decrease in heterogeneity (Figure S24-S28). We visualized the KEGG pathways using ‘KEGGgraph’ library in R and colored the genes by the number of datasets that show an increase.
We also performed an enrichment analysis of the transcription factors and miRNA to test if specific TFs or miRNAs regulate the genes that become more heterogeneous consistently. We collected gene-regulator association information using Harmonizome database (Rouillard et al., 2016), “MiRTarBase microRNA Targets” (12086 genes, 596 miRNAs) and “TRANSFAC Curated Transcription Factor Targets” (13216 genes, 201 TFs) sets. We used ‘fgsea’ package in R, which allows GSEA on a custom gene set. We tested the association for each regulator with at least 10 and at most 500 targets. Moreover, we tested if the number of regulators is associated with the change in heterogeneity. We first calculated the correlation between the heterogeneity change with age (or the number of datasets with an increase in heterogeneity) and the number of TFs or miRNAs regulating that gene, for aging and development separately and accounting for the direction of expression changes in these periods (i.e. separating genes into down-down, down-up, up-down, and up-up categories based on their expression in development and aging). To test the difference in the correlations between aging and development, we used 1,000 random permutations of the number of TFs. For each permutation, we randomized the number of TFs and calculated the correlation between heterogeneity change (or the number of datasets with an increase in heterogeneity) and the randomized numbers. We then calculate the percentage of datasets where aging has a higher correlation than development. Using the distribution of percentages, we test if the observed value is expected by chance.
Protein-protein interaction network analysis
We downloaded all human protein interaction data from STRING database (v11) (von Mering, 2004). Ensembl Peptide IDs are mapped to Ensembl Gene IDs using “biomaRt” package in R. We calculated the degree distributions for the genes that become consistently more heterogeneous with age and all remaining genes using different cutoffs for interaction confidence scores. In order to calculate the significance of difference, we i) calculated the number of interactors (degree) for each gene, ii) for 10,000 times, randomly sampled k genes from all interactome data (k = number of genes that become heterogeneous with age across all datasets and have interaction information in STRING database, after filtering for cutoff), iii) calculated the median of degree for each sample. We then calculated an empirical p-value by asking how many of these 10,000 samples we see a median degree that is equivalent to or higher than our original value. The number of genes and interactions after each cutoff are given in Figure S20.
Cell-type specificity analysis
Using FACS-sorted cell-type specific transcriptome data from mouse brain (Cahoy et al., 2008), we checked if there is any overlap between genes that become heterogeneous with age and cell-type specific genes. We downloaded data from the GEO database (GSE9566) and preprocessed as follows: i) RMA correction using ‘affy’ package in R (Gautier, Cope, Bolstad, & Irizarry, 2004), ii) log2 transformation, iii) quantile normalization using ‘preprocessCore’ package in R (Bolstad, 2016), iv) mapping probeset IDs to first mouse genes, and then human genes. We only included genes that have one to one orthologs in humans, after filtering out probesets that map to multiple genes. We defined cell-type specific genes by calculating the effect size (Cohen’s D) for each gene and cell type and identifying genes that have ES higher than or equal to two as specific to that cell type. At this cutoff, there was no overlap between cell-type specific gene lists. To test for association between heterogeneity and cell-type specificity, we used Fisher’s exact test.
Software
All analysis is done using R and the code to calculate heterogeneity changes with age is available as an R package ‘hetAge’, which is documented in https://mdonertas.github.io/hetAge/. “ggplot2” (Wickham, 2017) and “ggpubr” (Kassambara, 2018) R libraries are used for the visualization.
Data availability
Raw data used in this study is downloaded from the GEO database using GSE numbers specified in Table S1. All data generated in this study, i.e. changes in expression and heterogeneity with age for each dataset and functional enrichment results are available as Supplementary Tables.
Author Contributions
H.M.D. conceived and designed the study with the contributions from M.S., and J.M.T.. U.I. and H.M.D. analyzed the data. U.I. and H.M.D. interpreted the results and wrote the manuscript with the contributions from M.S. and J.M.T. All authors read, revised and approved the final version of this manuscript.
Funding Statement
This work is funded by EMBL (H.M.D., J.M.T.) and the Wellcome Trust (098565/Z/12/Z; J.M.T).
Acknowledgements
The authors thank Hamit Izgi, Matias Fuentealba Valenzuela, Dr. Daniel K. Fabian, and Prof Linda Partridge for helpful discussions. H.M.D. is a member of Darwin College, University of Cambridge.