Abstract
Cellular aging has been progressively elucidated by science. However, aging at the multicellular-individual level is still poorly understood. A recent theory of individuated multicellularity describes the emergence of crucial information content for cell differentiation. This information is mostly conveyed in the non-epigenetic constraints on histone modifications near transcription start sites. According to this theory, the non-epigenetic content emerges at the expense of the information capacity for epigenetic content. However, it is unclear whether this “reassignment” of capacity continues after adulthood. To answer this question, I analyzed publicly available high-throughput data of histone H3 modifications and mRNA abundance in human primary cells. The results show that the “reassignment” continues after adulthood in humans. Based on this evidence, I present a falsifiable theory describing how continued “reassignment” of information capacity creates a growing epigenetic/non-epigenetic information imbalance. According to my theoretical account, this imbalance is the fundamental reason why individuated multicellular organisms senesce.
Our intellectual endeavors have entertained the prospect of unlimited lifespan for centuries [1], and the scientific endeavor has been no exception [2]. In the 1950s, the immortality of cultured somatic cells was indeed a widely-held belief [3]. That changed only when Hayflick & Moorhead showed that cultured human somatic cells do stop dividing and become less viable once their divisions reach a certain number [4], a phenomenon known today as the Hayflick limit [3]. This loss of replicative capacity and, in general, the process of aging at the cellular level, have been found to correlate with telomere length [3,5,6]. Yet, the number of times human cells can divide in culture exceeds the number of times cells divide throughout our lifespan; there is no significant correlation between human cell replicative capacity and cell donor age [7]. That is, we—and individuated multicellular organisms in general—age before most of our cells do [8,9]. The outstanding question is why.
Theoretical descriptions of senescence or aging at the multicellular-individual level have been classified into two categories: programmed senescence and senescence caused by damage/error [10]. Recently it has been argued, however, that senescence is not programmed nor is it ultimately a consequence of damage or error in the organism’s structure/dynamics [11]. Instead, it may be a byproduct of maintenance and/or developmental dynamics [11,12], themselves underpinned in part by intracellular signaling pathways such as the cell-cycle-related PI3K/AKT/mTOR pathway [11]. These pathways have been shown to modulate aging at the cellular level in species such as the yeast Saccharomyces cerevisiae [13].
The analogous notion of aging at the multicellular-individual level as a byproduct of certain functional signaling pathways [11] is, in principle, supported by the fact that the deficiency of mTOR kinase—a key component of the PI3K/AKT/mTOR pathway—can double the lifespan of the roundworm Caenorhabditis elegans [14]. However, the fundamental dynamics that make individuated multicellular organisms senescent after adulthood remain unclear and largely lack falsifiable scientific theories. Falsifiability—the possibility of establishing a hypothesis or theory as false by observation and experiment [15]—allows the objective rejection of existing scientific theories, fosters the development of new ones, and constitutes the most widely accepted demarcation between science and non-science [16].
Using publicly available high-throughput data of histone H3 modifications and mRNA abundance in human primary cells to look for proof of concept, the issue of senescence can also be approached from the angle of theoretical biology. Thus, I conducted a statistical data analysis in this study, which revealed that proof of concept exists for the human species. These findings provide empirical grounds for my theoretical work, suggesting that senescence is a byproduct of functional developmental dynamics as first described by a recently proposed theory of individuated multicellularity [17]. Specifically, I show that the byproduct is a post-ontogenetic, growing imbalance between two different information contents conveyed respectively in two different types of constraints on histone post-translational modifications near transcription start sites (TSSs). Constraints are here understood as the local and level-of-scale specific thermodynamic boundary conditions required for energy to be released as work as described by Atkins [18]. The concept of constraint is crucial because, according to the theory of individuated multicellularity, a higher-order constraint (i.e., a constraint on constraints) on changes in histone modifications harnesses critical work that regulates transcriptional changes for cell differentiation at the multicellular-individual level.
Under the theory of individuated multicellularity, the intrinsic higher-order constraint is the simplest multicellular individual in fundamental terms. In addition, the dynamics of the lower-order constraints must be explicitly unrelated to each other (i.e., statistically independent) in order to elicit the emergence of the intrinsic higher-order constraint. Along with the emergence of this intrinsic higher-order constraint, the theory of individuated multicellularity describes the emergence of critical information content, named in the theory hologenic content, which is about the multicellular individual as a whole in terms of developmental self-regulation. Thus, for the sake of brevity, I here refer to the theory of individuated multicellularity as the hologenic theory.
The constraints on the combinatorial patterns of histone modifications are generally known as histone crosstalk [19,20]. Histone modifications are also known to be relevant for epigenetic changes [21], which are defined as changes in gene expression that cannot be explained by (i.e., that are explicitly unrelated to) changes in the DNA sequence [22]. This relevance is underpinned by the capacity of histone modifications to convey information content, which has allowed the prediction of mRNA levels from histone modification profiles near TSSs with high accuracy [23].
Based on these considerations and the properties of the nonnegative measure of multivariate statistical association known as total correlation [24] or multiinformation [25] (symbolized by C and typically measured in bits), the overall observable histone crosstalk can be decomposed. That is, histone crosstalk, if measured as a total correlation C, is the sum of two explicitly unrelated C components: one epigenetic (i.e., explicitly related to changes in gene expression) and the other non-epigenetic (i.e., explicitly unrelated to changes in gene expression). This sum can be expressed as follows: where X1, …, Xn are random variables representing n histone modification levels in specific genomic positions with respect to the TSS and Y is a random variable representing either gene expression level, transcription rate, or mRNA abundance level associated with the TSS. These levels are equivalent for the decomposition because of the strong correlation that exists between them ([26] and references therein).
The hologenic theory describes how the epigenetic component of histone crosstalk (represented by CY(X1,…, Xn, Y) in the sum decomposition of Eq. 1) conveys information about each cell’s transcriptional profile. This component is, in information content terms, the dominating component for any eukaryotic colonial species (such as the alga Volvox carteri [27]) and, importantly, also for undifferentiated stem cells.
The second, non-epigenetic component of histone crosstalk (represented by C(X1,…, Xn|Y) in Eq. 1) is known to grow in magnitude during development until the organism’s mature form is reached [17]. This component is described by the hologenic theory as conveying information about the multicellular individual as a whole—starting from the moment said individual emerges as an intrinsic higher-order constraint on the early embryo’s proliferating cells.
Importantly, the overall observable histone crosstalk magnitude (represented by C(X1,…,Xn) in Eq. 1) is not infinite. In other words, the overall histone crosstalk has a finite information capacity, which can be measured in bits. Moreover, the sum decomposition in Eq. 1 implies that the growth in magnitude (bits) of the hologenic (i.e., non-epigenetic) component must be accompanied by a decrease in magnitude of the epigenetic component. That is, the capacity (in bits) for hologenic information content in histone crosstalk is bound to grow at the expense of the capacity for epigenetic information content.
The hologenic theory also maintains that a necessary condition for the evolution of individuated multicellular lineages was the appearance of a class of molecules synthesized by the cells—called Nanney’s extracellular propagators (symbolized by ) in the theory [17]. These molecules are predicted to be, in a given tissue and time period, (i) secretable into the extracellular space, (ii) once secreted, capable of eliciting a significant incremental change (via signal transduction) in the magnitude of the non-epigenetic histone crosstalk (i.e., the C(X1,…,Xn|Y) summand in Eq. 1) within other cells’ nuclei, and (iii) affected in their extracellular diffusion dynamics by the geometrical complexity of the extracellular space (i.e., constraints on diffusion at the multicellular-individual level, which cannot be reduced to constraints at the cellular level). Also under the hologenic theory, for the multicellular individual to develop and survive, both hologenic (developmental self-regulation of the multicellular individual overall) and epigenetic (each cell’s transcriptional profile) contents must coexist.
One final but important consideration regarding histone crosstalk is that it is the result of constraints which, as mentioned previously, are level-of-scale specific. To exemplify this specificity, consider the example of an internal combustion engine: a single molecule in a cylinder wall does not embody a constraint on the expansion of the igniting gas, yet the cylinder-piston ensemble does. For this reason, histone crosstalk constraints were expected to have relevance for senescence but only at a specific level of scale. The specific level of scale in histone crosstalk that is relevant for human senescence has not been studied in detail before.
To investigate from a theoretical angle if the “reassignment” of information capacity for epigenetic and non-epigenetic (i.e., hologenic) content stops when development reaches the multicellular individual’s mature form or instead continues without interruption, one also needs to investigate the “reassignment” (if any) in cancer cells. One of the corollaries of the hologenic theory is a significant loss of hologenic content in cancer cells, because they are no longer constrained by the multicellular individual that normal (i.e., non-cancerous) cells serve and are constrained by. Thus, I developed a falsifiable theory of senescence based on the post-ontogenetic continuation of this “reassignment” process in human histone crosstalk as proof of concept.
To test this theory, I formalized the proof of concept into the following two hypotheses: (i) within genomic regions adjacent to TSSs in primary normal cells, the log-ratio between the non-epigenetic and epigenetic histone H3 crosstalk magnitudes is significantly and positively correlated with cell donor age (over a range of 0-90 years old) and (ii) no such statistically significant correlation exists for primary cancer cells (see Fig. 1).
RESULTS
To test the two proof-of-concept hypotheses, I used publicly available ChlP-seq (chromatin immunoprecipitation followed by high-throughput DNA sequencing) and RNA-seq (transcriptome high-throughput sequencing) data for human primary cell samples, which were obtained from different individuals with ages ranging from 0 to +90 years old. I computed log-transformed ChIP-seq signal magnitudes for each primary cell sample from ChIP-seq data of different position-specific (in bp relative to the TSS) histone H3 modifications. Similarly, I log-transformed mRNA abundance values from RNA-seq assay data associated with each ChIP-seq assay for each primary cell sample.
Using these tandem ChIP-seq and RNA-seq data, I quantified the non-epigenetic and epigenetic histone H3 crosstalk magnitudes (Eq. 1) for triads of variables {Xi, Xj, Xk}. These variables represented position-specific histone H3 modification levels, i.e., C(Xi, Xj, Xk|Y) and CY(Xi, Xj, Xk, Y) for the non-epigenetic and epigenetic histone crosstalk components, respectively, where Y represents mRNA abundance. Triads (as opposed to pairs or tetrads) were first analyzed because a triad constitutes the number of variables (i.e., position-specific histone modification levels) found to possess both significant predictive power and predictive synergy to resolve the statistical uncertainty about the mRNA abundance level associated with a given TSS (see details in Methods).
The log-ratio (base 2) between the non-epigenetic and epigenetic histone H3 crosstalk magnitudes was thus computed as the dimensionless quantity
Importantly, total correlation C captures all possible associations in the set of variables {Xi, Xj, Xk} that may exist starting from the pairwise level.
The log-ratio of non-epigenetic to epigenetic histone H3 crosstalk magnitude is positively correlated with cell donor age in normal cells
ChIP-seq data for five histone H3 modifications were used in all analyses: H3K4me1 (histone H3 lysine 4 monomethylation), H3K9me3 (histone H3 lysine 9 trimethylation), H3K27ac (histone H3 lysine 27 acetylation), H3K27me3 (histone H3 lysine 27 trimethylation), and H3K36me3 (histone H3 lysine 36 trimethylation). The ChIP-seq signals for these modifications were computed for 30 200bp-long genomic bins across a 6,000bp-long TSS-adjacent region (see Fig. 1). Thus, a total of 150 variables Xi representing position-specific histone H3 modification levels—each variable with signals for 18,220 RefSeq TSSs—were used when analyzing each cell sample. A total of 18 normal cell samples and 17 cancer cell samples was included in the analysis. The Pearson correlation coefficient r between the log-ratio and the cell donor age was obtained for each of the possible {Xi, Xj, Xk} triads. The 551,300 p-values (one-sided Student’s t-test) associated to these r values were then corrected for multiple testing (Benjamini-Yekutieli correction, see Methods), obtaining q-values.
To determine whether the hypothesized positive correlation between the non-epigenetic/epigenetic histone H3 crosstalk log-ratio and cell donor age exists, and also to illustrate the concept of positive correlation in normal cells vs. no correlation cancer cells, I obtained all possible 551,300 correlation values for triads (one-sided Student’s t-test). To exemplify, the results for the triad {H3K27ac (at –1000bp), H3K36me3 (at +1000bp), H3K4me1 (at +3200bp)} are shown here, where the correlation was positive (r=0.83) and highly significant (q=1.86×10−2), as seen in Fig. 2, indicating that the hypothesized correlation holds for this triad.
Altogether, the 551,300 correlation values had a mean value , a median value , and a standard deviation value σr =0.24 (see statistical distribution of r in Fig. 3). From these correlation values, only 24,185 (i.e., ~4%) were nonpositive and none of them was statistically significant (i.e., where r≤0, q>0.05). In contrast, it was found that for 315,378 triads (i.e., ~57%) the correlation values were positive and statistically significant (i.e., r>0 and q≤0.05).
Importantly, I also found that the hypothesized positive correlation between the log-ratio of non-epigenetic to epigenetic histone H3 crosstalk and cell donor age verified for triads of position-specific histone H3 modifications in normal cells loses its strength for tetrads (; ; σr=0.27). It is also no longer greater than zero for pairs (; ; σr =0.40) (Fig. 4). These results for tetrads and pairs indicate that the predicted positive correlation only holds for triads (and it was predicted in the second proof-of-principle hypothesis not to hold in cancer cells). Such specificity was expected because if senescence can be explained in terms of an imbalance of information-conveying constraints that are level-of-scale specific like other thermodynamic constraints, the imbalance itself also must be level-of-scale specific.
The log-ratio of non-epigenetic to epigenetic histone H3 crosstalk magnitude does not correlate with cell donor age in cancer cells
When I analyzed the log-ratio of non-epigenetic to epigenetic histone H3 crosstalk magnitude and cell donor age for cancer cells using the same exemplary triad {H3K27ac (at −1000bp), H3K36me3 (at +1000bp), H3K4me1 (at +3200bp)}, I found that no significant correlation exists between those two variables (r=–0.2; q=1; see Fig. 5), as hypothesized.
For the 551,300 correlation values corresponding to all triads of position-specific histone H3 modifications in cancer cells, the mean and median were close to zero (; ), and the standard deviation was σr =0.15 (see statistical distribution of r in Fig. 3). All associated p-values (two-sided Student’s t-test) were corrected and the resulting q-values were all equal to 1 and hence non-significant. Similar results—i.e., all q-values equal to 1—were obtained for all 11,175 pairs of position-specific histone H3 modification levels (; ; σr =0.23) and for all 50,000 random tetrads (; ; σr =0.15). These results suggest that, as predicted, no significant correlation exists between the log-ratio of non-epigenetic to epigenetic histone H3 crosstalk magnitude and cell donor age in cancer cells.
I also evaluated whether the stark difference of the correlation values between normal—i.e., r markedly positive—and cancer cells—i.e., r close to zero—was only attributable to the data point (for normal cell samples) that corresponds to a neonate, with coordinates (0, 0. 01) in Fig. 2. In other words, whether the neonate data point was simply a statistical outlier that created an otherwise nonexistent difference between normal and cancer cells in the analysis.
For this purpose, I recomputed all 551,300 correlation values corresponding to normal cells, excluding the neonate data point. The mean, median, and standard deviation values obtained were , , and σr =0.19, respectively (see distribution of r in comparison with that for cancer cells in Fig. 3). This difference between r values for normal cells (neonate data point excluded) and cancer cells was further tested and shown to be highly significant (Mann-Whitney U test: U=2.7×1011, p<2.2×10−16). These findings suggest that the neonate data point is not a statistical outlier among normal cell samples let alone explains the difference between normal and cancer cells in terms of the correlation values obtained.
The total information capacity of triad-wise histone H3 crosstalk does not correlate with cell donor age
Finally, I assessed whether the total information capacity (represented by C(X1,…,Xn) in Eq. 1 and measured in bits) of overall histone H3 crosstalk (triad-wise) is significantly correlated with age, in particular, whether it is positively correlated. This potential correlation is important, because if total information capacity increases with cell donor age, an age-correlated decrease of the proportion available for epigenetic content would not be necessarily a problem. That is, a proportionally smaller and smaller information capacity for epigenetic content within histone crosstalk would not generate an information content imbalance—hypothesized in the Introduction—as long as a growing total capacity provides enough room for epigenetic content in absolute terms.
To test this possibility, the correlation value r between cell donor age and total information capacity (in bits) of TSS-adjacent histone H3 crosstalk, computed as was obtained for all 551,300 triads of position-specific histone H3 modifications for normal cells.
The analysis revealed that the correlation coefficients r have mean, median, and standard deviation values , , and σr =0.24, respectively, and that all associated q-values were equal to 1 and thus non-significant. For cancer cells, all correlation values were also non-significant (q=1). Their mean, median, and standard deviation values were , , and σr =0.18, respectively. These results suggest that senescence would indeed be an information capacity “reassignment” problem—creating in turn an information-content imbalance, as hypothesized—rather than a “total capacity contraction” problem.
Taken together, the statistical strength of all the results obtained—notwithstanding the heterogeneous origin of the primary cell samples analyzed given the different tissues from different individuals—provides proof of concept and underpins a strong falsifiable prediction for a theory of senescence presented in the Discussion.
DISCUSSION
The successful testing of the two proof-of-concept hypotheses in the present work provides empirical grounds for the following falsifiable theory of senescence as a byproduct of developmental dynamics: Given that the “reassignment” process for information capacity in histone crosstalk—i.e., a progressive gain of capacity for hologenic information content at the expense of that for epigenetic content—continues without interruption throughout the multicellular individual’s lifespan, a growing and ultimately lethal information content imbalance is created in the cells’ nuclei. Importantly, this “reassignment” process is underpinned by constraints on the extracellular diffusion of molecules, and the constraints are embodied only at the multicellular-individual level. That is, in histone crosstalk there is a time-correlated loss of capacity for epigenetic information (i.e., less and less epigenetic constraints on histone crosstalk), which causes a global and progressive impairment of biological functions at the multicellular-individual level, eventually causing the death of the individual.
The nature of the epigenetic constraints on histone crosstalk strongly implicates this time-correlated loss of capacity for epigenetic information content (and concurrent gain of that for hologenic content) as the fundamental cause of senescence. Epigenetic constraints are explicitly related to transcriptional/gene expression changes and represented by the CY(X1,…, Xn, Y) summand in Eq. 1. Because they depend on the interactions between the histone-modified nucleosomes and the DNA wrapped around them—allowing or preventing transcription—the epigenetic information content they embody allows precise mRNA (and, ultimately, gene expression) levels from histone modification patterns.
This age-correlated hologenic/epigenetic information imbalance in histone crosstalk can also be understood in terms of an imbalance between the accuracy and precision of transcription in the cells with respect to the needs of the multicellular individual. That is, more accuracy (i.e., closeness of the mean mRNA level to the mean level functional for the multicellular individual) is reached with age at the expense of precision (i.e., closeness of the resulting mRNA levels from the same pattern of histone modifications). This trade-off is unavoidable because (i) the relative growth of C(X1,…, Xn|Y) implies an increasing constraint on (i.e., regulation of) histone modification patterns with respect to the multicellular individual [17], thus making transcription more accurate and (ii) the concurrent relative decrease of CY(X1,…, Xn, Y) means histone modification patterns become worse and worse predictors of mRNA levels, in turn making transcription less and less precise to the point of dysfunctionality with respect to the multicellular individual (see schematic in Fig. 6a).
Thus, we can characterize senescence under this theory as a global transcriptional over-regulation with respect to the multicellular individual’s needs—as opposed to cancer, where the dysfunctional effect is typically characterized in terms of the dysregulation of transcription and gene expression [28,29].
The following general prediction applies to the falsifiability of the theory of senescence: Within genomic regions adjacent to TSSs in primary normal cells from any given tissue in any individuated multicellular species, a significant positive correlation will be observed between the log-ratio of non-epigenetic to epigenetic histone crosstalk magnitude and the age of the individual from whom the cells were obtained. The specific level of crosstalk—i.e., number of position-specific histone modifications involved—at which this correlation exists may vary among species. It is predicted to be the level that possesses both significant predictive power and predictive synergy (see Methods) on mRNA levels. Moreover, since hologenic information content is described as emerging locally and independently in each developmental process [17], the statistical strength of the predicted positive correlation will be further increased—and underpinned by a monotonically increasing function—if all primary cell samples are obtained from the same tissue of the same individual throughout its lifespan.
The notable exceptions to be made for the prediction above are a few species able to undergo reverse developmental processes from adult to juvenile stages. One such species is the jellyfish Turritopsis nutricula [30], which is predicted to display an analogous negative correlation in the processes, i.e., “reassignment” in reverse. Another exception for the prediction are species displaying extremely slow or potentially negligible senescence processes [31]. Examples of these are the bristlecone pine Pinus longaeva [32], the freshwater polyp Hydra vulgaris [33], and the naked mole-rat Heterocephalus glaber [34], which, after adulthood, are predicted to display a significant but very weak positive correlation (in cases where senescence is extremely slow), or an hologenic/epigenetic log-ratio invariant with age (i.e., no correlation in cases where senescence is truly negligible; Fig. 6b).
Senescence is widely regarded as an evolutionary consequence of the relaxation of selection on traits that maintain/repair the multicellular individual’s functions in later life, because later life would have been rarely realized in the wild with the hazards it imposes [35]. However, under the falsifiable theory presented in this paper, this consensus is fundamentally incorrect. Indeed, senescence at the multicellular-individual level is, I suggest, not the result of relaxed selection but instead an intrinsic developmental byproduct that would have been already observable theoretically in the emergence of the very first individuated multicellular organisms as described by the hologenic theory [17]. In other words, had the first individuated multicellular organisms been free from any extrinsic hazard in the wild, they would have begun to senesce significantly after reaching a mature form in their development, as opposed to displaying extremely slow or negligible senescence as can be inferred from the relaxed-selection hypothesis.
If correct, the evolutionary account of senescence suggested here underscores the need for modern evolutionary theory to incorporate the effects of the few yet crucial events where unprecedented forms of biological individuality have emerged throughout the history of life on Earth. One of these events—as discussed here—is the emergence of the individuated multicellular organism [17] with senescence as its developmental byproduct, and its influence on the population renewal process. Other emergence events where new forms of individuality can arise with significant evolutionary consequences include the origin of life—explicitly excluded by Darwin from the scope of his original theory [36]—with its unprecedented self-regulating and self-reproducing dynamics that first enabled natural selection [44], and the emergence of the mind [37], which—through synthetic biology—could at some point elicit the appearance of new species in the evolutionary process without any involvement of natural selection. These latter two events, and potentially others, remain to be fully elucidated along with their evolutionary consequences.
Any theory of senescence is bound to address the question of whether aging at the multicellular-individual level can be dynamically stopped. The answer suggested here is that achieving a dynamical arrest of senescence is not a fundamental impossibility but it may well be a technical impossibility because of a therapeutic safety issue. From a fundamental point of view, methods could be developed to, for example, artificially increase the dynamical range of nucleosome-DNA interactions (thus increasing the capacity for epigenetic information content in histone crosstalk at the expense of that for hologenic content).
Yet, the hologenic theory also predicts that a significant loss of hologenic content is a necessary condition for the onset of cancer. If this is correct, a potentially unsurmountable safety problem arises: also under hologenic theory, the in vivo balance between hologenic and epigenetic information content is predictably “fine-tuned” as it is individual-specific, cell-type-specific, and also confined to small functional ranges. Thus, there could be an inherent high risk of greatly increasing cancer incidence with the slightest extrinsic attempt to correct for the hologenic/epigenetic content imbalance. This problem resides in that hologenic constraints, whose growth in magnitude has senescence as a byproduct, are the very constraints preventing an otherwise likely onset of cancer [17].
Based on a mathematical model of intercellular competition, Nelson and Masel have argued that stopping senescence, even if possible, will always elicit the onset of cancer and that senescence is ultimately inevitable [38]. Nevertheless, the existence of individuated multicellular species such as Turritopsis nutricula demonstrates that development can be reversed at least into juvenile developmental stages [30] and that of the naked mole-rat suggests that senescence is reversible in some cases and negligible or close to negligible in others, however exceptional.
The delicate balance between hologenic and epigenetic information described here may shed light on the well-known positive correlation between cancer incidence and age [39]: if the senescent multicellular individual attempts to correct its growing hologenic/epigenetic content imbalance too strongly, it may elicit the onset of cancer. Thus, age-related cancer would be the result of a strong enough “pushback” from the multicellular individual against its own senescence. Although the specific dynamics that would underpin the “pushback” are beyond the scope of this paper, this hypothesis is indeed falsifiable by means of the following secondary prediction: the observed log-ratio of non-epigenetic to epigenetic histone crosstalk magnitude in the normal (i.e., non-cancerous) cells closest to an age-related stage I malignant tumor will be significantly lower than said log-ratio observed in the other (i.e., tumor-nonadjacent) normal cells of the same tissue. (Note: The falsification of this secondary prediction does not imply the falsification of the theory as a whole.)
In turn, the “pushback”-against-senescence hypothesis for age-related cancer has, if correct, an implication we should not overlook. Namely, stopping senescence and eliminating the incidence of age-related cancer should be one and the same technical challenge. In this respect, it is worth noting that in the naked mole-rat both senescence [34] and cancer incidence [40,41] have been described as negligible or close to negligible.
Rozhok and DeGregori have recently highlighted the explanatory limitations [42] of the Armitage-Doll multistage model of carcinogenesis, which regards the accumulation of genetic mutations as the cause of age-related cancer [43]. They further argued that age-related cancer should rather be understood as a function of senescence-related processes [42]. However, their description of age-related cancer is based on Darwinian processes and thus differ from the account suggested here, which can be understood within the concept of teleodynamics [37,44]—a framework of biological individuality based on the emergence of intrinsic higher-order constraints, such as that described in the hologenic theory [17].
Apart from the proof of concept presented here, if the main prediction of this paper resists falsification attempts consistently, further research will be needed to elucidate the specific molecular dynamics embodying hologenic and epigenetic constraints within histone crosstalk completely. Such insights will be necessary to decide whether the hologenic/epigenetic information content imbalance can be corrected without compromising the multicellular individual’s health or survival.
METHODS
Data collection
The genomic coordinates and associated transcript lengths of all annotated RefSeq mRNA TSSs for the hg19 (Homo sapiens) assembly were downloaded from the UCSC (University of California, Santa Cruz) database [45]. All ChIP-seq and RNA-seq data downloaded, processed, and analyzed in this work were generated by the Canadian Epigenetics, Epigenomics, Environment and Health Research Consortium (CEEHRC) initiative funded by the Canadian Institutes of Health Research (CIHR), Genome BC, and Genome Quebec. CEEHRC protocols and standards can be found at http://www.epigenomes.ca/protocols-and-standards, and specific details on ChIP-seq antibody validation can be found on this link. Further information about the CEEHRC and the participating investigators and institutions can be found at http://www.cihr-irsc.gc.ca/e/43734.html. For a full list of source data files with their respective URLs for downloading, see Supplementary Information.
Cell sample data sets in the CEEHRC database were selected based on the following criteria: (i) only data sets with associated age were included and (ii) among these data sets, the group (for both normal and cancer cells) that maximized the number of specific histone H3 modifications present in all data sets was chosen.
ChIP-seq datafile processing
The original ChIP-seq binary datafile format was bigWig. For mapping its ChIP-seq signal into the hg19 assembly, each datafile was processed with standard bioinformatics tools [46–48] in the following pipeline: bigWigToWig → wig2bed --zero-indexed → sort -k1,1 -k2,2n → bedtools map -o median -null 0 -a hg19_all_tss.bed/hg19_all_tss_control.bed to generate an associated BED (Browser Extensible Data) file. (Note: The hg19_all_tss.bed file is a 200bp-per-bin BED reference file with no score values to perform the final ChIP-seq histone modification data mapping onto the 6,000bp-long TSS-adjacent genomic regions. The hg19_all_control.bed file is an analogous BED reference file for mapping the ChIP-seq input data onto 200-bp, 1-kbp, 5-kbp, and 10-kbp genomic windows, see ChIP-seq read profiles and normalization.)
ChIP-seq read profiles and normalization
To quantify and represent ChIP-seq read signal profiles for the histone H3 modifications, data were processed with the same method used in the EFilter multivariate algorithm [23] to predict mRNA levels with high accuracy (R~0.9). Steps in this method comprise (i) dividing the genomic region from 2 kbp upstream to 4 kbp downstream of each TSS into 30 200-bp-long bins, in each of which ChIP-seq reads were later counted; (ii) dividing the read count signal for each bin by its corresponding control (ChIP-seq input) read density to minimize artifactual peaks; (iii) estimating the control read density within a 1-kbp window centered on each bin, if the 1-kbp window contained at least 20 reads; otherwise, a 5-kbp window, or else a 10-kbp window was used if the control reads were less than 20. When the 10-kbp length was insufficient, a pseudo-count value of 20 reads per 10 kbp was set as the control read density. This implies that the denominator (i.e., control read density) is at least 0.4 reads per bin.
RNA-seq datafile processing
For each strand in the DNA, original datafiles contained mRNA abundances in RPKM (reads per kilobase of transcript per million mapped reads) in bigWig format. These datafiles were thus processed analogously to the ChIP-seq datafiles, i.e., using the pipeline bigWigToWig → wig2bed --zero-indexed → sort -k1,1 -k2,2n → bedtools map -o median -null 0 -a refseq_pos.bed/refseq_neg.bed to obtain associated BED files. (Note: The refseq_pos.bed and refseq_neg.bed files are BED reference files for each strand, with no score values, to perform the final RPKM calculation for each RefSeq mRNA in the hg19 assembly.)
When two or more mRNAs shared the same TSS (i.e., transcription start site with same genomic position and strand) the mean of the respective RPKM values was computed and associated with the corresponding TSS.
ChIP-seq/RNA-seq signal data tables
Using the RPKM values processed in this work, a subset TSSdef of all RefSeq mRNA TSSs displaying measured abundance (i.e., RPKM > 0) in all normal and cancer samples was determined. The number of TSSs in this subset TSSdef was 18,220, indicating that ~70% of the 26,048 RefSeq mRNA TSSs annotated in the hg19 assembly had an associated mRNA abundance greater than zero in all (i.e., both normal and cancer) samples. The obtained TSSdef subset thus provided the data analysis with a common basis for all samples that comprises most protein-coding genes annotated in the human genome.
For each sample data entry, 30 genomic bins were defined and denoted by the distance (bp) between their 5′-end and their respective TSSdef genomic coordinate: “-2000”, “-1800”, “-1600”, “-1400”, “-1200”, “-1000”, “-800”, “-600”, “-400”, “-200”, “0” (TSSdef or ‘+1’), “200”, “400”, “600”, “800”, “1000”, “1200”, “1400”, “1600”, “1800”, “2000”, “2200”, “2400”, “2600”, “2800”, “3000”, “3200”, “3400”, “3600”, and “3800”. Then, for each sample data entry, the ChIP-seq read signal was computed for all bins and for all histone modifications (30 bins × 5 modifications=150 signal values) in all TSSdef genomic regions. Data input tables—comprising the histone H3 modifications H3K4me1, H3K9me3, H3K27ac, H3K27me3, and H3K36me3—were thus generated for each sample entry as exemplified next:
The tables were then written to tab-delimited datafiles, which were subsequently classified into two groups: normal and cancer cells (see Table 1).
Shannon measures of statistical uncertainty and statistical association
Shannon measures of statistical uncertainty and statistical association were used in this work in order to quantify histone H3 crosstalk at TSSs and its relationship with mRNA levels.
Statistical uncertainty
C.E. Shannon’s seminal work, among other things, introduced the notion of—and a measure for—the uncertainty about discrete random variables [49]. For a discrete random variable X with probability mass function P(X) its uncertainty (also known as Shannon entropy) is defined as where P(x) is the probability of X=x and b is the logarithm base. When b=2 (the base used in this work), the unit for this measure is the bit. H(X) can also be interpreted as the amount of information necessary to resolve the uncertainty about the outcome of X. Shannon uncertainty was the measure used to estimate the uncertainty about the mRNA abundance level to be resolved in normal cells.
H(X) is typically called marginal uncertainty because it involves only one random variable. In a multivariate scenario, the measure H(X1,…, Xn) is called the joint uncertainty of the set of discrete random variables {X1,…, Xn}, and it is analogously defined as
Another measure important to this work is the conditional uncertainty about a discrete random variable Y, with probability mass function P(Y), given that the value of another discrete random variable X is known. This conditional uncertainty H(YX) can be expressed as where P(x,y) is the joint probability of X=x and Y=y. Importantly, any measure of Shannon uncertainty (or any other derived Shannon measure) that is conditional on a random variable X can also be understood as said measure being explicitly unrelated to, or statistically independent from, the variable X.
Statistical association
A classic Shannon measure of statistical association of any two discrete random variables X and Y is that of mutual information 1, defined as
Note that if and only if X and Y are statistically independent then I(X; Y)=0, H(X, Y)=H(X)+H(Y), and H(Y|X)=H(Y). To analyze the magnitude of histone H3 crosstalk at TSSs, the two best known multivariate generalizations of mutual information were used in this work. The first is interaction information [50] or co-information [51], also symbolized by I, which is defined analogously to Eq. 8 for a set V of n discrete random variables as where |U| is the cardinality (in this case, the number of random variables) of the subset U. In the case of interaction information I, Shannon uncertainty H is thus summed over all subsets of V (the uncertainty of the empty subset is H(Ø) = 0). Importantly, the interaction information of the random variables {X1,…, Xn} can be decomposed with respect to another random variable Y as follows:
Interaction information I(X1;…; Xn) captures the statistical association of all variables {X1,…, Xn} taken at once, i.e., excluding all lower-order associations, and it can also take negative values in some cases. Interaction information was used in this work as a means to compute total correlation values.
To specifically quantify the magnitude of histone H3 crosstalk, the second multivariate generalization of mutual information used in this work was total correlation [24] (symbolized by C) or multiinformation [25], which is defined as i.e., as the sum of the marginal uncertainties of the random variables {X1,…, Xn} minus their joint uncertainty. Importantly, and unlike interaction information I, total correlation C captures all possible statistical associations including lower-order associations or, equivalently, all possible associations between any two or more random variables in the set {X1,…, Xn}. This is because the definition of interaction information I in Eq. 10 allows total correlation C to be rewritten as a sum of quantities I for all possible combinations of variables in {X1,…, Xn}:
This expression for total correlation C as a sum of interaction information quantities I along with the sum decomposition of I in Eq. 11 allows C to be decomposed also as a sum: where CY(X1,…, Xn, Y) is the sum (analogous to that of Eq. 13) of all interaction information quantities I but now including the random variable Y in each combination of variables in {X1,…, Xn}, i.e., and where C(X1,…, Xn|Y) is the sum of all conditional interaction information quantities I given Y for each combination of variables in {X1,…, Xn}, i.e.,
For this work’s purposes, total correlation C was chosen as the measure of statistical association to assess TSS-adjacent histone crosstalk because (i) C is non-negative and thus easier to interpret conceptually, (ii) C is equal to zero if and only if all random variables it comprises are statistically independent, (iii) C captures all possible associations up to a given number of variables (in this work, position-specific histone modification levels) and, (iv) C can be decomposed, as shown in Eq. 14, as a sum of two C quantities: one explicitly related to a certain variable Y and the other explicitly unrelated to Y. Property (iv) was useful to decompose the overall histone crosstalk as a sum of an epigenetic and other non-epigenetic component (see Introduction).
An additional Shannon measure of statistical association was used to assess the predictive power of TSS-adjacent histone modification levels on mRNA abundance levels (such power has already been used to predict mRNA levels with high accuracy [23]). The uncertainty coefficient U [52] is defined as i.e., U(Y|X1,…, Xn) is the relative decrease in uncertainty about Y when {X1,…, Xn} are known—or, equivalently, the fraction of bits in Y that can be predicted by {X1,…, Xn}—and it can take values from 0 to 1. U(Y|X1,…, Xn)=0 implies the set {X1,…, Xn} has no predictive power on Y, whereas U(Y|X1,…, Xn)=1 implies {X1,…, Xn} can predict Y completely.
Levels of possible statistical associations when assessing histone crosstalk magnitudes
An important aspect of quantifying the epigenetic and non-epigenetic histone crosstalk components is the specific range of possible statistical associations. In other words, the choice of the number n of TSS-adjacent, position-specific histone H3 modification levels when computing CY(X1,…, Xn, Y) and C(X1,…, Xn|Y). To this end, the minimal n able to predict mRNA levels significantly and non-redundantly—which corresponds to the level of histone crosstalk able to convey a non-neglectable amount of epigenetic information content—was first determined. This value is straightforward to assess using the uncertainty coefficient U(Y|X1,…, Xn), where Y represents mRNA levels.
In effect, U(Y|Xi, Xj) (i.e., where n=2) quantifies the predictive power of pairs of position-specific histone modification levels, U(Y|Xi, Xj, Xk) quantifies the predictive power of triads, etc. U(Y|Xi,…, Xn) values were thus computed for singletons, pairs, triads, and tetrads. Singeltons were calculated for descriptive purposes only, because histone crosstalk is not measurable for them. On average, a triad (i.e., when n=3) of position-specific histone H3 modification levels was found to have (i) significant predictive power on mRNA levels (U(Y|Xi, Xj, Xk)=0.63) and, importantly, (ii) at least 2.3 times more predictive power than all possible singletons (3) and pairs (3) that exist within a triad taken together, i.e., a phenomenon known as synergy of a set of predictor variables [53] (see Table 2).
Pairs (i.e., when n=2) were also found to possess predictive synergy, but this synergy is smaller than that found for triads . The average predictive power of pairs on mRNA levels is also substantially lower (U(Y|Xi, Xj)=0.07). On the other hand, tetrads (i.e., when n=4) were found to have high predictive power (U(Y|Xi, Xj, Xk, Xl)=0.95) but they possess no synergy whatsoever and display instead what is called redundancy [53]. Based on previous work [23], high predictive power on mRNA levels and yet no synergy are thus expected to happen with a large enough n. From all possible singletons (4), pairs (6), and triads (4) that exist within a tetrad, the explanatory power on mRNA levels of a non-redundant set of only one triad and five pairs already exceeds the explanatory power of the tetrad.
(Note: In previous work it has been argued that RPKM may not always be a suitable unit of mRNA abundance when studying differential gene expression. Specifically, it was shown that, if transcript size distribution varies significantly among the samples, RPKM might introduce significant biases [55]. To overcome this problem, an alternative abundance unit TPM (transcripts per million)—which is an invertible linear transformation of the RPKM value for each sample—was introduced [55]. Nonetheless, this issue was not a problem for the present work because Shannon measures are invariant under any invertible transformation of the discrete random variables.)
Theoretical methods
The elaboration of the main falsifiable prediction took into account two observations for human primary cells in this work. Namely, (i) the uniqueness of triads of position-specific histone modification levels in terms of significant predictive power and predictive synergy and (ii) the post hoc result that triads constitute precisely the level n at which the predicted correlation between the non-epigenetic/epigenetic histone H3 crosstalk log-ratio and cell donor age actually exists. In this way, the main prediction was formulated with explicit dependence on the level of scale: “For any given tissue in any individuated multicellular species a positive correlation between the non-epigenetic/epigenetic histone H3 crosstalk log-ratio and cell donor age will be observed at the level n of histone crosstalk that possesses both significant predictive power and predictive synergy on mRNA levels.”
Statistical tests
The statistical significance of each Pearson correlation coefficient r obtained was assessed using the statistic t defined as which is known to follow a Student’s t-distribution with n–2 degrees of freedom, and where n is the number of data pairs [56]. For the hypothesized positive correlation between the non-epigenetic/epigenetic histone H3 crosstalk log-ratio and age, the statistical null hypothesis was tested against the alternative hypothesis that the correlation is greater than zero (i.e., one-sided Student’s t-test). For the hypothesized non-significant correlation between the overall histone H3 crosstalk magnitude and age, the statistical null hypothesis was tested against the alternative hypothesis that the correlation is greater or less than zero (i.e., two-sided Student’s t-test).
On the other hand, the distribution of correlation coefficients (r) is known to be non-Gaussian [57], which can be easily appreciated in Fig. 3. For this reason, the statistical comparison of r for normal cells (neonate data point excluded) and cancer cells was performed using the non-parametric Mann-Whitney U test [58].
Correction for multiple testing
The analysis of histone crosstalk involved 5 histone H3 modifications × 30 genomic bins = 150 TSS-adjacent, position-specific histone H3 modification levels. Thus, assessing the statistical significance of the correlation values involved a large number of tests of the null hypothesis (for triads, tetrads, and pairs) under general dependence. This dependence derives from the fact that different histone modification levels are known to be highly correlated (this is the phenomenon of histone crosstalk itself).
The resampling-based procedure by Benjamini and Yekutieli [59] provides control of the false discovery rate (FDR) [60] under general dependence conditions. This was the method thus used in this work in order to correct for multiple testing.
Code availability
Standard bioinformatics tools [46–48] and the Perl language were used to process the ChIP-seq and RNA-seq source data and to generate the *.nm.dat files displayed in Table 1. The R software [61] and its infotheo package [62] were used for the computation of Shannon measures of statistical uncertainty and statistical association from the *.nm.dat files. Marginal and joint Shannon uncertainties and all the other derived Shannon measures were computed using maximum likelihood (ML) estimation [63] and bias-corrected with the Miller-Madow method [64]. All the R code and the *.nm.dat files necessary for a full reproduction of the results are available as Supplementary Information.
ACKNOWLEDGMENTS
I wish to thank Angelika H. Hofmann id at SciWri Services for editing this paper into an English I could only hope to write.
REFERENCES
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵