Abstract
Background I have recently shown that the number of rate-limiting driver events per tumor can be estimated from the age distribution of cancer incidence using the gamma/Erlang probability distribution. It is important to understand how these predictions relate to established risk factors.
Methods The number of rate-limiting driver events per tumor was estimated using the gamma/Erlang distribution and correlated to the percentage of cancer cases attributable to modifiable risk factors.
Results The predicted number of rate-limiting driver events per tumor strongly correlates with the proportion of cancer cases attributable to modifiable risk factors for all cancers except those induced by infection or ultraviolet radiation. The correlation was confirmed for three countries, three corresponding incidence databases and risk estimation studies, as well as for both sexes: USA, males [r=0.80, P=0.002], females [r=0.81, P=0.0003]; England, males [r=0.90, P<0.0001], females [r=0.67, P=0.002]; Australia, males [r=0.90, P=0.0004], females [r=0.68, P=0.01].
Conclusions It is thus confirmed that predictions based on interpreting the age distribution of cancer incidence as the gamma/Erlang probability distribution have biological meaning, validating the underlying Poisson process as the law governing the development of the majority of cancer types, especially those driven by chemical mutagens. Importantly, this study suggests that the majority of driver events (60-80% in males, 50-70% in females) are induced by anthropogenic carcinogens, and not by cell replication errors or other internal processes.
Introduction
There have been multiple attempts to deduce the number of rate-limiting steps in carcinogenesis from the age distribution of cancer incidence or mortality [1]. The proposed models for doing this, however, suffer from several serious drawbacks. For example, early models assumed that cancer mortality increases with age according to the power law [2-4], which is inconsistent with the observed deceleration of mortality growth at an advanced age. Moreover, when high quality data have accumulated, it became clear that, at least for some cancers, incidence even starts to decrease after peaking at some advanced age [5, 6]. More recent models of cancer progression are based on multiple biological assumptions, consist of complicated equations that incorporate many predetermined empirical parameters, and still have not been shown to describe the decrease in cancer incidence at an advanced age [7-12]. It is also clear that an infinite number of such mechanistic models can be created and custom tailored to fit any set of data, leading us to question their explanatory and predictive values.
I have recently proposed that the age distribution of cancer incidence can be interpreted as the statistical distribution of probability to accumulate the required number of driver events by the given age [13]. I have shown that, of all standard probability distributions, the gamma distribution (and its special case with the integer shape parameter – the Erlang distribution) fits the actual age distribution of incidence for 20 most prevalent cancers the best [13]. I have then shown that the gamma/Erlang distribution is the only standard distribution that, in addition, approximates incidence for all studied childhood and young adulthood cancers, thus validating it as the universal equation describing cancer incidence [14]. Importantly, the Erlang distribution describes the waiting time for the occurrence of the given number of independent random events, as it was initially devised to calculate call queues at telephone exchanges. It is based on the Poisson process, which implies not only pure randomness of event timings but also their constant average rate. Thus, the excellent fit of the gamma/Erlang distribution to the actual incidence data implies that cancers develop according to the Poisson process, i.e. driver events occur randomly and at a constant average rate.
Interestingly, the shape parameter of the gamma/Erlang distribution can be interpreted as the number of rate-limiting driver events that occur by the time of cancer diagnosis. It is thus possible to estimate this number for any cancer type, upon fitting the gamma/Erlang distribution to the actual age distribution of incidence. I have shown that these numbers vary considerably, from 1 in retinoblastoma [14] to 41 in prostate cancer [13]. Next, it is important to show that these predictions correspond to experimentally observed variables, such as the number of driver mutations per tumor predicted from sequencing data. However, the variability of DNA alterations that can contribute to cancer progression, some of which are not yet routinely assessed, and the imperfection of algorithms for separating driver and passenger mutations severely complicate this task, as discussed in [13]. Thus, a simpler correlate is required to prove the meaningfulness of the predictions, before engaging in a full-scale confirmation effort.
Here I identify such correlate as the percentage of cancer cases due to modifiable risk factors. This is an often-used parameter in epidemiological studies, and is also called the population attributable fraction (PAF). It shows, for example, what percentage of lung cancer cases are caused by smoking tobacco. Combined PAF shows the overall contribution of all potentially modifiable risk factors, which usually include air pollution, occupational hazards, ionizing radiation, smoking, alcohol, poor diet, insufficient exercise, obesity, infection and ultraviolet radiation. Here I show that the numbers of driver events per tumor predicted by the gamma/Erlang distribution strongly correlate with combined PAFs for most cancers, with the exception of cancers with the large contribution from infection or ultraviolet radiation. This confirms that predictions obtained from the gamma/Erlang distribution are meaningful, validating the Poisson process as the law governing the development of most cancer types and fostering the search for correlations with tumor sequencing data. Importantly, the results suggest that up to 80% of driver events are caused by the environment and lifestyle, and not, for example, by stem cell divisions, as has been recently proposed [15, 16].
Methods
I. Data acquisition
a) Population attributable fractions data
Population attributable fractions (PAFs) combining all risk factors were obtained directly from published open-access articles separately for each cancer type and sex. PAFs for USA were obtained from the publication by Islami et al., Table 2 (Ref[17]). PAFs for England were obtained from the publication by Brown et al., Table 2 (Ref[18]). PAFs for Australia were obtained from the publication by Whiteman et al., Table 2 (Ref[19]). No modification or processing of PAF data was performed.
b) USA incidence data
United States Cancer Statistics Public Information Data: Incidence 1999–2012 was downloaded from the Centers for Disease Control and Prevention Wide-ranging OnLine Data for Epidemiologic Research (CDC WONDER) online database (http://wonder.cdc.gov/cancer-v2012.HTML) in November 2018 (Ref[20]). The United States Cancer Statistics (USCS) are the official federal statistics on cancer incidence from registries having high-quality data for 50 states and the District of Columbia. Data are provided by The Centers for Disease Control and Prevention National Program of Cancer Registries (NPCR) and The National Cancer Institute Surveillance, Epidemiology and End Results (SEER) program. Results were grouped by 5-year Age Groups and Crude Rates were selected as output. Crude Rates are calculated as the number of new cancer cases reported each calendar year per 100,000 population in each 5-year age group. The data were downloaded separately for males and females for each cancer type listed in the publication by Islami et al., Table 2 (Ref[17]).
c) England incidence data
England cancer incidence data were downloaded from the European Cancer Information System (ECIS) Data explorer (https://ecis.jrc.ec.europa.eu/explorer.php?$0-1$1-UK$2-224$4-1,2$3-All$6-5,84$5-1999,2012$7-2$CRatesByCancer$X0_10-ASR_EU_NEW) in November 2018 (Ref[21]). The ECIS database contains the aggregated output and the results computed from data submitted by population-based European cancer registries participating in Europe to the European Network of Cancer Registries – Joint Research Centre (ENCR-JRC) project on “Cancer Incidence and Mortality in Europe". Years of observation were limited to 1999-2012 period, to match the USA data. Incidence is calculated as the number of new cancer cases reported each calendar year per 100,000 population in each 5-year age group. The data were downloaded separately for males and females for each cancer type listed in the publication by Brown et al., Table 2 (Ref[18]), except for vulva and vagina cancers, as their selection was not possible in ECIS Data explorer.
d) Australia incidence data
Australia cancer incidence data were downloaded from the Cancer Incidence in Five Continents (CI5) Volume XI Age-specific curves Online Analysis tool (http://ci5.iarc.fr/CI5-XI/Pages/age-specific-curves_sel.aspx) in November 2018 (Ref[22]). CI5 is published approximately every five years by the International Agency for Research on Cancer (IARC) and the International Association of Cancer Registries (IACR) and provides comparable high quality statistics on the incidence of cancer from cancer registries around the world. Volume XI contains information from 343 cancer registries in 65 countries for cancers diagnosed from 2008 to 2012. Incidence is calculated as the number of new cancer cases reported each calendar year per 100,000 population in each 5-year age group. The data were downloaded separately for males and females for each cancer type listed in the publication by Whiteman et al., Table 2 (Ref[19]).
II. Data selection and analysis
a) Estimation of the number of driver events per tumor
For analysis, the incidence data were imported into GraphPad Prism 6. The following age groups were selected: “5–9 years”, “10–14 years”, “15–19 years”, “20–24 years”, “25–29 years”, “30–34 years”, “35–39 years”, “40–44 years”, “45– 49 years”, “50–54 years”, “55–59 years”, “60–64 years “, “65–69 years”, “70–74 years”, “75–79 years” and “80–84 years”. Prior age groups were excluded due to possible contamination by childhood subtype incidence, and “85+ years” was excluded due to an undefined age interval. If in the first several age groups (“5–9 years”, “10–14 years”, “15–19 years”) incidence initially decreased with age, reflecting contamination by childhood subtype incidence, these values were removed until a steady increase in incidence was detected. The middle age of each age group was used for the x values, e.g. 17.5 for the “15–19 years” age group. Incidence (new cancer cases per calendar year per 100,000 population) for each age group and each cancer type was used for the y values. Data for different countries, as well as for males and females, were analyzed separately. Data were analyzed with Nonlinear regression using the following User-defined equation for the gamma distribution:
The amplitude parameter A was constrained to “Must be between zero and 100000.0” and scale and shape parameters b and k to “Must be greater than 0.0”. “Initial values, to be fit” for all parameters were set to 1.0. All other settings were kept at default values, e.g. Least squares fit and No weighting.
The numerical value of the shape parameter k rounded to the nearest integer is interpreted as the number of driver events per tumor [13].
b) Correlation of the predicted numbers of driver events per tumor with PAFs
Obtained k values were correlated to population attributable fractions (PAFs) in GraphPad Prism 6 using the inbuilt Correlation tool at default settings, e.g. Pearson correlation with two-tailed P value. Cancer types were sorted into two classes, and correlation was performed separately for each class. Cancer types in which infection (Helicobacter pylori, Hepatitis B virus, Hepatitis C virus, Human herpes virus type 8: Kaposi sarcoma herpes virus, Human immunodeficiency virus and Human papillomavirus) or ultraviolet radiation contributed to more than 30% of cases, for a given country according to the published PAF data [17-19], were assigned to Class 2 (non-anthropogenic). The rest were assigned to Class 1 (anthropogenic), which included cancers with substantial contribution from air pollution, occupational exposure, exposure to ionizing radiation, smoking and exposure to secondhand smoke, alcohol intake, poor diet (red and processed meat, insufficient fiber, vegetables, fruit and calcium), excess body weight, insufficient physical activity, insufficient breastfeeding, postmenopausal hormone therapy and oral contraceptives, according to the published PAF data [17-19].
Results
To estimate the numbers of driver events per tumor, the gamma distribution was fitted to the actual age distributions of incidence separately for males and females in three countries – USA, England and Australia (Figure 1 and Table 1). The fits were generally excellent (R2=0.99), except for brain cancer (R2=0.98), thyroid cancer (R2=0.97), and several virus-induced cancers: pharyngeal (R2=0.98), nasopharyngeal (R2=0.93), vulvar (R2=0.98), cervical (R2=0.77), Kaposi sarcoma (R2=0.67) and Hodgkin lymphoma (R2=0.34). Due to the unsatisfactory fits, the last three cancer types were excluded from the further analysis. Successful fitting of the remaining cancer types allowed the estimation of the numbers of driver events per tumor using the shape parameter of the gamma distribution.
Plotting the correlation of the number of driver events per tumor predicted from the gamma distribution with the estimated percentage of cases due to modifiable risk factors obtained from the published studies revealed that cancers appear to cluster in two classes. Class 1, which included the majority of cancers, demonstrated the linear correlation, whereas Class 2 clustered in the upper left corner of the plot in a cloud-like fashion. Investigation of the Class 2 revealed that it consists entirely of cancers with substantial (>30%) contribution of infection to their pathogenesis, plus the melanoma cancer. Class 2 was therefore named “non-anthropogenic”, as infections and ultraviolet radiation existed long before the human civilization. Interestingly, all cancers in Class 1 were induced by factors that arose with human civilization, such as air pollution, occupational hazards, ionizing radiation, smoking, alcohol, poor diet, insufficient exercise, obesity, insufficient breastfeeding, postmenopausal hormone therapy and oral contraceptives. Therefore, Class 1 was termed “anthropogenic”.
The correlation of the predicted number of driver events per tumor with the estimated percentage of cases due to modifiable risk factors for cancers in males is shown in Figure 2 and Table 2, and in females in Figure 3 and Table 3. It can be seen that anthropogenic cancers indeed exhibit the strong correlation for all studied countries and for both sexes, whereas non-anthropogenic cancers exhibit the correlation in none of the cases. Amongst anthropogenic cancers, the correlation is stronger and more significant for males than for females. Interestingly, the correlation is stronger and more significant for American females [r=0.81, P=0.0003] than for English [r=0.67, P=0.002] and Australian [r=0.68, P=0.01] females, but weaker and less significant for USA males [r=0.80, P=0.002] than for English [r=0.90, P<0.0001] and Australian [r=0.90, P=0.0004] males. These differences are likely explained by differing exposures to risk factors between countries and between sexes, as well as by variations in the screening, diagnostics and reporting protocols of different countries, in the sets of cancers included in the studies from which risk factor data were obtained, and in the methodologies of those studies. The role of population genetics also cannot be ruled out.
Discussion
One of the most interesting findings of this study is the clustering of all cancers into two classes, termed here anthropogenic and non-anthropogenic. The possible explanation for this dichotomy is that the human body managed to evolve some protective countermeasures against cancer risk factors that were present for millions of years, whereas it appears unprepared for the novel risk factors brought by our civilization. For example, ultraviolet radiation has been present on Earth since the beginning, and although melanocytes cannot completely protect their DNA, and a lot of DNA damage occurs, it is likely that they developed a very slow division rate [23] to avoid conversion of this damage into mutations for as long as possible. This may explain why only few rate-limiting driver events are predicted for melanoma despite lots of DNA damage that melanocytes receive – rate-limiting in this case is cell division and not the DNA damage. Similarly, the human body had plenty of time to adapt to viruses and install some blocks which are difficult for viruses to overcome, which may explain why the incidence rates of virus-induced cancers are low, and less driver events are predicted than would be expected from the linear correlation. It is also clear that viruses are inducing cancer via different mechanisms than chemical carcinogens [24, 25], and thus the development of such cancers may not be described by the Poisson process. Indeed, many of the virus-induced cancers have rather poor fits of the Erlang distribution to their age distributions of incidence (Table 1).
The strong positive correlation of the predicted number of driver events per tumor with the contribution from anthropogenic risk factors suggests that the majority of driver events are caused by those factors. In other words, the higher is the number of driver events that are required for a given cancer type to appear, the less likely is for them to occur by chance (e.g. due to replication errors), and the more dependent are they on anthropogenic carcinogens to be induced. Indeed, as r2 is called “the coefficient of determination” and describes the proportion of the variance in one variable that is explained by the other variable, we can calculate (by squaring Pearson r values from Figures 2 and 3) that anthropogenic risk factors explain 64%, 81% and 81% of the variance in the predicted number of driver events per tumor for males and 66%, 45% and 46% of the variance for females, living in USA, England and Australia, respectively. This is in accord with the mainstream view that the environment and lifestyle are the major contributors to carcinogenesis, but conflicts with the recently proposed view that the majority of cancers develop due to replicative mutations occurring during stem cell division [15, 16]. The latter view is based on predominantly mouse data handpicked from varied publications and processed through calculations with unobvious assumptions, and thus has been widely criticized [26-32].
It is also interesting to speculate why the observed correlations are stronger for males than for females. One likely explanation is that males generally are more exposed to chemical mutagens, e.g. during smoking and at dangerous industries [17-19], directly inducing mutations in the DNA, some of which happen to be drivers. On the other hand, females have a higher contribution to cancer risk from disturbances in physiology, usually related to hormone levels, such as being obese, using oral contraceptives, undergoing postmenopausal hormone therapy or abstaining from breastfeeding [17-19]. These risk factors may not lead to an increase in the number of overall and driver mutations (discrete events), but promote cancer via changes in intracellular signaling levels or the microenvironment (gradual change) [33-37]. The latter cannot be detected and counted using the gamma/Erlang distribution, which is capable of recognizing only discrete random events.
Overall, the correlations identified here serve as the validation of the hypothesis that most cancers develop according to the Poisson process and that the gamma/Erlang distribution can be used to predict the number of driver events per tumor for most cancer types, especially those driven by chemical mutagens [13, 14]. This has numerous implications, from the fundamental understanding of the carcinogenesis process to the improvement in driver prediction algorithms.