Abstract
In recent years, many efforts in clinical and basic research have focused on finding molecular features of tumor samples with prognostic or classification potential. Among these, the association of the expression of gene signatures with survival probability is of special interest given its relatively direct applicability in the clinic and its power to shed insights into the molecular basis of cancer.
Although great efforts have been invested in data processing to control for unknown sources of variability in a gene-wise manner, little is known about the behaviour of gene signatures with respect to the effect of technical variables.
Here we show that the association of signatures with survival may be biased due to technical reasons and propose a simple and low intensive methodology based on correction by expectation under gene randomization. The resulting estimates are centred around zero and ensure correct asymptotic inference. Moreover, our methodology is robust against spurious correlations between global dataset tendencies and clinical outcome.
All tools (will be soon) available in the "HRunbiased" R package as well as processed datasets for colorectal and breast cancer.
In recent years, many efforts in clinical and basic research have focused on finding molecular features of tumor samples with prognostic or classification potential [1]. Large cohorts consisting of transcriptional and genetic data have been generated with the aim of characterizing tumor subtypes, understanding biological processes linked to tumorigenesis and identifying molecular profiles associated with clinical parameters [2, 3, 4]. Among these, the prognostic power associated to the expression of single genes or gene signatures has been of special interest, potentially leading to insight into the mechanisms involved in relapse and metastasis, and to the development of prognostic tests for clinical application [5, 6, 7, 8] (Fig 1a).
Summarization of expression data from gene sets to measure a feature of interest is justified by biological and statistical reasons. Studies on large cohorts have shown that pathways are altered through a wide variety of mechanisms, resulting in very low prevalence of single gene alterations [9] (Fig 1b above). As a consequence, transcriptomic analyses at the gene level may suffer from low statistical power and reproducibility [10, 11]. From the statistical perspective, the real status of a pathway is more accurately and sensitively captured by a combination of the expressions of its genes [12] (Fig 1b below). A variety of methods are available for transcriptomic data summarization: average of standardized expression values [12], reduction to a unique component derived from Single Value Decomposition (SVD) and related methods [13], or summaries based on Gene Set Enrichment Analysis (GSEA) [14, 15].
High-throughput data is susceptible to technical variability that can mask true biological information (decrease of statistical power) and/or lead to erroneous conclusions (bias) [16, 17] even if substantial effort has been invested in data processing [18, 19]. For this reason, a number of statistical approaches exist that aim to identify and control for unknown sources of variability while estimating gene expression in a gene-wise manner [18, 20]. Nevertheless, the impact of such effects on gene set summarization may be severe due to possible coordinated effects on all or a large fraction of genes [19, 21]. Therefore, an evaluation of this phenomenon is needed in the specific context of gene set summarization.
In this work, we evaluate two widely used methodologies for pathway summarization: Gene Set Variation Analysis (GSVA) [15] and a z-score based method (ZScore) [12]. Concerns regarding bias and statistical power are explored in a collection of public datasets [22, 23, 24, 25] focusing on cancer prognosis assessment. Drawbacks associated to these methodologies are identified and addressed using a simple strategy. Implications of these findings on conventional statistical inference are discussed, specially concerning the interpretation of asymptotic tests.
We downloaded and processed 3 independent cohorts of colorectal cancer (CRC) and 6 independent cohorts of breast cancer (see online methods 1), and generated random signatures ranging in size from 1 (single gene) to 500.
For any signature, we applied two strategies for pathway summarization: i) ZScore: gene expression values were converted to z-scores and averaged for each sample, using a similar approach to that in [12]; and ii) GSVA: Gene Set Variation Analysis [15] was applied on each sample and resulting scores were used as summaries. In both cases scores were centred and scaled to make association measures comparable. For each dataset separately, a Cox model was fitted to each score to evaluate their relationship with diseasefree-survival (See online methods 2). The log hazard ratio (lHR) and its associated test statistic (lHR-stat) were used as measures of association. Methods based on SVD represent a variation of the ZScore approach and were not evaluated in this work, as they are prone to be dominated by a small number of genes capturing specific signal which can be different from the biological target (Supp. Fig. 8.1).
For the ZScore method, we found that means of the random lHR deviated from zero in a substantial amount and in different direction depending on the dataset under study, suggesting a strong component of technical origin causing these deviations (Fig 1c, Fig. S1). Moreover, the proportion of random signatures that had significant association with relapse was far from the expected 5% under the hypothesis of lHR = 0 at 5% significance level, compromising the validity and interpretation of statistical inference based on asymptotic assumptions. As the number of genes increased, correlation between random signatures and a Global Signature (GS) including all genes in the dataset quickly converged to 1 (Fig. S3), translating also to convergence in terms of lHRs (Fig. 1e). These results were confirmed in different simulated scenarios and were evident even when only small deviations from zero were observed in average at the gene level (Supp. mat. 3 for permutationsbased study). In contrast, GSVA scores were approximately centred in the (a-priori) expected value of zero with similar distributions for all considered sizes (Fig 1c).
These observations motivated the use of the GS to correct the signatures by the expected value if their genes were selected at random, following an analogous strategy to that in [22]. Two approaches were used for GS correction: i) for each sample, we subtracted the Global Signature (GS) values to those obtained by the ZScore method before performing hypothesis testing (GS-cor); and ii) the GS was included as a covariate in the Cox models used for prognosis evaluation (GS-adj). The resulting lHRs of random signatures from GS-cor and GS-adj showed densities centred around zero and with similar variances across all signature sizes (Fig 1d). Moreover, asymptotic inference provided an approximate coverage of 5% under the null hypothesis of zero lHR, and similar to those using GSVA scores (Supp. Tables 3.1-3.6).
Despite the conceptual similarities among GSVA, GS-cor and GS-adj (see online methods 3), one key difference lies in that GS-adj accounts both for the magnitude and the direction of the GS scores and their correlation with the target signature in relation to the outcome by means of the variance-covariance matrix of the model coefficients. To explore this aspect, we compared the summarization methods in different simulated scenarios (See online methods 4)); for type I error, random signatures were categorized according to their correlations with the GS before assessment, while a set of positive control genes (F-TBRS) known to be associated with higher risk in colorectal cancer [23] was used for evaluation of type II error.
For methods based on a-priori correction (GSVA and GS-cor), random signatures produced lHRs distributions shifted from zero in an amount that depended on the correlation between the target signature and the GS (Fig 2a). This dependency was also observed for type II error: while negative correlations increased the chance to detect the association between F-TBRS and relapse, positive correlations considerably decreased statistical power. These results indicate that GSVA and GS-cor are not exempt from biases such as those in Fig. 1b and result in impaired asymptotic inference, possibly due to overcorrection of the global tendency of the dataset. On the contrary lHRs derived from GS-adj showed virtually identical and centred distributions for all correlation levels in the null hypothesis setups (Fig 2a), and only the expected decrease in statistical power was observed as (anti)correlation of GS and F-TBRS increased due to collinearity (Fig 2b). In the real CRC data GS-adj performed equally or better than GS-cor and GSVA in terms of statistical power except for GSE14333 (Fig 2c - Supp. mat. 4); in agreement with the simulation results, this dataset was the only one showing a clear inverse association between the GS and relapse (Fig 1c, Fig 2a) and negative correlation with the F-TBRS (corr = -0.2), which suggests overcorrection of GSVA and GS-cor as an explanation for their apparent higher performance (Supp. mat. 2 for extra simulations).
For settings where the GS is suspected to carry unknown or unidentifiable biological information, we modified the definition of GS by including only low variable genes in its computation (LV-adj), assuming that they are less likely to carry real biological information. For the CRC datasets, we found an increase in statistical power at expense of efficiency in bias correction (Fig 2d). Main surrogate variables or main factors found by the SVA [18] or RUV [20] approaches, respectively, were also considered for the adjusted Cox models without improving the results of LV-adj (Supp. mat. 5).
All the analyses in this work were also performed on a selection of BRCA datasets using the MammaPrint [7] signature for prediction of relapse [Supp. mat. 1], which provided analogous conclusions. Although in a lesser degree, issues related to biases and type II error persisted when datasets were combined through merging their expression matrices (Online methods 5 for details) (Supp. mat. 6).
In conclusion, we raise awareness on the usage of gene expression signatures to find associations with survival and prognosis in high-throughput data since, due to the existence of unknown sources of technical variation, the derived estimates may be biased and the corresponding inference inaccurate. Our results show that these biases are driven by overall trends that are present in the gene correlation structure of the expression matrix and by the association of these trends with the outcome, are possibly dataset specific, and quickly increase with the signature size. To address this problem, we propose a simple, flexible and low-intensive computational method (GS-adj) that summarizes the overall data signal to be used as a confounding factor in the statistical analysis. In contrast to existing methodology, our suggested approach accounts not only for the magnitude but also for the correlation of the GS and the signature being evaluated and their association with the outcome. This strategy ensures the validity of standard asymptotic inference for signature testing, and confers advantage in terms of accuracy and interpretability in comparison to other methods based on a-priori correction such as GSVA (Supplementary discussion).
The methods and interpretations derived in this work can be naturally extended to a vast range of domains of high-throughput data such as proteomics, methylation or metabolomics, and many other types of outcomes such as continuous measures (correlation) or binary status (logistic regression). Finally, we provide preprocessed datasets for CRC and BRCA (to be available soon) and a set of tools to diagnose custom datasets and to report adjusted lHR and p-values for different covariates (to be available soon).