A Knowledge-guided Mechanistic Model of Synthetic Lethality in the HCT116 Vorinostat-resistant Colon Cancer Xenograft Model Cell-line

Paul Aiyetan

doi:10.1101/2021.06.22.449530

Abstract

With an overall lifetime risk of about 4.3% and 4.0%, in men and women respectively, colorectal cancer remains the third leading cause of cancer-related deaths in the United States. In persons aged 55 and below, its rate increased at 1% per year in the years 2008 to 2017 despite the steady decline associated with improved screening, early diagnosis and treatment in the general population. Besides standardized therapeutic regimen, many trials continue to evaluate the potential benefits of vorinostat, mostly in combination with other anti-neoplastic agents for its treatment. Vorinostat, an FDA approved anti-cancer drug known as suberoylanilide hydroxamic acid (SAHA), an histone deacylase (HDAC) inhibitor, through many mechanisms, causes cancer cell arrest and death. However, like many other anti-neoplastic agents, resistance and or failures have been observed. In the HCT116 colon cancer cell line xenograft model, exploiting potential lethal molecular interactions by additional gene knockouts restored vorinotat sensitivity. This phenomenon, known as synthetic lethality, offers a promise to selectively target cancer cells. Although without clearly delineated understanding of underlying molecular processes, it has been demonstrated as an effective cancer-killing mechanism. In this study, we aimed to elucidate mechanistic interactions in multiple perturbations of identified synthetically lethal experiments, particularly in the vorinostat-resistant HCT116 (colon cancer xenograft model) cell line. Given that previous studies showed that knocking down GLI1, a downstream transcription factor involved in the Sonic Hedgehog pathway – an embryonal gene regulatory process, resulted in restoration of vorinostat sensitivity in the HCT116 colorectal cancer cell line, we hypothesized that vorinostat resistance is a result of upregulation of embryonal cellular differentiation processes; we hypothesized that elucidated regulatory mechanism would include crosstalks that regulate this biological process. We employed a knowledege-guided fuzzy logic regulatory inference method to elucidate mechanistic relationships. We validated inferred regulatory models in independent datasets. In addition, we evaluated the biomedical significance of key regulatory network genes in an independent clinically annotated dataset. We found no significant evidence that vorinostat resistance is due to an upregulation of embryonal gene regulatory pathways. Our observation rather support a topological rewiring of canonical oncogenic pathways around the PIK3CA, AKT1, RAS/BRAF etc. regulatory pathways. Reasoning that significant regulatory network genes are likely implicated in the clinical course of colorectal cancer, we show that the identified key regulatory network genes’ expression profile are able to predict short- to medium-term survival in colorectal cancer patients – providing a rationale basis for prognostification and potentially effective combination of therapeutics that target these genes along with vorinostat in the treatment of colorectal cancer.

Introduction

The quest for effective therapies for colorectal cancer, particular in younger patients with advanced disease has never been more imperative. With an overall lifetime risk of approximately 4.3% and 4.0%, in men and women respectively[1, 2], colorectal cancer is the second leading cause of cancer-related deaths in the United States[3]. In persons aged 50 and below, its rate increased at 2% per year in the years 2012 to 2016 despite the steady decline associated with improved screening, early diagnosis and treatment in the general population[2, 3]. According to the center for disease control and prevention (CDC), in 2017, 141, 425 new cases of colorectal cancers were reported, and 52, 547 people died of it[4]. The CDC estimates that for every 100, 000 people, 37 new colorectal cancer cases are reported and 14 people died of this cancer[4].

Historically, risk factors have been classified as modifiable and non-modifiable factors[5]. Modifiable factors have included being overweight, a sedentary lifestyle, diet rich in red and processed meat, and sugars, smoking and alcohol consumption, while non-modifiable factors include increasing age, history of inflammatory bowel disease, polyps, family history of colorectal cancer, ethnicity, type II diabetes mellitus, and familial or inherited syndromes [5]. Although familial or hereditary factors account for only a third of colorectal cancer diagnoses, their molecular basis have enabled fundamental understanding of the etiopathogenesis of the disease. These include, lynch syndrome (hereditary non-polyposis colon cancer or HNPCC) which is primarily associated with defects in the MLH1, MSH2 or the MSH6 genes, and accounts for about 2% to 4% of all colorectal cancers, familial adenomatous polyposis coli (FAP) which accounts for 1% of colorectal cancers, Peutz-Jeghers syndrome (PJS), and MUTYH-associated polyposis (MAP). Associated with mutations in the APC gene, the FAP-related colorectal cancer consists of three sub-types with almost specific clinical features. These include: the attenuated FAP, associated with fewer polyps and development of colorectal cancer at a later age than it is typical; the Gardner syndrome, associated with tumors of the soft tissues, bones and skin; and the Turcot syndrome, associated with an higher risk of colorectal cancer and a predisposition to developing medulloblastoma – a brain cancer. Usually diagnosed at a younger age, PJS is associated with mutations in the STK11 (LKB1) gene while as its name implies, MAP is caused by mutations in the MUTYH gene[5]. These associated genetic defects are characteristically those of genes involved in tumor suppression and DNA repair mechanisms [6].

Besides standardized therapeutic regimen, many trials continue to evaluate the potential benefits of vorinostat, mostly in combination with other anti-neoplastic agents for its treatment[7–17]. Vorinostat, an FDA approved anti-cancer drug known as suberoylanilide hydroxamic acid (SAHA), a histone deacetylase (HDAC) inhibitor, through many mechanisms, causes cancer cell arrest and death[18]. First discovered on attempts to make more efficient hybrid polar compounds that induce the differentiation of transformed cells[19, 20] and initially approved by the FDA for the cutanous manifestation of T cell leukemia, vorinostat has since become a therapeutic candidate for many tumors[21–29]. This is due in part to the evolving understanding of the role of epigenetic and posttranslational modifications in the etiopathogenesis of transformed cells[30–33]. Altering many pathways and processes, vorinostat has been discovered to not only alter the modification state of histone proteins but many more essential proteins involved in the oncogenic and tumor suppression process. More specifically and among many other mode of action, vorinostat inhibits the removal of acetyl group from the ϵ-amino group of lysine residues of histone proteins by histone deacetylases (HDACs). Accumulation of acetyl group maintains chromatin in an expanded state, facilitating transcriptional activities of major regulatory genes[18, 30, 34–36]. However, like many other anti-neoplastic agents, toxicities, resistance and or failures have been observed[13, 37, 38].

In the HCT116 colon cancer cell line xenograft model, exploiting potential lethal molecular interactions by additional gene knockouts, Falkenberg and colleagues were able to restore vorinotat sensitivity[39, 40]. This phenomenon, known as synthetic lethality, offers a promise to selectively target cancer cells[41]. Although without clear delineated understanding of underlying molecular processes, many studies demonstrate synthetic lethality as an effective cancer-killing mechanism.

In this study, we aimed to elucidate regulatory interactions, in multiple perturbations of identified synthetically lethal experiments, particularly in the vorinostat-resistant HCT116 (colon cancer xenograft model) cell line. In addition to elucidating interactions, we aim to elucidate key interactions that potentially determine observed phenotypes. Given that previous studies[39, 40] showed that knocking down GLI1, a downstream transcription factor involved in the Sonic hedgehog (SHH) pathway [42–44] – an embryonal gene regulatory process, resulted in restoration of vorinostat sensitivity in the HCT116 colorectal cancer cell line, we hypothesized that vorinostat resistance is a result of uptick in embryonal gene regulatory programs. We also hypothesized that elucidated regulatory mechanism would include crosstalks that regulate this biological processes – embryonal gene regulatory programs. We employed a knowledege-guided fuzzy logic regulatory inference method to elucidate mechanistic relationships from multiple synthetic lethal pertubation experiments in the vorinostat-resistant colon cancer cell lines. We validated inferred regulatory models in independent experiment datasets. And, we evaluated the biomedical significance of key regulatory network genes in an independent clinically annotated dataset.

Materials and Methods

Datasets

Synthetic Lethal Experiments Transcriptome, RNA Sequencing Assay Data

Two RNASeq expression datasets (accessions GSE56788 and GSE57871) with available viability assay data were retrieved from the National Center for Biotechnology information (NCBI) Gene Expression Omnibus [45, 46] public repository.

GSE56788

Detailed under the BioProject accession PRJNA244587, this consists of 45 assays from 15 biosamples, each ran in 3 independent biological replicates. RNA-seq expression profiles were acquired by next-generation sequencing of vorinostat-resistant HCT116 cells (HCT116-VR) following knockdown of potential vorinostat-resistance candidate genes. Assays included those of mock transfection to serve as controls. The authors of the study sought to understand the mechanisms by which these knockdowns contributed to vorinostat response – reestablishment of a gain in sensitivity to Vorinostat. siRNA-mediated knockdown of each previously identified resistance candidate genes in the HCT116-VR cell line was employed[39]. Raw RNA sequence expression data were downloaded from the NCBI Sequence Read Archive [47, 48], with accession number SRP041162. Table 1 shows the transcriptome expression profile data accessions and associated siRNA treatment experiments.

View this table:

Table 1:

GSE56788 Gene Expression Omnibus, GEO dataset I

GSE57871

Similar to GSE56788, the GSE57871 study is a 42 sample dataset derived from an expression profiling by high throughput sequencing. It consists of independent biological experiments of 14 samples performed in triplicates. RNA-seq high throughput expression profiling of vorinostat-resistant HCT116 cells was performed following gene knockdown of GLI1 or PSMD13 with or without vorinostat treatment. Study authors had chosen GLI1 and PSMD13 as potential vorinostat resistance genes because these had previously been identified through a genome-wide synthetic lethal RNA interference screen (the GSE56788 dataset study). An aim was to understand the transcriptional events underpinning the effect of GLI1 and PSMD13 knockdown (sensitisation to vorinostat-induced apoptosis). The authors first performed a knockdown on cells, and then treated these with vorinostat or the solvent control. Two timepoints for drug treatment were assessed: a time-point before induction of apoptosis (4hrs for siGLI1 and 8hrs for siPSMD13) and a timepoint when apoptosis could be detected (8hrs for siGLI1 and 12hrs for siPSMD13)[40]. Raw sequence expression data were downloaded from the NCBI Sequence Read Archive with accession number SRP042158. Table 2 shows the transcriptome expression profile sample data accessions and associated siRNA treatment and treatment timepoint experiments.

View this table:

Table 2:

GSE57871 Gene Expression Omnibus, GEO dataset

Colon Cancer-Associated Genes from OMIM

A curated list of colon cancer-associated genes (Table 3) were retrieved from the Online Mendelian Inheritance Man (OMIM) database [49, 50].

View this table:

Table 3:

Colon-cancer associated genes

Biomedical Significance Experiment Data

To evaluate the clinical and biomedical significance of inferred regulatory features and themes, gene expression profile were retrieved from the cancer genome atlas (TCGA) colorectal cancer mRNA data, in the TCGAcrcmRNA R Bioconductor package[51, 52]. The package contains the TCGA consortium-provided level 3 data, generated by the HiSeq and GenomeAnalyzer platforms, from 450 primary colorectal cancer patient samples[53]. For a more comprehensive and up-to-date phenotype information, associated patients’ clinical data were retrieved from the genomic data commons[54–57].

Methods

RNA Sequence Analyses

Quality assessment

For data quality assessment (QA), the fastqcr, ngsReports and Rqc R/bioconductor tools [52, 58–60], modeled after the FASTQC [61] tool philosophy were used. These provide add-on capabilities and the R programming interface to the standalone Java program implementation of FASTQC. QA results were used to identify data with questionable measured quality metrics. In addition to data file statistics, reported quality metrics included; ‘adapter content’, ‘overrepresented sequences’, ‘per base N content’, ‘per base sequence content’, ‘per base sequence quality’, ‘per sequence GC content’, ‘per sequence quality score’, ‘sequence duplication levels’, and ‘sequence length distribution’.

Reads quantification

To quantify expression, we aligned reported reads from the sequencing experiment to the genome. Although non-alignment based quantification approaches such as those implemented in Salmon [62], Sailfish [63], and Kallisto [64] are becoming more popular, the performance of these on quantifying lowly expressed genes and small RNAs is still being debated [65]. Therefore sequence reads were aligned to the genome (NCBI GRCh38 build) using the TopHat2 [66, 67] tool which accounts for slice junctions in alignments. Tophat2 uses the bowtie2 [69], noted for its speed and proven memory efficiency for primary alignment. Rather than build new index files, pre-built bowtie2 index files were downloaded from Illumina’s iGenomes archive [70]. Accepted hits and annotation information in the BAM format [71] output files were assembled into an expression matrix of feature counts using the featureCount routine in the Rsubread package [72].

Figure 1:

Methods Overview

Preprocessing and normalization

Feature counts were normalized using the DESeq2 package [73] tool’s implemented regularized log transformation to account for disparate total read counts in the different files and to allow for comparison across the different samples. The regularized log transformation moderates the high variance typically observed at low read counts. We specified regularized log transformation intercept as the average expression profile across the normal (mock) samples.

Model Building and Independent Validation Datasets

Datasets were divided into training (regulatory-model-infering) and test (regulatory-model-validation) datasets (Figure 2). Regulatory models were inferred using the training datasets. Inferred models were tested in the independent validation datasets. Independent validation dataset included two parts. A part was used to test the regulatory models while the other part was used to test and evaluate a simulation of the consolidated network.

Figure 2:

Datasets. For our fuzzy-logic inference and evaluation, Two qualifying datasets, with accession numbers GSE56788 and GSE56871, were found and retrieved from the NCBI Gene Expression Omnibus (GEO) database. The studies’ samples were subjected to quality assessment and inclusion criteria. 32 qualifying samples from the GSE56788 dataset were used for training (model building) and 12 samples meeting our inclusion criteria from the GSE56871 dataset were used for testing. Of the 12 samples, 3 samples from the 12 were derived from GLI siRNA knockdown experiments and 9 samples were from mock experiments.

Feature Selection

Although similar, feature selection for regulatory network reconstruction and inference differs from classical feature selection. Classical feature selection [74–78] approaches aim to identify the optimal set of features with which a trained model can best predict or correctly identify a class of a not-previously-seen object, given the object’s attributes – the class prediction problem. With a class prediction problem is an associated feature redandancy [79] which needs to be mitigated when choosing an optimal set. With respect to selecting features for regulatory networks however, this may not necessarily be the case, since features that appear redundant may imply co-regulatory (direct or indirect regulatory) interactions in the network. In both situations anyways, on a one hand is the cost of learning a model while on another hand is the curse of dimensionality that plague the low sample to feature ratio characteristic of biological experiments. The very high dimension coupled with low sample size and the potential noise in measured experiments present a limitation for regulatory network inference methods [80] in particular. Feature selection seeks to find a middle spot where cost is minimized with minimal loss in learned model benefits. Although optimized algorithms may mitigate cost, poorly selected or less optimal set of features are set to undermine the efficiency of any learned model.

For a regulatory network model that would represent colon cancer, we reasoned that network features should very likely include known and previously identified products of genes associated with the disease process. Thus, we compiled a list of genes consisting of a curated set obtained from the OMIM database [49, 50] and those from literature evidences i.e. genes in described pathways of colon cancer tumorigenesis. And, if we assume that the regulary network is a function of changes in features’ expression across time, among different perturbations or across cellular states, it should also appeal to reason that features with significant variations or dispersion in expression across samples should be more informative i.e. more relevant for deriving a regulatory network than those without or with minimal variations. Mathematically, we may describe a cellular state s, as a linear combination of weighted features’ expressions, given by the equation below: where α, β, γ, ··· ω are the rates of change in respective feature’s expression i.e. rate constants; ϵ is the random error estimate; {x₁, x₂, x₃ ··· x_n} is the set of expression values of features under condideration; and n is the total number of features. We reasoned that if we assume a regulatory network describes changes in cellular state across time, we might as well describe it as a first derivative of cellular state, f(s)′. Therefore features without changes in expression across time, i.e. features whose rate constants tended to zero would drop off in the estimate d(f(s))/dt. This is analogous to being of less significance in determining the dynamic nature of the regulatory network, i.e. changes in cellular state.

To determine maximally varying features, from our RNA sequence analyses normalized expression values, we estimated a mean absolute deviation (MAD) from the mean, for each feature. Given by, where n in this case is the number of samples or perturbations and is the mean expression value of the specific feature across the samples. x_i ∈ {x₁, x₂,… x_n}.

To further assess variation in the expression of genes across samples, we also determined fold changes between the minimum and maximum expression values for for the respective genes and the strength of change between knockdown and control experiments. Because genes with highest MADs were observed to be predominantly those with low average expression and thus may be confounded by a Poisson noise distribution, we performed differential expression analyses between the respective groups of knockdown (siRNA) experiments and the controls to identify statistical significantly expressed genes (i.e. features with true changes)[81–83].

In summary, in additon to genes previously identified as related to colon cancer tumorigenesis and the specific genes targeted in the knockdown experiments, expression profile-informed genes were also considered for regulatory network inference based on their MAD, differential expresson and the log fold difference between the minimum and maximum expression values across siRNA knockdown experiments. The expresion profile-based selection criteria we specified were that for a gene to be considered:

Its mean absolute deviations (MAD) must be greater than the median of MADs.
Its expression value in 80% of samples must be greater than its minimum value across all samples by a minimum of two folds. The 80% of samples must include ≥ 80% of siRNA-targeted experiments. And, it must be
Statistical siginificant and differentially expressed in at least two siRNA-targeted sample groups versus the control group

Knowledge-guided feature selection

Purely data-driven methods have drawbacks such as limited biological interpretability. Likewise, canonical signaling pathways from literature evidences, provided in curated knowledge databases are not very specific and these hardly predict cell type-specific responses to experimental situations [84]. Therefore, we employed a hybrid approach that addresses these limitations and, can integrate prior knowledge and real data for network inference. We searched the derived features, and the colon cancer related gene features from OMIM database, against the STRING database[85–87]. Our search parameters included: a search against a full network type where edges indicate both functional and physical protein interactions; reported network edges indicate the presence of evidence of interactions between nodes; active interaction sources included mining of literature texts (TextMining), known experiments, knowledge bases, documented co-expression information, gene neighborhood, fusion and co-occurrence information. Quantitative interaction score for retrieved edges was specified as a minimum of 0.150. We retrieved features reported to be part of a potential network. For each feature found as part of a potential network, all reported interacting features were retrieved and mapped. We elaborated regulatory relationships between and among features using the fuzzy logic approach.

Fuzzy Logic Regulatory Model and Network Inference

To tease regulatory interactions among our initial selection of features, we employed the fuzzy-logic approach. The fuzzy logic approach mitigates known challenges of modeling biological systems, such as inconsistencies and inaccuracies associated with high-throughput characterizations. These challenges also include data noise and those of dealing with a semi-quantitative data [88]. Similar to Boolean networks, fuzzy logic methods are simple and are fit to model imprecise and or highly complex networks. And, opposed to differential equation based models, they are less computationally expensive and less sensitive to imprecise measurements [89–91]. Fuzzy logic compensates for the inadequate dynamic resolution of a Boolean (or discrete) network, while simultaneously addressing the computational complexity of a continuous network [92, 93].

A significant advantage of the fuzzy logic approach is that, in contrast to many other automated decision making algorithms or regulatory inference methods, such as neural networks or polynomial fits, algorithms in fuzzy logic are presented in similar day-to-day conversational language. Therefore, a fuzzy logic is more easily understood and can be extrapolated in predictable ways.

In general, the fuzzy logic modeling approach entails three major steps.

Fuzzification
Rule evaluation, and
Defuzzification

[94].

Fuzzification

Considering expression as a linguistic variable and applying defined membership functions on observed continuous numerical expression data, the fuzzification step derives qualitative values. It is a mapping of non-fuzzy inputs to fuzzy linguistic terms [94]. To make data fuzzification easier, a normalization technique may be applied to scale values to within a preferred range [92, 94, 95].

The fuzzification step derives qualitative values from the expression profile’s crisp values. By applying defined membership functions on crisp, numerical expression data, we derived qualitative values – described as a mapping of non-fuzzy inputs to fuzzy linguistic terms [96]. Given qualitative values of HIGH, MEDIUM, or LOW, the fuzzification step takes a feature’s expression value and assigns it degrees to which it belongs to the respective class of HIGH, MEDIUM or LOW expression values. [97–100]. After an initial data transformation of log2 expression ratios by the arctan function and dividing values by , to project the ratios onto [-1,1], the fuzzification step utilizes three membership functions consisting of the ‘low’, ‘medium’, and ‘high’ functions. Given the three fuzzification functions (y₁ = low, y₂ = medium, y₃ = high), fuzzification of a gene expression value x results in the generation of a fuzzy set y = [y₁, y₂, y₃] as follows:

Rule evaluation

The rule evaluation step considers combinations of features and utilizes an inference engine of rules, of the form IF-THEN, including fuzzy set operations such as AND, OR, or NOT, to evaluate input features’ expression (in fuzzy set definition) in relation to output features. This has been described as attempting to make an expert judgment of collective linguistic terms; attempts to find a solution to an evaluation of the concurrent state of existence of linguistic description of states.

We specified our rule configuration (the specification of if-then relationships between variables in fuzzy space) in the form of a vector r = [r₁, r₂, r₃]. We specified the state of an output node z = [z₁, z₂, z₃] to be determined by the fuzzy state of an input feature y = [y₁, y₂, y₃] and the rule describing the relationship between the input and the output, r = [r₁, r₂, r₃] as follows:

An inhibitory relationship, for example, specified as [3, 2, 1] implies, if input is low (r₁), then output is high (3); if input is medium (r₂), then output is medium (2), and if input is high (r₃), then output in low (1). The classic fuzzy logic rule evaluation using the logical AND connective results in a combinatorial rule explosion i.e. an exponential increase in the number of rules to be evaluated and computational time, with additional inputs to be considered [101]. Therefore, to address this combinatorial rule explosion situation, we employed the logical OR (union) rule configuration, an algebraic sum in fuzzy logic [102, 103] as described in [104].

Defuzzification

The defuzzification step produces a quantifiable expression result or value given the input sets, the fuzzy rules, and membership functions. Defuzzification technically interpretes the membership degrees of the fuzzy sets into a specific decision or real value. The defuzzification step attempts to report a corresponding continuous numerical variable from a fuzzy state liguistic variable. Several approaches to defuzzify abound. We employed the simplified centroid method [103]. Given a predicted fuzzy values of an output node y = [y₁, y₂, y₃], we defined defuzzified expression values as:

After defuzzification, we reverse transformed back to log2 expression values by multiplying derived values by and applying the tangent function.

Inferred regulatory model fit

For each regulatory model, which consists of an output feature, its suggested regulatory input feature(s) and associated fuzzy logic rules (relating each input feature to the output respectively), we estimated the fitness of such model’s prediction of the output x across M experiment samples or perturbations x = {x₁, x₂,…, x_M} as: where is the set of defuzzified numerical log expression ratios predicted for the output feature and is the mean of the experimental values of x across the samples or perturbations observed. A perfect fit would result in a maximum E of 1.0.

Model probability (p-value) estimates

To estimate models probabilities, we fitted a probability density distribution for 100, 000 fit estimates of models derived by random permutations of rules and input features for each output features. We allowed up to four regulatory interactors. We computed a model fit’s p-value as the probability of observing an estimated fit from a random estimated fits distribution. A gamma distribution was fitted and, the ‘scale’ and ‘shape’ parameters were derived using The Maximum Likelihood Estimate (MLE) approach [105–108] implemented in the egamma function, in the EnvStat R package. With the ‘’scale’ and ‘shape’ parameters, random deviates and cummulative probabilities were derived using the (rgamma) and (pgamma) implementations respectively, in the stats package [109, 110].

Model validation

As described above, the fuzzy logic approach infers a regulatory model to consist of an output node, input nodes and respectively derived regulatory rule that relate each input node to the output node. We validated derived models for each feature output in the independent GLI1 siRNA knockdown experiments datasets generated by Falkenberg et al (2016). In this dataset, the authors focused on the genes GLI1 and PSMD13 as potential vorinostat-resistance candidate genes, identified from previous screens. Falkenberg and colleagues performed transcriptome analysis on vorinostat-resistant HCT116 cells (HCT116-VR) upon knockdown of these candidate genes in the presence and absence of vorinostat. According to the authors, treatment of vorinostat-resistant cells with the GLI1 small-molecule inhibitor, GANT61, phenocopied the effect of GLI1 knockdown. Therefore, for independent validation of our inferred regulatory models, we reason that for model estimated fit in the test data should as closely as possible be similar to (or better than the) estimated fit in the training dataset. The two timepoints for drug treatment assessed by Falkenberg and colleague represent a timepoint before induction of apoptosis (4hrs for siGLI1) and a timepoint when apoptosis could be detected (8hrs for siGLI1). Therefore for this validation, we used the sample expression data at 8hrs (see the table 4).

View this table:

Table 4:

Independent Validation Dataset

Network construction and validation

For each output node, the best-fitted model as determined by estimated fit difference between the associated models in the training and validation data was selected as a representative model. Representative models were consolidated into a single regulatory network (Figure 3). We reasoned that, models with minimal estimated fit difference are more likely stable than those with high differences.

Figure 3:

Regulatory network construction – constructed from consolidation of representative best-fitted models for all output nodes

To validate the derived regulatory network, we compared the monotonic and adaptive changes[111] observed by a dynamic simulation of the network over 5, 000 time-step iterations in the training data against that observed in the validation data. We reasoned that the distribution of observed changes between the training data network simulation and the independent validation data simulation would not be significantly different.

To simulate the network, we derived successive time-step expression values (I_n+1) for each node by a linear combination of the previous (I_n-1) and new values (I_n), to ensure the system converge smoothly towards equilibrium[99]. Given by Gormley et al, new values (I_n) were computed as:

Where the α option specifies the ‘mixing parameter’, guiding how quuickly the simulation reaches system equilibrium. New values for each node were based on the initial conditions and the fuzzy relations (regulatory rules) inferred from the training data. Zhang et al (2019) respectively described monotonic S_M and adaptive changes S_A as:

Where R are the estimated values over the entire iteration, R₀ are observed values at the start of simulation and R_T are values observed at the end of simulation. We utilized the Student t-test to determine if there is any difference in monotonic and adaptive network simulation changes between the training data and independent network validation data. To effectively simulate a knockdown and making the validation dataset-2 more comparable, we in-silico kept the level of knocked-down feature expression unchanged throughout the simulation steps. The table (Table 5) shows the dataset considered for independent validation of regulatory network (validation dataset-2).

View this table:

Table 5:

Independent Validation Dataset – in-silico knockout simulation

Clinical Significance Evaluation

To evaluate biomedical significance of inferred regulatory network, we first estimated importance of all nodes contained therein. We defined node importance score (I_i) similar to Zhang et al’s [111]. The node importance score estimates integrate network topology, network edge interaction strengths and gene expression. To encapsulate these, Zhang and colleagues defined a hub score (H), a local network entropy (S) and an adaptation score (A) and integrated these into a comprehensive index for each node – a normalized rank sum of these values.

A Hub score assesses a node’s connectivity to other nodes. It is the principal eigenvector of the adjacency matrix of the inferred regulatory network. If

Zhang et al described the hub score of node i as h_i.

Extending the works of Teschendorff and Severini [112], Zhang et al described local entropies as the degree of randomness in the local pattern of information flux around each node[111]. This is analogous to the centrality entropy described by Ortiz-Arroyo and Hussein [113]. It is a measure of the centrality of nodes depending on their contribution to the entropy of the derived regulatory network. We computed each nodes local entropy using Jalili et al’ scentiserve R package implementation of entropy[114]; derived from Shannon’s [115] definition of entropy which states that the entropy of a random variable X that can take n values is:

Jalili et al’s centrality entropy measure H_ce of a graph G, is defined as: where where paths(v_i) is the number of geodesic paths from node v_i to all the other nodes in the graph and paths(v₁, v₂,…, v_M) is the total number of geodesic paths M that exists across all the nodes in the graph.

In place of an adaptation score rank, we modified the node importance score to include instead the fit rank (r^F), the mean edges confidences rank (r^E) and the delta rank (r^D). We defined the fit rank as the rank of the estimated fit associated with the respective node in the network. We defined the mean edges confidences rank as the rank of the average of edge confidences returned from the STRING database associated with the node and contained in the node’s regulatory model inferred by the fuzzy logic approach. To moderate the estimated fits, we defined the delta rank as the rank of the difference in model-associated estimated fits observed in the training and independent validation datasets.

We defined an importance score (I_i) for each node as the normalized rank sum of these values, similar to Zhang et al’s.

Similar to Zhang and colleagues’[111], we evaluated the potential for highly ranked regulatory node features or themes to predict short-(three or less years) and mid-term survival (greater than 3 years). We reasoned that these features are potentially able to drive tumor cells to either circumvent or succumb to epistatic events. We fitted a logistic regression model using the expression profile and clinical information we retrieved on the cancer genome atlas (TCGA) primary colorectal cancer samples – incorporating our derived node importance measures as penalty weights and specifying the 3-year survival statuses (dead or alive) as the outcome. Given y_i = 0 or 1 as the binary response outcome associated with the i-th sample in n patients; p_i = Pr(y_i = 1); i = 1, ···, n; and x_i = (x_i1, x_i2,…, x_iL)^T is the expression profiles of the genes in the i-th patient, we modeled the logistic regression model as: where β₀ and β_j are respectively the intercept and regression coefficients.

We randomly divided the data into training and test subdatasets at varying sample ratios of 50%, 60%, 70%, and 80%. We ran 100 repeated estimates at the different sample ratios. We calculated the areas under the ROC curves (AUCs) for the training and test dataset. We further evaluted the association of the top ranked features with survival using a Kaplan–Meier (K-M) survival analysis[116–118] and estimated significance between the K-M curves using the Cox proportional hazard model[119] and the two-sided log-rank test[120]. We classified patients into two groups (high-risk vs low-risk) based on the optimal cutoff using the ROC approach.