Abstract
Cell identity is governed by gene expression, regulated by Transcription Factor (TF) binding at cis-regulatory modules. Decoding the relationship between patterns of TF binding and the regulation of cognate target genes is nontrivial, remaining a fundamental limitation in understanding cell decision-making mechanisms. Identification of TF physical binding that is biologically ‘neutral’ is a current challenge. We present the ‘NetNC’ software for discovery of functionally coherent TF targets, applied to study gene regulation in early embryogenesis. Predicted neutral binding accounted for 50% to ≥80% of candidate target genes assigned from significant binding peaks. Novel gene functions and network modules were identified, including regulation of chromatin organisation and crosstalk with notch signalling. Orthologues of predicted TF targets discriminated breast cancer molecular subtypes and our analysis evidenced new tumour biology; for example, predicting networks that reshape Waddington’s landscape during EMT-like phenotype switching. Predicted invasion roles for SNX29, ATG3, UNK and IRX4 were validated using a tractable cell model. This work illuminates conserved molecular networks that regulate epithelial remodelling in development and disease, with potential implications for precision medicine.
1 Introduction
Transcriptional regulatory factors (TFs) govern gene expression, which is a crucial determinant of phenotype. Therefore, mapping transcriptional regulatory networks is an attractive approach to gain understanding of the molecular mechanisms underpinning both normal biology and disease (Shlyueva et al, 2014; Stampfel et al, 2015; Rhee et al, 2014). TF action is controlled in multiple ways; including protein-protein interactions, DNA sequence affinity, 3D chromatin conformation, post-translational modifications and the processes required for TF delivery to the nucleus (Zabidi & Stark, 2016; Rhee et al, 2014; Khoueiry et al, 2017). The interplay of mechanisms influencing TF specificity across different biological contexts encompasses considerable complexity and genomescale assignment of TFs to individual genes is challenging (Shlyueva et al, 2014; Wilczynski & Furlong, 2010; Khoueiry et al, 2017). Indeed, much remains to be learned about the regulation of gene expression. For example, the relationship between enhancer sequences and the transcriptional activity of cognate promoters is only beginning to be understood (Khoueiry et al, 2017; Zabidi & Stark, 2016). Prediction of TF occupancy from DNA sequence composition alone has had only limited success, likely because protein interactions influence TF binding specificity (Jolma et al, 2015; Khoueiry et al, 2017).
TF binding sites may be determined experimentally using chromatin immunoprecipitation followed by sequencing (ChIP-seq) or microarray (ChIP-chip). These and related methods (e.g. ChIP-exo, DamID) have revealed a substantial proportion of statistically significant ‘neutral’ TF binding, that has apparently no effect on transcription from the promoters of assigned target genes (Shlyueva et al, 2014; Li et al, 2008; Ozdemir et al, 2011; Biggin, 2011). Evidence suggests that neutral binding can arise from TF association with euchromatin; for example, the binding of randomly-selected TFs and genome-wide transcription levels are correlated (Cheng et al, 2012; Consortium, 2012; Brown & Celniker, 2015). Genomic regions that bind large numbers of TFs have been termed Highly Occupied Target (HOT) regions (Roy et al, 2010). HOT regions are enriched for disease SNPs and can function as developmental enhancers (Kvon et al, 2012; Li et al, 2015). However, a considerable proportion of individual TF binding events at HOT regions may have little effect on gene expression and association with chromatin accessibility suggests non-canonical regulatory function such as sequestration of TFs or in 3D genome organisation (Moorman et al, 2006; Montavon et al, 2011) as well as possible technical artefacts (Teytelman et al, 2013). A proportion of apparently neutral binding sites may also have more subtle functions; for example in combinatorial context-specific regulation and in buffering transcriptional noise (Cannavò et al, 2016; Stampfel et al, 2015). Furthermore, enhancers may control the expression of genes that are sequence-distant but spatially close due to the 3D chromatin conformation (Moorman et al, 2006; Montavon et al, 2011). Current approaches to match bound TFs to candidate target genes may miss these distant regulatory relationships. Identification of bona fide, functional TF target genes remains a major obstacle in understanding the regulatory networks that control cell behaviour (Biggin, 2011; Stampfel et al, 2015; Keung et al, 2014; Brown & Celniker, 2015; Khoueiry et al, 2017).
The set of genes regulated by an individual TF typically have overlapping expression patterns and coherent biological function (Igual et al, 1996; Karczewski et al, 2014; MacArthur et al, 2009). Indeed, gene regulatory networks are organised in a hierarchical, modular structure and TFs frequently act upon multiple nodes of a given module (Hartwell et al, 1999; Hooper et al, 2007). Therefore, we hypothesised that functional TF targets collectively share network properties that may differentiate them from neutrally bound sites. Graph theoretic analysis can reveal biologically meaningful gene modules, including cross-talk between canonical pathways (Ideker et al, 2002; Vidal et al, 2011; Jaeger et al, 2017) and conversely may enable elimination of neutrally bound candidate TF targets derived from statistically significant ChIP-seq or ChIP-chip peaks. For this purpose, we have developed a novel algorithm (NetNC) that may be applied to discover functional TF targets and so help to illuminate mechanisms controlling cell phenotype, for example to inform causality in regulatory network inference (Shlyueva et al, 2014; Wilczynski & Furlong, 2010). NetNC analyses the connectivity between candidate TF target genes in the context of a functional gene network (FGN), in order to discover biologically coherent TF targets. Network approaches afford significant advantages for handling biological complexity, enable genome-scale analysis of gene function (Hu et al, 2016; Greene et al, 2015), and are not restricted to predefined gene groupings used by standard functional annotation tools (e.g. GSEA, DAVID) (Ideker et al, 2002; Subramanian et al, 2005; Huang et al, 2009). FGNs seek to comprehensively represent gene function and provide a useful framework for analysis of noisy real-world data (Marcotte et al, 1999; Pe’er & Hacohen, 2011). Clustering is frequently applied to a FGN in order to define a fixed network decomposition, as basis for identification of biological modules (Enright et al, 2002; Wang et al, 2011). Modules with a high proportion of genes associated with a given experimental condition, such as drug treatment, may define the network response and so illuminate the underlying biology. However, using predefined, fixed network modules may miss important features of the condition-specific set of genes; for example, gene products with corresponding nodes in the FGN may be absent from the biological condition(s) analysed. Indeed, it is typical for any given cell type to express only a subset of the genes encoded in its genome, hence clusters derived from analysis of the whole genome network may not accurately capture the biological interactions that occur in the context of a particular cell type or environment. Additionally, context-specific interactions are a common feature of biological networks, for example the varied repertoire of biophysical interactions in different cell types or between cell states, such as in the stages of the cell cycle (Pawson & Nash, 2003). Therefore, modules are defined dynamically in vivo and there is benefit in analysis approaches that can discover condition-specific communities of interacting genes without relying on predefined, fixed groupings. The NetNC algorithm satisfies this remit, enabling identification of coherent genes and modules according to the context represented by the gene list and a FGN, or another reference network.
We applied NetNC and a novel FGN (DroFN) to predict functional targets for multiple datasets that measured the binding of the Snail and Twist TFs, as well for modENCODE HOT regions (Roy et al, 2010). Snail and Twist have important roles in Epithelial to Mesenchymal Transition (EMT), a multi-staged morphogenetic programme fundamental for normal embryonic development that contributes to tumour progression and fibrosis (Nieto et al, 2016; Lim & Thiery, 2012; Giampieri et al, 2009; Yu et al, 2013). Integrative analysis of the predicted functional Snail, Twist targets, Notch screens and human breast cancer transcriptomes gave insights into both developmental and cancer biology. Predicted functional TF targets from NetNC analysis with no previously described role in invasion were validated in vitro.
2 Results
In the subsections below we first describe a D. melanogaster functional gene network (DroFN) and a clustering algorithm developed for functional transcription factor target prediction (NetNC). NetNC performed well against other approaches in discrimination of biologically related genes from synthetic neutrally bound targets. Using DroFN, NetNC and our synthetic benchmark, we estimated the proportion of neutral binding for nine Chromatin Immunoprecipitation (ChIP) microarray (ChIP-chip) or pyrosequencing (ChIP-seq) datasets, drawn from five different studies (MacArthur et al, 2009; Ozdemir et al, 2011; Zeitlinger et al, 2007; Sandmann et al, 2007; Roy et al, 2010). These nine datasets are referred to as ‘TF_ALL’; please see Methods section 4.3 for important details about the TF_ALL datasets. NetNC predicted Snail and Twist functional targets in early embryogenesis, revealing clusters of regulation for multiple genes in key developmental processes, including chromatin remodelling, transcriptional regulation and neural development. Predicted functional targets were enriched for Notch signalling modifiers and captured important aspects of human breast cancer biology. The DroFN network and NetNC software are made freely available as Additional Files associated with this manuscript.
2.1 A comprehensive D. melanogaster functional gene network (DroFN)
We developed a functional gene network (DroFN; 11,432 nodes, 787,825 edges) to provide a systems-wide map of D. melanogaster signalling and metabolism (Additional File 1). Evaluation of DroFN with time-separated blind test data derived from KEGG (TEST-NET) found good performance compared with the DroID (Yu et al, 2008) and GeneMania (Warde-Farley et al, 2010) networks (Table 1, Appendix Figure S1). The DroFN network was more highly connected than DroID, and had 2.6-fold higher average degree. GeneMania predicts shared Gene Ontology terms rather than KEGG pathway comembership, which may account for some of the performance gap found with GeneMania when compared to DroFN and DroID. However GeneMania performance on TEST-NET is similar to published values for 'Biological Process' terms (Warde-Farley et al, 2010). The overlap between DroFN and the Drosophila proteome interaction map (DPiM (Guruharsha et al, 2011)) was highly significant (FET p<10-308). DroFN and DPiM had 999 genes in common and 37.8% (2175/5747) of DroFN edges for these genes were also found in DPiM. The False Positive Rate for DroFN (0.047) was close to the prior for functional interaction estimated from KEGG (0.044); a proportion of these estimated false positives may represent bona fide interactions that were not annotated in KEGG. Overall, DroFN provides a useful genome-scale map of pathway comembership in D. melanogaster.
2.2 A novel algorithm for discovery of functional transcription factor binding (NetNC)
Large numbers of statistically significant TF binding sites appear to be neutral (non-functional) (Li et al, 2008; MacArthur et al, 2009; Biggin, 2011). We developed the NetNC algorithm for genomescale prediction of functional TF target genes (Figure 1). In broad terms, NetNC seeks to discover the biological functions common to a list of genes, therefore defining groups of genes with common function and revealing biologically defining characteristics. This general paradigm has been applied widely, for example in network-based approaches (Schramm et al, 2010; Overton et al, 2011; Ideker et al, 2002; Vidal et al, 2011) and in enrichment analysis (Subramanian et al, 2005; Huang et al, 2009; Geistlinger et al, 2011).
NetNC builds upon observations that TFs coordinately regulate multiple functionally related targets (Igual et al, 1996; MacArthur et al, 2009; Karczewski et al, 2014) and has been calibrated for discovery of biologically coherent genes in noisy data. The first stage in NetNC calculates hypergeometric mutual clustering (HMC) p-values (Goldberg & Roth, 2003) for each pair of candidate TF targets (H1) that are connected in the functional gene network (FGN). Empirical estimation of positive False Discovery Rate (pFDR) (Storey, 2002) across H1 is enabled by deriving HMC p-values from resampled genes (H0). Resampling to generate H0 controls for the number of candidate TF target genes analysed and the FGN structure. Iterative minimum cut is then computed on the pFDR thresholded network with a graph density stopping criterion (Ford & Fulkerson, 1956). Connected components of the resulting graph consisting of less that three nodes are discarded. The approach described above is edge-centric and is termed ‘Functional Target Identification’ (FTI), seeking to distinguish all biologically coherent gene pairs from functionally unrelated targets (e.g. arising from neutral TF binding). Additionally, NetNC has a node-centric ‘Functional Binding Target’ (FBT) mode that employs regularised Gaussian mixture modelling for unsupervised clustering with automatic cardinality selection (Lubbock et al, 2013). NetNC-FBT analyses degree-normalised Node Functional Coherence Scores (NFCS); examples of NFCS profiles and the fitted mixture models are visualised in Appendix Figure S2. The NetNC-FBT is parameter-free and so did not require calibration on training data.
The gold-standard data for NetNC development and validation took KEGG pathways to represent biologically coherent relationships, combined with ‘Synthetic Neutral Target Genes’ (SNTGs) derived by resampling from the DroFN network. A total of 17,600 datasets (Additional File 2, Additional File 3) were developed to contain between 5% and 80% SNTGs; therefore, the gold-standard data covered a wide range of possible values for the proportion of neutrally bound candidate TF target genes. NetNC was robust to variation in the input dataset size and %SNTGs, outperforming HC-PIN (Wang et al, 2011) and MCL (Enright et al, 2002) on blind test data (Figure 2, Appendix Table S1). Previous work that evaluated nine clustering algorithms, including MCL, found that HC-PIN had strong performance in functional module identification (Wang et al, 2011); therefore we selected HC-PIN for extensive comparison against NetNC. In general, NetNC was more stringent, with lower False Positive Rate (FPR) and higher Matthews Correlation Coefficient (MCC) than HC-PIN. MCC provides a balanced measure of predictive power across the positive (KEGG pathway) and negative (SNTG) classes of genes in the gold standard; therefore MCC is an attractive approach for assessment of overall performance. NetNC-FBT typically had lowest FPR and performed well on larger datasets. We saw a spread of performance values across resamples with identical number of pathways and %SNTG (Figure 2), which arose from expected differences between resamples. For example, differences in the density of the resampled SNTG genes may impact upon the power of NetNC to discriminate between SNTGs and KEGG pathway nodes. NetNC’s performance advantages were most prominent on blind test data with ≥50% SNTGs (Figure 2) and all nine of the TF_ALL datasets were predicted to contain ≥50% neutrally bound targets (Figure 3, see subsection 2.3, below). Therefore, given the performance advantage on blind test data with ≥50% STNGs (Figure 2), NetNC appears as the method of choice for identification of functional TF targets from genome-scale binding data.
2.3 Estimating neutral binding for EMT transcription factors and Highly Occupied Target (HOT) regions
We predicted functional target genes for the Snail and Twist TFs for developmental stages around gastrulation in D. melanogaster. Fly embryos perform rapid nuclear divisions and transcription, leading the formation of the syncytial blastoderm at about 2 hours. Nuclear divisions slow during cellularisation of the blastoderm after 2 hours and gastrulation occurs around 3 hours (Edgar & Schubiger, 1986; Leptin, 1995; Campos-Ortega & Hartenstein, 1997). Using NetNC and DroFN, we analysed Chromatin ImmunoPrecipitation (ChIP) microarray (ChIP-chip) or sequencing (ChIP-seq) data for overlapping time periods in early embryogenesis produced by four different laboratories and also the modENCODE Highly Occupied Target (HOT) regions (Ozdemir et al, 2011; MacArthur et al, 2009; Sandmann et al, 2007; Zeitlinger et al, 2007; Roy et al, 2010). Nine datasets in total were studied (TF_ALL, Table 2), enabling investigation of multiple factors that are commonly applied in discovery of candidate TF targets - including: peak intensity threshold; multiple developmental time periods, multiple antibodies, different analytical platforms, and using transcribed genes for peak assignment. Further details of the TF_ALL datasets are given in Methods subsection 4.3. The proportion of neutrally bound candidate target genes was estimated using a novel approach that calculated local FDR (lcFDR) from NetNC pFDR values, with calibration against the known SNTG fraction in gold standard data (NetNC-lcFDR). Local FDR estimates the false discovery rate at a specific score value (or range of values) in contrast to global FDR which is calculated using all of the values above a score threshold. We note that global pFDR was unsuitable for estimating the total fraction of neutral binding. For example, every TF_ALL dataset had pFDR=1 at the NetNC score threshold that included all candidate target genes; hence, a naïve approach based on global pFDR would always give a global neutral binding estimate of 100%. Furthermore, lcFDR may capture differences in score profiles that are missed by global pFDR, illustrated in Appendix Figure S3.
NetNC-lcFDR estimates of neutral binding across TF_ALL ranged from 50% to ≥80% (Figure 3A, Table 2). Reassuringly, the dataset with the most stringent peak calling (twi_1-3h_hiConf (Ozdemir et al, 2011)) had the highest (NetNC-lcFDR) or second highest (NetNC-FTI) predicted functional binding proportion. Target genes for regions bound during two consecutive developmental time periods (twi_2-6h_intersect (Sandmann et al, 2007)) also ranked highly, followed by HOT regions (Figure 3A, Table 2). Indeed, twi_2-6h_intersect had a significantly greater percentage of predicted functional targets (binomial p<4.0×10-15) with stronger gFDR and lcFDR profiles than either the twi_2-4h_intersect or twi_4-6h_intersect datasets from the same study, but where binding was during a single time period (Sandmann et al, 2007) (Figure 3). Therefore, predicted functional binding was enriched for regions occupied at >1 time period or by multiple TFs - including HOT regions, which had high functional coherence relative to the other datasets examined. Interestingly, a very similar proportion of functional targets was predicted by NetNC-lcFDR for binding sites derived from either the union or intersection of two Twist antibodies (NetNC-lcFDR=25-30%) from the same study (MacArthur et al, 2009), although the NetNC-FTI value was higher for input data representing the intersection of antibodies (30.5% (116/334) vs 23% (424/1848)). Substantial numbers of candidate target genes in all nine TF_ALL datasets passed a global FDR (gFDR) or lcFDR threshold value of 0.05 (Figure 3B, 3D). Even datasets with high predicted total neutral binding included candidate targets that met stringent NetNC FDR thresholds. For example, despite having a relatively low proportion of predicted total functional binding (Figure 3A) the datasets sna_2-3h_union, twi_2-3h_union respectively had the highest and second-highest proportion of genes passing lcFDR<0.05 (Figure 3B); these datasets were also highly ranked at gFDR<0.05 (Figure 3D).
ChIP peak intensity putatively correlates with functional binding, although some weak binding sites have been shown to be functional (Biggin, 2011; Chen et al, 2013). We found a significant correlation between genes’ NetNC NFCS values and ChIP peak enrichment scores in 6/8 datasets (q<0.05, HOT regions not analysed). The two datasets where no significant correlation was found (twi_1-3h_hiConf, twi_2-6h_intersect) were derived from protocols that enrich for functional targets and had the lowest predicted neutral binding proportion (Figure 3A). Indeed, the median peak score for twi_2-6h_intersect was significantly higher than data from the same study that was restricted to a single time period (twi_2-4h_intersect, q<5.0×10-56; twi_4-6h_intersect, q<4.8×10-58). Therefore the relationship of peak intensity with functional binding in twi_1-3h_hiConf, twi_2-6h_intersect appears to have been eliminated by the application of protocols that enriched for functional targets. Functional TF targets identified by NetNC were also enriched for human orthologues, defined by InParanoid (Östlund et al, 2009). For example, 72% (453/628) of the NetNC-FBT predicted functional target genes for twi_2-3h_union had human orthologues, which was significantly higher than the value (50%, 616/1220) for the full dataset (p<3×10-28 binomial test). Genome-wide expectation for human-fly orthology was 46%, calculated with reference to the fly genome, which was significantly lower than the value of 72% for the twi_2-3h_union predicted functional targets (p<5×10-40). The enrichment for evolutionary conservation of NetNC results aligns with the fundamental developmental processes captured by the datasets analysed (i.e. gastrulation, mesoderm development) and is consistent with the predicted functional target genes playing roles in these processes.
NetNC-lcFDR estimates of neutral binding agreed well with the Functional Target Identification results (NetNC-FTI, Table 2). Indeed, neutral binding estimates from these two methods had median difference of only 5.5% and were significantly correlated across TF_ALL, despite considerable methodological differences (r=0.85, p=0.008, Appendix Figure S4). This concordance supports the results from both NetNC-FTI and NetNC-lcFDR.
2.4 Genome-scale functional transcription factor target networks
NetNC results offer a global representation of the mechanisms by which Snail and Twist exert tissue-specific regulation in early D. melanogaster embryogenesis (Figure 4, Appendix Figure S5, Additional File 4). NetNC-FTI results for the nine TF_ALL datasets overlapped and clusters were manually annotated into biologically similar groups, with reference to Gene Ontology enrichment and FlyBase annotations (Ashburner et al, 2000; Maere et al, 2005; Huang et al, 2009; Gramates et al, 2017). Eleven biological groupings were identified in at least 4/9 TF_ALL datasets, including developmental regulation (9/9), chromatin organisation (6/9), ion transport (6/9), mushroom body development (6/9), phosphatases (6/9), splicing (5/9) and regulation of translation (5/9) (Appendix Table S2). Very few clusters were composed entirely from genes identified only in a single dataset, examples included: snoRNAs/nucleolar proteins (twi_2-3h_union), transferases (HOT), defense response/immune response (twi_2-4h_Toll10b) and chitin metabolism (twi_2-4h_intersect) (Figure 4, Appendix Figure S5). We investigated the robustness of NetNC-FTI to subsampled input using TF_ALL (Appendix Tables S3, S4). The median overlap of network edges output by analysis of the complete dataset with results from node subsampling rates of 95%, 80% and 50% respectively had median values across TF_ALL of 91%, 84% and 77% (respective median 95% CI 83-96%, 74-94%, 37-92%). The median overlap of genes for 95%, 80%, 50% subsamples respectively, averaged across TF_ALL, was 89%, 81%, 75% (median 95% CI 72%-97.2%, 66%-92%, 58%-97%). Overall, subsampling had a moderate effect on NetNC predictions and greater sensitivity was observed at lower subsampling rates, as expected. Some subsamples taken as input to NetNC had low overlap with the NetNC-FTI reference output (reference_net) for any given complete input dataset. Indeed, the reference_net represented between 14% to 39% of the total input gene list across the nine TF_ALL datasets. Subsamples that excluded a high proportion of the nodes in reference_net would be expected to result in weaker hypergeometric mutual clustering values for nodes that ovelapped with reference_net due to a reduction in common neighbours for the reference_net nodes included in the given subsample. Therefore, subsampling of the input gene list is expected to produce NetNC results that have reduced overlap with reference_net; this effect is also a source of variation in overlap across subsamples, reflected in the 95% CI values. Also, the probability of sampling nodes in reference_net is lower when a smaller fraction of the complete input TF_ALL gene list is covered by reference_net, leading to a greater subsampling-associated loss of nodes and edges. Consistent with this interpretation, TF_ALL datasets with the highest NetNC-FTI functional binding proportion (Table 2) (twi_1-3h_hiConf, twi_2-6h_intersect, HOT) were less sensitive to subsampling than datasets with relatively low predicted functional binding such as sna_2-4h_Toll10b and twi_4-6h_intersect (Appendix Tables S3, S4).
The developmental regulation cluster (DRC) encompassed key conserved morphogenetic pathways, for example: Notch, Wnt, Fibroblast Growth Factor (FGF). Notch signalling modifiers from public data (Guruharsha et al, 2012) overlapped significantly with NetNC-FTI results for each TF_ALL dataset (q <0.05), including the DRC, chromatin organisation and mediator complex clusters (Figure 4, Appendix Figure S5). Notch was identified as an important control node across TF_ALL where it had highest betweenness centrality in the DRC for three datasets and ranked (by betweenness) among the top ten DRC genes for 8/9 datasets. The activation of Notch can result in diverse, context-specific transcriptional outputs and the mechanisms regulating this pleiotropy are not well understood (Guruharsha et al, 2012; Ntziachristos et al, 2014; Bray, 2016; Nowell & Radtke, 2017). NetNC predicted functional Snail and Twist binding to many regulatory genes in the Notch neighbourhood, therefore providing evidence for novel factors controlling the transcriptional consequences of Notch activation in cell fate decisions controlled by these TFs. This is consistent with previous demonstration of signalling crosstalk for Notch with twist and snail in multiple systems; for example in adult myogenic progenitors (Bernard et al, 2010) and hypoxia-induced EMT (Sahlgren et al, 2008). Wingless also frequently had high betweenness, ranking within the top ten DRC genes in six datasets and was highest ranked in two instances. Thirteen genes were present in the DRC for at least seven of the TF_ALL datasets (DRC-13, Appendix Table S5), and these genes had established functions in the development of mesodermal derivatives such as muscle, the nervous system and heart (Baylies & Bate, 1996; Bernard et al, 2010; Xie et al, 2016; Bray, 2016; Chen et al, 1996; Lo et al, 2002; Trujillo et al, 2016). Public in situ hybridisation (ISH) data for the DRC-13 genes indicated their earliest expression in (presumptive) mesoderm at: stages 4-6 (wg, en, twi, N, htl, how), stages 7-8 (rib, pyd, mbc, abd -A) and stages 9-10 (pnt) (Hammonds et al, 2013; Tomancak et al, 2002; Hartley et al, 1987; BDGP). The remaining two DRC-13 genes had no evidence for mesodermal expression (fkh) or no data available (jar). However, other studies had shown that fkh is essential for caudal visceral mesoderm development (Kusch & Reuter, 1999) and had demonstrated jar expression in the midgut mesoderm (Millo & Bownes, 2007). The above data are consistent with direct regulation of DRC-13 by Twist and Snail in (presumptive) mesoderm, as predicted by NetNC-FTI.
Chromatin organisation clusters included polycomb-group (PcG) and trithorax-group (TrxG) genes; the most frequently identified were the Polycomb Repressive Complex 1 (PRC1) genes ph-d, psc (Shao et al, 1999) and su(var)3-9, a histone methyltransferase that functions in gene silencing (Czermin et al, 2001; Schotta et al, 2002) (Appendix Table S6). Other NetNC-FTI coherent genes with function related to PcG/TrxG included: the PRC1 subunit ph-p (Shao et al, 1999); corto which physically interacts with PcG and TrxG proteins (Salvaing et al, 2003; Lopez et al, 2001); the TrxG-related gene lolal that is required for silencing at polycomb response elements (Mishra et al, 2003; Quijano et al, 2016); taranis which has genetic interactions with TrxG and PcG (Schuster & Smith-Bolton, 2015; Calgaro et al, 2002; Fauvarque et al, 2001); TrxG genes trithorax, moira (Tie et al, 2014; Ingham & Whittle, 1980; Hong & Choi, 2016; Crosby et al, 1999). The gene silencing factor su(var)205 was also returned by NetNC-FTI in four TF_ALL datasets (Fanti et al, 1998; Fanti & Pimpinelli, 2008). Therefore, NetNC found direct regulation by Snail and Twist of a) PRC1 core components and other gene silencing factors, b) TrxG genes, c) modifiers of PcG, TrxG activity.
Brain development clusters were found for six TF_ALL datasets, as well as members of the proneural achaete-scute complex and Notch signalling components (Campos-Ortega, 1993). Snail regulation of neural clusters is consistent with its well characterised roles in repression of ectodermal (neural) genes in the prospective mesoderm (Leptin, 1991; Wieschaus & NüssleinVolhard, 2016; Gilmour et al, 2017). Additionally, Snail is important for neurogenesis in fly development and also in mammals (Ashraf & Ip, 2001; Zander et al, 2014). Therefore, binding to these neural functional modules could reflect potentiation of transcription to enable rapid activation in combination with other transcription factors as and when required within specific neural developmental trajectories (Sandmann et al, 2007; Nevil et al, 2017). The mushroom body is a prominent structure in the fly brain that is important for olfactory learning and memory (Caron et al, 2013). Twist is typically a transcriptional activator (Gilmour et al, 2017) although appears to contribute to Snail’s repressive activity (Lin et al, 2015) and Twist-related protein 1 was shown to directly repress Cadherin-1 in breast cancers (Vesuna et al, 2008). Our NetNC results predict novel Twist functions, for example in regulation of mushroom body neuroblast proliferation factors such as retinal homeobox, slender lobes, and taranis (Kraft et al, 2016; Orihara-Ono et al, 2005; Manansala et al, 2013).
2.5 Breast cancer subtype is characterised by differential expression of orthologous Snail and Twist functional targets
Genes that participate in EMT have roles in metastasis and drug resistance across multiple cancers (Creighton et al, 2010; Wang et al, 2009; Nieto et al, 2016). Indeed, the NetNC-FTI Snail and Twist targets included known drivers of tumour biology and also predicted novel cancer driver genes (Figure 4, Appendix Figure S5, Appendix Tables S2, S5, S6). Breast cancer intrinsic molecular subtypes with distinct clinical trajectories have been extensively validated and complement clinico-pathological parameters (Sørlie et al, 2003; Cejalvo et al, 2017). These subtypes are known as luminal-A, luminal-B, HER2-overexpressing, normal-like and basal-like (Sørlie et al, 2003). All of the NetNC-FTI networks for the nine TF_ALL datasets overlapped with known cancer pathways, including significant enrichment for Notch modifiers (q<0.05). We hypothesised that orthologous genes from NetNC clusters for Snail and Twist would stratify breast cancers by intrinsic molecular subtype. Indeed, aberrant activation of Notch orthologues in breast cancers had been demonstrated and was linked with EMT-like signalling, particularly for the basal-like and claudin-low subtypes (Stylianou et al, 2006; Barnawi et al, 2016; Ingthorsson et al, 2016; Zhang et al, 2017; Chen et al, 2009).
2.5.1 Unsupervised clustering with predicted functional targets recovers breast cancer intrinsic subtypes
We identified 57 human orthologues (ORTHO-57) that were NetNC-FTI functional targets in ≥4 TF_ALL datasets and were also represented within integrated gene expression microarray data for 2999 breast tumours (BrC_2999) (Moleirinho et al, 2013). Unsupervised clustering with ORTHO-57 stratified BrC_2999 by intrinsic molecular subtype (Figure 5). Clustering with NetNC results for individual Twist and Snail datasets also recovered the intrinsic breast cancer subtypes (Appendix Figure S6). Features within the heatmap were marked according to the dendrogram structure and gene expression values (Figure 5). Basal-like tumours were characterised by EN1 and NOTCH1, aligning with previous work (feature_Bas; Figure 5) (Stylianou et al, 2006; Barnawi et al, 2016; Beltran et al, 2014). Interestingly, elevated ETV6 expression was also largely restricted to the basal-like subtype. Others had reported ETV6 copy number amplifications in 21% of basal-like tumours and identified recurrent gene fusions with ETV6 in several cancers (Adélaïde et al, 2007; Letessier et al, 2005; Golub et al, 1995; Buijs et al, 1995). The Luminal A subtype (feature_LumA), shared gene expression characteristics with luminal B (feature_LumB2, ERBB3, MYO6) and normal-like (DOCK1, ERBB3, MYO6) tumours. High BMPR1B expression was a clear defining feature of the luminal A subtype, in agreement with previous results demonstrating oncogenic BMP signalling in luminal epithelia (Chapellier et al, 2015). Others had previously shown that the BMP2 ligand may be pleiotropic in breast cancers and development, promoting EMT characteristics in some contexts (Ma et al, 2005; Ren & Dijke, 2017; Katsuno et al, 2008). Tumours with high relative BMP2 expression were typically basal-like while luminal cancers had low BMP2; therefore, our data align with BMP2 upregulation as a feature of the EMT programme in basal-like cancers. The luminal B subtype had been established to have worse prognosis than luminal A, but more favourable prognosis than ESR1 negative cancers (Sørlie et al, 2001, 2003). Several genes were highly expressed in both feature_LumB1 and in ESR1 negative subtypes (feature_ERneg), including ECT2, SNRPD1, SRSF2 and CBX3; our data suggest that these genes might contribute to worse survival outcomes for luminal B relative to luminal A cancers. Indeed, the luminal A as well as normal-like tumour subtypes had low expression of these genes and CBX3, ECT2 had previously been correlated with poor prognosis (Liang et al, 2017; Wang et al, 2018). Furthermore, SNRPD1 is a component of core splicesomal small nuclear ribonucleoproteins (snRNPs) and SRSF2 is a splicing factor (Bermingham et al, 1995); RNA splicing was shown to be a survival factor in siRNA screening across multiple basal-like cancer cell lines and was suggested to have potential therapeutic value (Chan et al, 2017). Feature_LoExp broadly represents genes with low detection rates (indicated by the %P column in Figure 5) and the tumours populating feature_LoExp are a mixture of subtypes, but largely from a single study (Popovici et al, 2010). Notably, key EMT genes (SNAI2, TWIST1, QKI) had highest relative expression in normal-like tumours (feature_NL, Figure 5). Indeed, SNAI2 and TWIST1 were both assigned to the normal-like centroid. Feature_NL also included homeobox transcription factors (HOXA9, MEIS2) and a secreted cell migration guidance gene (SLIT2) (Schmid et al, 2007; Oulad-Abdelghani et al, 1997; Borrow et al, 1996). Some genes had high expression in both normal-like (feature_NL) and basal-like cancers, including: the QKI RNA-binding protein that regulates circRNA formation in EMT (Conn et al, 2015) and the FZD1 wnt/β-catenin receptor. Indeed, genes in feature_Bas and feature_NL clustered together in the gene dendogram, reflecting greater gene expression similarity to each other than to genes within features for the other breast cancer subtypes (Figure 5). Therefore, these data revealed concordance in gene expression between the normal-like and basal-like subtypes, including known EMT-related genes.
2.5.2 Integrating NetNC functional target networks and breast cancer transcriptome profiling
We visualised basal-like and normal-like gene annotations for orthologues in the NetNC-FTI networks, offering a new perspective on the molecular circuits controlling these different subtypes (Figure 4, Appendix Figure S5). We focussed on basal-like and normal-like cancers because they accounted for the large majority of genes in the datasets examined and were prominent in results from the centroid and heatmap analysis (Figure 5, Appendix Figure S6). Additionally, EMT had been shown to be important for basal-like breast cancer biology (Sarrió et al, 2008; Guen et al, 2017) and key EMT genes were annotated to the normal-like subtype in our analysis. NetNC-FTI clusters that contained splicing factors and components of the ribosome were associated with the normal-like subtype in results for three datasets (twi_2-4h_intersect, twi-2-6h_intersect, twi_2-3h_union); twi_2-3h_union also had communities for the proteosome and proteosome regulatory subunits where a high proportion of genes were annotated to the normal-like subtype. Orthologues in the sna_2-4h_Toll10b ‘RNA degradation and transcriptional regulation’ cluster were annotated to the basal-like subtype and never to the normal-like subtype; this cluster included HECA, which had been reported to function as both a tumour suppressor (Makino et al, 2001; Lin et al, 2013) and an oncogene (Chien et al, 2006). HECA was also identified in NetNC-FTI analysis of twi_2-4h_intersect and twi_4-6h_intersect; these two datasets had Twist binding at different, noncontiguous sites that were both assigned to hdc, the D. melanogaster orthologue of HECA. Roles for hdc were identified in cell survival (Resende et al, 2013, 2017), differentiation of imaginal primodia (Weaver & White, 1995), RNA interference (Dorner et al, 2006), Notch signalling (Guruharsha et al, 2012) and tracheal branching morphogenesis - upregulated by the snail gene family member escargot (Steneberg et al, 1998). HECA was upregulated in basal-like relative to normal-like tumours (p<3.3×10-23). Taken together, these data support participation of HECA in an EMT-like gene expression programme in basal-like breast cancers. An ‘ion antiporter and GPCR’ cluster for the sna_2-4h_Toll10b dataset (Figure 4) included the Na+/H+ antiporter SLC9A6 that also belonged to the twi_2-4h_Toll10b ‘transmembrane transport’ cluster (Appendix Figure S5). Alterations in pH by Na+/H+ exchangers, particularly SLC9A1, had been shown to drive basal-like breast cancer progression and chemoresistance (Cardone et al, 2005; Amith & Fliegel, 2017; Stock et al, 2008). SLC9A6 was 1.6-fold upregulated in basal-like relative to normal-like tumours (p<8.4×10-71) and may drive pH dysregulation as part of an EMT-like programme in basal-like breast cancers. A further cluster that was specific to basal-like cancers in the twi_2-3h_union dataset was annotated to ‘mitochondrial translation’, an emerging area of interest for cancer therapy (Škrtić et al, 2011; Weinberg & Chandel, 2015). Orthologues annotated to the basal-like subtype were frequently located in NetNC-FTI chromatin organisation clusters. For example, the twi_2-3h_union ‘chromatin organisation and transcriptional regulation’ cluster had six genes annotated to the basal-like subtype, including three Notch signalling modifiers (ash1, tara, Bap111) that were respectively orthologous to ASH1L, SERTAD2 and SMARCE1. The ASH1L histone methyltransferase was a candidate poor prognosis factor with copy number amplifications in basal-like tumours (Liu et al, 2014); SERTAD2 was a known bromodomain interacting oncogene and E2F1 activator (Hsu et al, 2001; Cheong et al, 2009); SMARCE1, a core subunit of the SWI/SNF chromatin remodelling complex, had been shown to regulate ESR1 function and to potentiate breast cancer metastasis (García-Pedrero et al, 2006; Sethuraman et al, 2016). Therefore our integrative analysis predicted specific chromatin organisation factors downstream of Snail and Twist, identifying orthologous genes that may control Notch output and basal-like breast cancer progression.
2.6 Novel Twist and Snail functional targets influence invasion in a breast cancer model of EMT
Our analysis underlined the functional relevance of novel regulators of EMT and cell invasion, including SNX29 (also known as RUNDC2A), ATG3, IRX4 and UNK. Therefore, we investigated the functional and instructive role of these genes in an established cell model of invasion by overexpressing SNAI1 in MCF7 cells (Dhasarathy et al, 2007). MCF7 cells are weakly invasive (Lacroix & Leclercq, 2004), thus the SNAI1-inducible MCF7 cell line was well suited to study alteration in expression of the selected genes in terms of their influence on invasion in conjunction with SNAI1 induction, knockdown or independently. This was achieved by the co-transfection of cDNAs of these genes alongside a doxycycline-inducible vector (pGoldiLox, (Peluso et al, 2017)) that expressed either SNAI1 cDNA or validated shRNAs against SNAI1 (Liu et al, 2013). To test for the instructive role of these genes, we ectopically expressed the selected NetNC functional targets in a transwell invasion assay that contained MCF7 with or without SNAI1 cDNA,SNAI1 shRNAs, mCherry control or scrambled control shRNA (Figure 6).
Over-expression of IRX4 significantly increased invasion relative to controls in all conditions examined and IRX4 had high relative expression in a subset of basal-like breast cancers (Figures 5, 6). IRX4 is a homeobox transcription factor involved in cardiogenesis, marking a ventricular-specific progenitor cell (Nelson et al, 2016) and is also associated with prostate cancer risk (Xu et al, 2014). SNX29 belongs to the sorting nexin protein family that function in endosomal sorting and signalling (Cullen, 2008; Marat & Haucke, 2016). SNX29 is poorly characterised and ectopic expression significantly reduced invasion in a SNAI1-dependent manner (Figure 6). Since we obtained these results, SNX29 downregulation has been associated with metastasis and chemoresistance in ovarian carcinoma (Zhu et al, 2015), consistent with SNX29 inhibition of invasion driven by Snail. ATG3 is an E2-like enzyme required for autophagy and mitochondrial homeostasis (Oral et al, 2012; Radoshevich et al, 2010), we found that ATG3 overexpression significantly increased invasion. Consistent with our results, knockdown of ATG3 has been reported to reduce invasion in hepatocellular carcinoma (Li et al, 2013). UNK is a RING finger protein homologous to the fly unkempt protein which binds mRNA, functions in ubiquitination and was upregulated in cells undergoing gastrulation (Mohler et al, 1992). Others have reported that UNK mRNA binding controls neuronal morphology and can induce spindle-like cell shape in fibroblasts (Murn et al, 2015, 2016). We found that UNK significantly increased MCF7 cell invasion in a manner that was additive with and independent of Snail, supporting a potential role in breast cancer progression. Indeed, UNK was overexpressed in cancers relative to controls in the ArrayExpress GeneAtlas (Parkinson et al, 2009).
3 Discussion
Our novel Network Neighbourhood Clustering (NetNC) algorithm and D. melanogaster functional gene network (DroFN) were applied to predict functional transcription factor binding targets from statistically significant ChIP-seq and ChIP-chip peak assignments during early fly development (TF_ALL). Seven of the nine TF_ALL datasets included developmental time periods encompassing stage four (syncytial blastoderm, 80-130 minutes), cellularisation of the blastoderm (stage five, 130-170 minutes) and initiation of gastrulation (stage 6, 170-180 minutes) (MacArthur et al, 2009; Zeitlinger et al, 2007; Ozdemir et al, 2011; Sandmann et al, 2007; Campos-Ortega & Hartenstein, 1997). The datasets twi_2-4h_intersect, sna_2-4h_intersect, twi_2-4h_Toll10b and sna_2-4h_Toll10b additionally included initial germ band elongation (stage seven, 180-190 minutes) (Sandmann et al, 2007; Zeitlinger et al, 2007; Campos-Ortega & Hartenstein, 1997); twi_2-4h_Toll10b and sna_2-4h_Toll10b may have also included stages eight (190-220 minutes) and nine (220-260 minutes) (Zeitlinger et al, 2007; Campos-Ortega & Hartenstein, 1997). Twi_2-4h_intersect and sna_2-4h_intersect were tightly staged between stages 5-7 (Sandmann et al, 2007). Additional to stages four, five and six, twi_1-3h_hiConf may have included the latter part of stage two (preblastoderm, 25-65 minutes) and stage three (pole bud formation, 65-80 minutes) (Campos-Ortega & Hartenstein, 1997). The twi_4-6h_intersect dataset was restricted to stages eight to nine which included germ band elongation and segmentation of neuroblasts (Sandmann et al, 2007; Campos-Ortega & Hartenstein, 1997). The above differences in the biological material analysed could be an important factor underlying variation between datasets, although there was considerable overlap in the functional networks predicted for TF_ALL (Figure 4, Appendix Table S2, Appendix Figure S5).
We integrated Notch screens and the expression of orthologous human breast cancer genes with the functional Snail, Twist targets predicted by NetNC, in order to illuminate the conserved molecular networks that orchestrate epithelial remodelling in development and tumour progression. Our analysis substantiated Snail and Twist function in regulating components of multiple core cell processes that govern the global composition of the transcriptome and proteome (Figure 4, Appendix Figure S5). These processes included transcription, chromatin organisation, RNA splicing, translation and protein turnover (ubiquitination). We identified a ‘Developmental Regulation Cluster’ (DRC) which was the major transcriptional control module identified in all nine TF_ALL datasets. Notch and also wingless had consistently high betweenness centrality in the DRC, which is a measure of a node’s influence within a network (Freeman, 1977). In this context, high betweenness centrality may highlight genes with key roles in determining the global network state, and so are important for controlling phenotype. Therefore Notch, wingless were predicted to be key control points regulated by Snail, Twist in the mesoderm specification network. Notch signalling putatively integrates with multiple canonical pathways (Guruharsha et al, 2012) including interaction with the Wnt gene family which have many conserved roles across metazoan development, such as in axis specification and mesoderm patterning (reviewed in (Nusse & Clevers, 2017) and (Schubert & Holland, 2013)). Our results are complementary to qualitative dynamic modelling where key control nodes may not necessarily have high betweenness (Mbodj et al, 2016). Orthologues of both Notch and wingless were previously shown to be aberrantly regulated in breast cancers, (for example (Stylianou et al, 2006; DiMeo et al, 2009), and we found that unsupervised clustering using predicted Snail and Twist functional targets stratified five intrinsic breast cancer subtypes (Sørlie et al, 2003) (Figure 5). While more recent studies have classified greater numbers of breast cancer subtypes, for example identifying ten groups (Curtis et al, 2012), the five subtypes employed in our analysis had been widely used, extensively validated, exhibited clear differences in prognosis, overlapped with subgroups defined using standard clinical markers (ESR1, HER2), and so were associated with distinct treatment pathways (Sørlie et al, 2003; Cejalvo et al, 2017). Analysis of the twi_2-3h_union dataset revealed a basal-like specific cluster for ‘mitochondrial translation’ (MT) (Figure 4). Inhibition of MT is a therapeutic strategy for AML and mitochondrial metabolism is currently being explored in the context of cancer therapy (Škrtić et al, 2011; Weinberg & Chandel, 2015). Our results highlight MT as a potentially attractive target in basal-like breast cancers, aligning with previous work linking MT upregulation with deletion of RB1 and p53, which occurs in approximately 20% of triple negative breast cancers (Jones et al; Nik-Zainal et al, 2016). NetNC analysis provided functional context for many Notch modifiers and proposed mechanisms of signalling crosstalk by predicting regulation of modifiers by Twist, Snail (Figure 4, Appendix Figure S5, Additional File 4). Clusters where multiple modifiers were identified may represent cell meso-scale units that are particularly important for Notch signalling in the context of mesoderm development and EMT (Additional File 4). For example, the mediator complex and transcription initiation subcluster for twist_union (Figure 4) had 13 nodes, of which 5 were Notch modifiers including orthologues of MED7, MED8, MED31. Our results show regulation of Notch signalling by Snail and Twist targeting of Notch transcriptional regulators, trafficking proteins, post-translational modifiers (e.g. ubiquitinylation) and receptor recycling (non-canonical, ligand-independent signalling) as well as regulation of pathways that may attenuate or modify the Notch signal, consistent with previous studies (Guruharsha et al, 2012; Ntziachristos et al, 2014). Taranis, a Notch modifier in the chromatin organisation cluster, was orthologous to the SERTAD2 bromodomain interacting oncogene (Hsu et al, 2001) which had elevated expression in a basal-like breast cancer cluster that contained NOTCH1 (Figure 4, Figure 5). Our integrative analysis suggests that SERTAD2 could control the phenotypic consequences of NOTCH1 activation in basal-like breast cancers through a chromatin remodelling mechanism. Notch signalling modulation has been applied in a clinical setting, for example in treatment of Alzheimer’s disease, and is a promising area for cancer therapy (Shih & Wang, 2007; Ntziachristos et al, 2014; Messersmith et al, 2015; Takebe et al, 2015). Orthologues of Notch modifiers identified in our analysis provide a pool of candidates that could potentially inform development of companion diagnostics or combination therapies for agents targeting the notch pathway in basal-like breast cancers. In addition to Notch signalling, taranis also functions to stabilise the expression of engrailed in regenerating tissue (Schuster & Smith-Bolton, 2015). The engrailed orthologue EN1 is a survival factor in basal-like breast cancers (Beltran et al, 2014); SERTAD2 and EN1 were both located within the basal-like breast cancer cluster ‘Bas’ (Figure 5). Indeed, EN1 was the clearest single basal-like cancer biomarker in the data examined. Therefore, we speculate that SERTAD2 may cooperate with EN1 in basal-like breast cancers, reflecting conservation of function between fly and human; indeed, our results evidence coordinated expression of these two genes as part of a gene expression programme controlled by EMT TFs. Regulation of EN1, SERTAD2 within an EMT programme could harmonise previous reports of key roles for both neural-specific and EMT TFs in basal-like breast cancers (Beltran et al, 2014; Sarrió et al, 2008). The taranis chromatin organisation cluster also contained Notch modifiers ash1, Bap111, which were respectively orthologous to the ASH1L, SMARCE1 breast cancer poor prognosis factors (Liu et al, 2014; Sethuraman et al, 2016). The notch pathway had been shown to drive EMT-like characteristics as well as to mediate hypoxia-induced invasion in multiple cell lines (Sahlgren et al, 2008). Previous work had also shown that SMARCE1, a SWI/SNF complex member, interacted with Hypoxia Inducible Factor 1A (HIF1A) signalling and had significant effects on cell viability upon knockdown/ectopic expression alongside disruption of notch family signalling by gamma-secretase inhibition (Sethuraman et al, 2016). SMARCE1 was recently shown to be important in early-stage cancer invasion (Sokol et al, 2017). Aligning with these studies, our results evidence conserved function for SMARCE1 in (partial) EMT signalling in both mesoderm development and breast cancer progression, possibly in regulation of SWI/SNF targeting. SWI/SNF has been shown to regulate chromatin switching in oral cancer EMT (Mohd-Sarip et al, 2017). NetNC results showing predicted regulation of chromatin organisation genes by Snail, Twist also included core polycomb group (PcG) and trithorax components, suggesting novel crosstalk with epigenetic regulation mechanisms in specifying mesodermal cell fates. PcG genes have long been considered to be crucial oncofetal regulators and have become the focus of significant cancer drug development efforts (Sparmann & Lohuizen, 2006; Koppens & Lohuizen, 2016). Our findings align with previous reports that gene silencing in EMT involves PcG, for example at Cdh1, CDKN2A (Herranz et al, 2008; Yang et al, 2010; Lamouille et al, 2014; Koppens & Lohuizen, 2016) and support a model where EMT TFs control the expression of their own coregulators; for example, Snai1 was shown to recruit polycomb repressive complex 2 members (Herranz et al, 2008). Overall, these NetNC results predicted components of feedback loops where the Snail, Twist EMT transcription factors regulate chromatin organisation genes that, in turn, may both reinforce and coordinate downstream stages in gene expression programmes for mesoderm development and cancer progression. Stages of the EMT programme had been described elsewhere, reviewed in (Nieto et al, 2016); our results map networks that may control the remodelling of Waddington’s landscape - identifying crosstalk between Snail, Twist, epigenetic modifiers and regulation of key developmental pathways, including notch (Hemberger et al, 2009). We speculate that dynamic interplay between successive cohorts of TFs and chromatin organisation factors could be an attractive mechanism to determine progress through and the ordering of steps in (partial) EMTs, consistent with ‘metastable’ intermediate stages (Nieto et al, 2016).
Our work integrates datasets from D. melanogaster and human breast cancers, offering insight into the biology of epithelial remodelling in both systems. Indeed, the fly genome is relatively small and hence more tractable for network studies, while the availability of data for analysis (e.g. ChIP-chip, ChIP-seq, genetic screens) is enhanced by both considerable community resources and the relative ease of experimental manipulation (Wangler et al, 2017; Mohr et al, 2014). The datasets sna_2-4h_Toll10b, twi_2-4h_Toll10b represent embryos formed entirely from mesodermal lineages (Zeitlinger et al, 2007) and, together, had significantly greater proportion of basal-like breast cancer genes than the combined sna_2-3h_union, twi_2-3h_union datasets (p<8.0×10-4). This enrichment aligned with work showing that basal-like breast cancers have EMT characteristics (Sarrió et al, 2008; Guen et al, 2017) and again highlighted commonalities between mesoderm development and breast cancers. We also presented evidence for molecular features of EMT in normal-like (NL) breast cancers. Multiple EMT factors, including SNAI2 and TWIST1, had highest expression values in NL cancers and were assigned to the NL centroid. Previous work had shown enrichment of non-epithelial genes in the normal-like subtype (Sørlie et al, 2001). EMT was known to confer stem-like cell properties (Mani et al, 2008; DiMeo et al, 2009; Schmidt et al, 2015) and our results were consistent with dedifferentiation or arrested differentiation due to activation of an EMT-like programme, forming a stem-like cell subpopulation in NL cancers. For example, SNAI2 had been linked with a stem-like signature in breast cancer metastasis and was critical for maintenance of mammary stem cells (Lawson et al, 2015; Guo et al, 2012). NetNC predicted targets for Twist included the proteosome, splicing and ribosomal components; orthologous genes for these subnetworks were largely assigned to the NL subtype in multiple TF_ALL datasets, suggesting potential regulation of these cell systems by TWIST1 in NL cancers. Some EMT genes were highly expressed in both basal-like and NL cancers, for example QKI (Figure 5); EMT-like signalling may therefore be a common thread connecting these two subtypes despite other important differences, such as hormone receptor status (Dai et al, 2015). Indeed, the majority of predicted Snail and Twist functional targets had orthologues that were assigned to either basal-like or NL cancers, providing further evidence that EMT-like signalling is important in both subtypes. We note that cell-compositional effects, associated with a previously reported high proportion of stromal tissue in NL tumours (Prat & Perou, 2011), could explain the observed enrichment of EMT molecular characteristics in this subtype. In addition to stromal compositional differences in the NL subtype, as noted above, an EMT signature might reflect inhibition of differentiation. Indeed, NL cancers were previously shown to have high expression of stem cell markers (Sørlie et al, 2001; Marcato et al, 2011; Raha et al, 2014; Sieuwerts et al, 2009). Our results demonstrated that NetNC functional targets from fly mesoderm development capture clinically relevant molecular features of breast cancers and revealed novel candidate drivers of tumour progression. Roles in control of invasion were found for four predicted functional targets (UNK, SNX29, ATG3, IRX4) in ectopic expression and shRNA knockdown experiments with a Snail inducible breast cancer cell line. Potential artefacts associated with changes in cell growth or proliferation are controlled within the transwell assays used, because values reflect the ratio of signal from cells located at either side of the matrigel barrier. These in vitro confirmatory results both support the novel analysis approach and evidence new function for the genes examined.
All nine of the TF_ALL datasets had high predicted NetNC-lcFDR neutral binding proportion (PNBP), ranging from 50% to ≥80%. These PNBP values may reflect an upper limit on neutral binding because some functional targets could be missed; for example due to errors in assigning enhancer binding to target genes and bona fide regulation of genes that have few DroFN edges with other candidate ChIP-seq or ChIP-chip targets. While neutral TF binding may arise partly from non-specific associations of TFs with euchromatin, alternative explanations include dormant binding, possibly reflecting developmental lineage (Junion et al, 2012) or enhancer priming (Factor et al, 2014). Additionally, calibration of lcFDR values against synthetic data based on KEGG might influence neutral binding estimates, due to potential differences in network properties between TF targets and KEGG pathways; such as clustering coefficient. Candidate target genes that were assigned to peaks according to RNA polymerase occupancy (MacArthur et al, 2009) had PNBP similar to or lower than datasets where RNA polymerase data was not used. Therefore, we found no evidence of benefit in using RNA polymerase binding data to guide peak matching. Candidate targets for the twi_2-4h_Toll10b, sna_2-4h_Toll10b datasets were defined using a relatively generous peak threshold (two-fold enrichment), which may explain the high PNBP found for sna_2-4h_Toll10b. Twi_2-4h_Toll10b had similar PNBP to the other Twist datasets analysed, although application of a higher peak enrichment threshold would likely lead to a lower PNBP value for this dataset. Indeed, twi_2-6h_intersect had the strongest peak intensity and lowest PNBP compared with other datasets from the same study (twi_2-4h_intersect, twi_4-6h_intersect). Candidate targets for twi_2-6h_intersect were continuously bound across two different time periods; the only other member of TF_ALL that represented binding at multiple time periods was the HOT dataset, which also had low PNBP. Indeed, the only dataset with lower PNBP than either HOT or twi_2-6h_intersect was the Twist ChIP-seq ‘high-confidence’ dataset (twi_1-3h_hiConf) where the most stringent peak filtering protocols had been applied (Ozdemir et al, 2011). Twi_1-3h_hiConf was the only ChIP-seq dataset analysed in this study, however this factor alone is unlikely to explain the high proportion of predicted functional binding. Indeed, overlap with ChIP-chip regions informed classification of the ‘high-confidence’ ChIP-seq peaks taken for twi_1-3h_hiConf (Ozdemir et al, 2011). Our results aligned with evidence that HOT regions function in gene regulation, despite their depletion for known TF motifs (Kvon et al, 2012; Chen et al, 2014; Boyle et al, 2014) and supported the emerging picture of widespread combinatorial control involving TF-TF interactions, cooperativity and TF redundancy (Stampfel et al, 2015; Long et al, 2016; Spitz & Furlong, 2012; Jolma et al, 2015; Khoueiry et al, 2017). We found similar NetNC PNBP values for datasets produced by taking either the intersection or the union of two independent Twist antibodies. Hits identified by multiple antibodies may be technically more robust due to reduced off-target binding (Sandmann et al, 2007). However, taking the union of candidate binding sites could eliminate false negatives arising from epitope steric occlusion, for example due to context-specific protein interactions. The similarity of PNBP values for either the intersection or the union of Twist antibodies suggests that, despite the higher expected technical specificity, the intersection of candidate targets may not enrich for functional binding sites at the 1% peak-calling FDR threshold applied (Sandmann et al, 2007; MacArthur et al, 2009). In general, fewer false negatives implies recovery of numerically more functional TF targets that therefore may produce denser clusters in DroFN which, in turn, could facilitate NetNC discovery of functional targets. Indeed, datasets representing the union of two antibodies ranked highly in terms of both the total number and proportion of genes recovered at lcFDR<0.05 or gFDR<0.05 (Figure 3).
NetNC may be widely useful for discovery of highly connected gene groups across multiple different data types. Further possible applications include: identification of differentially expressed pathways and macromolecular complexes from functional genomics data; illuminating common biology among CRISPR screen hits in order to inform prioritisation of candidates for follow-up work (Shalem et al, 2014); and discovery of functional coherence in chromosome conformation capture data (4-C, 5-C), for example in enhancer regulatory relationships (Simonis et al, 2006; Dostie et al, 2006). NetNC may be applied to any undirected network; including protein-protein or genetic interactions, telecommunications, climate and social networks. Indeed, context-specific effects are important for many disciplines; for example a given social event is unlikely to involve everyone in the social network, and regulatory changes may only apply to a subset of businesses in an economic model. The multiple complementary analysis modes in NetNC provide adaptability to extract value from real-world datasets. A parameter-free mode, NetNC-FBT, provides resilience to enable discovery of coherent genes with graph properties different to those of the KEGG pathways used in calibration of the ‘Functional Target Identification’ analysis mode (NetNC-FTI). NetNCFBT employs unsupervised clustering, and analyses the shape of the NFCS score distribution rather than absolute score values. Therefore, NetNC-FBT can separate high-scoring arbitrary subgraphs from disconnected or sparsely connected nodes in the input data. We note that NetNC-FBT had a low false positive rate on blind test data (Figure 2). On the other hand, the NetNC-FTI approach does not assume that the input gene list contains a large proportion of low-scoring genes and therefore has clear advantages for analysis of datasets that primarily contain functionally coherent genes. Also, NetNC-FTI gave the best overall performance for discrimination between biological pathways and Synthetic Neutral Target Genes (SNTGs). The NetNC software distribution includes a conservative, empirical method for estimation of local False Discovery Rate (lcFDR) from global FDR values, which could be useful in a wide range of applications. For example, FDR estimation is fundamental for mass spectrometry proteomics (Käll et al, 2008; Blakeley et al, 2012) where target-decoy searching approaches typically utilise a single ‘decoy’ search as the basis for fitting a null (H0) score distribution in order to estimate lcFDR (Blakeley et al, 2012; Käll et al, 2008; Elias & Gygi, 2007). However, NetNC generates H0 by resampling, which would be equivalent to having multiple decoy searches, which therefore enables estimation of local FDR by stepping through global FDR values. There might be merit in further investigation of the NetNC local FDR estimation strategy in the context of proteomics database searching. Evaluation on blind test data alongside leading clustering algorithms (MCL (Enright et al, 2002), HC-PIN (Wang et al, 2011)) showed that NetNC performed well overall, with particular advantages for analysis of datasets that had substantial synthetic neutral TF binding. Indeed, the nine TF_ALL datasets examined were predicted to have at least 50% neutral binding, aligning well with application of NetNC for discovery of functional targets in ChIP-chip and ChIP-seq data. TF binding focus networks derived from NetNC may also be useful in prioritising components for inclusion within regulatory network modelling. Software and datasets are made freely available as Additional Files associated with this publication.
NetNC does not require a priori definition of gene groupings, but instead dynamically defines clusters within the subnetwork induced in DroFN by the input gene list. Therefore, NetNC is complementary to techniques that employ static, predefined gene groupings such as GSEA (Subramanian et al, 2005), DAVID (Huang et al, 2009) and GGEA (Geistlinger et al, 2011)). For example, NetNC discovered functional groups for poorly characterised genes (Figure 4A, bottom right). Additionally, NetNC may be used for dimensionality reduction in gene-wise multiple hypothesis testing. One example application could be analysis of a gene list defined using a differential expression fold-change threshold, providing a hypothesis-generating step prior to evaluation of statistical significance performed on individual coherent genes or on gene clusters. The NetNC output would therefore identify a subset of genes, based on network coherence, for input into significance testing. Benjamini-Yekutieli false discovery rate control (Benjamini, 2001) would be appropriate due to the expected dependency of expression values from genes within NetNC clusters. This approach appears attractive for analysis of high-dimensional data, such as transcriptome profiling, where statistical power is diluted by the large number of hypotheses (genes) tested relative to the small number of biological samples that are typically available for analysis.
Indeed, established functional genomics data processing workflows involve filtering to reduce dimensionality; for example to eliminate genes with expression values indistinguishable from the assay ‘background’ (Quackenbush, 2002; Trapnell et al, 2012). NetNC could be deployed as a filter to select coherent genes according to the prior knowledge encoded by a functional gene network (FGN); NetNC would therefore generate a hypothesis for candidate differentially expressed genes based on the biological context represented by the FGN and the assumption that gene expression changes occur coherently, forming network communities. Statistical evaluation of this network coherence property, including estimation of FDR, is available within NetNC for numerical thresholding. Therefore, NetNC has novel applications in distillation of knowledge from high-dimensional data, including single-subject datasets which is an important emerging area for precision medicine (Vitali et al, 2017). Application of statistical and graph theoretic methods for quantitative evaluation of relationships between genes (nodes) in NetNC offers an alternative to the classical emphasis on individual genes in studying the relationship between genotype and phenotype (Baliga et al, 2017).
4 Materials and Methods
4.1 A High confidence, comprehensive D. melanogaster functional gene network (DroFN)
A Drosophila melanogaster functional network (DroFN) was developed using previously described methodology (Overton et al, 2011). Functional interaction probabilities, corresponding to pathway co-membership, were estimated by logistic regression of Bayesian probabilities from STRING v8.0 scores (Jensen et al, 2009) and Gene Ontology (GO) coannotations (Ashburner et al, 2000), taking KEGG (Kanehisa et al, 2010) pathways as gold standard.
Gene pair co-annotations were derived from the GO database of March 25th 2010. The GO Biological Process (BP) and Cellular Component (CC) branches were read as a directed graph and genes added as leaf terms. The deepest term in the GO tree was selected for each gene pair, and BP was given precedence over CC. Training data were taken from KEGG v47, comprising 110 pathways (TRAIN-NET). Bayesian probabilities for STRING and GO coannotation frequencies were derived from TRAIN-NET (Overton et al, 2011). Selection of negative pairs from TRAIN-NET using the perl rand() function was used to generate training data with equal numbers of positive and negative pairs (TRAIN-BAL), which was input for logistic regression, to derive a model of gene pair functional interaction probability:
Where:
pGO is the Bayesian probability derived from Gene Ontology coannotation frequency pSTRING is the Bayesian probability derived from the STRING score frequency The above model was applied to TRAIN-NET and the resulting score distribution thresholded by seeking a value that maximised the F-measure (van Rijsbergen, 1979) and True Positive Rate (TPR), while also minimising the False Positive Rate (FPR). The selected threshold value (p ≥0.779) was applied to functional interaction probabilities for all possible gene pairs to generate the high-confidence network, DroFN.
For evaluation of the DroFN network, time separated test data (TEST-TS) were taken from KEGG v62 on 13/6/12, consisting of 14 pathways that were not in TRAIN-NET. TEST-TS was screened against TRAIN-NET, eliminating 34 positive and 218 negative gene pairs to generate the blind test dataset TEST-NET (4599 pairs). GeneMania (version of 10th August 2011) (Warde-Farley et al, 2010) and DROID (v2011_08) (Yu et al, 2008) were assessed against TEST-NET.
4.2 Network neighbourhood clustering (NetNC) algorithm
NetNC identifies functionally coherent nodes in a subgraph S of functional gene network G (an undirected graph), induced by some set of nodes of interest D; for example, candidate transcription factor target genes assigned from analysis of ChIP-seq data. Intuitively, we consider the proportion of common neighbours for nodes in S to define coherence; for example, nodes that share neighbours have greater coherence than nodes that do not share neighbours. The NetNC workflow is summarised in Figure 1 and described in detail below. Two analysis modes are available a) node-centric (parameter-free) and b) edge-centric, with two parameters. Both modes begin by assigning a p-value to each edge (Sij) from Hypergeometric Mutual Clustering (HMC) (Goldberg & Roth, 2003), described in points one and two, below.
A two times two contingency table is derived for each edge Sij by conditioning on the Boolean connectivity of nodes in S to Si and Sj. Nodes Si and Sj are not counted in the contingency table.
Exact hypergeometric p-values (Goldberg & Roth, 2003) for enrichment of the nodes in S that have edges to the nodes Si and Sj are calculated using Fisher's Exact Test from the contingency table. Therefore, a distribution of p-values (H1) is generated for all edges Sij.
The NetNC edge-centric mode employs positive false discovery rate (Storey, 2002) and an iterative minimum cut procedure (Ford & Fulkerson, 1956) to derive clusters as follows:
a) Subgraphs with the same number of nodes as S are resampled from G, application of steps 1 and 2 to these subgraphs generates an empirical null distribution of neighbourhood clustering p-values (H0). This H0 accounts for the effect of the sample size and the structure of G on the Sij hypergeometric p-values (pij). Each NetNC run on TF_ALL in this study resampled 1000 subgraphs to derive H0.
b) Each edge in S is associated with a positive false discovery rate (q) estimated over pij using H1 and H0. The neighbourhood clustering subgraph C is induced by edges where the associated q ≤ Q.
c) An iterative minimum cut procedure (Ford & Fulkerson, 1956) is applied to C until all components have density greater than or equal to a threshold Z. Edge weights in this procedure are taken as the negative log p-values from H1.
d) As described in section 4.2.3, thresholds Q and Z were chosen to optimise the performance of NetNC on the 'Functional Target Identification' task using training data taken from KEGG. Connected components with less than three nodes are discarded, in line with common definitions of a 'cluster'. Remaining nodes are classified as functionally coherent.
The node-centric, parameter-free mode proceeds by calculating degree-normalised node functional coherence scores (NFCS) from H1, then identifies modes of the NFCS distribution using Gaussian Mixture Modelling (Lubbock et al, 2013):
a) The node functional coherence score (NFCS) is calculated by summation of Sij p-values in H1 (pij) for fixed Si, normalised by the Si degree value in S (di):
b) Gaussian Mixture Modelling (GMM) is applied to identify structure in the NFCS distribution. Expectation-maximization fits a mixture of Gaussians to the distribution using independent mean and standard deviation parameters for each Gaussian (Dempster et al, 1977; Lubbock et al, 2013). Models with 1..9 Gaussians are fitted and the final model selected using the Bayesian Information Criterion (BIC).
c) Nodes in high-scoring mode(s) are predicted to be ‘Functionally Bound Targets’ (FBTs) and retained. Firstly, any mode at NFCS<0.05 is excluded because this typically represents nodes with no edges in S (where NFCS=0). A second step eliminates the lowest scoring mode if >1 mode remains. Very rarely a unimodal model is returned, which may be due to a large non-Gaussian peak at NFCS=0 confounding model fitting; if necessary this is addressed by introducing a tiny Gaussian noise component (SD=0.01) to the NFCS=0 nodes to produce NFCS_GN0. GMM is performed on NFCS_GN0 and nodes eliminated according to the above procedure on the resulting model. This procedure was developed following manual inspection of results on training data from KEGG pathways with 'synthetic neutral target genes' (STNGs) as nodes resampled from G (TRAIN-CL, described in section 2.2.1).
Therefore, NetNC can be applied to predict functional coherence using either edge-centric or node-centric analysis modes. The edge-centric mode automatically produces a network, whereas the node-centric analysis does not output edges; therefore to generate networks from predicted FBT nodes an edge pFDR threshold may be applied, pFDR≤0.1 was selected as the default value. The statistical approach to estimate pFDR and local FDR are described in the sections below.
4.2.1 Estimating positive false discovery rate for hypergeometric mutual clustering p-values
The following procedure is employed to estimate positive False Discovery Rate (pFDR) (Storey, 2002) in the NetNC edge-centric mode. Subgraphs with number of nodes identical to S are resampled from G to derive a null distribution of HMC p-values (H0) (section 4.2, above). The resampling approach for pFDR calculation in NetNC-FTI controls for the structure of the network G, including degree distribution, but does not control for the degree distribution or other network properties of the subgraph S induced by the input nodelist (D). In scale free and hierarchical networks, degree correlates with clustering coefficient; indeed, this property is typical of biological networks (Yamada & Bork, 2009). Part of the rationale for NetNC assumes that differences between the properties of G and S (for example; degree, clustering coefficient distributions) may enable identification of clusters within S. Therefore, it would be undesirable to control for the degree distribution of S during the resampling procedure for pFDR calculation because this would also partially control for clustering coefficient. Indeed clustering coefficient is a node-centric parameter that has similarity with the edge-centric Hypergeometric Clustering Coefficient (HMC) calculation (Goldberg & Roth, 2003) used in the NetNC algorithm to analyse S. Hence, the resampling procedure does not model the degree distribution of S, although the degree distribution of G is controlled for. Positive false discovery rate is estimated over the p-values in H1 (pij) according to Storey (Storey, 2002):
Where:
R denotes hypotheses (edges) taken as significant
V are the number of false positive results (type I error)
NetNC steps through threshold values (pα) in pij estimating V using edges in H0 with p≤pα. H0 represents Y resamples, therefore V is calculated at each step:
The H1 p-value distribution is assumed to include both true positives (TP) and false positives (FP); H0 is taken to be representative of the FP present in H1. This approach has been successfully applied to peptide spectrum matching (Fitzgibbon et al, 2008; Sennels et al, 2009). The value of R is estimated by:
Additionally, there is a requirement for monotonicity:
Equation (6) represents a conservative procedure to prevent inconsistent scaling of pFDR due to sampling effects. For example consider the scaling of pFDR for pFDRx+1 at a pij value with additional edges from H1 but where no more resampled edges (i.e. from H0) were observed in the interval between px and px+1; before application of equation (6), the value of pFDRx+1 would be lower than pFDRx. The approach also requires setting a maximum on estimated pFDR, considering that there may be values of pα where R is less than V. We set the maximum to 1, which would correspond to a prediction that all edges at pij are false positives. The assumption that H1 includes false positives is expected to hold in the context of candidate transcription factor target genes and also generally across biomedical data due to the stochastic nature of biological systems (Raj & van Oudenaarden, 2008; Raj et al, 2010; Marusyk et al, 2012). We note that an alternative method to calculate R using both H1 and H0 would be less conservative than the approach presented here.
4.2.2 Estimating local false discovery rate from global false discovery rate
We developed an approach to estimate local false discovery rate (lcFDR) (Efron et al, 2001), being the probability that an object at a threshold (pα) is a false positive. Our approach takes global pFDR values as basis for lcFDR estimation. In the context of NetNC analysis using the DroFN network, a false positive is defined as a gene (node) without a pathway comembership relationship to any other nodes in the nodelist D. The most significant pFDR value (pFDRmin) from NetNC was determined for each node Si across the edge set Sij. Therefore, pFDRmin is the pFDR value at which node Si would be included in a thresholded network. We formulated lcFDR for the nodes with pFDRmin meeting a given pα (k) as follows:
Where l denotes the pFDRmin closest to and smaller than k, and where at least one node has pFDRmin≡pFDRl. Therefore, our approach can be conceptualised as operating on ordered pFDRmin values. n indicates the nodes in D with pFDRmin values meeting threshold k. X represents the number of nodes at pα≡k. The number of false positives (FP) for nodes with pα≡k (FPk) is estimated by subtracting the FP for threshold l from the FP at threshold k. Thus, division of FPk by X gives local false discovery rate bounded by k and l (Appendix Figure S7). If we define the difference between pFDRk and pFDRl:
Substituting pFDRk for (pFDRl + pFDRΔ) into equation (7) and then simplifying gives:
Equations (7) and (9) do not apply to the node(s) in D at the smallest possible value of pFDRmin because pFDRl would be undefined; instead, the value of lcFDRk is calculated as the (global) pFDRmin value. Indeed, global FDR and local FDR are equivalent when H1 consists of objects at a single pFDRmin value. Taking the mean lcFDRk across D provided an estimate of neutral binding in the studied ChIP-chip, ChIP-seq datasets and was calibrated against mean lcFDR values from datasets that had a known proportion of Synthetic Neutral Target Genes (SNTGs). Estimation of the total proportion of neutral binding in ChIP-chip or ChIP-seq data required lcFDR rather than (global) pFDR and, for example, accounts for the shape of the H1 distribution. In the context of NetNC analysis of TF_ALL, mean lcFDR may be interpreted as the probability that any candidate target gene is neutrally bound in the dataset analysed; therefore providing estimation of the total neutral binding proportion. Computer code for calculation of lcFDR is provided within the NetNC distribution (Additional File 5). Estimates of SNTGs by the NetNC-FBT approach were not taken forward due to large 95% CI values (Appendix Figure S8).
4.2.3 NetNC benchmarking and parameter optimisation
Gold standard data for NetNC benchmarking and parameterisation were taken as pathways from KEGG (v62, downloaded 13/6/12) (Kanehisa et al, 2010). Training data were selected as seven pathways (TRAIN-CL, 184 genes) and a further eight pathways were selected as a blind test dataset (TEST-CL, 186 genes) summarised in Appendix Table S7. For both TRAIN-CL and TEST-CL, pathways were selected to be disjoint and to cover a range of different biological functions. However, pathways with shared biology were present within each group; for example TRAIN-CL included the pathways dme04330 'Notch signaling' and dme04914 'Progesterone-mediated oocyte maturation', which are related by notch involvement in oogenesis (López-Schier & St Johnston, 2001; Schmitt & Nebreda, 2002). TEST-CL also included the related pathways dme04745 'Phototransduction' and dme00600 'Sphingolipid metabolism', for example where ceramide kinase regulates photoreceptor homeostasis (Acharya et al, 2003; Dasgupta et al, 2009; Yonamine et al, 2011).
Gold standard datasets were also developed in order to investigate the effect of dataset size and noise on NetNC performance. The inclusion of noise as resampled network nodes into the gold-standard data was taken to model neutral TF binding (Shlyueva et al, 2014; Li et al, 2008) and matches expectations on data taken from biological systems in general (Raj & van Oudenaarden, 2008; Marusyk et al, 2012). Therefore, gold standard datasets were generated by combining TRAIN-CL with nodes resampled from the network (G) and combining these with TRAIN-CL. The final proportion of resampled nodes (Synthetic Neutral Target Genes, SNTGs) ranged from 5% through to 80% in 5% increments. Since we expected variability in the network proximity of SNTGs to pathway nodes (S), 100 resampled datasets were generated per %SNTG increment. Further gold-standard datasets were generated by taking five subsets of TRAIN-CL, from three through seven pathways. Resampling was applied for these datasets as described above to generate node lists representing five pathway sets in TRAIN-CL by sixteen %SNTG levels by l00 repeats (TRAIN_CL_ALL, 8000 node lists; Additional File 2). A similar procedure was applied to TEST-CL, taking from three through eight pathways to generate data representing six pathway subsets by sixteen noise levels by 100 repeats (TEST-CL_ALL, 9600 node lists, Additional File 3). Data based on eight pathways (TEST-CL _8PW, 1600 node lists) were used for calibration of lcFDR estimates. Preliminary training and testing against the MCL algorithm (Enright et al, 2002) utilised a single subsample for 10%, 25%, 50% and 75% SNTGs (TRAIN-CL-SR, TEST-CL-SR; Additional File 6).
NetNC analysed the TRAIN-CL_ALL datasets in edge-centric mode, across a range of FDR (Q) and density (Z) threshold values. Performance was benchmarked on the Functional Target Identification (FTI) task which assessed the recovery of biological pathways and exclusion of SNTGs. Matthews correlation coefficient (MCC) was computed as a function of NetNC parameters (Q, Z). MCC is attractive because it is captures predictive power in both the positive and negative classes. FTI was a binary classification task for discrimination of pathway nodes from noise, therefore all pathway nodes were taken as as positives and SNTGs were negatives for the FTI MCC calculation. The FTI approach therefore tests discrimination of pathway nodes from SNTGs, which is particularly relevant to identification of functionally coherent candidate TF targets from ChIPchip or ChIP-seq peaks.
Parameter selection for NetNC on the FTI task analysed MCC values for the 100 SNTG resamples across five pathway subsets by sixteen SNTG levels in TRAIN-CL_ALL over the Q, Z values examined, respectively ranging from up to 10-7 to 0.8 and from up to 0.05 to 0.9. Data used for optimisation of NetNC parameters (Q, Z) are given in Additional File 7 and contour plots showing mean MCC across Q, Z values per %SNTG are provided in Appendix Figure S9. A ‘SNTG specified’ parameter set was developed for situations where an estimate of the input data noise component is available, for example from the node-centric mode of NetNC. In this parameterisation, for each of the sixteen datasets with different proportions of SNTG (5% .. 80%), MCC values were normalized across the five pathway subsets of TRAIN-CL (from three through seven pathways), by setting the maximum MCC value to 1 and scaling all other MCC values accordingly. The normalised MCC values <0.75 were set to zero and then a mean value was calculated for each %SNTG value across five pathway subsets by 100 resamples in TRAINCL_ALL (500 datasets per noise proportion). This approach therefore only included parameter values corresponding to MCC performance ≥75% of the maximum across the five TRAIN-CL pathway subsets. The high performing regions of these ‘summary’ contour plots sometimes had narrow projections or small fragments, which could lead to parameter estimates that do not generalise well on unseen data. Therefore, parameter values were selected as the point at the centre of the largest circle (in (Q, Z) space) completely contained in a region where the normalised MCC value was ≥0.95. This procedure yielded a parameter map: (SNTG Estimate) → (Q, Z), given in Appendix Table S8. NetNC parameters were also determined for analysis without any prior belief about the %SNTG in the input data - and therefore generalise across a wide range of %SNTG and dataset sizes. For this purpose, a contour plot was produced to represent the proportion of datasets where NetNC performed better than 75% of the maximum performance across TRAIN-CL_ALL for the FTI task in the Q, Z parameter space. The maximum circle approach described above was applied to the contour plot in order to derive ‘robust’ parameter values (Q, Z), which were respectively 0.120, 0.306 (NetNC-FTI).
4.2.4 Performance on blind test data
We compared NetNC against leading methods, HC-PIN (Wang et al, 2011) and MCL (Enright et al, 2002) on blind test data (Figure 2, Appendix Table S1). Input, output and performance summary files for HC-PIN on TEST-CL are given in Additional File 8. HC-PIN was run on the weighted graphs induced in DroFN by TEST-CL with default parameters (lambda = 1.0, threshold size = 3). MCL clusters in DroFN significantly enriched for query nodes from TEST-CL-SR were identified by resampling to generate a null distribution (Overton et al, 2011). Clusters with q<0.05 were taken as significant. MCL performance was optimised for the Functional Target Identification (FTI) task over the TRAIN-CL-SR datasets for MCL inflation values from 2 to 5 incrementing by 0.2. The best-performing MCL inflation value overall was 3.6 (Appendix Table S9).
4.2.5 Subsampling of transcription factor binding datasets and statistical testing
Robustness of NetNC performance was studied by taking 95%, 80% and 50% resamples from nine public transcription factor binding datasets, summarised in section 4.3 and described previously in detail (MacArthur et al, 2009; Zeitlinger et al, 2007; Sandmann et al, 2007; Ozdemir et al, 2011; Roy et al, 2010). A hundred subsamples of each of these datasets were taken at rates of 95%, 80% and 50%, thereby producing a total of 2700 datasets (TF_SAMPL). NetNC-FTI results across TF_SAMPL were used as input for calculation of median and 95% confidence intervals for the edge and gene overlap per subsampling rate for each transcription factor dataset analysed. The NetNC resampling parameter (Y) was set at 100, the default value. The edge overlap was calculated as the proportion of edges returned by NetNC-FTI for the subsampled dataset that were also present in NetNC-FTI results for the full dataset (i.e. at 100%). Therefore, nine values for median overlap and 95% CI were produced per subsampling rate for both edge and gene overlap, corresponding to the nine transcription factor binding datasets (Appendix Table S3). The average (median) value of these nine median overlap values, and of the 95% CI, was calculated per subsampling rate; these average values are quoted in Results section 2.4.
False discovery rate (FDR) correction of p-values was applied where appropriate and is indicated in this manuscript by the commonly used notation ‘q’ Benjamini-Hochberg correction was applied (Benjamini & Hochberg, 1995) unless otherwise specified in the text. The pFDR and local FDR values calculated by NetNC are described in Methods sections 4.2, 4.2.1 and 4.2.2 (above).
4.3 Transcription factor binding and Notch modifier datasets
We analysed public Chromatin Immunoprecipitation (ChIP) data for the transcription factors twist and snail in early Drosophila melanogaster embryos. These datasets were derived using ChIP followed by microarray (ChIP-chip) (MacArthur et al, 2009; Zeitlinger et al, 2007; Sandmann et al, 2007) and ChIP followed by solexa pyrosequencing (ChIP-seq) (Ozdemir et al, 2011). Additionally 'highly occupied target' regions, reflecting multiple and complex transcription factor occupancy profiles, were obtained from ModEncode (Roy et al, 2010). Nine datasets were analysed in total (TF_ALL) and are summarised below.
The 'union' datasets (WT embryos 2-3h, mostly late stage four or early stage five) combined ChIP-chip peaks significant at 1% FDR for two different antibodies targeted at the same TF and these were assigned to the closest transcribed gene according to PolII binding data (MacArthur et al, 2009). Additionally, where the closest transcribed gene was absent from the DroFN network then the nearest gene was included if it was contained in DroFN. This approach generated the datasets sna_2-3h_union (1158 genes) and twi_2-3h_union (1848 genes). The union of peaks derived from two separate antibodies maximised sensitivity and may have reduced potential false negatives arising from epitope steric occlusion. For the 'Toll10b' datasets, significant peaks with at least twofold enrichment for Twist or Snail binding were taken from ChIP-chip data on Toll10b mutant embryos (2-4h), which had constitutively activated Toll receptor (Zeitlinger et al, 2007; Stathopoulos et al, 2002); mapping to DroFN generated the datasets twi_2-4h_Toll10b (1238 genes), sna_2-4h_Toll10b (1488 genes). Toll10b embryos had high expression of Snail and Twist, which drove all cells to mesodermal fate trajectories (Zeitlinger et al, 2007). The two-fold enrichment threshold selected for this study reflects ‘weak’ binding, although was expected to include functional TF targets (Biggin, 2011). Therefore the candidate target genes for twi_2-4h_Toll10b and sna_2-4h_Toll10b were expected to contain a significant proportion of false positives. The Highly Occupied Target dataset included 38562 regions, of which 1855 had complexity score ≥8 and had been mapped to 1648 FlyBase genes according to the nearest transcription start site (Roy et al, 2010); 677 of these genes were matched to a DroFN node (HOT). The ‘HighConf’ data took Twist ChIPseq binding peaks in WT embryos (1-3h) that had been reported to be ‘high confidence’ assignments; high confidence filtering was based on overlap with ChIP-chip regions, identification by two peak-calling algorithms and calibration against peak intensities for known Twist targets, corresponding to 832 genes (Ozdemir et al, 2011). A total of 664 of these genes were found in DroFN (twi_1-3h_hiConf) and represented the most stringent approach to peak calling of all the nine TF_ALL datasets. The intersection of ChIP-chip binding for two different Twist antibodies in WT embryos spanning two time periods (2-4h and 4-6h) identified a total of 1842 target genes (Sandmann et al, 2007) of which 1444 mapped to DroFN (Intersect_ALL). Subsets of Intersect_ALL identified regions bound only at 2-4 hours (twi_2-4h_intersect, 801 genes), or only at 4-6 hours (twi_4-6h_intersect, 818 genes), or 'continuously bound' regions identified at both 2-4 and 4-6 hours (twi_2-6h_intersect, 615 genes). Assigned gene targets may belong to more than one subset of Intersect_ALL because time-restricted binding was assessed for putative enhancer regions prior to gene mapping; overlap of the Intersect_ALL subsets ranged between 30.2% and 55.4%. The Intersect_ALL datasets therefore enabled assessment of functional enhancer binding according to occupancy at differing time intervals and also to examine the effect of intersecting ChIPs for two different antibodies upon the proportion of predicted functional targets recovered.
The Notch signalling modifiers analysed in this study were selected based on identification in at least two of the screens reported in (Guruharsha et al, 2012).
4.4 Breast cancer transcriptome datasets and molecular subtypes
Primary breast tumour gene expression data were downloaded from NCBI GEO (GSE12276, GSE21653, GSE3744, GSE5460, GSE2109, GSE1561, GSE17907, GSE2990, GSE7390, GSE11121, GSE16716, GSE2034, GSE1456, GSE6532, GSE3494, GSE68892 (formerly geral-00143 from caBIG)). All datasets were Affymetrix U133A/plus 2 chips and were summarised with Ensembl alternative CDF (Dai et al, 2005). RMA normalisation (Irizarry et al, 2003) and ComBat batch correction (Johnson et al, 2007) were applied to remove dataset-specific bias as previously described (Sims et al, 2008; Moleirinho et al, 2013). Intrinsic molecular subtypes were assigned based upon the highest correlation to Sorlie centroids (Sørlie et al, 2003), applied to each dataset separately. Centred average linkage clustering was performed using the Cluster and TreeView programs (Eisen et al, 1998). Centroids were calculated for each gene based upon the mean expression across each of the Sorlie intrinsic subtypes (Sørlie et al, 2003). These expression values were squared to consider up and down regulated genes in a single analysis. Orthology to the DroFN network was defined using Inparanoid (Östlund et al, 2009). Differential expression was calculated by t-test comparing normalised (unsquared) expression values in normal-like and basal-like tumours with false discovery rate correction (Benjamini & Hochberg, 1995).
4.5 Invasion assays for validation of genes selected from NetNC results
MCF-7 Tet-On cells were purchased from Clontech and maintained as previously described (Liu et al, 2013).To analyse the ability of transfected MCF7 breast cancer cells to degrade and invade surrounding extracellular matrix, we performed an invasion assay using the CytoSelect™ 24-Well Cell Adhesion Assay kit. This transwell invasion assay allow the cells to invade through a matrigel barrier utilising basement membrane-coated inserts according to the manufacturer's protocol. Briefly, MCF7 cells transfected with the constructs (Doxycycline-inducible SNAI1 cDNA or SNAI1 shRNA with or without candidate gene cDNA) were suspended in serum-free medium. SNAI1 cDNA or SNAI1 shRNA were cloned in our doxycyline-inducible pGoldiLox plasmid (pGoldilox-Tet-ON for cDNA and pGolidlox-tTS for shRNA expression) using validated shRNAs against SNAI1 (NM_005985 at position 150 of the transcript (Liu et al, 2013)). pGoldilox has been used previously to induce and knock down the expression of Ets genes (Peluso et al, 2017). Following overnight incubation, the cells were seeded at 3.0×105 cells/well in the upper chamber and incubated with medium containing serum with or without doxycyline in the lower chamber for 48 hours. Concurrently, 106 cells were treated in the same manner and grown in a six well plate to confirm over-expression and knockdown. mRNA was extracted from these cells and quantitative real-time PCR (RT-qPCR) was performed as previously described (Essafi et al, 2011); please see Additional File 9 for gene primers. The transwell invasion assay evaluated the ratio of CyQuant dye signal at 480/520 nm in a plate reader of cells from the two wells and therefore controlled for potential proliferation effects associated with ectopic expression. We used empty vector (mCherry) and scrambled shRNA as controls and to control for the non-specific signal. At least three experimental replicates were performed for each reading.
5 Data and software availability
Software and key datasets are made freely available as Additional Files associated with this publication as follows:
Additional File 1: DroFN network and gold standard datasets for network inference.
Additional File 2: TRAIN_CL_ALL (NetNC training data).
Additional File 3: TEST_CL_ALL (NetNC test data).
Additional File 4: Cytoscape sessions with NetNC-FTI results for TF_ALL.
Additional File 5: NetNC software distribution.
Additional File 6: TRAIN-CL-SR and TEST-CL-SR (used for comparison with MCL algorithm).
Additional File 7: NetNC results on training data used for parameter optimisation (Q, Z).
Additional File 8: HCPIN input, output and performance summary files on TEST-CL.
Additional File 9: Primers for RT-qPCR.
7 Author Contributions
IMO conceived the overall project, obtained funding, designed the computational and statistical aspects, implemented and benchmarked the NetNC algorithm, performed analysis of all TF datasets and the validation data, interpreted results, produced Figures 1, 3, 4, 6, produced all Tables except as noted below, performed orthology mapping, annotated the heatmap features in Figure 5 and supervised JO, BH, ALRL, MJF, EP-C. JO implemented the iterative minimum cut, co-designed and implemented the NetNC parameter optimisation, assisted with NetNC benchmarking and produced Figures 2, S9, Table S8. BH obtained funding, co-designed and implemented the DroFN network inference, benchmarking and produced Figure S1. MJF co-designed and implemented the comparison of NetNC against the MCL algorithm, produced Table S9. ALRL co-designed and implemented the Gaussian Mixture Modelling aspects of NetNC and co-designed Equation 9. IO, JO and ALRL wrote the NetNC software distribution. AHS obtained funding, co-designed and implemented the breast cancer transcriptome analysis, interpreted results, produced Figures 5 and S6. AE obtained funding, interpreted results, designed and performed all bench laboratory experiments including tissue culture, transfection and transwell assays. EP-C assisted with annotation, visualisation and interpretation of the NetNC-FTI networks, including production of Figure S5. IO led the writing of the manuscript and revised it for important intellectual content with input from JO, AHS, AE, BH, EP-C, ALRL. All authors read and approved the submitted manuscript.
8 Conflict of Interest
None declared.
6 Acknowledgements
IMO is grateful to Prof Jeremy Gunawardena and Prof Peter Sorger for hosting him at HMS and for helpful discussions. Thanks to Prof PS Thiagarajan, Prof Andrew Millar, Prof Wendy Bickmore, Prof Nick Hastie, Prof Mike Levine, Prof Ben Lehner and Prof Julian Dow for invaluable comments. Mr Nick Moir and Dr Seanna McTaggart assisted with testing the NetNC software distribution. We acknowledge financial support from: Medical Research Council (MC_UU_12018/25; IMO), Royal Society of Edinburgh Scottish Government Fellowship cofunded by Marie Curie Actions (IMO), Marie Curie Fellowship (BH), Breast Cancer Now (AHS). AE was supported by a Wellcome Trust Beit Memorial Fellowship (AE) and by funding from Prof. Nick Hastie’s laboratory (MC_PC_U127527180).