Abstract
Over a decade ago, a new discipline called network medicine emerged as an approach to understand human diseases from a network theory point-of-view. Disease networks proved to be an intuitive and powerful way to reveal hidden connections among apparently unconnected biomedical entities such as diseases, physiological processes, signaling pathways, and genes. One of the fields that has benefited most from this improvement is the identification of new opportunities for the use of old drugs, known as drug repurposing. The importance of drug repurposing lies in the high costs and the prolonged time from target selection to regulatory approval of traditional drug development. In this document we analyze the evolution of disease network concept during the last decade and apply a data science pipeline approach to evaluate their functional units. As a result of this analysis, we obtain a list of the most commonly used functional units and the challenges that remain to be solved. This information can be very valuable for the generation of new prediction models based on disease networks.
About the authors Eduardo P. García del Valle is a PhD student at the Faculty of Computer Science of the Universidad Politécnica de Madrid (UPM). His research areas are Knowledge Recovery, Artificial Intelligence and Bioinformatics.
Gerardo Lagunes García is a PhD student in the Medical Data Analytics Laboratory at the Center for Biomedical Technology (CTB) of the Technical Universidad Politécnica de Madrid (UPM). he areas of research of interest are data mining, knowledge recovery, web development and bioinformatics.
Lucía Prieto Santamaría is a biotechnology graduate student of the Universidad Politécnica de Madrid (UPM).
Ernestina Menasalvas Ruiz is a Full Professor of Universidad Politécnica de Madrid. Her research activities are on various aspects of data mining project development and in the last years her research is focused on data mining on the medical field. She leads the Data Mining and Simulation research group at UPM.
Alejandro Rodríguez-González, PhD, is an Associate Professor Universidad Politécnica de Madrid (UPM). His main research interests are the Semantic Web, Artificial Intelligence and Biomedical informatics field. He leads the Medical Data Analytics laboratory at the Center for Biomedical Technology (CTB).
Massimiliano Zanin is a Postdoctoral Researcher at at the Center for Biomedical Technology (CTB) of Universidad Politécnica de Madrid (UPM). He is a member of the editorial team of Nature Scientific Reports, the European Journal of Social Behaviour, PeerJ and PeerJ Computer Science
Introduction
The study of diseases as non-isolated elements and the understanding of how they resemble and relate to each other are crucial to provide novel insights into pathogenesis and etiology, as well as in the identification of new targets and applications for drugs [1]. The complete sequencing of the human genome at the beginning of the 21st century represented a revolution in the study of the relationships between diseases. In combination with the growing availability of transcriptomic, proteomic, and metabolomic data sources, it should help to improve the classification of diseases [2]. However, the use of these sources raised new problems such as their fragmentation, heterogeneity, availability and different conceptualization of their data [3, 4].
Recent developments in network theory provide a way to address this challenge by representing these complex relationships as a collection of linked nodes [5]. Complex networks theory is a statistical physics interpretation of the old graph theory, aimed at describing and understanding the structures created by the relationships between the elements of a complex system [6–9].
Those elements are represented by nodes, pairwise connected by links whenever a relationship is observed between the corresponding elements. The resulting structure can then be described by means of a plethora of topological metrics [10], or be used as a base for modelling the system.
The application of this field to biological problems has been named “network biology”, while its use in biomedical problems is known as “network medicine” [11]. Following this approach, disease networks express the relationship between diseases as nodes and edges in a graph in G = (D,W), where D represents the set of diseases (nodes) and W the set of their relationships (edges) based upon their similarity. The meaning of similarity varies depending on the data used to build the network, which may be biological (genes or common proteins) or phenotypic (comorbidity, similar symptoms) [12], among other approaches.
During the past decade, numerous studies have been proposed to improve our understanding of the functioning of diseases and their relationships by creating disease networks based on different disease-disease association models and large-scale data exploitation. Of them, a significant number was oriented to exploit the new discovered relationships between diseases in the reassignment of known compounds for their treatment, the so-called “drug repurposing”. In the first part of this document, we will thoroughly review this previous work, analyzing the evolution of the methodologies used in the creation of disease networks from a timeline perspective up to the state of the art.
Despite their different approaches and methodologies, in the studies dedicated to the improvement of the disease understanding and particularly to drug repositioning, the typical phases of a data science pipeline are observed, such as data extraction, data integration model, validation and presentation. In the second part of the document, these common parts are analyzed and their existing implementations are compared taking into account their use and performance. Finally, based on the previous analysis, new studies are proposed by improving or combining the phases of the pipeline.
Evolution of disease networks
Early studies proposing the use of disease networks for the analysis of their underlying relationships used data of biological origin. In 2007, Goh et al. constructed a disease-gene bipartite graph called “Diseasome” using information from OMIM database [1]. From the diseasome they derived the Human Disease Network (HDN), in which pairs of disorders are connected if they have common genes. The study revealed that diseases tend to cluster by disease classes and that their degree of distribution follows a power law; that is, only a few diseases connect to a large number of diseases, whereas most diseases have few links to others. Aiming to reduce the bias of the HDN towards diseases transmitted in a Mendelian manner [13], subsequent studies used other sources of biological data. In 2008 year, Lee et al. constructed a metabolic disease network in which two disorders are connected if the enzymes associated with them catalyze adjacent reactions [14]. In 2009, Barrenas et al. [15] derived a complex diseasegene network (CDN) using GWAs (Genome Wide Association studies). The complex disease network showed that diseases belonging to the same disease class do not always share common disease genes. Complex disease genes are less central than the essential and monogenic disease genes in the human interactome.
The abundance of new biological data did not make researches overlook the existence of another important resource: the highest level clinical phenotypes, that is, symptoms. As one of the first and most obvious forms of diagnosis, the relationship between symptoms and diseases is widely documented in clinical records. In 2007, Rzhetsky et al. used the disease history of 1.5 million patients at the Columbia University Medical Center to infer the comorbidity links between disorders and prove that phenotypes form a highly connected network of strong pairwise correlation [16]. In 2009, Hidalgo et al. built a Phenotypic Disease Network (PDN) summarizing the connections of more than 10 thousand diseases obtained from pairwise comorbidity correlations reconstructed from over 30 million records from Medicare patients. The PDN is blind to the mechanism underlying the observed comorbidity, but it shows that patients tend to develop diseases in the network vicinity of diseases they have already had. Also disease progression was found to be different across genders and ethnicities [17]. More recently, Jiang et al. [18] used data from the Taiwan National Health Insurance Research Database to construct the epidemiological HDN (eHDN), where two diseases are concluded as connected if their probability of co-occurring in clinics deviates from what expected under independence. However, despite their demonstrated potential in pathological analysis, the access and use of clinical records in medical research is limited by several issues, including the heterogeneity of sources [19], ethical and legal restrictions and the disparity of regulations between countries [20].
The analysis of open text sources has been used as an alternative to medical records. One of the reasons is the improvement in the techniques for Named Entity Recognition (NER) for the extraction of medical terms. Okumura et al. [21] performed an analysis of the mapping between clinical vocabularies and findings in medical literature using OMIM as a knowledge source and MetaMap as the NLP tool. Following this idea, Rodríguez et al. [22] used web scraping and a combination of NLP techniques to extract diagnostic clinical findings from MedlinePlus articles about infectious diseases using MetaMap tool. In a further study, the same team compared the performance of MetaMap and cTakes in the same task [23]. The increasing availability of retrieval engines such as PubMed or UKPMC, maintained by the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), respectively, has also boosted this approach [24]. In 2014 Zhou et al. extracted symptom information from PubMed to construct the Human Symptoms Disease Network (HSDN). In the HSDN, the link weight between two diseases quantifies the similarity of their respective symptoms [25]. In 2015, Hoehndorf et al. created yet another Human Disease Network using a proposed similarity measure for text-mined phenotypes [26]. In both cases, these studies compare their results with gene-based networks, finding that symptom-based similarity of two diseases strongly correlates with the number of shared genetic associations. They also demonstrated that not only Mendelian diseases tend to be grouped into classes, but also common ones.
Due to the intrinsic complexity of the relationships between diseases, the consideration of a single factor (shared genes or common symptoms) is a limiting point. In his review of the HDN in 2012, Goh. et al. proposed that each and every disease-contributing factor such as molecular links from interactome, co-expression and metabolism, as well as genetic interactions and phenotypic comorbidity links, will have to be integrated in a context-dependent manner. Furthermore, drug chemical information and non-biological environmental factors such as toxicity information altogether must also be incorporated [13]. The result will be a combination of general and bipartite network representations into a single, complex, k-partite heterogeneous network referred as the complete Diseasome.
In line with this idea, Sun [27] and Albornoz [28] combined multiple data sources to create tripartite networks of gene-disease-PPI and gene-disease-pathways, respectively, to predict disease-disease associations. The latter study proved that for two diseases sharing a certain number of genes, the level of inclusion can be different between both diseases due to the different pool of genes and metabolic pathways involved in each disease. In 2012, Chen et al. created an heterogeneous network from 17 public data sources relating to drugs, chemical compounds, protein targets, diseases, side effects and pathways [29]. In 2013, Žitnik et al. integrated molecular interaction and ontology data of 11 different types to create another heterogeneous network. When evaluating the predictive capacity of the network, genetic interactions proved to be the most informative feature, as they tend to be causative as opposed to correlative and may therefore have less noise associated [4]. In both studies, the authors leveraged semantic ontology-level information to annotate the edges, as shown in Figure 1.
The evolution of these heterogeneous networks has resulted in the generation of complex tools for the study of disease associations based on multiple sources and types of relationships. A notable example is Hetionet [30], an integrative network encoding knowledge from millions of biomedical studies. Its data were integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. The completeness of the network is depicted in Figure 2.
Application to drug repurposing
The constant improvement in disease association prediction through the use of network theory has fostered its application to drug repurposing. Drug repurposing is the utilization of known drugs and compounds to treat new indications [31]. Since the repositioned drug has already passed a significant number of toxicity and other tests, its safety is known and the risk of failure for reasons of adverse toxicology are reduced [32]. As a result, the cost and time needed to bring a drug to market is significantly reduced compared to traditional drug development. The commercial applications of drug repositioning and the interest shown by pharmaceutical companies have led to a growing academic activity in this field. This fact is reflected in the evolution of the results for the search by “Drug Repurposing” in Google Scholar, as seen in Figure 3.
First studies in drug repurposing were based on the “guilt-by-association” assumption, that is, similar drugs may share similar targets and vice-versa [33]. In 2007, Yildrim et al. created a graph composed of US Food and Drug Administration–approved drugs and proteins linked by drug–target binary associations [34]. Similar studies were carried out by Ma’ayan [35] in 2007 and Chiang [36] and Bleakley [37] in 2009. In 2008, Nacher Schwartz compiled a drug-therapy network with all US-approved drugs and associated human therapies. From this bipartite network they constructed two other networks: a drug network and a therapy network. Therapies are closely linked to diseases, therefore the therapy network gave insights about the relations between diseases as well, making this work comparable to previous studies on human disease networks [38].
The above mentioned studies followed a drug-centric approach, that is, they discovered new indications for existing drugs based on drug-drug similarities. Other studies followed a disease-centric approach, in which effective drugs were identified based on disease-disease similarity. In 2008, Campillos et al. predicted new targets for drugs by calculating similarities between diseases based on side effect that appears from injection of drug [39]. In 2009, Guanghui Hu et al. performed a systematic, large-scale analysis of genomic expression profiles of human diseases and drugs to create a disease-drug network [40]. Suthram in 2010 [41], Mathur in 2012 [42] or Zhou in 2014 [25] also predicted new uses of existing drugs based on disease-disease associations calculated from mRNA expression similarity, biological process semantic similarity or phenotypic similarity, respectively.
As was the case in disease classification, focusing purely on drug-disease relations with no consideration of other underlying genetic or pharmacological mechanisms at play is a limiting factor in accuracy of drug repurposing prediction, due to the lack of completeness of individual information [31]. Therefore, incorporating heterogeneous data sources can potentially solve this issue. In 2011 Gottlieb made use of a broader collection of data sources to create five drug-drug similarity measures and two disease-disease similarity measures. These similarity measures were then used by PREDICT, an algorithm to infer novel drug indications [43]. Daminelli in 2012 [44] and Wang in 2014 [45] built tripartite drug-target-disease networks to predict repurposing candidate drugs.
Ultimately, advances towards more comprehensive networks have resulted in tools for the prediction of new treatments given a certain disease. This is the case of Rephetio [46], a project based on Hetionet [30] that predicted repurposing candidates by applying an algorithm originally developed for social network analysis [47]. Similarly, in the context of drug discovery, one can leverage on identifying potential associations between compounds and protein targets. To cope with the noisy, incomplete and high-dimensional nature of large-scale biological data, Luo et al. proposed DTINet [48], a Drug Target Indications (DTI) prediction system based on learning low-dimensional feature vectors that capture the context information of individual networks. DTINet showed better performance than other state-of-the-art DTI prediction methods and discovered the potential application of cyclooxygenase inhibitors in preventing inflammatory diseases.
A data science pipeline to build disease networks
Throughout the previous section, we have seen how the rise of network medicine studies has resulted in a expanding variety of innovative methods for the construction and exploitation of disease networks. However, despite using different strategies, these methods are generally based on determining the similarities and relationships between diseases and their treatments at phenotypic level (comorbidity, side-effects) or biological level (common genes, proteins, compounds). Furthermore, they clearly share common phases such as data ingestion, data processing, analysis, modeling or visualization that can be represented as functional units of a data science pipeline, as shown in Figure 4.
The data science pipeline consists in a sequence of stages or functional units that sequentially process some input data in order to solve a certain problem [49]. This concept applies to disease networks, where disease information is processed to discover how diseases relate to each other or how drugs can be repositioned. The pipeline representation also facilitates the reproducibility and the comparison among studies as a whole and also at phase level. Most importantly, it also enhances the reusability and the recombination of the functional units to build new drug-repurposing. Throughout the following sections we will describe the process of construction and exploitation of a disease network through the functional units of a data science pipeline.
Data acquisition and processing
The first step in the pipeline is to acquire data from a variety of sources, a process known as data acquisition or data ingestion. As seen in the section about the Evolution of Disease Networks, the growing availability of information sources has allowed developing different approaches to improve our understanding of diseases and to predict new drug applications.
A significant number of studies use biological data, such as KEGG (genes and pathways) [14], BioGRID (protein interactions) [4] or OMIM (genes and phenotypes) [1, 42], among many others. Supplementary Table 1 contains some of the most important sources of biological information, including their type and description. Studies on disease networks focusing on drug repositioning exploit drug databases and their relation to genes, phenotypes and compounds, such as those offered by the FDA [34–37] or DrugBank [25, 39–42], for instance. Supplementary Table 2 collects the most common drug data sources. Finally, an increasingly significant number of studies use data obtained by mining medical literature sources (e.g. articles, clinical trials) such as PubMed [25, 26, 50] or the GWAS Catalog [27]. Supplementary Table 3 contains some of the most relevant sources of medical literature.
A second step in the pipeline consists in transforming and mapping data into a format with the intent of making it more appropriate to work (usually referred as data processing, data wrangling or data munging). Recent studies combine multiple databases to provide more accurate prediction models [4, 29, 30]. However, this poses a challenge when relating identifiers or terms obtained from different sources. To address this problem, researchers use thesauri of terms such as MeSH, SNOMED CT or UMLS; code listings such as ICD or HGNC; and ontologies such as DO, PO, GO or Uberon [26, 51]. Being a valuable source of semantic and hierarchical information themselves, these resources allow mapping data such as disease codes or medical terms. In the case of medical literature sources, the use of metadata (such as MeSH headers in the case of Pubmed, for example) is often combined with terms extraction tools such as MetaMap or cTakes [23]. Supplementary Table 4 lists some of the sources used for data mapping.
The way to exploit the information in these databases varies greatly from one source to another. Largest databases offer online advanced search and provide developers with application programming interfaces (APIs) to facilitate intensive access to data. For example, the NCBI provides the E-utilities, a public API to access all the Entrez databases including PubMed, PMC, Gene, Nuccore and Protein. The Japanese KEGG also provides REST APIs for data consumption. DisGeNET provides an SPARQL endpoint that allows exploration of the DisGeNET-RDF data set and query federation to expand gene-disease association information with data on gene expression, drug activity and biological pathways, among other. In some cases, data can also be downloaded for their consumption through on-premise applications, as in the case of the Disease Ontology or the Gene Ontology, for example. This disparity complicates the use of different sources in research projects. To alleviate this problem, initiatives such as Biopython1 offer common libraries to access multiple sources reducing code duplication in computational biology. Finally, it is very important to know the limitations imposed by each source regarding the volume and use of the data. Supplementary Tables 1-4 also include information in this regard.
Data integration and modeling
In the next steps of the data science pipeline, data previously acquired and processed are integrated and analyzed in order to answer the matter of our study. In other words, a disease network is built by combining the output of the previous stage and a model is constructed from it. Disease networks consist of a set of nodes (mainly, but not only, representing diseases) and a set of edges (connecting diseases directly or through other related node types). Depending on the type of node they connect, network edges can be directed or undirected, weighted or unweighted. As described in previous sections, over the past decade successive studies based on disease networks have proposed different models of data integration.
Homogeneous networks
Homogeneous disease networks (i.e. those where nodes represent diseases and edges represent direct connections among them) are the simplest type of disease networks. In many studies these networks are built as a projection of a heterogeneous disease network (i.e. a network in which diseases are connected to other types of nodes) [1, 28]. For example, in Figure 5, the gene-disease bipartite network is projected onto the disease similarity network (DSN) by relating two diseases that have a gene in common. The disease–disease network can then be analysed by using standard network based methods [1, 52]. In a simplistic approach, the link weights in the resulting disease–disease network represent the link multiplicity resulting from the projection. More complex methods, such as hyperbolic weighting or resource allocation weighting, have been proposed as an alternative [53, 54].
In other studies, homogeneous disease networks are built as similarity networks. In these networks, if the similarity score between disease i and j is more than zero, the corresponding vertices are linked by an edge in the network. The weight of this edge is the corresponding disease similarity score. Several computation methods for the disease similarity score have been proposed, being Vector Space Model (VSM) [55] among the most popular ones. For instance, in 2006 Van Driel et al. represented diseases as vectors of features (viz. disease associated MeSH terms extracted from OMIM records) weighted by their inverse document frequency [56]. The similarity between diseases was then computed as the cosine of the disease vector angles (i.e. cosine similarity). A similar approach was followed by Zhou to build the HSDN [25] and by Sun to build the Integrated Disease Network [57]. Hoehndorf et al. proposed Normalized Pointwise Mutual Information (NPMI) for disease phenotypic term weighting and later used the PhenomeNET system to compute similarity between diseases using a Jaccard index based measured [26]. Similarity measures based on the term hierarchy in the Disease Ontology and the Gene Ontology have been proposed by Resnik, Lin, Wang, Mathur and Cheng [42, 58–61], and have been integrated in online tools like DisSim or DisSetSim [62, 63]. Okumura et al. described alternative similarity measures based on standardized disease classification, probabilistic calculation, and machine learning [64].
Heterogeneous networks
The projection of heterogeneous networks into homogeneous disease-disease networks allows applying simpler network analysis techniques on the resulting network. However, it often results in information loss. For instance, in Figure 5 by projecting the gene-gene network onto the disease-disease network, the information about gene interactions and their structure is lost. In contrast, heterogeneous networks make it easy to predict relationship between entities of different types, such as diseases, genes or drugs, following a guilt-by-association paradigm [33]. For example, a drug that regulates a gene associated to a disease could be repurposed for diseases associated to the same gene. Data fusion by matrix factorization and network topology based techniques, such as diffusion and meta-path, are the most common methods for edge prediction in heterogeneous networks.
Matrix Factorization methods are closely related to clustering (unsupervised) algorithms. Non-Negative Matrix Factorization (NNMF) decompose matrices of heterogeneous data and data relationships to obtain low-dimensional matrix factors. These factors are then used to reconstruct the data matrices, adding new unobserved data obtained from the latent structure captured by the low-dimensional matrix factors. Hence, NNMF provides a mechanism to integrate heterogeneous data of any number, type and size. In 2013 Žitnik et al. applied a variant of NNMF called non-negative matrix tri-factorization to discover new disease-disease association by fusing 11 data sources on four type of objects including drugs, genes, DO terms and GO terms [4]. In 2015 Dai et al. integrated drug-disease associations, drug-gene interactions, and disease-gene interactions with a a matrix factorization model to predict novel drug indications [65]. More recently, Zhang et al. proposed a similarity constrained matrix factorization method for the drug-disease association prediction using data of known drug-disease associations, drug features and disease semantic information [66].
Methods based on diffusion (i.e. information spreading across network links) have also been extensively proposed to estimate the strength of the connection between nodes of heterogeneous networks. An advantage of such approaches, also called network propagation methods, over matrix factorization is that they they preserve the network structure. Chen et al developed the method of Network-based Random Walk with Restart on the Heterogeneous network (NRWRH), a variation of a ranking algorithm, to predict potential drug-target interactions on heterogeneous networks [67]. Further variations of random walk algorithms, such Bi-Random Walk (BiRW) have been applied to predict novel disease-gene [68], disease-MiRNA [69] or disease-lncRNA associations [70], among others.
Metapath-based approaches also preserve the network structure, and additionally provide an intuitive framework and interpretable models and results. A meta-path P is a path defined over the general schema of the heterogeneous network G = (A, R), where A represents the set of nodes and R the set of their relationships. The metapath is denoted by , where l is an index indicating the corresponding metapath [47]. Figure 6 shows the metapaths extracted from an annotated heterogeneous network.
In their 2012 study, Chen et al. developed a meta-path based statistical model called Semantic Link Association Prediction (SLAP) to assess the association of drug target pairs and to predict missing links [29]. In 2016 Gang Fu et al. proposed an alternative DTI approach to the SLAP algorithm taking advantage of machine learning methods such as Random Forest and Support Vector Machine [63]. To quantify the prevalence of the meta-paths, Himmelstein adapted an existing method developed for social network analysis (PathPredict) and developed a new metric called degree-weighted path count (DWPC). The DWPC downweights paths through high-degree nodes when computing meta-path prevalence [30].
Despite maintaining and exploiting the structure of heterogeneous networks, methods based on diffusion or meta-paths present some scalability limitations, such as the bias introduced by the noise and high-dimensionality of biological data or the effort in feature engineering. Recently, Luo et al. designed DTINet, a novel network integration pipeline for DTI prediction. DTINet integrates information from heterogeneous sources (e.g., drugs, proteins, diseases and side-effects) and copes with the noisy, incomplete and high-dimensional nature of large-scale biological data by learning low-dimensional but informative vector representations of features for both drugs and proteins [48].
Model validation
In this analysis of the reconstruction of disease networks, we wanted to give a special relevance to the validation process. Ensuring that the computational pipeline is producing correct and valid results is critical, particularly in a clinical setting [71]. As previously explained, disease networks are used in studies as diverse as the discovery of new disease-disease relationships, the prediction of gene-disease relationships (GDA) or the repositioning of drugs. The validation of the network depends, therefore, on the type of study in question. In general, the validation can be done experimentally or by computational techniques.
Approaches and sources
Experimental validation includes the verification of the predictions in a controlled environment outside of a living organism (in vitro) or using a living organism (in vivo). Animal studies and clinical trials are two forms of in vivo research. For example, in their drug repositioning study based on heterogeneous networks, Luo et al. validated the bioactivities of the COX inhibitors predicted by DTINet experimentally. They tested their inhibitory potencies on the mouse kidney lysates using the COX fluorescent activity assays [48]. Jodeleit et al. validated their disease network of inflammatory processes in humanized NOD/SCID/IL2Rγ (NSG) mices [72]. While experimental validation studies have the potential to offer more conclusive results about the performance of disease networks, they have several limitations. First, animal studies and clinical trials require expensive lab work and are long and costly. In addition, their conclusions can be misleading. For example, a therapy can offer a short-term benefit, but a long-term harm. Also, it is debatable that genomic responses in mouse models mimic human inflammatory disease [73].
In silico is an expression used to mean “performed on computer or via computer simulation. “In silico tests have the potential to speed the validation process while reducing the need for expensive lab work. In silico validation requires a point of reference for evaluating the model performance, also known as Criterion Standard or Gold Standard. It is noteworthy that in the field of biomedicine usually the Criterion Standard is actually the best performing test available under reasonable conditions [74]. For example, in this sense, a MRI is the gold standard for brain tumour diagnosis, though it is not as good as a biopsy [75]. Hence, the most recurrent benchmarks used in the validation in silico of disease networks include consolidated data biomedical sources and medical literature.
Sources of biological, phenotypic or chemical data as well as several available ontologies and code standards (see Data extraction section) are used for validation in many studies focusing on disease networks. For instance, their performance to discover disease-disease relationships has been validated with the disease classifications in the Disease Ontology [4, 26] or in the ICD codes [28], as well as with comorbidity associations downloaded from the Human Disease Network (HuDiNe) [27]. DisGeNET has been used to validate de novo gene-disease associations [76], as it integrates data from expert curated repositories with information gathered through text-mining of the scientific literature, GWAS catalogues and animal models [77]. For the validation of drug repositioning predictions, sources such as PharmacotherapyDB and DrugCentral were exploited [46].
The aforementioned sources are inevitably biased towards consolidated knowledge, and therefore they might suffer some limitations in corroborating new discoveries. As an alternative (or usually, as a complement) to these sources, medical literature (i.e. studies, medical trials, clinical histories) are used to validate disease network based studies. For instance, Mathur and Paik used previous studies to validate disease-disease and drug-target associations [42, 78]. In some cases, the validation process also combined human (i.e. medical experts) action to corroborate the discoveries [25].
Methods
Leaving aside the particularities of biomedical research and its sources, the validation of classification or prediction methods based on disease networks does not differ from other validation cases. Therefore, in the analyzed studies we found validation methods widely used. For example, k-fold cross-validation is often used to check whether the model is an overfit or not [79, 80]. Overfitting is one of the typical problems of validation, especially when limited data sets are available.
To quantify the predictive power of their network-based model, many studies use the Area Under the Curve of the Receiver Operating Characteristic (AUC-ROC), another frequently used method in validation problems [26, 81, 82]. The AUC-ROC is the plot between sensitivity and (1-specificity). (1-specificity) is also known as false positive rate and sensitivity is also known as true positive rate. The p-value Is the probability that the observed sample AUC-ROC could actually correspond to a model of no predictive power (null hypothesis), i.e. to a model whose population AUC-ROC is 0.5. If p-value is small, then it can be concluded that the AUC-ROC is significantly different from 0.5 and that therefore there is evidence that the model actually discriminates between significant and non-significant results [83]. Typically, a threshold value (called significance level) of p-value < 0.05 is used. However, biomedicine studies often use more restrictive values like 0.005 [42] or even 0.001 [4]. As an alternative to the AUC-ROC, the p-value can be obtained for other tests such as chi-squared or Fisher’s exact, depending on the case of study [84]. Finally, to control the familywise error rate associated with multiple testing, a correction algorithm like Benjamini–Hochberg or Bonferroni is applied.
Presentation
Last but not least, at the end of the pipeline the results obtained should come out in a format that can be consumed by the audience (e.g. the scientific community, the media or even ourselves to inform the next iteration). One of the major advantages of disease networks is the intuitive access to the underlying complex interactions between diseases and other diseases, genes or drugs. Thus, publishing not only the data but also means to explore and exploit the network is key to ensure reproducibility and extensibility of the study [85]. Early studies lacked this option, although access to their data allowed the construction of visualization tools a posteriori. For example, Ramiro Gómez created an interactive view of the Human Disease Network proposed by Goh in 2007 using the graph visualization software Gephi2 and the original dataset from the study3. The same software was used by in 2014 and by Hoehndorf in 2015 to visualize the generated disease networks [26]. In both cases, a force-directed layout was used for the graph drawing [86].
Advances in network visualization tools have prompted the publication of network exploration systems associated with studies, being Cystoscape4 a remarkable example. Cytoscape provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations [87]. A number of studies have used Cytoscape as a basis to build and visualize their networks. For instance, Le et al. created HGPEC as an app for Cytoscape to predict novel disease-gene and disease-disease associations [88]. DisGeNet provided another app that allows to visualize, query and analyse a network representation of DisGeNET database (See Figure 7) [89]. Many other apps can be found in the Cytoscape app store5.
On their side, Himmelstein et al. accompanied their study based on heterogeneous disease networks with a powerful visualization tool built with Neo4j6 [30] that provides browsing and querying on Hetionet (see Figure 8). Being a remarkable example of data accessibility, not only the data but also the code of this tool is publicly available. Different studies of the University of Rome, such as SIGNOR7 and DISNOR8, also provides a disease network visualization tool that includes intuitive representations of the interactions between biological entities at different complexity levels (see Figure 9). This visualization tool was developed ad-hoc for these projects [90–92].
A recent study by Pavlopoulos et al. performs an empirical comparison of visualization tools for large-scale network analysis [93].
Discussion
The analysis of the evolution of the disease networks carried out in the first part of the document shows how these models have become increasingly complex and allow to address arduous problems such as the improvement of our disease understanding or the repositioning of drugs with promising results. However, as a side effect of this growing complexity, new challenges have emerged that need to be addressed.
The growing availability of biological sources, key in the improvement of disease networks, is ballasted by their fragmentation, heterogeneity, availability and different conceptualization of their data [3]. Furthermore, these sources are intrinsically biased towards consolidated knowledge, which complicates the discovery of novel findings. The exploitation of textual sources such as clinical histories or scientific articles - more abundant and faster growing - allows researchers to compensate for these limitations. As an example of the abundance and potential of these alternative sources, in a recent study Westergaard extracted and analyzed 15 million English scientific full-text articles published during the period 1823–2016 [94].
Despite this demonstrated potential, the exploitation of medical literature is hindered by factors such as its limited access and heterogeneity. In the aforementioned study by Westergaard, the team could only access a subset of the Medline articles in full-text mode, while for the rest only the abstracts were available. In addition, depending on the source, they had to process documents with different structures and format. As an alternative, a recent study proposed the use of Wikipedia as a source of structured and free-access text data, evaluating its usefulness in the detection of relations between diseases based on its symptoms/diagnosis elements, and comparing its performance with that of PubMed. The obtained results showed that Wikipedia can be as relevant a source as PubMed for this type of analysis [95].
Another limiting factor when integrating new sources to enhance the predictive capacity of disease networks is noise [96]. Adding new sources does not necessarily imply an improvement, since some databases are more informative than others. For example, Žitnik et al. evaluated the impact of removing sources in the performance of the proposed model to validate their informativeness. They observed that while the absence of some sources significantly affected the performance, in other case the impact was minimum [4]. It is therefore necessary to counteract this effect by choosing algorithms that eliminate irrelevant sources or features before constructing the model [48].
Validation is yet another challenge in the studies based on disease networks. In some cases, the absence of a Standard Criterion leads to the use of previous studies for the validation of the new models [42, 78]. This might ultimately result in the propagation of errors from one study to another. The use of curated sources and of sufficiently contrasted studies, combined when possible with in-vitro and in-vivo validations, helps to alleviate solution to this problem [48, 72]. Related with the challenge of validation, the difficulty in accessing data from some studies prevents their reproducibility and verification by other teams, which makes them less reliable as references for future studies or as benchmarks. However, the effort of some researchers in making available the results of their work is worth to mention. Study cases such as Hetionet, Rephetio, SIGNOR and DisNOR [30, 46, 90, 91], which offer advanced search and visualization tools, undoubtedly represent the path to follow.
The review of the process of creating a disease network from the point of view of a data science pipeline carried out in the second part of the document allows to compare how each study has faced these challenges. Supplementary Table 5 lists some of the most notable studies related to disease networks of the last decade, breaking down each of its phases. It also contains information on the type of problem addressed and the characteristics of the obtained network. This table could be considered an extension / update of the one compiled by Sun K. et al. [27].
Conclusion and future work
Research studies on based disease networks have significantly advanced over the last decade. From the initial simple undirected networks that associated diseases with symptoms or genes in a way, we have moved to complex networks that relate the disease to dozens of features from different sources in a semantic, directional and weighted way. The growing availability of biological and textual sources, the improvement in techniques and processing capacity and the use of new models have contributed fundamentally to this progress. As can be concluded from the analysis in the first part of the document, the contribution of disease networks to fields of disease understanding and drug repositioning is increasingly notable.
Nevertheless, an exhaustive analysis of the phases in the process of creating disease networks carried out in the second half of the document reveals important challenges. First, biological sources suffer from fragmentation, heterogeneity, lack of availability and different conceptualization, that can only be alleviated in part with the aggregation of textual sources. Second, the combination of sources involves the introduction of noise that can affect the performance of the model, which makes it necessary to take preventive measures in this regard. Finally, the scarcity of reference data and verifiable studies hinders the validation of the new models.
In addition to detecting these challenges, the analysis of disease networks from the point of view of their functional units allows for a more precise comparison of studies, highlighting their differences and common points. This study and the presented analyses, reflected in the summary tables, can serve to inspire future work. For example, a performance comparison of the prediction models in the different studies might lead to deduce which functional units offer better results. In a next phase, based on the obtained results, alternative combinations of these functional units could be proposed to build new pipelines and obtain more precise models based on disease networks.
Funding
Horizon 2020 research and innovation programme under grant agreement No. 727658, project IASIS (Integration and analysis of heterogeneous big data for precisionmedicine and suggested treatments for different types of patients).
Conflicts of interest
Authors declare no conflict of interest.
Keypoints
Disease networks have proved to be an intuitive and powerful way to address arduous problems such as the improvement of our disease understanding or the repositioning of drugs.
Over the last decade, disease networks have evolved from initial simple and undirected homogeneous networks, to complex, semantic, directional and weighted heterogeneous networks.
Depite their increasing complexity, studies on disease networks share common phases that can be represented as functional units of a data science pipeline for a better analysis and comparison.
The heterogeneity and fragmentation of biological and textual sources, the noise introduced by their combination and the scarcity of validation datasets are some of the challenges discovered through this analysis.
Footnotes
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵