Abstract
The rapid accumulation of knowledge in the field of systems and networks biology during recent years requires complex, but user-friendly and accessible web services that allow from visualization to complex algorithmic analysis. While several web applications exist with various focus on creation, revision, curation, storage, aggregation, integration, collaboration, exploration, visualization, and analysis, many of these services remain disjoint and have yet to be packaged into a cohesive environment.
Here, we present BEL Commons; an integrative knowledge discovery environment for knowledge assembly networks encoded in the Biological Expression Language. Users can submit their BEL files to be parsed, validated, and stored with fine-granular permissions. After, users can summarize, explore, and optionally shared their networks with the scientific community. We have implemented a query builder wizard to help users find the relevant portions of increasingly large and complex knowledge assembly networks and a visualization interface that allows them to explore their resulting networks. Finally, we have included a dedicated analytical service for performing data-driven analysis of knowledge assemblies to support hypothesis generation.
This web application can be freely accessed at https://bel-commons.scai.fraunhofer.de.
Introduction
While there exist a variety of modeling languages, data formats, and analytical packages for knowledge assemblies in networks and systems biology, many require deep knowledge of computer programming to use and are generally inaccessible to biologists and clinicians. With the explosion of data and knowledge in the biomedical domain, it is paramount to develop tools that foster collaboration between groups of scientists with different backgrounds and skill sets who are working towards similar goals. Already, there are multiple freely available, web-based tools for networks and systems biology with varying focus on creation, revision, curation, storage, aggregation, integration, collaboration, exploration, visualization, and analysis.
Knowledge assemblies in systems and networks biology comprise multiple relations between biological entities across multiple scales (e.g., from the genome via the cellular and tissue level to the entire organism or population). They provide an abstraction over the concepts of pathways and mechanisms in order to encode a variety of causal, correlative, and associative relations in network representations of complex biology. In this scope, knowledge assemblies directly correspond to networks and can be handled as such.
Because of the accelerating throughput of scientific publication in the biomedical domain, several workflows (e.g., BELIEF [1], REACH [2], TRIPS [3], etc.) have been developed to automate relation extraction. Limitations on the precision and recall of these techniques motivated the development of several semi-automated relation extraction pipelines (e.g., SBV Improver [4], BELIEF Dashboard [5]) embedded in web interfaces to allow users to manually revise and curate their results. When natural language processing is insufficient, manual curation interfaces (e.g., WikiPathways [6]) can drive crowdsourced creation of knowledge assemblies.
While some useful knowledge bases (e.g., miRTarBase [7], CTD [8], KEGG [9]) disseminate their data in non-standard formats, the use of standard formats (e.g., SBML [10], BioPAX [11], BEL [12]) has become much more common as in the case of the integration effort of Pathway Commons [13]. Other web tools (e.g., NDEx [14], GraphSpace [15] provide the ability to upload, store, share, and distribute networks while remaining schema-agnostic. Most of these resources also include simple network visualization, layout, and exploration with PathVisio [16], Cytoscape [17], Cytoscape.js [18], as well as in-browser navigators.
Numerous algorithms and analyses have been published for systems and networks biology, but most are bespoke due the heterogeneous nature of their underlying knowledge assemblies, data sets, and the scientific questions motivating their development. An exception lies in enrichment analysis, which is applicable across a variety of knowledge assemblies and data sets and is embedded in several web applications for user accessibility. Even in its simplicity, enrichment analysis has the ability to produce significant biological insight [19] as exemplified by the Enrichment Map Cytoscape Plugin [20, 21] with Pathway Commons as well as GSEA [22] with MSigDB [23].
Recently, BEL has been successfully used as a semantic and modeling framework for multi-scale and multi-modal knowledge in order to investigate the aetiology of complex disease as shown by Domingo-Fernández et al. [24] with the release of the NeuroMMSig Mechanism Enrichment Server. While the list of published BEL-specific algorithms is currently short (e.g., Reverse Causal Reasoning [25], Network Perturbation Amplitude [26], etc.), the advent of the modern PyBEL framework [27] has improved the accessibility and utility of BEL and motivates its wider adoption. Unfortunately, unlike many of the other functions that have been implemented in web applications, algorithms generally remain confined to programmatic use and are not accessible to potential users. Last, but not least, the ecosystem of BEL-specific web applications is small, and does not include a service for parsing, validating, and converting BEL.
There are still several unmet needs for users that motivate the development of new of web applications. Generally, there is still the need to enable complex exploration and visualization as well as to to make algorithms and analyses generally accessible and reusable. Specifically to BEL, there is a need to make parsing, validating, and converting generally accessible. Finally, an integrative knowledge discovery environment that comprises many of the previously mentioned features would be greatly beneficial to the scientific community. Here, we present BEL Commons, a web application that addresses these unmet needs and is a first attempt at building such an environment.
Components
The user interface of BEL Commons contains several components that integrate the most useful features from the variety of the previously mentioned web applications for systems and networks biology (Figure 1). Below, we outline their functionalities and typical use cases. Implementation details can be found in the Supplementary Information.
Network Uploader
The first point of entry for many users of BEL Commons will be through its BEL uploader. Even with recent publication of a modern BEL parser (i.e., PyBEL), a web interface greatly improves accessibility for a more general audience. The uploader contains a short form allowing users to choose a file to upload and to toggle common parsing and compilation parameters. After submitting, users’ files are sent to an asynchronous task queue, implemented with RabbitMQ (https://www.rabbitmq.com) and Celery (http://www.celeryproject.org), that performs parsing, validation, and compilation in the background. The queue then performs several assessments that enumerate errors and warnings encountered during parsing, produce statistics over the resulting network, and identify biological network motifs. Finally, the queue notifies the submitter upon completion.
The parsing errors and warnings are categorized first as syntactic or semantic then with much more detail, as described in the PyBEL documentation (https://pybel.rtfd.io). Each is presented with provenance information including the line, line number, and position so curators can quickly make changes. Recurring errors and warnings are identified and grouped separately to allow curators to quickly make impactful improvements. Finally, a faceted search is presented for situations where an overwhelming number of errors and warnings are present.
The statistical summary (Figure 2A) of the network presents information about the contents of the network and also network-theoretic measurements of the full network. Several charts are generated depicting the types and number of nodes, edges, modifications, namespaces, annotations, and citations existing in the network. Furthermore, scalar values describing network properties such as network density and average node degree and also node overlap with other networks using the Szymkiewicz-Simpson coefficient in the database are listed.
The analysis of knowledge network motifs (Figure 2B) builds upon the ideas of transcriptional network motifs presented by Alon [28] and makes generalizations to apply them to knowledge networks. The simplest motifs that are informative to the robustness, correctness, and applicability of a knowledge network are based on the identification of biological contradictions. One type is the contradictory pair, where knowledge has been curated stating A increases B but also A decreases B. The ability to identify the contradiction that RB1 has been shown to increase the transcriptional activity of E2F4 by Li et al. [29] and that it decreases the transcriptional activity of E2F4 [30] is a significant step towards quality control during curation.
User Rights Managements and Collaboration
Upon BEL upload, the user is presented with the option to make the resulting network either public or private. Networks can be uploaded privately for use during research then later released publicly to accompany a publication and share their work with the scientific community. The network catalog (Figure 3A) dynamically shows users only networks for which they have the appropriate permissions.
Users can create projects that allow for multiple users to mutually share networks. For example, a curation project in a given disease area could contain networks generated from the efforts of multiple curators. Projects can generate a merged network that can be summarized, explored, analyzed, and exported with the same tools available for stand-alone networks. Users can access their private activity page (Figure 3B), which provides a global summary over their projects, networks, queries, data sets, and experiments.
We do not presume all users plan to produce their own BEL script, especially with the growing number of both general and context-specific publicly-available knowledge assemblies. In light of this, we have included several of these resources with BEL Commons for these users (Table 1). The catalog of networks for which a user has the appropriate permissions can be accessed directly from the home page of BEL Commons.
Query Builder
While knowledge assemblies and their corresponding networks encoded in BEL have the benefit of being readily mergeable, it becomes increasingly difficult to visualize, explore, and interpret larger networks. Furthermore, most query interfaces are limited to exploration of the Nth neighbors of a biological entity of interest, which is often insufficient to capture complex biology. The second main point of entry for many users of BEL Commons will be through the query builder, which enables the generation of powerful, precise, and expressive queries. Each query contains three parts: assembly, seeding, and application of mutations.
The first facet of the query builder allows users to select knowledge assemblies relevant to their scientific question. While it still remains computationally feasible to query a merged view over the entire catalog of knowledge assemblies, users have the opportunity to apply their intuition and select the most relevant on the basis of their specificity towards target diseases areas, their curation methods, or any other appropriate criteria described in their metadata. Additionally, other structured knowledge resources such as protein families [32, 33], biochemical reactions [34, 35], gene orthologies (e.g., Entrez Gene [36], MGI [37], RGD [38], etc.) with high confidence can be included to enrich the curated knowledge assemblies. Later, we show how this novel feature can be used to enrich networks as a pre-processing step to connect disparate components before analysis.
The second facet of the query builder allows the user to choose from a combination of seeding methods to select nodes and edges (Table 2). Besides the more simple questions (e.g., which nodes are neighbor of my protein of interest?, which nodes are closely connected?), the query builder allows scientists to ask scientific questions like the one proposed in the following scenario: the leukemia drug, nilotinib, triggers cells to remove of faulty components-including the ones associated with several brain diseases [39]. In 2015, the Georgetown University Medical Center published findings that the drug had a therapeutic effect on patients in Alzheimer’s and Parkinson’s diseases [40]. Though it is currently unknown, a search of the paths between this drug and these diseases could provide insight to the drug’s mechanism of action.
The third and final facet of the query builder is the application of mutations (i.e., enrichments, selections, filters, and transformations) to the seed network. These vary from the enrichment of the central dogma, which extends protein nodes with their corresponding RNA and gene nodes; to the induction of purely causal subgraphs, which remove any non-causal edges; to the removal of pathology nodes, which are often uninformative in analyses; to pre-processing steps for analysis that might involve collapsing gene, RNA, and protein nodes. BEL Commons dynamically loads mutation functions from the PyBEL Tools package (https://github.com/pybel/pybel-tools), such that new functions, pipelines, and workflows for processing networks can be written quickly and made available to users. A full list is available through the PyBEL Tools documentation (https://pybel-tools.rtfd.io), or through the query builder itself.
Each query is saved with a unique identifier such that queries can be rerun, shared, merged, or compared. Rather than storing the results of queries, the assembly, seeding, and mutations are stored as a "transaction" so that they can be applied to new assemblies, for example, when a network is updated. Effectively, queries correspond to an experimental protocol for processing raw knowledge assemblies before visualization, exploration, analysis, and interpretation. However, the construction of a query is not the end of its life. The next section summarizing the biological network explorer describes how queries can be extended and evolve during the process of scientific inquiry.
Biological Network Explorer
The third main point of entry for many users, the core feature of BEL Commons, will be its biological network explorer (Figure 4), which uses D3.js (https://d3js.org) to render networks with a force-directed layout algorithm that can be panned and zoomed. Because the complexity of biological networks often limits the utility of automated layouts [42], users can also manually drag and reposition nodes. Furthermore, users can adjust the edge length parameter of the algorithm to rarefy densely grouped nodes and improve readability. The networks are styled with minimum visual clutter and make use of easily-distinguishable colors rather than obtrusive shapes for nodes as well as patterns and colors for different types of edges.
The explorer has several contextual actions registered to the nodes and edges. Users can left-click a node to populate the information box below the explorer with information from external data sources (e.g., Entrez Gene, ChEBI [43], ExPASy [33], GO [44] gathered from Bio2BEL services (https://github.com/bio2bel) and the EMBL Ontology Lookup Service (OLS) [45]. Alternatively, users can right-click a node to open a contextual menu that enables further modification to the network (e.g., delete the node, add the neighbors of the node to the network) that are interactively appended to the original query used to render the visualization. The contents of the network can also be further modified by the inline query builder, which allows additional mutations (including enrichments) to be applied interactively. The query history is displayed at the bottom of the explorer, new changes are highlighted in red, and because queries are stored as transactions, changes can be reverted with an "undo" button.
When an edge is clicked, the information box is populated with relevant citations, evidences, and annotations. Each edge is linked to a voting and commenting system so domain-specific experts, curators, and bioinformaticians can engage in discussion on the correctness and robustness of the chosen representation of knowledge.
To the right of the explorer is the filter tool box, which incorporates a novel approach to filtering and exploring networks using a linked hierarchical explorer of the terminologies/ontologies annotated to the edges in the currently displayed networks. Users can search and select groups of annotations to filter the network. For example, this could be useful to exclude edges asserted from research on cell lines that are not relevant. The filter tool box has three additional tabs; nodes, edges and highlight. Users can either search for specific nodes and edges in their corresponding tabs or use the highlight tab to select nodes and edges with specific properties to highlight in the network.
Above the explorer is the general tool box that includes several additional interactions for exploration, analysis, and export of the network using the protocols described by Hoyt, et al. [27]. Notably, it contains a path mining tool that enables path searches between given nodes with fine-granular, configurable settings (e.g., directedness, path search algorithm, application of filters for pathologies, etc.). Hence, it can immediately be used to identify the causal root affecting two nodes, or generate hypothetical links across modes and scales.
The visualization can be further modified by resizing the nodes corresponding to the results of topological or data-driven analyses, such as their degree, betweenness centrality, or by the results of an experiment (e.g. Unbiased Candidate Mechanism Perturbation Amplitude; see below) in order to identify novel biological entities.
Finally, there are several alternatives to exploring networks that are too large to render in-browser due to the limits of Javascript-based graphics. First, the network catalog opts to present users with a random subsampling of large networks. If a network in the explorer becomes too big, then the explorer prompts the usage of the filter tool box to identify a more relevant, smaller network. Otherwise, users can export the current network to multiple formats for use in desktop visualization applications.
Analytical Service
Algorithms for analyzing pathways and networks were categorized by Kathri in 2012 into three main categories: over-representation analysis, functional class scoring, and pathway topology [19]. These algorithms have been developed for a wide variety of applications, data formats, and graph types, but few are specific to BEL.
BEL Commons includes the newly developed Unbiased Candidate Mechanism Perturbation Amplitude (UCMPA) workflow, which begins by generating unbiased candidate mechanisms (with respect to their boundaries) based on the upstream controllers of biological process nodes in BEL networks and including their induced causal edges. It addresses issues with graph topologies (i.e., cycles, contradictory edges) and other limits (e.g., need for pre-defined subgraphs, usage of correlative relationships) posed by previous BEL-specific algorithms [26, 46] with more complex randomization and sampling approaches. Users can map pre-processed high-throughput experimental data (e.g., differential gene expression) to the network then execute a classical schema-free analytical method inspired by other heat diffusion analyses in networks biology [47, 48]. The algorithm outputs scores for each biological process based on the propagated effect of its upstream controllers, dictated by the input data. A technical description of this method can be found in the PyBEL Tools documentation (https://pybel-tools.rtfd.io). Finally, users are presented with a statistics page and are able to overlay the results on the original networks.
In the following section, the query builder, interactive explorer, and analytical service are used to assess several differential gene expression experiments representing Alzheimer’s disease patients at different disease states using NeuroMMSig networks.
Application Scenario
This section describes a typical use case, where a multiple networks are queried, explored, then analyzed.
First, a sampling of networks from NeuroMMSig that represent well-known mechanisms with different progression patterns in the context of Alzheimer’s disease (i.e., Low Density Lipoprotein Subgraph, GABA Subgraph, Notch Signaling Subgraph, and Reactive Oxygen Species Subgraph) were assembled with the query builder. No seeding methods were used since the analysis is general across mechanisms annotated by NeuroMMSig.
Because underlying knowledge assemblies generated by multiple curators often manifest as disjoint networks that contain related nodes, automated methods can be used to connect them. First, the central dogma was inferred for all proteins and RNAs. For example, this connected the GABRA4 gene and GABR4 mRNA from disjoint components. Further modifications (i.e., collapsing of variants, collapsing on the central dogma, enriching unqualified edges) were also applied.
As an example from the NeuroMMSig GABA Subgraph, knowledge-based approaches can be used to connect the disconnected GABBR2 to GABBR1 gene nodes to their common families using resources like PFAM [49] or InterPro [32]. Hierarchical knowledge sources like these can then be used to reason over the network, like using the knowledge that GABBR2 decreases the cAMP catabolic process [50] to assert that GABBR1, the other member of the GPCR family 3, GABA-B receptor (IPR001828), shares the same activity. While this knowledge does not exist in the assembly, literature search also notes several connections between GABBR1 and cAMP signalling [51, 52].
Unbiased candidate mechanisms were generated from the assembly from the upstream controllers of each biological process present and the UCMPA workflow was applied to each to assist in the interpretation of the differential gene expression experiments from GSE28146 [53] that were preprocessed with GEO2R [54]. This trial classified patients into three disease progression stages: incipient, moderate, and severe. BEL Commons includes a parallel coordinate display (Figure 5) to assist in comparison of multiple UCMPA experiments across different data sets and clustering their results with K-Means clustering. While BEL has inherent limits in its temporal expressivity, interpreting data that has an inherent temporal ordering helps overcome this limit.
While Alzheimer’s disease must be studied with respect to its progression over time, this analysis can provide insight directly to measurements performed on a single time series. Those results provide a ranking that prioritizes the most up- and down-regulated biological processes as a function of the observed data.
Discussion
No web application, however feature-rich, will ever satiate the desire and creativity of researchers for generate novel solutions to scientific problems. While BEL Commons has taken inspiration from many well-constructed services to advance towards a more feature-rich knowledge discovery environment that enables researchers to explore knowledge and data in new ways, it still has this limit. However, we are not discouraged, and we hope to make several improvements to BEL Commons in the future:
We would like to improve the interoperability of BEL Commons and the platform build on BEL itself by integrating open authentication systems like ORCID in order to harmonize identification of users across multiple web services and provide reliable provenance for networks, queries, and analyses. We would also like to integrate further tools for converting BEL to RDF in order to connect BEL with other linked data. Further, we would like to improve exporters to other services, notably, NDEx, which have brighter outlooks on sharing and feedback systems. Recent developments in integrating INDRA [55] with PyBEL enables conversion from BioPAX documents to BEL.A future update to BEL Commons will include an option to upload these documents as well.
We would like to integrate BEL Commons with other BEL-specific systems developed with different underlying technologies. First, integrating the BELIEF Dashboard to use the underlying network and edge store from PyBEL would enable a more thorough feedback and curation interface so users could not only vote on the correctness of edges, but fix them directly. Second, the NeuroMMSig Mechanism Enrichment Server will be re-implemented completely with reusable PyBEL code and BEL Commons components in order to advance its goals of achieving patient-stratification by using common algorithms and tools.
Conclusion
Along with recent improvements in generation of BEL content through text mining (INDRA) and serialization of resources (Bio2BEL project (https://github.com/bio2bel)), we believe that BEL Commons will make BEL more accessible to both academic and industrial users. We have made this application freely available at https://bel-commons.scai.fraunhofer.de.
Authors’ Contributions
C.T.H. and D.D.F. conceived the web application, implemented it, and wrote this manuscript. M.H.A. reviewed the content.
Funding
This work was supported by the EU/EFPIA Innovative Medicines Initiative Joint Undertaking under AETIONOMY [grant number 115568], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies in kind contribution.
Conflict of Interest
None declared.
Acknowledgements
We would like to thank all colleges that assist in testing and provide feedback in order to improve this work, especially to André Gemünd for his technical assistance. We would also like to thank Scott Colby for making our logos.