Abstract
Recently, network inference algorithms have grown tremendously in the field of systems biology because network identification is essential for understanding relationships between regulation mechanisms for genes, elucidating functional mechanisms underlying cellular processes, as well as identifying molecular targets for discoveries in medicines. This article provides a brief overview of different approaches used to identify biological networks and reviews recent advances in network identification.
I. INTRODUCTION
A network is a set of nodes and a set of directed or undirected edges between the nodes. In biological systems, there exist many types of networks, including: transcriptional, signaling and metabolic networks, and an edge in the network corresponding to a biochemical interaction can be validated experimentally. However, since the size of the search space increases exponentially with the number of nodes in the network, still less is known about the structure of such networks.
Many computational methods [1–14] used to infer gene regulatory networks (GRNs) or biochemical interactions provide a prediction of the ‘wiring diagram’ of the network. Roughly, these methods identify the network structure from gene expression profile data by searching for patterns of correlation or conditional probabilities that indicate causal influence, or by finding best parameters in the mathematical model that fit the data. These approaches fall generally into the following categories: 1) Statistical models and 2) Mechanistic network models.
In general, the network identification remains a difficult problem; statistical dependencies are affected by both direct and indirect path of nodal interactions; nonlinearities in the system dynamics and measurement noise make this problem even more challenging. Also, in order to continue to have an impact in systems biology, identification of the graph topology from data should be able to reveal deficiencies in the model and suggest new experimental directions.
II. STATISTICAL APPROACH
Statistical approaches use the so-called ‘influence’ network model, which generally reflects global properties of a system's behavior, and thus true molecular interactions are described rather implicitly [15]. Hence, these models can be difficult to interpret and also difficult to integrate further information.
A. Correlation-Based Method
Predictions of physical and functional links between cellular components are often based on correlations between experimental measurements, such as gene expression [1]. The correlation coefficient ρij (between gene i and j) determines the connection between gene i and j; Adjacency matrix (A) can be defined as follows:
This gives an undirected, unweighted network. However, many methods relying on a variety of pairwise gene expression correlation measures are subject to exceedingly high false-positive rates (direct vs. indirect influence, i.e., via one or more intermediaries).
B. Mutual Information (MI)
Mutual information is a measure of the mutual dependence between two variables which can be expressed by the joint distribution of two random variable X and Y relative to the joint distribution of X and Y under the assumption of independence. The mutual information can be defined as I(X;Y)=H(X)-H(X|Y) where H(X) is the marginal entropy and H(X|Y) is the conditional entropy. By applying the data processing inequality [16], indirect interactions can be eliminated since statistical dependencies might be of an indirect nature. For example, if X➔Y➔Z, then I(X;Y) ≥ I(X;Z), with equality if and only if I(X;Y|Z) =0.
Many approaches apply some additional filtering and post-processing procedures and the final result is an adjacency matrix from which we can infer interactions.
C. Bayesian Network
Graphical model is a term that refers to the separation of a joint probability distribution into conditional probabilities. It is commonly used in Bayesian networks, which have several attractive properties for the inference of signaling pathways from biological data sets; Bayesian networks can represent stochastic nonlinear relationships and describe direct molecular interactions as well as indirect influences that proceed through additional unobserved components. Thus, very complex relationships in signaling pathways can be discovered [17–19].
In the formulation of Bayesian networks, the structure of a genetic regulatory network is modeled by a directed acyclic graph G = (V, E) where vertices (V) represent genes or other elements and edges (E) represent biochemical interactions in the network. Bayesian network modeling associates with each variable Xi, a probability distribution conditioned on its parents in the graph (Pai). The graph structure represents the dependency assumptions that each variable is independent of its non-descendant; thus the joint distribution can be decomposed into the following form:
The goal of Bayesian network inference is to search among possible graphs and select the best graph which describes the dependency relationships observed in the experimental data. One can take a score-based approach and given a scoring function and a set of data, network inference amounts to finding the structure that maximizes the score. Main challenges include: exponential complexity in the local network connectivity necessitating heuristic search procedures, reliance on unrealistic network models and the need to discretize expression data [15].
III. MECHANISTIC MODELING APPROACH
Mechanistic network models identify the interactions based on a prior knowledge being used as biologically motivated constraints, i.e., reducing search space. Thus, such reverse engineering approaches reveal the best interaction maps that fit the data to prior models [20,21]. Many methods consider a dynamical system that depends on a reaction graph, summarizing all biochemical reactions and associated parameters. These methods assume that neither the graph nor the parameters are known. Inference regarding the graph structure is carried out by integrating experimental data with dynamic models and then reformulating parameter estimation problems. In this way, one can take account of model complexity as well as the fit-to-data.
A. Boolean Networks
The state of a gene can be described by a Boolean variable, i.e., a gene is considered to be either ‘on’ or ‘off’, and intermediate expression levels are neglected and hence its products are present or absent. Using Boolean variables, interactions between states can be represented by a Boolean functions, which define the status of state of a gene from the activation of other genes. Also, modeling regulatory networks by means of Boolean networks allows large regulatory networks to be analyzed in an efficient way, by making strong assumptions on the structure and simple dynamics of GRNs.
B. Ordinary Differential Equations
Ordinary differential equations (ODEs), which model the dynamics of biological systems, have been widely used to analyze GRNs. The ODE formulation models the concentrations of RNAs, proteins, and other molecules by time-dependent variables with values contained in the set of nonnegative real numbers. Regulatory interactions take the form of functional and differential relations between the concentration variables. Specifically, gene regulation is modeled by rate equations expressing the rate of production of a component of the system as a function of the concentrations of other components as follows where x is the vector of concentrations of proteins, mRNAs, or small molecules and fi represents a nonlinear function. With dynamic models, regression techniques fit the data to a priori model and we can infer interaction maps from the biochemical reactions and associated parameters [22].
IV. RECENT TRENDS IN NETWORK IDENTIFICATION
Although high-throughput measurement techniques have grown tremendously, still data insufficiency strongly impedes identification of GRNs. Hence, in order to obtain reliable inference results, it is important to incorporate biologically motivated constraints (i.e., sparsity). Also, many researchers propose new methods which combine diverse types of data together (e.g., multidimensional -omic data, ChIP-on-chip data, protein-protein interaction data, sequence information), or integrate a number of independent experimental clues from literature or biological databases.
In general, regression techniques fit the data to prior models and such methods are limited to relatively simple models, i.e., usually based on simple, often linear, approximations to underlying dynamics. This is due to the fact that as the network complexity increases, the number of parameters becomes much larger than the number of experimental constraints. Thus, incorporating biologically motivated constraints is very useful to reduce the search space. Since biological regulatory networks are known to be sparse, meaning that most genes interact with only a small number of genes compared with the total number in the network, many methods take advantage of the sparsity [22]. These methods typically use l1-norm optimization, which leads to a sparse representation of the network and improves the ability to find the actual network structure. Moreover, these methods can be extended to combine a priori information on the network structure (i.e., known promotion and inhibition relations can be coded in with constraints).
Various information from scientific literature and biological database can be used in combination with experimental data. Recently, many researchers proposed promising methods that integrate such diverse types of data in GRN identification [23]. Thus, facing limited amounts of experimental data, the integration of prior biological knowledge and multiple sources of heterogeneous data will be one of the important focuses in future GRN identification research.
Footnotes
This research was supported by the NIH NCI under the ICBP and PS-OC programs (5U54CA112970-08)