Disentangling Multidimensional Spatio-Temporal Data into their Common and Aberrant Responses

Young Hwan Chang; Jim Korkola; Dhara N. Amin; Mark Moasser; Jose M. Carmena; Joe W. Gray; Claire J. Tomlin

doi:10.1101/004259

Abstract

With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations with different cell lines, or neural spike data sets across many experimental trials have the potential to acquire insight across multiple dimensions. For this potential to be realized, we need a suitable representation to turn data into insight. Since a wide range of experiments and the (unknown) complexity of underlying system make biological data more heterogeneous than those in other fields, we propose the method based on Robust Principal Component Analysis (RPCA), which is well suited for extracting principal components where we have corrupted observations. The proposed method provides us a new representation of these data sets which consists of its common and aberrant response. This representation might help users to acquire a new insight from data.

Author Summary One of the most exciting trends and important themes in science and engineering involves the use of high-throughput measurement data. With different dimensions, for example, various perturbations, different doses of drug or cell lines characteristics, such multidimensional data set enables us to understand commonalities and differences across multiple dimensions. A general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity measure. A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. With this notion, we propose the RPCA-based method which models common variations as approximately the low-rank component and anomalies as the sparse component. We show that the proposed method is able to find distinct subtypes and classify data set in a robust way by separating common responses and abnormal responses without any prior knowledge.

Introduction

Over the last years, the use of high-throughput measurement data has become one of the most exciting trends and important themes in science and engineering. This is becoming increasingly important in biology. However, handling and analyzing biological data have challenges all of their own because the data set represents heterogeneity. Biological data not only stem from a wide range of experiments but also represent the (unknown) complexity of underlying system [1]. In cancer cells, signaling networks frequently become compromised, leading to abnormal behaviors and responses to external stimuli. Also, we have to consider various experimental conditions with different dimensions such as inhibitions/stimulations, different doses of drugs, and various cell lines as shown in Figure 1.

Figure 1.

Multi-dimensional Spatiotemporal data where we consider various experiments with different perturbations, doses, mechanism, tasks, etc.

With the explosion of the amount of various biological data, a general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity (or dissimilarity) measure which is critical to the analysis. Since such multidimensional data have the potential to acquire insight across multiple dimensions, these data enable users to start to develop models and draw hypotheses that not only describe the spatial and temporal dynamics of the biological system but also inform them about commonalities and differences across dimensions. A significant challenge for creating suitable representations is to continue handling large data sets and to match the growing diversity and quantity of the data set.

A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. The potential of clustering to reveal biologically meaningful patterns in microarray data was first realized and demonstrated in an early paper by Eisen et al [2]. Thereafter, in many biological applications, different methods have been used to analyze gene expression data and characterize gene functional behavior. Among various data-driven modeling approaches, clustering methods are widely used on these data to categorize genes with similar expression profiles. However, until recently, most studies have focused on the spatial, rather than temporal, structure of data. For instance, neural models are usually concerned with processing static spatial patterns of intensities without regard to temporal information [3]. Since many existing data-driven modeling approaches such as clustering, classification or inference such as Bayesian inference using biological data focus on static data, they have limitations in analyzing multi-dimensional spatio-temporal data sets.

Recently, much research has focused on time series high-throughput data sets such as time series gene expression or time-binned neural activity. These data sets have the advantage of being able to identify dynamic relationships between genes or neurons since the spatio-temporal pattern results from integration of regulatory signals or electrochemical signals through the network over time. For example, time series gene-knockout experiment data sets provide the distinct possibility of observing the cellular mechanisms in action [4]. Also, these data sets help us to unravel the mechanistic drivers characterizing cellular response and to break down the genome into sets of genes involved in the related processes [5]. Moreover, instead of concentrating on steady state response, monitoring dynamic patterns provides a profoundly different type of information. For instance, several recent studies focus on the temporal complexity and heterogeneity of single-neuron activity in the premotor and motor cortices [3] [6] [7]. Moreover, since many current and emerging cancer treatments are designed to inhibit or stimulate a specific node (or gene) in the networks and alter signaling cascades, advancing our understanding of how the system dynamics of these networks is deregulated across cancer cells and finding subgroups of genes and conditions will ultimately lead to the more effective treatment strategies [8].

In this paper, we propose the RPCA-based method for analyzing spatio-temporal data sets which represent underlying biological systems. Since the proposed method provides us suitable representations which turn data into insight, it helps us to understand information from a new point of view. To demonstrate that our method helps users acquire insight efficiently and to emphasize that the proposed method can be applicable to various domains, we consider two different systems 1) neural population dynamics and 2) gene regulatory network. Since the proposed method uses the common dynamic features in the spatio-temporal data set, the key idea is how to arrange individual data sets in order to make them amenable to this analysis. Thus, the proposed method enables scientists and engineers to analyze these data by retrieving common dynamical information and focus on the interesting details with a new perspective on the problem.

Background

Motivation

1. Neural Population Dynamics

Neural activity is typically studied by averaging noisy spiking activity across multiple experimental trials to obtain an approximate neural firing rate that varies smoothly over time. However, if neural activity is more a reflection of internal neural dynamics rather than response to external stimulus, the time series of neural activity may differ even when monkey is performing nominally identical tasks [7]. In [6], Churchland et al. showed that neural activity patterns in the primary motor cortex and premotor cortex associated with nearly identical velocity profiles can be very different. This is particularly true of behavioral tasks involving perception, decision making, attention, or motor planning. In these settings, it is critical not to average the neural data across trials, but to analyze it on a trial-by-trial basis [3]. Moreover, stimulus representations in some sensory systems are characterized by the precise spike timing of a small number of neurons [10] [11] [12], suggesting that the details of operations in the brain are embedded not only in the overall neural spike rate, but also the timings of spikes.

The motor and premotor cortex have been extensively studied but their dynamic response properties are poorly understood [3]. Also, there is a debate about whether neural activity relates to muscles or to abstract movement features. We can define the motor cortical activity, which represents movement parameters as per equation (1), and the dynamical system that generates movements as per equation (2) [3]: where x_i(t) is the firing rate of neuron i at time t, h_i is its tuning function, and each param_j may represent a movement parameter such as hand velocity, target position or direction. In (2), x ∈ ℝⁿ is a vector describing the firing rate of all neurons where n is the number of neurons, is its derivative, f is an unknown function, and u is an external input. In (2), neural activity is governed by underlying dynamics f(·), so dynamical features should be present in the population activity.

2. Gene Regulatory Network

In microarray data, missing data and corrupted data are quite common, and not uniform across samples (here, we consider arbitrary corruptions by human error during biological experiments, for example, mislabeling, improper use of markers or antibody). Two strategies for dealing with missing values are either to modify clustering methods so that they can deal with missing values, or impute a “complete” data set before clustering [13].

Consider collections of time series gene expression of breast cancer cell lines or microarray data sets from pathway-targeted therapies that are gene knockout experiments. When a specific gene is perturbed as shown in Figure 2(c), the broad gene expression levels of other genes might be perturbed over time. Thus, comparing gene expression levels in the perturbed system with those in the unperturbed system reveals the extra information that is the different cellular mechanisms in action. A dynamical system of the gene regulatory network can be modelled as follows: where represents the derivative of gene expression at time t, n is the number of genes, f(·) represents the vector field of the typical dynamical system (or wild-type) and g_{·}(·) represents an additional vector field which is added by perturbation or simply a mutant-specific part (blue and red edges in Figure 2(c). In other words, we have a unified model for wild-type cell line, and in the mutant or perturbation case, we invoke a single change to network topology or add a single influence for the specific gene where additional vector fields such as g_LAP(·), g_Akti(·) and g_M(·) are assumed to be sparse (i.e., affect only a single gene expression). Note that these additional vector fields affect only a single gene expression at time t but this influence can be propagated through the network and integrated over time.

Figure 2.

Conceptual representation: (a) RPCA applied to computer vision. A typical example of video surveillance where the low-rank component represents the unchanging background and the sparse component represents the movements in the foreground. (b) RPCA applied to neural systems. The low-rank component putatively represents (submovement relevant) neural signatures and the sparse component represents neural activity unrelated to submovement onset. (c) Collections of gene-knockout experiments and mutant-specific part representations (breast cancer signaling pathway) with wild-type, Lapatinib treatment, Akt inhibitor and mutant cell lines where solid black edges represent common network topology, and blue and red edges represent a single change of the network topology for perturbations or mutant cell lines.

Robust Principal Component Analysis (RPCA)

In the computer vision literature [9], an interesting separation problem is introduced where the observed data matrix can be decomposed into unseen low-rank component and unseen sparse component. The method called Robust Principal Component Analysis (RPCA) is provable correct and efficient algorithm for the recovery of low-dimensional linear structure from non-ideal observations. For example, gross errors frequently occur in many applications such as image processing, occlusions, malicious tempering, sensor failures and bioinformatics.

In video surveillance, we need to identify activities that stand out from the background given a sequence of video frames [9]. Figure 2 (a) shows that if we stack the video frames as rows of a matrix where q is the number of frames for a given time window, and P_x and P_y represent the number of pixels of 2-D images respectively, then across each row of a matrix M, there exists common component that is the stationary background and the moving objects in the foreground at each image frame. Here, a large data matrix M is an input for RPCA and the output is both the stationary background (L) and the moving objects in the foreground (S). Suppose you have only one frame; you can not identify the moving objects from the stationary background. However, by stacking all the vectorized frames according to a sequence, i.e., all the frames align across the column direction as shown in Figure 2 (a), we can identify the stationary backgrounds which are common variations, and then capture the moving objects which are sparse components for each frame.

With this notion, suppose we are given a large data matrix M, which has principal components in the low-rank component and may contain some anomalies in the sparse component. Mathematically, it is natural to model the common variations as approximately the low-rank component L, and the anomaly as the sparse component S. In [9], Candès et al. formulate this as follows: where ‖L‖_∗ denotes so-called nuclear norm of the matrix L, i.e., the sum of the singular value of L, and ‖S‖₁ = ∑_ij |S_ij| represents l₁-norm of S. Choosing the tuning parameter λ to be , works well for incoherent matrices where n₁, n₂ represent the dimension of matrix M [9]. For practical problems, however, it is often possible to improve performance by choosing λ according to prior knowledge about the solution.

How to Construct the Data Matrix M

Recall the video surveillance example as shown in Figure 2 (a) where each row of a matrix M represents the vectorized 2-D images at each time frame. Since each image consists of the stationary background (L_i,:) and the moving objects in the foreground (S_i,:) at each time i, we denote M as follows: where M_i,: represents the i-th row of M. If there is no moving object in the foreground and no lighting variation for a given video sequence (i.e., ∀i, S_i,: = 0), obviously L_i,:(= L_j,:(i ≠ j)) represents the common stationary background. On the other hand, if not (i.e., S_i,: ≠ 0), M represents the aligned corrupted measurement M_i,:. Although the measurements are corrupted by moving objects in the foreground, we are able to separate L and S under certain conditions [9]. Similarly, if we construct multidimensional spatio-temporal data sets into M, we may be able to separate common dynamic features and analyze aberrant behavior.

1. Neural Population Dynamics

Recall equation (2) and consider Figure 2 (b). Suppose we align spatio-temporal neural activity governed by (2) with discrete events, such as movement onset (i.e., when a monkey triggers submovement¹) where the superscript i represents the i-th trial and N_T represents the number of time points for the chosen time window. Then, we denote M as follows: where represents the temporal neural activity of the i-th neuron, e_i ∈ ℝⁿ is a unit vector, and q is the number of trials across all data. Each row of M represents vectorized spatio-temporal neural response for the each trial. Note that we align each spatio-temporal data set x^j[t] with the same temporal condition (submovement onset) as shown in Figure 2 (b) but we do not separate different types of submovement. For example, submovements with different reach directions, or with different ordinal positions in an overlapped series of submovements, are combined in our input matrix . With the similar notion of the stationary background in a video surveillance, some portion of the variability may reflect common dynamic features (L) corresponding to triggering submovement even though the responses of each neuron are corrupted by task-irrelevant neural responses (S) and may vary significantly across many trials.

2. Gene Regulatory Network

Recall equation (3) and consider Figure 2 (c). To handle various perturbation conditions (or different cell lines), we should consider those factor carefully. In (3), although the additional vector field (g_{·}) represents a single influence for a specific gene, this single influence can be propagated through the network and integrated over time. For example, when we perturb x_j by using inhibitor, if x_j is connected with many other genes directly or indirectly, the broad gene expression levels of other genes can be perturbed over time. On the other hand, if x_j is connected with only few genes, this perturbation only affects small fraction of gene expression since it can not be propagated through the entire network.

Similar to equation (6), we can construct M using gene expression time series data with q different perturbations including different cell lines. Here, each row of represents the vectorized time series gene expression (n: the number of genes, N_T: the number of time points and q: the number of different perturbation conditions including the number of different cell lines) and different rows represent spatio-temporal responses of different perturbations or different cell lines.

Since time series gene expression results from integration of regulatory signals constrained by the gene regulatory network, the input matrix M may reflect common dynamic response corresponding to the characteristics of the network structure. Intuitively, in video surveillance, if someone stays motionlessly in all the frames, RPCA algorithm discriminates him as a low rank component. Unless he moves, we could not see the background because he always blocks the background. Similarly, in order to extract common response of gene regulatory network exactly, we should perturb the entire network arbitrarily and uniformly.

Results

Disentangling the Low-rank and Sparse components

In [9], Candès et al. discuss the identifiability issue. To make the problem meaningful, the low-rank component L must not be sparse. Another identifiability issue arises if the sparse matrix has low-rank. In many computer vision applications, practical low-rank and sparse separation gives visually appealing solution.

However, for neural activity data, only a small subset of the whole ensemble of neurons is active at any moment as shown in Figure 3 (left). Since M is sparse, the low-rank component might be sparse. Also, for the pathway targeted therapies, since gene regulatory network is known to be sparse, a large subset of the whole ensemble of genes might be deactivated at any moment. Moreover, the original distributions of the amplitude of individual neuronal activities or gene expressions are highly skewed. For example, neural activities often form very eccentric clusters shown in Figure 3 (left); some neurons are highly activated (30-40 spikes/sec) but others typically have only a few spikes per second. Similarly, gene expressions form very eccentric clusters since each gene expression shows different scales in practice.

Figure 3.

The low-rank matrices from both RPCA and RP-RPCA where are input matrices and we choose m = n = 64 for the comparison (contrast represents activity of neuron. i.e., high contrast represents highly modulated neural activity and white color represents zero neural activity). (left) raw-data (center) low-rank component using RPCA and (right) low-rank component using RP-RPCA.

These imply that practical low-rank and sparse separation seems to be ambiguous and might not provide biologically meaningful solution in both neural activity analysis and gene knockout experiment data set. To remedy identifiability issue, we propose the RPCA-based method conjunction with Random Projection (RP); RP can de-sparsity the input data set and make highly eccentric distribution to be more spherical so it makes the singular vectors of the low-rank matrix be reasonably distributed. (see Methods section: Random Projection (RP) and Identifiability for details)

Numerical Example

To illustrate the issue of identifiability and how RP can alleviate the issue, we consider a simple example: we generate a sparse low-rank input matrix (q = 50, n = 2, N_T = 10) where the rank of is 6 as shown in Figure S1(a). Note that in this example we chose the same dimension for the input and (refer to (7) and (8), no dimension reduction). This is done so that Ψ ∈ ℝ^m×n in equation (7) is invertible (we choose m = n and a nonsingular matrix Ψ), allowing us to compare the outputs of RPCA and RP-RPCA directly, as described below. Here, by using RP, we take advantage of de-sparsifying our input data and reducing eccentric distribution. In general, choosing m < n makes much denser because information is compressed by RP.

To evaluate the performance of separation into a low-rank and a sparse component, we add sparse corruption for where is the projection so is the projected corrupted input . To compare the performance of RP-RPCA with RPCA, we first decompose into its low-rank and sparse components. Then, we invert the projection: where we define and .

Figure 4 shows statistics of both RPCA and RP-RPCA (in which RPCA is applied to the matrix ) as a function of the tuning parameter λ in equation (4). In this example, . Since our input is still sparse in this example, the rank of both L^rpca, is 15 for . If we choose λ = 0.113 (discounting the penalty for sparse component), the ranks of L^rpca, are approximately 6, which is the same as the rank of the original input . With this choice of λ, for RPCA we find that ‖S^rpca‖ is much bigger than the original corruption signal . On the other hand, for RP-RPCA, we have . Therefore, for RP-RPCA, the separation of the low-rank component and sparse component is close to the true solution but for the original RPCA, we have misidentification in both low-rank and sparse components (more detailed information in Figure S2).

Figure 4.

Statistics of a numerical example: we run RPCA for and (We added sparse corruption to ). Left y-axis represents the norm of sparse component and the right y-axis shows the rank of L (more detailed information in Figure S1 and Figure S2).

Application to Neural Data

Figure 3 (left) shows the actual neural activities aligned with movement onset. The aligned neural activity shows that the ratios between units’ mean firing rates are fairly constant from the salient vertical striations in the plots and temporal patterns exists across all the submovements. Also, as mentioned previously, the neural population activities are sparsely active (white color represents 0 spikes/sec) and show eccentric behavior; for example, some neurons have a much higher spiking rate than others.

Figure 3 (middle) (right) show the low-rank matrix from both RPCA and RP-RPCA respectively (for simple comparison, we choose m = n). Since is sparse and has an eccentric distribution, the singular vectors may not be reasonably spread out. Applying RPCA directly to would result in the low-rank component being composed of only highly modulated neural activity (middle). On the other hand, RP-RPCA can extract the low-rank component from a more distributed set of neural dimensions than RPCA alone can. Also, the result of RP-RPCA gives a more visually appealing solution.

Application to gene knockout experiments

To test the proposed RP-RPCA algorithm, we consider gene knockout experiments using SKBR3 cell line [4] which has been used in studies of Human Epidermal Growth Factor Receptor2 (HER2) positive breast cancer. We chose this data set because it has various perturbations (16 perturbations) using a single cell line and contains 15 gene expressions with 4 time points as shown in Figure 5 (top row). Middle row represents low-rank component and bottom row represents highly aberrant sparse component (we use a certain threshold to emphasize highly corrupted components). In raw data (top row), nearly all treatments show differential responses. However, low-rank component (middle row) can be categorized into approximately 3-4 subtype responses and sparse component (bottom row) shows genomic aberration-specific responses.

Figure 5.

Gene knockout experiments [4](16 perturbations × 15 gene expressions × 4 time points [0, 1, 48, 72h]): (upper) raw data (middle) low-rank component and (lower) highly corrupted sparse component using threshold.

Also, following observations suggest mechanisms of response and resistance which may inform unanticipated biological insight.

(observation 1) mTOR inhibition (the second column in the bottom row) shows aberration responses in DEPTOR, pHER3, IRS-1 and pAkt(308, 473). In [15], DEPTOR is identified as an mTOR-interacting protein whose expression is negatively regulated by mTORC1 and mTORC2; Also, Peterson et al. found that DEPTOR overexpression suppresses S6K1 but it activates Akt by relieving feedback inhibition from mTORC1 to PI3K signaling. Therefore, high DEPTOR expression is necessary to maintain PI3K and Akt activation and is consistent with the previous result [15].
(observation 2) HER2 inhibition (the sixth column in the bottom row) results in aberration responses of HER3, pAkt(473) and DEPTOR. Figure S3 represents an abstract model of HER2 overexpressed breast cancer by M. Moasser. Since high DEPTOR expression represents low mTORC1 and mTORC2 [15], there are increasing activated HER3 and Akt by relieving inhibition according to his model. More interesting fact is that PHLPP is known to dephosphrylates SER473 in Akt (i.e., partially inactivating the kinase) which is captured in the sparse component pAkt(473).
(observation 3) S6K inhibition (the third column in the bottom row) results in aberration responses of pAkt(473). Since S6K located downstream of the Akt-TSC2-mTORC pathway, S6K inhibition captures only activating of pAkt(473).
(observation 4) PI3K inhibition (the 7th-11th columns in the bottom row) leads to increase phosphorylation of MAPK.

In order to cluster spatio-temporal data set, we separate the common response from the perturbed responses based on the proposed method. Since abnormal behaviors or different responses to external stimuli or different cell lines can be extracted from the information available in the data set, we could cluster data correctly and reveal biological meaningful patterns (see Methods section: Cluster Analysis for details). Figure 6 shows the clustered result using existing hierarchical clustering (raw data M, d_xy in (9)) and the proposed method ([L S], d_ϕψ in (10)) respectively. We match the clustered results with graphical representation generated by M. Moasser and our clustered result is more consistent with the known network structure and responses.

Figure 6.

Clustered group: (left) hierarchical cluster and (right) the proposed method. Both clustered results compare with graphical representation generated by M. Moasser.

Application to RPPA (Reverse Phase Protein Arrays) data set

Breast cancers are comprised of distinct subtypes which may respond differently to pathway-targeted therapies; collections of breast cancer cell lines showed differential responses across cell lines and showed subtype-, pathway-, and genomic aberration-specific responses [8]. These observations suggest mechanism of response and resistance. The Gray Lab and Dr. Mills group have a time course analysis on 11 cell lines (all HER2 amplified: 6 PI3K mutant, 5 PI3K wild-type) in response to Lapatinib, Akt inhibitor and combination of the two. They collect protein for Reverse Phase Protein Arrays (RPPA) [16] at 30min, 1h, 2h, 4h, 8h, 24h, 48h and 72h post-treatment.

As shown in Figure 7 (top row), Lapatinib treatment results in down-regulation of a variety of phos-phoproteins in the signaling pathway. From the raw data (M) or low-rank component (L), we can easily observe down-regulation and slow-recovery of the levels of activation but the levels of activation were higher in the PI3K mutation cell lines. Treatment with Akt inhibitor leads to down-regulation of proteins (downstream of Akt) in all HER2 amplified cell lines, although the amplitude of down-regulation is slightly less in cell lines with PI3K mutations. In the PI3K mutation cell lines, treatment with the combination of Lapatinib and Akt inhibitor leads to further down-regulation of the Akt signaling pathway but Akt levels are intermediate in comparison to those observed with inhibitor alone. Although these observations are still interesting, more interesting details might be in both the low-rank component L and the sparse component S:

(observation 1) In the PI3K mutation with applying both inhibitors, full inhibition of pS6RP is observed and these results show the synergistic effect of Lapatinib and Akt inhibitor (in the bottom row, low-rank component).
(observation 2) The main difference between wild-type and PI3K mutant is the response of pS6RP and p70S6K. For the wild-type cell lines, all treatments result in down-regulated pS6RP and p70S6K. However, for PI3K mutant cells, all treatments result in up-regulation pS6RP and p70S6K in the short-term (in the sparse component, red) and down-regulation in the long-term. Suppressing pS6RP relieves feedback inhibition and activated Akt. This difference makes PI3K mutation cells more resistant to HER2 inhibitors than their wild-type counterparts. This finding is not obvious when we take a look at the raw data. Furthermore, our method makes our finding more convincing not by visually searching and comparison M but by separating common response (L) and aberrant behavior (S) by solving (4).
(observation 3) BT474 shows aberrant behavior as shown in Figure S4. This mutation has been reported to confer weak oncogenicity, unlike the other PI3K mutations.

Figure 7.

Heat maps showing average response based on both raw data and disentanglement result within subtype to targeted therapeutics: (1_st column) HER2+/PI3K wild type, (2_nd column) HER2+/PI3K mutant. Each column consists of average responses of raw RPPA, low-rank component and sparse component. Each row represents targeted therapeutics alone and in combination (LAP, AKTi, both). In the PI3K mutation, we can see up-regulation of S6 pS235, pS240 and p70S6K pS371 in the short-term (in the sparse component, red) (more detailed information for each cell line in Figure S4).

Figure 8 shows the clustered result using existing hierarchical clustering and the proposed method respectively. Our clustered result is more robust and unaffected by different treatments since our algorithm could separate common aspects of gene expressions from the raw data and identify aberrant responses. On the other hand, the clustered group based on existing hierarchical clustering is changing across different treatments although characteristics of cell lines is not changing.

Figure 8.

Clustered group using RPPA data set: (a,b,c) hierarchical cluster and (d,e,f) the proposed method.

Discussion

Clustering and network inference are usually developed independently. For instance, until recently, most studies of gene regulatory network inference focus on particular data set to identify underlying graph structure and apply the same method for another data set and so on. Or, clustering methods are used on various data sets to find subgroups or classify them. However, we would argue that there are deep relationships between the two and they potentially cover each other’s shortcomings since spatio-temporal gene expression pattern results from both the network structure and integration of regulatory signal through the network [17]. Moreover, by using the available information and comparing gene expression levels in the various perturbation conditions, we might reveal the subtype graph structure and understand heterogeneity across various perturbations without any prior information.

In this paper, we demonstrated that the proposed method helps to find distinct subtypes and classify data set in a robust way. In order to interpret multi-dimensional spatio-temporal data set, we usually compare the responses over experiments and find differences by looking raw data. However, we cannot convince ourselves especially with no prior knowledge. Also, as the dimension of high-throughput data increases, analysis based on visually appealing way is not possible in practice. For instance, we might have to consider multi-dimensions such as positive perturbation, negative perturbation, temporal response, various read-outs, mechanisms and various doses. However, the proposed method provides us more convincing way to interpret biological data while at the same time handle multi-dimensional data set; The low-rank representation provides us the large-scale feature and the sparse component show the interesting details with respect to common dynamic feature. The intuition behind this is that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted or a fraction of the entries are missing as well [9].

Also, although there is a wealth of literature describing canonical cell signaling networks, little is known about exactly how these networks operate in different cancer cells. Therefore, a possible extension of the proposed method is that once we extract common responses, we apply any inference algorithms to identify the unified structure using these common responses. Or, we can also focus on individual sparse components to identify heterogeneity of network structure. Advancing our understanding of how these networks are deregulated across cancer cells and different targeted therapies will ultimately lead to improve effectiveness of pathway-targeted therapies.

Conclusion

In this study, we develop a new method for clustering and analyzing multi-dimensional biological data with providing a new perspective on the problem. We illustrate how the proposed method can be useful to extract common event-related neural features across many experimental trials. Also, we show that the proposed method helps to find distinct subtypes and classify data set in a robust way by separating common response and abnormal responses without any prior knowledge. We are currently applying our method to analyze and cluster RPPA data set of the HER2 positive breast cancer and trying to identify underlying graph structures.

Methods

Random Projection (RP) and Identifiability

Random Projection(RP)

Recent theoretical work has identified random projection as a promising dimensionality reduction technique [19]. Projecting the data onto a random lower-dimensional subspace preserves the similarity of different data vectors, for example, the distances between the points are approximately preserved. Also, RP can reduce the dimension of data while keeping clusters of data points well-separated [19]. Moreover, using RP is substantially less expensive to compute than using techniques such as PCA (Principal Component Analysis) because RP is data-independent.

The idea of RP is that a small number of random linear projections can preserve key information. Theoretical work [19] [20] [21] [22] guarantees that with high probability, all pairwise Euclidean and geodesic distances between points on a low-dimensional manifold are well-preserved under the mapping Ψ: ℝⁿ → ℝ^m, m < n. Consider a linear signal model where Ψ = [ψ₁ ψ₂ … ψ_n] is an m × n projection matrix whose elements are drawn randomly from independent identical distributions. First, note that the dimensionality of the data x is reduced since m < n. Also, if we define where ē_i is m-dimensional unit vector and , then or where ⊗ represents the Kronecker product and is an identity matrix.

In [19], Dasgupta showed that even if the original distribution of data samples is highly skewed (having an ellipsoidal contour of high eccentricity), its projected counterparts will be more spherical. Since it is conceptually much easier to design algorithms for spherical clusters than ellipsoidal ones, this feature of random projection can simplify the separation into the low-rank and sparse components. Therefore, we can reduce the computational complexity of the non-smooth convex optimization, in particular l₁ and nuclear norms minimization, used in RPCA².

Identifiability

Suppose our input in equation (6) can be decomposed as where σ_i are the positive singular values, are the left- and right-singular vectors of L, and d_L represents the rank of the matrix L. d_S is the number of sparse components in S, and are sparse with only one nonzero entry respectively. By using RP, we have for , where we denote by R. As we mentioned above, our input is sparse, so the singular vectors of the low-rank matrix L might not be reasonably spread out. However, by using RP (multiplying by R), the singular vectors of the resulting matrix become reasonably spread out.

Cluster Analysis

Overview: Dissimilarity

Common measures of dissimilarity for data include Euclidean distance [13], where x and y are p-vectors of measurements on the objects to be clustered. Also, Manhattan distance is used, and the “1-correlation” distance is defined as follows

The 1-correlation distance is bounded in [0, 2]. This dissimilarity is invariant to changes in location or scale of either x or y. The 1-correlation dissimilarity can be related to the more familiar Euclidean distance: if and , then . That is, squared Euclidean distance for standardized objects is proportional to the correlation of the original objects. For microarray data, the choice of a dissimilarity measure makes it a popular choice for biological applications. Changes in the average measurement level or range of measurement from one sample to the next are effectively removed by this dissimilarity.

Missing data and corruption

As we mentioned, in microarray data, missing data and corrupted data are quite common so in order to deal with missing values, one can modify clustering methods or impute a “complete” data set before clustering. For example, we consider highly-correlated signal x_L = sin(t) + n₁ and y_L = sin(t) + n₂ where t is time step and n₁, n₂ are Gaussian noise . Now, we add a sparse corruption (x_S) to the original signal (x_L) as shown in Figure 9 (a) and calculate the dissimilarity between x_corr(= x_L + x_S) and y_corr (= y_L + 0). Even though we choose the d-sparse corruption of x_S where d(≪ p) is the number of nonzero component in x_S, the correlation is degraded as shown in Figure 9 (b) (left). Assuming that we know the corruption signal x_S and y_S, we can decompose x_corr, y_corr as ϕ = [x_L; x_S] ∈ ℝ^2p and ψ = [y_L; y_S] ∈ ℝ^2p respectively. In Figure 9 (b) (middle), the red square represents the corruption signal where y_S = 0. Since corruption signal changes the mean and the variance, the correlation is still degraded in (b) (middle). We introduce γ so that we allow different weighting factors for (x_L, y_L) and (x_S, y_S) respectively. For example, we choose small γ for the corruption signal (x_S, y_S).

Figure 9.

Simple example: (a) green solid line with circle (-○-) represents y_corr(= y_L + 0) and blue solid line with circle (-•-) represents x_corr(= x_L + x_S) where filled circle (•) represents corrupted data, unfilled circle (○) represents uncorrupted data (x_L) and unfilled square (□) represents corruption signal (x_S) (b) x_corr-y_corr plot with 1-correlation distance (d_xy) without modification(left), with disentanglement(middle), and with disentanglement/weighting factor γ.

Therefore, in order to deal with corrupted signals and cluster them, we should separate the original signal and corruption signal first and then calculate the dissimilarity with adjusting weighting factor γ. For a gene expression time series data set, when a gene is knocked out, systems are subjected to controlled perturbations and the broad gene expression levels of other genes are perturbed. We can reveal extra information by comparing gene expression levels in the perturbed system with those in the original system. Since abnormal behaviors or different responses to external stimuli or different cell lines can be extracted from the original data using the information available in the data set, we could cluster data and reveal biological meaningful patterns.

Our approach: a new 1-correlation distance

We rewrite the “1-correlation” distance (9) as where x, y ∈ ℝ^p, and and consider the separation as follows: and where x = x_L + x_S, y = y_L + y_S and the subscript L, S represent low-rank component and sparse component. We define “1-correlation” distance for ϕ, ψ, as follows: where and . The relation between d_xy(= 1 − ρ_xy) and d_ϕψ(= 1 − ρ_ϕψ is as follows: where is p-dimensional identity matrix, and .

Therefore, d_xy uses the mixture of low-rank component and sparse component but d_ϕψ calculates the correlation based on the separation. Also, in order to adjust the weighting factor as shown in Figure 9 (b) (right), we simply denote where γ is a weighting factor.

Lemma 1. If the sparse component is zero, d_ϕψ = d_xy.

Proof. Since x_S = 0 and y_S = 0, we can simply consider ϕ, ψ as and respectively and γ = 1

For the disentanglement, we propose the RPCA-based (Robust Principal Component Analysis, [9]) method which uses the information available in the data set in order to identify similar expression patterns³.

Acknowledgments

This research was supported by the NIH NCI under the ICBP and PS-OC programs (5U54CA112970-08), the NIGMS and by the NSF under grant EFRI 1137267.

Footnotes

↵* E-mail: tomlin{at}eecs.berkeley.edu
↵1 Submovement represents a type of motor primitive. For example, the hand speed profile as a function of time resulting from arm movements can be represented by a sum of bell-shaped functions, each of which is called a submovement [14].
↵2 Many speedup methods were developed in optimization by avoiding large-scale SVD. In [23], Mu et al. demonstrated the power of projected matrix nuclear norm by reformulating RPCA and in [24], Zhou et al. demonstrated the effectiveness and the efficiency of Bilateral Random Projections. However, both methods consider a dense matrix while in this paper we consider the case when the input matrix is sparse.
↵3 In [18], Liu et al. proposed an RPCA-based method of discovering differentially expressed genes using static data. They provided an efficient and effective approach for gene identification. However, we focus on the spatio-temporal gene expression data set and consider the disentanglement of low-rank and sparse component to extract common features and detect specific response or heterogeneity via modified RPCA. Here, we treat the spatio-temporal gene expression and focus on the relationship between gene regulatory network and dynamics of regulatory signal. We note this goes beyond the results in [14] due to the transformation involved.

References

1.↵
Marx V (2013) Biology: The big challenges of big data. Nature: 255–260.
2.↵
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 498: 14863–14868.
OpenUrl
3.↵
Churchland MM, Cunningham JP, Kaufman MT, Foster JD, Nuyujukian P, et al. (2012) Neural population dynamics during reaching. Nature 487: 51–56.
OpenUrl CrossRef PubMed Web of Science
4.↵
Amin DN, Sergina N, Ahuja D, McMahon M, Blair JA, et al. (2010) Resiliency and vulnerability in the her2-her3 tumorigenic driver. Science Translational Medicine 2: 16ra7.
OpenUrl Abstract/FREE Full Text
5.↵
Androulakis I, Yang E, Almon R (2007) Analysis of time-series gene expression data: Methods, challenges, and opportunities, annual review of biomedical engineering. Annual Review of Biomedical Engineering 9: 205–228.
OpenUrl CrossRef PubMed Web of Science
6.↵
Churchland MM, Shenoy KV (2007) Temporal complexity and heterogeneity of single-neuron activity in premotor and motor cortex. Journal of Neurophysiology 97: 4235–4257.
OpenUrl CrossRef PubMed Web of Science
7.↵
Yu BM, Cunningham JP, Santhanam G, Ryu SI, Shenoy KV, et al. (2009) Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology 102: 614–635.
OpenUrl CrossRef PubMed Web of Science
8.↵
Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, et al. (2012) Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences of the United States of America 109: 2724–2729.
OpenUrl Abstract/FREE Full Text
9.↵
Candès EJ, Li X, Ma Y, Wright J (2011) Robust principal component analysis? Journal of the ACM 58: 1–37.
OpenUrl
10.↵
Gerstner W, Kempter R, Hemmen JLV, Wagner H (1996) A neuronal learning rule for sub-millisecond temporal coding. Nature 383: 76–78.
OpenUrl CrossRef PubMed Web of Science
11.↵
Song S, Miller KD, Abbott LF (2000) Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience 3: 919–926.
OpenUrl CrossRef PubMed Web of Science
12.↵
Long MA, Jin DZ, Fee MS (2010) Support for a synaptic chain model of neuronal sequence generation. Nature 468: 394–399.
OpenUrl CrossRef PubMed Web of Science
13.↵
Chipman H, Hastie TJ, Tibshirani R (2003) Chap4: Clustering microarray data. Statistical analysis of gene expression microarray data Terry Speed, Chapman and Hall, CRC press.
14.↵
Chang YH, Chen M, Overduin SA, Gowda S, Carmena JM, et al. (2013) Low-rank representation of neural activity and detection of submovements. the Proceedings of the IEEE Conference on Decision and Control: 2544–2549.
15.↵
Peterson TR, Laplante M, Thoreen CC, Sancak Y, Kang SA, et al. (2009) Deptor is an mtor inhibitor frequently overexpressed in multiplemyeloma cells and required for their survival. Cell 137: 873–886.
OpenUrl CrossRef PubMed Web of Science
16.↵
Hennessy BT, Lu Y, Gonzalez-Angulo AM, Carey MS, Myhre S, et al. (2010) A technical assessment of the utility of reverse phase protein arrays for the study of the functional proteome in non-microdissected human breast cancers. Clinical Proteomics 6: 129–151.
OpenUrl CrossRef PubMed
17.↵
Shiraishi Y, Kimura S, Okada M (2010) Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. BMC Bioinformatics 26: 1073–1081.
OpenUrl
18.↵
Liu JX, Wang YT, Zheng CH, Sha W, Mi JX, et al. (2013) Robust pca based method for discovering differential expressed genes. BMC Bioinformatics 14.
19.↵
Dasgupta S (2000) Experiments with random projection. Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence: 143–151.
20.↵
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. Proceeding KDD ’01 Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining: 245–250.
21.↵
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. 5th International Conference on Machine Learning and Applications (ICMLA): 245–250.
22.↵
Baraniuk RG, Wakin MB (2009) Random projections of smooth manifolds. Journal of Foundations of Computational Mathematics 9: 51–77.
OpenUrl
23.↵
Mu Y, Dong J, Yuan X, Yan S (2011) Accelerated low-rank visual recovery by random projection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2609–2616.
24.↵
Zhou T, Tao D (2011) Bilateral random projections. arXiv:11125215.

View the discussion thread.

Posted April 23, 2014.

Download PDF

Citation Tools

Subject Area

Systems Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11739)
Bioengineering (8750)
Bioinformatics (29189)
Biophysics (14967)
Cancer Biology (12093)
Cell Biology (17409)
Clinical Trials (138)
Developmental Biology (9419)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18301)
Genetics (12238)
Genomics (16797)
Immunology (11865)
Microbiology (28068)
Molecular Biology (11583)
Neuroscience (60953)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4957)
Plant Biology (10425)
Scientific Communication and Education (1683)
Synthetic Biology (2884)
Systems Biology (7338)
Zoology (1651)

[1] 1.↵
Marx V (2013) Biology: The big challenges of big data. Nature: 255–260.

[2] 2.↵
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 498: 14863–14868.
OpenUrl

[3] 3.↵
Churchland MM, Cunningham JP, Kaufman MT, Foster JD, Nuyujukian P, et al. (2012) Neural population dynamics during reaching. Nature 487: 51–56.
OpenUrl CrossRef PubMed Web of Science

[4] 4.↵
Amin DN, Sergina N, Ahuja D, McMahon M, Blair JA, et al. (2010) Resiliency and vulnerability in the her2-her3 tumorigenic driver. Science Translational Medicine 2: 16ra7.
OpenUrl Abstract/FREE Full Text

[5] 5.↵
Androulakis I, Yang E, Almon R (2007) Analysis of time-series gene expression data: Methods, challenges, and opportunities, annual review of biomedical engineering. Annual Review of Biomedical Engineering 9: 205–228.
OpenUrl CrossRef PubMed Web of Science

[6] 6.↵
Churchland MM, Shenoy KV (2007) Temporal complexity and heterogeneity of single-neuron activity in premotor and motor cortex. Journal of Neurophysiology 97: 4235–4257.
OpenUrl CrossRef PubMed Web of Science

[7] 7.↵
Yu BM, Cunningham JP, Santhanam G, Ryu SI, Shenoy KV, et al. (2009) Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology 102: 614–635.
OpenUrl CrossRef PubMed Web of Science

[8] 8.↵
Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, et al. (2012) Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences of the United States of America 109: 2724–2729.
OpenUrl Abstract/FREE Full Text

[9] 9.↵
Candès EJ, Li X, Ma Y, Wright J (2011) Robust principal component analysis? Journal of the ACM 58: 1–37.
OpenUrl

[10] 10.↵
Gerstner W, Kempter R, Hemmen JLV, Wagner H (1996) A neuronal learning rule for sub-millisecond temporal coding. Nature 383: 76–78.
OpenUrl CrossRef PubMed Web of Science

[11] 11.↵
Song S, Miller KD, Abbott LF (2000) Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience 3: 919–926.
OpenUrl CrossRef PubMed Web of Science

[12] 12.↵
Long MA, Jin DZ, Fee MS (2010) Support for a synaptic chain model of neuronal sequence generation. Nature 468: 394–399.
OpenUrl CrossRef PubMed Web of Science

[13] 13.↵
Chipman H, Hastie TJ, Tibshirani R (2003) Chap4: Clustering microarray data. Statistical analysis of gene expression microarray data Terry Speed, Chapman and Hall, CRC press.

[14] 14.↵
Chang YH, Chen M, Overduin SA, Gowda S, Carmena JM, et al. (2013) Low-rank representation of neural activity and detection of submovements. the Proceedings of the IEEE Conference on Decision and Control: 2544–2549.

[15] 15.↵
Peterson TR, Laplante M, Thoreen CC, Sancak Y, Kang SA, et al. (2009) Deptor is an mtor inhibitor frequently overexpressed in multiplemyeloma cells and required for their survival. Cell 137: 873–886.
OpenUrl CrossRef PubMed Web of Science

[16] 16.↵
Hennessy BT, Lu Y, Gonzalez-Angulo AM, Carey MS, Myhre S, et al. (2010) A technical assessment of the utility of reverse phase protein arrays for the study of the functional proteome in non-microdissected human breast cancers. Clinical Proteomics 6: 129–151.
OpenUrl CrossRef PubMed

[17] 17.↵
Shiraishi Y, Kimura S, Okada M (2010) Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. BMC Bioinformatics 26: 1073–1081.
OpenUrl

[18] 18.↵
Liu JX, Wang YT, Zheng CH, Sha W, Mi JX, et al. (2013) Robust pca based method for discovering differential expressed genes. BMC Bioinformatics 14.

[19] 19.↵
Dasgupta S (2000) Experiments with random projection. Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence: 143–151.

[20] 20.↵
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. Proceeding KDD ’01 Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining: 245–250.

[21] 21.↵
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. 5th International Conference on Machine Learning and Applications (ICMLA): 245–250.

[22] 22.↵
Baraniuk RG, Wakin MB (2009) Random projections of smooth manifolds. Journal of Foundations of Computational Mathematics 9: 51–77.
OpenUrl

[23] 23.↵
Mu Y, Dong J, Yuan X, Yan S (2011) Accelerated low-rank visual recovery by random projection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2609–2616.

[24] 24.↵
Zhou T, Tao D (2011) Bilateral random projections. arXiv:11125215.

Disentangling Multidimensional Spatio-Temporal Data into their Common and Aberrant Responses

Abstract

Introduction

Background

Motivation

1. Neural Population Dynamics

2. Gene Regulatory Network

Robust Principal Component Analysis (RPCA)

How to Construct the Data Matrix M

1. Neural Population Dynamics

2. Gene Regulatory Network

Results

Disentangling the Low-rank and Sparse components

Numerical Example

Application to Neural Data

Application to gene knockout experiments

Application to RPPA (Reverse Phase Protein Arrays) data set

Discussion

Conclusion

Methods

Random Projection (RP) and Identifiability

Random Projection(RP)

Identifiability

Cluster Analysis

Overview: Dissimilarity

Missing data and corruption

Our approach: a new 1-correlation distance

Acknowledgments

Footnotes

References

Citation Manager Formats

Subject Area