Abstract
With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations with different cell lines, or neural spike data sets across many experimental trials have the potential to acquire insight across multiple dimensions. For this potential to be realized, we need a suitable representation to turn data into insight. Since a wide range of experiments and the (unknown) complexity of underlying system make biological data more heterogeneous than those in other fields, we propose the method based on Robust Principal Component Analysis (RPCA), which is well suited for extracting principal components where we have corrupted observations. The proposed method provides us a new representation of these data sets which consists of its common and aberrant response. This representation might help users to acquire a new insight from data.
Author Summary One of the most exciting trends and important themes in science and engineering involves the use of high-throughput measurement data. With different dimensions, for example, various perturbations, different doses of drug or cell lines characteristics, such multidimensional data set enables us to understand commonalities and differences across multiple dimensions. A general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity measure. A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. With this notion, we propose the RPCA-based method which models common variations as approximately the low-rank component and anomalies as the sparse component. We show that the proposed method is able to find distinct subtypes and classify data set in a robust way by separating common responses and abnormal responses without any prior knowledge.
Introduction
Over the last years, the use of high-throughput measurement data has become one of the most exciting trends and important themes in science and engineering. This is becoming increasingly important in biology. However, handling and analyzing biological data have challenges all of their own because the data set represents heterogeneity. Biological data not only stem from a wide range of experiments but also represent the (unknown) complexity of underlying system [1]. In cancer cells, signaling networks frequently become compromised, leading to abnormal behaviors and responses to external stimuli. Also, we have to consider various experimental conditions with different dimensions such as inhibitions/stimulations, different doses of drugs, and various cell lines as shown in Figure 1.
With the explosion of the amount of various biological data, a general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity (or dissimilarity) measure which is critical to the analysis. Since such multidimensional data have the potential to acquire insight across multiple dimensions, these data enable users to start to develop models and draw hypotheses that not only describe the spatial and temporal dynamics of the biological system but also inform them about commonalities and differences across dimensions. A significant challenge for creating suitable representations is to continue handling large data sets and to match the growing diversity and quantity of the data set.
A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. The potential of clustering to reveal biologically meaningful patterns in microarray data was first realized and demonstrated in an early paper by Eisen et al [2]. Thereafter, in many biological applications, different methods have been used to analyze gene expression data and characterize gene functional behavior. Among various data-driven modeling approaches, clustering methods are widely used on these data to categorize genes with similar expression profiles. However, until recently, most studies have focused on the spatial, rather than temporal, structure of data. For instance, neural models are usually concerned with processing static spatial patterns of intensities without regard to temporal information [3]. Since many existing data-driven modeling approaches such as clustering, classification or inference such as Bayesian inference using biological data focus on static data, they have limitations in analyzing multi-dimensional spatio-temporal data sets.
Recently, much research has focused on time series high-throughput data sets such as time series gene expression or time-binned neural activity. These data sets have the advantage of being able to identify dynamic relationships between genes or neurons since the spatio-temporal pattern results from integration of regulatory signals or electrochemical signals through the network over time. For example, time series gene-knockout experiment data sets provide the distinct possibility of observing the cellular mechanisms in action [4]. Also, these data sets help us to unravel the mechanistic drivers characterizing cellular response and to break down the genome into sets of genes involved in the related processes [5]. Moreover, instead of concentrating on steady state response, monitoring dynamic patterns provides a profoundly different type of information. For instance, several recent studies focus on the temporal complexity and heterogeneity of single-neuron activity in the premotor and motor cortices [3] [6] [7]. Moreover, since many current and emerging cancer treatments are designed to inhibit or stimulate a specific node (or gene) in the networks and alter signaling cascades, advancing our understanding of how the system dynamics of these networks is deregulated across cancer cells and finding subgroups of genes and conditions will ultimately lead to the more effective treatment strategies [8].
In this paper, we propose the RPCA-based method for analyzing spatio-temporal data sets which represent underlying biological systems. Since the proposed method provides us suitable representations which turn data into insight, it helps us to understand information from a new point of view. To demonstrate that our method helps users acquire insight efficiently and to emphasize that the proposed method can be applicable to various domains, we consider two different systems 1) neural population dynamics and 2) gene regulatory network. Since the proposed method uses the common dynamic features in the spatio-temporal data set, the key idea is how to arrange individual data sets in order to make them amenable to this analysis. Thus, the proposed method enables scientists and engineers to analyze these data by retrieving common dynamical information and focus on the interesting details with a new perspective on the problem.
Background
Motivation
1. Neural Population Dynamics
Neural activity is typically studied by averaging noisy spiking activity across multiple experimental trials to obtain an approximate neural firing rate that varies smoothly over time. However, if neural activity is more a reflection of internal neural dynamics rather than response to external stimulus, the time series of neural activity may differ even when monkey is performing nominally identical tasks [7]. In [6], Churchland et al. showed that neural activity patterns in the primary motor cortex and premotor cortex associated with nearly identical velocity profiles can be very different. This is particularly true of behavioral tasks involving perception, decision making, attention, or motor planning. In these settings, it is critical not to average the neural data across trials, but to analyze it on a trial-by-trial basis [3]. Moreover, stimulus representations in some sensory systems are characterized by the precise spike timing of a small number of neurons [10] [11] [12], suggesting that the details of operations in the brain are embedded not only in the overall neural spike rate, but also the timings of spikes.
The motor and premotor cortex have been extensively studied but their dynamic response properties are poorly understood [3]. Also, there is a debate about whether neural activity relates to muscles or to abstract movement features. We can define the motor cortical activity, which represents movement parameters as per equation (1), and the dynamical system that generates movements as per equation (2) [3]: where xi(t) is the firing rate of neuron i at time t, hi is its tuning function, and each paramj may represent a movement parameter such as hand velocity, target position or direction. In (2), x ∈ ℝn is a vector describing the firing rate of all neurons where n is the number of neurons, is its derivative, f is an unknown function, and u is an external input. In (2), neural activity is governed by underlying dynamics f(·), so dynamical features should be present in the population activity.
2. Gene Regulatory Network
In microarray data, missing data and corrupted data are quite common, and not uniform across samples (here, we consider arbitrary corruptions by human error during biological experiments, for example, mislabeling, improper use of markers or antibody). Two strategies for dealing with missing values are either to modify clustering methods so that they can deal with missing values, or impute a “complete” data set before clustering [13].
Consider collections of time series gene expression of breast cancer cell lines or microarray data sets from pathway-targeted therapies that are gene knockout experiments. When a specific gene is perturbed as shown in Figure 2(c), the broad gene expression levels of other genes might be perturbed over time. Thus, comparing gene expression levels in the perturbed system with those in the unperturbed system reveals the extra information that is the different cellular mechanisms in action. A dynamical system of the gene regulatory network can be modelled as follows: where represents the derivative of gene expression at time t, n is the number of genes, f(·) represents the vector field of the typical dynamical system (or wild-type) and g{·}(·) represents an additional vector field which is added by perturbation or simply a mutant-specific part (blue and red edges in Figure 2(c). In other words, we have a unified model for wild-type cell line, and in the mutant or perturbation case, we invoke a single change to network topology or add a single influence for the specific gene where additional vector fields such as gLAP(·), gAkti(·) and gM(·) are assumed to be sparse (i.e., affect only a single gene expression). Note that these additional vector fields affect only a single gene expression at time t but this influence can be propagated through the network and integrated over time.
Robust Principal Component Analysis (RPCA)
In the computer vision literature [9], an interesting separation problem is introduced where the observed data matrix can be decomposed into unseen low-rank component and unseen sparse component. The method called Robust Principal Component Analysis (RPCA) is provable correct and efficient algorithm for the recovery of low-dimensional linear structure from non-ideal observations. For example, gross errors frequently occur in many applications such as image processing, occlusions, malicious tempering, sensor failures and bioinformatics.
In video surveillance, we need to identify activities that stand out from the background given a sequence of video frames [9]. Figure 2 (a) shows that if we stack the video frames as rows of a matrix where q is the number of frames for a given time window, and Px and Py represent the number of pixels of 2-D images respectively, then across each row of a matrix M, there exists common component that is the stationary background and the moving objects in the foreground at each image frame. Here, a large data matrix M is an input for RPCA and the output is both the stationary background (L) and the moving objects in the foreground (S). Suppose you have only one frame; you can not identify the moving objects from the stationary background. However, by stacking all the vectorized frames according to a sequence, i.e., all the frames align across the column direction as shown in Figure 2 (a), we can identify the stationary backgrounds which are common variations, and then capture the moving objects which are sparse components for each frame.
With this notion, suppose we are given a large data matrix M, which has principal components in the low-rank component and may contain some anomalies in the sparse component. Mathematically, it is natural to model the common variations as approximately the low-rank component L, and the anomaly as the sparse component S. In [9], Candès et al. formulate this as follows: where ‖L‖∗ denotes so-called nuclear norm of the matrix L, i.e., the sum of the singular value of L, and ‖S‖1 = ∑ij |Sij| represents l1-norm of S. Choosing the tuning parameter λ to be , works well for incoherent matrices where n1, n2 represent the dimension of matrix M [9]. For practical problems, however, it is often possible to improve performance by choosing λ according to prior knowledge about the solution.
How to Construct the Data Matrix M
Recall the video surveillance example as shown in Figure 2 (a) where each row of a matrix M represents the vectorized 2-D images at each time frame. Since each image consists of the stationary background (Li,:) and the moving objects in the foreground (Si,:) at each time i, we denote M as follows: where Mi,: represents the i-th row of M. If there is no moving object in the foreground and no lighting variation for a given video sequence (i.e., ∀i, Si,: = 0), obviously Li,:(= Lj,:(i ≠ j)) represents the common stationary background. On the other hand, if not (i.e., Si,: ≠ 0), M represents the aligned corrupted measurement Mi,:. Although the measurements are corrupted by moving objects in the foreground, we are able to separate L and S under certain conditions [9]. Similarly, if we construct multidimensional spatio-temporal data sets into M, we may be able to separate common dynamic features and analyze aberrant behavior.
1. Neural Population Dynamics
Recall equation (2) and consider Figure 2 (b). Suppose we align spatio-temporal neural activity governed by (2) with discrete events, such as movement onset (i.e., when a monkey triggers submovement1) where the superscript i represents the i-th trial and NT represents the number of time points for the chosen time window. Then, we denote M as follows: where represents the temporal neural activity of the i-th neuron, ei ∈ ℝn is a unit vector, and q is the number of trials across all data. Each row of M represents vectorized spatio-temporal neural response for the each trial. Note that we align each spatio-temporal data set xj[t] with the same temporal condition (submovement onset) as shown in Figure 2 (b) but we do not separate different types of submovement. For example, submovements with different reach directions, or with different ordinal positions in an overlapped series of submovements, are combined in our input matrix . With the similar notion of the stationary background in a video surveillance, some portion of the variability may reflect common dynamic features (L) corresponding to triggering submovement even though the responses of each neuron are corrupted by task-irrelevant neural responses (S) and may vary significantly across many trials.
2. Gene Regulatory Network
Recall equation (3) and consider Figure 2 (c). To handle various perturbation conditions (or different cell lines), we should consider those factor carefully. In (3), although the additional vector field (g{·}) represents a single influence for a specific gene, this single influence can be propagated through the network and integrated over time. For example, when we perturb xj by using inhibitor, if xj is connected with many other genes directly or indirectly, the broad gene expression levels of other genes can be perturbed over time. On the other hand, if xj is connected with only few genes, this perturbation only affects small fraction of gene expression since it can not be propagated through the entire network.
Similar to equation (6), we can construct M using gene expression time series data with q different perturbations including different cell lines. Here, each row of represents the vectorized time series gene expression (n: the number of genes, NT: the number of time points and q: the number of different perturbation conditions including the number of different cell lines) and different rows represent spatio-temporal responses of different perturbations or different cell lines.
Since time series gene expression results from integration of regulatory signals constrained by the gene regulatory network, the input matrix M may reflect common dynamic response corresponding to the characteristics of the network structure. Intuitively, in video surveillance, if someone stays motionlessly in all the frames, RPCA algorithm discriminates him as a low rank component. Unless he moves, we could not see the background because he always blocks the background. Similarly, in order to extract common response of gene regulatory network exactly, we should perturb the entire network arbitrarily and uniformly.
Results
Disentangling the Low-rank and Sparse components
In [9], Candès et al. discuss the identifiability issue. To make the problem meaningful, the low-rank component L must not be sparse. Another identifiability issue arises if the sparse matrix has low-rank. In many computer vision applications, practical low-rank and sparse separation gives visually appealing solution.
However, for neural activity data, only a small subset of the whole ensemble of neurons is active at any moment as shown in Figure 3 (left). Since M is sparse, the low-rank component might be sparse. Also, for the pathway targeted therapies, since gene regulatory network is known to be sparse, a large subset of the whole ensemble of genes might be deactivated at any moment. Moreover, the original distributions of the amplitude of individual neuronal activities or gene expressions are highly skewed. For example, neural activities often form very eccentric clusters shown in Figure 3 (left); some neurons are highly activated (30-40 spikes/sec) but others typically have only a few spikes per second. Similarly, gene expressions form very eccentric clusters since each gene expression shows different scales in practice.
These imply that practical low-rank and sparse separation seems to be ambiguous and might not provide biologically meaningful solution in both neural activity analysis and gene knockout experiment data set. To remedy identifiability issue, we propose the RPCA-based method conjunction with Random Projection (RP); RP can de-sparsity the input data set and make highly eccentric distribution to be more spherical so it makes the singular vectors of the low-rank matrix be reasonably distributed. (see Methods section: Random Projection (RP) and Identifiability for details)
Numerical Example
To illustrate the issue of identifiability and how RP can alleviate the issue, we consider a simple example: we generate a sparse low-rank input matrix (q = 50, n = 2, NT = 10) where the rank of is 6 as shown in Figure S1(a). Note that in this example we chose the same dimension for the input and (refer to (7) and (8), no dimension reduction). This is done so that Ψ ∈ ℝm×n in equation (7) is invertible (we choose m = n and a nonsingular matrix Ψ), allowing us to compare the outputs of RPCA and RP-RPCA directly, as described below. Here, by using RP, we take advantage of de-sparsifying our input data and reducing eccentric distribution. In general, choosing m < n makes much denser because information is compressed by RP.
To evaluate the performance of separation into a low-rank and a sparse component, we add sparse corruption for where is the projection so is the projected corrupted input . To compare the performance of RP-RPCA with RPCA, we first decompose into its low-rank and sparse components. Then, we invert the projection: where we define and .
Figure 4 shows statistics of both RPCA and RP-RPCA (in which RPCA is applied to the matrix ) as a function of the tuning parameter λ in equation (4). In this example, . Since our input is still sparse in this example, the rank of both Lrpca, is 15 for . If we choose λ = 0.113 (discounting the penalty for sparse component), the ranks of Lrpca, are approximately 6, which is the same as the rank of the original input . With this choice of λ, for RPCA we find that ‖Srpca‖ is much bigger than the original corruption signal . On the other hand, for RP-RPCA, we have . Therefore, for RP-RPCA, the separation of the low-rank component and sparse component is close to the true solution but for the original RPCA, we have misidentification in both low-rank and sparse components (more detailed information in Figure S2).
Application to Neural Data
Figure 3 (left) shows the actual neural activities aligned with movement onset. The aligned neural activity shows that the ratios between units’ mean firing rates are fairly constant from the salient vertical striations in the plots and temporal patterns exists across all the submovements. Also, as mentioned previously, the neural population activities are sparsely active (white color represents 0 spikes/sec) and show eccentric behavior; for example, some neurons have a much higher spiking rate than others.
Figure 3 (middle) (right) show the low-rank matrix from both RPCA and RP-RPCA respectively (for simple comparison, we choose m = n). Since is sparse and has an eccentric distribution, the singular vectors may not be reasonably spread out. Applying RPCA directly to would result in the low-rank component being composed of only highly modulated neural activity (middle). On the other hand, RP-RPCA can extract the low-rank component from a more distributed set of neural dimensions than RPCA alone can. Also, the result of RP-RPCA gives a more visually appealing solution.
Application to gene knockout experiments
To test the proposed RP-RPCA algorithm, we consider gene knockout experiments using SKBR3 cell line [4] which has been used in studies of Human Epidermal Growth Factor Receptor2 (HER2) positive breast cancer. We chose this data set because it has various perturbations (16 perturbations) using a single cell line and contains 15 gene expressions with 4 time points as shown in Figure 5 (top row). Middle row represents low-rank component and bottom row represents highly aberrant sparse component (we use a certain threshold to emphasize highly corrupted components). In raw data (top row), nearly all treatments show differential responses. However, low-rank component (middle row) can be categorized into approximately 3-4 subtype responses and sparse component (bottom row) shows genomic aberration-specific responses.
Also, following observations suggest mechanisms of response and resistance which may inform unanticipated biological insight.
(observation 1) mTOR inhibition (the second column in the bottom row) shows aberration responses in DEPTOR, pHER3, IRS-1 and pAkt(308, 473). In [15], DEPTOR is identified as an mTOR-interacting protein whose expression is negatively regulated by mTORC1 and mTORC2; Also, Peterson et al. found that DEPTOR overexpression suppresses S6K1 but it activates Akt by relieving feedback inhibition from mTORC1 to PI3K signaling. Therefore, high DEPTOR expression is necessary to maintain PI3K and Akt activation and is consistent with the previous result [15].
(observation 2) HER2 inhibition (the sixth column in the bottom row) results in aberration responses of HER3, pAkt(473) and DEPTOR. Figure S3 represents an abstract model of HER2 overexpressed breast cancer by M. Moasser. Since high DEPTOR expression represents low mTORC1 and mTORC2 [15], there are increasing activated HER3 and Akt by relieving inhibition according to his model. More interesting fact is that PHLPP is known to dephosphrylates SER473 in Akt (i.e., partially inactivating the kinase) which is captured in the sparse component pAkt(473).
(observation 3) S6K inhibition (the third column in the bottom row) results in aberration responses of pAkt(473). Since S6K located downstream of the Akt-TSC2-mTORC pathway, S6K inhibition captures only activating of pAkt(473).
(observation 4) PI3K inhibition (the 7th-11th columns in the bottom row) leads to increase phosphorylation of MAPK.
In order to cluster spatio-temporal data set, we separate the common response from the perturbed responses based on the proposed method. Since abnormal behaviors or different responses to external stimuli or different cell lines can be extracted from the information available in the data set, we could cluster data correctly and reveal biological meaningful patterns (see Methods section: Cluster Analysis for details). Figure 6 shows the clustered result using existing hierarchical clustering (raw data M, dxy in (9)) and the proposed method ([L S], dϕψ in (10)) respectively. We match the clustered results with graphical representation generated by M. Moasser and our clustered result is more consistent with the known network structure and responses.
Application to RPPA (Reverse Phase Protein Arrays) data set
Breast cancers are comprised of distinct subtypes which may respond differently to pathway-targeted therapies; collections of breast cancer cell lines showed differential responses across cell lines and showed subtype-, pathway-, and genomic aberration-specific responses [8]. These observations suggest mechanism of response and resistance. The Gray Lab and Dr. Mills group have a time course analysis on 11 cell lines (all HER2 amplified: 6 PI3K mutant, 5 PI3K wild-type) in response to Lapatinib, Akt inhibitor and combination of the two. They collect protein for Reverse Phase Protein Arrays (RPPA) [16] at 30min, 1h, 2h, 4h, 8h, 24h, 48h and 72h post-treatment.
As shown in Figure 7 (top row), Lapatinib treatment results in down-regulation of a variety of phos-phoproteins in the signaling pathway. From the raw data (M) or low-rank component (L), we can easily observe down-regulation and slow-recovery of the levels of activation but the levels of activation were higher in the PI3K mutation cell lines. Treatment with Akt inhibitor leads to down-regulation of proteins (downstream of Akt) in all HER2 amplified cell lines, although the amplitude of down-regulation is slightly less in cell lines with PI3K mutations. In the PI3K mutation cell lines, treatment with the combination of Lapatinib and Akt inhibitor leads to further down-regulation of the Akt signaling pathway but Akt levels are intermediate in comparison to those observed with inhibitor alone. Although these observations are still interesting, more interesting details might be in both the low-rank component L and the sparse component S:
(observation 1) In the PI3K mutation with applying both inhibitors, full inhibition of pS6RP is observed and these results show the synergistic effect of Lapatinib and Akt inhibitor (in the bottom row, low-rank component).
(observation 2) The main difference between wild-type and PI3K mutant is the response of pS6RP and p70S6K. For the wild-type cell lines, all treatments result in down-regulated pS6RP and p70S6K. However, for PI3K mutant cells, all treatments result in up-regulation pS6RP and p70S6K in the short-term (in the sparse component, red) and down-regulation in the long-term. Suppressing pS6RP relieves feedback inhibition and activated Akt. This difference makes PI3K mutation cells more resistant to HER2 inhibitors than their wild-type counterparts. This finding is not obvious when we take a look at the raw data. Furthermore, our method makes our finding more convincing not by visually searching and comparison M but by separating common response (L) and aberrant behavior (S) by solving (4).
(observation 3) BT474 shows aberrant behavior as shown in Figure S4. This mutation has been reported to confer weak oncogenicity, unlike the other PI3K mutations.
Figure 8 shows the clustered result using existing hierarchical clustering and the proposed method respectively. Our clustered result is more robust and unaffected by different treatments since our algorithm could separate common aspects of gene expressions from the raw data and identify aberrant responses. On the other hand, the clustered group based on existing hierarchical clustering is changing across different treatments although characteristics of cell lines is not changing.
Discussion
Clustering and network inference are usually developed independently. For instance, until recently, most studies of gene regulatory network inference focus on particular data set to identify underlying graph structure and apply the same method for another data set and so on. Or, clustering methods are used on various data sets to find subgroups or classify them. However, we would argue that there are deep relationships between the two and they potentially cover each other’s shortcomings since spatio-temporal gene expression pattern results from both the network structure and integration of regulatory signal through the network [17]. Moreover, by using the available information and comparing gene expression levels in the various perturbation conditions, we might reveal the subtype graph structure and understand heterogeneity across various perturbations without any prior information.
In this paper, we demonstrated that the proposed method helps to find distinct subtypes and classify data set in a robust way. In order to interpret multi-dimensional spatio-temporal data set, we usually compare the responses over experiments and find differences by looking raw data. However, we cannot convince ourselves especially with no prior knowledge. Also, as the dimension of high-throughput data increases, analysis based on visually appealing way is not possible in practice. For instance, we might have to consider multi-dimensions such as positive perturbation, negative perturbation, temporal response, various read-outs, mechanisms and various doses. However, the proposed method provides us more convincing way to interpret biological data while at the same time handle multi-dimensional data set; The low-rank representation provides us the large-scale feature and the sparse component show the interesting details with respect to common dynamic feature. The intuition behind this is that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted or a fraction of the entries are missing as well [9].
Also, although there is a wealth of literature describing canonical cell signaling networks, little is known about exactly how these networks operate in different cancer cells. Therefore, a possible extension of the proposed method is that once we extract common responses, we apply any inference algorithms to identify the unified structure using these common responses. Or, we can also focus on individual sparse components to identify heterogeneity of network structure. Advancing our understanding of how these networks are deregulated across cancer cells and different targeted therapies will ultimately lead to improve effectiveness of pathway-targeted therapies.
Conclusion
In this study, we develop a new method for clustering and analyzing multi-dimensional biological data with providing a new perspective on the problem. We illustrate how the proposed method can be useful to extract common event-related neural features across many experimental trials. Also, we show that the proposed method helps to find distinct subtypes and classify data set in a robust way by separating common response and abnormal responses without any prior knowledge. We are currently applying our method to analyze and cluster RPPA data set of the HER2 positive breast cancer and trying to identify underlying graph structures.
Methods
Random Projection (RP) and Identifiability
Random Projection(RP)
Recent theoretical work has identified random projection as a promising dimensionality reduction technique [19]. Projecting the data onto a random lower-dimensional subspace preserves the similarity of different data vectors, for example, the distances between the points are approximately preserved. Also, RP can reduce the dimension of data while keeping clusters of data points well-separated [19]. Moreover, using RP is substantially less expensive to compute than using techniques such as PCA (Principal Component Analysis) because RP is data-independent.
The idea of RP is that a small number of random linear projections can preserve key information. Theoretical work [19] [20] [21] [22] guarantees that with high probability, all pairwise Euclidean and geodesic distances between points on a low-dimensional manifold are well-preserved under the mapping Ψ: ℝn → ℝm, m < n. Consider a linear signal model where Ψ = [ψ1 ψ2 … ψn] is an m × n projection matrix whose elements are drawn randomly from independent identical distributions. First, note that the dimensionality of the data x is reduced since m < n. Also, if we define where ēi is m-dimensional unit vector and , then or where ⊗ represents the Kronecker product and is an identity matrix.
In [19], Dasgupta showed that even if the original distribution of data samples is highly skewed (having an ellipsoidal contour of high eccentricity), its projected counterparts will be more spherical. Since it is conceptually much easier to design algorithms for spherical clusters than ellipsoidal ones, this feature of random projection can simplify the separation into the low-rank and sparse components. Therefore, we can reduce the computational complexity of the non-smooth convex optimization, in particular l1 and nuclear norms minimization, used in RPCA2.
Identifiability
Suppose our input in equation (6) can be decomposed as where σi are the positive singular values, are the left- and right-singular vectors of L, and dL represents the rank of the matrix L. dS is the number of sparse components in S, and are sparse with only one nonzero entry respectively. By using RP, we have for , where we denote by R. As we mentioned above, our input is sparse, so the singular vectors of the low-rank matrix L might not be reasonably spread out. However, by using RP (multiplying by R), the singular vectors of the resulting matrix become reasonably spread out.
Cluster Analysis
Overview: Dissimilarity
Common measures of dissimilarity for data include Euclidean distance [13], where x and y are p-vectors of measurements on the objects to be clustered. Also, Manhattan distance is used, and the “1-correlation” distance is defined as follows
The 1-correlation distance is bounded in [0, 2]. This dissimilarity is invariant to changes in location or scale of either x or y. The 1-correlation dissimilarity can be related to the more familiar Euclidean distance: if and , then . That is, squared Euclidean distance for standardized objects is proportional to the correlation of the original objects. For microarray data, the choice of a dissimilarity measure makes it a popular choice for biological applications. Changes in the average measurement level or range of measurement from one sample to the next are effectively removed by this dissimilarity.
Missing data and corruption
As we mentioned, in microarray data, missing data and corrupted data are quite common so in order to deal with missing values, one can modify clustering methods or impute a “complete” data set before clustering. For example, we consider highly-correlated signal xL = sin(t) + n1 and yL = sin(t) + n2 where t is time step and n1, n2 are Gaussian noise . Now, we add a sparse corruption (xS) to the original signal (xL) as shown in Figure 9 (a) and calculate the dissimilarity between xcorr(= xL + xS) and ycorr (= yL + 0). Even though we choose the d-sparse corruption of xS where d(≪ p) is the number of nonzero component in xS, the correlation is degraded as shown in Figure 9 (b) (left). Assuming that we know the corruption signal xS and yS, we can decompose xcorr, ycorr as ϕ = [xL; xS] ∈ ℝ2p and ψ = [yL; yS] ∈ ℝ2p respectively. In Figure 9 (b) (middle), the red square represents the corruption signal where yS = 0. Since corruption signal changes the mean and the variance, the correlation is still degraded in (b) (middle). We introduce γ so that we allow different weighting factors for (xL, yL) and (xS, yS) respectively. For example, we choose small γ for the corruption signal (xS, yS).
Therefore, in order to deal with corrupted signals and cluster them, we should separate the original signal and corruption signal first and then calculate the dissimilarity with adjusting weighting factor γ. For a gene expression time series data set, when a gene is knocked out, systems are subjected to controlled perturbations and the broad gene expression levels of other genes are perturbed. We can reveal extra information by comparing gene expression levels in the perturbed system with those in the original system. Since abnormal behaviors or different responses to external stimuli or different cell lines can be extracted from the original data using the information available in the data set, we could cluster data and reveal biological meaningful patterns.
Our approach: a new 1-correlation distance
We rewrite the “1-correlation” distance (9) as where x, y ∈ ℝp, and and consider the separation as follows: and where x = xL + xS, y = yL + yS and the subscript L, S represent low-rank component and sparse component. We define “1-correlation” distance for ϕ, ψ, as follows: where and . The relation between dxy(= 1 − ρxy) and dϕψ(= 1 − ρϕψ is as follows: where is p-dimensional identity matrix, and .
Therefore, dxy uses the mixture of low-rank component and sparse component but dϕψ calculates the correlation based on the separation. Also, in order to adjust the weighting factor as shown in Figure 9 (b) (right), we simply denote where γ is a weighting factor.
Lemma 1. If the sparse component is zero, dϕψ = dxy.
Proof. Since xS = 0 and yS = 0, we can simply consider ϕ, ψ as and respectively and γ = 1
For the disentanglement, we propose the RPCA-based (Robust Principal Component Analysis, [9]) method which uses the information available in the data set in order to identify similar expression patterns3.
Acknowledgments
This research was supported by the NIH NCI under the ICBP and PS-OC programs (5U54CA112970-08), the NIGMS and by the NSF under grant EFRI 1137267.
Footnotes
↵* E-mail: tomlin{at}eecs.berkeley.edu
↵1 Submovement represents a type of motor primitive. For example, the hand speed profile as a function of time resulting from arm movements can be represented by a sum of bell-shaped functions, each of which is called a submovement [14].
↵2 Many speedup methods were developed in optimization by avoiding large-scale SVD. In [23], Mu et al. demonstrated the power of projected matrix nuclear norm by reformulating RPCA and in [24], Zhou et al. demonstrated the effectiveness and the efficiency of Bilateral Random Projections. However, both methods consider a dense matrix while in this paper we consider the case when the input matrix is sparse.
↵3 In [18], Liu et al. proposed an RPCA-based method of discovering differentially expressed genes using static data. They provided an efficient and effective approach for gene identification. However, we focus on the spatio-temporal gene expression data set and consider the disentanglement of low-rank and sparse component to extract common features and detect specific response or heterogeneity via modified RPCA. Here, we treat the spatio-temporal gene expression and focus on the relationship between gene regulatory network and dynamics of regulatory signal. We note this goes beyond the results in [14] due to the transformation involved.