Abstract
Many neurons in the brain, such as place cells in the rodent hippocampus, have localized receptive fields, i.e., they respond to a small neighborhood of stimulus space. What is the functional significance of such representations and how can they arise? Here, we propose that localized receptive fields emerge in similarity-preserving networks of rectifying neurons that learn low-dimensional manifolds populated by sensory inputs. Numerical simulations of such networks on standard datasets yield manifold-tiling localized receptive fields. More generally, we show analytically that, for data lying on symmetric manifolds, optimal solutions of objectives, from which similarity-preserving networks are derived, have localized receptive fields. Therefore, nonnegative similarity-preserving mapping (NSM) implemented by neural networks can model representations of continuous manifolds in the brain.
1 Introduction
A salient and unexplained feature of many neurons is that their receptive fields are localized in the parameter space they represent. For example, a hippocampus place cell is active in a particular spatial location [1], the response of a V1 neuron is localized in visual space and orientation [2], and the response of an auditory neuron is localized in the sound frequency space [3]. In all these examples, receptive fields of neurons from the same brain area tile (with overlap) low-dimensional manifolds.
Localized receptive fields are shaped by neural activity as evidenced by experimental manipulations in developing and adult animals [4, 5, 6, 7]. Activity influences receptive fields via modification, or learning, of synaptic weights which gate the activity of upstream neurons channeling sensory inputs. To be biologically plausible, synaptic learning rules must be physically local, i.e., the weight of a synapse depends on the activity of only the two neurons it connects, pre- and post-synaptic.
In this paper, we demonstrate that biologically plausible neural networks can learn manifold-tiling localized receptive fields from the upstream activity in an unsupervised fashion. Because analyzing the outcome of learning in arbitrary neural networks is often difficult, we take a normative approach, Fig. 1. First, we formulate an optimization problem by postulating an objective function and constraints, Fig. 1. Second, for inputs lying on a manifold, we derive an optimal offline solution and demonstrate analytically and numerically that the receptive fields are localized and tile the manifold, Fig. 1. Third, from the same objective, we derive an online optimization algorithm which can be implemented by a biologically plausible neural network, Fig. 1. We expect this network to learn localized receptive fields, the conjecture we confirm by simulating the network numerically, Fig. 1.
Optimization functions considered here belong to the family of similarity-preserving objectives which dictate that similar inputs to the network elicit similar outputs [8, 9, 10, 11, 12]. In the absence of sign constraints, such objectives are provably optimized by projecting inputs onto the principal subspace [13, 14, 15], which can be done online by networks of linear neurons [8, 9, 10]. Constraining the sign of the output leads to networks of rectifying neurons [11] which have been simulated numerically in the context of clustering and feature learning [11, 12, 16, 17], and analyzed in the context of blind source extraction [18]. In the context of manifold learning, optimal solutions of Nonnegative Similarity-preserving Mapping objectives have been missing because optimization of existing NSM objectives is challenging. Our main contributions are:
Analytical optimization of NSM objectives for input originating from symmetric manifolds.
Derivation of biologically plausible NSM neural networks.
Offline and online algorithms for manifold learning of arbitrary manifolds.
The paper is organized as follows. In Sec. 2, we derive a simplified version of an NSM objective.
Much of our following analysis can be carried over to other NSM objectives but with additional technical considerations. In Sec. 3, we derive a necessary condition for the optimal solution. In Sec. 4, we consider solutions for the case of symmetric manifolds. In Sec. 5, we derive online optimization algorithm and an NSM neural network. In Sec. 6, we present the results of numerical experiments, which can be reproduced with the code at https://github.com/flatironinstitute/mantis.
2 A Simplified Similarity-preserving Objective Function
To introduce similarity-preserving objectives, let us define our notation. The input to the network is a set of vectors, xt ∈ ℝn, t = 1, …, T, with components represented by the activity of n upstream neurons at time, t. In response, the network outputs an activity vector, yt ∈ ℝm, t = 1, …, T, m being the number of output neurons.
Similarity preservation postulates that similar input pairs, xt and xt′, evoke similar output pairs, yt and yt′. If similarity of a pair of vectors is quantified by their scalar product and the distance metric of similarity is Euclidean, we have where we introduced a matrix notation X ≡ [x1, …, xT] ∈ ℝn×T and Y ≡ [y1, …, yT] ∈ ℝm×T and m < n. Such optimization problem is solved offline by projecting the input data to the principal subspace [13, 14, 15]. The same problem can be solved online by a biologically plausible neural network performing global linear dimensionality reduction [8, 10].
We will see below that nonlinear manifolds can be learned by constraining the sign of the output and introducing a similarity threshold, α (here E is a matrix of all ones):
In the special case, α = 0, Eq. (2) reduces to the objective in [11, 19, 18].
Intuitively, Eq. (2) attempts to preserve similarity for similar pairs of input samples but orthogonalizes the outputs corresponding to dissimilar input pairs. Indeed, if the input similarity of a pair of samples t, t′ is above a specified threshold, xt · xt′ > α, output vectors yt and yt′ would prefer to have yt · yt′ ≈ xt · xt′ − α, i.e., it would be similar. If, however, xt · xt′ < α, the lowest value yt · yt′ for yt, yt′ ≥ 0 is zero meaning that that they would tend to be orthogonal, yt · yt′ = 0. As yt and yt′ are nonnegative, to achieve orthogonality, the output activity patterns for dissimilar patterns would have non-overlapping sets of active neurons. In the context of manifold representation, Eq. (2) strives to preserve in the y-representation local geometry of the input data cloud in x-space and let the global geometry emerge out of the nonlinear optimization process.
As the difficulty in analyzing Eq. (2) is due to the quartic in Y term, we go on to derive a simpler quadratic in Y objective function that produces very similar outcomes. To this end, we, first, introduce an additional power constraint: Tr Y⊤ Y ≤ k as in [9, 11]. We will call the input-output mapping obtained by this procedure NSM-0: where we expanded the square and kept only the Y-dependent terms.
We can redefine the variables and drop the last term in a certain limit (see the Supplementary Material, Sec. A.1, for details) leading to the optimization problem we call NSM-1:
Conceptually, this type of objective has proven successful for manifold learning [20]. Intuitively, just like Eq. (2), NSM-1 preserves similarity of nearby input data samples while orthogonalizing output vectors of dissimilar input pairs. Indeed, a pair of samples t, t′ with xt · xt′ > α, would tend to have yt · yt′ as large as possible, albeit with the norm of the vectors controlled by the constraint ||yt||2 ≤ β. Therefore, when the input similarity for the pair is above a specified threshold, the vectors yt and yt′ would prefer to be aligned in the same direction. For dissimilar inputs with xt · xt′ < α, the corresponding output vectors yt and yt′ would tend to be orthogonal, meaning that responses to these dissimilar inputs would activate mostly nonoverlapping sets of neurons.
3 A Necessary Optimality Condition for NSM-1
In this section, we derive the necessary optimality condition for Problem (NSM-1). For notational convenience, we introduce the Gramian D ≡ X⊤ X and use [z]+, where z ∈ ℝT, for the componentwise ReLU function, ([z]+)t ≡ max(zt, 0).
Proposition 1.
The optimal solution of Problem (NSM-1) satisfies where y(a) designates a column vector which is the transpose of the a-th row of Y and Λ = diag(λ1, …, λT) is a nonnegative diagonal matrix.
The proof of Proposition 1 (Supplementary Material, Sec. A.2) proceeds by introducing Lagrange multipliers Λ = diag(λ1, …, λT) ≥ 0 for the constraint diag(Y⊤ Y) ≤ βI, and writing down the KKT conditions. Then, by separately considering the cases λtyat = 0 and λtyat > 0 we get Eq. (3). To gain insight into the nature of the solutions of (3), let us assume λt > 0 for all t and rewrite it as
Eq. (4) suggests that the sign of the interaction within each pair of yt and yt′ depends on the similarity of the corresponding inputs. If xt and xt′ are similar, Dtt′ > α, then yat′ has excitatory influence on yat. Otherwise, if xt and xt′ are farther apart, the influence is inhibitory. Such models often give rise to localized solutions [21]. Since, in our case, the variable yat gives the activity of the a-th neuron as the t-th input vector is presented to the network, such a solution would define a receptive field of neuron, a, localized in the space of inputs. Below, we will derive such localized-receptive field solutions for inputs originating from symmetric manifolds.
4 Solution for Symmetric Manifolds via a Convex Formulation
So far, we set the dimensionality of y, i.e., the number of output neurons, m, a priori. However, as this number depends on the dataset, we would like to allow for flexibility of choosing the output dimensionality adaptively. To this end, we introduce the Gramian, Q ≡ Y⊤ Y, and do not constrain its rank. Minimization of our objective functions requires that the output similarity expressed by Gramian, Q, captures some of the input similarity structure encoded in the input Gramian, D.
Redefining the variables makes the domain of the optimization problem convex. Matrices like D and Q which could be expressed as Gramians are symmetric and positive semidefinite. In addition, any matrix, Q, such that Q ≡ Y⊤ Y with Y ≥ 0 is called completely positive. The set of completely positive T × T matrices is denoted by CPT and forms a closed convex cone [22].
Then, NSM-1, without the rank constraint, can be restated as a convex optimization problem with respect to Q belonging to the convex cone CPT :
Despite the convexity, for arbitrary datasets, optimization problems in CPT are often intractable for large T [22]. Yet, for D with a high degree of symmetry, below, we will find the optimal Q.
Imagine now that there is a group G ⊆ ST, ST being the permutation group of the set {1, 2, …, T}, so that Dg(t)g(t′) = Dtt′ for all g ∈ G. The matrix with elements Mg(t)g(t′) is denoted as gM, representing group action on M. We will represent action of g on a vector w ∈ RT as gw, with (gw)t = wg(t).
Theorem 1.
If the action of the group G is transitive, that is, for any pair t, t′ ∈ {1, 2, …, T} there is a g ∈ G so that t′ = g(t), then there is at least one optimal solution of Problem (NSM-1a) with Q = Y⊤ Y, Y ∈ ℝm×T and Y ≥ 0, such that
for each a, the transpose of the a-th row of Y, termed y(a), satisfies
Let H be the stabilizer subgroup of y(1), namely, H = Stab y(1) ≡ {h ∈ G|hy(1) = y(1)}. Then, m = |G/H| and Y can be written as where gi are members of the m distinct left cosets in G/H.
In other words, when the symmetry group action is transitive, all the Lagrange multipliers are the same. Also the different rows of the Y matrix could be generated from a single row by the action of the group. A sketch of the proof is as follows (see Supplementary Material, Sec. A.3, for further details). For part (i), we argue that a convex minimization problem with a symmetry always has a solution which respects the symmetry. Thus our search could be limited to the G-invariant elements of the convex cone, CPG = {Q ∈ CPT | Q = gQ, ∀g ∈ G}, which happens to be a convex cone itself. We then introduce the Lagrange multipliers and define the Lagrangian for the problem on the invariant convex cone and show that it is enough to search over Λ = λI. Part (ii) follows from optimality of Q = Y⊤ Y implying optimality of .
Eq. (5) is a non-linear eigenvalue equation that can have many solutions. Yet, if those solutions are related to each other by symmetry they can be found explicitly, as we show in the following subsections.
4.1 Solution for Inputs on the Ring with Cosine Similarity in the Continuum Limit
In this subsection, we consider the case where inputs, xt, lie on a one-dimensional manifold shaped as a ring centered on the origin: where t ∈ {1, 2, …, T}. Then, we have and Eq. (5) becomes
In the limit of large T, we can replace a discrete variable, t, by a continuous variable, , and λ → T μ, leading to with C adjusted so that for some measure m in the space of ϕ, which is a continuous variable labeling the output neurons. We will see that ϕ could naturally be chosen as an angle and the constraint becomes .
Eq. (8) has appeared previously in the context of the ring attractor [21]. While our variables have a completely different neuroscience interpretation, we can still use their solution: whose support is the interval [ϕ − ψ, ϕ + ψ].
Eq. (9) gives the receptive fields of a neuron, ϕ, in terms of the azimuthal coordinate, θ, shown in the bottom left panel of Fig. 1. The dependence of μ and ψ on α is given parametrically (see Supplementary Material, Sec. A.4). So far, we have only shown that Eq. (9) satisfies the necessary optimality condition in the continuous limit of Eq. (8). In Sec. 6, we confirm numerically that the optimal solution for a finite number of neurons approximates Eq. (9), Fig. 2.
While we do not have a closed-form solution for NSM-0 on a ring, we show that the optimal solution also has localized receptive fields (see Supplementary Material, Sec. A.5).
4.2 Solution for Inputs on Higher-dimensional Compact Homogeneous Manifolds
Here, we consider two special cases of higher dimensional manifolds. The first example is the 2-sphere, S2 = SO(3)/SO(1). The second example is the rotation group, SO(3), which is a three-dimensional manifold. It is possible to generalize this method to other compact homogeneous spaces for particular kernels.
We can think of a 2-sphere via its 3-dimensional embedding: S2 ≡ {x ∈ |||x|| = 1}. For two points Ω, Ω′ on the 2-sphere let D(Ω, Ω′) = x(Ω) · x(Ω′), where x(Ω), x(Ω′) are the corresponding unit vectors in the 3-dimensional embedding.
Remarkably, we can show that solutions satisfying the optimality conditions are of the form
This means that the center of a receptive field on the sphere is at Ω0. The neuron is active while the angle between x(Ω) and x(Ω0) is less than ψ. For the derivation of Eq. (10) and the self-consistency conditions, determining ψ, μ in terms of α, see Supplementary Material, Sec. A.6.
In the case of the rotation group, for g, g′ ∈ SO(3) we adopt the 3 × 3 matrix representations R(g), R(g′) and consider to be the similarity kernel. Once more, we index a receptive field solution by the rotation group element, g0, where the response is maximum: with ψ, μ being determined by α through self-consistency equations. This solution has support over g ∈ SO(3), such that the rotation has a rotation angle less than ψ.
To summarize this section, we demonstrated, in the continuum limit, that the solutions to NSM objectives for data on symmetric manifolds possess localized receptive fields that tile these manifolds.
What is the nature of solutions as the datasets depart from highly symmetric cases? To address this question, consider data on a smooth compact Riemannian manifold with a smooth metric resulting in a continuous curvature tensor. Then the curvature tensor sets a local length scale over which the effect of curvature is felt. If a symmetry group acts transitively on the manifold, this length scale is constant all over the manifold. Even if such symmetries are absent, on a compact manifold, the curvature tensor components are bounded and there is a length scale, L, below which the manifold locally appears as flat space. Suppose the manifold is sampled well enough with many data points within each geodesic ball of length, L, and the parameters are chosen so that the localized receptive fields are narrower than L. Then, we could construct an asymptotic solution satisfying the optimality condition. Such asymptotic solution in the continuum limit and the effect of uneven sampling along the manifold will be analyzed elsewhere.
5 Online Optimization and Neural Networks
Here, we derive a biologically plausible neural network that optimizes NSM-1. To this end, we transform NSM-1 by, first, rewriting it in the Lagrangian form:
Here, unconventionally, the nonnegative Lagrange multipliers that impose the inequality constraints are factorized into inner products of two nonnegative vectors (zt · zt). Second, we introduce auxiliary variables, W, b, Vt [10]:
The equivalence of (13) to (12) can be seen by performing the W, b, and Vt optimizations explicitly and plugging the optimal values back. (13) suggests a two-step online algorithm (see Appendix A.8 for full derivation). For each input xt, in the first step, one solves for yt, zt and Vt, by projected gradient descent-ascent-descent, where γy,z,V are step sizes. This iteration can be interpreted as the dynamics of a neural circuit (Fig. 1, Top right panel), where components of yt are activities of excitatory neurons, b is a bias term, zt – activities of inhibitory neurons, W is the feedforward connectivity matrix, and Vt is the synaptic weight matrix from excitatory to inhibitory neurons, which undergoes a fast time-scale anti-Hebbian plasticity. In the second step, W and b are updated by gradient descent-ascent: where W is going through a slow time-scale Hebbian plasticity and b through homeostatic plasticity. η is a learning rate. Application of this algorithm to symmetric datasets is shown in Fig. 2 and Fig. 3.
6 Experimental Results
In this section, we verify our theoretical results by solving both offline and online optimization problems numerically. We confirm our theoretical predictions in Sec. 4 for symmetric manifolds and demonstrate that they hold for deviations from symmetry. Moreover, our algorithms yield manifold-tiling localized receptive fields on real-world data.
Synthetic data. Recall that for the input data lying on a ring, optimization without a rank constraint yields truncated cosine solutions, see Eq. (9). Here, we show numerically that fixed-rank optimization yields the same solutions, Fig. 2: the computed matrix Y⊤ Y is indeed circulant, all receptive fields are equivalent to each other, are well approximated by truncated cosine and tile the manifold with overlap. Similarly, for the input lying on a 2-sphere, we find numerically that localized solutions tile the manifold, Fig. 3.
For the offline optimization we used a Burer-Monteiro augmented Lagrangian method [23, 24]. Whereas, conventionally, the number of rows m of Y is chosen to be βT (observe that diag(Y⊤ Y) ≤ βI implies that Tr(Y⊤ Y) ≤ βT, making βT an upper bound of the rank), we use the non-standard setting m βT, as a small m might create degeneracies (i.e., hard-clustering solutions).
Also, we empirically demonstrate that the nature of the solutions is robust to deviations from symmetry in manifold curvature and data point density. See Fig. 4 and its caption for details.
6 Real-world data
For normalized input data with every diagonal element above the threshold α, the term in NSM-1 behaves as described in Sec. 2. For unnormalized inputs, it is preferable to control the sum of each row of Q, i.e. , with an individual αt, instead of the total sum.
Additionally, enforcing is in many cases empirically equivalent to enforcing but makes the optimization easier. We thus obtain the objective function which, for some choice of αt, is equivalent to (here, 1 ∈ ℝT is a column vector of ones)
For highly symmetric datasets without constraints on rank, NSM-2 has the same solutions as NSM-1 (see Supplementary Material, Sec. A.7). Relaxations of this optimization problem have been the subject of extensive research to solve clustering and manifold learning problems [25, 26, 27, 28]. A biologically plausible neural network solving this problem was proposed in [12]. For the optimization of NSM-2 we use an augmented Lagrangian method [23, 24, 28, 29].
We have extensively applied NSM-2 to datasets previously analyzed in the context of manifold learning [28, 30] (see Supplementary Material, Sec. B). Here, we include just two representative examples, figs. 5 and 6, showing the emergence of localized receptive fields in a high-dimensional space. Despite the lack of symmetry and ensuing loss of regularity, we obtain neurons whose receptive fields, taken together, tile the entire data cloud. Such tiling solutions indicate robustness of the method to imperfections in the dataset and further corroborate the theoretical results derived in this paper.
7 Discussion
In this work, we show that objective functions approximately preserving similarity, along with nonnegativity constraint on the outputs, learn data manifolds. Neural networks implementing NSM algorithms use only biologically plausible local (Hebbian or anti-Hebbian) synaptic learning rules.
These results add to the versatility of NSM networks previously shown to cluster data, learn sparse dictionaries and blindly separate sources [11, 18, 16], depending on the nature of input data. This illustrates how a universal neural circuit in the brain can implement various learning tasks [11].
Our algorithms, starting from a linear kernel, D, generate an output kernel, Q, restricted to the sample space. Whereas the associations between kernels and neural networks was known [31], previously proposed networks used random synaptic weights with no learning. In our algorithms, the weights are learned from the input data to optimize the objective. Therefore, our algorithms learn data-dependent kernels adaptively.
In addition to modeling biological neural computation, our algorithms may also serve as general-purpose mechanisms for generating representations of manifolds adaptively. Unlike most existing manifold learning algorithms [32, 33, 34, 35, 36, 37], ours can operate naturally in the online setting. Also, unlike most existing algorithms, ours do not output low-dimensional vectors of embedding variables but rather high-dimensional vectors of assignment indices to centroids tiling the manifold, similar to radial basis function networks [38]. This tiling approach is also essentially different from setting up charts [39, 40], which essentially end up modeling local tangent spaces. The advantage of our high-dimensional representation becomes obvious if the output representation is used not for visualization but for further computation, e.g., linear classification [41].
Acknowledgments
We are grateful to Yanis Bahroun, Johannes Friedrich, Victor Minden, Eftychios Pnevmatikakis, and the other members of the Flatiron Neuroscience group for discussion and comments on an earlier version of this manuscript. We thank Sanjeev Arora, Afonso Bandeira, Moses Charikar, Jeff Cheeger, Surya Ganguli, Dustin Mixon, Marc’Aurelio Ranzato, and Soledad Villar for helpful discussions.
Footnotes
anirvans{at}physics.rutgers.edu, alexander.genkin{at}gmail.com mtepper{at}flatironinstitute.org, cpehlevan{at}flatironinstitute.org, dchklovskii{at}flatironinstitute.org
32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.