Abstract
Sensory data about most natural task-relevant variables are confounded by task-irrelevant sensory variations, called nuisance variables. To be useful, the sensory signals that encode the relevant variables must be untangled from the nuisance variables through nonlinear transformations, before the brain can use or decode them to drive behaviors. The information to be untangled is represented in the cortex by the activity of large populations of neurons, constituting a nonlinear population code. Here we provide a new way of thinking about non-linear population codes and nuisance variables, leading to a theory of nonlinear feedforward decoding of neural population activity. This theory obeys fundamental mathematical limitations on information content that are inherited from the sensory periphery, producing redundant codes when there are many more corti-cal neurons than primary sensory neurons. The theory predicts a simple, easily computed quantitative relationship between fluctuating neural activity and behavioral choices if the brain uses its nonlinear population codes optimally: more informative patterns should be more correlated with choices.
1 Introduction
How does an animal use, or ‘decode’, the information represented in its brain? When the average responses of some neurons are well-tuned to a stimulus of interest, this is straightforward. In binary discrimination tasks, for example, a choice can be reached simply by a linear weighted sum of these tuned neural responses. Yet real neurons are rarely tuned to precisely one variable: variation in multiple stimulus dimensions influence their responses. As we show below, this can dilute or even abolish the mean tuning to the relevant stimulus. The brain cannot simply use linear computation, nor can we understand neural processing using linear models.
To see this problem in a simple case, imagine a simplified model of a visual neuron that includes an oriented edge-detecting linear filter followed by additive noise, with a Gabor receptive field like simple cells in primary visual cortex (Figure 1A). If an edge is presented to this model neuron, different rotation angles will change the overlap, producing a different mean. This neuron is then tuned to orientation.
However, when the edge has the opposite polarity, with black and white reversed, then the linear response is reversed also. If the two polarities occur with equal frequency, then the positive and negative responses cancel on average. The mean response of this linear neuron to any given orientation is therefore precisely constant, so the model neuron is untuned.
Notice that stimuli aligned with the neuron’s preferred orientation will generally elicit the highest or lowest response magnitude, depending on polarity. Edges with the smallest response to one polarity will also have the smallest response to its inverse. Thus, even though the mean response of this linear neuron is zero, independent of orientation, the variance is tuned.
To estimate the variance, and thereby the orientation itself, the brain can compute the square of the linear responses. This would allow the brain to estimate the orientation independently from polarity. This is consistent with the well-known energy model of complex cells in primary visual cortex, which use squaring nonlinearities to achieve invariance to the polarity of an edge [1]. We will return to this paradigmatic example of simple nonlinear computation throughout this article.
Generalizing from this example, we identify edge polarity as a ‘nuisance variable’ — a property in the world that alters how task-relevant stimuli appear but is, itself, irrelevant for the current task (here, perceiving orientation). Other examples of nuisance variables include the illuminant for guessing surface color, position for object recognition, expression for face identification, or pitch for speech recognition. Nuisance variables generally make it hard to extract the task-relevant variables from sense data, which is the central task of perception [2–5]. (Of course, what is a nuisance for one task might be a target variable in another task, and vice versa.)
The prevailing neuroscience view of this disentangling process is deterministic: the output of a complex (often multi-stage) nonlinear function identifies the variables of interest [2, 3, 6]. Here we take a statistical perspective: the brain learns from its history of sensory inputs which statistics of its many sense data can be used to extract the task-relevant variable. In the orientation estimation task above, the relevant statistic was not the mean but the variance.
Just because a neural population encodes information, it does not mean that the brain decodes it all.Here, encoding specifies how the neural responses re-late to the stimulus input; similarly, decoding specifies how the neural responses relate to the behavioral out-put. To understand the brain’s computational strategy we must understand how encoding and decoding are related, i.e. how the brain uses the information it has. As we will see, our statistical perspective provides a simple way of testing whether the brain’s decoding strategy is efficient, based on whether neural response patterns that are informative about the task-relevant sensory input are also informative about the animal’s behavior in the task.
2 Results
2.1 Task, stimulus, neural responses, action
To specify our mathematical framework for nonlinear decoding, we model a task, a stimulus with both relevant and irrelevant variables, neural responses, and behavioral choices.
In our task, an agent observes a multidimensional stimulus (s, n) and must act upon one particular relevant aspect of that stimulus, s, while ignoring the rest, n. The irrelevant stimulus aspects serve as nuisance variables for the task (the letter n stands for nuisance).
Together, these stimulus properties determine a complete sensory input that drives some responses r in a population of N neurons according to the distribution p(r|s, n).
We consider a feedforward processing chain for the brain, in which the neural responses r are nonlinearly transformed downstream into other neural responses R(r), which in turn are used to create a perceptual estimate of the relevant stimulus ŝ:
We model the brain’s estimate as a linear function of the downstream responses R. Ultimately these estimates are used to generate an action that the experimenter can observe. Here we assume that the task is local or fine-scale estimation: the subject must directly report its estimate for the relevant stimuli near a reference s0. We measure performance by the variance of this estimate,.
We assume that we have recorded activity only from some of the upstream neurons, so we don’t have direct access to R, only r. Nonetheless we would like to learn something about the downstream computations used in decoding. In this paper we show how to use the statistics of cofluctuations in r and to estimate the quality of nonlinear decoding.
2.2 Signal and noise
The population response, which we take here to be the spike counts of each neuron in a specified time window, reflects both signal and noise, where signal is the repeatable stimulus-dependent aspects of the response, and noise reflects trial-to-trial variation. Conventionally in neuroscience, the signal is often thought to be the stimulus dependence of the average response, i.e. the tuning curve f (s) = Σr r p(r|s) = ⟨r|s⟩ (angle brackets denote an average over all responses given the condition after the vertical bar). Below we will broaden this conventional definition to allow the signal to include any stimulus-dependent statistical property of the population response.
Noise is the non-repeatable part of the response, characterized by the variation of responses to a fixed stimulus. It is convenient to distinguish internal noise from external noise. Internal noise is internal to the animal, and is described by response distribution p(r|s, n) when everything about the stimulus is fixed. This could also include uncontrolled variation in internal states [7–10], like attention, motivation, or wander-ing thoughts. External noise is variability generated by the external world, or nuisance variables, such as the positions of all dots in a random dot kinematogram [11] or the polarity of an edge (Figure 1). External noise leads to a neural response distribution p(r|s) where only the relevant variables are held fixed. Both types of noise can lead to uncertainty about the true stimulus.
Trial-to-trial variability can of course be correlated across neurons. Neuroscientists often measure two types of second-order correlations: signal correlations and noise correlations [12–20]. Signal correlations mea-sure shared variation in responses r averaged over the set of stimuli s: ρsignal = Corr(r). (Internal) noise correlations measure shared variation that persists even when the stimulus is completely identical, nuisance variables and all: ρnoise(s, n) = Corr(r|s, n).
For multidimensional stimuli, however, these are only two extremes on a spectrum, depending on how many stimulus aspects are fixed across the trials to be averaged. We propose an intermediate type of correlation: nuisance correlations. Here we fix the task-relevant stimulus variable(s) s, and average over the nuisance variables n: ρnuisance = Corr(r|s). Just as signal correlations don’t mean correlations between signals, nuisance correlations are not correlations between nuisance variables, but rather between neural responses induced by the external noise or nuisance variation. Of course nuisance correlations will be task-dependent, since the task determines which variables are nuisance and which are relevant [21, 22].
Critically, but confusingly, some so-called ‘noise’ correlations and nuisance correlations actually serve as signals. This happens whenever the statistical pattern of trial-by-trial fluctuations depends on the stimulus, and thus contain information. For example, a stimulus-dependent noise covariance functions as a signal. There would still be true noise, i.e. irrelevant trial-to-trial variability that makes the signal uncertain, but it would be relegated to higher-order fluctuations [23] such as the variance of the response covariance (Figure 2D, Table 1). Stimulus-dependent correlations, principally due to nuisance variation, lead naturally to non-linear population codes, as we will explain below.
2.3 Nonlinear encoding by neural popula-tions
Most accounts of neural population codes actually address linear codes, in which the mean response is tuned to the variable of interest and completely captures all signal about it [24–28]. We call these codes linear be-cause the neural response property needed to best es-timate the stimulus near a reference (or even infer the entire likelihood of the stimulus, Supplement S1.2) is a linear function of the response. Linear codes for different variables may arise early in sensory processing, or after many stages of computation [2, 5].
If any of the relevant signal can only be extracted using nonlinear functions of the neural responses, then we say that the population code is nonlinear.
It is illuminating to take a statistical view: unlike a linear code, the information is not encoded in mean neural responses but instead by higher-order statistics of responses [16, 29]. These functional and statistical views are naturally linked because estimating higher-order statistics requires nonlinear operations. For instance, information from a stimulus-dependent covariance Q(s) = ⟨rr┬|s⟩ can be decoded by quadratic operations R = rr┬[22, 30, 31]. Table 1 compares the relevant neural response properties for linear and non-linear codes.
A simple example of a nonlinear code is the exclusive-or (XOR) problem. Given the responses of two binary neurons, r1 and r2, we would like to de-code the value of a task-relevant signal s = XOR(r1, r2) (Figure 2A). We don’t care about the specific value of r1 by itself, and in fact r1 alone tells us nothing about s. The same is true for r2. The signal is actually re-flected in the trial-by-trial correlation between r1 and r2: when they are the same then s = -1, and when they are opposite then s = +1. The correlation, and thus the relevant variable s, can be estimated nonlin-early from r1 and r2 as .
Some experiments have reported stimulus-dependent internal noise correlations that depend on the signal, even for a completely fixed stimulus without any nuisance variation [32–36]. Other experiments have turned up evidence for nonlinear population codes by characterizing the nonlinear selectivity directly [6, 37, 38].
More typically, however, stimulus-dependent corre-lations arise from external noise, leading to what we call nuisance correlations. In the introduction (Figure 1) we showed a simple orientation estimation example in which fluctuations of an unknown contrast eliminate the orientation tuning of mean responses, relegating the tuning to variances. Figure 2B–E shows a slightly more sophisticated version of this example, where instead of two image polarities, we introduce spatial phase as a continuous nuisance variable. This again eliminates mean tuning, but introduces nuisance covariances that are orientation tuned.
One might object that although the nuisance covariance is tuned to orientation, a subject cannot compute the covariance on a single trial because it does not experience all possible nuisance variables to average over. This objection stems from a conceptual error that conflates the tuning (signal) with the raw sense data (signal+noise). In linear codes, the subject does not have access to the tuned mean response ⟨r|s⟩, just a noisy single-trial version of the mean, namely r. Analogously, the subject does not need access to the tuned covariance, just a noisy single-trial version of the covariance, rr⊤(Table 1). In this simple example, the nuisance variable of spatial phase ensures that quadratic statistics contains relevant information.
2.4 Decoding and choice correlations
To study how neural information is used or decoded, past studies have examined whether neurons that are sensitive to sensory inputs also reflect an animal’s behavioral outputs or choices [39–47]. However, this choice-related activity is hard to interpret, because it may reflect decoding of the recorded neurons, or merely correlations between them and other neurons that are decoded instead [48].
In principle, we could discount such indirect relationships with complete recordings of all neural activity. This is currently impractical for most animals, and even if we could record from all neurons simultaneously, data limitations would prevent us from fully disambiguating how neural activities directly influence behavior.
To understand key principles of neural computation, however, we may not care about all detailed patterns of decoding weights and their underlying synaptic connectivity. Instead we may want to know only certain properties of the brain’s strategies. One important property is the efficiency with which the brain decodes available neural information as it generates an animal’s choices.
Conveniently, testable predictions about choice-related activity can reveal the brain’s decoding efficiency, in the case of linear codes [28]. Next we review these predictions, and then generalize them to nonlin-ear codes.
2.5 Choice correlations predicted for opti-mal linear decoding
We define ‘choice correlation’ Cr k as the correlation coefficient between the response rk of neuron k and the stimulus estimate (which we view as a continuous ‘choice’) , given a fixed stimulus s:
This choice correlation is a conceptually simpler and more convenient measure than the more conventional statistic, ‘choice probability’ [49], but it has almost identical properties (Methods 4.2) [28, 48].
Intuitively, if an animal is decoding its neural information efficiently, then those neurons encoding more information should be more correlated with the choice. Mathematically, one can show that choice correlations indeed have this property when decoding is optimal [28]: where J and Jrk are, respectively, the linear Fisher Information [23] based on the entire population r or on neuron k’s response rk (Methods 4.2). This relation-ship holds for a locally optimal linear estimator, regardless of the structure of noise correlations.
Another way to test for optimal linear decoding would be to measure whether the animal’s behavioral discriminability matches the discriminability for an ideal observer of the neural population response. Yet this approach is not feasible, as it requires one to measure simultaneous responses of many, or even all, relevant neurons. In contrast, the optimality test (Eq 3) requires measuring only single neuron responses, which is vastly easier. Neural recordings in the vestibular system are consistent with optimal decoding according to this prediction [28].
2.6 Nonlinear choice correlations for optimal decoding
However, when nuisance variables wash out the mean tuning of neuronal responses, we may well find that a single neuron has both zero choice correlation and zero information about the stimulus. The optimality test would thus be inconclusive.
This situation is exactly the same one that gives rise to nonlinear codes. A natural generalization of Equation 3 can reveal the quality of neural computation on nonlinear codes. We simply define a ‘nonlinear choice correlation’ between the stimulus estimate and non-linear functions of neural activity R(r):
(Methods 4.2), where Rk(r) is a nonlinear function of the neural responses. If the brain optimally decodes the information encoded in the nonlinear statistics of neural activity, according to the simple nonlinear extension to Eq 4, then the nonlinear choice correlation satisfies the equation where JR k(r) is the linear Fisher Information in Rk(r) (Methods 4.2.2).
As an example of this relationship, we return to the orientation example. Here the response covariance Σ(s) = Cov(r|s) depends on the stimulus, but the mean f = ⟨ r|s ⟩ = ⟨ r ⟩ does not. In this model, optimally decoded neurons would have no linear correlation with behavioral choice. Instead, the choice should be driven by the product of the neural responses, R(r) = vec(rr ⊤), where vec(·) is a vectorization that flattens an array into a one-dimensional list of numbers. Such quadratic computation is what the energy model for complex cells is thought to accomplish for phase-invariant orientation coding [1]. Figure 3 shows linear and nonlinear choice correlations for pairs of neurons, defined as . When decoding is linear, linear choice correlations are strong while nonlinear choice correlations are near zero (Figure 3A,B). When the decoding is quadratic, here mediated by an intermediate layer that multiplies pairs of neural activity, the nonlinear choice correlations are strong while the linear ones are insignificant (Figure 3C,D).
2.7 Which nonlinearity?
If the brain’s decoder optimally uses all available information, choice correlations will obey the prediction of Eq. 7 even if the specific nonlinearities used by the brain differ from those selected for evaluating choice correlations (Methods 4.2.3). The prediction will hold as long as the brain’s nonlinearity can be expressed as a linear combination of the tested nonlinearities (Methods 4.2.3). Figure 4 shows a situation where information is encoded by quadratic and cubic sufficient statistics of neural responses, while a simulated brain decodes them near-optimally using a generic neural network rather than a set of nonlinearities matched to those sufficient statistics. Despite this mismatch we can successfully identify that the brain is near-optimal by applying Eq 7, even without knowing the simulated brain’s true nonlinear transformations.
2.8 Redundant codes
It might seem unlikely that the brain uses optimal, or even near-optimal, nonlinear decoding. Even if it does, there are an enormous number of high-order statistics for neural responses, so the information content in any one statistic could be tiny compared to the total information in all of them. For example, with N neurons there are on the order of N 2 quadratic statistics, N 3 cubic statistics, and so on. With so many statistics contributing information, the choice correlation for any single one would then be tiny according to the ratio in Eq 7, and would be indistinguishable from zero with reasonable amounts of data. Past theoretical studies have described nonlinear (specifically, quadratic) codes with extensive information that grows proportionally with the number of neurons [16, 30]. This would in-deed imply immeasurably small choice correlations for large, optimally decoded populations.
A resolution to these concerns is information-limiting correlations [27]. The past studies that derive extensive nonlinear information treat large cortical populations in isolation from the smaller sensory population that would naturally provide its input [16, 30]. However, when a network inherits information from a much smaller input population, the expanded neural code becomes highly redundant: the brain cannot have more information than it receives. Noise in the input is processed by the same pathway as the signal, and this generates noise correlations that can never be averaged away [27].
Previous work [27] characterized linear information-limiting correlations for fine discrimination tasks by de-composing the noise covariance into , where ∈ is the variance of the information-limiting com-ponent and Σ0 is noise that can be averaged away with many neurons.
For nonlinear population codes, it is not just the mean that encodes the signal, f (s) = ⟨ r|s ⟩, but rather the nonlinear statistics F (s) = ⟨ R(r)|s ⟩. Likewise, the noise does not comprise only second-order covariance of r, Cov(r|s), but rather the second-order covariance of the relevant nonlinear statistics, Γ = Cov(R|s) (Section 2.2). Analogous to the linear case, these correlations can be locally decomposed as where ∈ is again the variance of the information-limiting component, and Γ0 is any other covariance which can be averaged away in large populations. The information-limiting noise bounds the estimator variance to no smaller than ∈ even with optimal decoding.
Neither additional neurons nor additional decoded statistics can improve performance beyond this bound. As a direct consequence, when there are many fewer sensory inputs than cortical neurons, many distinct statistics Rk(r) will carry redundant information. Un-der these conditions, many ratios JRk/J. (Eq 7) can be measurably large even for optimal nonlinear decoding (Figure 5).
2.9 Decoding efficiency revealed by choice correlations
Even if decoding is not strictly optimal, Eq. 7 can be satisfied due to information-limiting correlations. Decoders that seem substantially suboptimal because they fail to avoid the largest noise components in Γ0 can be nonetheless dominated by the bound from information-limiting correlations. This will occur whenever the variability from suboptimally decoding Γ0 is smaller than ∈. Just as we can decompose the nonlinear noise correlations into information-limiting and other parts, we can decompose nonlinear choice correlations into corresponding parts as well, with the result that where ζR depends on the particular type of suboptimal decoding (Supporting Information S7). The slope α between choice correlations and those predicted from optimality is given by the fraction of estimator variance explained by information-limiting noise, . This slope therefore provides an estimate of the efficiency of the brain’s decoding.
Figure 5 shows an example of a decoder that would be highly suboptimal without considering redundancy, but is nonetheless close to optimal when information limits are inherited.
In realistically redundant models that have more cortical neurons than sensory neurons, many decoders could be near-optimal, as we recently discovered in experimental data for a linear population code [28]. However, even in redundant codes there may be substantial inefficiencies, especially for unnatural tasks [50].
3 Discussion
This study introduced a theory of nonlinear population codes, grounded in the natural computational task of separating relevant and irrelevant variables. The theory considers both encoding and decoding — how stimuli drive neurons, and how neurons drive behavioral choices. It showed how correlated fluctuations between neural activity and behavioral choices could reveal properties of the brain’s decoding. Unlike previous theories [16, 30], ours remains consistent with bio-logical constraints due to the large cortical expansion of sensory representations by incorporating redundancy through information-limiting correlations. Crucially, this theory provides a remarkably simple test to determine if downstream nonlinear computation decodes all that is encoded.
Alternative methods to estimate whether animals use their information efficiently rely upon comparing behavioral performance to performance of an ideal observer that can access the entire population. Even with impressive advances in neurotechnology, this challenge remains out of reach for large populations. In contrast, our proposed method to test for optimal decoding has a vastly lower experimental burden. It requires only that a few cells be recorded simultaneously while an animal performs a fine estimation or discrimination task.
On the other hand, this simple test does not offer a complete description of neural transformations.
It instead tests one important hypothesis about their functional role — that the brain performs optimal de-coding. The theory also provides a practical way of estimating decoding efficiency. The brain may not be optimal, but instead may be satisfied by a more modest decoding efficiency. In this case, more work is needed to understand what suboptimalities the brain tolerates for satisfactory performance.
3.1 Which nonlinearities should we test?
If all neural signals are decoded optimally, then all choice correlations for any function of those signals should also be consistent with optimal decoding, since they contain the same information. Yet for the wrong or incomplete nonlinearities that do not disentangle the task-relevant variables from the nuisance variables, the test may be inconclusive, just as it was for linear decod-ing of a nonlinear code (Figure 4): the chosen nonlinear functions may not extract linearly decodable information nor have any choice correlation.
The optimal nonlinearities would be those that collectively extract the sufficient statistics about the relevant stimulus, which will depend on both the task and the nuisance variables. In complex tasks, like recognizing object from images with many nuisance variables, most of the relevant information lives in higher-order statistics, and therefore require more complex nonlinearities to extract. In such high-dimensional cases, our proposed test is unlikely to be useful. This is because our method expresses stimulus estimates as sums of nonlinear functions, and while that is universal in principle [51], that is not a compact way to express the complex nonlinearities of deep networks. Alternatively, with good guidance from trained neural network models our method could potentially judge whether those nonlinearities provide a good description of neural decoding. This decoding perspective would complementing studies that argue for a good match between encoding by convolutional neural networks [6].
The best condition to apply our optimality test is in tasks of modest complexity but still possessing fundamentally nonlinear structure. Some interesting examples where our test could have practical relevance include motion detection using photoreceptors [52], visual search with distractors (XOR-type tasks) [31, 53], sound localization in early auditory processing before the inferior colliculus [54], or context switching in higher-level cortex [55].
Our test for optimal nonlinear decoding really amounts to testing for optimal linear decoding of non-linear functions of recorded neural data. If we had access to some putative downstream neurons that computed these nonlinear functions, we could just test whether the brain linearly decoded those neurons optimally. Yet that would circumvent the most interesting and crucial nonlinear aspects of neural computation. Alternatively, if we could record from neurons at different levels of the processing chain, we could try to characterize that nonlinear recoding between them directly, without reference to a behavioral choice. But this would not easily relate these computations to their functional role. The method proposed here allows us to skip these intermediate steps and directly test the optimality of all accumulated downstream nonlinearities.
3.2 Nonlinear decoding or switched linear decoding?
Could the brain avoid nonlinear decoding just by switching between different linear decoders depending on the current nuisance variable n, so that The switching variable itself would have to be inferred from sensory data, which requires marginalizing over the task variable; this takes us back to the original problem, but with task and nuisance variables reversed. Even so, switched linear decoding would actually be equivalent to nonlinear decoding whenever is estimated from neural responses: . A discrimination task with a changing class bound ary [12, 56, 57] is, in principle, a nonlinear task. But if the class boundary is changed too slowly, perhaps changing only on different days, then the brain may well re-learn its weights rather than performing some nonlinear decoding of recent activity. A better experi-mental design for revealing nonlinear computation for task context would be randomly changing the tasks, either cued [58] or even uncued [55], on a short enough time scale that the recent neural activity affects the class boundary.
3.3 Limitations of the approach
For efficient decoding in a learned task, the optimality test (7) is necessary but not sufficient. If the brain neglects some of informative sufficient statistics, and we don’t test these neglected statistics either, then we could find the brain is consistent with our optimal de-coding test, yet still be suboptimal. Only if the test is passed for all statistics will the test be conclusive. For an extreme example, a single neuron might pass the test, but if other neurons don’t, then the brain is not using its information well. On a broader scale, one might find that all individual responses rk pass the optimality test, while products of responses rjrk fail. This would be consistent with linear information being used well while distinct quadratic information is present but unused; on the other hand this outcome would not be consistent with quadratic statistics that are uninformative but decoded anyway, since that would increase the output variance beyond that expected from the linear information. In future work we will demonstrate how we can use nonlinear choice correlations to identify properties of suboptimal decoders [59].
Our approach is currently limited to feedforward processing, which unquestionably oversimplifies cortical processing. Nonetheless, feedforward models do a fair job of capturing the representational structure of the brain [6].
Feedback could also cause suboptimal networks to exhibit choice correlations that seem to resemble the optimal prediction. If the feedback is noisy and projects into the same direction that encodes the stimulus, such as from a dynamic bias [60], then this could appear as information-limiting correlations, enhancing the match with Eq 7. This situation could be disambiguated by measuring the internal noise source providing the feedback, and of course this would require more simultaneous measurements.
3.4 Comparing choice correlations from in-ternal and external noise
Since many stimulus-dependent response correlations are induced by external nuisance variation, not internal noise, we might not find informative stimulus-dependent noise correlations upon repeated presentations of a fixed stimulus. Those correlations may only be informative about a stimulus in the presence of natural nuisance variation. For example, if a picture of a face is shown repeatedly without changing its pose, then small expression changes can readily be identified by linear operations; if the pose can vary then the stimulus is only reflected in higher-order correlations [5].
In contrast, we should see some nonlinear choice correlations even when nuisance variables are fixed. This is because neural circuitry must combine responses nonlinearly to eliminate natural nuisance variation, and any internal noise passing through those same channels will thereby influence the choice (although they may be smaller and more difficult to detect than the fluctuations caused by the nuisance variation). This influence will manifest as nonlinear choice correlations. In other words, stimulus-dependent noise correlations need not predict a fixed stimulus, but they may predict the choice (Supplementary Information S8).
For optimal decoding, the choice correlations measured using fixed nuisance variables will differ from Eq 7, which should strictly hold only when there is natural nuisance variation. This is implicit in Eq 7, since the relevant quantities are conditioned only on the relevant stimulus s while averaging over the nuisance variations n. However, under some conditions, a related prediction for nonlinear choice correlations holds even with-out averaging over nuisance variables (Supplementary Information S8).
3.5 Conclusion
Despite the clear importance of computation that is both nonlinear and distributed, and evidence for nonlinear coding in the cortex [31, 33–35], most neuroscience applications of population coding concepts have assumed linear codes and linear readouts [6, 28, 39, 61, 62]. The few that directly address nonlinear popu-lation codes either have an impossibly large amount of encoded information [16, 30], or investigate abstract properties unrelated to structured tasks [63]. Some ex-perimental studies have been able to extract additional information from recorded populations using nonlinear decoders [31, 64], but the inferred properties of such decoders are based on recordings being a representative sample that can be extrapolated to larger populations. Unknown correlations and redundancy prevents that from being a reliable method [23, 65].
Our method to understand nonlinear neural decod-ing requires neural recordings in a behaving animal. The task must be hard enough that it makes some errors, so that there are behavioral fluctuations to explain. Finally, there should be a modest number of nonlinearly entangled nuisance variables. Unfortu-nately, many neuroscience experiments are designed without explicit use of nuisance variables. Although this simplifies the analysis, this simplification comes at a great cost, which is that the neural circuits are being engaged far from their natural operating point, and far from their purpose: there is little hope of understanding neural computation without challenging the neural systems with nonlinear tasks for which they are required.
Our statistical perspective on feedforward nonlinear coding in the presence of nuisance variables provides a useful framework for thinking about neural compu-tation. Furthermore, choice-related activity provides guidance for designing interesting experiments to mea-sure not only how information is encoded in the brain, but how it is decoded to generate behavior. In future work we aim to apply this theory to experimental data to test whether real brains decode neural information optimally in any nonlinear tasks.
4 Methods
4.1 Encoding models
4.1.1 Orientation estimation with varying spatial phase
Figure 1 illustrates how nuisance variation can eliminate a neuron’s mean tuning to relevant stimulus variables, relegating the neural tuning to higher-order statistics like covariances. In this example, the subject estimates the orientation of a Gabor image, G(x|s, n), where x is spatial position in the image, and s and n are the orientation and spatial phase of the image, respectively (Supplemental Material S2). The model visual neurons are linear Gabor filters like idealized simple cells in primary visual cortex, corrupted by additive white Gaussian noise. Their responses are thus distributed as r ∼ P (r|s, n) = N (r|f (s, n), ϵI), where ϵ is the noise variance and the mean f (s, n) = ⟨r|s, n⟩ =Σr r p(r|s, n) is determined by the overlap between the image and the receptive field.
When the spatial phase n is known, the mean neural response contains all the information about orientation s. The brain can decode responses linearly to estimate orientation near a reference s0.
When the spatial phase varies, however, the each mean response to a fixed orientation will be combine across different phases: f (s) = ⟨r|s ⟩ = Σr r p(r|s) =∫ dn Σr r p(r|s, n)p(n). Since each spatial phase can be paired with another phase π radians away that inverts the linear response, the phase-averaged mean is f (s) = 0. Thus the brain cannot estimate orientation by decoding these neurons linearly; nonlinear computation is necessary.
The covariance provides one such tuned statistic. We define Covij (r|s, n) as the neural covariance for a fixed input image (noise correlations), and Covij (r|s) as the neural covariance when the nuisance varies (nuisance correlations). According to the law of total covariance, where δfi(s, n) = fi(s, n) -⟨fi(s, n) ⟨n. Supplementary Information S2 shows in detail how Covij (r|s) is tuned to s.
4.1.2 Exponential family distribution and sufficient statistics
We assume the response distribution conditioned on the relevant stimulus (but not on nuisance variables) is approximately a mem-ber of the exponential family with nonlinear sufficient statistics, where R(r) is a vector of sufficient statistics for the natural pa-rameter H(s), b(r) is the base measure, and A(s) is the log-partition function. The sufficient statistics contain all of the in-formation in the population response, and all other tuned statis-tics may be derived from them.
Estimation and inference are closely connected in the expo-nential family. In Supplemental Material S1.2, we show that the optimal local estimation can be achieved by linearly decoding the nonlinear sufficient statistics, . The decoding weights minimize the variance of an unbiased decoder, where F ′= ∂ (R(r)|s) /∂s is the sensitivity of the statistics to changing inputs, and Γ = Cov(R|s) is the stimulus-conditioned response covariance which generally includes nuisance correlations (Section 2.2).
The variance of this unbiased local estimator from the neural responses is lower-bounded by the inverse Fisher information. For exponential family distributions with nonlinear sufficient statistics R(r), the Fisher information is [23] (Supplemental Material S1.1)
4.1.3 Quadratic encoding
In a quadratic coding model, the distribution of neural responses is described by the exponential family with up to quadratic sufficient statistics, R(r) = {ri, rirj } for i, j ∈ {1, …, N }. A familiar example is the Gaussian distribution with stimulus-dependent covariance Σ(s). In order to demonstrate the coding properties of a purely nonlinear neural code, here we assume that the mean tuning curve f (s) and the stimulus-conditional covariances Σij (s) depend smoothly on the stimulus. We can quantify the informa-tion content of the neural population using Equation 13.
4.1.4 Cubic encoding
In our cubic coding model, the distribution of neural responses is described by the exponential family with up to cubic sufficient statistics, R(r) = {ri, rirj, rirjrk} for i, j, k ∈ {1, …, N }.
We approximate a three-neuron cubic code first using purely cubic components, and we then apply a stimulus-dependent affine transformation to include linear and quadratic statistics. The pure cubic code is used for a vector z with sufficient statistics zizjzk (and a base measure to ensure the distribution is bounded and normalizable).
We approximate this distribution by a mixture of four Gaussians. The mixture is chosen to reproduce the tetrahedral symmetry of the cubic distribution (Supplementary Figure 6), which allows the cubic statistics of responses to be stimulus dependent, leaving stimulus-independent quadratic and linear statistics.
To generate larger multivariate cubic codes for Figure (6), for simplicity we assume the pure cubic terms only couple disjoint triplets of variables, and sample independently from an approximately cubic distribution for each triplet. To convert this purely cubic distribution to a distribution with linear and quadratic in-formation, we shift and scale these cubic samples z in a manner dependent on s: where f (s) and Σ(s) describes the desired signal-dependent mean and covariance (see Supplemental Material S4).
4.2 Nonlinear choice correlations
4.2.1 Estimating choice correlation
The nonlinear choice correlation between the stimulus estimate and one nonlinear function Rk (the kth element of the vector R) of recorded neural activity r is where is the estimator variance.
To compute this quantity from neural responses to stimuli, we need to condition neural responses and behavior data on the same signal s, or on the same total input (s, n) if we want to isolate the contribution of purely internal noise rather than nuisance variation (Supplemental Material S8). We combine choice correlations calculated under different stimulus conditions by balanced z-scoring [66].
4.2.2 Optimality test
Locally optimal linear estimator weights for decoding statistics R are given by linear regression as w ∝ Γ-1F ′. Substituting these weights into (16), the optimal nonlinear choice correlation becomes where is the linear Fisher Information in Rk(r)
For fine-scale discriminations, optimal choice correlations can be written in many equivalent ways: where is the discriminability. These ways reflect the simple relationships between four quantities often used to represent information: d-prime is propvortional to the square root of the Fisher information [67]; estimator standard deviation is bounded by the inverse square root of the Fisher in-formation, , discrimination threshold is proportional to the estimator standard deviation, In different experiments (binary discrimination, continuous estimation), it can be most natural to express this relationship in different measured quantities.
In our simulations with binary choices for fine discrimination, we calculate the optimal nonlinear choice correlation using d-prime [68]. is estimated from neural responses generated by stimuli s± = s0 ± Δs/2 near a reference stimulus s0:
The discriminability for an decoded neural population is estimated from the unbiased decoder output’s standard deviation,
4.2.3 Nonlinear choice correlation to analyze an unknown nonlinearity
In Figure 4, we generated neural responses given sufficient statistics that are polynomials up to third order, R(r) = {ri, rirj, rirjrk} (Methods 4.1.4). In contrast, our model brain decodes the stimulus using a cascade of linear-nonlinear transformations, with Rectified Linear Units (ReLU(x) = max(0, x)) for the nonlinear activation functions. We used a fully-connected ReLU network with two hidden layers and 30 units per hidden layer,
We trained the network weights and biases with backpropagation to estimate stimuli near a reference s0 based on 20000 training pairs (r, s) generated by the cubic encoding model. This trained neural network extracted 91% of the information available to an optimal decoder.
4.3 Information-limiting correlations
Only specific correlated fluctuations limit the information con-tent of large neural populations [27]. These fluctuations can ul-timately be referred back to the stimulus as r ∼ p(r|s + ds), where ds is zero mean noise, whose variance 1/J∞ determines the asymptotic variance of any stimulus estimator. These information-limiting correlations for nonlinear computation can be characterized by the covariance of the sufficient statistics, Γ = Cov(R|s) conditioned on s; the information-limiting component arises specifically from the signal covariance Cov(F (s)|s). Since the signal for local estimation of stimuli near a reference s0 is , the information-limiting component of the covariance is proportional to F F:
Here Γ0 is any covariance of R that does not limit information in large populations. Substituting this expression into (13) for the nonlinear Fisher Information, we obtain where is the nonlinear Fisher Information allowed by Γ0. When the population size grows, the extensive information term J0 grows proportionally, so the output information will asymptote to J∞.
Author contributions
XP conceived the theoretical framework. XP and QY performed the theoretical analyses. QY performed the simulations. QY and XP wrote the manuscript.
Supplemental material
S1 Exponential family distributions
For a stimulus s and a response r, the conditional prob-ability is a member of the exponential family when where H(s) are the natural parameters, R(r) are the sufficient statistics, A(s) and b(r) are the log nor-malizer and base measure. The statistics R(r) are called sufficient because they contain all the informa-tion needed to estimate the stimulus s.
S1.1 Fisher information
One measure of information content that a population response contains about a stimulus is the Fisher infor-mation J (s) [16, 24–27, 29]. The Fisher information is given by
For distributions p(r|s) in the exponential family with sufficient statistics R(r), we can compute these quantities analytically. We denote the mean of the sufficient statistics as F (s) = ⟨R(r)|s ⟩ This mean ⟨R|s ⟩ can be obtained by differentiating A(s) by the natural parameters H(s),
Equation 29 can give us the first and second derivatives of A(s) over s.
Thus we can compute two definitions of Fisher information. and where Γ = Cov[R(r)|s].
Since the two definition are equivalent, we have
Substituting Equation 38 into Equation 37, we find the Fisher Information for the exponential family [23]
S1.2 Estimation in the exponential family
Again assuming responses come from this distribution, we want to compute the maximum likelihood stimulus, , near a reference stimulus s0:
A Taylor expansion around the reference yields where all functions and derivatives are evaluated at s0. We find the maximum by differentiating with respect to s and setting the result equal to zero:
The solution is
Since r is a random quantity, we can express R as a mean and a deviation away from that mean: R = ⟨R|s0 ⟩ + δR = F + δR. In this case, H″TR -A″ = H″TF -A″+H″TδR, where the mean term is precisely the negative Fisher Information -J (s0). If the trial-to-trial fluctuations in the uncertainty are small relative to the average uncertainty then this Fisher term will dominate. Then we have
Where and where we used the results from Equations 13 and 38, with Γ = Cov(R|s0) and F = ⟨R|s0 ⟩. Thus, in this limit, the optimal estimator for s is a linear decoding of the sufficient statistics R(r).
S2 Orientation estimation task with varying spatial phase
In Figure 2B, the subject’s task is to estimate orientation s near a reference s0, based on images G of Gabor patterns given by where k = κ(cos s, sin s). Here the target s is the orientation of the pattern, n is a nuisance variable reflecting the spatial phase, x is the pixel location in the image, and k is a spatial frequency vector with amplitude κ = ∥k.∥ We assume the spatial receptive field of simple cell j in primary visual cortex is also described by a Gabor function where each neuron has a preferred orientation sj, spa-tial phase nj, and spatial frequency kj. Here for sim-plicity we assume that all neurons’ preferred spatial frequencies have the same amplitude κ that matches the input image.
We model the mean neuronal responses by the over-lap between the image and their linear receptive field. This overlap determines the tuning curve of each neuron:
This expression can be written in the form: using the stimulus-dependent response amplitude and phase where we define the cotants
Equation 52 reveals that the mean response of each neuron traces out a sinusoidal oscillation in n, where the amplitude and phase depend on s and the specific neuron j. The mean tuning for each pair of neurons therefore traces out an ellipse as a function of the nuisance variable, the input’s spatial phase. When we average over the ellipse generated by the nuisance variable n, the mean tuning to s is abolished — but the response covariances (nuisance correlations) re-main tuned to s.
Assuming each neuron’s response variability is drawn independently from a standard Gaussian N (0, 1), we can write the response distribution as
If the spatial phase n were fixed and known, the brain could estimate the orientation just from the mean tuning of the neural responses. However, if the spatial phase is unknown and varies between stimulus presen-tations uniformly from 0 to 2π, the mean tuning f (s) can be expressed as
This shows that there is no signal in the mean responses.
However, the brain can perform quadratic computations to eliminate the nuisance variable. We can de-fine Covij[r|s, n] as the neural covariance (noise correlations) when everything in the image is fixed, and Covij[r|s] as the neural covariance when the nuisance is unknown and free to vary (nuisance correlations). Then Covij[r|s] is where Dij(s) is given by
Here when we compute Equation 71, we used the trigonometric identity: 2 cos(x) cos(y) = cos(x + y) + cos(x -y), and cos(2n + ψi + ψj)dn = 0.
This demonstrates that the neural covariance Covij[r|s] depends on the orientation s. While lin-ear computation is useless for estimating orientation since the mean responses are untuned (59), quadratic (or higher-order) nonlinear computations can be used to estimate the orientation.
S3 Quadratic coding model
In a purely quadratic coding model (no linear infor-mation), the distribution of neural responses is de-scribed by the exponential family with quadratic sufficient statistics, p(r|s) ∼ exp [H(s)TR(r)] where R(r) = (…, rirj, …). A familiar example is a Gaussian distribution with stimulus-dependent covariance: p(r|s) = N (f, Σ(s)).
As a concrete example we construct a covariance that rotates with stimulus s. Any covariance matrix needs to be positive semidefinite. We build Σ(s) by setting the eigenvalues to be positive and s-independent and eigenvectors to form an orthogonal basis that rotates with s: where V (s) = exp As is a rotation matrix in which A = -ATis a real antisymmetric matrix with pure imaginary eigenvalues, and Λ is a diagonal matrix com posed of all positive eigenvalues.
To calculate the Fisher Information (Equation 13), we need to first calculate the derivative of the mean and covariance Γ = Cov[R(r)|s] of the quadratic sufficient statistics.
Because the mean of r is not dependent on the stim-ulus in this example, we can compute , where is the derivative of the covariance of r,
Here Ω is a diagonal matrix of eigenvalues for A, U is an orthogonal matrix of the eigenvectors of A, and X = U TΛU.
The elements in Γ can be expressed as Γij,kn = ⟨ rirjrkrn|s ⟩ -⟨ rirj|s ⟩ ⟨ rkrn|s ⟩. We can use the following identity for a Gaussian to compute this fourth-order quantity:
Where
Substitution of the response covariance (Equation 72) into Equation 74 allows us to calculate the covariance Γ of the quadratic sufficient statistics, and thereby to estimate the stimulus and Fisher information for this quadratic code.
S4 Cubic codes
In Figure 6 we assume the brain encodes the stimulus using a cubic code. A simple cubic code in z = (zi, zj, zk) ∈ ℝ 3 can be written as where we include the base measure to ensure normalizability (Figure 6A).
For mathematical convenience, we approximate this code by a mixture of Gaussians.
Where
And
The vectors va reflect the four corners of the tetrahedron, va,i = ±1, to match the tetrahedral symmetry of the pure cubic code (Equation 76, Figure 6). To sample from this distribution, we randomly choose a compo-nent a and then sample from the gaussian N (z|µa, Σa) conditioned on that component.
This distribution has zero mean and identity covariance but a nontrivial skewness tensor, and qualitatively matches the corresponding distribution for the true exponential family distribution with cubic sufficient statistics (Figure 6).
For simplicity, we consider pure cubic codes with non-overlapping cliques of three variables.
To convert this purely cubic distribution into a distribution with linear and quadratic information as well, we simply shift and scale the distribution in a manner dependent on s:
These affine transformations can be incorporated directly into each component of the mixture of gaussians,
Note that the linear and quadratic information terms are independent of the component a.
S5 Using nonlinear choice correlation to analyze unknown nonlinearities
The true nonlinearity that the brain uses to estimate the stimulus is unknown. Thus a crucial question in our decoding analysis is, which nonlinearities to consider? One reasonable set is polynomials in r, i.e. a Taylor series expansion of the neural nonlinearities, Ψ(r) = (ri, rirj, rirjrk, …).
The locally optimal decoder is a weighted sum of the sufficient statistics R(r) (Equation 46):
However, the brain might choose a different nonlinear basis g(r):
As long as the brain’s nonlinear function spans the same function basis as the sufficient statistics, we can still get all of the information about stimulus from neural population. This allows us to use choice correlation between brain’s estimate and our analysis non-linearity Ψ(r) to check the optimality condition (Equation 7).
In Figure 4, we assumed that the optimal nonlinear basis function R is polynomial nonlinearity up to third order, R(r) = (ri, rirj, rirjrk, …). We used cubic codes described in Methods 4.1.4 to generate neural responses for which R(r) are sufficient statistics for the stimulus. In this simulation, 18 neuronal responses (six cliques of size 3) were generated using cubic codes.
Our model brain decodes the stimulus using a cascade of linear-nonlinear transformations, with Rectified Linear Units (ReLU(x) = max(0, x)) for the nonlinear activation functions. We used a fully-connected ReLU network with two hidden layers and 30 units per hidden layer,
We trained the neural network with 20000 response samples generated from a cubic code driven by stimuli near the reference s0. We optimized the estimation performance for the neural network using backpropagation to find weights {W (𝓁)}, biases {b(𝓁)}, and read-out vector v that minimized the mean squared error. Our trained neural network performed near-optimally, extracting 91% of the Fisher information compared to optimal decoding based on the true sufficient statistics.
Feigning ignorance of our simulated brain’s true decoder, we used mononomial nonlinearities Ψ(r) in our the nonlinear choice correlation test (Equation 7). The simulated choice correlations were calculated by Equation 5, where R(r) = Ψ(r) based on neural responses driven by the reference stimulus s0, and the stimulus estimate was brain. The optimal choice correlation is computed using Equation 7, where We computed ΔF Ψ based on neural population responses r+ and r- driven by stimuli s+ = s0 ± Δs/2. The change in mean was ΔF Ψ = ⟨Ψ(r+) ⟩ -⟨Ψ(r-)⟩, and the average standard deviation was . is the variance of estimate of reference stimulus s0 using the trained neural network. Based on these quantities, Figure 4 shows that we can success-fully identify that the brain is near-optimal.
S6 Information-limiting correlations
Information-limiting correlations can ultimately be referred back to the stimulus, to appear as r ∼ p(r|s + ds), where ds is zero mean noise with variance 1/J∞which determines the uncertainty of stimulus. Applying the law of total covariance, we can decompose the covariance of nonlinear statistics R(r) conditioned on the stimulus into two parts: where ⟨·⟩p indicates an expectation value over the distribution p. The first term can be computed as follows,
Here we denote the covariance of R(r) given s and ds as Γ(s + ds). The second equality used a Taylor expansion of Γ(s + ds) around s. The third equality used the fact that the mean of ds is zero. Γ0 is the covariance of R in the absence of information-limiting correlations. The second term in Equation 91 can be expressed as
Here we have written the mean of R(r) given s and ds as F (s + ds). The second equality used a first-order expansion of F (s + ds) around s. The third equality used the fact that the variance of ds is 1/J∞. which is a rank-one perturbation of the covariance Γ 0
To compute the nonlinear Fisher Information,JR(r) = F′TΓ-1 F ′, we can use the Sherman-Morrison lemma to compute Γ-1:
Substituting these equations into the nonlinear Fisher Information (Equation 13) and simplifying, we obtain
Here is the nonlinear Fisher Informa-tion in the absence of information-limiting correlations. When the population size grows, the term J0 grows proportionally [16, 29], so for large populations the out-put information saturates at J∞.
S7 Nonlinear choice correlation for suboptimal decoding
A decoder that would be suboptimal for one population code could be near-optimal in the presence of information-limiting noise. In this case, nonlinear choice correlations can be decomposed into a sum of two terms, one from the information-limiting component and the other from the rest of the noise [28]:
For unbiased decoding, wTF = 1 Some manipulation gives where Γ0k = (Γ0)kk ≈ Γkk for small information-limiting noise variance 1/J∞ « Γ0k (which nonetheless can have a large effect on information despite the small variance), and where Σ0s is the standard deviation of the estimate produced by the same suboptimal decoder w in the absence of information-limiting correlations, i.e. when the covariance of the sufficient statistics is Γ0. The variance of can itself be decomposed into two terms as well: where we assume unbiased decoding, which implies wTF l = 1. This expression allows us to represent the ratio as
sWith Substituting these into (Eq 103) we find that the choice correlation for a suboptimal decoder in the presence of information-limiting correlations is a weighted sum of the choice correlations for optimal and suboptimal decoding:
Here and are, respectively, the choice corre lations for suboptimal decoding without information-limiting noise (so Γ = Γ0), and choice correlations for optimal decoding.
The slope α between choice correlations and those predicted from optimal decoding is equal to the fraction of estimator variance explained by information-limiting noise. This slope therefore provides an estimate of the efficiency of the brain’s decoding.
S8 Comparing choice correlations from internal or external noise
The response covariance that drives fluctuations in choices could arise from internal or external (nuisance) variability, or both. Choice correlations predicted for optimal decoding differ depending on whether we condition on the nuisance variables or not. In the main text, we described optimal choice correlations under the distribution p(r|s). This includes variations caused by external nuisance variables, which is sensible since this is what the brain’s decoder must handle. However, it is also potentially informative to examine how purely internal variability correlates with choice, as this is of-ten how choice correlations are assessed. In this section, we derive the choice correlations driven by purely internal noise, for a decoder that learned to remove external nuisance variation as well.
For simplicity we assume that the nonlinear sufficient statistics R(r) are linearly tuned to both the stimulus s and a scalar nuisance variable n, where F ′ and G′ characterize the sensitivity of R(r) to stimulus s and nuisance n, and an internal noise source η has zero mean with covariance H. We assume the brain has a prior over the nuisance variation, p(n), with zero mean and variance ξ. The total covariance for internal and external fluctuations is then
When we measure choice correlations while fixing the nuisance variables in the experiment, we assume the brain retains its decoding strategy accounting for both internal noise and unknown nuisance variation, and not the optimal decoding strategy when the nuisance is fixed and known. These decoding weights are where the denominator J1 = F ′TΓ-1F ′ is the Fisher information about s when there is natural nuisance variation following p(n). For distributions in the exponential family, this information saturates the Cramer-Rao bound on an estimator’s variance, so that . [69] The normalization by J1 ensures the decoding is locally unbiased. These weights are used to estimate the stimulus according to
Choice correlations in this fixed-nuisance experiment will be denoted by a lowercase c:
We include the superscript c sub as a reminder that these choice correlations do not follow the optimal pat-tern when the decoder is not matched to only the purely internal variability, as here.
We can express these choice correlations as:
The covariance between and R is
For the scalar nuisance variable we assume here, we can use the Sherman-Morrison lemma to decompose the inverse of the total covariance into a rank-one perturbation of the internal noise inverse covariance:
Substituting this inverse covariance into Equation 113, we obtain
This last expression can be rewritten using elements of the Fisher information matrix, whose inverse bounds the covariance of any joint estimator of the signal and nuisance variables, :
With these substitutions, we have
The denominator of Equation 112 involves the variance of the sufficient statistics, and the variance of the brain’s decoder, where we used the following results:
Combining the results from Equation 122, 126 and 123, we can compute Equation 112
The optimal choice correlation when there is natural nuisance variation (Eq 7) is given by where is the Fisher Information in Rk about s when there is natural nuisance variation, and is the standard deviation of the statistic Rk, again when there is natural nuisance variation.
The choice correlations for the same decoder differ under experimental conditions with and without nuisance variation: and . We find that the nuisance-conditioned choice correlations relate to the optimal nuisance-averaged choice correlations ; according to where we have defined the following constants: and
The slope βk and offset Γk of the relationship be-tween these two types of choice correlations (Equation 135) depends on the amount of nuisance variation compared to internal noise and the suboptimality of the brain’s decoding strategy. When the signal and nui-sance can be disentangled, that is, estimated nearly independently using the statistics R(r), then J12 is small and the choice correlations driven purely by internal fluctuations closely match the optimal choice correlations in the presence of nuisance variation (Figure 7A). In contrast, when nuisance variations remain partialy confused with the signal, then J12 is large and the choice correlations for fixed nuisance variables may differ from the optimal pattern seen when allowing nuisance variables to change from trial to trial (Figure 7B). For the simulations in Figure 7, we set the sufficient statistics to be linear R(r) = r for simplicity. Neu-ral responses were generated from a Gaussian distri-bution with a stimulus-dependent mean and identity covariance H = I: p(r |s, n) = N (F ′s + G′n, I). In Figure 7A, F ′ and G′ are set to be orthogonal to en-sure J12 = F ′TH-1G′ = 0. They are picked from the eigenvector of a symmetric matrix ATA, where A is a matrix whose elements are generated from uniform dis-tribution bounded by 0 and 1. In Figure 7B, each element in F ′ and G′ is drawn from a uniform distribution over the interval [0, 1]. We simulate 10000 responses of a population with N = 50 neurons. The stimulus is set to 0 and the nuisance is fixed to be 1. The brain’s decoder assumes a Gaussian prior over the nuisance variation with zero mean and variance ξ = 2. The decoding weights follow Equation 109, and the stimulus is estimated using Equation 110. Choice correlations in this fixed-nuisance experiment are computed by Equation 111 (vertical axis in Figure 7). The predicted optimal choice correlation is computed by Equation 134 (horizontal axis in Figure 7). In this setting, βk ≈ 1 when J12 = 0.
In this context, it is especially noteworthy that a mismatch between choice correlations and the optimal pattern might not indicate that the brain is suboptimal, but instead that the experimental task may not match the natural tasks for which the brain could have been optimized.
Acknowledgements
The authors thank Jeff Beck, Valentin Dragoi, Arun Parajuli, Andreas S. Tolias, Edgar Walker, R. James Cotton and Alex Pouget for helpful conversations. This work was supported by NSF CAREER grant 1552868 and NeuroNex grant 1707400 to XP.
References
- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].
- [15].
- [16].↵
- [17].
- [18].
- [19].
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].
- [26].
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].
- [41].
- [42].
- [43].
- [44].
- [45].
- [46].
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵