Revealing nonlinear neural decoding by analyzing choices

Qianli Yang; Xaq Pitkow

doi:10.1101/332353

Abstract

Sensory data about most natural task-relevant variables are confounded by task-irrelevant sensory variations, called nuisance variables. To be useful, the sensory signals that encode the relevant variables must be untangled from the nuisance variables through nonlinear transformations, before the brain can use or decode them to drive behaviors. The information to be untangled is represented in the cortex by the activity of large populations of neurons, constituting a nonlinear population code. Here we provide a new way of thinking about non-linear population codes and nuisance variables, leading to a theory of nonlinear feedforward decoding of neural population activity. This theory obeys fundamental mathematical limitations on information content that are inherited from the sensory periphery, producing redundant codes when there are many more corti-cal neurons than primary sensory neurons. The theory predicts a simple, easily computed quantitative relationship between fluctuating neural activity and behavioral choices if the brain uses its nonlinear population codes optimally: more informative patterns should be more correlated with choices.

1 Introduction

How does an animal use, or ‘decode’, the information represented in its brain? When the average responses of some neurons are well-tuned to a stimulus of interest, this is straightforward. In binary discrimination tasks, for example, a choice can be reached simply by a linear weighted sum of these tuned neural responses. Yet real neurons are rarely tuned to precisely one variable: variation in multiple stimulus dimensions influence their responses. As we show below, this can dilute or even abolish the mean tuning to the relevant stimulus. The brain cannot simply use linear computation, nor can we understand neural processing using linear models.

To see this problem in a simple case, imagine a simplified model of a visual neuron that includes an oriented edge-detecting linear filter followed by additive noise, with a Gabor receptive field like simple cells in primary visual cortex (Figure 1A). If an edge is presented to this model neuron, different rotation angles will change the overlap, producing a different mean. This neuron is then tuned to orientation.

Figure 1:

Simple nonlinear code for orientation induced by two polarities. (A) Receptive field for a linear neuron. (B) Four example images, each with an orientation s ∈ [0, π) and a polarity n ∈ {-1, +1}. (C) Even though the mean response of the linear neuron is tuned to orientation if polarity were specified (conditional mean, red), the mean response is untuned when the polarity is unknown and could take any value (marginal mean, black). (D) Tuning is recovered by the marginal variance even if the polarity is unknown (blue).

However, when the edge has the opposite polarity, with black and white reversed, then the linear response is reversed also. If the two polarities occur with equal frequency, then the positive and negative responses cancel on average. The mean response of this linear neuron to any given orientation is therefore precisely constant, so the model neuron is untuned.

Notice that stimuli aligned with the neuron’s preferred orientation will generally elicit the highest or lowest response magnitude, depending on polarity. Edges with the smallest response to one polarity will also have the smallest response to its inverse. Thus, even though the mean response of this linear neuron is zero, independent of orientation, the variance is tuned.

To estimate the variance, and thereby the orientation itself, the brain can compute the square of the linear responses. This would allow the brain to estimate the orientation independently from polarity. This is consistent with the well-known energy model of complex cells in primary visual cortex, which use squaring nonlinearities to achieve invariance to the polarity of an edge [1]. We will return to this paradigmatic example of simple nonlinear computation throughout this article.

Generalizing from this example, we identify edge polarity as a ‘nuisance variable’ — a property in the world that alters how task-relevant stimuli appear but is, itself, irrelevant for the current task (here, perceiving orientation). Other examples of nuisance variables include the illuminant for guessing surface color, position for object recognition, expression for face identification, or pitch for speech recognition. Nuisance variables generally make it hard to extract the task-relevant variables from sense data, which is the central task of perception [2–5]. (Of course, what is a nuisance for one task might be a target variable in another task, and vice versa.)

The prevailing neuroscience view of this disentangling process is deterministic: the output of a complex (often multi-stage) nonlinear function identifies the variables of interest [2, 3, 6]. Here we take a statistical perspective: the brain learns from its history of sensory inputs which statistics of its many sense data can be used to extract the task-relevant variable. In the orientation estimation task above, the relevant statistic was not the mean but the variance.

Just because a neural population encodes information, it does not mean that the brain decodes it all.Here, encoding specifies how the neural responses re-late to the stimulus input; similarly, decoding specifies how the neural responses relate to the behavioral out-put. To understand the brain’s computational strategy we must understand how encoding and decoding are related, i.e. how the brain uses the information it has. As we will see, our statistical perspective provides a simple way of testing whether the brain’s decoding strategy is efficient, based on whether neural response patterns that are informative about the task-relevant sensory input are also informative about the animal’s behavior in the task.

2 Results

2.1 Task, stimulus, neural responses, action

To specify our mathematical framework for nonlinear decoding, we model a task, a stimulus with both relevant and irrelevant variables, neural responses, and behavioral choices.

In our task, an agent observes a multidimensional stimulus (s, n) and must act upon one particular relevant aspect of that stimulus, s, while ignoring the rest, n. The irrelevant stimulus aspects serve as nuisance variables for the task (the letter n stands for nuisance).

Together, these stimulus properties determine a complete sensory input that drives some responses r in a population of N neurons according to the distribution p(r|s, n).

We consider a feedforward processing chain for the brain, in which the neural responses r are nonlinearly transformed downstream into other neural responses R(r), which in turn are used to create a perceptual estimate of the relevant stimulus ŝ:

We model the brain’s estimate as a linear function of the downstream responses R. Ultimately these estimates are used to generate an action that the experimenter can observe. Here we assume that the task is local or fine-scale estimation: the subject must directly report its estimate for the relevant stimuli near a reference s₀. We measure performance by the variance of this estimate,.

We assume that we have recorded activity only from some of the upstream neurons, so we don’t have direct access to R, only r. Nonetheless we would like to learn something about the downstream computations used in decoding. In this paper we show how to use the statistics of cofluctuations in r and to estimate the quality of nonlinear decoding.

2.2 Signal and noise

The population response, which we take here to be the spike counts of each neuron in a specified time window, reflects both signal and noise, where signal is the repeatable stimulus-dependent aspects of the response, and noise reflects trial-to-trial variation. Conventionally in neuroscience, the signal is often thought to be the stimulus dependence of the average response, i.e. the tuning curve f (s) = Σ_r r p(r|s) = ⟨r|s⟩ (angle brackets denote an average over all responses given the condition after the vertical bar). Below we will broaden this conventional definition to allow the signal to include any stimulus-dependent statistical property of the population response.

Noise is the non-repeatable part of the response, characterized by the variation of responses to a fixed stimulus. It is convenient to distinguish internal noise from external noise. Internal noise is internal to the animal, and is described by response distribution p(r|s, n) when everything about the stimulus is fixed. This could also include uncontrolled variation in internal states [7–10], like attention, motivation, or wander-ing thoughts. External noise is variability generated by the external world, or nuisance variables, such as the positions of all dots in a random dot kinematogram [11] or the polarity of an edge (Figure 1). External noise leads to a neural response distribution p(r|s) where only the relevant variables are held fixed. Both types of noise can lead to uncertainty about the true stimulus.

Trial-to-trial variability can of course be correlated across neurons. Neuroscientists often measure two types of second-order correlations: signal correlations and noise correlations [12–20]. Signal correlations mea-sure shared variation in responses r averaged over the set of stimuli s: ρ_signal = Corr(r). (Internal) noise correlations measure shared variation that persists even when the stimulus is completely identical, nuisance variables and all: ρ_noise(s, n) = Corr(r|s, n).

For multidimensional stimuli, however, these are only two extremes on a spectrum, depending on how many stimulus aspects are fixed across the trials to be averaged. We propose an intermediate type of correlation: nuisance correlations. Here we fix the task-relevant stimulus variable(s) s, and average over the nuisance variables n: ρ_nuisance = Corr(r|s). Just as signal correlations don’t mean correlations between signals, nuisance correlations are not correlations between nuisance variables, but rather between neural responses induced by the external noise or nuisance variation. Of course nuisance correlations will be task-dependent, since the task determines which variables are nuisance and which are relevant [21, 22].

Critically, but confusingly, some so-called ‘noise’ correlations and nuisance correlations actually serve as signals. This happens whenever the statistical pattern of trial-by-trial fluctuations depends on the stimulus, and thus contain information. For example, a stimulus-dependent noise covariance functions as a signal. There would still be true noise, i.e. irrelevant trial-to-trial variability that makes the signal uncertain, but it would be relegated to higher-order fluctuations [23] such as the variance of the response covariance (Figure 2D, Table 1). Stimulus-dependent correlations, principally due to nuisance variation, lead naturally to non-linear population codes, as we will explain below.

View this table:

Table 1:

Neural response properties relevant for linear and nonlinear codes. In each case, the brain must estimate the stimulus from a single example of neural data, but the relevant function of that data is linear for linear codes, and nonlinear for nonlinear codes. The noise and signal can be quantified by the correspond-ing covariance and stimulus-dependent changes in the corresponding means (i.e. the tuning curve slope).

Figure 2:

Nonlinear codes. A: Simple example in which a stimulus s is the XOR of two neural responses (top). Conditional probabilities p(r₁, r₂|s) of those responses (bottom) show they are anti-correlated when s = +1 (red) and positively correlated when s = -1 (blue). This stimulus-dependent correlation between responses creates a nonlinear code. The remaining panels show that a similar stimulus-dependent correlation emerges in orientation discrimination with unknown spatial phase. B: Gabor images with two orientations and three spatial phases. C: Mean responses of linear neurons with Gabor receptive fields are sensitive to orientation when phase is fixed (arrows), but point in different directions for different spatial phases. When phase is an unknown nuisance variable, this mean tuning therefore vanishes (black dot). D: The response covariance Cov(r₁, r₂|s) between these linear neurons is tuned to orientation even when averaging over spatial phase. Response covariances for four orientations are depicted by ellipses. E: A continuous view of the covariance tuning to orientation for a pair of neurons.

2.3 Nonlinear encoding by neural popula-tions

Most accounts of neural population codes actually address linear codes, in which the mean response is tuned to the variable of interest and completely captures all signal about it [24–28]. We call these codes linear be-cause the neural response property needed to best es-timate the stimulus near a reference (or even infer the entire likelihood of the stimulus, Supplement S1.2) is a linear function of the response. Linear codes for different variables may arise early in sensory processing, or after many stages of computation [2, 5].

If any of the relevant signal can only be extracted using nonlinear functions of the neural responses, then we say that the population code is nonlinear.

It is illuminating to take a statistical view: unlike a linear code, the information is not encoded in mean neural responses but instead by higher-order statistics of responses [16, 29]. These functional and statistical views are naturally linked because estimating higher-order statistics requires nonlinear operations. For instance, information from a stimulus-dependent covariance Q(s) = ⟨rr^┬|s⟩ can be decoded by quadratic operations R = rr^┬[22, 30, 31]. Table 1 compares the relevant neural response properties for linear and non-linear codes.

A simple example of a nonlinear code is the exclusive-or (XOR) problem. Given the responses of two binary neurons, r₁ and r₂, we would like to de-code the value of a task-relevant signal s = XOR(r₁, r₂) (Figure 2A). We don’t care about the specific value of r₁ by itself, and in fact r₁ alone tells us nothing about s. The same is true for r₂. The signal is actually re-flected in the trial-by-trial correlation between r₁ and r₂: when they are the same then s = -1, and when they are opposite then s = +1. The correlation, and thus the relevant variable s, can be estimated nonlin-early from r₁ and r₂ as .

Some experiments have reported stimulus-dependent internal noise correlations that depend on the signal, even for a completely fixed stimulus without any nuisance variation [32–36]. Other experiments have turned up evidence for nonlinear population codes by characterizing the nonlinear selectivity directly [6, 37, 38].

More typically, however, stimulus-dependent corre-lations arise from external noise, leading to what we call nuisance correlations. In the introduction (Figure 1) we showed a simple orientation estimation example in which fluctuations of an unknown contrast eliminate the orientation tuning of mean responses, relegating the tuning to variances. Figure 2B–E shows a slightly more sophisticated version of this example, where instead of two image polarities, we introduce spatial phase as a continuous nuisance variable. This again eliminates mean tuning, but introduces nuisance covariances that are orientation tuned.

One might object that although the nuisance covariance is tuned to orientation, a subject cannot compute the covariance on a single trial because it does not experience all possible nuisance variables to average over. This objection stems from a conceptual error that conflates the tuning (signal) with the raw sense data (signal+noise). In linear codes, the subject does not have access to the tuned mean response ⟨r|s⟩, just a noisy single-trial version of the mean, namely r. Analogously, the subject does not need access to the tuned covariance, just a noisy single-trial version of the covariance, rr^⊤(Table 1). In this simple example, the nuisance variable of spatial phase ensures that quadratic statistics contains relevant information.

2.4 Decoding and choice correlations

To study how neural information is used or decoded, past studies have examined whether neurons that are sensitive to sensory inputs also reflect an animal’s behavioral outputs or choices [39–47]. However, this choice-related activity is hard to interpret, because it may reflect decoding of the recorded neurons, or merely correlations between them and other neurons that are decoded instead [48].

In principle, we could discount such indirect relationships with complete recordings of all neural activity. This is currently impractical for most animals, and even if we could record from all neurons simultaneously, data limitations would prevent us from fully disambiguating how neural activities directly influence behavior.

To understand key principles of neural computation, however, we may not care about all detailed patterns of decoding weights and their underlying synaptic connectivity. Instead we may want to know only certain properties of the brain’s strategies. One important property is the efficiency with which the brain decodes available neural information as it generates an animal’s choices.

Conveniently, testable predictions about choice-related activity can reveal the brain’s decoding efficiency, in the case of linear codes [28]. Next we review these predictions, and then generalize them to nonlin-ear codes.

2.5 Choice correlations predicted for opti-mal linear decoding

We define ‘choice correlation’ C_r _k as the correlation coefficient between the response r_k of neuron k and the stimulus estimate (which we view as a continuous ‘choice’) , given a fixed stimulus s:

This choice correlation is a conceptually simpler and more convenient measure than the more conventional statistic, ‘choice probability’ [49], but it has almost identical properties (Methods 4.2) [28, 48].

Intuitively, if an animal is decoding its neural information efficiently, then those neurons encoding more information should be more correlated with the choice. Mathematically, one can show that choice correlations indeed have this property when decoding is optimal [28]: where J and J_rk are, respectively, the linear Fisher Information [23] based on the entire population r or on neuron k’s response r_k (Methods 4.2). This relation-ship holds for a locally optimal linear estimator, regardless of the structure of noise correlations.

Another way to test for optimal linear decoding would be to measure whether the animal’s behavioral discriminability matches the discriminability for an ideal observer of the neural population response. Yet this approach is not feasible, as it requires one to measure simultaneous responses of many, or even all, relevant neurons. In contrast, the optimality test (Eq 3) requires measuring only single neuron responses, which is vastly easier. Neural recordings in the vestibular system are consistent with optimal decoding according to this prediction [28].

2.6 Nonlinear choice correlations for optimal decoding

However, when nuisance variables wash out the mean tuning of neuronal responses, we may well find that a single neuron has both zero choice correlation and zero information about the stimulus. The optimality test would thus be inconclusive.

This situation is exactly the same one that gives rise to nonlinear codes. A natural generalization of Equation 3 can reveal the quality of neural computation on nonlinear codes. We simply define a ‘nonlinear choice correlation’ between the stimulus estimate and non-linear functions of neural activity R(r):

(Methods 4.2), where R_k(r) is a nonlinear function of the neural responses. If the brain optimally decodes the information encoded in the nonlinear statistics of neural activity, according to the simple nonlinear extension to Eq 4, then the nonlinear choice correlation satisfies the equation where J_R _k(r) is the linear Fisher Information in R_k(r) (Methods 4.2.2).

As an example of this relationship, we return to the orientation example. Here the response covariance Σ(s) = Cov(r|s) depends on the stimulus, but the mean f = ⟨ r|s ⟩ = ⟨ r ⟩ does not. In this model, optimally decoded neurons would have no linear correlation with behavioral choice. Instead, the choice should be driven by the product of the neural responses, R(r) = vec(rr ^⊤), where vec(·) is a vectorization that flattens an array into a one-dimensional list of numbers. Such quadratic computation is what the energy model for complex cells is thought to accomplish for phase-invariant orientation coding [1]. Figure 3 shows linear and nonlinear choice correlations for pairs of neurons, defined as . When decoding is linear, linear choice correlations are strong while nonlinear choice correlations are near zero (Figure 3A,B). When the decoding is quadratic, here mediated by an intermediate layer that multiplies pairs of neural activity, the nonlinear choice correlations are strong while the linear ones are insignificant (Figure 3C,D).

Figure 3:

Linear and nonlinear choice correlations successfully distinguish network structure. A linearly decoded population (A) produces nonzero linear choice correlations (B), while the nonlinear choice correlations are randomly distributed around zero. The situation is reverse for a nonlinear network (C), with insignificant linear choice correlations but strong nonlinear ones (D). Here the network implements a quadratic nonlinearity, so the relevant choice correlations are quadratic as well, C_jk = Corr(r_jr_k, c).

2.7 Which nonlinearity?

If the brain’s decoder optimally uses all available information, choice correlations will obey the prediction of Eq. 7 even if the specific nonlinearities used by the brain differ from those selected for evaluating choice correlations (Methods 4.2.3). The prediction will hold as long as the brain’s nonlinearity can be expressed as a linear combination of the tested nonlinearities (Methods 4.2.3). Figure 4 shows a situation where information is encoded by quadratic and cubic sufficient statistics of neural responses, while a simulated brain decodes them near-optimally using a generic neural network rather than a set of nonlinearities matched to those sufficient statistics. Despite this mismatch we can successfully identify that the brain is near-optimal by applying Eq 7, even without knowing the simulated brain’s true nonlinear transformations.

Figure 4:

Identifying optimal nonlinear decoding by a generic neural network using nonlinear choice correlation. A: Responses encode stimulus information in polynomial sufficient statistics up to cubic, simulated brain using ReLu nonlinearities trained to extract that information B: Simulated brain using matched pol nomial nonlinearities to extract the information from responses. C,D: Choice correlations with polynomial nonlinear statistics show the network computations are consistent with optimal nonlinear decoding, regardless if the tested statistics match(Network B) or not match(Network A) the actual decoder(Methods 4.2.3).

2.8 Redundant codes

It might seem unlikely that the brain uses optimal, or even near-optimal, nonlinear decoding. Even if it does, there are an enormous number of high-order statistics for neural responses, so the information content in any one statistic could be tiny compared to the total information in all of them. For example, with N neurons there are on the order of N ² quadratic statistics, N ³ cubic statistics, and so on. With so many statistics contributing information, the choice correlation for any single one would then be tiny according to the ratio in Eq 7, and would be indistinguishable from zero with reasonable amounts of data. Past theoretical studies have described nonlinear (specifically, quadratic) codes with extensive information that grows proportionally with the number of neurons [16, 30]. This would in-deed imply immeasurably small choice correlations for large, optimally decoded populations.

A resolution to these concerns is information-limiting correlations [27]. The past studies that derive extensive nonlinear information treat large cortical populations in isolation from the smaller sensory population that would naturally provide its input [16, 30]. However, when a network inherits information from a much smaller input population, the expanded neural code becomes highly redundant: the brain cannot have more information than it receives. Noise in the input is processed by the same pathway as the signal, and this generates noise correlations that can never be averaged away [27].

Previous work [27] characterized linear information-limiting correlations for fine discrimination tasks by de-composing the noise covariance into , where ∈ is the variance of the information-limiting com-ponent and Σ₀ is noise that can be averaged away with many neurons.

For nonlinear population codes, it is not just the mean that encodes the signal, f (s) = ⟨ r|s ⟩, but rather the nonlinear statistics F (s) = ⟨ R(r)|s ⟩. Likewise, the noise does not comprise only second-order covariance of r, Cov(r|s), but rather the second-order covariance of the relevant nonlinear statistics, Γ = Cov(R|s) (Section 2.2). Analogous to the linear case, these correlations can be locally decomposed as where ∈ is again the variance of the information-limiting component, and Γ₀ is any other covariance which can be averaged away in large populations. The information-limiting noise bounds the estimator variance to no smaller than ∈ even with optimal decoding.

Neither additional neurons nor additional decoded statistics can improve performance beyond this bound. As a direct consequence, when there are many fewer sensory inputs than cortical neurons, many distinct statistics R_k(r) will carry redundant information. Un-der these conditions, many ratios J_Rk/J. (Eq 7) can be measurably large even for optimal nonlinear decoding (Figure 5).

Figure 5:

Information-limiting noise makes a network more robust to suboptimal decoding. (Left) A sim-ulated optimal decoder produces choice correlations that match our optimal predictions (blue, on diagonal). In contrast, a suboptimal decoder, such as one that is blind to higher-order correlations (w ∝ F ′), exhibits a suboptimal pattern of choice correlations (red, off-diagonal) in the presence of noise Γ₀ that permits the population to have extensive information. (Right) When information is limited, the same decoding weights are less detrimental, and thus exhibit a similar pattern of choice correlations as an optimal decoder.

2.9 Decoding efficiency revealed by choice correlations

Even if decoding is not strictly optimal, Eq. 7 can be satisfied due to information-limiting correlations. Decoders that seem substantially suboptimal because they fail to avoid the largest noise components in Γ₀ can be nonetheless dominated by the bound from information-limiting correlations. This will occur whenever the variability from suboptimally decoding Γ₀ is smaller than ∈. Just as we can decompose the nonlinear noise correlations into information-limiting and other parts, we can decompose nonlinear choice correlations into corresponding parts as well, with the result that where ζ_R depends on the particular type of suboptimal decoding (Supporting Information S7). The slope α between choice correlations and those predicted from optimality is given by the fraction of estimator variance explained by information-limiting noise, . This slope therefore provides an estimate of the efficiency of the brain’s decoding.

Figure 5 shows an example of a decoder that would be highly suboptimal without considering redundancy, but is nonetheless close to optimal when information limits are inherited.

In realistically redundant models that have more cortical neurons than sensory neurons, many decoders could be near-optimal, as we recently discovered in experimental data for a linear population code [28]. However, even in redundant codes there may be substantial inefficiencies, especially for unnatural tasks [50].

3 Discussion

This study introduced a theory of nonlinear population codes, grounded in the natural computational task of separating relevant and irrelevant variables. The theory considers both encoding and decoding — how stimuli drive neurons, and how neurons drive behavioral choices. It showed how correlated fluctuations between neural activity and behavioral choices could reveal properties of the brain’s decoding. Unlike previous theories [16, 30], ours remains consistent with bio-logical constraints due to the large cortical expansion of sensory representations by incorporating redundancy through information-limiting correlations. Crucially, this theory provides a remarkably simple test to determine if downstream nonlinear computation decodes all that is encoded.

Alternative methods to estimate whether animals use their information efficiently rely upon comparing behavioral performance to performance of an ideal observer that can access the entire population. Even with impressive advances in neurotechnology, this challenge remains out of reach for large populations. In contrast, our proposed method to test for optimal decoding has a vastly lower experimental burden. It requires only that a few cells be recorded simultaneously while an animal performs a fine estimation or discrimination task.

On the other hand, this simple test does not offer a complete description of neural transformations.

It instead tests one important hypothesis about their functional role — that the brain performs optimal de-coding. The theory also provides a practical way of estimating decoding efficiency. The brain may not be optimal, but instead may be satisfied by a more modest decoding efficiency. In this case, more work is needed to understand what suboptimalities the brain tolerates for satisfactory performance.

3.1 Which nonlinearities should we test?

If all neural signals are decoded optimally, then all choice correlations for any function of those signals should also be consistent with optimal decoding, since they contain the same information. Yet for the wrong or incomplete nonlinearities that do not disentangle the task-relevant variables from the nuisance variables, the test may be inconclusive, just as it was for linear decod-ing of a nonlinear code (Figure 4): the chosen nonlinear functions may not extract linearly decodable information nor have any choice correlation.

The optimal nonlinearities would be those that collectively extract the sufficient statistics about the relevant stimulus, which will depend on both the task and the nuisance variables. In complex tasks, like recognizing object from images with many nuisance variables, most of the relevant information lives in higher-order statistics, and therefore require more complex nonlinearities to extract. In such high-dimensional cases, our proposed test is unlikely to be useful. This is because our method expresses stimulus estimates as sums of nonlinear functions, and while that is universal in principle [51], that is not a compact way to express the complex nonlinearities of deep networks. Alternatively, with good guidance from trained neural network models our method could potentially judge whether those nonlinearities provide a good description of neural decoding. This decoding perspective would complementing studies that argue for a good match between encoding by convolutional neural networks [6].

The best condition to apply our optimality test is in tasks of modest complexity but still possessing fundamentally nonlinear structure. Some interesting examples where our test could have practical relevance include motion detection using photoreceptors [52], visual search with distractors (XOR-type tasks) [31, 53], sound localization in early auditory processing before the inferior colliculus [54], or context switching in higher-level cortex [55].

Our test for optimal nonlinear decoding really amounts to testing for optimal linear decoding of non-linear functions of recorded neural data. If we had access to some putative downstream neurons that computed these nonlinear functions, we could just test whether the brain linearly decoded those neurons optimally. Yet that would circumvent the most interesting and crucial nonlinear aspects of neural computation. Alternatively, if we could record from neurons at different levels of the processing chain, we could try to characterize that nonlinear recoding between them directly, without reference to a behavioral choice. But this would not easily relate these computations to their functional role. The method proposed here allows us to skip these intermediate steps and directly test the optimality of all accumulated downstream nonlinearities.

3.2 Nonlinear decoding or switched linear decoding?

Could the brain avoid nonlinear decoding just by switching between different linear decoders depending on the current nuisance variable n, so that The switching variable itself would have to be inferred from sensory data, which requires marginalizing over the task variable; this takes us back to the original problem, but with task and nuisance variables reversed. Even so, switched linear decoding would actually be equivalent to nonlinear decoding whenever is estimated from neural responses: . A discrimination task with a changing class bound ary [12, 56, 57] is, in principle, a nonlinear task. But if the class boundary is changed too slowly, perhaps changing only on different days, then the brain may well re-learn its weights rather than performing some nonlinear decoding of recent activity. A better experi-mental design for revealing nonlinear computation for task context would be randomly changing the tasks, either cued [58] or even uncued [55], on a short enough time scale that the recent neural activity affects the class boundary.

3.3 Limitations of the approach

For efficient decoding in a learned task, the optimality test (7) is necessary but not sufficient. If the brain neglects some of informative sufficient statistics, and we don’t test these neglected statistics either, then we could find the brain is consistent with our optimal de-coding test, yet still be suboptimal. Only if the test is passed for all statistics will the test be conclusive. For an extreme example, a single neuron might pass the test, but if other neurons don’t, then the brain is not using its information well. On a broader scale, one might find that all individual responses r_k pass the optimality test, while products of responses r_jr_k fail. This would be consistent with linear information being used well while distinct quadratic information is present but unused; on the other hand this outcome would not be consistent with quadratic statistics that are uninformative but decoded anyway, since that would increase the output variance beyond that expected from the linear information. In future work we will demonstrate how we can use nonlinear choice correlations to identify properties of suboptimal decoders [59].

Our approach is currently limited to feedforward processing, which unquestionably oversimplifies cortical processing. Nonetheless, feedforward models do a fair job of capturing the representational structure of the brain [6].

Feedback could also cause suboptimal networks to exhibit choice correlations that seem to resemble the optimal prediction. If the feedback is noisy and projects into the same direction that encodes the stimulus, such as from a dynamic bias [60], then this could appear as information-limiting correlations, enhancing the match with Eq 7. This situation could be disambiguated by measuring the internal noise source providing the feedback, and of course this would require more simultaneous measurements.

3.4 Comparing choice correlations from in-ternal and external noise

Since many stimulus-dependent response correlations are induced by external nuisance variation, not internal noise, we might not find informative stimulus-dependent noise correlations upon repeated presentations of a fixed stimulus. Those correlations may only be informative about a stimulus in the presence of natural nuisance variation. For example, if a picture of a face is shown repeatedly without changing its pose, then small expression changes can readily be identified by linear operations; if the pose can vary then the stimulus is only reflected in higher-order correlations [5].

In contrast, we should see some nonlinear choice correlations even when nuisance variables are fixed. This is because neural circuitry must combine responses nonlinearly to eliminate natural nuisance variation, and any internal noise passing through those same channels will thereby influence the choice (although they may be smaller and more difficult to detect than the fluctuations caused by the nuisance variation). This influence will manifest as nonlinear choice correlations. In other words, stimulus-dependent noise correlations need not predict a fixed stimulus, but they may predict the choice (Supplementary Information S8).

For optimal decoding, the choice correlations measured using fixed nuisance variables will differ from Eq 7, which should strictly hold only when there is natural nuisance variation. This is implicit in Eq 7, since the relevant quantities are conditioned only on the relevant stimulus s while averaging over the nuisance variations n. However, under some conditions, a related prediction for nonlinear choice correlations holds even with-out averaging over nuisance variables (Supplementary Information S8).

3.5 Conclusion

Despite the clear importance of computation that is both nonlinear and distributed, and evidence for nonlinear coding in the cortex [31, 33–35], most neuroscience applications of population coding concepts have assumed linear codes and linear readouts [6, 28, 39, 61, 62]. The few that directly address nonlinear popu-lation codes either have an impossibly large amount of encoded information [16, 30], or investigate abstract properties unrelated to structured tasks [63]. Some ex-perimental studies have been able to extract additional information from recorded populations using nonlinear decoders [31, 64], but the inferred properties of such decoders are based on recordings being a representative sample that can be extrapolated to larger populations. Unknown correlations and redundancy prevents that from being a reliable method [23, 65].

Our method to understand nonlinear neural decod-ing requires neural recordings in a behaving animal. The task must be hard enough that it makes some errors, so that there are behavioral fluctuations to explain. Finally, there should be a modest number of nonlinearly entangled nuisance variables. Unfortu-nately, many neuroscience experiments are designed without explicit use of nuisance variables. Although this simplifies the analysis, this simplification comes at a great cost, which is that the neural circuits are being engaged far from their natural operating point, and far from their purpose: there is little hope of understanding neural computation without challenging the neural systems with nonlinear tasks for which they are required.

Our statistical perspective on feedforward nonlinear coding in the presence of nuisance variables provides a useful framework for thinking about neural compu-tation. Furthermore, choice-related activity provides guidance for designing interesting experiments to mea-sure not only how information is encoded in the brain, but how it is decoded to generate behavior. In future work we aim to apply this theory to experimental data to test whether real brains decode neural information optimally in any nonlinear tasks.

4 Methods

4.1 Encoding models

4.1.1 Orientation estimation with varying spatial phase

Figure 1 illustrates how nuisance variation can eliminate a neuron’s mean tuning to relevant stimulus variables, relegating the neural tuning to higher-order statistics like covariances. In this example, the subject estimates the orientation of a Gabor image, G(x|s, n), where x is spatial position in the image, and s and n are the orientation and spatial phase of the image, respectively (Supplemental Material S2). The model visual neurons are linear Gabor filters like idealized simple cells in primary visual cortex, corrupted by additive white Gaussian noise. Their responses are thus distributed as r ∼ P (r|s, n) = N (r|f (s, n), ϵI), where ϵ is the noise variance and the mean f (s, n) = ⟨r|s, n⟩ =Σr r p(r|s, n) is determined by the overlap between the image and the receptive field.

When the spatial phase n is known, the mean neural response contains all the information about orientation s. The brain can decode responses linearly to estimate orientation near a reference s0.

When the spatial phase varies, however, the each mean response to a fixed orientation will be combine across different phases: f (s) = ⟨r|s ⟩ = Σr r p(r|s) =∫ dn Σr r p(r|s, n)p(n). Since each spatial phase can be paired with another phase π radians away that inverts the linear response, the phase-averaged mean is f (s) = 0. Thus the brain cannot estimate orientation by decoding these neurons linearly; nonlinear computation is necessary.

The covariance provides one such tuned statistic. We define Cov_ij (r|s, n) as the neural covariance for a fixed input image (noise correlations), and Cov_ij (r|s) as the neural covariance when the nuisance varies (nuisance correlations). According to the law of total covariance, where δf_i(s, n) = f_i(s, n) -⟨f_i(s, n) ⟨n. Supplementary Information S2 shows in detail how Cov_ij (r|s) is tuned to s.

4.1.2 Exponential family distribution and sufficient statistics

We assume the response distribution conditioned on the relevant stimulus (but not on nuisance variables) is approximately a mem-ber of the exponential family with nonlinear sufficient statistics, where R(r) is a vector of sufficient statistics for the natural pa-rameter H(s), b(r) is the base measure, and A(s) is the log-partition function. The sufficient statistics contain all of the in-formation in the population response, and all other tuned statis-tics may be derived from them.

Estimation and inference are closely connected in the expo-nential family. In Supplemental Material S1.2, we show that the optimal local estimation can be achieved by linearly decoding the nonlinear sufficient statistics, . The decoding weights minimize the variance of an unbiased decoder, where F ′= ∂ (R(r)|s) /∂s is the sensitivity of the statistics to changing inputs, and Γ = Cov(R|s) is the stimulus-conditioned response covariance which generally includes nuisance correlations (Section 2.2).

The variance of this unbiased local estimator from the neural responses is lower-bounded by the inverse Fisher information. For exponential family distributions with nonlinear sufficient statistics R(r), the Fisher information is [23] (Supplemental Material S1.1)

4.1.3 Quadratic encoding

In a quadratic coding model, the distribution of neural responses is described by the exponential family with up to quadratic sufficient statistics, R(r) = {r_i, r_ir_j } for i, j ∈ {1, …, N }. A familiar example is the Gaussian distribution with stimulus-dependent covariance Σ(s). In order to demonstrate the coding properties of a purely nonlinear neural code, here we assume that the mean tuning curve f (s) and the stimulus-conditional covariances Σ_ij (s) depend smoothly on the stimulus. We can quantify the informa-tion content of the neural population using Equation 13.

4.1.4 Cubic encoding

In our cubic coding model, the distribution of neural responses is described by the exponential family with up to cubic sufficient statistics, R(r) = {r_i, r_ir_j, r_ir_jr_k} for i, j, k ∈ {1, …, N }.

We approximate a three-neuron cubic code first using purely cubic components, and we then apply a stimulus-dependent affine transformation to include linear and quadratic statistics. The pure cubic code is used for a vector z with sufficient statistics z_iz_jz_k (and a base measure to ensure the distribution is bounded and normalizable).

We approximate this distribution by a mixture of four Gaussians. The mixture is chosen to reproduce the tetrahedral symmetry of the cubic distribution (Supplementary Figure 6), which allows the cubic statistics of responses to be stimulus dependent, leaving stimulus-independent quadratic and linear statistics.

Figure 6:

Multivariate skewed distributions. (A) Isoprobability contour of an exponential family distribution with cubic statistics in three dimensions, drawn from p(z|s) ∝ exp (sz₁z₂z₃ -‖ z ‖ 4). (B) Isoprobability contour for a mixture of four gaussians (Eq 78). (C,D) Samples drawn from the mixture form, with s = 1, 2.

To generate larger multivariate cubic codes for Figure (6), for simplicity we assume the pure cubic terms only couple disjoint triplets of variables, and sample independently from an approximately cubic distribution for each triplet. To convert this purely cubic distribution to a distribution with linear and quadratic in-formation, we shift and scale these cubic samples z in a manner dependent on s: where f (s) and Σ(s) describes the desired signal-dependent mean and covariance (see Supplemental Material S4).

4.2 Nonlinear choice correlations

4.2.1 Estimating choice correlation

The nonlinear choice correlation between the stimulus estimate and one nonlinear function R_k (the kth element of the vector R) of recorded neural activity r is where is the estimator variance.

To compute this quantity from neural responses to stimuli, we need to condition neural responses and behavior data on the same signal s, or on the same total input (s, n) if we want to isolate the contribution of purely internal noise rather than nuisance variation (Supplemental Material S8). We combine choice correlations calculated under different stimulus conditions by balanced z-scoring [66].

4.2.2 Optimality test

Locally optimal linear estimator weights for decoding statistics R are given by linear regression as w ∝ Γ^-1F ′. Substituting these weights into (16), the optimal nonlinear choice correlation becomes where is the linear Fisher Information in R_k(r)

For fine-scale discriminations, optimal choice correlations can be written in many equivalent ways: where is the discriminability. These ways reflect the simple relationships between four quantities often used to represent information: d-prime is propvortional to the square root of the Fisher information [67]; estimator standard deviation is bounded by the inverse square root of the Fisher in-formation, , discrimination threshold is proportional to the estimator standard deviation, In different experiments (binary discrimination, continuous estimation), it can be most natural to express this relationship in different measured quantities.

In our simulations with binary choices for fine discrimination, we calculate the optimal nonlinear choice correlation using d-prime [68]. is estimated from neural responses generated by stimuli s± = s0 ± Δs/2 near a reference stimulus s0:

The discriminability for an decoded neural population is estimated from the unbiased decoder output’s standard deviation,

4.2.3 Nonlinear choice correlation to analyze an unknown nonlinearity

In Figure 4, we generated neural responses given sufficient statistics that are polynomials up to third order, R(r) = {r_i, r_ir_j, r_ir_jr_k} (Methods 4.1.4). In contrast, our model brain decodes the stimulus using a cascade of linear-nonlinear transformations, with Rectified Linear Units (ReLU(x) = max(0, x)) for the nonlinear activation functions. We used a fully-connected ReLU network with two hidden layers and 30 units per hidden layer,

We trained the network weights and biases with backpropagation to estimate stimuli near a reference s₀ based on 20000 training pairs (r, s) generated by the cubic encoding model. This trained neural network extracted 91% of the information available to an optimal decoder.

4.3 Information-limiting correlations

Only specific correlated fluctuations limit the information con-tent of large neural populations [27]. These fluctuations can ul-timately be referred back to the stimulus as r ∼ p(r|s + ds), where ds is zero mean noise, whose variance 1/J_∞ determines the asymptotic variance of any stimulus estimator. These information-limiting correlations for nonlinear computation can be characterized by the covariance of the sufficient statistics, Γ = Cov(R|s) conditioned on s; the information-limiting component arises specifically from the signal covariance Cov(F (s)|s). Since the signal for local estimation of stimuli near a reference s0 is , the information-limiting component of the covariance is proportional to F F:

Here Γ₀ is any covariance of R that does not limit information in large populations. Substituting this expression into (13) for the nonlinear Fisher Information, we obtain where is the nonlinear Fisher Information allowed by Γ0. When the population size grows, the extensive information term J0 grows proportionally, so the output information will asymptote to J∞.

Author contributions

XP conceived the theoretical framework. XP and QY performed the theoretical analyses. QY performed the simulations. QY and XP wrote the manuscript.

Supplemental material

S1 Exponential family distributions

For a stimulus s and a response r, the conditional prob-ability is a member of the exponential family when where H(s) are the natural parameters, R(r) are the sufficient statistics, A(s) and b(r) are the log nor-malizer and base measure. The statistics R(r) are called sufficient because they contain all the informa-tion needed to estimate the stimulus s.

S1.1 Fisher information

One measure of information content that a population response contains about a stimulus is the Fisher infor-mation J (s) [16, 24–27, 29]. The Fisher information is given by

For distributions p(r|s) in the exponential family with sufficient statistics R(r), we can compute these quantities analytically. We denote the mean of the sufficient statistics as F (s) = ⟨R(r)|s ⟩ This mean ⟨R|s ⟩ can be obtained by differentiating A(s) by the natural parameters H(s),

Equation 29 can give us the first and second derivatives of A(s) over s.

Thus we can compute two definitions of Fisher information. and where Γ = Cov[R(r)|s].

Since the two definition are equivalent, we have

Substituting Equation 38 into Equation 37, we find the Fisher Information for the exponential family [23]

S1.2 Estimation in the exponential family

Again assuming responses come from this distribution, we want to compute the maximum likelihood stimulus, , near a reference stimulus s₀:

A Taylor expansion around the reference yields where all functions and derivatives are evaluated at s₀. We find the maximum by differentiating with respect to s and setting the result equal to zero:

The solution is

Since r is a random quantity, we can express R as a mean and a deviation away from that mean: R = ⟨R|s₀ ⟩ + δR = F + δR. In this case, H^″TR -A^″ = H^″TF -A^″+H^″TδR, where the mean term is precisely the negative Fisher Information -J (s₀). If the trial-to-trial fluctuations in the uncertainty are small relative to the average uncertainty then this Fisher term will dominate. Then we have

Where and where we used the results from Equations 13 and 38, with Γ = Cov(R|s₀) and F = ⟨R|s₀ ⟩. Thus, in this limit, the optimal estimator for s is a linear decoding of the sufficient statistics R(r).

S2 Orientation estimation task with varying spatial phase

In Figure 2B, the subject’s task is to estimate orientation s near a reference s₀, based on images G of Gabor patterns given by where k = κ(cos s, sin s). Here the target s is the orientation of the pattern, n is a nuisance variable reflecting the spatial phase, x is the pixel location in the image, and k is a spatial frequency vector with amplitude κ = ∥k.∥ We assume the spatial receptive field of simple cell j in primary visual cortex is also described by a Gabor function where each neuron has a preferred orientation s_j, spa-tial phase n_j, and spatial frequency k_j. Here for sim-plicity we assume that all neurons’ preferred spatial frequencies have the same amplitude κ that matches the input image.

We model the mean neuronal responses by the over-lap between the image and their linear receptive field. This overlap determines the tuning curve of each neuron:

This expression can be written in the form: using the stimulus-dependent response amplitude and phase where we define the cotants

Equation 52 reveals that the mean response of each neuron traces out a sinusoidal oscillation in n, where the amplitude and phase depend on s and the specific neuron j. The mean tuning for each pair of neurons therefore traces out an ellipse as a function of the nuisance variable, the input’s spatial phase. When we average over the ellipse generated by the nuisance variable n, the mean tuning to s is abolished — but the response covariances (nuisance correlations) re-main tuned to s.

Assuming each neuron’s response variability is drawn independently from a standard Gaussian N (0, 1), we can write the response distribution as

If the spatial phase n were fixed and known, the brain could estimate the orientation just from the mean tuning of the neural responses. However, if the spatial phase is unknown and varies between stimulus presen-tations uniformly from 0 to 2π, the mean tuning f (s) can be expressed as

This shows that there is no signal in the mean responses.

However, the brain can perform quadratic computations to eliminate the nuisance variable. We can de-fine Cov_ij[r|s, n] as the neural covariance (noise correlations) when everything in the image is fixed, and Cov_ij[r|s] as the neural covariance when the nuisance is unknown and free to vary (nuisance correlations). Then Cov_ij[r|s] is where D_ij(s) is given by

Here when we compute Equation 71, we used the trigonometric identity: 2 cos(x) cos(y) = cos(x + y) + cos(x -y), and cos(2n + ψ_i + ψ_j)dn = 0.

This demonstrates that the neural covariance Cov_ij[r|s] depends on the orientation s. While lin-ear computation is useless for estimating orientation since the mean responses are untuned (59), quadratic (or higher-order) nonlinear computations can be used to estimate the orientation.

S3 Quadratic coding model

In a purely quadratic coding model (no linear infor-mation), the distribution of neural responses is de-scribed by the exponential family with quadratic sufficient statistics, p(r|s) ∼ exp [H(s)^TR(r)] where R(r) = (…, r_ir_j, …). A familiar example is a Gaussian distribution with stimulus-dependent covariance: p(r|s) = N (f, Σ(s)).

As a concrete example we construct a covariance that rotates with stimulus s. Any covariance matrix needs to be positive semidefinite. We build Σ(s) by setting the eigenvalues to be positive and s-independent and eigenvectors to form an orthogonal basis that rotates with s: where V (s) = exp As is a rotation matrix in which A = -A^Tis a real antisymmetric matrix with pure imaginary eigenvalues, and Λ is a diagonal matrix com posed of all positive eigenvalues.

To calculate the Fisher Information (Equation 13), we need to first calculate the derivative of the mean and covariance Γ = Cov[R(r)|s] of the quadratic sufficient statistics.

Because the mean of r is not dependent on the stim-ulus in this example, we can compute , where is the derivative of the covariance of r,

Here Ω is a diagonal matrix of eigenvalues for A, U is an orthogonal matrix of the eigenvectors of A, and X = U ^TΛU.

The elements in Γ can be expressed as Γ_ij,kn = ⟨ r_ir_jr_kr_n|s ⟩ -⟨ r_ir_j|s ⟩ ⟨ r_kr_n|s ⟩. We can use the following identity for a Gaussian to compute this fourth-order quantity:

Where

Substitution of the response covariance (Equation 72) into Equation 74 allows us to calculate the covariance Γ of the quadratic sufficient statistics, and thereby to estimate the stimulus and Fisher information for this quadratic code.

S4 Cubic codes

In Figure 6 we assume the brain encodes the stimulus using a cubic code. A simple cubic code in z = (z_i, z_j, z_k) ∈ ℝ 3 can be written as where we include the base measure to ensure normalizability (Figure 6A).

For mathematical convenience, we approximate this code by a mixture of Gaussians.

Where

And

The vectors v_a reflect the four corners of the tetrahedron, v_a,i = ±1, to match the tetrahedral symmetry of the pure cubic code (Equation 76, Figure 6). To sample from this distribution, we randomly choose a compo-nent a and then sample from the gaussian N (z|µ_a, Σ_a) conditioned on that component.

This distribution has zero mean and identity covariance but a nontrivial skewness tensor, and qualitatively matches the corresponding distribution for the true exponential family distribution with cubic sufficient statistics (Figure 6).

For simplicity, we consider pure cubic codes with non-overlapping cliques of three variables.

To convert this purely cubic distribution into a distribution with linear and quadratic information as well, we simply shift and scale the distribution in a manner dependent on s:

These affine transformations can be incorporated directly into each component of the mixture of gaussians,

Note that the linear and quadratic information terms are independent of the component a.

S5 Using nonlinear choice correlation to analyze unknown nonlinearities

The true nonlinearity that the brain uses to estimate the stimulus is unknown. Thus a crucial question in our decoding analysis is, which nonlinearities to consider? One reasonable set is polynomials in r, i.e. a Taylor series expansion of the neural nonlinearities, Ψ(r) = (r_i, r_ir_j, r_ir_jr_k, …).

The locally optimal decoder is a weighted sum of the sufficient statistics R(r) (Equation 46):

However, the brain might choose a different nonlinear basis g(r):

As long as the brain’s nonlinear function spans the same function basis as the sufficient statistics, we can still get all of the information about stimulus from neural population. This allows us to use choice correlation between brain’s estimate and our analysis non-linearity Ψ(r) to check the optimality condition (Equation 7).

In Figure 4, we assumed that the optimal nonlinear basis function R is polynomial nonlinearity up to third order, R(r) = (r_i, r_ir_j, r_ir_jr_k, …). We used cubic codes described in Methods 4.1.4 to generate neural responses for which R(r) are sufficient statistics for the stimulus. In this simulation, 18 neuronal responses (six cliques of size 3) were generated using cubic codes.

Our model brain decodes the stimulus using a cascade of linear-nonlinear transformations, with Rectified Linear Units (ReLU(x) = max(0, x)) for the nonlinear activation functions. We used a fully-connected ReLU network with two hidden layers and 30 units per hidden layer,

We trained the neural network with 20000 response samples generated from a cubic code driven by stimuli near the reference s₀. We optimized the estimation performance for the neural network using backpropagation to find weights {W ^(𝓁)}, biases {b^(𝓁)}, and read-out vector v that minimized the mean squared error. Our trained neural network performed near-optimally, extracting 91% of the Fisher information compared to optimal decoding based on the true sufficient statistics.

Feigning ignorance of our simulated brain’s true decoder, we used mononomial nonlinearities Ψ(r) in our the nonlinear choice correlation test (Equation 7). The simulated choice correlations were calculated by Equation 5, where R(r) = Ψ(r) based on neural responses driven by the reference stimulus s₀, and the stimulus estimate was _brain. The optimal choice correlation is computed using Equation 7, where We computed ΔF _Ψ based on neural population responses r₊ and r_- driven by stimuli s₊ = s₀ ± Δs/2. The change in mean was ΔF _Ψ = ⟨Ψ(r₊) ⟩ -⟨Ψ(r_-)⟩, and the average standard deviation was . is the variance of estimate of reference stimulus s₀ using the trained neural network. Based on these quantities, Figure 4 shows that we can success-fully identify that the brain is near-optimal.

S6 Information-limiting correlations

Information-limiting correlations can ultimately be referred back to the stimulus, to appear as r ∼ p(r|s + ds), where ds is zero mean noise with variance 1/J_∞which determines the uncertainty of stimulus. Applying the law of total covariance, we can decompose the covariance of nonlinear statistics R(r) conditioned on the stimulus into two parts: where ⟨·⟩_p indicates an expectation value over the distribution p. The first term can be computed as follows,

Here we denote the covariance of R(r) given s and ds as Γ(s + ds). The second equality used a Taylor expansion of Γ(s + ds) around s. The third equality used the fact that the mean of ds is zero. Γ₀ is the covariance of R in the absence of information-limiting correlations. The second term in Equation 91 can be expressed as

Here we have written the mean of R(r) given s and ds as F (s + ds). The second equality used a first-order expansion of F (s + ds) around s. The third equality used the fact that the variance of ds is 1/J_∞. which is a rank-one perturbation of the covariance Γ 0

To compute the nonlinear Fisher Information,JR(r) = F^′TΓ-1 F ′, we can use the Sherman-Morrison lemma to compute Γ^-1:

Substituting these equations into the nonlinear Fisher Information (Equation 13) and simplifying, we obtain

Here is the nonlinear Fisher Informa-tion in the absence of information-limiting correlations. When the population size grows, the term J₀ grows proportionally [16, 29], so for large populations the out-put information saturates at J_∞.

S7 Nonlinear choice correlation for suboptimal decoding

A decoder that would be suboptimal for one population code could be near-optimal in the presence of information-limiting noise. In this case, nonlinear choice correlations can be decomposed into a sum of two terms, one from the information-limiting component and the other from the rest of the noise [28]:

For unbiased decoding, w^TF = 1 Some manipulation gives where Γ_0k = (Γ₀)_kk ≈ Γ_kk for small information-limiting noise variance 1/J_∞ « Γ_0k (which nonetheless can have a large effect on information despite the small variance), and where Σ_0s is the standard deviation of the estimate produced by the same suboptimal decoder w in the absence of information-limiting correlations, i.e. when the covariance of the sufficient statistics is Γ₀. The variance of can itself be decomposed into two terms as well: where we assume unbiased decoding, which implies w^TF ^l = 1. This expression allows us to represent the ratio as

_sWith Substituting these into (Eq 103) we find that the choice correlation for a suboptimal decoder in the presence of information-limiting correlations is a weighted sum of the choice correlations for optimal and suboptimal decoding:

Here and are, respectively, the choice corre lations for suboptimal decoding without information-limiting noise (so Γ = Γ₀), and choice correlations for optimal decoding.

The slope α between choice correlations and those predicted from optimal decoding is equal to the fraction of estimator variance explained by information-limiting noise. This slope therefore provides an estimate of the efficiency of the brain’s decoding.

S8 Comparing choice correlations from internal or external noise

The response covariance that drives fluctuations in choices could arise from internal or external (nuisance) variability, or both. Choice correlations predicted for optimal decoding differ depending on whether we condition on the nuisance variables or not. In the main text, we described optimal choice correlations under the distribution p(r|s). This includes variations caused by external nuisance variables, which is sensible since this is what the brain’s decoder must handle. However, it is also potentially informative to examine how purely internal variability correlates with choice, as this is of-ten how choice correlations are assessed. In this section, we derive the choice correlations driven by purely internal noise, for a decoder that learned to remove external nuisance variation as well.

For simplicity we assume that the nonlinear sufficient statistics R(r) are linearly tuned to both the stimulus s and a scalar nuisance variable n, where F ′ and G′ characterize the sensitivity of R(r) to stimulus s and nuisance n, and an internal noise source η has zero mean with covariance H. We assume the brain has a prior over the nuisance variation, p(n), with zero mean and variance ξ. The total covariance for internal and external fluctuations is then

When we measure choice correlations while fixing the nuisance variables in the experiment, we assume the brain retains its decoding strategy accounting for both internal noise and unknown nuisance variation, and not the optimal decoding strategy when the nuisance is fixed and known. These decoding weights are where the denominator J₁ = F ^′TΓ^-1F ′ is the Fisher information about s when there is natural nuisance variation following p(n). For distributions in the exponential family, this information saturates the Cramer-Rao bound on an estimator’s variance, so that . [69] The normalization by J₁ ensures the decoding is locally unbiased. These weights are used to estimate the stimulus according to

Choice correlations in this fixed-nuisance experiment will be denoted by a lowercase c:

We include the superscript c sub as a reminder that these choice correlations do not follow the optimal pat-tern when the decoder is not matched to only the purely internal variability, as here.

We can express these choice correlations as:

The covariance between and R is

For the scalar nuisance variable we assume here, we can use the Sherman-Morrison lemma to decompose the inverse of the total covariance into a rank-one perturbation of the internal noise inverse covariance:

Substituting this inverse covariance into Equation 113, we obtain

This last expression can be rewritten using elements of the Fisher information matrix, whose inverse bounds the covariance of any joint estimator of the signal and nuisance variables, :

With these substitutions, we have

The denominator of Equation 112 involves the variance of the sufficient statistics, and the variance of the brain’s decoder, where we used the following results:

Combining the results from Equation 122, 126 and 123, we can compute Equation 112

The optimal choice correlation when there is natural nuisance variation (Eq 7) is given by where is the Fisher Information in R_k about s when there is natural nuisance variation, and is the standard deviation of the statistic R_k, again when there is natural nuisance variation.

The choice correlations for the same decoder differ under experimental conditions with and without nuisance variation: and . We find that the nuisance-conditioned choice correlations relate to the optimal nuisance-averaged choice correlations ; according to where we have defined the following constants: and

The slope β_k and offset Γ_k of the relationship be-tween these two types of choice correlations (Equation 135) depends on the amount of nuisance variation compared to internal noise and the suboptimality of the brain’s decoding strategy. When the signal and nui-sance can be disentangled, that is, estimated nearly independently using the statistics R(r), then J₁₂ is small and the choice correlations driven purely by internal fluctuations closely match the optimal choice correlations in the presence of nuisance variation (Figure 7A). In contrast, when nuisance variations remain partialy confused with the signal, then J₁₂ is large and the choice correlations for fixed nuisance variables may differ from the optimal pattern seen when allowing nuisance variables to change from trial to trial (Figure 7B). For the simulations in Figure 7, we set the sufficient statistics to be linear R(r) = r for simplicity. Neu-ral responses were generated from a Gaussian distri-bution with a stimulus-dependent mean and identity covariance H = I: p(r |s, n) = N (F ′s + G′n, I). In Figure 7A, F ′ and G′ are set to be orthogonal to en-sure J₁₂ = F ^′TH^-1G′ = 0. They are picked from the eigenvector of a symmetric matrix A^TA, where A is a matrix whose elements are generated from uniform dis-tribution bounded by 0 and 1. In Figure 7B, each element in F ′ and G′ is drawn from a uniform distribution over the interval [0, 1]. We simulate 10000 responses of a population with N = 50 neurons. The stimulus is set to 0 and the nuisance is fixed to be 1. The brain’s decoder assumes a Gaussian prior over the nuisance variation with zero mean and variance ξ = 2. The decoding weights follow Equation 109, and the stimulus is estimated using Equation 110. Choice correlations in this fixed-nuisance experiment are computed by Equation 111 (vertical axis in Figure 7). The predicted optimal choice correlation is computed by Equation 134 (horizontal axis in Figure 7). In this setting, β_k ≈ 1 when J₁₂ = 0.

Figure 7:

Comparing choice correlations caused by internal and external noise. (A) When estimates of nuisance variables are independent of estimates of task-relevant signals, the optimal choice correlations driven by internal noise, ;, match the optimal pattern ;expected for optimal decoding under natural nuisance variation (Equation 7). (B) When the signal and nuisance variables remain confounded by an estimator and decoding is evaluated under different conditions than those for which it was optimized, then the choice correlations need not match this optimal prediction.

In this context, it is especially noteworthy that a mismatch between choice correlations and the optimal pattern might not indicate that the brain is suboptimal, but instead that the experimental task may not match the natural tasks for which the brain could have been optimized.

Acknowledgements

The authors thank Jeff Beck, Valentin Dragoi, Arun Parajuli, Andreas S. Tolias, Edgar Walker, R. James Cotton and Alex Pouget for helpful conversations. This work was supported by NSF CAREER grant 1552868 and NeuroNex grant 1707400 to XP.

References

[1].↵
Adelson EH, Bergen JR (1985) Spatiotemporal en- ergy models for the perception of motion. Josa a 2: 284–299.
OpenUrl
[2].↵
DiCarlo JJ, Cox DD (2007) Untangling invariant object recognition. Trends in cognitive sciences 11: 333–341.
OpenUrl CrossRef PubMed Web of Science
[3].↵
Rust NC, DiCarlo JJ (2010) Selectivity and toler- ance (invariance) both increase as visual informa- tion propagates from cortical area v4 to it. Journal of Neuroscience 30: 12978–12995.
OpenUrl Abstract/FREE Full Text
[4].
Pagan M, Urban LS, Wohl MP, Rust NC (2013) Signals in inferotemporal and perirhinal cortex suggest an untangling of visual target information. Nature neuroscience 16: 1132.
OpenUrl CrossRef PubMed
[5].↵
Meyers EM, Borzello M, Freiwald WA, Tsao D (2015) Intelligent information loss: the coding of facial identity, head pose, and non-face informa- tion in the macaque face patch system. Journal of Neuroscience 35: 7069–7081.
OpenUrl Abstract/FREE Full Text
[6].↵
Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ (2014) Performance- optimized hierarchical models predict neural re- sponses in higher visual cortex. Proceedings of the National Academy of Sciences 111: 8619–8624.
OpenUrl Abstract/FREE Full Text
[7].↵
Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal firing in cortical microcircuits. science 327: 584–587.
OpenUrl Abstract/FREE Full Text
[8].
Ecker AS, Berens P, Cotton RJ, Subramaniyan M, Denfield GH, Cadwell CR, Smirnakis SM, et al. (2014) State dependence of noise correlations in macaque primary visual cortex. Neuron 82: 235–248.
OpenUrl CrossRef PubMed Web of Science
[9].
Denfield GH, Ecker AS, Shinn TJ, Bethge M, Tolias AS (2017) Attentional fluctuations induce shared variability in macaque primary visual cortex. bioRxiv: 189282.
[10].↵
Ecker AS, Denfield GH, Bethge M, Tolias AS (2016) On the structure of neuronal population activity under fluctuations in attentional state. Journal of Neuroscience 36: 1775–1789.
OpenUrl Abstract/FREE Full Text
[11].↵
Jazayeri M, Movshon JA (2006) Optimal repre- sentation of sensory information by neural popu- lations. Nature neuroscience 9: 690–696.
OpenUrl CrossRef PubMed Web of Science
[12].↵
Bondy AG, Cumming BG (2016) Feedback dy- namics determine the structure of spike-count cor- relation in visual cortex. bioRxiv: 086256.
[13].
Averbeck BB, Latham PE, Pouget A (2006) Neural correlations, population coding and computation. Nature reviews neuroscience 7: 358.
OpenUrl CrossRef PubMed Web of Science
[14].
Cohen MR, Kohn A (2011) Measuring and interpreting neuronal correlations. Nature neuroscience 14: 811.
OpenUrl CrossRef PubMed
[15].
Kohn A, Coen-Cagli R, Kanitscheider I, Pouget A (2016) Correlations and neuronal population in- formation. Annual review of neuroscience 39.
[16].↵
Ecker AS, Berens P, Tolias AS, Bethge M (2011) The effect of noise correlations in populations of diversely tuned neurons. Journal of Neuroscience 31: 14272–14283.
OpenUrl Abstract/FREE Full Text
[17].
Abbott LF, Dayan P (1999) The effect of correlated variability on the accuracy of a population code. Neural computation 11: 91–101.
OpenUrl CrossRef PubMed Web of Science
[18].
Cohen MR, Maunsell JH (2009) Attention improves performance primarily by reducing interneuronal correlations. Nature neuroscience 12: 1594.
OpenUrl CrossRef PubMed Web of Science
[19].
Cohen MR, Newsome WT (2009) Estimates of the contribution of single neurons to perception de- pend on timescale and noise correlation. Journal of Neuroscience 29: 6635–6648.
OpenUrl Abstract/FREE Full Text
[20].↵
Gawne TJ, Richmond BJ (1993) How independent are the messages carried by adjacent inferior tem- poral cortical neurons? Journal of Neuroscience 13: 2758–2771.
OpenUrl Abstract/FREE Full Text
[21].↵
Ralf H, Bethge M (2010) Evaluating neuronal codes for inference using fisher information. In: Advances in neural information processing systems. pp. 1993–2001.
[22].↵
Burge J, Jaini P (2017) Accuracy maximization analysis for sensory-perceptual tasks: Computa- tional improvements, filter robustness, and coding advantages for scaled additive noise. PLoS computational biology 13: e1005281.
OpenUrl
[23].↵
Beck J, Bejjanki VR, Pouget A (2011) Insights from a simple expression for linear fisher informa- tion in a recurrently connected population of spik- ing neurons. Neural computation 23: 1484–1502.
OpenUrl CrossRef PubMed Web of Science
[24].↵
Paradiso M (1988) A theory for the use of visual orientation information which exploits the colum- nar structure of striate cortex. Biological cybernetics 58: 35–49.
OpenUrl CrossRef PubMed Web of Science
[25].
Zohary E, Shadlen MN, Newsome WT (1994) Correlated neuronal discharge rate and its implica- tions for psychophysical performance. Nature 370: 140–143.
OpenUrl CrossRef PubMed Web of Science
[26].
Sompolinsky H, Yoon H, Kang K, Shamir M (2001) Population coding in neuronal systems with correlated noise. Physical Review E 64: 051904.
OpenUrl
[27].↵
Moreno-Bote R, Beck J, Kanitscheider I, Pitkow X, Latham P, Pouget A (2014) Information- limiting correlations. Nature neuroscience 17: 1410–1417.
OpenUrl CrossRef PubMed
[28].↵
Pitkow X, Liu S, Angelaki DE, DeAngelis GC, Pouget A (2015) How can single sensory neurons predict behavior? Neuron 87: 411–423.
OpenUrl CrossRef PubMed
[29].↵
Shamir M, Sompolinsky H (2004) Nonlinear population codes. Neural computation 16: 1105–1136.
OpenUrl CrossRef PubMed Web of Science
[30].↵
Shamir M, Sompolinsky H (2006) Implications of neuronal diversity on population coding. Neural Computation 18: 1951–1986.
OpenUrl CrossRef PubMed Web of Science
[31].↵
Pagan M, Simoncelli EP, Rust NC (2016) Neural quadratic discriminant analysis: Nonlinear decod- ing with v1-like computation. Neural computation 28: 2291–2319.
OpenUrl
[32].↵
Gutnisky DA, Dragoi V (2008) Adaptive coding of visual information in neural populations. Nature 452: 220.
OpenUrl CrossRef PubMed Web of Science
[33].↵
Kohn A, Smith MA (2005) Stimulus dependence of neuronal correlation in primary visual cortex of the macaque. The Journal of neuroscience 25: 3661–3673.
OpenUrl Abstract/FREE Full Text
[34].
Averbeck BB, Lee D (2006) Effects of noise correlations on information encoding and decoding. Journal of Neurophysiology 95: 3633–3644.
OpenUrl CrossRef PubMed Web of Science
[35].↵
Ohiorhenuan IE, Mechler F, Purpura KP, Schmid AM, Hu Q, Victor JD (2010) Sparse coding and high-order correlations in fine-scale cortical networks. Nature 466: 617.
OpenUrl CrossRef PubMed Web of Science
[36].↵
Ponce-Alvarez A, Thiele A, Albright TD, Stoner GR, Deco G (2013) Stimulus-dependent variability and noise correlations in cortical mt neurons. Proceedings of the National Academy of Sciences 110: 13162–13167.
OpenUrl Abstract/FREE Full Text
[37].↵
Rigotti M, Barak O, Warden MR, Wang XJ, Daw ND, Miller EK, Fusi S (2013) The importance of mixed selectivity in complex cognitive tasks. Nature 497: 585–590.
OpenUrl CrossRef PubMed Web of Science
[38].↵
Pagan M, Rust NC (2014) Dynamic target match signals in perirhinal cortex can be explained by instantaneous computations that act on dynamic input from inferotemporal cortex. The Journal of Neuroscience 34: 11067–11084.
OpenUrl Abstract/FREE Full Text
[39].↵
Britten KH, Newsome WT, Shadlen MN, Celebrini S, Movshon JA (1996) A relationship between behavioral choice and the visual responses of neurons in macaque mt. Visual neuroscience 13: 87–100.
OpenUrl CrossRef PubMed Web of Science
[40].
Shadlen MN, Britten KH, Newsome WT, Movshon JA (1996) A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience 16: 1486–1510.
OpenUrl Abstract/FREE Full Text
[41].
Dodd JV, Krug K, Cumming BG, Parker AJ (2001) Perceptually bistable three-dimensional figures evoke high choice probabilities in cortical area mt. Journal of Neuroscience 21: 4809–4821.
OpenUrl Abstract/FREE Full Text
[42].
Krajbich I, Armel C, Rangel A (2010) Visual fixations and the computation and comparison of value in simple choice. Nature neuroscience 13: 1292.
OpenUrl CrossRef PubMed Web of Science
[43].
de Lafuente V, Romo R (2005) Neuronal correlates of subjective sensory experience. Nature neuroscience 8: 1698.
OpenUrl CrossRef PubMed Web of Science
[44].
Treue S, Trujillo JCM (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399: 575.
OpenUrl CrossRef PubMed Web of Science
[45].
Roitman JD, Shadlen MN (2002) Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of neuroscience 22: 9475–9489.
OpenUrl Abstract/FREE Full Text
[46].
Gu Y, Angelaki DE, DeAngelis GC (2008) Neural correlates of multisensory cue integration in macaque mstd. Nature neuroscience 11: 1201.
OpenUrl CrossRef PubMed Web of Science
[47].↵
Purushothaman G, Bradley DC (2005) Neural population code for fine perceptual decisions in area mt. Nature neuroscience 8: 99.
OpenUrl CrossRef PubMed Web of Science
[48].↵
Haefner RM, Gerwinn S, Macke JH, Bethge M (2013) Inferring decoding strategies from choice probabilities in the presence of correlated variability. Nature neuroscience 16: 235–242.
OpenUrl CrossRef PubMed
[49].↵
Britten KH, Newsome WT, Shadlen MN, Celebrini S, Movshon JA (1996) A relationship between behavioral choice and the visual responses of neurons in macaque mt. Visual Neuroscience 13: 87–100.
OpenUrl CrossRef PubMed Web of Science
[50].↵
Nienborg H, Cumming BG (2007) Psychophysically measured task strategy for disparity discrimination is reflected in v2 neurons. Nature neuroscience 10: 1608.
OpenUrl CrossRef PubMed Web of Science
[51].↵
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4: 251–257.
OpenUrl CrossRef Web of Science
[52].↵
Poggio T, Koch C (1987) Synapses that compute motion. Scientific American 256: 46–53.
OpenUrl PubMed Web of Science
[53].↵
Ma WJ, Navalpakkam V, Beck JM, Van Den Berg R, Pouget A (2011) Behavior and neural basis of near-optimal visual search. Nature neuroscience 14: 783.
OpenUrl CrossRef PubMed
[54].↵
Davis KA, Ramachandran R, May BJ (2003) Auditory processing of spectral cues for sound localization in the inferior colliculus. Journal of the Association for Research in Otolaryngology 4: 148–163.
OpenUrl
[55].↵
Saez A, Rigotti M, Ostojic S, Fusi S, Salzman C (2015) Abstract context representations in primate amygdala and prefrontal cortex. Neuron 87: 869–881.
OpenUrl
[56].↵
Cohen MR, Newsome WT (2008) Contextdependent changes in functional circuitry in visual area mt. Neuron 60: 162–173.
OpenUrl CrossRef PubMed Web of Science
[57].↵
Lange RD, Haefner RM (2016) Inferring the brain9s internal model from sensory responses in a probabilistic inference framework. bioRxiv: 081661.
[58].↵
Mante V, Sussillo D, Shenoy KV, Newsome WT (2013) Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503: 78–84.
OpenUrl CrossRef PubMed Web of Science
[59].↵
Yang Q, Pitkow X (2015) Robust nonlinear neural codes. Cosyne abstract.
[60].↵
Haefner RM, Berkes P, Fiser J (2016) Perceptual decision-making as probabilistic inference by neural sampling. Neuron 90: 649–660.
OpenUrl CrossRef PubMed
[61].↵
Ma WJ, Beck JM, Latham PE, Pouget A (2006) Bayesian inference with probabilistic population codes. Nature neuroscience 9: 1432.
OpenUrl CrossRef PubMed Web of Science
[62].↵
Graf AB, Kohn A, Jazayeri M, Movshon JA (2011) Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience 14: 239.
OpenUrl CrossRef PubMed Web of Science
[63].↵
Babadi B, Sompolinsky H (2014) Sparseness and expansion in sensory representations. Neuron 83: 1213–1226.
OpenUrl CrossRef PubMed
[64].↵
Maynard E, Hatsopoulos N, Ojakangas C, Acuna B, Sanes J, Normann R, Donoghue J (1999) Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience 19: 8083–8093.
OpenUrl Abstract/FREE Full Text
[65].↵
Kanitscheider I, Coen-Cagli R, Kohn A, Pouget A (2015) Measuring fisher information accurately in correlated neural populations. PLoS computational biology 11: e1004218.
OpenUrl
[66].↵
Kang I, Maunsell JH (2012) Potential confounds in estimating trial-to-trial correlations between neuronal response and behavior using choice probabilities. Journal of neurophysiology 108: 3403–3415.
OpenUrl CrossRef PubMed Web of Science
[67].↵
Berens P, Ecker AS, Gerwinn S, Tolias AS, Bethge M (2011) Reassessing optimal neural population codes with neurometric functions. Proceedings of the National Academy of Sciences 108: 4423–4428.
OpenUrl Abstract/FREE Full Text
[68].↵
Green DM, Swets JA (1966) Signal detection theory and psychophysics. John Wiley.
[69].↵
Bethge M, Rotermund D, Pawelzik K (2002) Optimal short-term population coding: when fisher information fails. Neural computation 14: 2317–2351.
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted May 28, 2018.

Download PDF

Citation Tools

Subject Area

Neuroscience

Subject Areas

All Articles

Animal Behavior and Cognition (5204)
Biochemistry (11725)
Bioengineering (8728)
Bioinformatics (29135)
Biophysics (14940)
Cancer Biology (12052)
Cell Biology (17363)
Clinical Trials (138)
Developmental Biology (9408)
Ecology (14147)
Epidemiology (2067)
Evolutionary Biology (18272)
Genetics (12223)
Genomics (16773)
Immunology (11844)
Microbiology (28027)
Molecular Biology (11564)
Neuroscience (60841)
Paleontology (451)
Pathology (1864)
Pharmacology and Toxicology (3232)
Physiology (4940)
Plant Biology (10405)
Scientific Communication and Education (1681)
Synthetic Biology (2878)
Systems Biology (7335)
Zoology (1642)

[1] [1].↵
Adelson EH, Bergen JR (1985) Spatiotemporal en- ergy models for the perception of motion. Josa a 2: 284–299.
OpenUrl

[2] [2].↵
DiCarlo JJ, Cox DD (2007) Untangling invariant object recognition. Trends in cognitive sciences 11: 333–341.
OpenUrl CrossRef PubMed Web of Science

[3] [3].↵
Rust NC, DiCarlo JJ (2010) Selectivity and toler- ance (invariance) both increase as visual informa- tion propagates from cortical area v4 to it. Journal of Neuroscience 30: 12978–12995.
OpenUrl Abstract/FREE Full Text

[4] [4].
Pagan M, Urban LS, Wohl MP, Rust NC (2013) Signals in inferotemporal and perirhinal cortex suggest an untangling of visual target information. Nature neuroscience 16: 1132.
OpenUrl CrossRef PubMed

[5] [5].↵
Meyers EM, Borzello M, Freiwald WA, Tsao D (2015) Intelligent information loss: the coding of facial identity, head pose, and non-face informa- tion in the macaque face patch system. Journal of Neuroscience 35: 7069–7081.
OpenUrl Abstract/FREE Full Text

[6] [6].↵
Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ (2014) Performance- optimized hierarchical models predict neural re- sponses in higher visual cortex. Proceedings of the National Academy of Sciences 111: 8619–8624.
OpenUrl Abstract/FREE Full Text

[7] [7].↵
Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal firing in cortical microcircuits. science 327: 584–587.
OpenUrl Abstract/FREE Full Text

[8] [8].
Ecker AS, Berens P, Cotton RJ, Subramaniyan M, Denfield GH, Cadwell CR, Smirnakis SM, et al. (2014) State dependence of noise correlations in macaque primary visual cortex. Neuron 82: 235–248.
OpenUrl CrossRef PubMed Web of Science

[9] [9].
Denfield GH, Ecker AS, Shinn TJ, Bethge M, Tolias AS (2017) Attentional fluctuations induce shared variability in macaque primary visual cortex. bioRxiv: 189282.

[10] [10].↵
Ecker AS, Denfield GH, Bethge M, Tolias AS (2016) On the structure of neuronal population activity under fluctuations in attentional state. Journal of Neuroscience 36: 1775–1789.
OpenUrl Abstract/FREE Full Text

[11] [11].↵
Jazayeri M, Movshon JA (2006) Optimal repre- sentation of sensory information by neural popu- lations. Nature neuroscience 9: 690–696.
OpenUrl CrossRef PubMed Web of Science

[12] [12].↵
Bondy AG, Cumming BG (2016) Feedback dy- namics determine the structure of spike-count cor- relation in visual cortex. bioRxiv: 086256.

[13] [13].
Averbeck BB, Latham PE, Pouget A (2006) Neural correlations, population coding and computation. Nature reviews neuroscience 7: 358.
OpenUrl CrossRef PubMed Web of Science

[14] [14].
Cohen MR, Kohn A (2011) Measuring and interpreting neuronal correlations. Nature neuroscience 14: 811.
OpenUrl CrossRef PubMed

[15] [15].
Kohn A, Coen-Cagli R, Kanitscheider I, Pouget A (2016) Correlations and neuronal population in- formation. Annual review of neuroscience 39.

[16] [16].↵
Ecker AS, Berens P, Tolias AS, Bethge M (2011) The effect of noise correlations in populations of diversely tuned neurons. Journal of Neuroscience 31: 14272–14283.
OpenUrl Abstract/FREE Full Text

[17] [17].
Abbott LF, Dayan P (1999) The effect of correlated variability on the accuracy of a population code. Neural computation 11: 91–101.
OpenUrl CrossRef PubMed Web of Science

[18] [18].
Cohen MR, Maunsell JH (2009) Attention improves performance primarily by reducing interneuronal correlations. Nature neuroscience 12: 1594.
OpenUrl CrossRef PubMed Web of Science

[19] [19].
Cohen MR, Newsome WT (2009) Estimates of the contribution of single neurons to perception de- pend on timescale and noise correlation. Journal of Neuroscience 29: 6635–6648.
OpenUrl Abstract/FREE Full Text

[20] [20].↵
Gawne TJ, Richmond BJ (1993) How independent are the messages carried by adjacent inferior tem- poral cortical neurons? Journal of Neuroscience 13: 2758–2771.
OpenUrl Abstract/FREE Full Text

[21] [21].↵
Ralf H, Bethge M (2010) Evaluating neuronal codes for inference using fisher information. In: Advances in neural information processing systems. pp. 1993–2001.

[22] [22].↵
Burge J, Jaini P (2017) Accuracy maximization analysis for sensory-perceptual tasks: Computa- tional improvements, filter robustness, and coding advantages for scaled additive noise. PLoS computational biology 13: e1005281.
OpenUrl

[23] [23].↵
Beck J, Bejjanki VR, Pouget A (2011) Insights from a simple expression for linear fisher informa- tion in a recurrently connected population of spik- ing neurons. Neural computation 23: 1484–1502.
OpenUrl CrossRef PubMed Web of Science

[24] [24].↵
Paradiso M (1988) A theory for the use of visual orientation information which exploits the colum- nar structure of striate cortex. Biological cybernetics 58: 35–49.
OpenUrl CrossRef PubMed Web of Science

[25] [25].
Zohary E, Shadlen MN, Newsome WT (1994) Correlated neuronal discharge rate and its implica- tions for psychophysical performance. Nature 370: 140–143.
OpenUrl CrossRef PubMed Web of Science

[26] [26].
Sompolinsky H, Yoon H, Kang K, Shamir M (2001) Population coding in neuronal systems with correlated noise. Physical Review E 64: 051904.
OpenUrl

[27] [27].↵
Moreno-Bote R, Beck J, Kanitscheider I, Pitkow X, Latham P, Pouget A (2014) Information- limiting correlations. Nature neuroscience 17: 1410–1417.
OpenUrl CrossRef PubMed

[28] [28].↵
Pitkow X, Liu S, Angelaki DE, DeAngelis GC, Pouget A (2015) How can single sensory neurons predict behavior? Neuron 87: 411–423.
OpenUrl CrossRef PubMed

[29] [29].↵
Shamir M, Sompolinsky H (2004) Nonlinear population codes. Neural computation 16: 1105–1136.
OpenUrl CrossRef PubMed Web of Science

[30] [30].↵
Shamir M, Sompolinsky H (2006) Implications of neuronal diversity on population coding. Neural Computation 18: 1951–1986.
OpenUrl CrossRef PubMed Web of Science

[31] [31].↵
Pagan M, Simoncelli EP, Rust NC (2016) Neural quadratic discriminant analysis: Nonlinear decod- ing with v1-like computation. Neural computation 28: 2291–2319.
OpenUrl

[32] [32].↵
Gutnisky DA, Dragoi V (2008) Adaptive coding of visual information in neural populations. Nature 452: 220.
OpenUrl CrossRef PubMed Web of Science

[33] [33].↵
Kohn A, Smith MA (2005) Stimulus dependence of neuronal correlation in primary visual cortex of the macaque. The Journal of neuroscience 25: 3661–3673.
OpenUrl Abstract/FREE Full Text

[34] [34].
Averbeck BB, Lee D (2006) Effects of noise correlations on information encoding and decoding. Journal of Neurophysiology 95: 3633–3644.
OpenUrl CrossRef PubMed Web of Science

[35] [35].↵
Ohiorhenuan IE, Mechler F, Purpura KP, Schmid AM, Hu Q, Victor JD (2010) Sparse coding and high-order correlations in fine-scale cortical networks. Nature 466: 617.
OpenUrl CrossRef PubMed Web of Science

[36] [36].↵
Ponce-Alvarez A, Thiele A, Albright TD, Stoner GR, Deco G (2013) Stimulus-dependent variability and noise correlations in cortical mt neurons. Proceedings of the National Academy of Sciences 110: 13162–13167.
OpenUrl Abstract/FREE Full Text

[37] [37].↵
Rigotti M, Barak O, Warden MR, Wang XJ, Daw ND, Miller EK, Fusi S (2013) The importance of mixed selectivity in complex cognitive tasks. Nature 497: 585–590.
OpenUrl CrossRef PubMed Web of Science

[38] [38].↵
Pagan M, Rust NC (2014) Dynamic target match signals in perirhinal cortex can be explained by instantaneous computations that act on dynamic input from inferotemporal cortex. The Journal of Neuroscience 34: 11067–11084.
OpenUrl Abstract/FREE Full Text

[39] [39].↵
Britten KH, Newsome WT, Shadlen MN, Celebrini S, Movshon JA (1996) A relationship between behavioral choice and the visual responses of neurons in macaque mt. Visual neuroscience 13: 87–100.
OpenUrl CrossRef PubMed Web of Science

[40] [40].
Shadlen MN, Britten KH, Newsome WT, Movshon JA (1996) A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience 16: 1486–1510.
OpenUrl Abstract/FREE Full Text

[41] [41].
Dodd JV, Krug K, Cumming BG, Parker AJ (2001) Perceptually bistable three-dimensional figures evoke high choice probabilities in cortical area mt. Journal of Neuroscience 21: 4809–4821.
OpenUrl Abstract/FREE Full Text

[42] [42].
Krajbich I, Armel C, Rangel A (2010) Visual fixations and the computation and comparison of value in simple choice. Nature neuroscience 13: 1292.
OpenUrl CrossRef PubMed Web of Science

[43] [43].
de Lafuente V, Romo R (2005) Neuronal correlates of subjective sensory experience. Nature neuroscience 8: 1698.
OpenUrl CrossRef PubMed Web of Science

[44] [44].
Treue S, Trujillo JCM (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399: 575.
OpenUrl CrossRef PubMed Web of Science

[45] [45].
Roitman JD, Shadlen MN (2002) Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of neuroscience 22: 9475–9489.
OpenUrl Abstract/FREE Full Text

[46] [46].
Gu Y, Angelaki DE, DeAngelis GC (2008) Neural correlates of multisensory cue integration in macaque mstd. Nature neuroscience 11: 1201.
OpenUrl CrossRef PubMed Web of Science

[47] [47].↵
Purushothaman G, Bradley DC (2005) Neural population code for fine perceptual decisions in area mt. Nature neuroscience 8: 99.
OpenUrl CrossRef PubMed Web of Science

[48] [48].↵
Haefner RM, Gerwinn S, Macke JH, Bethge M (2013) Inferring decoding strategies from choice probabilities in the presence of correlated variability. Nature neuroscience 16: 235–242.
OpenUrl CrossRef PubMed

[49] [49].↵
Britten KH, Newsome WT, Shadlen MN, Celebrini S, Movshon JA (1996) A relationship between behavioral choice and the visual responses of neurons in macaque mt. Visual Neuroscience 13: 87–100.
OpenUrl CrossRef PubMed Web of Science

[50] [50].↵
Nienborg H, Cumming BG (2007) Psychophysically measured task strategy for disparity discrimination is reflected in v2 neurons. Nature neuroscience 10: 1608.
OpenUrl CrossRef PubMed Web of Science

[51] [51].↵
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4: 251–257.
OpenUrl CrossRef Web of Science

[52] [52].↵
Poggio T, Koch C (1987) Synapses that compute motion. Scientific American 256: 46–53.
OpenUrl PubMed Web of Science

[53] [53].↵
Ma WJ, Navalpakkam V, Beck JM, Van Den Berg R, Pouget A (2011) Behavior and neural basis of near-optimal visual search. Nature neuroscience 14: 783.
OpenUrl CrossRef PubMed

[54] [54].↵
Davis KA, Ramachandran R, May BJ (2003) Auditory processing of spectral cues for sound localization in the inferior colliculus. Journal of the Association for Research in Otolaryngology 4: 148–163.
OpenUrl

[55] [55].↵
Saez A, Rigotti M, Ostojic S, Fusi S, Salzman C (2015) Abstract context representations in primate amygdala and prefrontal cortex. Neuron 87: 869–881.
OpenUrl

[56] [56].↵
Cohen MR, Newsome WT (2008) Contextdependent changes in functional circuitry in visual area mt. Neuron 60: 162–173.
OpenUrl CrossRef PubMed Web of Science

[57] [57].↵
Lange RD, Haefner RM (2016) Inferring the brain9s internal model from sensory responses in a probabilistic inference framework. bioRxiv: 081661.

[58] [58].↵
Mante V, Sussillo D, Shenoy KV, Newsome WT (2013) Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503: 78–84.
OpenUrl CrossRef PubMed Web of Science

[59] [59].↵
Yang Q, Pitkow X (2015) Robust nonlinear neural codes. Cosyne abstract.

[60] [60].↵
Haefner RM, Berkes P, Fiser J (2016) Perceptual decision-making as probabilistic inference by neural sampling. Neuron 90: 649–660.
OpenUrl CrossRef PubMed

[61] [61].↵
Ma WJ, Beck JM, Latham PE, Pouget A (2006) Bayesian inference with probabilistic population codes. Nature neuroscience 9: 1432.
OpenUrl CrossRef PubMed Web of Science

[62] [62].↵
Graf AB, Kohn A, Jazayeri M, Movshon JA (2011) Decoding the activity of neuronal populations in macaque primary visual cortex. Nature neuroscience 14: 239.
OpenUrl CrossRef PubMed Web of Science

[63] [63].↵
Babadi B, Sompolinsky H (2014) Sparseness and expansion in sensory representations. Neuron 83: 1213–1226.
OpenUrl CrossRef PubMed

[64] [64].↵
Maynard E, Hatsopoulos N, Ojakangas C, Acuna B, Sanes J, Normann R, Donoghue J (1999) Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience 19: 8083–8093.
OpenUrl Abstract/FREE Full Text

[65] [65].↵
Kanitscheider I, Coen-Cagli R, Kohn A, Pouget A (2015) Measuring fisher information accurately in correlated neural populations. PLoS computational biology 11: e1004218.
OpenUrl

[66] [66].↵
Kang I, Maunsell JH (2012) Potential confounds in estimating trial-to-trial correlations between neuronal response and behavior using choice probabilities. Journal of neurophysiology 108: 3403–3415.
OpenUrl CrossRef PubMed Web of Science

[67] [67].↵
Berens P, Ecker AS, Gerwinn S, Tolias AS, Bethge M (2011) Reassessing optimal neural population codes with neurometric functions. Proceedings of the National Academy of Sciences 108: 4423–4428.
OpenUrl Abstract/FREE Full Text

[68] [68].↵
Green DM, Swets JA (1966) Signal detection theory and psychophysics. John Wiley.

[69] [69].↵
Bethge M, Rotermund D, Pawelzik K (2002) Optimal short-term population coding: when fisher information fails. Neural computation 14: 2317–2351.
OpenUrl CrossRef PubMed Web of Science