Abstract
The inherent uncertainty of the world suggests that optimally-performing brains should use probabilistic internal models to represent it. This idea has provided a powerful explanation for a range of behavioural phenomena. But describing behaviour in probabilistic terms is not strong evidence that the brain itself explicitly uses probabilistic models. We sought to test whether neurons represent such models in higher cortical regions, learn them, and use them in behaviour. Using a sampling framework, we predicted that trial-evoked and sleeping population activity represent the inferred and expected probabilities generated from an internal model of a behavioural task, and would become more similar as the task was learnt. To test these predictions, we analysed population activity from rodent prefrontal cortex before, during, and after sessions of learning rules on a Y-maze. We found that population activity patterns occurred far in excess of chance on millisecond time-scales. During successful learning, distributions of these activity patterns increased in similarity between trials and post-learning sleep as predicted. Learning-induced changes were greatest for patterns expressed at the maze’s choice point and predicting correct choice of maze arm to obtain reward, consistent with an updated internal model of the task. Our results suggest sample-based internal models are a general computational principle of cortex.
Author Summary The cerebral cortex contains billions of neurons. The activity of one neuron is lost in this morass, so it is thought that the co-ordinated activity of groups of neurons – “neural ensembles” – are the basic element of cortical computation, underpinning sensation, cognition, and action. But what do these ensembles represent? Here we show that ensemble activity in rodent prefrontal cortex represents samples from an internal model of the world - a probability distribution that the world is in a specific state. We find that this internal model is updated during learning about changes to the world, and is sampled during sleep. These results suggest that probability-based computation is a generic principle of cortex.
Introduction
How do we know what state the world is in? Behavioural evidence suggests brains solve this problem using probabilistic reasoning [1, 2]. Such reasoning implies that brains represent and learn internal models for the statistical structure of the external world [1, 3, 4]. With these models, neurons could represent uncertainty about the world with probability distributions, and update those distributions with new knowledge using the rules of probabilistic inference. Theoretical work has elucidated potential mechanisms for how cortical populations represent and compute with probabilities [5–9], and shown how computational models of inference predict aspects of cortical activity in sensory and decision-making tasks [2, 9, 10]. But experimental evidence that neurons represent probabilistic internal models is lacking. And it is unknown if internal models are specific to sensorimotor control [2, 3, 9, 11] or are a general computational principle of cortex.
A natural candidate to address these issues is the medial prefrontal cortex (mPFC). Medial PFC is necessary for learning new rules or strategies [12, 13], and changes in mPFC neuron firing times correlates with successful rule learning [14], suggesting that mPFC coding of task-related variables changes over learning. We thus hypothesised that mPFC encodes an internal model of a task, which is updated by task performance, and from which population activity is generated.
To test these hypotheses, we analysed population activity from the mPFC of rats learning rules in a Y-maze [15]. We show here that moment-to-moment patterns of spiking activity in mPFC populations are highly similar between sleeping and behavior, but the rate of occurrence of individual patterns changes during learning. Consequently, activity patterns in sleep and in trials with correct performance converge. Patterns that change their occurrence rate are predictive of task performance, consistent with their being samples from an internal model [11]. These findings suggest mPFC represents and updates a sample-based internal model of the maze rules.
Results
Rats with implanted tetrodes learnt one of three rules on a Y-maze: go left, go right, or go to the randomly-lit arm (Fig. 1A). Each recording session was a single day containing 3 epochs totalling typically 1.5 hours: pre-task sleep/rest, behavioural testing on the task, and post-task sleep/rest. We focussed on ten sessions where the animal reached the learning criteria for a rule mid-session (Materials and Methods; 15-55 neurons per session). In this way, we sought to isolate changes in population activity solely due to rule-learning.
Theoretical predictions
An experimentally-accessible proposal for testing the hypothesis of neural internal models is the recent inference-by-sampling hypothesis [7, 11, 16, 17]. This proposes that cortical population activity at some time t is a sample from an underlying probability distribution. Cortical activity evoked by external stimuli represents sampling from the model-generated “posterior” distribution that the world is in a particular state. Spontaneous cortical activity represents sampling of the model in the absence of external stimuli, forming a model-generated “prior” for the expected properties of the world. A key prediction is that the evoked and spontaneous population activity should converge over repeated experience, as the internal model adapts to match the relevant statistics of the external world. Just such a convergence has been observed in small populations from ferret V1 over development [16]. Unknown is whether this framework generalises to higher cortices and learning.
We derived theoretical predictions for changes in mPFC population activity from the inference-by-sampling hypothesis (schematically illustrated in Fig. 1B; see S1 File for an extended account). We sought to test the idea that the mPFC contains at least one internal model related to task performance, such as representing the relevant decision-variable (here, left or right) or the rule-dependent outcomes. Learning of the task should therefore update the internal model based on feedback from each trial’s outcome. We theorised that mPFC population activity on each trial was sampling from the posterior distribution generated from this model; and that “spontaneous” activity in slow-wave sleep (SWS), occurring in the absence of task-related stimuli and behaviour, samples the corresponding prior distribution (Fig. 1B). Consequently, updating the internal model from task feedback should be reflected in changes to the posterior and prior distributions generated from that model.
By restricting our analyses to sessions with successful learning, we expected the post-task SWS activity to be sampling from an internal model that has learnt the correct rule. To compare posterior distribution samples from the same internal model, we considered population activity during correct trials after the learning criteria were met – we call this distribution P (R). Our main prediction was thus that the distribution P (R) of activity during the correct trials would be more similar to the distribution in post-task SWS [P(Post)] than in pre-task SWS [P(Pre)]. Such a convergence of distributions would be evidence that a task-related internal model in mPFC was updated by feedback.
Millisecond precision spike correlation patterns in mPFC
Following previous work [7, 16–18], we defined the samples as population-wide activity patterns on millisecond time-scales. Activity patterns were characterised as a binary vector (or “word”) of active and inactive neurons within some small time window (Fig. 2A). Statistical structure at millisecond time-scales has been characterised for populations in the retina [18–21] and primary visual cortex [16, 22], but not fo higher-order cortices. We thus first demonstrate that mPFC activity patterns on millisecond time-scales contain above-chance statistical structure.
We were primarily interested in co-activation patterns of more than one neuron firing together, as the relative occurrence of patterns with single neurons (a single “1”) represents firing rates. We thus first determined the time-scales at which co-activation patterns appear. Figure 2B shows that at low millisecond time-scales, the proportion of activity patterns containing co-active neurons increases by an order of magnitude when doubling the binsize. The smallest binsize with a non-negligible proportion of co-activation patterns was 2 ms, with ∼ 1% (89731/7452300) of all patterns. This was also true for each epoch considered separately (Fig. 2C-E). We thus used a 2 ms binsize throughout, as this was the smallest time-scale with consistent co-activation patterns.
Such co-activation patterns could be due to persistent, precise correlations between spike-times in different neurons, or just due to coincident firing of otherwise independent neurons. We found that the proportion of co-activation patterns in the data exceeded those predicted for independent neurons by a factor of 3 (Fig. 2B) at low millisecond time-scales. This was also true for each separate epoch (Fig. 2C-E), extending up to a factor of at least 6 for the task trials (Fig. 2D), ruling out the possibility that the excess of precise correlations was due to differences in brain state.
Our hypothesis that sleep and waking states respectively represent a prior and posterior distribution from the same model requires not just precise patterns, but largely the same patterns. If the set of patterns markedly differed between waking and sleep, then it would be implausible that they were drawn from the same underlying model. We found that each recorded population of N neurons had the same sub-set of all 2N possible activity patterns in all epochs (Fig. 2B). Such a common set of patterns is consistent with their being samples generated from the same form of internal model across both behaviour and sleep.
Together, these results show that there is above-chance statistical structure in mPFC population activity, on time-scales of milliseconds, and with consistent patterns of activity between sleeping and waking epochs. Consequently, the population activity here is consistent with the requirements of samples from an underlying probability distribution.
Activity distributions converge between task and post-task sleep
To test our theoretical predictions, we compared the statistical distributions of activity patterns between task and sleep epochs. If activity patterns are samples from a probability distribution, then two similar probability distributions will be revealed by the similar frequencies of sampling each pattern [16]. For each pair of epochs, we thus computed the distances between the two corresponding distributions of activity patterns (Fig. 3). We first used the information-theory based Kullback-Liebler divergence to measure the distance D(P|Q) between distributions P and Q in bits [16]. We found that in 9 of the 10 sessions the distribution P(R) of activity during the trials was closer to the distribution in post-task SWS [P(Post)] than in pre-task SWS [P(Pre)] (Fig. 4A).
On average the task-evoked distribution of patterns was 18.7±6.2% closer to the post-task SWS distribution than the pre-task SWS distribution (Fig. 4B), showing a convergence between task-evoked and post-task SWS distributions. Further, we found a robust convergence even at the level of individual sessions (Fig. 4C). The convergence was also robust to the choice of correct trials in the task distribution P(R) (S1 Fig).
Together, these results are consistent with the convergence over learning of the posterior and prior distributions represented by mPFC population activity. They imply that mPFC encodes a task-related internal model that is updated by task feedback.
Robustness of the convergence
It seems remarkable that the sampling of temporally precise population activity patterns in prefrontal cortex could systematically change during learning. To check the robustness of this result, we recomputed all analyses twice. We used a different distance measure to check that the results were stable to issues in estimating the probability distributions; and we checked whether our results were dependent on our choice of binsize.
While the Kullback-Liebler divergence provides the most complete characterisation of the distance between two probability distributions, estimating it accurately from limited sample data has known issues [23]. To check our results were robust, we re-computed all distances using the Hellinger distance, a non-parametric measure that provides a lower bound for the Kullback-Liebler divergence. Reassuringly, we found the same results: the distribution P(R) of activity during the trials was consistently closer to the distribution in post-task SWS [P(Post)] than in pre-task SWS [P(Pre)] (Fig. 4F-H; the mean convergence between task-evoked and post-task SWS distributions was 21±2.8%).
We found that the convergence between the task P(R) and post-task SWS P(Post) distributions was robust to the choice of activity pattern binsize across an order of magnitude from 2 to 20 ms (Fig. 5). Our results thus do not depend on some arbitrary choice of binsize. Above a binsize of 50 ms, convergence was statistically indistinguishable from zero, meaning that the pre- and post-task SWS distributions are equidistant, on average, from the task distribution. This suggests that the behaviourally relevant time-scales for activity patterns are indeed on the order of a few milliseconds.
Wrong models lead to no convergence
Is this convergence of trial-evoked and post-task SWS distributions inevitable? To answer this, we made use of the 7 sessions in which the rats experienced a rule change. As rule changes occurred only after 10 consecutive correct trials [15], these sessions are uniquely divided into periods when the internal model of the task was right and when it was wrong. Once wrong, the rat needed to find the correct new model. Consequently, our theory predicts that the prior distribution in post-task SWS sleep is not derived from the same internal model as that used before the rule change. In other words, the posterior from the pre-change trials and the prior from the spontaneous activity of post-task sleep are derived from different models, and should not converge.
We tested this prediction by comparing the activity pattern distributions in pre-change correct trials [P(R*)] and in post-task SWS [P(Post)]. There was no convergence between the two distributions, when measured using either Kullback-Liebler divergence (3.6± 11.7%; Fig. 4D-E) or the Hellinger distance (3.8± 9.3%; Fig. 4I-J). For the effect sizes observed for the learning sessions, there was sufficient power to recover the same effect size at α = 0.05 with N = 7 sessions (KLD: learning session effect size d = 0.96, rule-change session power = 0.7; Hellinger: d = 2.36, power ≈ 1), which argues against low power causing the lack of convergence.
Convergence is a consequence of changes to correlations, not just firing rates
Our convergence was measured across a change in brain state between waking and sleeping. While within each state the occurrence of co-activation patterns exceeds chance by an order of magnitude (Fig. 2C-E), this still leaves open the possibility that the change in population firing rates between states could artificially cause their activity pattern distributions to increase in similarity [24, 25]. To control for this, we used the “raster” model [24] to generate surrogate sets of spike-trains that matched both the mean firing rates of each neuron, and the distribution of total population activity in each time-bin (K = 0, 1, …, N spikes per bin). Consequently, the occurrence rates of particular activity patterns in the raster model are those predicted to arise from neuron and population firing rates alone.
We fitted the raster model to the post-session SWS neuron and population firing rates, and compared the data-derived distance D(Post|R) with the model-derived distance D(Post − model|R). If the change in population firing rate during SWS caused the convergence, then the raster model should exactly capture the statistics of the SWS firing and we would obtain D(Post|R) ≈ D (Post − model|R).
We found that firing rates could not account for the convergence between task and post-task SWS distributions. The data-derived distance D(Post|R) was always smaller than the distance D(Post − model|R) predicted by the raster model (Fig. 6A). This was true whether we used Kullback-Liebler divergence or the Hellinger distance (Fig. 6C) to measure distances between distributions. Consequently, the convergence between the task and post-task SWS distributions is due to the selective repeat of specific activity patterns.
Our activity patterns were built from single units, unlike previous work using multi-unit activity [16, 18, 20, 24, 26], so we expected our patterns to be sparse with rare synchronous activity. Indeed our data are dominated by activity patterns with no spikes or 1 spike (Fig. 2B-E; we breakdown the distributions at 2ms in S2 Fig). If all patterns had only 0 or 1 spike, then the raster model spike trains would be exactly equivalent to the data. Given the relative sparsity (∼ 1%) of co-activation patterns in our data, it is all the more surprising then that we found such a consistent lower distance for our data-derived distributions.
It follows that the true difference between data and model is in the relative occurrence of co-activation patterns. To check this, we applied the same analysis to distributions built only from these co-activation patterns, drawn from data and from the raster model fitted to the complete data. We found that the data-derived distance D(Post|R) was always smaller than the distance D(Post − model|R) predicted by the raster model (Fig. 6B-D). Across all sessions, the model-predicted distance D(Post − model|R) was between 3% and 46% greater than the data-derived distance D(Post|R) using Kullback-Liebler divergence, indicating that much of the convergence between task and SWS distributions could not be accounted for by firing rates alone. These results confirm that the convergence between the task and post-task SWS distributions is due to the selective repeat of specific co-activation patterns.
Reassuringly, for these distributions of co-activation patterns, all convergence results held (S3 Fig) despite the order-of-magnitude fewer sampled patterns compared to the full dataset.
Convergence is not a recency effect
We examined periods of SWS in order to most likely observe the sampling of a putative internal model in a static condition, with no external inputs and minimal learning. But as correct task trials more likely occur towards the end of a learning session, this raises the possibility that the closer match between task and post-task SWS distributions is a recency effect, due to some trace or reverberation in sleep of the most recent task activity.
The time-scales involved make this unlikely. Bouts of SWS did not start until typically 8 minutes after the end of the task (mean 397 s, S.D. 188 s; Fig. 7A). Any reverberation would thus have to last at least that long to appear in the majority of post-task SWS distributions.
The intervening period before the first bout of SWS contains quiet wakefulness and early sleep stages. If convergence was a recency effect, then we would expect that distributions [P(Rest)] of activity patterns in this more-immediate “rest” epoch would also converge with the task distributions. We did not find this: across sessions, there was no evidence that the distribution in post-task rest [P(Rest)] consistently converged with the distribution during task trials [P(R)] (Fig. 7B-C; mean convergence was − 8.7± 18.7%). Thus the observed convergence is inconsistent with a recency effect.
Distributions are updated by task-relevant activity patterns
The above analysis rests on the idea that the distributions of activity patterns are derived from an internal model of the task. This predicts that individual patterns should correlate with some aspect of the task. We sought an unbiased way of testing this prediction, so considered the following. In our theory, the changes to the internal model over learning should be directly reflected in the differences between the prior distributions before and after learning. Consequently, if we compare the sampling of activity patterns in pre-task sleep to sampling in post-task sleep, then any patterns with changes in their sampling should be from the updated model. This means that these patterns should encode some aspect of the task.
Remarkably, this is exactly what we found. For each co-activation pattern, we found its ability to predict a trial’s outcome by its rate of occurrence on that trial (Fig 8A). When we compared this outcome prediction to the change in sampling between pre- and post-task sleep, we found a strong correlation between the two (Fig. 8B-D). This correlation was highly robust (Fig. 8E-G). The learnt internal model, as evidenced by the updated patterns sampled from it, was seemingly encoding the task.
Outcome-predictive patterns occur around the choice point
Consistent with the internal model being task-related, we further found that the outcome-predictive activity patterns preferentially occurred around the choice point of the maze (Fig. 9). Particularly striking was that patterns strongly predictive of outcome rarely occurred in the starting arm (Fig. 9A). Together, the selective changes over learning to outcome-specific (Fig. 8) and location-specific (Fig. 9) activity patterns show that the convergence of distributions (Fig. 4) is not a statistical curiosity, but is evidence for the updating of a behaviourally-relevant internal model.
Discussion
We find that moment-to-moment patterns of mPFC population activity change their sampling rates during learning of a spatial navigation task. Consequently, the statistical distributions of patterns in spontaneous and task-evoked activity converge. These changes match our theoretical predictions from a probabilistic sampling framework. Our analyses thus suggest mPFC encodes a probabilistic internal model of a task, which is updated by behavioural outcomes.
Prefrontal cortex has been implicated in both planning and working memory during spatial navigation [27–30], and executive control in general [31, 32]. Our results suggests a probabilistic basis for these functions. In particular, prefrontal cortex has been implicated in both the representation of current goals [33, 34] and strategies [35]. Both these functions are consistent with an internal model that relates sensory information to the statistical structure of the world, and the use of that model to plan behaviour.
How a cortical region encodes an internal model is an open question. A strong candidate is the relative strengths of the synaptic connections both into and within the encoding cortical circuit [7, 8, 11, 17]. The activity of a cortical circuit is strongly dependent on the pattern and strength of the connections between its neurons [36, 37]. Consequently, defining the underlying model as the circuit’s synaptic network allows both model-based inference through synaptically-driven activity and model learning through synaptic plasticity [11].
Sampling and replay
Our results are distinct from previous observations of task-specific replay during sleep in prefrontal cortex [38], including reports [15] using the same data analysed here. We observed a convergence of waking and sleep activity pattern distributions using precise activity patterns down to 2 ms resolution. Using surrogate models, we showed that the convergence is due to the persistence of specific correlations between neurons, rather than changes in firing rates. Finally, we showed that the updated activity patterns were just those that could predict task performance.
In contrast to the work here, replay accounts infer the recurrence of similar activity. They do not identify the changed patterns, nor relate them to task behaviour. Moreover, replay is described for coincident activity on coarse time-scales greater than those used here by a factor of 50 (ref. [15]) up to a factor of 10000 (ref. [38]). We found that there was no apparent convergence of activity pattern distributions when examining coincident activity at 50ms resolution and above (Fig. 5). Finally, our observation that patterns during correct performance on rule-change sessions are not specifically sampled in post-session SWS (Fig. 4D,E) is incompatible with the simple replay of experience-related activity in sleep. Current empirical observations of replay thus do not predict our observed changes in activity pattern distributions.
Activity distributions in sleeping and waking
Our theory proposes that spontaneous neural activity during sleep is sampling a prior distribution generated from an internal model. We found that the set of activity patterns was remarkably conserved between sleeping and behaviour (Fig. 2B), consistent with activity being generated from the same internal model in both states. This theory predicts that manipulating synaptic weights during sleep, changing the internal model, should change both the prior and the posterior distributions over task variables. Recent work has shown that inducing task-specific reward signals during sleep, likely altering synaptic weights, indeed immediately alters task behaviour on waking [39]. Our results thus suggest that casting sleeping and waking activity as prior and posterior distributions generated from the same internal model could be a fruitful computational framework for relating cortical dynamics to behaviour.
Population activity patterns constrain cortical dynamics
We used the inference-by-sampling framework to guide our analysis by deriving from it specific hypotheses to be tested. But even if the sampling interpretation of the activity patterns will turn out to be wrong, our main observations provide constraints on theories of cortical coding and dynamics.
Studies of prefrontal cortex coding generally assume that information is conveyed by firing rates [29, 30, 40, 41] or rate correlations [27, 42, 43]. By contrast, here we show evidence of ensemble coding at highly precise time scales, of both outcome and position dependence. We found it remarkable that we could extract anything of interest at this resolution, and checked these results extensively, including the use of large-repeat permutation tests. Previously, such fine-scale structure of stimulus-evoked population activity patterns has only been observed in the retina during passive observation [18, 20, 21]. We extend these results to show that such fine time-scale correlation structure can be observed in cortical regions for executive control, and be evoked by tasks.
Previous studies have observed strong similarities between spontaneous and evoked firing rates [44–47] or firing sequences [48] in cortex. These findings imply that the underlying cortical circuit has similarly constrained dynamics in both spontaneous and evoked states [49]. Extending these results, we found a highly similar set of precisely-timed activity patterns across sleeping and task performance, which suggests that cortical population activity is underpinned by similar dynamics in both states, and those dynamics can reproduce patterns with high temporal precision. Maass and colleagues [7, 17] have shown that a range of cortical network models can produce specific distributions of such precise activity patterns, provided they have a source of noise (such as synaptic release failure) to produce stochastic wandering of the global activity level. Our data support these models, and suggest that global activity oscillations during slow-wave sleep [50, 51] do not prevent the stochastic sampling of activity patterns, providing a target for future modelling studies.
Materials and Methods
Task and electrophysiological recordings
The data analysed here were from ten learning sessions and seven rule change sessions in the study of [15]. For full details on training, spike-sorting, and histology see [15]. Four Long-Evans male rats with implanted tetrodes in prelimbic cortex were trained on the Y-maze task (Fig. 1A). Each recording session consisted of a 20-30 minute sleep or rest epoch (pre-task epoch), in which the rat remained undisturbed in a padded flowerpot placed on the central platform of the maze, followed by a task epoch, in which the rat performed for 20-40 minutes, and then by a second 20-30 minute sleep or rest epoch (post-task epoch). Every trial started when the rat reached the departure arm and finished when the rat reached the end of one of the choice arms. Correct choice was rewarded with drops of flavoured milk. Each rat had to learn the current rule by trial-and-error, either: go to the left arm; go to the right arm; go to the lit arm. To maintain consistent context across all sessions, the extra-maze light cues were lit in a pseudo-random sequence across trials, whether they were relevant to the rule or not.
We primarily analysed here data from the ten sessions in which the previously-defined learning criteria were met: the first trial of a block of at least three consecutive rewarded trials after which the performance until the end of the session was above 80%. In later sessions the rats reached the criterion for changing the rule: ten consecutive correct trials or one error out of 12 trials. Thus each rat learnt at least two rules, with seven rule-change sessions in total.
Tetrode recordings were spike-sorted only within each recording session for conservative identification of stable single units. In the 17 sessions we analyse here, the populations ranged in size from 15-55 units. Spikes were recorded with a resolution of 0.1 ms.
Activity pattern distributions
For a population of size N, we characterised population activity from time t to t + δ as an N-length binary vector with each element being 1 if at least one spike was fired by that neuron in that time-bin, and 0 otherwise. In the main text we predominantly use a binsize of δ = 2 ms; Fig. 5 shows the robustness of the main results to the choice of binsize. We build patterns using the number of recorded neurons N, up to a maximum of 35 for computational tractability. The probability distribution for these activity patterns was compiled by counting the frequency of each pattern’s occurrence and normalising by the total number of pattern occurrences.
To test the predicted proportion of co-activation patterns by independently firing neurons, we shuffled inter-spike intervals for each neuron independently, then reconstruct the activity patterns at the chosen binsize. This procedure keeps the same inter-spike interval distribution for each neuron, but disrupts any correlation between neurons. As both the task and sleep epochs were broken up into chunks (of trials and SWS bouts, respectively), we only shuffled inter-spike intervals within each chunk. We repeated the shuffling 20 times, and in Fig. 2C-E we plot for the shuffled data the means and error bars of ± 2 s.e.m. (too small to see on the scales of the axes).
Comparing distributions
We quantified the distance D(P|Q) between probability distributions P and Q using both the Kullback-Liebler divergence (KLD) and the Hellinger distance.
The KLD is an information theoretic measure to compare the similarity between two probability distributions. Let P = (p1, p2, …, pn) and Q = (q1, q2, …, qn) be two discrete probability distributions, for n distinct possibilities – for us, these are all possible individual activity patterns. The KLD is then defined as . This measure is not symmetric, so that in general d(P|Q) ≠ d(Q|P). Following prior work [16, 24], we thus compute and report the symmetrised KLD: D(P|Q) = (d(P|Q) + d(Q|P))/2.
There are 2N distinct possible activity patterns in a recording with N neurons. Most of these activity patterns are never observed, so we exclude the activity patterns that are not observed in either of the epochs we compare. The empirical frequency of the remaining activity patterns is biased due to the limited length of the recordings [23]. To counteract this bias, we use the Bayesian estimator and quadratic bias correction exactly as described in [16]. The Berkes estimator assumes a Dirichlet prior and multinomial likelihood to calculate the posterior estimate of the KLD; we use their code (github.com/pberkes/neuro-kl) to compute the estimator. We then compute a KLD estimate using all S activity patterns, and using S/2 and S/4 patterns randomly sampled without replacement. By fitting a quadratic polynomial to these three KLD estimates, we can then use the intercept term of the quadratic fit as an estimate of the KLD if we had access to recordings of infinite length [23, 52].
The Hellinger distance for two discrete distributions P and Q is . To a first approximation, this measures for each pair of probabilities (pi, qi) the distance between their square-roots. In this form, D(P|Q) = 0 means the distributions are identical, and D(P|Q) = 1 means the distributions are mutually singular: all positive probabilities in P are zero in Q, and vice-versa. The Hellinger distance is a lower bound for the KLD: 2D(P|Q) ≤ KLD.
To compare distances between sessions we computed a normalised measure of “convergence”. The divergence between a given pair of distributions could depend on many factors that differ between sessions, including that each recorded population was a different size, and how much of the relevant population for encoding the internal model we recorded. Consequently, the key comparison between the divergences D(Pre|R) − D(Post|R) also depends on these factors. To compare the difference in divergences across sessions, we computed a “convergence” score by normalising by the scale of the divergence in the pre-task SWS: ((D(Pre|R) − D(Post|R))/D(Pre|R). We express this as a percentage. Convergence greater than 0% indicates that the distance between the task (R: correct trials) and post-task SWS (Post) distributions is smaller than that between the task and pre-task SWS (Pre) distributions.
Statistics
Quoted measurement values are means ± s.e.m. All hypothesis tests used the non-parametric Wilcoxon signtest for a one-sample test that the sample median for the population of sessions is greater than zero. For learning sessions, we have N = 10 sessions; for rule-changes (Fig. 4) we have N = 7 sessions. Throughout we plot mean values and their approximate 95% confidence intervals given by ± 2 s.e.m.
Bootstrapped confidence intervals (in Fig. 4C,H) for each session were constructed using 1000 bootstraps of each epoch’s activity pattern distribution. Each bootstrap was a sample-with-replacement of activity patterns from the data distribution X to get a sample distribution X*. For a given pair of bootstrapped distributions X*, Y* we then compute their distance D*(X*|Y). Given both bootstrapped distances D*(Pre|R) and D*(Post|R), we then compute the bootstrapped convergence (D*(Pre*|R*) − D*(Post*|R*))/D(Pre*|R*).
Raster model
To control for the possibility that changes in activity pattern occurrence were due solely to changes in the firing rates of individual neurons and the total population, we used the raster model exactly as described in [24]. For a given data-set of spike-trains N and binsize δ, the raster model constructs a synthetic set of spikes such that each synthetic spike-train has the same mean rate as its counterpart in the data, and the distribution of the total number of spikes per time-bin matches the data. In this way, it predicts the frequency of activity patterns that should occur given solely changes in individual and population rates.
For Fig 6 we generated 1000 raster models per session using the spike-trains from the post-task SWS in that session. For each generated raster model, we computed the distance between its distribution of activity patterns and the data distribution for correct trials in the task D(Post − model|R). This comparison gives the expected distance between task and post-task SWS distributions due to firing rate changes alone. We plot the difference between the mean D(Post − model|R) and the data D(Post|R) in Fig. 6.
Outcome prediction
We examined the correlates of activity pattern occurrence with behaviour. To rule out pure firing rate effects, we excluded all patterns with K = 0 and K = 1 spikes, considering only co-activation patterns with two or more active neurons.
To check whether individual activity patterns coded for the outcome on each trial, we used standard receiver-operating characteristic (ROC) analysis. For each pattern, we computed the distribution of its occurrence frequencies separately for correct and error trials (as in the example of Fig. 8A). We then used a threshold T to classify trials as error or correct based on whether the frequency on that trial exceeded the threshold or not. We found the fraction of correctly classified correct trials (true positive rate) and the fraction of error trials incorrectly classified as correct trials (false positive rate). Plotting the false positive rates against the true positive rates for all values of T gives the ROC curve. The area under the ROC curve gives the probability that a randomly chosen pattern frequency will be correctly classified as from a correct trial; we report this as P (predict outcome).
Relationship of sampling change and outcome prediction
Within each session, we computed the change in each pattern’s occurrence between pre- and post-task SWS. These were normalised by the maximum change within each session. Maximally changing patterns were candidates for those updated by learning during the task. Correlation between change in pattern sampling and outcome prediction was done on normalised changes pooled over all sessions. Change scores were binned using variable-width bins of P (predict outcome), each containing the same number of data-points to rule out power issues affecting the correlation. We regress P (predict outcome) against the median change in each bin, using the mid-point of each bin as the value for P (predict outcome). Our main claim is that prediction and change are dependent variables (Fig. 8C-G). To test this claim, we compared the data correlation against the null model of independent variables, by permuting the assignment of change scores to the activity patterns. For each permutation, we repeat the binning and regression. We permuted 5000 times to get the sampling distribution of the correlation coefficient R* predicted by the null model of independent variables. To check robustness, all analyses were repeated for a range of fixed number of data-points per bin between 20 and 100.
Relationship of location and outcome prediction
The location of every occurrence of a co-activation pattern was expressed as a normalized position on the linearised maze (0: start of departure arm; 1: end of the chosen goal arm). Our main claim is that activity patterns strongly predictive of outcome occur predominantly around the choice point of the maze, and so prediction and overlap of the choice area are dependent variables (Fig. 9B). To test this claim, we compared this relationship against the null model of independent variables, by permuting the assignment of location centre-of-mass (median and interquartile range) to the activity patterns. For each permutation, we compute the proportion of patterns whose interquartile range overlaps the choice area, and bin as per the data. We permuted 5000 times to get the sampling distribution of the proportions predicted by the null model of independent variables: we plot the mean and 95% range of this sampling distribution as the grey region in Fig. 9B.
S1 File: Extended discussion of the theoretical predictions
Neural inference
How do we know the current state of the world given some input from it? Our input is both limited in time and noisy, so our estimates are inherently uncertain. Consequently, we have an inference problem: what is our best guess of the current state of the world given some finite, noisy input? We can state this problem as being equivalent to inferring the probability distribution at some given moment in time t; in words, this is the probability of currently being in a given state, out of all possible states, given both the available input and some internal model of the world. Using Bayes’ theorem, we can make this dependence on input and model explicit:
The prior is the internal estimate of the current state before the observation input, the posterior is the estimate of the current state after observing input, and the improvement in the estimate arises from the new information available in input that is processed through the likelihood. All these are dependent on the model of the world we are using. This internal model specifies how we interpret the inputs in the likelihood, and generate the prior probabilities. If we change the model, we change these two operations, and so change our estimates of the current state of the world. We can think of the model as specifying what we expect to be relevant in the input, and what states we expect to be in.
One goal of learning is thus to update the internal model to match the statistical properties of the world. The better the model, the better we will be able to predict the state of the external world. But as we can only access directly the inputs generated from those states, formally we say that learning seeks to maximise P(input|model) over all possible inputs at all times t by changing the parameters of the model. A model which always generates maximum values for P(input|model) is the best possible learnt internal model of the external world. Obtaining such a model necessarily means that we have experienced all possible states giving rise to those inputs, so that the prior P (state|model) is always accurate, and we obtain no new information from the likelihood. Consequently, the posterior probability becomes always proportional to the prior probability. A measure of learning is thus how close the prior and posterior distributions have become.
Inference-by-sampling
The inference-by-sampling theory (Fiser et al., 2010; Berkes et al., 2011) proposes that the model is encoded by the particular set and weight of connections in a neural circuit. In this view, the posterior distribution is encoded by the activity of the circuit evoked by some input. Crucially, it predicts that the prior distribution is encoded by spontaneous activity of the same circuit, as this is solely sampling the model.
If the circuit is the model, then the theory predicts that the circuit’s instantaneous population activity is a sample from a probability distribution - from the posterior when receiving external input, from the prior in spontaneous activity. Some downstream neurons, receiving these samples as a consecutive sequence of inputs, can reconstruct the probability distribution just by summing their inputs over time.
For simplicity, Berkes et al (Berkes et al., 2011) considered the instantaneous population activity as some binary vector indicating whether each neuron was active or inactive in a very small time window. This representation makes the distributions easy to measure experimentally.
Learning updates synaptic weights, altering the encoded model. The prediction that posterior and prior distributions converge over learning is thus neurally equivalent to the convergence between the distributions of evoked and spontaneous population activity.
Evidence for inference-by-sampling in visual cortices
These ideas were developed in the context of visual processing, and particularly with reference to V1. In this context, the “state” of the world is the current view, and the input is the information received by the retina. The proposed purpose of inference in V1 is to infer the most likely low level visual features – edges, for example – present in the current view, given the input to the retina. V1’s internal model is then a statistical model of the low-level features, which can be built over a life-time’s experience of the world.
Consequently, Berkes and colleagues (Berkes et al., 2011) tested the construction of this internal model by recording from area V1 at different stages of development in the ferret. Natural images were used to probe the current posterior distribution supported by the model, and darkness was used to probe the current prior distribution. Over development, the activity distribution evoked by natural images increased its similarity to the distribution during darkness. This increase was robust to a series of controls for simultaneous changes in firing rate statistics (Berkes et al., 2011; Okun et al., 2012; Fiser et al., 2013). Their results are consistent with the inference-by-sampling interpretation in which the internal model is updated by experience with the world, so that the posterior and prior distributions converge.
Inference-by-sampling in higher cortices over learning during behaviour
These results could not address learning separately from development. Further, unknown is whether inference-by-sampling can be observed in higher-order cortices, or during ongoing behaviour.
There is no a priori reason to expect that inference-by-sampling would be restricted to primary sensory cortices. Much has been written about the generic nature of the cortical microcircuit (Thomson & Lamy, 2007; Harris & Shepherd, 2015), so we might reasonably expect that, if an internal model is encoded by the neural circuit in V1, so other similar cortical circuits in other regions encode other internal models.
Compelling support for this has come from modelling work by Maass and colleagues (Buesing et al., 2011; Habenschuss et al., 2013). Their models have shown how a wide range of plausible cortical circuit models all produce the necessary dynamics to sample from a statistical model encoded by the circuit’s connections (Buesing et al., 2011; Habenschuss et al., 2013). Moreover, the models also replicate key properties of the firing statistics in cortex, including the close-to-Poisson irregularity of firing patterns. These suggest that the inference-by-sampling hypothesis is indeed a plausible generic computation for cortex.
Inference of state is also a generic operation. Nothing in Equation 2 limits its application to sensory information. We might consider “state” in the sense used in the reinforcement learning literature (Sutton & Barto, 1998), as a generic description of the current values of variables of the external world. Indeed, in forms of reinforcement learning that depend on simulation of future actions, “state” in this context can even refer to the simulated values of variables in the external world - for which we would use the internal model to simulate possible outcomes. During behaviour, we might thus expect that an internal model is learnt about the statistical dependence of outcomes on decisions in particular contexts.
The power of the inference-by-sampling hypothesis is that we do not need to know the internal model to test for its existence. We need not specify an exact model to test the convergence of distributions in evoked and spontaneous activity, but such a convergence is evidence of an updated internal model.
Consequently, to test the generality of the inference-by-sampling hypothesis, we sought to test the convergence of distributions over learning using data from the medial prefrontal cortex (mPfC) of rats learning rules in a Y-maze task (Peyrache et al., 2009). By looking at these data for a change to some internal model in mPFC, we are assuming only that the model is related to the rule, but not any specific form of model. It could encode the set of task states and their transitions; it could encode the current sequence of required actions; it could be a statistical model of outcomes. Supporting this assumption, we know mPFC is necessary for successful acquisition of new rules (Ragozzino et al., 1999; Rich & Shapiro, 2007), and that mPFC pyramidal neurons change their firing patterns during acquisition of the rules used here (Benchenane et al., 2010).
Even if the interpretation of the convergence of distributions in the inference-by-sampling framework turns out to be incorrect, the observation of such a convergence between waking and spontaneous activity over learning still offers compelling clues to the nature of cortical computation.
What distributions to compare?
Nonetheless, the inference-by-sampling theory places limits on exactly which activity distributions to compare. In the Berkes et al. (Berkes et al., 2011) study, this decision was made simple by the elegant experimental design. As they monitored V1 over development, so it was reasonable to expect the internal model to adapt to the statistics of the world over a lifetime. Their tests at different developmental stages were samples of the current posterior and prior distributions supported by the model. We would not expect signifi-cant changes to the internal model during their testing, as it was short on the time-scale of the developmental changes, and so they could compare their entire recorded distributions of evoked and spontaneous activity. In other words, they were able to compare two distributions from the same, static model.
Our data on rats learning rules in a Y-maze allow us to address if learning of the internal model can be observed. But learning on short time-scales brings the confounding issue that learning the model is happening online, while we are monitoring activity. So what distributions should we compare?
We chose the 10 training sessions in which the rat clearly acquired the present rule, so we could be reasonably sure that we would observe changes that correlated with learning. We reasoned that neural activity in clearly identified sleep periods before and after the session was a clear candidate for spontaneous activity, as it occurred in the absence of external sensory input. We used slow-wave sleep periods to clearly delineate the presence of sleep. As the rats acquired the rule in that session then, if mPfC indeed encodes rule acquisition, we expect that the spontaneous activity in sleep after the session is drawn from the internal model related to the correct rule.
We can only be sure that during behaviour this correct-rule model would be sampled on correct trials. This does not imply that mPfC activity is causal for decisions on those trials - even in a monitoring or goal-encoding role, mPfC activity would reflect whether or not the correct decision was made. The mPfC activity on error trials is unconstrained by the theory. Consequently, we can only be sure that, if the inference-by-sampling hypothesis is true, then the distribution of samples on correct trials would converge, on average, to the distribution in sleep after learning.
The final, subtle constraint is that overt behavioural signs of learning likely indicates ongoing synaptic plasticity. For example, on the same Y-maze, some pyramidal neurons in mPfC change the timing of their spikes in relation to the hippocampal theta rhythm, indicating local circuit plasticity (Benchenane et al., 2010). If so, then the internal model is changing during behaviour. But the internal model putatively sampled in the post-session sleep will be stable. To thus minimise the confound of these changes during behaviour, and compare static posterior and prior distributions (as per Berkes et al., 2011), we sought to identify where the internal model updating may have finished. A useful proxy for this is the asymptotic behavioural performance. We thus used the trial at which the rat reached the learning criteria as the indicator of relative stability in the internal model. All correct trials from this trial onwards were then used to construct the activity distribution during the task - we call this distribution P (R) in the main text, and distances measured between it and some other distribution P (X) we call D (X|R).
Acknowledgments
We thank the Humphries lab (Javier Caballero, Mat Evans, Silvia Maggi) for discussions; Rasmus Petersen for comments on the manuscript; and P. Berkes and M. Okun for respectively making their KL divergence and raster model code publicly available.
Footnotes
↵* mark.humphries{at}manchester.ac.uk