Abstract
Model-based reinforcement learning (mbRL) has been widely used in explaining animal behavior. In mbRL, the model, or the structure of the task, is used to evaluate the associations between actions and outcomes. It has been proposed that the orbitofrontal cortex (OFC) encodes the model during mbRL. However, it is not well understood how the OFC acquires and stores model information. Here, we propose a neural network framework based on reservoir computing. Reservoir networks exhibit heterogeneous and dynamic activity patterns that are suitable to encode task states. The information can be extracted by a linear readout trained with reinforcement learning. We demonstrate how our framework acquires and stores the task state space. The framework exhibits mbRL behavior and its aspects resemble experimental findings of the OFC. Our study provides a theoretical explanation of how the OFC may contribute to mbRL and a new approach to understanding the neural mechanism underlying mbRL.
Introduction
Even the simplest reinforcement learning (RL) algorithm captures the essence of operant conditioning in psychology and animal learning (Rescorla & Wagner, 1972). That is, actions that are rewarded tend to be repeated more frequently; actions that are punished are more likely to be avoided. However, it fails to explain animals’ behavior in more complicated situations. One particular approach to extending the capabilities of RL algorithms, known as model-based reinforcement learning (mbRL, as contrast to model-free RL, or mfRL), uses the knowledge of the task structure, i.e., the model, to guide the learning (Beck et al., 2008). mbRL is especially successful in explaining the goal-directed learning behavior in complex environments (Dolan & Dayan, 2013; Doll, Simon, & Daw, 2012; Keramati, Smittenaar, Dolan, & Dayan, 2016).
Several studies have investigated the possible brain structures that may be involved in mbRL (Daw, Gershman, Seymour, Dayan, & Dolan, 2011; Glascher, Daw, Dayan, & O'Doherty, 2010; Haber, Kim, Mailly, & Calzavara, 2006; Kennerley, Behrens, & Wallis, 2011; Schultz, Dayan, & Montague, 1997). Notably, the orbitofrontal cortex (OFC) has been hypothesized to represent the task space and encode task states (Wilson, Takahashi, Schoenbaum, & Niv, 2014). Several lesion studies showed that the animals with OFC lesions exhibited deficits acquiring task information for building a task model (Hornak et al., 2004; Izquierdo, Suda, & Murray, 2004; Takahashi et al., 2011). Electrophysiology studies of the OFC have demonstrated that the OFC encodes many aspects of reward information, including reward value (J. L.Jones et al., 2012; Padoa-Schioppa, 2011; Padoa-Schioppa & Assad, 2006; Rudebeck, Mitz, Chacko, & Murray, 2013; Wallis & Miller, 2003), probability (Kennerley & Wallis, 2009), risk (O'Neill & Schultz, 2015), information value (Blanchard, Hayden, & Bromberg-Martin, 2015), abstract rules (Wallis, Anderson, & Miller, 2001), and strategies (Tsujimoto, Genovesio, & Wise, 2011). Yet, it is not well understood how models themselves may be encoded and represented by a neural network, and what sort of neuronal firing properties we expect to find in neurophysiological experiments. Furthermore, we do not know how to teach a model-agnostic neural network to acquire the structure of the task just based on trial and error.
The recent development of reservoir computing may provide a solution (Buonomano & Maass, 2009; Laje & Buonomano, 2013; Maass, Natschlager, & Markram, 2002). Reservoir networks are recurrent networks with fixed connections. Within a reservoir network, neurons are randomly and sparsely connected. Importantly, the internal states of a reservoir exhibit rich temporal dynamics, which represents a nonlinear transformation of its input history and can be very useful for encoding task state sequences. The information encoded by the network can be extracted with a linear output, which can be trained during learning. Reservoir networks have been shown to exhibit dynamics similar to that observed in the prefrontal cortex (Barak, Sussillo, Romo, Tsodyks, & Abbott, 2013; Cheng, Deng, Hu, Zhang, & Yang, 2015; Enel, Procyk, Quilodran, & Dominey, 2016).
In the current study, we demonstrate with two commonly learning paradigms how a reservoir network may achieve mbRL by encoding task states without prior knowledge of task structures. Task event sequences, including reward events, are provided as inputs to the network. A simple yet biologically feasible reward-dependent Hebbian learning algorithm is used to adjust its output weights. We show that our framework can solve problems with different task structures and exhibits mbRL behavior previously reported in animals and humans. We further demonstrate the similarity between the reservoir network and the OFC. Manipulations to our network reproduce the behavior of animals with OFC lesions. The reservoir neurons’ response patterns resemble characteristics of the OFC neurons reported from previous electrophysiological experiments.
Taken together, these results suggest a simple mechanism that naturally leads to the acquisition of task structure and therefore supports mbRL. Finally, we propose some future experiments that may be used to test our model.
Results
We describe our results in three parts. We start with using our network to model a classical reversal learning task. We take advantage of its simplicity to explain the principles behind the framework. Then we show such a framework may be applied to more complex scenarios, using a two-stage decision task as an example. Finally, we demonstrate how the network framework may be used to describe experimental findings in the OFC during value-based decision making.
Reversal Learning
In a classical reversal learning task, the animals have to keep track of the reward contingency of two choice options that may be reversed during a test session (Izquierdo et al., 2004; B. Jones & Mishkin, 1972). Normal animals were found to learn reversals faster and faster, which has been used as an indication of mbRL (Wilson et al., 2014). The mbRL behavior was however found to be impaired in animals with OFC lesions or with lesions that contained fibers passing near the OFC (Izquierdo et al., 2004; Rudebeck, Saunders, Prescott, Chau, & Murray, 2013). These animals were not able to learn reversals faster and faster when they were repeatedly tested. The learning impairments could be explained by mfRL (Wilson et al., 2014).
Our neural network framework consists of a state encoding layer (SEL), which is a reservoir network. It receives three inputs and generates two outputs (Fig 1a). The three inputs to the SEL are the two choice options A and B, together with a reward input that indicates whether the choice yields a reward or not in the current trial. The outputs represent choice actions A and B for the next trial. We use the neural activity of the SEL at the end of the input to determine the SEL’s output.
The framework is able to reproduce animals’ behavior. The number of error trials that takes for the framework to achieve the performance threshold, which is set at 93% in the initial learning and at 80% in the subsequent reversals, decreases as the model goes through more and more reversals (Fig 1b). Interestingly, a learning deficit similar to that found in OFC-lesion animals is observed if we remove the reward input to the SEL (Fig 1b). As the OFC and its neighboring brain areas such as the ventromedial prefrontal cortex (vmPFC) are known to receive both the sensory inputs and reward inputs from sensory and reward circuitry in the brain, removing the reward input from our model mimics the situation where the brain has to learn without functioning structures in or near the OFC.
Neurons in the SEL, as expected from a typical reservoir network, show highly heterogeneous response patterns. Some neurons are found to encode the stimulus identity, some neurons encode reward, and others show mixed tuning (Fig 2a). A principal component analysis (PCA) based on the population activity shows that the network can distinguish all four possible task states: choice A rewarded, choice A not rewarded, choice B rewarded, and choice B not rewarded (Fig 2b).
The ability to distinguish these states is essential for learning. To understand the mbRL behavior exhibited by our model, we study how neurons with different selectivity contribute to the learning (Fig 2c). We find that readout weights of the neurons that are selective to the combination of stimulus and reward inputs (e.g. AR and BR) are mostly affected by the learning. The difference between the weights of their connections to the outputs A and B keeps growing despite repeated reversals. In contrast, the weights of the output connections of pure stimulus- selective neurons only wiggle around the baseline between reversals.
The difference between these two groups of neurons explains why our network achieves mbRL only when the reward input is available. Let us first consider the AR neurons, which are selective for the situation when choice A leads to reward. In these A-rewarded blocks, the connections between the AR neurons and the DML neuron of choice A are strengthened. When the reward contingency is reversed and now choice A leads to no reward, the connections between the AR neurons and choice A are not affected very much. That is because the group of AN neurons instead of the AR neurons that are activated in the blocks when choice A is not rewarded. As the result, the connections between the AN neurons and the DML neuron of choice B are strengthened and the connections between the AN neurons and the DML neuron of choice A are weakened. When the reward contingency is flipped again, the connections between the AR neurons and the DML neuron of choice A are strengthened further. This way, the learning is never erased by the reversals, and the network learns faster and faster. In comparison, let us now consider the A neurons, which encode only the sensory inputs and are activated whenever input A is present. In the A-rewarded blocks, the connections between the A neurons and the DML neuron of choice A is strengthened. In B-rewarded blocks, the connections between the A neurons and the DML neuron of choice A is however weakened when the network chooses A% and gets no reward, and the learning in the previous block is reversed. Thus, the output connections of A% neurons only fluctuate around the baseline with the reversals. They do not contribute much to the learning, and the overall behavior of the network is mostly driven by neurons that are activated by the combination of reward input and sensory inputs. Removing R deactivates these neurons and leads to the model-free behavior.
Two-stage Markov decision task
We further test our network model with a two-stage decision making task. The task is similar to the Markov decision task used previously in several human fMRI studies (Glascher et al., 2010). In this task, the subjects have to choose between two options A1 and A2. Their choices then lead to two intermediate outcomes B1 and B2 at different but fixed probabilities. The choice of A1 more likely leads to B1, and the choice of A2 is more likely followed by B2. Importantly, the final reward is contingent only on these intermediate outcomes, and the contingency is reversed across blocks (Fig 3a). Thus, the probability of getting a reward is higher for B1 in one block and becomes lower in the next block. The probabilistic association between the initial choices and the intermediate outcomes never changes. The subjects are not informed of the structure of the task, and they have to figure out the best option by tracking not only the reward outcomes but also the intermediate outcomes.
We keep our framework mostly the same as in the previous task. Here, we have two additional input units that reflect the intermediate outcomes (Fig 3b). To demonstrate our framework’s capability of encoding sequential events, the input units are activated sequentially in our simulations as they are in the real experiment (Fig 3c). We also add a non-reward input unit whose activity is set to 1 when a reward is not obtained at the end of a trial. The additional non-reward input facilitates learning but does not change the results qualitatively.
For a mfRL strategy, the probability of repeating the previous choice only depends on the reward outcome. The probability of repeating the previous choice is higher when a reward is obtained than when no reward is obtained. The intermediate outcome is ignored. However, for a mbRL strategy, this is no longer the case. For example, consider the situation when the subject initially chooses A1, the intermediate outcome happens to be B2, and a reward is obtained. If the subject understands B2 is an unlikely outcome of choice A1 (rare), but a likely outcome of choice A2%(common), a reward obtained after the rare event B2 should actually motivate the subject to switch from the previous choice and choose A2 the next time. The subject should always choose the option that is more likely to lead to the intermediate outcome that is currently associated with a reward.
To quantify the model-based learning behavior, we first evaluate the impact of the previous trial’s outcome on the current trial. We classify all trial outcomes into four categories: common-rewarded (CR), common-unrewarded (CN), rare-rewarded (RR) and rare- unrewarded (RN). Here, common and rare indicate whether the intermediate outcome is the more likely outcome of the chosen option or not. Glascher et al (Glascher et al., 2010)showed that the mbRL led to a higher probability of repeating the previous choice in the CR and RN conditions. This is also what we observe in our network model’s behavior (Fig 4a).
To illustrate how the network acquires the model, we define the model-based index, which represents the tendency of model-based behavior (see the Method). The model-based index grows larger as the training goes on (Fig 4b). It indicates that the network learns the structure of the task gradually and transits to the model-based behavior from an initially model-free behavior. Similar to our findings in the first task, the SEL without the reward input does not show this transition (Fig 4b). We further quantify the contributions of mbRL and mfRL to the network behavior using a model fitting procedure previously described by Glascher et al. (Glascher et al., 2010), and the network without the reward input shows a significantly smaller weight for mbRL, suggesting it is worse at picking up the task structure (Fig 4c).
Again, a PCA on the SEL population activity shows that the SEL distinguishes different task states (Fig 4d). Because of the structure of the task in which the contingency between the first stage options and the intermediate outcomes is fixed, the network only needs to find out the current reward contingency of the intermediate outcomes. We found that the learning picks out the most relevant neurons that encode the contingency between the intermediate outcomes and the reward outcomes (B1R, B2R, etc.). Their connection weights to the DML neurons show better and better differentiation of the two choices throughout the training (Fig 4e). In contrast, the weights of neurons that encode the association between the first stage options and the reward outcomes (A1R, A2R, etc.) are less differentiated. These results suggest that the network acquires the task structure as the result of training.
Value representation by the OFC
Previous electrophysiology studies have shown that OFC neurons encode value during economic choices (Padoa-Schioppa & Assad, 2006; Wallis & Miller, 2003). Among these value encoding neurons, studies have identified multiple classes of neurons encoding a variety of information, including the value of individual offers (offer value), the value of the chosen option (chosen value), and the identity of the chosen option (chosen identity) (Cai & Padoa-Schioppa, 2014; Padoa-Schioppa, 2013).
Here we show that our framework may explain this apparent heterogeneous value encoding in the OFC. Here we model a two-alternative economic choice task by providing two inputs to the SEL, representing the value of each option (Fig 5a). The framework can reproduce the choice behavior of monkeys (Fig 5b)(Padoa-Schioppa & Assad, 2006). Then we study the selectivity of the SEL neurons. We find not only neurons that encode the value of each option (offer value neurons, middle panel in Fig 6a), but also neurons that encode the value of the chosen option (chosen value neurons, left panel in Fig 6a). Furthermore, a proportion of neurons show the selectivity for the choice as previously reported (chosen identity neurons, right panel in Fig 6a). We classify the neurons in the reservoir network into 10 categories as described in Padoa-Schioppa and Assad (Padoa-Schioppa & Assad, 2006). Interestingly, we are able to find neurons in 9 of the 10 categories (Fig 6b, c). The only missing category (neurons encoding other/chosen value) was also very rare in the experimental data. Although the proportions of neurons encoding each category are not an exact copy of the experimental data, but the similarity is apparent. This is surprising given that we do not tune the internal connections of the SEL to the task. The heterogeneity is naturally expected from a reservoir network, but it takes much more effort to explain with recurrent network models that have a well-defined structure (Daie, Goldman, & Aksay, 2015; Rustichini & Padoa-Schioppa, 2015).
Discussion
So far, we have shown that a simple reservoir-based network model may exhibit model-based learning behavior. The more interesting question is that why the network is capable of doing so and how this network model may help us to understand the functions of the OFC.
We place a reservoir network as the centerpiece of our model. Reservoir networks are large, distributed, nonlinear dynamical recurrent neural networks with fixed weights. Because of recurrent networks’ complicated dynamics, they are especially useful in modeling temporal sequences including languages (Rodriguez, 2001; Suykens, Vandewalle, & Moor, 1996). They have been shown to be Turing equivalent (Kilian & Siegelmann, 1996) and capable of approximating arbitrary dynamical systems (Funahashi & Nakamura, 1993). In our model, the reservoir network encodes a combinatory of inputs that constitutes the task state space. States are encoded by the activities of the reservoir neurons, and the learned action values are represented by the weights of the readout connections. We show that a reinforcement learning algorithm is capable of solving the relatively simple tasks in this study. However, it has been shown that reinforcement learning is in general not very efficient for extracting information from reservoir networks. A possible solution is to introduce additional layers to help with the readout (Cheng et al., 2015).
It is important to note that reward events must also be provided as an input to allow mbRL. Including reward events allows the network to establish associations between sensory stimuli and rewards, thus facilitates model-based learning. Removing reward inputs to the reservoir leads to a mfRL behavior. Although reward modulates neural activities almost everywhere in the cortex, the OFC is unique in its role of encoding the association between sensory stimuli and rewards. Removing the reward input to the reservoir mimics the situation when animals cannot rely on such an association to learn tasks. In this case, the reservoir is still perfectly functional in terms of encoding task events other than rewards. We hypothesize this simulates the situation when animals have to depend on their other memory structures in the brain - such as hippocampus or other medial temporal lobe structures - for learning. The importance of the reward input to the reservoir explains the key role that the OFC plays in mbRL.
Several recent studies reported that selective lesions in the OFC did not reproduce the behavior deficits in reversal learning previously seen if the fibers passing through or near the OFC were spared (Rudebeck, Saunders, et al., 2013). Since these fibers probably carry the reward information from the midbrain areas, these results do not undermine the importance of reward inputs. Presumably, when the lesion is limited to the OFC, the projections that carrying the reward information were still available to or might even be redirected to other neighboring prefrontal structures, including ventromedial prefrontal cortex, which might take over the role of the OFC and contribute to mbRL in animals with selective OFC lesions.
There are several reasons why we choose reservoir networks to construct our model. First reason is that we would like to pair our network model with reinforcement learning. Reservoir networks have fixed internal connections; the training occurs only at the readout. The number of parameters is thus much smaller, which could be important for efficient reinforcement learning. Generality is another benefit offered by reservoir networks. Because the internal connections are fixed, we can use the same network to solve a different problem by just training a different readout. The reservoir can be seen as a general-purpose task state representation network. Lastly, our results as well as several other studies show that neurons in reservoir networks – although with untrained connections weights – show properties similar to that observed in the real brain (Barak et al., 2013; Cheng et al., 2015; Sussillo & Abbott, 2009), suggesting local plasticity may not play a role as important as previously thought.
The fact that the internal connections are fixed in a reservoir network means that the selectivities of the reservoir neurons are also fixed. This may seem at odds with the experimental findings of many OFC neurons shifting their encodings rapidly during reversals (Rolls, Critchley, Mason, & Wakeman, 1996). However, these observations may be interpreted differently. The neurons that were found to have different responses during reversals might be in fact encoding rewards. On the other hand, there is evidence that OFC neurons with inflexible encodings during reversals might be more important for mbRL behavior (Schoenbaum, Saddoris, & Stalnaker, 2007).
The performance of our network depends on several factors. First, it is important that reservoir should be able to distinguish between different task states. The number of possible task states may be only 4 or 8 as in our examples, or may be impossibly large even if the number of inputs increases only slightly. The latter is due to the infamous combinatorial explosion problem. One may alleviate the problem by introducing learning in the reservoir to weed out irrelevant combinations. Second, the dynamics of the reservoir should allow information to be maintained long enough until the decision is made. The recent developed gated recurrent neural networks may provide a solution with units that may maintain information for long periods (Chung, Gulcehre, Cho, & Bengio, 2014). Third, the model exhibits substantial variability between runs, suggesting the initialization may impact its performance. Further investigation is needed to make the model more robust.
Our model makes several testable predictions. First, because of the reservoir structure, the inputs from the same source should be represented evenly in the network. For example, in a visual task, different visual stimuli should be represented at roughly the same strength in the OFC, even if their visual salience may be drastically different. Second, we should be able to find neurons encoding all relevant task parameters in the network. Third, reducing the number of inputs may make the network to be more efficient in certain tasks. This may seem counterintuitive. But removing inputs reduces the number of states that the network has to encode, thus improves learning efficiency for tasks that do not require those additional states. For example, if we remove the reward input to the SEL, which is essential for model-based learning, the network should however be more efficient at model-free learning. Indeed, animals with OFC lesions were found to perform better than control animals when reward history was not important (Riceberg & Shapiro, 2012).
In summary, our framework does not intend to be a complete model of how the OFC works. Instead of creating a complete neural network solution of mbRL or the OFC, which is improbable at the moment, we are aiming at the modest goal of providing a proof of concept that approaches the critical problem of how to acquire the model in mbRL. By demonstrating the network’s similarity to the experimental findings in the OFC, our study opens up new possibilities in future investigation.
Materials and Methods
Neural Network Model
The model is composed of three layers: an input layer (IL), a state encoding layer (SEL), and a decision-making output layer (DML) (Fig. 1a).
The units in the input layer represent the identities of sensory stimuli and the reward obtained. The input neurons are sparsely connected to the SEL units. The connection weights wi(1) are set to 0 with a probability of pIR=0.2. Nonzero weights are assigned independently from a standard uniform distribution [0, 1].
In the SEL, there are N= 500 neurons. The neurons in the SEL are connected with a low probability p=0.1 and the connections are randomly and independently set from a Gaussian distribution with zero mean and a variance of g2/(p*N), where the gain g acts as the control parameter in the SEL. Connections in the SEL could be both positive and negative.
Each neuron in the SEL is described by an activation variable xi for i = 1, 2, …, N, which is initialized with a normal distribution N(0, σini2) at the beginning of each trial. xi is updated at each time step (dt = 1ms) as follows: where τ represents the time constant, wij is the synaptic weight between neurons i and j, dWi stands for the white noise, and σnoise is its variance. The firing rate yi of neuron i is a function of the activation variable xi relative to a minimal firing rate ymin=0 and the maximal rate ymax=1:
Here y0 = 0.1 is the baseline firing rate.
The SEL neurons project to the DML. The two competing neurons in the DML represent the two choices respectively. The total input of neuron k in the DML is where wik(2) is the weight of the synapse between neuron i in the SEL circuit and neuron k in the DML. The synaptic weights between the SEL and DML are randomly initialized with uniform distribution [0, 1], and normalized to keep the squared sum of synaptic weights projecting to the same DML unit equal to 1.
The synaptic weights between the SEL and DML are updated based on the reward outcome during the training phase. The stochastic choice behavior of our model is described by a softmax function: where pk represents the probability for choosing the choice ak, and the other choice is chosen with probability 1- pk. β adjusts the competition strength of two choices, and vk is the input of the DML unit k. The firing rate of the unit k, yk, is set to 1 if choice ak is chosen, otherwise it is set to 0.
Reinforcement Learning
At the end of each trial, the weights between the SEL and the DML neurons are updated. The plastic weights in eq (3) in trial n+1 are updated as follows:
The update term Δwik depends on the reward prediction error and the responses of the neurons in the SEL circuit and DML: where η is the learning rate, and r is the reward. E[r] denotes the expected value. When the reward r is larger than E[r], the connections between the SEL neurons whose firing rate is above the threshold yth and the neurons in the DML would be strengthened, and the connections between the neurons whose firing rate is below yth and the neurons in the DML would be weakened. After each update, the weights are normalized: so that the vector length of remains constant. The normalization stops the weights from growing infinitely (Royer & Pare, 2003).
Behavior Task
Reversal learning
The network has to choose between two options. One option leads to a reward, and the other does not. The stimulus-reward contingency is reversed every 100 trials. The criterion for learning is set to 28 correct trials in 30 successive trials for the initial learning and 24 correct trials in 30 successive trials for subsequent reversals.
The input layer units represent the identities of the two options and the reward. An option unit’s response is set to 1 for if the corresponding option is chosen in the current trial, otherwise it is set to 0. The reward unit’s response is set to 1 if the choice is rewarded in the current trial. The output of the network indicates its choice for the next trial. The network parameters are set as follows. Time constant τ = 100ms, Network gain g=2, training threshold yth = 0.4, temperature parameter β = 4, learning rate η = 0.001, noise gain σnoise =0.01, initial noise gain σini = 0.01.
The selectivity of neurons in the SEL is determined at the time point when the decision is made. A unit is defined as selective to a certain input or a combination of inputs if its responses are significantly higher under the condition when the input or all inputs of the combination are set to 1 than when they are set to 0.
Two-stage Markov decision task
The network has to make a choice between options A1 and A2. A1 leads to intermediate outcome B1 at the probability of 80%, and B2 at the probability of 20%. Vice versa, option A2 leads to B2 at the probability of 80%, and B1 at a lower probability of 20%. The contingency between options (A1, A2) and intermediate outcomes (B1, B2) is fixed. Initially, B1 leads to a reward at the probability of 80% and B2 leads to reward at the probability of 20%. The reward contingency is reversed every 50 trials.
The input layer contains 6 units, representing the identities of two first stage options A1 and A2, two intermediate outcome B1 and B2, and the reward and non-reward conditions, respectively. The activity of option unit A1 or A2 is set to 1 when the respective option is chosen. The activity of intermediate outcome unit B1 or B2 is set to 1 when the respective intermediate outcome is presented. The reward unit’s activity is set to 1 when a reward is obtained, and the non-reward unit’s activity is set to 1 when no reward is obtained. The units are activated sequentially, reflecting the sequential nature of the task. The A units are activated between 200 and 700ms after a trial starts, the B units between 700 and 1200ms, and the reward units between 1200 and 1700ms.
The output of the network indicates its choice. The network parameters are set as follows. Time constant τ = 500ms, Network gain g=2.25, training threshold yth = 0.2, temperature parameter β = 2, learning rate η = 0.001, noise gain σnoise =0.01, initial noise gain σ = 0.01.
The selectivity of neurons in the SEL is determined at the time point when the decision is made. There are 8 conditions in this task, namely A1B1R, A1B1N, A2B1R, A2B1N, A1B2R, A1B1N, A2B2R, and A2B2N. For example, A1B1R indicates the condition when A1 is chosen, intermediate outcome B1 is presented, and a reward is obtained. A neuron’s preferred condition is the condition under which its activity is the largest and significantly higher than its activity under any other conditions. Then the neurons are grouped into different categories based on their preferred conditions. The neurons in category A1R are the neurons whose preferred conditions are A1B1R, A1B2R, A2B1N and A2B2N. All the preferred conditions of the neurons in category A1R provide evidence for associating A1 with the reward. Similarly, the preferred conditions of the neurons in the category BIN are A1B1N, A1B2R, A2B1N and A2B2R. They provide evidence that B1 is not associated with the reward.
Model-based model fitting
In order to test the model-based learning, we fit our data based on the model introduced by Daw et al. (Daw et al., 2011). The model fits the behavioral results with a mixture of model- free and model-based learning algorithm. In our simplified task, the network makes only one choice in each trial. The inverse temperature parameters β1 is set to 2, which is also used to produce simulated behavioral choices. The parameter p, which captures the tendency for perseveration and switching, is set to 0, although all conclusions still hold when p is allowed to vary. The free parameters relevant in our task are α1, α2, λ and w. α1 and α2 are the learning rates in the model-free and model-based learning algorithms, respectively. The eligibility λ represents how large proportion of credit from the reward can be given to the first states and actions in our task paradigm. w is the weight for model-based learning. When w equals 1, the behavior is purely model-based. When w equals 0, the behavior is purely model-free. The fitting is done by a maximum likelihood estimation procedure.
Model-based index
Inspired by the factorial analysis from Daw et al. (Daw et al., 2011), we define a MB index (eq.8) to quantify the tendency of repeating the choice in the last trial under different situations. The combination of the two reward outcomes and the two intermediate outcomes, common and rare, gives us four possible outcomes: common-rewarded (CR), common-unrewarded (CN), rare-rewarded (RR) and rare-unrewarded (RN). In the model-based learning, the agent is more likely to repeat the previous choice if the last trial is a CR or an RN trial. Higher MB index means that the behavioral pattern is more similar to the mbRL behavior.
Value-based economic choice task
Unlike the two previous paradigms, both options in this paradigm lead to a reward. Two input units represent the rewards associated with the two options, respectively. The input strength is proportional to reward magnitude. In our simulations, the reward A is valued twice as much as reward B for the same reward magnitude. The relative value preference between the two options is not provided as an input to the network directly, but used in calculating the expected value. The value of the reward is defined as the product of the relative value and the reward magnitude.
The activity of the input unit, f(t), is described by the following equations (Rustichini & Padoa-Schioppa, 2015).
Here t is the time in the unit of ms within a trial, magri is the magnitude of the reward type i in each trial, max(magri) is the maximal reward magnitude of reward type i within the block, and min (mag_ri) represents the minimal reward magnitude of reward type i, which is always 0 in our simulations. The expected value is the sum of the product of the probability of choosing the option and corresponding reward magnitude. where pi and mi are the probability of choosing option i and its reward magnitude, and γ=2 is the relative value preference between the two reward options. Only the data from the trials after 6000 trials training are included for the analyses. The network parameters are set as follows. Time constant τ = 100ms, Network gain g=2.5, training threshold yth = 0.2, temperature parameter β= 4, learning rate η = 0.005, noise gain σnoise =0.05, initial noise gain σini = 0.2.
As in Padoa-schioppa and Assad (Padoa-Schioppa & Assad, 2006), the following variables are defined for further analysis: total value (the sum of the value of two options), chosen value (the value of the chosen option), other value (the value of the unchosen option), value difference (chosen-other value), value ratio (other/chosen value), offer value (the value of the one option), chosen juice (the identity of the chosen option), and value A chosen (the value of the option A when option A is chosen).
We use an analysis similar to that in Padoa-schioppa and Assad (Padoa-Schioppa & Assad, 2006) to study the selectivity of SEL units during the post-offer period (0-500ms after the stimulus onset). Linear regressions are applied to each variable to fit the neural responses in this time window for each SEL unit separately. A variable is considered to explain the response of a neuron if the slope of a fitting linear function is significantly different from zero.
Acknowledgements
This work is supported by the CAS Hundreds of Talents Program and Science and Technology Commission of Shanghai Municipality (15JC1400104) to T. Y., and by Public Projects of Zhejiang Province (2016C31G2020069) and the 3rd Level in Zhejiang Province “151 talents project” to Z. C. We thank Yu Shan and Xiao-Jing Wang for discussions and comments during the study. The authors declare no competing financial or nonfinancial interests.