ABSTRACT
Dopamine (DA) neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) encode reward prediction errors (RPEs) and are proposed to mediate error-driven learning. However the learning strategy engaged by DA-RPEs remains controversial. Model-free associations imbue cue/actions with pure value, independently of representations of their associated outcome. In contrast, model-based associations support detailed representation of anticipated outcomes. Here we show that although both VTA and SNc DA neuron activation reinforces instrumental responding, only VTA DA neuron activation during consumption of expected sucrose reward restores error-driven learning and promotes formation of a new cue→sucrose association. Critically, expression of VTA DA-dependent Pavlovian associations is abolished following sucrose devaluation, a signature of model-based learning. These findings reveal that activation of VTA-or SNc-DA neurons engages largely dissociable learning processes with VTA-DA neurons capable of participating in model-based predictive learning, while the role of SNc-DA neurons appears limited to reinforcement of instrumental responses.
INTRODUCTION
The activity of midbrain dopamine (DA) neurons respond in a characteristic fashion to reward, with increased phasic firing in response to unexpected rewards or reward-predicting cues, little or no response to perfectly predicted rewards, and pauses in firing when predicted rewards fail to materialize1–3. This pattern of response largely complies with the concept of a signed reward prediction error (RPE), an error-correcting teaching signal featured in contemporary theories of associative learning4–8. Therefore, it has been suggested that the error signal carried by phasic DA responses and broadcast to forebrain regions constitutes the neural implementation of such theoretical teaching signals5,9. In support of this hypothesis, recent optogenetic studies showed that activation or inhibition of ventral tegmental area (VTA) DA neurons mimics positive or negative RPEs, respectively, and affects Pavlovian appetitive learning accordingly10,11. However, the specific learning strategy engaged by DA teaching signals remains controversial.
Computational models discriminate between two separate forms for error-driven learning7,12. In model-free learning, RPEs imbue predictive cues with a common currency cache value determined by the value of the outcome during training. This form of learning does not allow for a representation of the specific identity of the outcome; therefore, expression of this learning is independent of the desire for that specific outcome at the time of test. Alternatively, in model-based learning, error signals contribute to construction of internal models of the causal structure of the world, allowing predictive cues to signal the specific identity of their paired outcome. As a result, the expression of model-based learning is motivated by an internal representation of a specific outcome and anticipation of its inferred current value.
The role of DA teaching signals in model-free and model-based processes remains unclear13–15. Since the original discovery that DA neurons track changes in expected value, phasic dopamine signals have predominantly been interpreted as model-free RPEs, promoting pure value assignment. Consistent with this view, direct activation of DA neurons serves as a potent reinforcer of instrumental behavior in self-stimulation procedures11,16–21.
More recently however, a contribution of phasic DA signals to model-based learning has also been suggested. This is based on growing evidence that DA neurons have access to higher order knowledge, beyond observable stimuli, for the computation of RPEs22–25. Moreover, DA neurons were recently shown to respond to valueless changes in the sensory features of an expected reward26. While these studies reveal model-based influences in DA error signal computation, the exact associative content promoted by these DA signals is uncertain. A recent study intriguingly showed that in absence of a valuable outcome, phasic activation of DA neurons promotes model-based association between two neutral cues. Since the cues were neutral, there was no opportunity for model-free, value-based conditioning. It remains to be determined how DA signals contribute to associative learning when subjects are actively learning about value-laden rewarding outcomes, the canonical situation in which DA signals are robustly observed, and in which both model-free and model-based strategies are possible.
Potentially relevant to questions about the model-free or model-based nature of DA-induced learning is the proposed functional heterogeneity of DA neurons based on anatomical location. Indeed, while RPEs are relatively uniformly encoded across midbrain DA neurons, different contributions to learning have been proposed for VTA and substantia nigra (SNc) DA neurons based on the distinct ventral and dorsal striatal targets of these neurons27–30. Note however that unlike VTA-DA neurons for which a role in prediction learning is established, the contributions of phasic activity in SNc-DA neurons to error-driven prediction learning remains uncertain.
Therefore, the purpose of the present study was twofold: 1) assess the contribution of VTA- and SNc-DA neuron activation to Pavlovian reward learning, and 2) when learning was observed as a result of our manipulations, determine the model-free or model-based nature of this learning.
To accomplish these goals, rats were trained in a blocking paradigm in which the formation of an association between a target cue and a paired reward is prevented, or blocked, if this cue is presented simultaneously with another cue that already signals reward. In this situation, the absence of RPEs, presumably reflected in the absence of a DA response, is thought to prevent learning about the target cue. We sought to unblock learning by restoring RPEs, either endogenously by increasing the magnitude of reward, or by optogenetically activating VTA- or SNc-DA neurons during reward consumption. When successful unblocking was observed as a result of our manipulations, we assessed the model-free or model-based nature of the newly learned association by determining its sensitivity to post-conditioning outcome devaluation.
RESULTS
Phasic activation of VTA- but not SNc-DA neurons mimics reward prediction errors and promotes Pavlovian learning
Three groups of rats (Reward Upshift, n = 24; VTA-DA Stim, n = 20; SNc-DA Stim n = 16) were trained in a Pavlovian blocking/unblocking task (Fig. 1). In the first stage of this task, two auditory cues, A and B, were presented individually and followed by the delivery of a sucrose reward. For the Reward Upshift group, the quantity of sucrose associated with these cues was different: cue A signaled a large sucrose reward (3 × 0.1ml), while cue B signaled a small sucrose reward (0.1ml). This was done so that subsequent upshift of sucrose reward during the compound BY (from small to large reward) would cause an endogenous RPE and presumably unblock learning about the target cue Y. For the other two groups (VTA-DA Stim and SNc-DA Stim) cue A and B both signaled a large sucrose delivery, which, in absence of further manipulation should prevent endogenous RPEs during the subsequent compound phase. Subjects acquired conditioned responding rapidly, as indicated by the time spent in the reward port during cue presentation (Fig. 2). Note that during reinforced sessions, conditioned behavior was measured during the first 9s of cue presentation (before reward was delivered), in order to capture reward anticipation and avoid contamination by consumption behavior. In the reward upshift group, responding to cue A was greater than cue B (average for last 4 days of individual cue, T = 9.703, P < 0.001), which is consistent with the different reward magnitudes associated with these two cues. This difference in responding was not observed in VTA and SNc stim groups as in these groups, both cues signaled a large reward (Ps = 1.000; average last 4 days of individual cue).
In the second stage of the procedure, two distinct visual cues (X and Y) accompanied the auditory cues to form to the compounds AX and BY. Both of these compound cues were paired with a large sucrose reward. For all subjects, the addition of cue X was redundant: a large reward was expected, and obtained, on the basis of cue A alone. Therefore, in absence of prediction error during AX trials, learning about the target cue X should be blocked. In contrast, the introduction of cue Y coincided with a prediction error. For the reward upshift group, the transition from small to large reward (from cue B to compound BY) is thought to create an endogenous prediction error that unblocks learning about the target cue Y. For the other two groups, we sought to artificially recreate a normally absent prediction error, by optogenetically activating VTA- or SNc-DA neurons during reward consumption in BY trials. For all groups, the transition from individual cue to compound trials was accompanied by a general increase in conditioned responding (A vs. AX, and B vs. BY Ps < 0.001) possibly reflecting both changes in associative weight and unconditioned arousing and/or disinhibiting properties of novel auditory cues31.
Finally, to assess the associative strength acquired by each individual cue following reward upshift or DA neuron optogenetic activation, all rats underwent a probe test in which all cues were presented separately and in absence of sucrose reinforcement (Fig. 3). Conditioned behavior was measured during the entire 30s cue. A two-way mixed ANOVA (Group × Cue) revealed a main effect of Group (F2,57 = 13.818, P < 0.01) and Cue (F3,171 = 17.997, P < 0.01) and a significant interaction between these factors (F6,171 = 11.050, P < 0.01). Follow-up one-way RM ANOVAs separately conducted on each experimental group revealed, for each group, a significant effect of cue type on responding (reward upshift: F3,69 = 22.078, P < 0.001; VTA-DA stimulation: F3,57 = 11.634, P < 0.001; SNc-DA stimulation: F3,45 = 7.836, P < 0.001). Posthoc comparisons confirmed that responding to the ancillary cues A and B was as expected: subjects in the reward upshift group responded more to A than to B (T = 5.373, P < 0.001), and subjects in the other two groups responded equally to these cues (VTA-DA stimulation: T = 0.904, P = 1.000, SNc-DA stimulation: T = 0.537, P = 1.000), which is consistent with the magnitude of reward paired with these cues during training. Of primary interest are the responses to the target cues X and Y. In the reward upshift group, the surprising increase in reward magnitude during the BY compound unblocked learning, resulting in greater conditioned responding to Y than to X (T = 5.841, P < 0.001). Note that both cues, Y and X, benefited from equal pairing with the sucrose reward during the compound phase, only the presence or absence of RPE during these cues differed and promoted or blocked learning, respectively. Stimulation of VTA-DA neurons during sucrose consumption in presence of the BY compound also resulted in greater responding to Y than to X (T = 5.334, P < 0.001). These results indicate that phasic activation of VTA-DA neurons mimicked endogenous RPEs and unblocked learning. In contrast, activation of SNc-DA neurons did not unblock Pavlovian learning; subjects responded equally to X and Y (T = 0.344, P = 1) and responding to these cues was low (< 10% of cue time spent in port, on any trial). Analysis of the rate of port entries during the cues (a different metric for the assessment of Pavlovian conditioned approach) yielded essentially similar results (Fig. S1). Note however that unlike the time in port, the rate of port entries did not follow a monotonic increase during Pavlovian training, making that last metric a somewhat ambiguous readout for changes in associative strength. Therefore, we chose to focus our primary analyses on time in port.
To directly compare the consequence of endogenous RPEs and DA neurons activation on Pavlovian learning, we calculated for all individuals an unblocking score defined here as the difference in responding between Y and X (unblocked – blocked; using time in port as the measure of responding)(Fig. S2). We then compared this value between groups and while we found a general group effect (F2,57 = 8.247, P < 0.001), post hoc analysis found no difference between the reward upshift and VTA-DA stimulation groups (T = 0.817, P = 1) indicating equal unblocking as a result of these two manipulations. In contrast the unblocking score of the SNc-DA stimulation group was different from all other groups (all Ps ≤ 0.01), which confirms the functional dissociation between VTA- and SNc-DA neurons.
In certain conditions, cues paired with natural reward or with DA neurons stimulation can elicit behaviors that are not directed towards the reward port, such as orienting to the cue, rearing, and general locomotion/rotations32,33. To determine the role of endogenous- as well as optically induced-RPEs on the acquisition of these behaviors, we recorded and analyzed animals’ behavioral responses to cues X and Y during the probe test. While the target cues occasionally evoked orienting, rearing, or rotations, these behaviors were equally frequent in response to cue X and Y (Fig. S3), suggesting that, under these experimental parameters, these behaviors are not conditioned responses, but rather reflect the intrinsic (unconditioned) salient properties of the cues.
After completion of the unblocking task, we assessed the reinforcing properties of VTA- and SNc-DA neurons activation in an intracranial self-stimulation (ICSS) task in which rats could respond on one of two nosepokes to obtain a 1-s optical stimulation of DA neurons (Fig. 4). In agreement with previous studies16,20,21,33, we found that activation of both VTA- and SNc-DA neurons serves as a potent reinforcer of ICSS behavior. A 3-way mixed ANOVA (Group × Day × Nosepoke) conducted on responding over the course of two daily ICSS sessions revealed a clear preference for the active nosepoke (F1,34 = 45.522, P < 0.001) and a Nosepoke × Day interaction (F1,34 = 54.789, P < 0.001) as responding at the active nosepoke increased over time (T = 10.712, P < 0.001, Bonferroni post hoc tests) while responding at the inactive nosepoke remained virtually absent (T = 0.0414, P < 0.967, Bonferroni post hoc tests). Critically, we found no main effect (F1,34 = 0.876, P = 0.356) or interaction with group (Group × Day: F1,34 = 0.244, P = 0.625; Group × Nosepoke: F1,34 = 0.777, P = 0.384; Group × Day × Nosepoke: F1,34 = 0.270, P = 0.607), indicating that stimulation of VTA- and SNc-DA neurons is equally reinforcing in the ICSS paradigm.
Together, these results show that while VTA- and SNc-DA neuron activation are equally potent reinforcers of instrumental behavior, only VTA-DA neurons activation mimics endogenous RPEs in promoting error-correcting Pavlovian learning (unblocking).
Activation of VTA-DA neurons engages model-based learning
Although we demonstrated above that endogenous RPEs induced by reward upshift and optogenetic activation of VTA-DA neurons result in numerically comparable unblocking effects, the learning strategy engaged by these two manipulations might be different. In model-free algorithms, RPEs imbue predictive cues with a scalar cache value, resulting in conditioned responses largely independent of the outcome value at the time of test. Alternatively, in model-based accounts, predictive cues come to signal the specific identity of their paired outcome, resulting in conditioned responses motivated by the sensorily-rich representation of the outcome and its current value. To determine the learning strategy recruited by endogenous RPEs, or VTA-DA neuronal activation, we assessed the effect of devaluing the sucrose outcome on responding to Y, the unblocked cue. New groups of rats were trained in the blocking/unblocking task previously described: learning about cue Y was unblocked either by reward upshift (n = 24) or by VTA-DA neurons stimulation (n = 23) during the BY compound. At the end of the compound training phase, rats in each group were assigned to one of two conditions. Subjects in the “devalued” condition had sucrose devalued by pairing its consumption in the homecage with LiCl-induced nausea (conditioned taste aversion). For the subjects in the “valued” condition, sucrose consumption and LiCl-induced nausea occurred on alternate days, which preserved the value of the sucrose outcome (Fig. 5, Fig. S4). Two days after the final LiCl injection, all rats were then tested for conditioned responding to Y (unblocked cue) and to A (ancillary cue paired with large reward) in separate probe sessions with the order of testing counterbalanced.
A 3-way mixed ANOVA (Group × Devaluation × Cue) conducted on the time in port during the cues revealed a main effect of Cue (F1,43 = 6.119, P = 0.017) and of Devaluation (F1,43 = 10.707, P = 0.002) as well as an interaction between these factors (F1,43 = 4.750, P = 0.035). This interaction was due to a significant influence of the devaluation procedure on responding to the unblocked cue, Y (T = 3.563, P<0.001), but not on the ancillary cue, A (T = 0.514, P = 0.609). Reduced responding to Y after sucrose devaluation indicates that this response is normally motivated by the representation of the sucrose outcome and anticipation of its current value (model-based process). Critically, we found no main effect (F1,43 = 0.869, P = 0.356) or interaction with Group (Group × Devaluation: F1,43 = 0.005, P = 0.943; Group × Cue: F1,43 = 0.000, P = 0.993; Group × Devaluation × Cue: F1,43 = 0.339, P = 0.564). Planned contrast analyses independently confirmed that, for each group, sucrose devaluation reduced responding to unblocked cue Y (Reward Upshift: T= 2.559, P = 0.018; VTA-DA Stim.: T= 2.116, P = 0.046), but not to A (Reward Upshift: T= 1.126, P = 0.272; VTA-DA Stim.: T= 0.018, P = 0.986). Analysis of the rate of port entries during the cues yielded essentially similar results (Fig. S5). Note that VTA-DA valued and devalued subjects later displayed similar ICSS behavior (Fig. S6), which indicates that the reduced responding to the unblocked cue Y in devalued subjects cannot be explained by reduced efficiency of the optical stimulation, and thus reduced unblocking, in those animals. These results indicate that both endogenous RPEs and VTA-DA neuronal activation during sucrose consumption promoted the formation of model-based associations and conferred cue Y with the ability to evoke a representation of the sucrose outcome.
DISSCUSION
We have shown that activation of VTA, but not SNc, DA neurons mimics RPEs and promotes the formation of model-based cue-reward associations. We used a Pavlovian blocking procedure, in which the formation of a cue-reward association is normally blocked by the absence of RPE (the reward being signaled by other predictive stimuli in the environment). Confirming and extending a previous study in our lab11, we showed that restoring RPEs, either endogenously, by increasing the magnitude of the sucrose reward, or by optogenetic activation of VTA-DA neurons, unblocks learning and promote the formation of a cue-reward association. In a separate experiment, we probed the content of this newly formed association by assessing its sensitivity to outcome devaluation. We found that following unblocking by reward upshift, or by VTA-DA stimulation, the expression of the unblocked learning was sensitive to the current value of the outcome; postunblocking devaluation of the sucrose outcome almost entirely abolished responding to the unblocked cue. This indicates that both manipulations (reward upshift or VTA-DA stimulation), promote the formation of model-based associations that integrate a representation of the specific identity of the rewarding outcome. In stark contrast, we showed that optogenetic activation of SNc-DA neurons failed to promote Pavlovian learning, i.e., learning remained blocked. This is despite the fact that activation of both VTA- and SNc-DA neurons serves as a potent reinforcer in self-stimulation procedures.
These results are consistent with a recent study by Sharpe and colleagues showing that phasic VTA-DA responses mediate the formation of the association between two neutral stimuli (A→B), a form of learning that is necessarily model-based since it involves only identity and not value34. The status of this association was then assessed by pairing one of the stimuli with a food reward (B→food) and testing the conditioned responding to the other stimulus (A); food-seeking responses evoked by the target cue revealed a learned association between the two stimuli and inference of upcoming food reward (i.e., if A→B and B→food, then A→food). While Sharpe et al. demonstrated for the first time that VTA-DA signals can promote the association between neutral stimuli, this study did not address the nature of reward encoding in DA-dependent associations. Indeed, although their study involved a natural reward, it was used simply as a necessary means to reveal stimulus-stimulus associations, and was not the object of DA manipulations. This distinction is important because unlike stimulus-stimulus associations that are by definition model-based, cue-reward associations can be encoded in a model-free or model-based manner. Therefore, the possibility remains that while capable of promoting model-based learning when only sensory information is available, VTA DA signals nevertheless engage preferentially modelfree learning when (model-free) value can be encoded. In the present study, optogenetic activation of DA neurons was used to promote direct cue-reward associations, a form of learning that presents the opportunity for model-free and model-based algorithms. In these conditions, in which both learning strategies are equally valid, we showed that VTA-DA signals engage preferentially model-based learning.
Note however that our results do not preclude the participation of VTA-DA signals in model-free value assignment. Indeed, as shown here (ICSS experiment) and elsewhere16,33, in absence of external reward, the activation of VTA-DA neurons can confer cues and action with incentive/action value. Ultimately and consistent with DA’s neuromodulatory role, the content of DA-induced learning is likely dependent on the nature of the information being encoded and processed in terminal regions when coincident DA surges occur. What we show here is that in the presence of an external reward, the recruitment of a model-based learning strategy is not an exception but rather a central feature of VTA-DA teaching signals. This is consistent with recent studies showing that treatments (pharmacological or dietary restrictions) that globally increase or decrease DA function promote or impair, respectively, model-based processes in humans35–37. Note however that these treatments also affect tonic DA levels, which could affect learning independently of the phasic error signals.
An intriguing aspect of our results is the dissociation between the unblocked cue Y and the ancillary cue A in terms of response strategy. Indeed, unlike cue Y, cue A evoked conditioned responding that was driven by model-free associations (not affected by sucrose devaluation). The reason for this dissociation is unknown, but might involve training history differences between these cues. Indeed, compared to cue Y, cue A benefited from an extensive training history (224 conditioning trials vs. 32 for cue Y) which has been shown to promote model-free learning, although generally in the context of instrumental conditioning38–40. Perhaps more interesting are the implications for the role of VTA-DA signals in learning. In the VTA-DA group, the cues A and B are equivalent up to the compound conditioning phase and, based on the lack of effect of devaluation on A, we can assume that responding to both cues is governed by model-free associations. Therefore it appears that the activation of VTA-DA neurons promoted the formation of model-based associations about Y in subjects that were (presumably) currently engaged in model-free behavior during BY trials. This surprising result suggests that model-free associations could be formed “in the background”, independently of the strategy that governs behavior at the time these associations are formed, or through post-training event replay41. Alternatively, it could be that activation of VTA-DA neurons is sufficient to shift response strategy and restore model-based processing42. Further studies are required to address these questions.
Our results provide strong evidence for a functional dissociation between VTA- and SNc-DA neurons in appetitive learning. While activation of VTA-DA neurons unblocked Pavlovian learning, we found no evidence of unblocking following SNc-DA neurons activation, despite careful analysis of several behavioral responses (time in port, port entries, orienting, rearing, and locomotion). This contrasts with recent results from our lab showing that, in absence of a natural reward, activation of VTA- or SNc-DA neurons during cue presentation promotes the development of conditioned cue-evoked locomotion33. An important point to consider when comparing these results is the behavior of the animals at the time of the stimulation. Although free movement was possible, animals in the present study were relatively immobile during DA stimulation because it occurred as they were consuming the sucrose reward. This absence of ambulatory movement during DA stimulation could have prevented the emergence of conditioned locomotion.
In contrast with the selective role of VTA-DA neurons in Pavlovian unblocking, we show in here, in agreement with previous studies20,33, that instrumental behavior for ICSS can be supported by either VTA- or SNc-DA neurons stimulation. This partial dissociation between VTA- and SNc-DA neurons in Pavlovian and instrumental learning is reminiscent of the actor-critic reinforcement algorithm. This model is based on the idea of a separation of labor between a prediction module and an action module, with distributed RPEs promoting learning in both modules but with different consequences (updating predictions vs. reinforcing actions). A possible neural implementation of actor-critic algorithm has been suggested, with ventromedial (VMS) and dorsolateral (DLS) striatum functioning as prediction and action modules, respectively29. Consistent with this, we showed that activation of SNc-DA neurons, projecting predominantly to DLS, reinforces prior actions but has no influence on Pavlovian prediction learning, in agreement with the role of RPEs in an action module, while activation of VTA-DA neurons, projecting predominantly to VMS, promote Pavlovian learning, in agreement with the role of RPEs in a prediction module. Because predictions are updated by RPEs but also influence RPEs computations in return, the actor-critic model predicts that RPEs in the prediction module reinforce Pavlovian cues/states, which can then subsequently evoke back-propagated RPEs, including in the action module. A neural equivalent of this process, in which Pavlovian predictions encoded in the VMS feed back onto midbrain DA neurons (including SNc-DA neurons) and contribute to the propagation of a RPE teaching signal to more dorsal and lateral portions of the striatum, could contribute to the instrumental reinforcement induced by VTA-DA stimulation. However, a critical difference between our results and the predictions of the actor-critic algorithm is that this algorithm is strictly model-free, while we have shown here that VTA-DA signals contribute to model-based Pavlovian learning. Therefore, our results suggest a hybrid model that incorporates both model-free and model-based processes and in which VTA DA dependent model-based predictions shape SNc-DA signals and train model-free instrumental learning43
Finally, these results have important implications for our understanding of DA-related pathologies. Noisy/deregulated DA signals originating from the VTA, such has been observed in schizophrenic patients44,45, could promote model-based associations between external and/or internal events that are merely coincident but not causally-related, leading to the construction of internal world models that are out of touch with the physical reality and sources of delusional beliefs46. In contrast, emergence of cue- or reward-evoked DA signals in the DLS, such has been reported after repeated drug use47–50, could contribute to the reinforcement of maladaptive drug-seeking responses that persist despite knowledge of their adverse consequences51,52.
AUTHOR CONTRIBUTIONS
R.K. and P.H.J. conceived the study; R.K., H.J.P, and N.B.S. carried out the experiments; R.K. analyzed the data; R.K. and P.H.J. wrote the manuscript with input from all the authors.
Author Information
The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to RK (ronald.keiflin{at}jhu.edu) or PHJ (patricia.janak{at}jhu.edu)
ACKNOWLEDGEMENTS
This work was supported by National Institutes of Health grant DA035943.