Abstract
Cortico-basal-ganglia-thalamic (CBGT) networks are critical for adaptive decision-making, yet how changes to circuit-level properties impact cognitive algorithms remains unclear. Here we explore how dopaminergic plasticity at corticostriatal synapses alters competition between striatal pathways, impacting the evidence accumulation process during decision-making. Spike-timing dependent plasticity simulations showed that dopaminergic feedback based on rewards modified the ratio of direct and indirect corticostriatal weights within opposing action channels. Using the learned weight ratios in a full spiking CBGT network model, we simulated neural dynamics and decision outcomes in a reward-driven decision task and fit them with a drift-diffusion model. Fits revealed that the rate of evidence accumulation varied with inter-channel differences in direct pathway activity while boundary height varied with overall indirect pathway activity. This multi-level modeling approach demonstrates how complementary learning and decision computations emerge from corticostriatal plasticity.
Author summary Cognitive process models like reinforcement learning (RL) and the drift-diffusion model (DDM) have helped to elucidate the basic information algorithms underlying error-corrective learning and the evaluation of accumulating decision evidence leading up to a choice. While these relatively abstract models help to guide experimental and theoretical probes into associated phenomena, they remain uninformative about the actual physical mechanics by which learning and decision algorithms are carried out in a neurobiological substrate during adaptive choice behavior. Here, we present an “upwards mapping” approach to bridging neural and cognitive models of value-based decision making, showing how dopaminergic feedback alters the network-level dynamics of cortico-basal-ganglia-thalamic (CBGT) pathways during learning to bias behavioral choice towards more rewarding actions. By mapping “up” the levels of analysis, this approach yields specific predictions about aspects of neuronal activity that map to the quantities appearing in the cognitive decision-making framework.
1 Introduction
The flexibility of mammalian behavior showcases the dynamic range over which neural circuits can be modified by experience and the robustness of the emergent cognitive algorithms that guide goal-directed actions. Decades of research in cognitive science has independently detailed the algorithms of decision-making (e.g., accumulation-to-bound models, [1]) and reinforcement learning (RL; [2, 3]), providing foundational insights into the computational principles of adaptive decision-making. In parallel, research in neuroscience has shown how the selection of actions, and the use of feedback to modify selection processes, both rely on a common neural substrate: cortico-basal ganglia-thalamic (CBGT) circuits [4–8].
Understanding how the cognitive algorithms for adaptive decision-making emerge from the circuit-level dynamics of CBGT pathways requires a careful mapping across levels of analysis [9], from circuits to algorithm (see also [10, 11]). Previous simulation studies have demonstrated how the specific circuit-level computations of CBGT pathways map onto sub-components of the multiple sequential probability ratio test (MSPRT; [5, 12]), a simple algorithm of information integration that selects single actions from a competing set of alternatives based on differences in input evidence [13, 14]. Allowing a simplified form of RL to modify corticostriatal synpatic weights results in an adaptive variant of the MSPRT that approximates the optimal solution to the action selection process based on both sensory signals and feedback learning [15, 16]. Previous attempts at multi-level modeling have largely adopted a “downwards mapping” approach, whereby the stepwise operations prescribed by computational or algorithmic models are intuitively mapped onto plausible neural substrates. Recently, Frank [17] proposed an alternative “upwards mapping” approach for bridging levels of analysis, where biologically detailed models are used to simulate behavior that can be fit to a particular cognitive algorithm. Rather than ascribing different neural components with explicit computational roles, this variant of multi-level modeling examines how cognitive mechanisms are influenced by changes in the functional dynamics or connectivity of those components. A key assumption of the upwards mapping approach is that variability in the configuration of CBGT pathways should drive systematic changes in specific sub-components of the decision process, expressed by the parameters of the drift-diffusion model (DDM; [1]). Indeed, by fitting the DDM to synthetic choice and response time data generated by a rate-based CBGT network, Ratcliff and Frank [18] showed how variation in the height of the decision threshold tracked with changes in the strength of subthalamic nulceus (STN) activity. Thus, this example shows how simulations that map up the levels of analysis can be used to investigate the emergent changes in information processing that result from targeted modulation of the underlying neural circuitry.
Motivated by the predictions of a recently proposed Believer-Skeptic hypothesis of CBGT pathway function [7], we utilize the upwards mapping approach to modeling adaptive choice behavior across neural and cognitive levels of analysis (Figure 1). The Believer-Skeptic hypothesis posits that competition between the direct (Believer) and indirect (Skeptic) pathways within an action channel encodes the degree of uncertainty for that action. This competition is reflected in the drift rate of an accumulation-to-bound process (see [19]). Over time, dopaminergic (DA) feedback signals can sculpt the Believer-Skeptic competition to bias decisions towards the behaviorally optimal target [15]. To explicitly test this prediction, we first modeled how phasic DA feedback signals [20] can modulate the relative balance of corticostriatal synapses via spike-timing dependent plasticity (STDP; [21, 22]), thereby promoting or deterring action selection. The effects of learning on the synaptic weights were subsequently implemented in a spiking model of the full CBGT network meant to accurately capture the known physiological properties and connectivity patterns of the constituent neurons in these circuits [23]. The performance (i.e., accuracy and response times) of the CBGT simulations were then fit using a hierarchical DDM [24]. This progression from synapses to networks to behavior, allows us to explicitly test the mechanistic predictions of the Believer-Skeptic hypothesis by mapping how specific features of striatal activity that result from reward-driven changes in corticostriatal synaptic weights could underlie parameters of the fundamental cognitive algorithms of decision-making.
2 Results
2.1 STDP network results
To evaluate how dopaminergic plasticity impacts the efficacy of corticostriatal synapses, we modeled learning using a spike-timing dependent plasticity (STDP) paradigm in a simulation of corticostriatal networks implementing a simple two artificial forced choice task. In this scenario, one of two available actions, which we call left (L) and right (R), was selected by the spiking of model striatal medium spiny neurons (MSNs; Subsection 4.1.3). These model MSNs were grouped into action channels receiving inputs from distinct cortical sources (Figure 1, left). Every time an action was selected, dopamine was released, after a short delay, at an intensity proportional to a reward prediction error (equations 9 and 10). All neurons in the network experienced this non-targeted increase in dopamine, emulating striatal release of dopamine by substantia nigra pars compacta neurons, leading to plasticity of corticostriatal synapses (equation 8; see Figure 10).
The model network was initialized so that it did not a priori distinguish between L and R actions. We first performed simulations in which a fixed reward level was associated with each action, to assist in parameter tuning and verify effective model operation. In this scenario, where the rewards for each action did not change over time (i.e., one action always elicited a larger reward than the other), a gradual change in corticostriatal synaptic weights occurred (Supp. Figure 1A) in parallel with the learning of the actions’ values (Supp. Figure 1B). These changes in synaptic weights induced altered MSN firing rates (Supp. Figure 1C,D), reflecting changes in the sensitivity of the MSNs to cortical inputs in a way that allowed the network to learn over time to select the more highly rewarded action (Supp. Figure 2A). That is, firing rates in the direct pathway MSNs (dMSNs; DL and DR) associated with the more highly rewarded action increased, lead to a more frequent selection of that action. On the other hand, firing rates of the indirect pathway MSNs (iMSNs; IL and IR) remained quite similar (Supp. Figure 1C,D). This similarity is consistent with recent experimental results [25], while the finding that dMSNs and iMSNs associated with a selected action are both active has also been reported in several experimental works [26–28].
In this model, indirect pathway activity counters action selection by cancelling direct pathway spiking (Subsection 4.1.3). This serves as a proxy in this simplified framework for indirect pathway competition with the direct pathway in the full network simulations (see Subsection 2.2). Based on the cancellation framework, the ratio of direct pathway weights to indirect pathway weights provides a reasonable representation of the extent to which each action is favored or disfavored. In our simulations, after a long period of gradual evolution of weights and action values, the direct pathway versus indirect pathway weight ratio of the channel for the less favored action started to drop more rapidly, indicating the emergence of certainty about action values and a clearer separation between frequencies with which the two actions were selected (Figure 2).
To show that the network remained flexible after learning a specific action value relation, we ran additional simulations using a variety of reward schedules in which the reward values associated with the two actions were swapped after the performance of a certain number of actions. Once values switched, the network was always able to learn the new values. Specifically, QL and QR began evolving toward the new reward levels, switching their relative magnitudes along the way; the weights of corticostriatal synapses to L-dMSN (R-dMSN) weakened (strengthened) (e.g., Supp. Figure 2), and the relative performance frequencies of the two actions also reversed. Thus the network was able to adaptively learn immediate reward contingencies, without being restricted by previously learned contingencies.
While these simulations show that applying a dopaminergic plasticity rule to corticostriatal synapses allows for a simple network to learn action values linked to reward magnitude, many reinforcement learning tasks rely on estimating reward probability (e.g., two armed bandit tasks). To evaluate the network’s capacity to learn from probabilistic rewards, we simulated a variant of a probabilistic reward task and compared the network performance to previous experimental results on action selection with probabilistic rewards in human subjects [29]. For consistency with experiments, we always used pL + pR = 1, where pL and pR were the probabilities of delivery of a reward of size ri = 1 when actions L and R were performed, respectively. Moreover, as in the earlier work, we considered the three cases pL = 0.65 (high conflict), pL = 0.75 (medium conflict) and pL = 0.85 (low conflict).
As in the constant reward case, the corticostriatal synaptic weights onto the two dMSN populations clearly separated out over time (Figure 3). The separation emerged earlier and became more drastic as the conflict between the rewards associated with the two actions diminished, i.e., as reward probabilities became less similar. Interestingly, for relatively high conflict, corresponding to relatively low pL, the weights to both dMSN populations rose initially before those onto the less rewarded population eventually diminished. This initial increase likely arises because both actions yielded a reward of 1, leading to a significant dopamine increase, on at least some trials. The weights onto the two iMSN populations remained much more similar. One general trend was that the weights onto the L-iMSN neurons decreased, contributing to the bias toward action L over action R.
In all three cases, the distinction in synaptic weights translated into differences across the dMSNs’ firing rates (Figure 4, first row), with L-dMSN firing rates (DL) increasing over time and R-dMSN firing rates (DR) decreasing, resulting in a greater difference that emerged earlier when pL was larger and hence the conflict between rewards was weaker. Notice that the DL firing rate reached almost the same value for all three probabilities. In contrast, the DR firing rate tended to smaller values as the conflict decreased. As expected based on the changes in corticostriatal synaptic weights, the iMSN population firing rates remained similar for both action channels, although the rates were slightly lower for the population corresponding to the action that was more likely to yield a reward (Figure 4F).
Similar trends across conflict levels arose in the respective frequencies of selection of action L. Over time, as weights to L-dMSN neurons grew and their firing rates increased, action L was selected more often, becoming gradually more frequent than action R. Not surprisingly, a significant difference between frequencies emerged earlier, and the magnitude of the difference became greater, for larger pL (Figure 5).
To show that this feedback learning captured experimental observations, we performed additional probabilistic reward simulations to compare with behavioral data in forced-choice experiments with human subjects [29]. Each of these simulations represented an experimental subject, and each action selection was considered as the outcome of one trial performed by that subject. After each trial, a time period of 50 ms was imposed during which no cortical inputs were sent to striatal neurons such that no actions would be selected, and then the full simulation resumed. For these simulations, we considered the evolution of the value estimates for the two actions either separately for each subject (Figure 6A) or averaged over all subjects experiencing the same reward probabilities (Figure 6B), as well as the probability of selection of action L averaged over subjects (Figure 6C). The mean in the difference between the action values gradually tended toward the difference between the reward probabilities for all conflict levels. Although convergence to these differences was generally incomplete over the number of trials we simulated (matched to the experiment duration), these differences were close to the actual values for many individual subjects as well as in mean (Figure 6A,B). These results agree quite well with the behavioral data in [29] obtained from 15 human subjects, as well as with observations from similar experiments with rats [30].
Also as in the experiments, the probability of selection of the more rewarded action grew across trials for all three reward probabilities, with less separation in action selection probability than in action values across different reward probability regimes (Figure 6C). Although our actual values for the probabilities of selection of higher value actions did not reach the levels seen experimentally, this likely reflected the non-biological action selection rule in our STDP model (see Subsection 4.1.3), whereas the agreement of our model performance with experimental time courses of value estimation (Figure 6A,B) and our model’s general success in learning to select more valuable actions (Supp. Figure 1C and Figure 5) justify the incorporation of our results on corticostriatal synaptic weights into a spiking network with a more biologically-based decision-making mechanism, which we next discuss.
2.2 CBGT Dynamics and Choice Behavior
A key observation from our STDP model is that differences in rewards associated with different actions lead to differences in the ratios of corticostriatal synaptic weights to dMSN and iMSNs across action channels. Using weight ratios adapted from the STDP model, obtained by varying weights to dMSNs with fixed weights to iMSNs (Figure 3), we next performed simulations with a full spiking CBGT network to study the effects of this corticostriatal imbalance on the emergent neural dynamics and choice behavior following feedback-dependent learning in the context of low, medium, and high probability reward schedules (2500 trials/condition; see Subsection 4.2.1 for details). In each simulation, cortical inputs featuring gradually increasing firing rates were supplied to both action channels, with identical statistical properties of inputs to both channels. These inputs led to evolving firing rates in nuclei throughout the basal ganglia, also partitioned into action channels, with an eventual action selection triggered by the thalamic firing rate in one channel reaching 30 Hz (Figure 1, center and Figure 7). We found that both dMSN and iMSN firing rates gradually increased in response to cortical inputs. Consistent with our STDP simulations (Figure 4), dMSN firing rates became higher in the channel for the selected action. Interestingly, iMSN firing rates also became higher in the selected channel, consistent with recent experiments (see [31], among others). Similar to the activity patterns observed in the striatum, higher firing rates were also observed in the selected channel’s STN and thalamic populations, whereas GPe and GPi firing rates were higher in the unselected channel (Figure 7).
More generally across all weight ratio conditions, dMSNs and iMSNs exhibited a gradual ramping in population firing rates [32] that eventually saturated around the average RT in each condition (Figure 8A). To characterize the relevant dimensions of striatal activity that contributed to the network’s behavior, we extracted several summary measures of dMSN and iMSN activity, shown in Figure 8B-C. Summary measures of dMSN and iMSN activity in the L and R channels were calculated by estimating the area under the curve (AUC) of the population firing rate between the time of stimulus onset (200 ms) and the RT on each trial. Trialwise AUC estimates were then normalized between values of 0 and 1, including estimates from all trials in all conditions in the normalization. As expected, increasing the disparity of left and right Ctx-dMSN weights led to greater differences in direct pathway activation between the two channels (i.e., DL > DR; Figure 8B). The increase in DL - DR reflects a form of competition between action channels, where larger values indicate stronger dMSN activation in the optimal channel and/or a weakening of dMSN activity in the suboptimal channel. Similarly, increasing the weight of Ctx-dMSN connections caused a shift in the competition between dMSN and iMSN populations within the left action channel (i.e., DL > IL). Thus, manipulating the weight of Ctx-dMSN connections to match those predicted by the STDP model led to both between- and within-channel biases favoring firing of the direct pathway of the optimal action channel in proportion to its expected reward value.
Interestingly, although the weights of Ctx-iMSN connections were kept constant across conditions, iMSN populations showed reliable differences in activation between channels (Figure 8C). Similar to the observed effects on direct pathway activation, higher reward conditions were associated with progressively greater differences in the AUC of L and R indirect pathway firing rates (IL - IR). At first glance, greater indirect pathway activation in higher compared to lower valued action channels differs from the similarity of activation levels of both indirect pathway channels that we obtained in the STDP model and also appears to be at odds with canonical theories of the roles of the direct and indirect pathways in RL and decision-making. This finding can be explained, however, based on a certain feature represented in the connections within the CBGT network but not within the STDP network, namely thalamo-striatal feedback between channels. That is, the strengthening and weakening of Ctx-dMSN weights in the L and R channels, respectively, translated into relatively greater downstream disinhibition of the thalamus in the L channel, which increased excitatory feedback to L-dMSNs and L-iMSNs while reducing thalamo-striatal feedback to R-MSNs in both pathways.
Finally, we examined the effects of reward probability on the AUC of all iMSN firing rates (Iall; combining across action channels). Observed differences in Iall across reward conditions were notably more subtle than those observed for other summary measures of striatal activity, with greatest activity in the medium reward condition, followed by the high and low reward conditions, respectively.
In addition to analyzing the effects of altered Ctx-dMSN connectivity strength on the functional dynamics of the CBGT network, we also studied how the decision-making behavior of the CBGT network was influenced by this manipulation. Consistent with previous studies of value-based decision-making in humans [33–37], we observed a positive effect of reward probability on both the frequency and speed of correct (e.g., leftward, associated with higher reward probability) choices (Figure 8D). Bootstrap sampling (10,000 samples) was performed to estimate 95% confidence intervals (CI95) around RT and accuracy means (μ) in each condition, and to assess the statistical significance of pairwise comparisons between conditions. Choice accuracy increased across low (μ = 64%, CI95 = [62, 65]), medium (μ = 85%, CI95 = [84, 86]), and high (μ = 100%, CI95 = [100, 100]) reward probabilities. Pairwise comparisons revealed that the increase in accuracy observed between low and medium conditions, as well as that observed between medium and high conditions, reached statistical significance (both p < 0.0001). Along with the increase in accuracy across conditions, we observed a concurrent decrease in the RT of correct (L) choices in the low (μ = 477ms, CI95 = [472, 483]), medium (μ = 467ms, CI95 = [462, 471]), and high (μ = 460ms, CI95 = [456, 464]) reward probability conditions. Notably, our manipulation of Ctx-dMSN weights across conditions manifested in stronger effects on accuracy (i.e., probability of choosing the more valuable action), with subtler effects on RT.
Specifically, the decrease in RT observed between the low and medium conditions reached statistical significance (p < .0001); however, the RT decrease observed between the medium and high conditions did not (p = .13).
We also examined the distribution of RTs for L responses across reward conditions (Figure 8E). All conditions showed a rightward skew in the distribution of RTs, an empirical hallmark of simple choice behavior and a useful check of the suitability of accumulation-to-bound models like the DDM for modeling a particular behavioral data set. Moreover, the degree of skew in the RT distributions for L responses became more pronounced with increasing reward probability, suggesting that the observed decrease in the mean RT at higher levels of reward was driven by a change in the shape of the distribution, and not, for instance, a temporal shift in its location.
2.3 CBGT-DDM Mapping
We performed fits of a normative DDM to the CBGT network’s decision-making performance (i.e., accuracy and RT data) to understand the effects of corticostriatal plasticity on emergent changes in decision behavior. This process was implemented in three stages. First, we compared models in which only one free DDM parameter was allowed to vary across levels of reward probability (single parameter DDMs). Next, a second round of fits was performed in which a second free DDM parameter was included in the best-fitting single parameter model identified in the previous stage (dual parameter DDMs). Finally, the two best-fitting dual parameter models were submitted to a third and final round of fits with the inclusion of trialwise measures of striatal activity (see Figure 8B-C) as regressors on designated parameters of the DDM.
All models were evaluated according to their relative improvement in performance compared to a null model in which all parameters were fixed across conditions. To identify which single parameter of the DDM best captured the behavioral effects of alterations in reward probability as represented by Ctx-dMSN connectivity strength, we compared the deviance information criterion (DIC) of models in which either the boundary height (a), the onset delay (tr), the drift rate (v), or the starting-point bias (z) was allowed to vary across conditions. Figure 9A shows the difference between the DIC score of each model (DICM) and that of the null model (ΔDIC = DICM - DICnull), with lower values indicating a better fit to the data (see Table 1 for additional fit statistics). Conventionally, a DIC difference (ΔDIC) of magnitude 10 or more is regarded as strong evidence in favor of the model with the lower DIC value [38]. Compared to the null model as well as alternative single parameter models, allowing the drift rate v to vary across conditions afforded a significantly better fit to the data (ΔDIC = 960.79). Examination of posterior distributions of v in the best-fitting single parameter model revealed a significant increase in v with successively higher levels of reward probability (vLow = .35; vMed = 1.61; vHigh = 2.71), capturing the observed increase in speed and accuracy across conditions by increasing the rate of evidence accumulation toward the upper (L) decision threshold.
To investigate potential interactions between the drift rate and other parameters of the DDM, we performed another round of fits in which a second free parameter (either a, tr, or z), in addition to v, was allowed to vary across conditions (Figure 9A). Compared to alternative dual-parameter models, the combined effect of allowing v and a to vary across conditions (Figure 8B,C) provided the greatest improvement in model fit over the null model (ΔDIC = −1174.07), as well as over the best-fitting single parameter model (DICv,a - DICv = −213.27). While the dual v and a model significantly outperformed both alternatives (DICv,a - DICv,t = −205.89; DICv,a - DICv,z = 184.05), the second best-fitting dual parameter model, in which v and z were left free across conditions, also afforded a significant improvement over the drift-only model (DICv,z - DICv = −29.23). Thus, both v, a and v, z dual parameter models were considered in a third and final round of fits. The third round was motivated by the fact that, while behavioral fits can yield reliable and informative insights about the cognitive mechanisms engaged by a given experimental manipulation, recent studies have effectively combined behavioral observations with coincident measures of neural activity to test more precise hypotheses about the neural dynamics involved in regulating different cognitive mechanisms [29, 39, 40]. To this end, we refit the v, a and v, z models to the same simulated behavioral dataset (i.e., accuracy and RTs produced by the CBGT network) as in the previous rounds, with the addition of different trialwise measures of striatal activity included as regressors on one of the two free parameters in the DDM.
For each regression DDM (N=24 models, corresponding to 24 ways to map 2 of 6 striatal activity measures to the v, a and v, z models), one of the summary measures shown in Figure 8B-C was regressed on v, and another regressed on either a or z, with separate regression weights estimated for each level of reward probability. Model fit statistics are shown for each of the 24 regression models in Table 2, along with information about the neural regressors included in each model and their respective parameter dependencies. The relative goodness-of-fit afforded by all 24 regression models is visualized in Figure 9A (lower panel), identifying what we have labelled as model III as the clear winner with an overall DIC = 18860.37 and with ΔDIC = −9716.17 compared to the null model. In model III, the drift rate v on each action selection trial depended on the relative strength of direct pathway activation in L and R action channels (e.g., DL - DR), whereas the boundary height a on that trial was computed as a function of the overall strength of indirect pathway activation across both channels (e.g., Iall). To determine how these parameter dependencies influenced levels of v and a across levels of reward probability, the following equations were used to transform intercept and regression coefficient posteriors into posterior estimates of v and a for each condition j: where ΔDj and Ij are the mean values of DL - DR and Iall in condition j (see Figure 8B-C), and are posterior distributions for v and a intercept terms, and and are the posterior distributions estimated for the linear weights relating DL - DR and Iall to v and a, respectively. The observed effects of reward probability on v and a, as mediated by trialwise changes in DL - DR and Iall, are schematized in Figure 9B, with conditional posteriors for each parameter plotted in Figure 9C. Consistent with best-fitting single and dual parameter models (e.g., without striatal regressors included), the weighted effect of DL - DR on v in model III led to a significant increase in v across low , medium , and high conditions. Thus, increasing the disparity of dMSN activation between L and R action channels led to faster and more frequent leftward actions by increasing the rate of evidence accumulation towards the correct decision boundary. Also consistent with parameter estimates from the best-fitting dual parameter model (i.e., v, a), inclusion of trialwise values of Iall led to an increase in the boundary height in the medium and high conditions compared to estimates in the low condition . However, in contrast with boundary height estimates derived from behavioral data alone (not shown), a estimates in model III showed no significant difference between medium and high levels of reward probability.
Next, we evaluated the extent to which the best-fitting regression model (i.e., model III) was able to account for the qualitative behavioral patterns exhibited by the CBGT network in each condition. To this end, we simulated 20,000 trials in each reward condition (each trial producing a response and RT given a parameter set sampled from the model posteriors) and compared the resulting RT distributions, along with mean speed and accuracy measures, with those produced by the CBGT model (Figure 9D,E). Parameter estimates from the best-fitting model captured both the increasing rightward skew of RT distributions, as well as the concurrent increase in mean decision speed and accuracy with increasing reward probability.
In summary, by leveraging trialwise measures of simulated striatal MSN subpopulation dynamics to supplement RT and choice data generated by the CBGT network, we were able to 1) substantially improve the quality of DDM fits to the network’s behavior across levels of reward probability compared to models without access to neural observations and 2) identify dissociable neural signals underlying observed changes in v and a across varying levels of reward probability associated with available choices.
3 Discussion
Reinforcement learning in mammals alters the mapping from sensory evidence to action decisions. Here we set out to understand how this adaptive decision-making process emerges from underlying neural circuits using a modeling approach that bridges across levels of analysis, from plasticity at corticostriatal synapses to CBGT network function to quantifiable behavioral parameters [11, 12, 15, 18]. We show how a simple, DA-mediated STDP rule can modulate the sensitivity of both dMSN and iMSN populations to cortical inputs. This learning allows for the network to discover which target in a two-alternative forced-choice task is more likely to deliver a reward by modifying the ratio of direct and indirect pathway corticostriatal weights within each action channel. With this result in hand, we simulated the network-level dynamics of CBGT circuits, as well as behavioral responses, under different levels of conflict in reward probabilities, by extrapolating from the learned corticostriatal weights from the STDP simulations. As reward probability for the optimal target increased, the asymmetry of dMSN firing rates between action channels grew, as did the overall activity of iMSNs across both action channels. By fitting the DDM to the simulated decision behavior of the CBGT network, we found that changes in the rate of evidence accumulation tracked with the difference in dMSN population firing rates across action channels, while the the level of evidence required to trigger a decision tracked with the overall iMSN population activity. These findings show how, at least within this specific framework, plasticity at corticostriatal synapses induced by phasic changes in DA can have a multifaceted effect on cognitive decision processes.
A critical assumption of our theoretical experiments is that the CBGT pathways accumulate sensory evidence for competing actions in order to identify the most contextually appropriate response. This assumption is supported by a growing body of empirical and theoretical evidence. For example, Yartsev et al. [32] recently showed that, in rodents performing an auditory discrimination task, the anterior dorsolateral striatum satisfied three fundamental criteria for establishing causality in the evidence accumulation process: (1) inactivation of the striatum ablated the animal’s discrimination performance on the task, (2) perturbation of striatal neurons during the temporal window of evidence accumulation had predictable and reliable effects on trial-wise behavioral reports, and (3) gradual ramping, proportional to the strength of evidence, was observed in both single unit and population firing rates of the striatum (however, see also [41]). Consistent with these empirical findings, Caballero et al. [16] recently proposed a novel computational framework, capturing perceptual evidence accumulation as an emergent effect of recurrent activation of competing action channels. This modeling work builds on previous studies showing how the architecture of CBGT loops is ideal for implementing a variant of the sequential probability ratio test [5, 12]. Taken together, these converging lines of evidence point to CBGT pathways as being causally involved in the accumulation of evidence for decision-making.
The idea that an accumulation of evidence algorithm can be implemented via network-level dynamics within looped circuit architectures stands in sharp contrast to cortical models of decision-making that presume a more direct isomorphism between accumulators and neural activity (for review see [42]). Early experimental work showed how population-level firing rates in area LIP displayed the same ramp-to-threshold dynamics as predicted by an evidence accumulation process [43–45]. This simple relation between algorithm and implementation has now come into question. Follow-up electrophysiological experiments showed how this population-level accumulation may, in fact, reflect the aggregation of step-functions across neurons that resemble an accumulator when summed together yet lack accumulation properties at the level of individual units [46]. In addition, recent results from intervention studies are inconsistent with the causal role of cortical areas in the accumulation of evidence. For instance, Katz et al. [47] found that inactivation of area LIP in macaques had no effect on the ability of monkeys to discriminate the direction of motion stimuli in a standard random dot motion task. In contrast to the presumed centrality of LIP in sensory evidence accumulation, these findings and supporting reports from [48] and [49] suggest that cortical areas like LIP provide a useful proxy for the deliberation process but are unlikely to have a causal role in the decision itself.
The recent experimental [32] and theoretical [16] revelations of CBGT involvement in decision-making are particularly exciting, not only for the purposes of identifying a likely neural substrate of perceptual choice, but also for their implications for integrating accumulation-to-bound models (e.g., action selection mechanisms) with theories of RL (e.g., feedback-dependent learning of action values). We previously proposed a Believer-Skeptic framework [7] to capture the complementary roles played by the direct and indirect pathways in the feedback-dependent learning and the moment-to-moment evidence accumulation leading up to action selection. This competition between opposing control pathways can be characterized as a debate between a Believer (direct pathway) and a Skeptic (indirect pathway), reflecting the instantaneous probability ratio of evidence in favor of executing and suppressing a given action respectively. Because the default state of the basal ganglia pathways is motor-suppressing (e.g., [50, 51]), the burden of proof falls on the Believer to present sufficient evidence for selecting a particular action. In accumulation-to-bound models like the DDM, this sequential sampling of evidence is parameterized by the drift rate. Therefore, the Believer-Skeptic model specifically predicts that this competition should be reflected, at least in part, in the rate of evidence accumulation. As for the role of learning in the Believer-Skeptic competition, multiple lines of evidence suggest that dopaminergic feedback during learning systematically biases the direct-indirect competition in a manner consistent with increasing the drift rate for more rewarding actions [7, 29, 33, 35, 52, 53]. Indeed, the STDP simulations in the current study showed opposing effects of dopaminergic feedback on corticostriatal synapses in the direct pathway for both the optimal and suboptimal action channels, with the post-learning difference between the direct pathway synaptic weights in the two channels proportional to the difference in expected action values. This provides testable predictions at multiple levels for how feedback learning should influence the decision process over time.
In support of the biological assumptions underlying the CBGT network, several important empirical properties naturally emerged from our simulations. First, both dMSN and iMSN striatal populations were concurrently activated on each trial (see [25, 54, 55]) and exhibited gradually ramping firing rates that often saturated before the response on each trial [32, 41]. Second, in contrast with the relatively early onset of ramping activity in the striatum, recipient populations in the GPi sustained high tonic firing rates throughout most of the trial, with activity in the selected channel showing a precipitous decline near the recorded RT [23, 56, 57]. This delayed change in GPi activation is caused by the opposing influence of concurrently active dMSN and iMSN populations in each channel, such that the influence of the direct pathway on the GPi is temporarily balanced out by activation of the indirect pathway (see [23]). To represent low, medium, and high levels of reward probability conflict, we manipulated the weights of cortical input to dMSNs in each channel (see Table 4), increasing and decreasing the ratio of direct pathway weights to indirect pathway weights for L and R actions, respectively. As expected, increasing the difference in the associated reward for L and R actions led to stronger firing in L-dMSNs and weaker firing of R-dMSNs. Consistent with recently reported electrophysiological findings [25, 55], we also observed an increase in the firing of iMSNs in the L action channel, which in our simulations may arise from channel-specific feedback from the L component of the thalamus. Behaviorally, the choices of the CBGT network became both faster and more accurate (e.g., higher percentage of L responses) at higher levels of reward, suggesting that the observed increase in L-iMSN firing did not serve to delay or suppress L selections. These changes in neural dynamics also produced consequent changes in value-based decision behavior consistent with previous studies linking parameters of the DDM with experiential feedback.
One of the critical outcomes of the current set of experiments is the mechanistic prediction of how variation in specific neural parameters relates to changes in parameters of the DDM. Consistent with past work (see [7, 29]), the DDM fits to the CBGT-simulated behavior showed an increase in drift rate toward the higher valued decision boundary with increasing expected reward. Additionally, we found that greater disparity in the expected values of alternative actions led to an increase in the boundary height. Indeed, the co-modulation of drift rate and boundary parameters observed here has also been found in human and animal experimental studies of value-based choice [29, 33, 35]. For example, experiments with human subjects in a value-based learning task showed that selection and response speed patterns were best described by an increase in the rate of evidence for more valued targets, coupled with an upwards shift in the boundary height for all targets [33]. Moreover, in healthy human subjects, but not Parkinson’s disease patients, reward feedback was found to drive increases in both rate and boundary height parameters, effectively breaking the speed-accuracy tradeoff [33]. To identify more precise links between the relevant neural dynamics underlying the observed drift rate and boundary height effects we performed another round of model fits with striatal summary measures included as regressors to describe trial-by-trial variability. Behavioral fits were substantially improved by estimating trialwise values of drift rate as a function of the difference between L- and R-dMSN activation and trialwise values of boundary height as a function of the iMSN activation across both channels. These relationships stand both as novel predictions arising from the current study and as refinements to the Believer-Skeptic framework, implying that the Believer component relies on a competition between action channels while the Skeptic involves a cooperative aspect.
While our present findings provide key insights into the links between implementation mechanisms and cognitive algorithms during adaptive decision-making, they are constrained by the nature of the multi-level modeling approach itself. Our goal was to evaluate a specific hypothesis under the Believer-Skeptic framework about the combined role of corticostriatal pathways in learning and decision making, and our simulations demonstrate that strengthening corticostriatal synapses is one way that the brain can adjust striatal firing to shape the drift rate and accumulation threshold, promoting faster and more frequent selection of actions with a higher expected value. We do not presume, however, that the impacts of dopaminergic plasticity at corticostriatal synapses on striatal activity are singularly responsible for setting the drift rate during value-based decision-making. Indeed, because the CBGT network has many more parameters than the DDM, many different properties of the CBGT network, aside from corticostriatial weights and measures of striatal activity, could potentially be manipulated to cause analogous behavioral patterns and inferred effects on the drift rate and boundary height parameters in the DDM. For instance, in contrast to the striatal iMSN modulation of boundary height observed in the current study, Ratcliff and Frank [18] found that simulated changes STN firing were also capable of describing a change in the boundary height, raising the threshold in the context of high decision conflict. In fact, experimental evidence suggests the existence of both striatal [58–60] and subthalamic [39, 60, 61] mechanisms for adjusting the boundary height. It remains for future work to study how multiple mechanisms such as these work together to impact decision behavior as well as to consider more complex decision-making tasks that may help to expose distinct roles for these aspects of CBGT activity. Another open direction is to generalize our approach to include more detailed representations of neurons in CBGT populations, such as Hodgkin-Huxley-type models, and additional detail about BG neuronal subpopulations and pathways, such as distinct representations of arkypallidal and prototypical GPe neurons and the GPe projection to the striatum.
Our simulations make several novel predictions for future experiments. The STDP simulations described in Section2.1 suggest that feedback-dependent reward learning should drive more salient changes in cortical synaptic weights to dMSN populations than to iMSN populations. At the same time, while the learning-related changes in L and R direct pathway corticostriatal weights were mirrored by the relative firing rates of L- and R-dMSNs in the CBGT network, iMSN firing rates are also predicted to show channel-specific differences, despite constancy in their corticostriatal weights across conditions. The observed increase in iMSN firing disparity between the L and R channels in our simulations emerged due to the thalamostriatal feedback assumed in the CBGT network, where dMSN activation leads to disinhibition of the thalamus, thereby increasing excitatory feedback to both MSN subtypes within a given channel. This represents another novel model prediction that can be tested empirically. Since it is currently unclear whether these feedback connections actually adhere to a channel-specific (e.g., focal) topology, we hope that our work will motivate future experiments to explore the topology of thalamostriatal inputs. Finally, our study predicts that the difference in dMSN activity across action channels modulates the rate of value-based evidence accumulation. This could be directly tested by applying different magnitudes of optogenetic stimulation to dMSNs in L- and R-lateralized dorsolateral striatum to effectively manipulate the strength of evidence for L and R lever presses. According to our simulations, increasing the relative magnitude of dMSN stimulation in the R, compared to L, dorsolateral striatum should speed and facilitate the selection of contralateral lever presses. Choice and RT data could then be fit with the DDM to determine if the behavioral effects of laterally-biased dMSN stimulation were best described by a change in the drift rate. Analogous experiments targeting iMSNs but without channel specificity could be used similarly to evaluate our prediction that overall iMSN activity level modulates DDM boundary height.
3.1 Conclusion
Here we characterize the effects of dopaminergic feedback on the competition between direct and indirect CBGT pathways and how this plasticity impacts the evaluation of evidence for alternative actions during value-based choice. Using simulated neural dynamics to generate behavioral data for fitting by the DDM and determining how measures of striatal activity influence this fit, we show how the rate of evidence accumulation and the decision boundary height are modulated by the direct and indirect pathways, respectively. This multi-level modeling approach affords a unique combination of biological plausibility and mechanistic interpretability, providing a rich set of testable predictions for guiding future experimental work at multiple levels of analysis.
4 Methods
Our work involves three distinct model systems, a spike-timing dependent plasticity (STDP) network consisting of striatal neurons and their cortical inputs, with corticostriatal synaptic plasticity driven by phasic reward signals resulting from simulated actions and their consequent dopamine release; a spiking cortico-basal ganglia-thalamic (CBGT) network, comprising neurons and synaptic connections from the key cortical and subcortical areas within the CBGT computational loops that take sensory evidence from cortex and make a decision to select one of two available responses; and the drift diffusion model (DDM), a cognitive model of decision-making that describes the accumulation-to-bound dynamics underlying the speed and accuracy of simple choice behavior [1].
In this section, we present the details of each of these models along with some computational approaches that we use in simulating and analyzing them. The three models are simulated separately, but outputs of specific models are critical for the tuning of other models, as we shall describe.
4.1 STDP network
4.1.1 Neural model
We consider a computational model of the striatum consisting of two different populations that receive different inputs from the cortex (see Figure 1, left). Although they do not interact directly, they compete with each other to be the first to select a corresponding action.
Each population contains two different types of units: (i) dMSNs, which facilitate action selection, and (ii) iMSNs, which suppress action selection. Each of these neurons is represented with the exponential integrate- and-fire model [62], such that each neural membrane potential obeys the differential equation where gL is the leak conductance and VL the leak reversal potential. In terms of a neural I - V curve, VT denotes the voltage that corresponds to the largest input current to which the neuron does not spike in the absence of synaptic input, while ΔT stands for the spike slope factor, related to the sharpness of spike initialization. Isyn(t) is the synaptic current, given by Isyn(t) = gsyn(t)(ν (t) - Vsyn), where the synaptic conductance gsyn(t) changes via a learning procedure (see Subsection 4.1.2). A reset mechanism is imposed that represents the repolarization of the membrane potential after each spike. Hence, when the neuron reaches a boundary value Vb, the membrane potential is reset to Vr.
The inputs from the cortex to each MSN neuron within a population are generated using a collection of oscillatory Poisson processes with rate ν and pairwise correlation c. Each of these cortical spike trains, which we refer to as daughters, is generated from a baseline oscillatory Poisson process {X(tn)}n, the mother train, which has intensity function λ(1 + A sin(2πθt)) such that the spike probability at time point tn is where A and θ are the amplitude and the frequency of the underlying oscillation, respectively; tn+1 - tn =: δt is the time step; and λ is the mother train rate. After the mother train is computed, each mother spike is transferred to each daughter with probability p, checked independently for each daughter. To fix the daughters’ rates and the correlation between the daughter trains, the mother train’s rate is given by λ = ν/(p * δt) where
In the STDP network (see Figure 1, left) we consider two different mother trains to generate the cortical daughter spike trains for the two different MSN populations. Each dMSN neuron or iMSN neuron receives input from a distinct daughter train, with the corresponding transfer probabilities pD and pI, respectively. As shown in [63], the cortex to iMSN release probability exceeds that of cortex to dMSN. Hence, we set pD < pI.
Striatal neuron parameters
We set the exponential integrate- and-fire model parameter values as C = 1 μF/cm2, gL = 0.1 μS/cm2, VL = −65 mV, VT = −59.9 mV, and ΔT = 3.48 mV (see [62]). The reset parameter values are Vb = −40 mV and Vr = −75 mV. The synaptic current derives entirely from excitatory inputs from the cortex, so Vsyn = 0 mV. For these specific parameters, synaptic inputs are required for MSN spiking to occur.
Cortical neuron parameters
To compute p, we set the daughter Poisson process parameter values as ν = 0.002 and c = 0.5 and apply equation 4. Once the mother trains are created using these values, we set the iMSN transfer probability to pI = p and the dMSN transfer probability to pD = 2/3 pI. In most simulations, we set A = 0 to consider non-oscillatory cortical activity. We have also tested the learning rule when A = 0.06 and θ = 25 Hz and obtained similar results.
The network has been integrated computationally by using the Runge-Kutta (4,5) method in Matlab (ode45) with time step δt = 0.01 ms. Different realizations lasting 15 s were computed to simulate variability across different subjects in a learning scenario.
Every time that an action is performed (see Subsections 4.1.3 and 4.1.4), all populations stop receiving inputs from the cortex until all neurons in the network are in the resting state for at least 50 ms. During these silent periods, no MSN spikes occur and hence no new actions are performed (i.e., they are action refractory periods). After these 50 ms, the network starts receiving synaptic inputs again and we consider a new trial to be underway.
4.1.2 Learning rule
During the learning process, the corticostriatal connections are strengthened or weakened according to previous experiences. In this subsection, we will present equations for a variety of quantities, many of which appear multiple times in the model. Specifically, there are variables gsyn, w for each corticostriatal synapse, APRE for each daughter train, APOST and E for each MSN. For all of these, to avoid clutter, we omit subscripts that would indicate explicitly that there are many instances of these variables in the model.
We suppose that the conductance for each corticostriatal synapse onto each MSN neuron, gsyn(t), obeys the differential equation where tj denotes the time of the jth spike in the cortical daughter spike train pre-synaptic to the neuron, δ(t) is the Dirac delta function, τg stands for the decay time constant of the conductance, and w(t) is a weight associated with that train at time t. The weight is updated by dopamine release and by the neuron’s role in action selection based on a similar formulation to one proposed previously [22], which descends from earlier work [64]. The idea of this plasticity scheme is that an eligibility trace E (cf. [65]) represents a neuron’s recent spiking history and hence its eligibility to have its synapses modified, with changes in eligibility following a spike timing-dependent plasticity (STDP) rule that depends on both the pre- and the post-synaptic firing times. Plasticity of corticostriatal synaptic weights depends on this eligibility together with dopamine levels, which in turn depend on the reward consequences that follow neuronal spiking.
To describe the evolution of neuronal eligibility, we first define APRE and APOST to represent a record of pre- and post-synaptic spiking, respectively. Every time that a spike from the corresponding cell occurs, the associated variable increases by a fixed amount, and otherwise, it decays exponentially. That is, where XPRE (t) and XPOST (t) are functions set to 1 at times t when, respectively, a neuron that is pre-synaptic to the post-synaptic neuron, or the post-synaptic neuron itself, fires a spike, and are zero otherwise, while ΔPRE and ΔPOST are the fixed increments to APRE and APOST due to this firing. The additional parameters τPRE, τPOST denote the decay time constants for APRE, APOST, respectively.
The spike time indicators XPRE, XPOST and the variables APRE, APOST are used to implement an STDP-based evolution equation for the eligibility trace, which takes the form implying that if a pre-synaptic neuron spikes and then its post-synaptic target follows, such that APRE > 0 and XPOST becomes 1, the eligibility E increases, while if a post-synaptic spike occurs followed by a pre-synaptic spike, such that APOST > 0 and XPRE becomes 1, then E decreases; at times without spikes, the eligibility decays exponentially with rate τE.
In contrast to previous work [22], we propose an update scheme for the synaptic weight w(t) that depends on the type of MSN neuron involved in the synapse. It has been observed [66–69] that dMSNs tend to have less activity than iMSNs at resting states, consistent with our assumption that pD < pI, and are more responsive to phasic changes in dopamine than iMSNs. In contrast, iMSNs are largely saturated by tonic dopamine. In both cases, we assume that the eligibility trace modulates the extent to which a synapse can be modified by the dopamine level relative to a tonic baseline (which we without loss of generality take to be 0), consistent with previous models. Hence, we take w(t) to change according to the equation where the function represents sensitivity to phasic dopamine, αw refers to the learning rate, KDA denotes the level of dopamine available at the synapses, is an upper bound for the weight w that depends on whether the postsynaptic neuron is a dMSN (X = D) or an iMSN (X = I), c controls the saturation of weights to iMSNs, and |·| denotes the absolute value function. The dopamine level KDA itself evolves as where the sum is taken over the times {ti} when actions are performed, leading to a change in KDA that we treat as instantaneous, and τDOP is the dopamine decay constant. The DA update value DAinc(ti) depends on the performed action as follows: where ri(t) is the reward associated to action i at time t, Qi(t) is an estimate of the value of action i at time t such that ri(t) - Qi(t) is the subtractive reward prediction error [70], and α ∈ [0, 1] is the value learning rate. This rule for action value updates and dopamine release resembles past work [71] but uses a neurally tractable maximization operation (see [72, 73] and references therein) to take into account that reward expectations may be measured relative to optimal past rewards obtained in similar scenarios [74, 75]. The evolution of these variables is illustrated in Figure 10, which is discussed in more detail in Subsection 4.1.4.
4.1.3 Actions and rewards
Actions
Each dMSN facilitates performance of a specific action. We specify that an action occurs, and so a decision is made by the model, when at least three different dMSNs of the same population spike in a small time window of duration ΔDA. When this condition occurs, a reward is delivered and the dopamine level is updated correspondingly, impacting all neurons in the network, depending on eligibility. Then, the spike counting and the initial window time are reset, and cortical spikes to all neurons are turned off over the next 50 ms before resuming again as usual.
We assume that iMSN activity within a population counters the performance of the action associated with that population [76]. We implement this effect by specifying that when an iMSN in a population fires, the most recent spike fired by a dMSN in that population is suppressed. Note that this rule need not contradict observed activation of both dMSNs and iMSNs preceding a decision [26], see Subsection 2.1. We also implemented a version of the network in which each iMSN spike cancels the previous spike from both MSN populations. Preliminary simulations of this variant gave similar results to our primary version but with slower convergence (data not shown).
For convenience, we refer to the action implemented by one population of neurons as “left” or L and the action selected by the other population as “right” or R.
Rewards
In our simulations, to test the learning rule, we present results from different reward scenarios. In one case, we use constant rewards, with rL = 0.7 and rR = 0.1. In another case, we implement probabilistic rewards: every time that an action occurs, the reward ri is set to be 1 with probability pi or 0 otherwise, i ∈ {L, R}. For this case, we consider three different probabilities such that pL + pR = 1 and pL > pR, keeping the action L as the preferred one. Specifically, we take pL = 0.85, pL = 0.75, and pL = 0.65 to allow comparison with previous results [29]. In tuning the model, we also considered a regime with reward switches: reward values were as in the constant reward case but after a certain number of actions occurred, the reward-action associations were exchanged. Although the model gave sensible results, we did not explore this case thoroughly, and we simply show one example in the Supplementary Information.
4.1.4 Example implementation
The algorithm for the learning rule simulations is as follows:
First, compute cortical mother spike trains and extract daughter trains to be used as inputs to each MSN from the mother trains.
Next, while t < tend,
use RK45, with step size dt = 0.01 ms, to compute the voltages of the MSNs in the network at the current time t from equations 3 and 5,
for each MSN, set the corresponding XPOST (t) equal to 1 if a spike is performed or 0 otherwise and set the corresponding XPRE (t) to 1 if an input spike arrives or 0 otherwise,
update the action condition by checking sequentially for the following two events:
if any iMSN neuron in population i ∈ {L, R} spikes, then the most recent spike performed by any of the dMSNs of population i is cancelled;
for each i ∈ {L, R}, count the number of spikes of the dMSNs in the ith population inside a time window consisting of the last ΔDA ms; if at least nact spikes have occurred in this window, then action i has occurred and we update DAinc and Qi according to equation 10,
use RK45, with step size dt = 0.01 ms, to solve equations 6–8 for each synapse, along with equation 9, yielding an update of DA and synaptic weight levels, for neurons that have XPRE (t) = 1, update synaptic conductance using g(t) = g(t) + w(t),
set t = t + dt.
Figure 10 illustrates the evolution of all of the learning rule variables over a brief time window. Cortical spikes (thin straight light lines, top panel) can drive voltage spikes of dMSNs (dark curves, top panel), which in turn may or may not contribute to action selection (green – for L – and orange – for R – dots, top panel). Each time a dMSN fires, its eligibility trace will deviate from baseline according to the STDP rule in equation 7. In this example, the rewards are rL = 0.7 and rR = 0.1, such that every performance of L leads to an appreciable surge in KDA, with an associated rise in QL, but performances of R do not cause such large increases in KDA and QR.
Various time points are labeled in the top panel of Figure 10. At time A, R is selected. The illustrated R-dMSN fires just before this time and hence its eligibility increases. There is a small increase in KDA leading to a small increase in the w for this dMSN. At time B, L is selected. Although it is difficult to detect at this resolution, the illustrated L-dMSN fires just after the action, such that its E becomes negative and the resulting large surge in KDA causes a sizeable drop in wL. At time C, R is selected again. This time, the R-dMSN fired well before time C, so its eligibility is small, and this combines with the small KDA increase to lead to a negligible increase in wR. At time D, action L is selected but the firing of the L-dMSN is sufficiently late after this that no change in wL results. At time E, L is selected again. This time, the L-dMSN fires just before the action leading to a large eligilibity and corresponding increase in wL. Finally, at time F, L is selected. In this instance, the R-dMSN fired just before selection and hence is eligible, causing wR to increase when KDA goes up. Although this weight change does not reflect correct learning, it is completely reasonable, since the physiological synaptic machinery has no way to know that firing of the R-dMSN did not contribute to the selected action L.
4.1.5 Learning rule parameters
The learning rule parameters have been chosen to capture various experimental observations, including some differences between dMSN and iMSNs. First, it has been shown that cortical inputs to dMSNs yield more prolonged responses with more action potentials than what results from cortical inputs to iMSNs [77]. Moreover, dMSNs spike more than iMSNs when both types receive similar cortical inputs [78]. Hence, the effective weights of cortical inputs to dMSNs should be able to become stronger than those to iMSNs, which we encode by selecting . This choice is also consistent with the observation that dMSNs are more sensitive to phasic dopamine than are iMSNs [66–69]. On the other hand, the baseline firing rates of iMSNs exceed the baseline of dMSNs [79], and hence we take the initial condition for w(t) for the iMSNs greater than that for the dMSNs.
The relative values of other parameters are largely based on past computational work [22], albeit with different magnitudes to allow shorter simulation times. The learning rate αw for the dMSNs is chosen to be positive and larger than the absolute value of the negative rate value for the iMSNs. The parameters ΔPRE, ΔPOST, τE, τPRE, and τPOST have been assigned the same values for both types of neurons, keeping the relations ΔPRE > ΔPOST and τPRE > τPOST. Finally, the rest of the parameters have been adjusted to give reasonable learning outcomes.
Parameter values
We use the following parameter values in all of our simulations: τDOP = 2 ms, ΔDA = 6 ms, τg = 3 ms, α = 0.05 and c = 2.5. For both dMSNs and iMSNs, we set ΔPRE = 10 (instead of ΔPRE = 0.1; [22]), ΔPOST = 6 (instead of ΔPOST = 0.006; [22]), τE = 3 (instead of τE = 150; [22]), τPRE = 9 (instead of τPRE = 3; [22]), and τPOST = 1.2 (instead of τPOST = 3; [22]). Finally, αw = {80, −55} (instead of αw = {12, −11}; [22]) and wmax = {0.1, 0.03} (instead of wmax = {0.00045, 0}; [22]), where the first value refers to dMSNs and the second to iMSNs. Note that different reward values, ri, were used in different types of simulations, as explained in the associated text.
Learning rule initial conditions
The initial conditions used to numerically integrate the system are w = 0.015 for weights of synapses to dMSNs and w = 0.018 for iMSNs, with the rest of the variables relating to value estimation and dopamine modulation initialized to 0.
4.2 CBGT network
The spiking CBGT network is adapted from previous work [23]. Like the STDP model described above, the CBGT network simulation is designed to decide between two actions, a left or right choice, based on incoming sensory signals (Figure 1). The full CBGT network was comprised of six interconnected brain regions (see Table 3), including populations of neurons in the cortex, striatum (STR), external segment of the globus pallidus (GPe), internal segment of the globus pallidus (GPi), subthalamic nucleus (STN), and thalamus. Because the goal of the full spiking network simulations was to probe the consequential effects of corticostriatal plasticity on the functional dynamics and emergent choice behavior of CBGT networks after learning has already occurred, CBGT simulations were conducted in the absence of any trial-to-trial plasticity, and did not include dopaminergic projections from the subtantia nigra pars compacta. Rather, corticostriatal weights were manipulated to capture the outcomes of STDP learning as simulated with the learning network (Subsection 4.1) under three different probabilistic feedback schedules (see Table 4), each maintained across all trials for that condition (N=2500 trials each).
4.2.1 Neural dynamics
To build on previous work on a two-alternative decision-making task with a similar CBGT network and to endow neurons in some BG populations with bursting capabilities, all neural units in the CBGT network were simulated using the integrate- and-fire-or-burst model [80]. Each neuron’s membrane dynamics were determined by:
In equation 11, parameter values are C = 0.5 nF, gL = 25 nS, VL = −70 mV, Vh = −0.60 mV, and VT = 120 mV. When the membrane potential reaches a boundary Vb, it is reset to Vr. We take Vb = −50 mV and Vr = −55 mV.
The middle term in the right hand side of equation 11 represents a depolarizing, low-threshold T-type calcium current that becomes available when h grows and when V is depolarized above a level Vh, since H(V) is the Heaviside step function. For neurons in the cortex, striatum (both MSNs and FSIs), GPi, and thalamus, we set gT = 0, thus reducing the dynamics to the simple leaky integrate- and-fire model. For bursting units in the GPe and STN, rebound burst firing is possible, with gT set to 0.06 nS for both nuclei. The inactivation variable, h, adapts over time, decaying when V is depolarized and rising when V is hyperpolarized according to the following equations: and with and for both GPe and STN.
For all units in the model, the synaptic current Isyn, reflects both the synaptic inputs from other explicitly modeled populations of neurons within the CBGT network, as well as additional background inputs from sources that are not explicitly included in the model. This current is computed using the equation
The reversal potentials are set to VE = 0 mV and VI = −70 mV . The synaptic current components correspond to AMPA (g1), NMDA (g2), and GABAA (g3) synapses. The gating variables si for AMPA and GABAA receptor-mediated currents satisfy: while NMDA receptor-mediated current gating obeys:
In equations 15 and 16, tj is the time of the jth spike and α = 0.63. The decay constant, τ, was 2 ms for AMPA, 5 ms for GABA_A, and 100 ms for NMDA-mediated currents. A time delay of 0.2 ms was used for synaptic transmission.
4.2.2 Network architecture
The CBGT network includes six of the nodes shown in Figure 1, excluding the dopaminergic projections from the substantia nigra pars compacta that are simulated in the STDP model. The membrane dynamics, projection probabilities, and synaptic weights of the network (see Table 3) were adjusted to reflect empirical knowledge about local and distal connectivity associated with different populations, as well as resting and task-related firing patterns [23, 57].
The cortex included separate populations of neurons representing sensory information for L (N=270) and R (N=270) actions that approximate the processing in the intraparietal cortex or frontal eye fields. On each trial, L and R cortical populations received excitatory inputs from an external source, sampled from a truncated normal distribution with a mean and standard deviation of 2.5 Hz and 0.06, respectively, with lower and upper limits of 2.4 Hz and 2.6 Hz. Critically, L and R cortical populations received the same strength of external stimulation on each trial to ensure that any observed behavioral effects across conditions were not the result of biased cortical input. Excitatory cortical neurons also formed lateral connections with other cortical neurons with a diffuse topology, or a non-zero probability of projecting to recipient neurons within and between action channels (see Table 3 for details). The cortex also included a single population of inhibitory interneurons (CtxI; N=250 total) that formed reciprocal connections with left and right sensory populations. Along with external inputs, cortical populations received diffuse ascending excitatory inputs from the thalamus (Th; N=100 per input channel).
L and R cortical populations projected to dMSN (N=100/channel) and iMSN (N=100/channel) populations in the corresponding action channel; that is, cortical signals for a L action projected to dMSN and iMSN cells selective for L actions. Both cortical populations also targeted a generic population of FSI (N=100 total) providing widespread but asymmetric inhibition to MSNs, with stronger FSI-dMSN connections than FSI-iMSN connections [81]. Within each channel, dMSN and iMSN populations also formed recurrent and lateral inhibitory connections, with stronger inhibitory connections from iMSN to dMSN populations [81]. Striatal MSN populations also received channel-specific excitatory feedback from corresponding populations in the thalamus. Inhibitory efferent projections from the iMSNs terminated on populations of cells in the GPe, while the inhibitory efferent connections from the dMSNs projected directly to the GPi.
In addition to the descending inputs from the iMSNs, the GPe neurons (N=1000/channel) received excitatory inputs from the STN. GPe cells also formed recurrent, within channel inhibitory connections that supported stability of activity. Inhibitory efferents from the GPe terminated on corresponding populations in the the STN (i.e., long indirect pathway) and GPi (i.e., short indirect pathway). We did not include arkypalldal projections (i.e., feedback projections from GPe to the striatum; [82]) as it is not currently well understood how this pathway contributes to basic choice behavior.
Similar to the GPe, STN populations were composed of bursing neurons (N=1000/channel) with channel-specific inhibitory inputs from the GPe as well as excitatory inputs from cortex (the hyperdirect pathway). The since no cancellation signals were modeled in the experiments (see Subsection 4.2.3), the hyperdirect pathway was simplified to background input to the STN. Unlike the striatal MSNs and the GPe, the STN did not feature recurrent connections. Excitatory feedback from the STN to the GPe was assumed to be sparse but channel-specific, whereas projections from the STN to the GPi were channel-generic and caused diffuse excitation in both L- and R-encoding populations.
Populations of cells in the GPi (N=100/channel) received inputs from three primary sources: channel-specific inhibitory afferents from dMSNs in the striatum (i.e., direct pathway) and the corresponding population in the GPe (i.e., short indirect pathway), as well as excitatory projections from the STN shared across channels (i.e., long indirect and hyperdirect pathways; see Table 3). The GPi did not include recurrent feedback connections. All efferents from the GPi consisted of inhibitory projections to the motor thalamus. The efferent projections were segregated strictly into pathways for L and R actions.
Finally, L- and R-encoding populations in the thalamus were driven by two primary sources of input, integrating channel-specific inhibitory inputs from the GPi and diffuse (i.e., channel-spanning) excitatory inputs from cortex. Outputs from the thalamus delivered channel-specific excitatory feedback to corresponding dMSN and iMSN populations in the striatum as well as diffuse excitatory feedback to cortex.
4.2.3 Simulations of experimental scenarios
Because the STDP simulations did not reveal strong differences in Ctx-iMSN weights across reward conditions, only Ctx-dMSN weights were manipulated across conditions in the full CBGT network simulations. In all conditions the Ctx-dMSN weights were higher in the left (higher/optimal reward) than in the right (lower/suboptimal reward) action channel (see Table 4). On each trial, external input was applied to L- and R-encoding cortical populations, each projecting to corresponding populations of dMSNs and iMSNs in the striatum, as well as to a generic population of FSIs. Critically, all MSNs also received input from the thalamus, which was reciprocally connected with cortex. Due to the suppressive effects of FSI activity on MSNs, sustained input from both cortex and thalamus was required to raise the firing rates of striatal projection neurons to levels sufficient to produce an action output. Due to the convergence of dMSN and iMSN inputs in the GPi, and their opposing influence over BG output, co-activation of these populations within a single action channel served to delay action output until activity within the direct pathway sufficiently exceeded the opposing effects of the indirect pathway [23]. The behavioral choice, as well as the time of that decision (i.e., the RT) were determined by a winner-take-all rule with the first action channel to cause the average firing rate of its thalamic population to rise above a threshold of 30 Hz being selected.
4.3 Drift Diffusion Model
To understand how altered corticostriatal weights influence decision-making behavior, we fit the simulated behavioral data from the CBGT network with a DDM [1, 83] and compared alternative models in which different parameters were allowed to vary across reward probability conditions. The DDM is an established model of simple two-alternative choice behavior, providing a parsimonious account of both the speed and accuracy of decision-making in humans and animal subjects across a wide variety of binary choice tasks [83]. It assumes that input is stochastically accumulated as the log-likelihood ratio of evidence for two alternative choices until reaching one of two decision thresholds, representing the criterion evidence for committing to a choice. Importantly, this accumulation-to-bound process affords predictions about the average accuracy, as well as the distribution of response times, under a given set of model parameters. The core parameters of the DDM include the rate of evidence accumulation, or drift rate (v), the distance between decision boundaries, also referred to as the threshold (a), the bias in the starting-point between boundaries for evidence accumulation (z), and a non-decision time parameter that determines when accumulation of evidence begins (tr), accounting for sensory and motor delays.
To narrow the subset of possible DDM models considered, DDM fits to the CBGT model behavior were conducted in three stages using a forward stepwise selection process. First, we compared models in which a single parameter in the DDM was free to vary across reward conditions. For these simulations all the DDM parameters were tested. Next, additional model fits were performed with the best-fitting model from the previous stage, but with the addition of a second free parameter. Finally, the two best fitting dual parameter models were submitted to a final round of fits in which trial-wise measures of striatal activity (see Figure 8B-C) were included as regressors on the two designated parameters of the DDM. All CBGT regressors were normalized between values of 0 and 1. Each regression model included one regression coefficient capturing the linear effect of a given measure of neural activity on one of the free parameters (e.g., a, v, or z), as well as an intercept term for that parameter, resulting in a total of four free parameters per selected DDM parameter or 8 free parameters altogether. For example, in a model where drift rate is estimated as function of the difference between dMSN firing rates in the left and right action channels, the drift rate on trial t is given by , where is the drift rate intercept, is the beta coefficient for reward condition j, and Xj (t) is the observed difference in dMSN firing rates between action channels on trial t in condition j. A total of 24 separate regression models were fit, testing all possible combinations between the two best-fitting dual parameter models and the four measures of striatal activity summarized in Figure 8B-C.
Fits of the DDM were performed using HDDM (see [84] for details), an open source Python package for Bayesian estimation of DDM parameters. Each model was fit by drawing 2000 Markov Chain Monte-Carlo (MCMC) samples from the joint posterior probability distribution over all parameters, with acceptance based on the likelihood (see [85]) of the observed accuracy and RT data given each parameter set. A burn-in period of 1200 samples was implemented to ensure that model selection was not influenced by samples drawn prior to convergence. Sampling chains were also visually inspected for signs of convergence failure; however, parameters in all models showed normally distributed posterior distributions with little autocorrelation between samples suggesting that sampling parameters were sufficient for convergence. The prior distributions used to initialize all DDM parameters included in the fits can be found in [84].
Competing Interests
The authors declare no financial or non-financial competing interests.
Acknowledgments
C. Vich is supported by the Ministerio de Economía, Industria y Competitividad (MINECO), the Agencia Estatal de Investigación (AEI), and the European Regional Development Funds (ERDF) through projects MTM2014-54275-P, MTM2015-71509-C2-2-R and MTM2017-83568-P (AE/ERDF,EU). JR received support from NSF awards DMS 1516288, 1612913 (CRCNS), and 1724240 (CRCNS). TV received support from NSF CAREER award 1351748. The research was sponsored in part by the U.S. Army Research Laboratory, including work under Cooperative Agreement Number W911NF-10-2-0022, and the views espoused are not official policies of the U.S. Government.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.
- 35.↵
- 36.
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.
- 68.
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵