Abstract
Model-free learning creates stimulus-response associations. But what constitutes a stimulus? Are there limits to types of stimuli a model-free or habitual system can operate over? Most experiments on reward learning in humans and animals have used discrete sensory stimuli, but there is no algorithmic reason that model-free learning should be restricted to external stimuli, and recent theories have suggested that model-free processes may operate over highly abstract concepts and goals. Our study aimed to determine whether model-free learning processes can operate over environmental states defined by information held in working memory. Specifically, we tested whether or not humans can learn explicit temporal patterns of individually uninformative cues in a model-free manner. We compared the data from human participants in a reward learning paradigm using (1) a simultaneous symbol presentation condition or (2) a sequential symbol presentation condition, wherein the same visual stimuli were presented simultaneously or as a temporal sequence that required working memory. We found a significant effect of reward on human behavior in the sequential presentation condition, indicating that model-free learning can operate on information stored in working memory. Further analyses, however, revealed that the behavior of the participants contradicts the basic assumptions of our hypotheses, and it is possible that the observed effect of reward was generated by model-based rather than model-free learning. Thus it is not possible to draw any conclusions from out study regarding model-free learning of temporal sequences held in working memory. We conclude instead that careful thought should be given about how to best explain two-stage tasks to participants.
1 Introduction
Reinforcement learning theory and the computational algorithms associated with it have been extremely influential in the behavioral, biological, and computer sciences. Reinforcement learning theory describes how an agent learns by interacting with its environment [1]. In a typical reinforcement learning paradigm, the agent selects an action and the environment responds by presenting rewards and taking the agent to the next situation, or state. A reinforcement learning algorithm determines how the agent changes its action selection strategy as a result of experience, with the goal of maximizing future rewards. Depending on how algorithms accomplish this goal, they are classified as model-free or model-based [1]. Model-based algorithms acquire beliefs about how the environment generates outcomes in response to their actions and select actions according to their predicted consequences. By contrast, model-free algorithms generate a propensity to perform, in each state of the world, actions that were more rewarding in previous visits to that environmental state. Model-free reinforcement learning algorithms are of considerable interest to behavioral and biological scientists, in part because they offer a compelling account of the phasic activity of dopamine neurons, but also more generally can explain many observed patterns of behavior in human and non-human animals [2, 3, 4, 5, 6, 7].
A key concept in reinforcement learning theory is the environmental state. Typically, empirical tests of reinforcement learning algorithms use discrete sensory stimuli to define environmental states. However, there is no theoretical or algorithmic constraint to define the states of the environment exclusively by sensory stimuli. State definitions may also include the agent’s internal stimuli, such as its memory of past events, thirst or hunger level, or even subjective characteristics such as happiness or sadness [1]. Thus, model-free reinforcement learning might operate over a wide variety of both external and internal factors.
Indeed, recent work suggests that model-free learning algorithms can support a large set of cognitive processes and behaviors beyond the formation of habitual response associations with discrete sensory stimuli [8, 9, 10]. For instance, it has been proposed that the model-free system can perform the action of selecting a goal for goal-directed planning [11] or conversely that a model-based decision can trigger a habitual action sequence [12, 13, 14, 15]. Model-free algorithms have also been suggested to gate working memory [16]. However, many of these important theoretical proposals about model-free algorithms have not been directly tested empirically.
Here, we determine the ability of model-free reinforcement learning algorithms to operate over states defined by information held in working memory, an internal state. Specifically, we use an experimental paradigm and computational modeling framework designed to dissociate model-free from model-based influences on behavior [17] to test if temporally separated sequences of individually uninformative cues can drive model-free learning and behavior. If an agent can store the elements of a temporal sequence in its memory to form a unique and predictive cue and use the memorized information as the state definition, then, theoretically, it can use model-free algorithms to learn the associations between a specific sequence of individually uninformative cues and action outcomes [18].
Our approach has several important facets. First, we use an experimental paradigm that allows us to determine not only if our participants learn from information in working memory, but also whether that learning is supported by model-based or model-free algorithms. Second, the cues in our temporal sequences are individually uninformative; in other words, any single cue in isolation provides no information about which response is correct. It is well-known that model-free algorithms can shift response associations to the earliest occurring predictor of the correct response in a temporal sequence of informative cues and can integrate predictive information across individual cues. Neither of these mechanisms is possible in our paradigm because the individual cues themselves contain no information about the previous or subsequent cues or which response is best.
Temporal pattern learning is a fundamental and early developing human cognitive ability. It allows people to form predictions about what will happen from what has happened and select their actions accordingly. Humans can learn patterns both explicitly and implicitly in the absence of specific instructions or conscious awareness [19]. Moreover, they can do so as early as two months of age [20]. In fact, people identify patterns even when, in reality, no pattern exists [21]. These empirical results together with the theoretical potential for model-free learning to operate over internal stimuli suggest that temporal pattern learning could be supported by model-free processes. However, to date, studies of reinforcement learning and decision making have focused primarily on tasks in which the relevant stimuli are presented simultaneously just prior to or at the time of decision-making, or on implicit motor sequence learning, wherein participants learn a sequence of movements automatically, without full awareness (for instance, 22, 23, 24, 25, 26). Thus, the degree to which model-free processes do in fact operate over temporal sequences or any other information stored in working memory has not yet been directly tested and compared with model-free learning from traditionally employed external, static environmental cues.
Here, we directly test whether model-free processes can access and learn from information stored in working memory. We adapted a decision-making paradigm originally developed by Daw et al. [17] that can behaviorally dissociate the influence of model-free and model-based learning on choice. The task was performed by two groups of human participants either in a simultaneous condition (i.e. static and external), wherein visual stimuli were presented simultaneously, or in a sequential condition, wherein the same visual stimuli were presented as a temporal sequence that required working memory processing.
2 Results
2.1 Determining model-free and model-based influences on choice behavior
Forty-one young adult human participants completed a behavioral task adapted from Daw et al. [17]. In our task, participants began each trial in a randomly selected initial state represented by one of four possible sequences of two symbols: AA, AB, BA, or BB (Figure 1). At this initial state, participants chose one of two possible actions: going left or going right. They were then taken to one of two possible final states, the blue state or the pink state. If they had gone left, they were taken with 0.8 probability to the final state given by the rule AA → blue, AB → pink, BA → pink, BB → blue or with 0.2 probability to the other final state. If they had gone right, they were taken with 0.8 probability to the final state not given by the previous rule or with 0.2 probability to the other final state. The common (most probable) transitions between the initial and final states are shown in Figure 2. To predict the final state accurately, participants had to know both elements of the sequence. If they knew only one, the final state might have been either blue or pink with 0.5 probability and they would not be able to perform above chance. This feature is key and separates our work from others in which each element of a sequence is predictive on its own.
One of the final states delivered a monetary reward with 0.7 probability and the other with 0.3 probability. The optimal strategy was to always select the action that led with 0.8 probability to the final state with 0.7 reward probability. Initially, participants were instructed to learn the common transitions between the initial and final states in the absence of rewards. They were told that each final state might be rewarded with different probabilities, but not what the probabilities were nor that they were fixed. The task comprised 250 trials and participants received the total reward they obtained at the end.
Twenty-one participants were randomly allocated to a simultaneous condition and twenty to a sequential condition (Figure 1). In the simultaneous condition, both symbols that represented the initial state were displayed simultaneously on the screen. In the sequential condition, each symbol was displayed consecutively by itself, as a temporal sequence. The specific objective of this study was to determine if participants in the sequential condition could use states represented in working memory to learn the task in a model-free way or if their learning was necessarily model-based. The simultaneous condition is already known to support model-free learning as well as model-based learning [17, 27, 28, 29, 30]. We thus sought to determine the difference between the standard simultaneous and working-memory dependent sequential conditions.
The two-stage task we used can differentiate between model-free and model-based learning because algorithms that implement them make different predictions about how a reward received in a trial impacts a participant’s choices in subsequent trials. The SARSA (λ =1) model-free algorithm learns this task by strengthening or weakening associations between initial states and initial-state actions depending on whether the action is followed by a reward or not [1]. Therefore, it simply predicts that an initial-state action that resulted in a reward is more likely to be repeated in the next trial with the same initial state [17]. On the other hand, the model-based algorithm considered in this study uses an internal model of the task’s structure to determine the initial-state choice that will most likely result in a reward [17]. To this end, it considers which final state, pink or blue, was most frequently rewarded in recent trials and selects the initial-state action, left or right, that will most likely lead there. Therefore, the model-free algorithm predicts that the participant will choose the mostly frequently rewarded action in past trials with the same initial state, while the model-based algorithm predicts that the participant will choose the action with the highest probability of leading to the mostly frequently rewarded final state in past trials, regardless of their initial states.
The model-free and model-based algorithms thus generate different predictions about the stay probability, which is the probability that in two consecutive trials the participant will stay with their first choice and take the same initial-state action in the second trial. For instance, if the participant chose left in two consecutive trials, this was considered a stay. The model-free and model-based predictions are different if the letters presented in one trial are the same or different than the letters presented in the other trial, so we ran four separate analyses on the data from each condition, dividing consecutive trial pairs into four subsets: “same letters” if both letters presented in the first trial are the same as the letters presented in the second trial (for example, AB for the first trial and AB again for the second trial), “same first letter” if the first letters presented in each trial are the same but the second letters are different (for example, AB and AA), “same second letter” if the second letters are the same but the first letters are different (for example, AB and BB), and “different letters” if both the first letters and the second letters are different (for example, AB and BA).
In all cases, we analyzed the data using Bayesian hierarchical logistic regression analyses. In addition to examining the stay choice probabilities, we directly examined the logistic regression coefficients for each condition and trial pair subset. Because in our task the mean reward probability associated with one final state is higher than the mean reward probability associated with the other final state, we did not use the logistic regression model proposed by Daw et al. [17]—as several studies demonstrate, if the reward probabilities are not the same, a reward by transition interaction does not uniquely characterizes model-based agents, but also appears in purely model-free results [31, 32, 33, 34]. We thus used instead an extended logistic regression model we had previously proposed that corrects for different reward probabilities by adding two control predictors: a binary variable that indicates whether or not the chosen initial-state action in the first trial leads commonly to the final state with the highest reward probability, and a binary variable that indicates whether or not the agent visited in the first trial the final state with the highest reward probability [34]. For comparison with the behavior of human participants, we fitted model-free and model-based algorithms to the experimental data, used the obtained parameter estimates to simulate purely model-free and purely model-based agents performing our task, and analyzed the resulting data using the same logistic regression procedure. The stay probabilities obtained for both simulated agents and human participants are shown in Figures 3, 4, 5, and 6, and the logistic regression coefficients obtained for both simulated agents and human participants are shown in Figures 7, 8, 9, and 10.
As can be seen in Figure 7A, if the letters are the same, for example AB for both trials, the model-free prediction is that the stay probability will increase if the first trial was rewarded and decrease if it was not; i.e., model-free learning creates a positive reward effect. The model-based prediction, on the other hand, is that the stay probability will increase if either the first trial was rewarded and the transition was common or the first trial was unrewarded and the transition was rare, and decrease otherwise; i.e. model-based learning creates a positive reward by transition interaction [17]. If the consecutive trials have different initial-state letters, the predictions will be different depending on the condition (simultaneous or sequential) and the assumed hypothesis regarding model-free learning of temporal sequences. In the simultaneous condition, the model-free prediction is that the stay probability will not change, because learning does not generalize among different initial states (Figures 8A, 9A, and 10A). In the sequential condition, if we assume that model-free learning can learn from temporal sequences, then the prediction is that the stay probability will also not change (Figures 8A, 9A, and 10A). If, however, we assume that model-free learning cannot learn from temporal sequences, then the model-free system may associate the first letter or the second letter to previously received rewards. Assuming, for example, that the second letter is associated with rewards, if the two consecutive trials have the same second letter, the stay probability should increase if the previous trial was rewarded and decrease if the previous trial was unrewarded; if the two consecutive trials have different second letters, however, the stay probability should not change. The simulated results for the latter hypothesis are not shown; in the Figures above, we assumed that model-free learning can learn from temporal sequences. For model-based learning, the prediction is that the reward by transition interaction will the positive if both letters are the same or both letters are different for the two trials (for example, the first trial’s letters are AB and the second trial’s letters are either AB or BA—see Figures 7A and 10A), because in this case the common and rare transitions are the same for both trials. If one letter is the same but the other letter is different (for example, the first trial’s letters are AB and the second trial’s letters are AA or BB—see Figures 8A and 9A), the model-based prediction is that the reward by transition interaction will be negative, because in this case the common and rare transitions are switched between the trials, so if the left action commonly leads to pink state in the first trial, for instance, it commonly leads to the blue state in the second trial.
For our sample of the human participants and trial pairs in the “same letters” subset, behavior was influenced by both reward and reward by transition interaction regardless of whether the states were defined by external sensory cues or internal working-memory representations (Figures 7C). We thus found no evidence that sequentially presented, working-memory-dependent state cues shift the balance of model-based and model-free effects on choice behavior compared to traditional, static, external cues. However, the results obtained for other trial pair subsets show unpredicted effects, namely: (1) there is a negative effect of reward for the sequential condition in the “same second letter” subset (Figure 9C); the estimated value of this coefficient is −0.20 (95% CI [−0.39, −0.01]); and (2) there is a positive effect of reward for the simultaneous condition in the “different letters” subset (Figure 10C); the estimated value of this coefficient is 0.23 (95% CI [0.02, 0.43]). Because of these unexpected results, we decided to replicate our experiment using a task that had geometric figures rather than letters to identify the different initial states (see Appendix on page 29). 32 human participants performed that task in both the simultaneous and sequential conditions. We again observed in the replicated data a negative reward effect for the sequential condition in the “same first letter” and “same second letter” subsets, as well as a positive reward effect for both the sequential and the simultaneous condition in the “different letters” subset.
3 Discussion
In this study, we empirically tested the hypothesis that human participants can develop model-free associations between temporal sequences of stimuli stored in working memory and a motor response. To that end, we developed a behavioral task based on a previous decision-making paradigm that can determine the model-free and model-based influences on choice [17]. The participants in the simultaneous condition performed this task with the two visual symbols presented together simultaneously and those in the sequential condition performed it with the same two visual symbols presented as a temporal sequence that had to be held in working memory. A key element of our experimental paradigm is that the individual symbols within each temporal sequence convey no information about the best response in isolation. This fact rules out the possibility that the sequential condition’s model-free effect is due to an association between a single symbol in the sequence and a response rather than one between the entire sequence and a response. Each sequence element is completely uninformative by itself: it cannot predict reward delivery above chance. Therefore, the task cannot be learned by simple stimulus-response associations with individual symbols in the temporal sequence.
At first glance, our results support the hypothesis that model-free learning can operate on stimuli stored in working memory. Two findings, however, cannot be explained by the assumed model of hybrid reinforcement learning, adapted to the two-stage task by Daw et al. [17]. Since model-free learning is assumed to be unable to generalize between distinct states (see Doll et al. 35, Kool et al. 36 for example studies that critically depend on this assumption) and model-based learning is assumed to generate only a reward by transition interaction, there should not be a reward effect for consecutive trials with different initial-state symbols. Yet, we observed a positive reward effect for trial pairs in the “different letters” subset both in the data presented here and in the follow-up replication study using a different initial-state representation. A possible explanation for this finding is that, after all, model-free learning is able to generalize between different state representations. It is possible that participants reduced the two-letter sequence to an abstract representation such as “the two letters were the same” (either AA or BB) or “the two letters were different” (either AB or BA). This abstraction is sufficient to determine the common and rare transitions, and we know from direct reports that at least some participants used it to memorize the transition rules. If model-free learning can operate on stimuli stored in working memory, it is also conceivable it can also operate on abstract representations stored in working memory. However, the use of abstract state representations cannot explain our second unpredicted finding: a negative effect of reward observed for the sequential condition in the “same second letter” subset and, in the replication study, also in the “same first letter” subset. Under the assumed model of hybrid learning, the reward effect can never be negative. The TD(λ = 1) algorithm used here to model model-free learning in the brain foresees no circumstances under which rewarding one action would decrease the probability of choosing that action again in the future.
The unpredicted reward effects we observed in some analyses raise a question about the predicted reward effect observed in other analyses: Does a reward effect truly indicate model-free learning in our data set? Is it not possible that at least some of these effects are generated by model-based learning instead? It is commonly assumed that model-based learning does not generate a reward effect, because it is assumed that participants make model-based decisions using a specific model of the task structure. It is possible, however, that the model they are using is different from the assumed one and can generate positive as well as negative reward effects. For example, a participant might think that their initial-state choices influence the reward probability, even if they are told this is not the case—they might have misunderstood or forgotten the instructions or thought the instructions were misleading.
Given that at least some of the observed reward effects may be generated by model-based rather than model-free learning, we cannot conclude that our data presents evidence for or against the hypothesis that model-free learning can operate over information held in working memory. In order to study this or other hypotheses involving model-free learning, it is crucial that participants are using a model of the task structure for model-based learning that does not generate reward effects. Future research may thus concentrate on developing more detailed and precise instructions, as well as tutorials and tests, to make sure that participants really understood the task and what they have to do. It is also essential that the data are checked for violations of the assumed model using multiple analyses.
4 Methods
4.1 Participants
Forty-one healthy young adults participated in the experiment, 21 (13 female) randomly assigned by a random number generator to the simultaneous condition and 20 (13 female) to the sequential condition. The inclusion criterion was speaking English and no participants were excluded from the analysis. The sample size was chosen by the precision for research planning method [37, 38], by comparing the estimated differences between participant groups in the logistic regression analysis with those between model-free and model-based simulated agents.
The experiment was conducted in accordance with the Zurich Cantonal Ethics Commission’s norms for conducting research with human participants, and all participants gave written informed consent.
4.2 Task
The task’s state transition model defines four possible initial states, which were randomly selected with uniform distribution in each trial and represented by four different stimuli, each composed of two symbols: AA, AB, BA, or BB. At the initial state, two actions were available to the participant: pressing the left or the right arrow keys. By pressing one of the keys, the participant was taken to a final state, which might be either the blue state or the pink state. If the left arrow key was pressed, the participant was taken to the final state given by the rule AA → blue, AB → pink, BA → pink, BB → blue with 0.8 probability or to the other state with 0.2 probability; if the right arrow key was pressed, the participant was taken to the final state not given by the previous rule with 0.8 probability or to the other state with 0.2 probability. There was no choice of action at the final state, but participants were required to make a button press to potentially earn the reward. Each final state was rewarded according to an associated probability, which was 0.7 for one state and 0.3 for the other. The highest reward probability was associated with the blue state for half of the participants and to the pink state for the other half. Participants were told that each final state might be rewarded with different probabilities, but not what the probabilities were nor that they were fixed.
In contrast with our task design, in which the final states’ reward probabilities were fixed, in the original task design proposed by Daw et al. [17] the reward probabilities slowly drifted over time, because those authors were interested in the trade-off between model-based and model-free mechanisms, which is assumed to happen on the basis of their relative uncertainties. In this study we were interested instead in testing if model-free learning of temporal patterns is possible and keeping the task environment stable helps making the model-free associations stronger and more likely to influence choice [39, 40].
Participants were initially instructed to learn the common transitions between the initial and the final states in the absence of reward. Participants then performed the task defined by the model above in the simultaneous or sequential condition. Half of the participants were randomly allocated to the simultaneous condition and the other half to the sequential condition (Figure 1). In the simultaneous condition, both symbols that define the initial state were displayed simultaneously on the screen for 3 seconds. In the sequential condition, each symbol is an element of a sequence and each element was presented for 1 second, but never conjointly, and with a 1-second delay (blank screen) in between. Two triangles pointing left and right then appeared and the participant was given 2 seconds to make a decision about whether to press the left or the right arrow keys; if they did not press any keys, the word SLOW was displayed for 1 second, and the trial was aborted and omitted from analysis. A blue or pink rectangle appeared immediately afterward, indicating the final state. The participant then pressed the up-arrow key and, if the final state was rewarded, a green dollar sign appeared on the screen for 2 seconds; otherwise, a black X appeared for 2 seconds. The task comprised 250 trials, with a break every 50 trials, and participants received the total reward they obtained by the end of the task (0.18 CHF per reward).
4.3 Model-free algorithm
The SARSA model-free algorithm with replacing eligibility traces [1, 17] was used to simulate model-free learning agents. For each action a and state s, it estimated the value Q(s, a) of performing that action in that state. The task’s initial states si were AA, AB, BA, and BB, and the actions ai available at the initial states were left and right. The final states were pink and blue, and the only action af available at those states was up. The initial value of Q(s, a) for every state and action was 0.5. In each trial t, the simulated agent at the initial state si chose left as its initial-state action with probability pleft and right with probability 1 − pleft, according to the following equation: where β > 0 is an inverse temperature parameter that determines the algorithm’s propensity to choose the option with the highest estimated value. After the final state sf was observed and a reward r ∈ {0, 1} was received, state-action values were updated according to the following equations: where 0 ≤ α1, α2, λ ≤ 1 are parameters: α1 is the initial learning rate, α2 is the final learning rate, and λ is the eligibility trace [1, 17].
In the special case where λ = 1, the update of initial state-action values becomes that is, the estimated values of choosing left and right in each initial state are updated independently of the final state’s estimated value. Thus, SARSA (λ = 1) ignores the identity of the final state when making initial-state decisions, and an initial-state action that resulted in a reward will necessarily lead to a higher stay probability when the respective initial state recurs. This is true even if the action will probably lead to the final state with the lowest value.
4.4 Model-based algorithm
In simulations of model-based agents [17], values were assigned to initial-state actions and to final states. The value V of a final state s ∈ {pink, blue} in the first trial t =1 was V(s, 1) = 0.5. An initial-state choice c ∈ {left, right} in trial t had a value V given by where Pr(c → s) is the probability that choosing c will lead to the final state s, which might be 0.8 or 0.2 according to the task’s transition model. The value of an initial-state choice can thus be understood as the expected value of the final state the agent will go to after making that choice. If V(left, t) > V(right, t), the agent was more likely to choose left and vice-versa.
In each trial t, the agent’s initial state action was left with probability pleft and right with probability 1 − pleft, given by where β is an inverse temperature parameter. After the agent made its initial-state choice and went to a final state s, that final state’s value was updated according to the following equation: where r(t) ∈ {0, 1} indicates if the agent received a reward and 0 ≤ α ≤ 1 is a learning-rate parameter of the model. The value of a final state is thus the moving average of the rewards received in that state.
4.5 Data analysis by logistic regression
For each human participant or simulated agent, we calculated the stay probability in pairs of consecutive trials as a function of reward, transition, initial-state choice and visited final state in the first trial [34]. In the second trial of each pair, if the human participant or simulated agent chose an action (left or right) that was the same as that chosen in the previous trial, this was considered a stay. For each trial pair, the second trial’s choice was coded as the random variable y and classified as a stay (y = 1) or not a stay (y = 0). For each condition, trial pairs were divided into four subsets: “same letters” (if the letters presented in the first trial were the same as the letters presented in the second trial; for example, AB for the first trial, AB for the second), “same first letter” (if the first letter presented in the first trial was the same as the first letter presented in the second trial, but the second letter was different; for example, AB for the first trial, AA for the second), “same second letter” (if the second letter presented in the first trial was the same as the second letter presented in the second trial, but the first letter was different; for example, AB for the first trial, BB for the second), and “different letters” (if both letters presented in the first trial were different from the letters presented in the second trial; for example, AB for the first trial, BA for the second). For each trial pair subset, a separate analysis was performed.
We then analyzed the resulting data using a hierarchical logistic regression model whose parameters were estimated through Bayesian computational methods. The dependent variable was pstay, the stay probability for a given trial, and the independent variables were xr, which indicated whether a reward was received or not in the previous trial (+1 if the previous trial was rewarded, − 1 otherwise), xt, which indicated whether the transition in the previous trial was common or rare (+1 if it was common, −1 if it was rare), the interaction between the two, xc, which indicated whether in the previous trial the participant chose or not the initial-state choice with the highest reward probability (+1 if the choice had the highest reward probability, −1 otherwise), and xf, which indicated whether in the pervious trial the participant visited the final state with the highest reward probability (+1 if the final state had the highest reward probability, −1 otherwise). Thus, for each condition, we determined a intercept for each participant and five fixed coefficients that are shown in the following equation:
The distribution of y was Bernoulli(pstay). The distribution of the vectors was if the participant was in the simultaneous condition and if the participant was in the sequential condition; in other words, the subset means for each were allowed to vary independently. The parameters of the distribution were given vague prior distributions based on preliminary analyses—the vectors’ components were given a prior, and the vector’s components were given a Half-normal(0, 25) prior. Other vague prior distributions for the model parameters were tested and the results did not change significantly.
To obtain parameter estimates from the model’s posterior distribution, we coded the model into the Stan modeling language [41, 42] and used the PyStan Python package [43] to obtain 80,000 samples of the joint posterior distribution from four chains of length 40,000 (warmup 20,000). Convergence of the chains was indicated by for all parameters.
4.6 Fitting of the algorithms to experimental data
For comparison with the participant data, we fitted the SARSA model-free algorithm and the model-based algorithm to the experimental data and generated replicated data using the fitted parameters. The parameters were obtained by fitting both algorithms to all participants. To that end, we used a Bayesian hierarchical model, which allowed us to pool data from all participants to improve individual parameter estimates.
The parameters of the model-based algorithm for the ith participant were αi and βi. They were given a Beta(aα, bα) and prior distributions respectively. The hyperparameters aα and bα were themselves given a noninformative Half-normal(0, 104) prior and the hyperparameters μβ and were given a noninformative and Half-normal(0, 104) priors respectively. The parameters of the model-free algorithm for the ith participant were , and βi. They were given a Beta(aα1, bα1), Beta(aα2, bα1), Beta(aλ, bλ) and prior distributions respectively. The hyperparameters aα1, aα2, aλ, bα1, bα2, and bλ were themselves given a noninformative Half-normal(0, 104) prior and the hyperparameters μβ and were given a noninformative and Half-normal(0, 104) priors respectively. We then coded the models into the Stan modeling language [41, 42] and used the PyStan Python package [43] to obtain 40,000 samples of the joint posterior distribution from one chain of length 80,000 (warmup 40,000). Convergence of the chains was indicated by for all parameters.
4.7 Code and data availability
All the behavioral data used in this study are available at https://github.com/carolfs/mf_wm