Abstract
To make decisions organisms often accumulate information across multiple timescales. However, most experimental and modeling studies of decision-making focus on sequences of independent trials. On the other hand, natural environments are characterized by long temporal correlations, and evidence used to make a present choice is often relevant to future decisions. To understand decision-making under these conditions we analyze how a model ideal observer accumulates evidence to freely make choices across a sequence of correlated trials. We use principles of probabilistic inference to show that an ideal observer incorporates information obtained on one trial as an initial bias on the next. This bias decreases the time, but not the accuracy of the next decision. Furthermore, in finite sequences of trials the rate of reward is maximized when the observer deliberates longer for early decisions, but responds more quickly towards the end of the sequence. Our model also explains experimentally observed patterns in decision times and choices, thus providing a mathematically principled foundation for evidence-accumulation models of sequential decisions.
1. Introduction
Organismal behavior is often driven by decisions that are the result of evidence accumulated to determine the best among available options (Gold and Shadlen, 2007; Brody and Hanks, 2016). For instance, honeybee swarms use a democratic process in which each bee’s opinion is communicated to the group to decide which nectar source to forage (Seeley et al., 1991). Competitive animals evaluate their opponents’ attributes to decide whether to fight or flee (Stevenson and Rillich, 2012), and humans decide which stocks to buy or sell, based on individual research and social information (Moat et al., 2013). Importantly, the observations of these agents are frequently uncertain (Hsu et al., 2005; Brunton et al., 2013) so accurate decision-making requires robust evidence integration that accounts for the reliability and variety of evidence sources (Raposo et al., 2012).
The two alternative forced choice (TAFC) task paradigm has been successful in probing the behavioral trends and neural mechanisms underlying decision-making (Ratcliff, 1978). In a TAFC task subjects decide which one of two hypotheses is more likely based on noisy evidence (Gold and Shadlen, 2002; Bogacz et al., 2006). For instance, in the random dot motion discrimination task, subjects decide whether a cloud of noisy dots predominantly moves in one of two directions. Such stimuli evoke strong responses in primate motion-detecting areas, motivating their use in the experimental study of neural mechanisms underlying decision-making (Shadlen and Newsome, 2001; Gold and Shadlen, 2007). The response trends and underlying neural activity are well described by the drift-diffusion model (DDM), which associates a subject’s belief with a particle drifting and diffusing between two boundaries, with decisions determined by the first boundary the particle encounters (Stone, 1960; Bogacz et al., 2006; Ratcliff and McKoon, 2008).
The DDM is popular because (a) it can be derived as the continuum limit of the statistically-optimal sequential probability ratio test (Wald and Wolfowitz, 1948; Bogacz et al., 2006); (b) it is an analytically tractable Wiener diffusion process whose summary statistics can be computed explicitly (Ratcliff and Smith, 2004; Bogacz et al., 2006); and (c) it can be fit remarkably well to behavioral responses and neural activity in TAFC tasks with independent trials (Gold and Shadlen, 2002, 2007) (although see Latimer et al. (2015)).
However, the classical DDM does not describe many aspects of decision–making in natural environments. For instance, the DDM is typically used to model a series of independent trials where evidence accumulated during one trial is not informative about the correct choice on other trials (Ratcliff and McKoon, 2008). Organisms in nature often make a sequence of related decisions based on overlapping evidence (Chittka et al., 2009). Consider an animal deciding which way to turn while fleeing a pursuing predator: To maximize its chances of escape its decisions should depend on both its own and the predator’s earlier movements (Corcoran and Conner, 2016). Animals foraging over multiple days are biased towards food sites with consistently high yields (Gordon, 1991). Thus even in a variable environment, organisms use previously gathered information to make future decisions (Dehaene and Sigman, 2012). We need to extend previous experimental designs and corresponding models to understand if and how they do so.
Even in a sequence of independent trials, previous choices influence subsequent decisions (Fernberger, 1920). Such serial response dependencies have been observed in TAFC tasks which do (Cho et al., 2002; Fründ et al., 2014) and do not (Bertelson, 1961) require accumulation of evidence across trials. For instance, a subject’s response time may decrease when the current state is the same as the previous state (Pashler and Baylis, 1991). Thus, trends in response time and accuracy suggest subjects use trial history to predict the current state of the environment, albeit suboptimally (Kirby, 1976).
Goldfarb et al. (2012) examined responses in a series of dependent trials with the correct choice across trials evolving according to a two-state Markov process. The transition probabilities affected the response time and accuracy of subjects in ways well described by a DDM with biased initial conditions and thresholds. For instance, when repeated states were more likely, response times decreased in the second of two repeated trials. History-dependent biases also increase the probability of repeat responses when subjects view sequences with repetition probabilities above chance (Abrahamyan et al., 2016; Braun et al., 2018). These results suggest that an adaptive DDM with an initial condition biased towards the previous decision is a good model of human decision-making across correlated trials (Goldfarb et al., 2012).
Most earlier models proposed to explain these observations are intuitive and recapitulate behavioral data, but are at least partly heuristic (Goldfarb et al., 2012; Fründ et al., 2014). Yu and Cohen (2008) have proposed a normative model that assumes the subjects are learning non-stationary transition rates in a dependent sequence of trails. However, they did not examine the normative model for known transition rates. Such a model provides a standard for experimental subject performance, and a basis for approximate models, allowing us to better understand the heuristics subjects use to make decisions (Ratcliff and McKoon, 2008; Brunton et al., 2013).
Here, we extend previous DDMs to provide a normative model of evidence accumulation in serial trials evolving according to a two-state Markov process whose transition rate is known to the observer. We use sequential analysis to derive the posterior for the environmental states H± given a stream of noisy observations (Bogacz et al., 2006; Veliz-Cuba et al., 2016). Under this model, ideal observers incorporate information from previous trials to bias their initial belief on subsequent trials. This decreases the average time to the next decision, but not necessarily its accuracy. Furthermore, in finite sequences of trials the rate of reward is maximized when the observer deliberates longer for early decisions, but responds more quickly towards the end of the sequence. We also show that the model agrees with experimentally observed trends in decisions.
2. A model of sequential decisions
We model a repeated TAFC task with the environmental state in each trial (H+ or H−) chosen according to a two-state Markov process: In a sequence of n trials, the correct choices (environmental states or hypotheses) H1:n = (H1, H2, …, Hn) are generated so that P(H1 = H±) = 1/2 and P(Hi = H∓|Hi−1 = H±) = ϵ for i = 2, …, n (See Fig. 1). When ϵ = 0.5 the states of subsequent trials are independent, and ideally the decision on one trial should not bias future decisions (Ratcliff, 1978; Shadlen and Newsome, 2001; Gold and Shadlen, 2007, 2002; Bogacz et al., 2006; Ratcliff and McKoon, 2008).
When 0 ≤ ϵ < 0.5, repetitions are more likely and trials are dependent, and evidence obtained during one trial can inform decisions on the next1. We will show that ideal observers use their decision on the previous trial to adjust their prior over the states at the beginning of the following trial.
Importantly, we assume that all rewards are given at the end of the trial sequence (Braun et al., 2018). Rewarding the correct choice on trial i provides unambiguous information about the state Hi, superseding the noisy information gathered over that trial. Our results can be easily extended to this case, as well as cases when rewards are only given with some probability.
Our main contribution is to derive and analyze the ideal observer model for this sequential version of the TAFC task. We use principles of probabilistic inference to derive the sequence of DDMs, and optimization to determine the decision thresholds that maximize the reward rate (RR). Due to the tractability of the DDM, the correct probability and decision times that constitute the RR can be computed analytically. We also demonstrate that response biases, such as repetitions, commonly observed in sequential decision tasks, follow naturally from this model.
3. Optimizing decisions in a sequence of uncorrelated trials
We first derive in Appendix A and summarize here the optimal evidence-accumulation model for a sequence of independent TAFC trials (P(Hi = Hi−1) = 0.5). Although these results can be found in previous work (Bogacz et al., 2006), they are crucial for the subsequent discussion, and we thus provide them for completeness. A reader familiar with these classical results can skip to Section 4, and refer back to results in this section as needed.
The drift-diffusion model (DDM) can be derived as the continuum limit of a recursive equation for the loglikelihood ratio (LLR) (Wald and Wolfowitz, 1948; Gold and Shadlen, 2002; Bogacz et al., 2006). When the environmental states are uncorrelated and unbiased, an ideal observer has no bias at the start of each trial.
3.1. Drift-diffusion model for a single trial
We assume that on a trial the observer integrates a stream of noisy measurements of the true state, H1. If these measurements are conditionally independent, the functional central limit theorem yields the DDM for the scaled LLR, , after observation time t,
Here W is a Wiener process, the drift g1 ∈ g± depends on the environmental state, and D is the variance which depends on the noisiness of each observation.
For simplicity and consistency with typical random dot kinetogram tasks (Schall, 2001; Gold and Shadlen, 2007), we assume each drift direction is of equal unit strength: g+ = −g− = 1, and task difficulty is controlled by scaling the variance2 through D. The initial condition is determined by the observer’s prior bias, y1(0) = D ln[P(H1 = H+)/P(H1 = H−)]. Thus for an unbiased observer, y1(0) = 0.
There are two primary ways of obtaining a response from the DDM given by Eq. (1) that mirror common experimental protocols (Shadlen and Newsome, 2001; Gold and Shadlen, 2002, 2007): An ideal observer interrogated at a set time, t = T responds with sign(y(T)) = ±1 indicating the more likely of the two states, H1 = H±, given the accumulated evidence. On the other hand, an observer free to choose their response time can trade speed for accuracy in making a decision. This is typically modeled in the DDM by defining a decision threshold, θ1, and assuming that at the first time, T1, at which |y1(T1)| ≥ θ1, the evidence accumulation process terminates, and the observer chooses H± if sign(y1(T1)) = ±1.
The probability, c1, of making a correct choice in the free response paradigm can be obtained using the Fokker-Planck (FP) equation corresponding to Eq. (1) (Gardiner, 2009). Given the initial condition , threshold θ1, and state H1 = H+, the probability of an exit through either boundary ±θ1 is simplifying at to
An exit through the threshold θ1 results in a correct choice of H1 = H+, so . The correct probability c1 increases with θ1, since more evidence is required to reach a larger threshold. Defining the decision in trial 1 as d1 = ±1 if y1(T1) = ±θ1, Bayes’ rule implies since P(H1 = H±) = P(d1 = ±1) = 1/2. Rearranging the expressions in Eq. (4) and isolating θ1 relates the threshold θ1 to the LLR given a decision d1 = ±1:
3.2. Tuning performance via speed-accuracy tradeoff
Increasing θ1 increases the probability of a correct decision, and the average time to make a decision, DT1. Humans and other animals balance speed and accuracy to maximize the rate of correct decisions (Chittka et al., 2009; Bogacz et al., 2010). This is typically quantified using the reward rate (RR) (Gold and Shadlen, 2002): where c1 is the probability of a correct decision, DT1 is the mean time required for y1(t) to reach either threshold ±θ1, and TD is the prescribed time delay to the start of the next trial (Gold and Shadlen, 2002; Bogacz et al., 2006). Eq. (6) increases with c1 (accuracy) and with inverse time 1/(DT1 + TD) (speed). This usually leads to a nonmonotonic dependence of RR1 on the threshold, θ1, since increasing θ1 increases accuracy, but decreases speed (Gold and Shadlen, 2002; Bogacz et al., 2006).
The average response time, DT1, can be obtained as the solution of a mean exit time problem for the Fokker–Planck (FP) equation for p(y1, t) with absorbing boundaries, p(±θ1, t) = 0 (Bogacz et al., 2006; Gardiner, 2009). Since the RR is determined by the average decision time over all trials, we compute the unconditional mean exit time, which for simplifies to
Plugging this expression into the RR as defined in Eq. (6), assuming , we have
We can identify the maximum of RR1(θ1) > 0 by finding the minimum of its reciprocal, 1/RR1(θ1) (Bogacz et al., 2006),
Here is the Lambert W function (the inverse of ). In the limit TD → 0, we have , and Eq. (9) defines nontrivial optimal thresholds at TD > 0.
Having established a policy for optimizing performance in a sequence of independent TAFC trials, in the next section we consider dependent trials. We can again explicitly compute the RR function for sequential TAFC trials with states, H1:n = (H1, H2, H3, …, Hn), evolving according to a two-state Markov process with parameter ϵ := P(Hn+1 ≠ Hn). Unlike above, where ϵ = 0.5, we will see that when ϵ ∈ [0,0.5), an ideal observer starts each trial Hn for n ≥ 2 by using information obtained over the previous trial, to bias their initial belief, .
4. Integrating information across two correlated trials
We first focus on the case of two sequential TAFC trials. Both states are equally likely at the first trial, P(H1 = H±) = 1/2, but the state at the second trial can depend on the state of the first, ϵ := P(H2 = H∓|H1 = H±). On both trials an observer makes observations, , with conditional density f±(ξ) if Hn = H± to infer the state Hn. The observer also uses information from trial 1 to infer the state at trial 2. All information about H1 can be obtained from the decision variable, d1 = ±1, and ideal observers use the decision variable to set their initial belief, , at the start of trial 2. We later show that the results for two trials can be extended to trial sequences of arbitrary length with states generated according to the same two-state Markov process.
4.1. Optimal observer model for two correlated trials
The first trial is equivalent to the single trial case discussed in section 3.1. Assuming each choice is equally likely, P(H1 = H±) = 1/2, no prior information exists. Therefore, the decision variable, y1(t), satisfies the DDM given by Eq. (1) with y1(0) = 0. This generates a decision d1 ∈ ±1 if y1(T1) = ±θ1, so that the probability of a correct decision c1 is related to the threshold θ1 by Eq. (4). Furthermore, since ϵ = P(H2 = H∓|H1 = H±), it follows that 1 − ϵ = P(H2 = H±|H1 = H±), which means and, similarly,
As the decision, d1 = ±1, determines the probability of each state at the end of trial 1, individual observations, , are not needed to define the belief of the ideal observer at the outset of trial 2. The ideal observer uses a sequence of observations, , and their decision on the previous trial, d1, to arrive at the probability ratio
Taking the logarithm, and applying conditional independence of the measurements , we have that indicating that . Taking the temporal continuum limit as in Appendix A, we find with the Wiener process, W, and variance defined as in Eq. (1). The drift g2 ∈ ±1 is determined by H2 = H±.
Furthermore, the initial belief is biased in the direction of the previous decision, as the observer knows that states are correlated across trials. The initial condition for Eq. (13) is therefore where we have used Eqs. (4), (10), and (11). Note that in the limit ϵ → 0.5, so no information is carried forward when states are uncorrelated across trials. In the limit ϵ → 0, we find , so the ending value of the decision variable y1(T1) = ±θ1 in trial 1 is carried forward to trial 2, since there is no change in environmental state from trial 1 to 2. For ϵ ∈ (0, 1/2), the information gathered on the previous trial provides partial information about the next state, and we have .
We assume that the decision variable, y2(t), evolves according to Eq. (13), until it reaches a threshold ±θ2, at which point the decision d2 = ±1 is registered. The full model is thus specified by,
The above analysis is easily extended to the case of arbitrary n ≥ 2, but before we do so, we analyze the impact of state correlations on the reward rate (RR) and identify the decision threshold values θ1:2 that maximize the RR.
As noted earlier, we assume the total reward is given at the end of the experiment, and not after each trial (Braun et al., 2018). A reward for a correct choice provides the observer with complete information about the state Hn, so that
A similar result holds in the case that the reward is only given with some probability.
4.2. Performance for constant decision thresholds
We next show how varying the decision thresholds impacts the performance of the observer. For simplicity, we first assume the same threshold is used in both trials, θ1:2 = θ. We quantify performance using a reward rate (RR) function across both trials: where cj is the probability of correct responses on trial j = 1, 2, DTj is the mean decision time in trial j, and TD is the time delay after each trial. The expected time until the reward is thus DT1 + DT2 + 2TD. The reward rate will increase as the correct probability in each trial increases, and as decision time decreases. However, the decision time in trial 2 will depend on the outcome of trial 1.
Similar to our finding for trial 1, the decision variable y2(t) in trial 2 determines LLR2(t), since y2 = D · LLR2. Moreover, the probability of a correct responses on each trial, c1 and c2 are equal, and determined by the threshold, θ = θ1:2, and noise amplitude, D (See Appendix B).
The average time until a decision in trial 2 is (See Appendix C)
Notice, in the limit ϵ → 0.5, Eq. (17) reduces to Eq. (8) for θ1 = θ as expected. Furthermore, in the limit ϵ → 0, DT2 → 0, since decisions are made immediately in an unchanging environment as .
We can now use the fact that c1 = c2 to combine Eqs. (4), and (17) with Eq. (16) to find which can be maximized using numerical optimization.
Our analysis shows that the correct probability in both trials, c1:2, increases with θ and does not depend on the transition probability ϵ (Fig. 2A). In addition, the decision time in trial 2, DT2, increases with θ and ϵ (Fig. 2B). At smaller transition rates, ϵ, more information is carried forward to the second trial, so the decision variable is, on average, closer to the threshold θ. This shortens the average decision time, and increases the RR (Fig. 2C, D). Lastly, note that the threshold value, θ, that maximizes the RR is higher for lower values of ϵ, as the observer can afford to require more evidence for a decision when starting trial 2 with more information.
4.3. Performance for dynamic decision thresholds
We next ask whether the RR given by Eq. (16) can be increased by allowing unequal decision thresholds between the trials, θ1 ≠ θ2. As before, the probability of a correct response, , and mean decision time, DT1, in trial 1 are given by Eqs. (3) and (8). The correct probability c2 in trial 2 is again determined by the decision threshold, and noise variance, D, as long as . However, when , the response in trial 2 is instantaneous (DT2 = 0) and c2 = (1 − ϵ)c1 + ϵ(1 − c1). Therefore, the probability of a correct response on trial 2 is defined piecewise
We will show in the next section that this result extends to an arbitrary number of trials, with an arbitrary sequence of thresholds, θ1:n.
In the case the average time until a decision in trial 2, is (See Appendix C) which allows us to compute the reward rate function RR1:2(ϵ) = RR1:2(θ1:2; ϵ): where . This reward rate is convex as a function of the thresholds, θ1:2, for all examples we examined.
The pair of thresholds, , that maximize RR satisfy , so that the optimal observer typically decreases their decision threshold from trial 1 to trial 2 (Fig. 3A,B). This shortens the mean time to a decision in trial 2, since the initial belief is closer to the decision threshold more likely to be correct. As the change probability ϵ decreases from 0.5, initially the optimal decreases and increases (Fig. 3C), and eventually colides with the boundary . For smaller values of ϵ, RR is thus maximized when the observer accumulates information on trial 1, and then makes the same decision instantaneously on trial 2, so that c2 = (1 − ϵ)c1 + ϵ(1 − c1) and DT2 = 0. As in the case of a fixed threshold, the maximal reward rate, , increases as ϵ is decreased (Fig. 3D), since the observer starts trial 2 with more information. Surprisingly, despite different strategies, there is no big increase in compared to constant thresholds, with the largest gain at low values of ϵ.
In the results displayed for our RR maximization analysis (Fig. 2 and 3), we have fixed TD = 2 and D = 1. We have also examined these trends in the RR at lower and higher values of TD and D (plots not shown), and the results are qualitatively similar. The impact of dynamic thresholds θ1:2 on the RR are slightly more pronounced when TD is small or when D is large.
5. Multiple correlated trials
To understand optimal decisions across multiple correlated trials, we again assume that the states evolve according to a two-state Markov Chain with ϵ := P(Hj+1 = H∓|Hj = H±). Information gathered on one trial again determines the initial belief on the next. We first discuss optimizing the RR for a constant threshold for a fixed number n of trials known to the observer. When decision thresholds are allowed to vary between trials, optimal thresholds typically decrease across trials, .
5.1. Optimal observer model for multiple trials
As in the case of two trials, we must consider the caveats associated with instantaneous decisions in order to define initial conditions , probabilities of correct responses cj, and mean decision times DTj. The same argument leading to Eq. (13) shows that the decision variable evolves according to a linear drift diffusion process on each trial. The probability of a correct choice is again defined by the threshold, when , and cj = (1 − ϵ)cj−1 + ϵ (1 − cj−1) when , as in Eq. (18). Therefore, cj may be defined iteratively for j ≥ 2, as with since .
The probability cj quantifies the belief at the end of trial j. Hence, as in Eq. (14), the initial belief of an ideal observer on trial j + 1 has the form and the decision variable yj(t) within trials obeys a DDM with threshold θj as in Eq. (13). A decision dj = ±1 is then registered when yj(Tj) = ±θj, and if , then dj = dj−1 and Tj = 0.
5.2. Performance for constant decision thresholds
With n trials, the RR is the sum of correct probabilities divided by the total decision and delay time:
The probability of a correct choice, cj = 1/(1 + e−θ/D), is constant across trials, and determined by the fixed threshold θ. It follows that . Eq. (22) then implies that for ϵ > 0 and if ϵ = 0.
As is constant on the 2nd trial and beyond, the analysis of the two trial case implies that the mean decision time is given by an expression equivalent to Eq. (17) for j ≥ 2. Using these expressions for cj and DTj in Eq. (23), we find
As n is increased, the threshold value, θmax, that maximizes the RR increases across the full range of ϵ values (Fig. 4): As the number of trials increases, trial 1 has less impact on the RR. Longer decision times on trial 1 impact the RR less, allowing for longer accumulation times (higher thresholds) in later trials. In the limit of an infinite number of trials (n → ∞), Eq. (24) simplifies to and we find θmax of RR∞(θ; ϵ) sets an upper limit on the optimal θ for n < ∞.
Interestingly, in a static environment (ϵ → 0), so the decision threshold value that maximizes RR diverges, θmax → ∞ (Fig. 4C). Intuitively, when there are many trials (n ≫ 1), the price of a long wait for a high accuracy decision in the first trial (c1 ≈ 1 but DT1 ≫ 1 for θ ≫ 1) is offset by the reward from a large number (n − 1 ≫ 1) of subsequent, instantaneous, high accuracy decisions (cj ≈ 1 but DT2 = 0 for θ ≫ 1).
5.3. Dynamic decision thresholds
In the case of dynamic thresholds, the RR function, Eq. (23), can be maximized for an arbitrary number of trials, n. The probability of a correct decision, cj, is given by the more general, iterative version of Eq. (21). Therefore, while analytical results can be obtained for constant thresholds, numerical methods are needed to find the sequence of optimal dynamic thresholds.
The expression for the mean decision time, DTj, on trial j is determined by marginalizing over whether or not the initial state aligns with the truestate Hj, yielding
Thus, Eq. (21), the probability of a correct decision on trial j − 1, is needed to determine the mean decision time, DTj, on trial j. The resulting values for cj and DTj can be used in Eq. (23) to determine the RR which again achieves a maximum for some choice of thresholds, θ1:n := (θ1, θ2, …, θn).
In Fig. 5 we show that the sequence of decision thresholds that maximize RR, , is decreasing across trials, consistent with our observation for two trials. Again, the increased accuracy in earlier trials improves accuracy in later trials, so there is value in gathering more information early in the trial sequence. Alternatively, an observer has less to gain from waiting for more information in a late trial, as the additional evidence can only be used on a few more trials. For intermediate to large values of ϵ, the observer uses a near constant threshold on the first n − 1 trials, and then a lower threshold on the last trial. As ϵ is decreased, the last threshold, , collides with the boundary , so the last decision is made instantaneously. The previous thresholds, collide with the corresponding boundaries in reverse order as ϵ is decreased further. Thus for lower values of ϵ, an increasing number of decisions toward the end of the sequence are made instantaneously.
6. Comparison to experimental results on repetition bias
We next discuss the experimental predictions of our model, motivated by previous observations of repetition bias. It has been long known that trends appear in the response sequences of subjects in series of TAFC trials (Fernberger, 1920), even when the trials are uncorrelated (Cho et al., 2002; Gao et al., 2009). For instance, when the state is the same on two adjacent trials, the probability of a correct response on the second trial increases (Fründ et al., 2014), and the response time decreases (Jones et al., 2013). In our model, if the assumed change rate of the environment does not match the true change rate, ϵassumed ≠ ϵtrue, performance is decreased compared to ϵassumed = ϵtrue. However, this decrease may not be severe. Thus the observed response trends may arise partly due to misestimation of the transition rate ϵ, and the assumption that trials are dependent, even if they are not (Cho et al., 2002; Yu and Cohen, 2008).
Motivated by our findings on in the previous sections, we focus on the model of an observer who fixes their decision threshold across all trials, θj = θ, ∀j. Since allowing for dynamic thresholds does not impact considerably, we think the qualitative trends we observe here should generalize.
6.1. History-dependent psychometric functions
To provide a normative explanation of repetition bias, we compute the degree to which the optimal observer’s choice on trial j − 1 biases their choice on trial j. For comparison with experimental work, we present the probability that dj = +1, conditioned both on the choice on the previous trial, and on coherence3 (See Fig. 6A). In this case, ,where the exit probability πθ(·) is defined by Eq. (2), and the initial decision variable, , is given by Eq. (14) using the threshold θ and the assumed transition rate, ϵ = ϵassumed. The unconditioned psychometric function is obtained by marginalizing over the true environmental change rate, ϵtrue, P(dj = +1|gj/D) = (1 − ϵtrue)P(dj = +1|gj/D, dj−1 = +1) + ϵtrueP(dj = +1|gj/D, dj−1 = −1), which equals πθ(0) if ϵ = ϵtrue.
Fig. 6A shows that the probability of decision dj = +1 is increased (decreased) if the same (opposite) decision was made on the previous trial. When ϵassumed ≠ ϵtrue, the psychometric function is shallower and overall performance worsens (inset), although only slightly. Note that we have set , so the decision threshold maximizes RR∞ in Eq. (25). The increased probability of a repetition arises from the fact that the initial condition yj(0) is biased in the direction of the previous decision.
6.2. Error rates and decision times
The adaptive DDM also exhibits repetition bias in decision times (Goldfarb et al., 2012). If the state at the current trial is the same as that on the previous trial (repetition) the decision time is shorter than average, while the decision time is increased if the state is different (alternation, see Appendix D). This occurs for optimal (ϵ assumed = ϵtrue) and suboptimal (ϵassumed ≠ ϵtrue) models as long as ϵ ∈ [0, 0.5). Furthermore, the rate of correct responses increases for repetitions and decrease for alternations (See Appendix D). However, the impact of the state on two or more previous trials on the choice in the present trial becomes more complicated. As in previous studies (Cho et al., 2002; Goldfarb et al., 2012; Jones et al., 2013), we propose that such repetition biases in the case of uncorrelated trials could be the result of the subject assuming ϵ = ϵassumed ≠ 0.5, even when ϵtrue = 0.5.
Following Goldfarb et al. (2012), we denote that three adjacent trial states as composed of repetitions (R), and alternations (A) (Cho et al., 2002; Goldfarb et al., 2012). Repetition-repetition (RR)4 corresponds to the three identical states in a sequence, Hj = Hj−1 = Hj−2; RA corresponds to Hj ≠ Hj−1 = Hj−2; AR to Hj = Hj−1 ≠ Hj−2; and AA to Hj ≠ Hj−1 ≠ Hj−2. The correct probabilities cXY associated with all four of these cases can be computed explicitly, as shown in Appendix D. Here we assume that ϵassumed = ϵtrue for simplicity, although it is likely that subject often use ϵassumed ≠ ϵtrue as shown in Cho et al. (2002); Goldfarb et al. (2012).
The probability of a correct response as a function of ϵ is shown for all four conditions in Fig. 6B. Clearly, cRR is always largest, since the ideal observer correctly assumes repetitions are more likely than alternations (when ϵ ∈ [0,0.5)).
Note when ϵ = 0.05, cRR > cAA > cAR > cRA (colored dots), so a sequence of two alternations yields a higher correct probability on the current trial than a repetition preceded by an alternation. This may seem counterintuitive, since we might expect repetitions to always yield higher correct probabilities. However, two alternations yield the same state on the first and last trial in the three trial sequence (Hj = Hj−2). For small ϵ, since initial conditions begin close to threshold, then it is likely dj = dj−1 = dj−2, so if dj−2 is correct, then dj will be too. At small ϵ, and at parameters we chose here, the increased likelihood of the first and last response being correct, outweighs the cost of the middle response likely being incorrect. Note that AA occurs with probability ϵ2, and is thus rarely observed when ϵ is small. As ϵ is increased, the ordering switches to cRR > cAR > cAA > cRA (e.g., at ϵ = 0.25), so a repetition in trial j leads to higher probability of a correct response. In this case, initial conditions are less biased, so the effects of the state two trials back are weaker. As before, we have fixed .
These trends extend to decision times. Following a similar calculation as used for correct response probabilities, cXY, we can compute the mean decision times, TXY, conditioned on all possible three state sequences (See Appendix D). Fig. 6C shows that, as a function of ϵ, decision times following two repetitions, TRR, are always the smallest. This is because the initial condition in trial j is most likely to start closest to the correct threshold: Decisions do not take as long when the decision variable starts close to the boundary. Also, note that when ϵ = 0.05, TRR < TAA < TAR < TRA, and the explanation is similar to that for the probabilities of correct responses. When the environment changes slowly, decisions after two alternations will be quicker because a bias in the direction of the correct decision is more likely. The ordering is different at intermediate ϵ where TRR < TAR < TAA < TRA.
Notably, both orderings of the correct probability and decision times, as they depend on the three trial history, have been observed in previous studies of serial bias (Cho et al., 2002; Gao et al., 2009; Goldfarb et al., 2012): Sometimes cAA > cAR and TAA < TAR whereas sometimes cAA < cAR and TAA > TAR. However, it appears to be more common that AA sequences lead to lower decision times, TAA < TAR (e.g., See Fig. 3 in Goldfarb et al. (2012)). This suggests that subjects assume state switches are rare, corresponding to a smaller ϵassumed.
6.3. Initial condition dependence on noise
Lastly, we explored how the initial condition, y0, which captures the information gathered on the previous trial, changes with signal coherence, g/D. We only consider the case , and note that the threshold will vary with D. We find that the initial bias, y0, always tends to increase with D (Fig. 6D). The intuition for this result is that the optimal will tend to increase with D, since uncertainty increases with D, necessitating a higher threshold to maintain constant accuracy (See Eq. (4)). As the optimal threshold increases with D, the initial bias increases with it. This is consistent with experimental observations that show that bias tends to decrease with coherence, i.e. increase with noise amplitude (Fründ et al., 2014; Olianezhad et al., 2016; Braun et al., 2018).
We thus found that experimentally observed history-dependent perceptual biases can be explained using a model of an ideal observer accumulating evidence across correlated TAFC trials. Biases are stronger when the observer assumes a more slowly changing environment – smaller ϵ – where trials can have an impact on the belief of the ideal observer far into the future. Several previous studies have proposed this idea (Cho et al., 2002; Goldfarb et al., 2012; Braun et al., 2018), but to our knowledge, none have used a normative model to explain and quantify these effects.
7. Discussion
It has been known for nearly a century that observers take into account previous choices when making decisions (Fernberger, 1920). While a number of descriptive models have been used to explain experimental data (Fründ et al., 2014; Goldfarb et al., 2012), there have been no normative models that quantify how information accumulated on one trial impacts future decisions. Here we have shown that a straightforward, tractable extension of previous evidence-accumulation models can be used to do so.
To account for information obtained on one trial observers could adjust accumulation speed (Ratcliff, 1985; Diederich and Busemeyer, 2006; Urai et al., 2018), threshold (Bogacz et al., 2006; Goldfarb et al., 2012; Diederich and Busemeyer, 2006), or their initial belief on the subsequent trial (Bogacz et al., 2006; Braun et al., 2018). We have shown that an ideal observer adjusts their initial belief and decision threshold, but not the accumulation speed in a sequence of dependent, statistically-identical trials. Adjusting the initial belief increases reward rate by decreasing response time, but not the probability of a correct response. Changes in initial bias have a larger effect than adjustments in threshold, and there is evidence that human subjects do adjust their initial beliefs across trials (Fründ et al., 2014; Olianezhad et al., 2016; Braun et al., 2018). On the other hand, the effect of a dynamic threshold across trials diminishes as the number of trials increases: is nearly the same for constant thresholds θ as for when allowing dynamic thresholds θ1:n. Given the cognitive cost of trial-to-trial threshold adjustments, it is thus unclear whether human subjects would implement such a strategy (Balci et al., 2011).
To isolate the effect of previously gathered information, we have assumed that the observers do not receive feedback or reward until the end of the trial. Observers can thus use only information gathered on previous trials to adjust their bias on the next. However, our results can be easily extended to the case when feedback is provided on each trial. The expression for the initial condition y0 is simpler in this case, since feedback provides certainty concerning the state on the previous trial.
We also assumed that an observer knows the length of the trial sequence, and uses this number to optimize reward rate. It is rare that an organism would have such information in a natural situation. However, our model can be extended to random sequence durations if we allow observers to marginalize over possible sequence lengths. If decision thresholds are fixed to be constant across all trials θj = θ, the projected RR is the average number of correct decisions c〈n〉 divided by the average time of the sequence. For instance, if the trial length follows a geometric distribution, n ~ Geom(p), then
This equation has the same form as Eq. (24), the RR in the case of fixed trial number n, with n replaced by the average 〈n〉 = 1/p. Thus, we expect that as p decreases, and the average trial number, 1/p, increases, resulting in an increase in the RR and larger optimal decision thresholds θmax (as in Fig. 4). The case of dynamic thresholds is somewhat more involved, but can be treated similarly. However, as the number of trials is generated by a memoryless process, observers gain no further information about how many trials remain after the present, and will thus not adjust their thresholds as in Section 5.3.
Our conclusions also depend on the assumptions the observer makes about the environment. For instance, an observer may not know the exact probability of change between trials, so that ϵassumed ≠ ϵtrue. Indeed, there is evidence that human subjects adjust their estimate of ϵ across a sequence of trials, and assume trials are correlated, even when they are not (Cho et al., 2002; Gao et al., 2009; Yu and Cohen, 2008). Our model can be expanded to a hierarchical model in which both the states H1:n and the change rate, ϵ, are inferred across a sequence of trials (Radillo et al., 2017). This would result in model ideal observers that use the current estimate of the ϵ to set the initial bias on a trial (Yu and Cohen, 2008). A similar approach could also allow us to model observers that incorrectly learn, or make wrong assumptions, about the rate of change.
Modeling approaches can also point to the neural computations that underlie the decision-making process. Cortical representations of previous trials have been identified (Nogueira et al., 2017; Akrami et al., 2018), but it is unclear how the brain learns and combines latent properties of the environment (the change rate ϵ), with sensory information to make inferences. Recent work on repetition bias in working memory suggests short-term plasticity serves to encode priors in neural circuits, which bias neural activity patterns during memory retention periods (Papadimitriou et al., 2015; Kilpatrick, 2018). Such synaptic mechanisms may also allow for the encoding of previously accumulated information in sequential decision-making tasks, even in cases where state sequences are not correlated. Indeed, evolution may imprint characteristics of natural environments onto neural circuits, shaping whether and how ϵ is learned in experiments and influencing trends in human response statistics (Fawcett et al., 2014; Glaze et al., 2018). Ecologically adapted decision rules may be difficult to train away, especially if they do not impact performance too adversely (Todd and Gigerenzer, 2007).
The normative model we have derived can be used to identify task parameter ranges that will help tease apart the assumptions made by subjects (Cho et al., 2002; Gao et al., 2009; Goldfarb et al., 2012). Since our model describes the behavior of an ideal observer, it can be used to determine whether experimental subjects use near-optimal or suboptimal evidence-accumulation strategies. Furthermore, our adaptive DDM opens a window to further instantiations of the DDM that consider the influence of more complex state-histories on the evidence-accumulation strategies adopted within a trial. Uncovering common assumptions about state-histories will help guide future experiments, and help us better quantify the biases and core mechanisms of human decision-making.
Acknowledgements
This work was supported by an NSF/NIH CRCNS grant (R01MH115557). KN was supported by an NSF Graduate Research Fellowship. ZPK was also supported by an NSF grant (DMS-1615737). KJ was also supported by NSF grant DBI-1707400 and DMS-1517629.
Appendix A Derivation of the drift-diffusion model for a single trial
We assume that an optimal observer integrates a stream of noisy measurements at equally spaced times t1:s = (t1, t2, …, ts) (Wald and Wolfowitz, 1948; Bogacz et al., 2006). The likelihood functions, f±(ξ) := P(ξ|H±), define the probability of each measurement, ξ, conditioned on the environmental state, H±. Observations in each trial are combined with the prior probabilities P(H±). We assume a symmetric prior, P(H±) = 1/2. The probability ratio is then due to the independence of each measurement . Thus, if then H1 = H+(H1 = H−) is the more likely state. Eq. (A.1) can be written recursively where , due to the assumed prior at the beginning of trial 1.
Taking the logarithm of Eq. (A.2) allows us to express the recursive relation for the log-likelihood ratio (LLR) as an iterative sum where , so if then H1 = H± is the more likely state. Taking the continuum limit Δt → 0 of the timestep Δt := ts − ts−1, we can use the functional central limit theorem to yield the DDM (See Bogacz et al. (2006); Veliz-Cuba et al. (2016) for details): where W is a Wiener process, the drift depends on the state, and is the variance which depends only on the noisiness of each observation, but not the state.
With Eq. (1) in hand, we can relate y1(t) to the probability of either state H± by noting its Gaussian statistics in the case of free boundaries so that
Note, an identical relation is obtained in Bogacz et al. (2006) in the case of absorbing boundaries. Either way, it is clear that y1 = D · LLR1. This means that before any observations have been made , so we scale the log ratio of the prior by D to yield the initial condition in Eq. (1).
Appendix B Threshold determines probability of being correct
We note that Eq. (A.5) extends to any trial j in the case of an ideal observer. In this case, when the decision variable y2(T2) = θ, we can write so that a rearrangement of Eq. (B.1) yields
Therefore the probability of correct decisions on trials 1 and 2 are equal, and determined by θ = θ1:2, and D. A similar computation shows that the threshold, θj = θ, and diffusion rate, D, but not the initial belief, determine the probability of a correct response on any trial j.
To demonstrate self-consistency in the general case of decision thresholds that may differ between trials θ1:2, we can also derive the formula for c2 in the case . Our alternative derivation of Eq. (B.2) begins by computing the total probability of a correct choice on trial 2 by combining the conditional probabilities of a correct choice given decision d1 = +1 leading to initial belief , and decision d1 = −1 leading to initial belief : showing again that the correct probability c2 on trial 2 is solely determined by θ2 and D, assuming .
Appendix C Mean response time on the second trial
We can compute the average time until a decision on trial 2 by marginalizing over d1 = ±1, since d1 = ±1 will lead to the initial condition in trial 2. We can compute using Eq. (7) as
The second terms in the summands of Eq. (C.1) are given by and note DT2 = [DT2|H2 = H+] P(H2 = H+) + [DT2|H2 = H−] P(H2 = H−) = DT2|(H2 = H±).
Using Eqs. (C.1-C.2) and simplifying assuming θ1:2 = θ, we obtain Eq. (17). For unequal thresholds, θ1 ≠ θ2, a similar computation yields Eq. (19).
Appendix D Repetition-dependence of correct probabilities and decision times
Here we derive formulas for correct probabilities cj and average decision times DTj in trial j, conditioned on whether the current trial is a repetition (R: Hj = Hj−1) or alternation (A: Hj ≠ Hj−1) compared with the previous trial. To begin, we first determine the correct probability, conditioned on repetitions (R), cj|R := cj|(Hj = Hj−1 = H±):
In the case of alternations, cj|A := cj|(Hj ≠ Hj−1) = cj|(Hj ≠ Hj−1 = H±):
Thus, we can show that the correct probability in the case of repetitions is higher than that for alternations, cj|R > cj|A for any θ > 0 and ϵ ∈ [0, 0.5), since
Now, we also demonstrate that decision times are shorter for repetitions than for alternations. First, note that when the current trial is a repetition, we can again marginalize to compute DTj|R := DTj|(Hj = Hj−1 = H±): whereas in the case of alternations
By subtracting DTj|A from DTj|R, we can show that DTj|R < DTj|A for all θ > 0 and ϵ ∈ [0, 0.5). To begin, note that
Since it is clear the first term in the product above is positive for θ, D > 0, we can show DTj|R − DTj|A < 0 and thus DTj|R < DTj|A if . We verify this is so by first noting the function F(y) = (ey/D − e−y/D)/y is always nondecreasing since since y/D ≥ tanh y/D. In fact F(y) is strictly increasing everywhere but when y = 0. This means that as long as y0 < θ, F(y0) < F(θ), which implies so as we wished to show, so DTj|R < DTj|A.
Moving to three state sequences, we note that we can use our formulas for cj|R and cj|A in Appendix D as well as πθ(y) described by Eq. (2) to compute
Applying similar logic to the calculation of the decision times as they depend on the three state sequences, we find that
Appendix E Numerical simulations
All numerical simulations of the DDM were performed using the Euler-Maruyama method with a timestep dt = 0.005, computing each point from 105 realizations. To optimize reward rates, we used the Nelder-Mead simplex method (fminsearch in MATLAB). All code used to generate figures will be available on github.
Footnotes
↵1 If 0.5 < ϵ ≤ 1 states are more likely to alternate. The analysis is similar to the one we present here, so we do not discuss this case separately.
↵2 An arbitrary drift amplitude g can also be scaled out via a change of variables , so the resulting DDM for has unit drift, and .
↵3 We define coherence as the drift, gj ∈ ±1, in trial j divided by the noise diffusion coefficient D (Gold and Shadlen, 2002; Bogacz et al., 2006).
↵4 Note that RR here refers to repetition-repetition, as opposed to earlier when we used RR in Roman font to denote reward rate.