Remembrance of Inferences Past

Ishita Dasgupta; Eric Schulz; Noah D. Goodman; Samuel J. Gershman

doi:10.1101/231837

Abstract

Bayesian models of cognition assume that people compute probability distributions over hypotheses. However, the required computations are frequently intractable or prohibitively expensive. Since people often encounter many closely related distributions, selective reuse of computations (amortized inference) is a computationally efficient use of the brain’s limited resources. We present three experiments that provide evidence for amortization in human probabilistic reasoning. When sequentially answering two related queries about natural scenes, participants’ responses to the second query systematically depend on the structure of the first query. This influence is sensitive to the content of the queries, only appearing when the queries are related. Using a cognitive load manipulation, we find evidence that people amortize summary statistics of previous inferences, rather than storing the entire distribution. These findings support the view that the brain trades off accuracy and computational cost, to make efficient use of its limited cognitive resources to approximate probabilistic inference.

Remembrance of Inferences Past

“Cognition is recognition.”

Hofstadter (1995)

Introduction

Many theories of probabilistic reasoning assume that human brains are equipped with a general-purpose inference engine that can be used to answer arbitrary queries for a wide variety of probabilistic models (Griffiths, Vul, & Sanborn, 2012; Oaksford & Chater, 2007). For example, given a joint distribution over objects in a scene, the inference engine can be queried with arbitrary conditional distributions, such as:

What is the probability of a microwave given that I’ve observed a sink?
What is the probability of a toaster given that I’ve observed a sink and a microwave?
What is the probability of a toaster and a microwave given that I’ve observed a sink?

The nature of the inference engine that answers such queries is still an open research question, though many theories posit some form of approximate inference using Monte Carlo sampling (e.g., Dasgupta, Schulz, & Gershman, 2017; Denison, Bonawitz, Gopnik, & Griffiths, 2013; Gershman, Vul, & Tenenbaum, 2012; Sanborn & Chater, 2016; Thaker, Tenenbaum, & Gershman, 2017; Ullman, Goodman, & Tenenbaum, 2012; Vul, Goodman, Griffiths, & Tenenbaum, 2014).

The flexibility and power of such a general-purpose inference engine trades off against its computational efficiency: by treating each query distribution independently, an inference engine forgoes the opportunity to reuse computations across queries. Every time a distribution is queried, past computations are ignored and answers are produced anew—the inference engine is memoryless, a property that makes it statistically accurate but inefficient in environments with overlapping queries. Continuing the scene inference example, answering the third query should be easily computable once the first two queries have been computed. Mathematically, the answer can be expressed as: Even though this is a trivial example, standard inference engines do not exploit these kinds of regularities because they are memoryless—they have no access to traces of past computations.

An inference engine may gain efficiency by incurring some amount of bias due to reuse of past computations—a strategy we will refer to as amortized inference (Gershman & Goodman, 2014; Stuhlmüller, Taylor, & Goodman, 2013). For example, if the inference engine stores its answers to the “toaster” and “microwave” queries, then it can efficiently compute the answer to the “toaster or microwave” query without rerunning inference from scratch. More generally, the posterior can be approximated as a parametrized function, or recognition model, that maps data in a bottom-up fashion to a distribution over hypotheses, with the parameters trained to minimize the divergence between the approximate and true posterior. 1 By sharing the same recognition model across multiple queries, the recognition model can support rapid inference, but is susceptible to “interference” across different queries.

Amortization has a long history in machine learning; the locus classicus is the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995), which uses samples from the generative model to train a recognition model. More recent extensions and applications of this approach (e.g., Kingma & Welling, 2013; Paige & Wood, 2016; Rezende, Mohamed, & Wierstra, 2014; Ritchie, Thomas, Hanrahan, & Goodman, 2016) have ushered in a new era of scalable Bayesian computation in machine learning. We propose that amortization is also employed by the brain (see Yildirim, Kulkarni, Freiwald, & Tenenbaum, 2015, for a related proposal), flexibly reusing past inferences in order to efficiently answer new but related queries. The key behavioral prediction of amortized inference is that people will show correlations in their judgments across related queries.

We report 3 experiments that test this prediction using a variant of the probabilistic reasoning task previously studied by Dasgupta, Schulz, and Gershman (2017). In this task, participants answer queries about objects in scenes, much like in the examples given above. Crucially, the hypothesis space is combinatorial because participants have to answer questions about sets of objects (e.g., “All objects starting with the letter S”). This renders exact inference intractable: the hypothesis space cannot be efficiently enumerated. In our previous work (Dasgupta, Schulz, & Gershman, 2017), we argued that people approximate inference in this domain using a form of Monte Carlo sampling. Although this algorithm is asymptotically exact, only a small number of samples can be generated due to cognitive limitations, thereby revealing systematic cognitive biases such as anchoring and adjustment, subadditivity, and superadditivity (see also Lieder, Griffiths, Huys, & Goodman, 2017a, 2017b; Vul et al., 2014).

We show that the same algorithm can be generalized to reuse inferential computations in a manner consistent with human behavior. First we describe how amortization might be used by the mind. We consider two crucial questions about how this might be implemented: what parts of previous calculations do people reuse —all previous memories or summaries of the calculations— and when do they choose to reuse their amortized calculations. Next we test these questions empirically. In Experiment 1, we demonstrate that people do use amortization by showing that there is a lingering influence of one query on participants’ answers to a second, related query. In Experiment 2, we explore what is reused, and find that people use summary statistics of their previously generated hypotheses, rather than the hypotheses themselves. Finally, in Experiment 3, we show that people are more likely to reuse previous computations when those computations are most likely to be relevant: when a second cue is similar to a previously evaluated one.

Hypothesis generation and amortization

Before describing the experiments, we provide an overview of our theoretical framework. First, we describe how Monte Carlo sampling can be used to approximate Bayesian inference, and summarize the psychological evidence for such an approximation. We then introduce amortized inference as a generalization of this framework.

Monte Carlo sampling

Bayes’ rule stipulates that the posterior distribution is obtained as a normalized product of the likelihood P(d\h) and the prior P(h): where H is the hypothesis space. Unfortunately, Bayes’ rule is computationally intractable for all but the smallest hypothesis spaces, because the denominator requires summing over all possible hypotheses. This intractability is especially prevalent in combinatorial space, where hypothesis spaces are exponentially large. In the scene inference example, 𝓗 = h₁ × h₂ ×…h_K is the product space of latent objects, so if there are K latent objects and M possible objects, |𝓗| = M_K. If we imagine there are M = 1000 kinds of objects, then it only takes K = 26 latent objects for the number of hypotheses to exceed the number of atoms in the universe.

Monte Carlo methods approximate probability distributions with samples θ = {h₁,…,h_N} from the posterior distribution over the hypothesis space. We can understand Monte Carlo methods as producing a recognition model Q_θ(h|d) parametrized by θ (see Saeedi, Kulkarni, Mansinghka, & Gershman, 2017, for a systematic treatment). In the idealized case, each hypothesis is sampled from P(h|d). The approximation is then given by: where I[·] = 1 if its argument is true (and 0 otherwise). The accuracy of this approximation improves with N, but from a decision-theoretic perspective even small N may be serviceable (Gershman, Horvitz, & Tenenbaum, 2015; Lieder et al., 2017a; Vul et al., 2014).

The key challenge in applying Monte Carlo methods is that generally we do not have access to samples from the posterior. Most practical methods are based on sampling from a more convenient distribution, weighting or selecting the samples in a way that preserves the asymptotic correctness of the approximation (MacKay, 2003). We focus on Markov chain Monte Carlo (MCMC) methods, the most widely used class of approximations, which are based on simulating a Markov chain whose stationary distribution is the posterior. In other words, if one samples from the Markov chain for long enough, eventually h will be sampled with frequency proportional to its posterior probability.

A number of findings suggest that MCMC is a psychologically plausible inference algorithm. Many implementations use a form of local stochastic search, proposing and then accepting or rejecting hypotheses. For example, the classic Metropolis-Hastings algorithm first samples a new hypothesis from a proposal distribution and then accepts this proposal with probability Intuitively, this Markov chain will tend to move from lower to higher probability hypotheses, but will also sometimes “explore” low probability hypotheses. In order to ensure that a relatively high proportion of proposals are accepted, is usually constructed to sample proposals from a local region around h_n. This combination of locality and stochasticity leads to a characteristic pattern of small inferential steps punctuated by occasional leaps, much like the processes of conceptual discovery in childhood (Ullman et al., 2012) and creative insight in adulthood (Suchow, Bourgin, & Griffiths, 2017). Even low-level visual phenomena like perceptual multistability can be described in these terms (Gershman et al., 2012; Moreno-Bote, Knill, & Pouget, 2011).

Another implication of MCMC, under the assumption that a small number of hypotheses are sampled, is that inferences will tend to show anchoring effects (i.e., a systematic bias towards the initial hypotheses in the Markov chain). Lieder and colleagues have shown how this idea can account for a wide variety of anchoring effects observed in human cognition (Lieder, Griffiths, & Goodman, 2012; Lieder et al., 2017b). For example, priming someone with an arbitrary number (e.g., the last 4 digits of their social security number) will bias a subsequent judgment (e.g., about the birth date of Gandhi), because the arbitrary number influences the initialization of the Markov chain.

In previous research (Dasgupta, Schulz, & Gershman, 2017), we have shown that MCMC can account for many other probabilistic reasoning “fallacies,” suggesting that they arise not from a fundamental misunderstanding of probability, but rather from the inevitable need to approximate inference with limited cognitive resources. We explored this idea using the scene inference task introduced in the previous section. The task facing subjects in our experiments was to judge the probability of a particular set of latent objects (the hypothesis, h) in a scene conditional on observing one object (the cue, d). By manipulating the framing of the query, we showed that subjects gave different answers to formally equivalent queries. In particular, by partially unpacking the queried object set (where fully unpacking an object set means to present it explicitly as a union of each of its member objects) into a small set of exemplars and a “catch-all” hypothesis (e.g., “what is the probability that there is a chair, a computer, or any other object beginning with C?”), we found that subjects judged the probability to be higher when the unpacked exemplars were typical (a “subadditivity” effect; cf. Tversky & Koehler, 1994) and lower when the unpacked exemplars were atypical (a “superadditivity” effect; cf. Sloman, Rottenstreich, Wisniewski, Hadjichristidis, & Fox, 2004) compared to when the query is presented without any unpacking.

To provide a concrete example, in the presence of the cue “table,” the typically unpacked query “what is the probability that there is also a chair, a computer, or any other object beginning with C?” generates higher probability estimates relative to the packed query “what is the probability that there is another object beginning with C?”, whereas the atypically unpacked query “what is the probability that there is also a cow, a canoe, or any other object beginning with C?” generates lower probability estimates compared to the packed query.

We were able to account for these effects using MCMC under the assumption that the unpacked exemplars initialize the Markov chain that generates the sample set. Because the initialization of the Markov chain transiently determines its future trajectory, initializing with typical examples causes the chain to tarry in the high probability region of the queried object set, thereby increasing its judged probability (subadditivity). In contrast, initializing with atypical examples causes the chain to get more easily derailed into regions outside the queried object set. This decreases the judged probability of the queried object set (superadditivity). The strength of these effects theoretically diminishes with the number of samples, as the chain approaches its stationary distribution. Accordingly, experimental manipulations that putatively reduce the number of samples, such as response deadlines and cognitive load, moderate this effect (Dasgupta, Schulz, & Gershman, 2017). The experiments reported in this paper build on these findings, using subadditivity and superadditivity in the scene inference paradigm to detect behavioral signatures of amortized inference.

Amortized inference

As defined in the previous section, Monte Carlo sampling is memoryless, approximating P(h|d) without reference to other conditional distributions that have been computed in the past; all the hypothesis samples are specific to a particular query, and thus there can be no cumulative improvement in approximation accuracy across multiple queries. However, a moment’s reflection suggests that people are capable of such improvement. Every time you look out your window, you see a slightly different scene, but it would be wasteful to recompute a posterior over objects from scratch each time; if you did, you would be no faster at recognizing and locating objects the millionth time compared to the first time. Indeed, experimental research has found considerable speed-ups in object recognition and visual search when statistical regularities can be exploited (Oliva & Torralba, 2007).

Amortized inference is a generalization of the standard memoryless framework. We will formulate it in the most general possible terms, and later explore more specific variants.

Figure 1 illustrates the basic idea. In the standard, memoryless framework, an inference engine inverts a generative model P(d,h) over hypothesis h and data d to compute a recognition model Q_θ(h|d) parametrized by θ. For example, Monte Carlo methods use a set of samples to parametrize the recognition model. Importantly, the answer to each query is approximated using a different set of parameters (e.g., independent samples)—Q_θ1 (h|d₁), Q_θ2 (h|d₂), etc. In the amortized framework, parameters are shared across queries. The parameters are selected to accurately approximate not just a single query, but a distribution of queries. If cognitive resources are unbounded, then the optimal solution is to parametrize each query separately, thereby recovering the memoryless framework. Under bounded resources, a finite number of parameters must be shared between multiple queries, leading to memory effects: the answer to one query affects the answer to other, similar queries.

Figure 1. Theory schematic. (Left) Standard, memoryless framework in which a recognition model Q_θ(h|d) approximates the posterior over hypothesis h given data d. The recognition model is parametrized by θ (e.g., a set of samples in the case of Monte Carlo methods). Memoryless inference builds a separate recognition model for each query. (Right) Amortized framework, in which the recognition model shares parameters across queries. After each new query, the recognition model updates the shared parameters. In this way, the model “learns to infer.”

While reuse increases computational efficiency, it can cause errors in two ways. First, if amortization is deployed not only when two queries are identical but also when they are similar, then answers will be biased due to blurring together of the distributions. This is analogous to interference effects in memory. Second, the answer to the first query might itself have been inaccurate or biased, so its reuse will propagate that inaccuracy to the second query’s answer. Our experiments focus on the second type of error. Specifically, we will investigate how the over- or underestimation of unpacked probabilities resulting from approximate inference for one query will continue to influence responses to a second query.

Two amortization strategies

In our experiments, we ask participants to sequentially answer pairs of queries (denoted Ql and Q2). In Experiment 2, both queries are conditioned on the same cue object (d), but with varying query object sets (h). That is, both questions are querying the same probability distribution over objects, but eliciting the probabilities of different objects in each case. So in theory, all samples taken to answer query 1, can be reused to answer query 2 (they are both samples from the same distribution). This sample reuse strategy allows all computations carried out for query 1 to be reused to answer query 2. However, it is expensive, because each sample must be stored in memory. A less memory-intensive solution is to store and reuse summary statistics of the generated samples, rather than the samples themselves. This summary reuse strategy offers greater efficiency but less flexibility. Several more sophisticated amortization schemes have been developed in the machine learning literature (e.g., Paige & Wood, 2016; Rezende et al., 2014; Stuhlmüller et al., 2013), but we focus on sample and summary reuse because they make clear experimental predictions, which we elaborate below.

In the context of our experiments, summary reuse is only applicable to problems where the answer to Q2 can be expressed as the composition of the answer to Q1 and another (putatively simpler) computation. In Experiment 2, Q2 queries a hypothesis space that is the union of the hypothesis space queried in Q1 and a disjoint hypothesis space. For example if Q1 is “What is the probability that there is an object starting with a C in the scene?”, Q2 could be “What is the probability that there is an object starting with a C or an R in the scene?”. In this case, samples generated in response to Q1 are summarized by a single number (“the probability of an object starting with C”), new samples are generated in response to a simpler query (“the probability of an object starting with R”), and these two numbers are then composed (in this case added) to give the final estimate for Q2 (“the probability of an object starting with C or R”). This is possible because both queries are functions of the same probability distribution over latent objects.

These strategies are simplifications of what the brain is likely doing. Re-using all the samples exactly is unreasonably resource intensive, and re-using only the exact statistic in the few places that the second query can be expressed as a composition of the first query and a simpler computation is unreasonably inflexible. We do not claim that either extreme is plausible, but —to a first approximation— they capture the key ideas motivating our theoretical framework, and more importantly, they make testable predictions which can be used to assess which extreme pulls more weight in controlled experiments.

In particular, sample-based and summary-based amortization strategies make different predictions about how subadditivity and superadditivity change as a function of the sample size (Figure 2, details of these implementations can be found in the Appendix). For sample-based amortization, as the sample size for Q1 grows, the effect for Q2 asymptotically diminishes and eventually vanishes as the effect of biased initialization in Q1 washes out. However, initially increasing the sample size for Q1 also amplifies the effects for Q2 under a sample-based scheme, because this leads to more biased Q1 samples being available for reuse. The amplification effect dominates up to a sample size of around 230 (estimate for the number of samples taken for inference in this domain, reported in Dasgupta, Schulz, & Gershman, 2017). This effect can be counteracted by increasing the sample size for Q2. These are unbiased samples, since Q2 is always presented as a packed query. More such samples will push the effect down by drowning out the bias with additional unbiased samples.

Figure 2. Simulation of subadditivity and superadditivity effects under sample-based (top) and summary-based (bottom) amortization strategies. In all panels, the y-axis represents the unstandardized effect size for Q2. Left panels show the effects of changing the sample size for Q1; right panels show the effects of changing the sample size for Q2. When sample size for one query is changed, sample size for the other query is held fixed at 230 (the sample size estimated by Dasgupta, Schulz, & Gershman, 2017).

Under a summary-based strategy, increasing the sample size for Q1 will only diminish the effects for Q2, because the bias from Q1 is strongest when the chain is close to its starting point. The effect of early, biased samples on the summary statistic disappears with more samples. We see also that changing the number of samples for Q2 does not influence the effect size because the initialization of the chain for Q2 is not influenced by the samples or summary statistic from the answer to Q1. Under the summary-based strategy, the subadditivity and superadditivity effects for Q2 derive entirely from the same effects for Q1, which themselves are driven by the initialization (see Dasgupta, Schulz, & Gershman, 2017).

We test the different predictions of these strategies by placing people under cognitive load during either Q1 or Q2 in Experiment 2, a manipulation that is expected to reduce the number of produced samples (Dasgupta, Schulz, & Gershman, 2017; Thaker et al., 2017). In this way, we can sample different parts of the curves shown in Figure 2.

Adaptive amortization

Amortization is not always useful. As we have already mentioned, it can introduce systematic bias into probabilistic judgments. This is especially true if samples or summary statistics are transferred between two dissimilar distributions. This raises the question: are human amortization algorithms adaptive? This question is taken up empirically in Experiment 3. Here we discuss some of the theoretical issues.

Truly adaptive amortization requires a method to assess similarities between queries. Imagine as an example the situation in which there is a “chair” in the scene and you have to evaluate the probability of any object starting with a “P”. If afterwards you are told that there is a “book” in another scene, and the task is again to evaluate the probability of any object starting with a “P”, it could be a viable strategy to reuse at least some of the previous computations. However, in order to do so efficiently, you would have to know how similar a chair is to a book, i.e. if they occur with a similar set of other objects on average. One way to quantify this similarity is by assessing the induced posterior over all objects conditioned on either “book” or “chair”, and then comparing the two resulting distributions directly. Cues that are more similar should co-occur with other objects in similar proportions.

To assess the similarity of two distributions over objects induced by two different cues, we will need a formal similarity measure. One frequently used measure of similarity between two probability distribution is the Kullback-Leibler (KL) divergence. For two discrete probability distributions Q and P, the KL divergence between P and Q is defined as The KL divergence is minimized to 0 when Q and P are identical. We will use this measure in Experiment 3 to select queries that are either similar or dissimilar, in order to examine whether our participants only exhibit signatures of amortization when the queries are similar.2 Note, however, that the exact calculation of these divergences cannot be part of the algorithmic machinery used by humans to assess similarity, since it presupposes access to the posterior being approximated. Our experiments do not yet provide insight into how humans might achieve tractable adaptive amortization, a problem we leave to future research.

Experiment 1

In Experiment 1, we seek initial confirmation of our central hypothesis: human inference is not memoryless. To detect these “remembrances of inferences past”, we ask participants to answer pairs of queries sequentially. The first query is manipulated (by packing or unpacking the queried hypothesis) in such a way that subadditive or superadditive probability judgments can be elicited (Dasgupta, Schulz, & Gershman, 2017). Crucially, the second query is always presented in packed form, so any differences across the experimental conditions in answers to the second query can only be attributed to the lingering effects of the first query.

Participants

84 participants (53 males, mean age=32.61, SD=8.79) were recruited via Amazon’s Mechanical Turk and received $0.50 for their participation plus an additional bonus of $0.10 for every on-time response.

Procedure

Participants were asked to imagine playing a game in which their friend sees a photo and then mentions one particular object present in the photo (the cued object). The participant is then queried about the probability that another class of objects (e.g., “objects beginning with the letter B”) is also present in the photo.

Each participant completed 6 trials, where the stimuli on every trial corresponded to the rows in Table 1. On each trial, participants first answered Q1 given the cued object (for example, “I see a lamb in this photo. What is the probability that I also see a window, a wardrobe, a wine rack, or any other object starting with a W?”), using a slider bar to report the conditional probability using values between 0 (not at all likely) to 100 (very likely, see also Figure 3). The Q1 framing (typical-unpacked, atypical-unpacked or packed) was chosen randomly. Participants then completed the same procedure for Q2 (immediately after Q1), conditional on the same cued object. The framing for Q2 was always packed and Q2 was always presented as a conjunction (for example, “What is the probability I see an object starting with a W or F?”), where the order of the letters was determined at random.

Figure 3. Experimental setup. Participants were asked to estimate the conditional probability using a slider bar within a 20-second time limit.

View this table:

Table 1 Experimental stimuli and queries for Experiment 1.

Results

Six participants were excluded from the following analysis, four of which failed to respond on time in more than half of the questions, and two of which entered the same response throughout.

Consistent with our previous studies (Dasgupta, Schulz, & Gershman, 2017), we found both subadditivity and superadditivity effects for Q1, depending on the unpacking: probability judgments were higher for unpacked-typical queries than for packed queries (a subadditivity effect; 59.35 vs. 49.67; t(77) = 4.03,p < 0.001) and lower for unpacked-atypical than for packed queries (a superadditivity effect; 31.42 vs. 49.67; t(77) = —6.44,p < 0.001).

Next we calculated the difference between each participant’s response to every query and the mean packed response to the same queried object. This difference was then entered as a dependent variable into a linear mixed effects regression with random effects for both participants and queried objects as well as a fixed effect for the condition. The resulting estimates revealed both a significant subadditivity (difference = 12.60 ± 1.25, t(610.49) = 10.083, p < 0.0001) and superadditivity (difference = —15.69 ± 1.32, t(615.46) = —11.89, p < 0.0001) effect.

Additionally, we found evidence that participants reused calculations from Q1 for Q2: even though all Q2 queries were presented in the same format (as packed), the estimates for that query differed depending on how Q1 was presented. In particular, estimates for Q2 were lower when Q1 was unpacked to atypical exemplars (46.38 vs 56.83; t(77) = 5.08, p < 0.001), demonstrating a superadditivity effect that carried over from one query to the next. We did not find an analogous carry-over effect for subadditivity (58.47 vs. 56.83; t(77) = 0.72, p = 0.4), possibly due to the subadditivity effect “washing out” more quickly (i.e. with fewer samples) than superadditivity, as has been observed in this domain before (see Dasgupta, Schulz, & Gershman, 2017).

We calculated the difference between each participant’s response for every Q2 and the mean response for the same object averaged over all responses to Q2 conditional on Q1 being packed. The resulting difference was again entered as the dependent variable into a linear mixed effects regression with both participants and cued object as random effects as well as condition as a fixed effect. The resulting estimates showed both a significant subadditivity (difference = 4.39 ± 1.14, £(606.40) = 3.83, p < 0.001) and superadditivity (difference = -7.86 ± 1.21, £(610.41) = -6.50, p < 0.0001) effect.

We calculated each participant’s mean response to all packed hypotheses for Q2 over all trials as a baseline measure and then assessed the difference between each condition’s mean response and this mean packed response. This resulted in a measure of an average effect size for the Q2 responses (how much each participant under- or overestimates different hypotheses as compared to an average packed hypothesis). Results of this calculation are shown in Figure 4.

Figure 4. Experiment 1: Differences between Q2 responses for each condition and an average packed baseline. A negative relative mean estimate indicates a superadditivity and a positive relative mean estimate a subadditivity effect. Error bars represent the standard error of the mean.

The superadditivity effect was significantly greater than 0 (£(77) = 5.07, p < 0.001). However, the subadditivity effect did not differ significantly from 0 (£(77) = —0.42, p > 0.6; see also Dasgupta, Schulz, & Gershman, 2017).

Next, we explored whether responses to Q1 predicted trial-by-trial variation in responses to Q2. Figure 5 shows the difference between participants’ estimates for Q1 and the true underlying probability of the query (as derived by letting our MCMC model run until convergence) plotted against the same difference for Q2. If participants do indeed reuse computations, then how much their estimates deviate from the underlying truth for Q1 should be predictive for the deviance of their estimates for Q2.

Figure 5. Trial-by-trial analyses of Experiment 1. Difference between Q1 responses and true probability (as assessed by our MCMC model) plotted against the same quantity for Q2. Lines show the least-squares fit with standard error bands.

We found significant positive correlations between the two queries in all conditions when aggregating across participants (average correlation: r = 0.67, p < 0.01). The same conclusion can be drawn from analyzing correlations within participants and then testing the average correlation against 0 (r = 0.55, p < 0.01). Moreover, the within-participant effect size (the response difference between the unpacked conditions and the packed condition) for Q1 was correlated with responses to Q2 for both atypical (r = 0.35, p < 0.01) and typical (r = 0.21, p < 0.05) unpacking conditions. This means that participants who showed greater subadditivity or superadditivity for Q1 also showed correspondingly greater effects for Q2.

Discussion

Experiment 1 established a memory effect in probabilistic inference: answers to a query are influenced by answers to a previous query, thereby providing evidence for amortization. In particular, both a sub- and a superadditivity effect induced at Q1 carried over to Q2, and participants showing stronger effects sizes for both sub- and superadditivity for Q1 also showed greater effects for Q2.

Experiment 2

Our next experiment sought to discriminate between sample-based and summary-based amortization strategies. We follow the logic of the simulations shown in Figure 2, manipulating cognitive load at Q1 and Q2 in order to exogenously control the number of samples (see Dasgupta, Schulz, & Gershman, 2017; Thaker et al., 2017, for a similar approach).

In addition to cognitive load, we manipulate the “overlap” of Q1 with Q2, by creating a new set of “no overlap” queries with no overlap between the hypothesis spaces of the query pairs. We predicted that we would only see a memory effect for queries with overlapping pairs. This manipulation allows us to rule out an alternative trivial explanation of our results: numerical anchoring (high answers to the first query lead to high answers to the second query). If the apparent memory effect was just due to anchoring, we would expect to see the effect regardless of query overlap, contrary to our predictions.