Abstract
We subjectively perceive our visual field with high fidelity, yet large peripheral distortions can go unnoticed and peripheral objects can be difficult to identify (crowding). A recent paper proposed a model of the mid-level ventral visual stream in which neural responses were averaged over an area of space that increased as a function of eccentricity (scaling). Human participants could not discriminate synthesised model images from each other (they were metamers) when scaling was about half the retinal eccentricity. This result implicated ventral visual area V2 and approximated “Bouma’s Law” of crowding. It has subsequently been interpreted as a link between crowding zones, receptive field scaling, and our rich perceptual experience. However, participants in this experiment never saw the original images. We find that participants can easily discriminate real and model-generated images at V2 scaling. Lower scale factors than even V1 receptive fields may be required to generate metamers. Efficiently explaining why scenes look as they do may require incorporating segmentation processes and global organisational constraints in addition to local pooling.
Introduction
Vision science seeks to understand why things look as they do (Koffka 1935). Typically, our entire visual field looks subjectively crisp and clear. Yet our perception of the scene falling onto the peripheral retina is actually limited by at least three distinct sources: the optics of the eye, retinal sampling, and the mechanism(s) giving rise to crowding, in which our ability to identify and discriminate objects in the periphery is limited by the presence of nearby items (Bouma 1970; Pelli and Tillman 2008).1 Thus we can be insensitive to significant changes in the world despite our rich subjective experience.
Visual crowding has been characterised as compulsory texture perception (Parkes et al. 2001; Lettvin 1976) and compression (Balas, Nakano, and Rosenholtz 2009; Rosenholtz, Huang, and Ehinger 2012). This idea entails that we cannot perceive the precise structure of the visual world in the periphery. Rather, we are aware only of the summary statistics or ensemble properties of visual displays, such as the average size or orientation of a group of elements (Ariely 2001; Dakin and Watt 1997). One of the appeals of the summary statistic idea is that it can be directly motivated from the perspective of efficient coding: it is a form of compression. Image-computable texture summary statistics have been shown to be correlated with human performance in various tasks requiring the judgment of peripheral information, such as crowding and visual search (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Freeman and Simoncelli 2011; Rosenholtz 2016; Ehinger and Rosenholtz 2016). Recently, it has even been suggested that summary statistics underlie our rich phenomenal experience itself—in the absence of focussed attention, we perceive only a texture-like visual world (Cohen, Dennett, and Kanwisher 2016).
Across many tasks, summary statistic representations seem to capture aspects of peripheral vision when their pooling corresponds to “Bouma’s Law” (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Freeman and Simoncelli 2011; Wallis and Bex 2012; Ehinger and Rosenholtz 2016). Bouma’s Law states that objects will crowd (correspondingly, statistics will be pooled) over spatial regions corresponding to about half the retinal eccentricity (Bouma 1970; Pelli and Tillman 2008; though see Rosen, Chakravarthi, and Pelli 2014). If the visual system does indeed represent the periphery using summary statistics, then Bouma’s scaling implies that as retinal eccentricity increases, increasingly large regions of space are texturised by the visual system. If a model captured these statistics and their pooling, images could be created that are indistinguishable from the original despite being physically different (metamers). These images would be equivalent to the model and to the human visual system (Freeman and Simoncelli 2011; Wallis, Bethge, and Wichmann 2016; Portilla and Simoncelli 2000; Koenderink et al. 2017).
Freeman and Simoncelli (2011) developed a model (hereafter, FS-model) in which texture-like summary statistics were pooled over spatial regions inspired by the receptive fields in primate visual cortex. The size of neural receptive fields in ventral visual stream areas increases as a function of retinal eccentricity, and as one moves downstream from V1 to V2 and V4 at a given eccentricity. Each visual area therefore has a signature scale factor, defined as the ratio of the receptive field diameter to retinal eccentricity (Freeman and Simoncelli 2011). Similarly, the pooling regions of the FS-model also increase with retinal eccentricity with a definable scale factor. New images could be synthesised that matched the summary statistics of original images at this scale factor. As scale factor increases, texture statistics are pooled over increasingly large regions of space, resulting in more distorted synthesised images relative to the original (that is, more information is discarded).
The maximum scale factor for which the images remain indistinguishable (the critical scale) characterises perceptually-relevant compression in the visual system’s representation. If the scale factor of the model corresponded to the scaling of the visual system in the responsible visual area, and information in upstream areas was irretrievably lost, then the images synthesised by the model should be indistinguishable while discarding as much information as possible. That is, we seek the maximum compression that is perceptually lossless. Larger scale factors would discard more information than the relevant visual area and therefore the images should look different. Smaller scale factors preserve information that could be discarded without any perceptual effect.
Crucially, it is the minimum critical scale over images that is important for the scaling theory. If the visual system computes summary statistics over fixed (image-independent) pooling regions in the same way as the model, then the model must be able to produce metamers for all images. While images may vary in their individual critical scales, the image with the smallest critical scale determines the maximum compression for appearance to be matched in general.
Freeman and Simoncelli showed that the largest scale factor for which two synthesised images could not be told apart was 0.5, or pooling regions of about half the eccentricity. This scaling matched the signature of area V2, and also matched the approximate value of Bouma’s Law. Subsequently, this result has been interpreted as a link between receptive field scaling, crowding, and our rich phenomenal experience (e.g. Block 2013, Cohen, Dennett, and Kanwisher (2016), Landy (2013), Movshon and Simoncelli (2014), Seth (2014)). These interpretations imply that the FS-model creates metamers for natural scenes. However, observers in Freeman and Simoncelli’s experiment never saw the original scenes, but only compared synthesised images to each other. Showing that two model samples are indiscriminable from each other could yield trivial results. For example, two white noise samples matched to the mean and contrast of a natural scene would be easy to discriminate from the scene but hard to discriminate from each other. Wallis, Bethge and Wichmann (2016) showed that observers could easily discriminate Portilla and Simoncelli (2000) textures from original images in the periphery, but did not test the FS-model. The Portilla and Simoncelli model makes no explicit connection to neural receptive field scaling. In addition, relative to the textures tested by Wallis et al., the pooling region overlap used in the FS-model provides a strong constraint on the resulting syntheses, making the images much more similar to the originals. It is therefore entirely possible that the FS-model produces metamers for natural scenes for scale factors of 0.5. Here we test this, and compare the results to our own model using CNN texture features.
Results
We tested whether the FS-model can produce metamers using an oddity design in which the observer had to pick the odd image out of three successively shown images (Fig 1E). In a 3-alternative oddity paradigm, performance for metamerism would lie at 0.33 (dashed horizontal line, Figure 1F). We used two comparison conditions: either observers compared two model syntheses to each other (synth vs synth; as in Freeman and Simoncelli 2011) or the original image to a model synthesis (orig vs synth). As in the original paper (Freeman and Simoncelli 2011) we measured the performance of human observers for images synthesised with different scale factors (using Freeman and Simoncelli’s code, see Methods). To quantify the critical scale factor we fit the same nonlinear model as Freeman and Simoncelli, which parameterises sensitivity as a function of critical scale and gain, as a mixed-effects model with random effects of participant and image (see Methods).
In previous experiments (see Supplementary Figures S8 and S9), we observed that texture-like distortions are more visible when they fall over image regions containing inhomogenous structure, long edges, or borders between different surfaces than when they fall into texture-like regions. We therefore explicitly compared “scene-like” images containing inhomogenous structures to “texture-like” images containing more homogenous or periodically-patterned content in the periphery. We hand-selected ten images from each class (Figure 1A and Figure 1B)2.
When participants compared synthesised images to each other as in Freeman and Simoncelli (Figure 1F, synth vs synth), there was little evidence that the critical scale depended on the image content (scene-vs texture-like). The difference in critical scale between texture-like and scene-like images was 0.09, 95% CI [-0.03, 0.24], p(β < 0) = 0.078. This result is consistent with Freeman and Simoncelli, who reported no dependency on image. It seems likely that this is because comparing synthesised images to each other means that the model has removed higher-order structure that might allow discrimination. All images appear distorted, and the task becomes one of identifying a specific distortion pattern. While the critical scales we find are somewhat lower than those reported by Freeman and Simoncelli (2011; Figure 1G), they are within the range of other reported critical scale factors (Freeman and Simoncelli 2013). One striking difference between our results for synth vs synth and those of Freeman and Simoncelli is that the performance of our participants was quite poor even for large scale factors. This may be because because we used more images in our experiment than Freeman and Simoncelli, so participants were less familiar with the distortions that could appear. We have replicated these results in an experiment using the same ABX task as in Freeman and Simoncelli (Figure S5).
Comparing the original image to model syntheses yielded a different pattern of results. First, participants are able to discriminate the original images from their FS-model syntheses at scale factors of 0.5 (Figure 1F). Performance lies well above chance for all participants: these images are not metamers. This holds for both scene-like and texture-like images. Furthermore, critical scale depends on the image type. Model syntheses match the texture-like images on average with scale factors of approximately 0.25 (the smallest value we could generate using the FS-model). In contrast, the scene-like images are quite discriminable from their model syntheses at this scale. Correspondingly, texture-like images had higher critical scales than scene-like images on average (0.13, 95% CI [0.06, 0.22], p(β < 0) = 0.001). Thus, smaller pooling regions are required to make metamers for scene-like images than for texture-like images.
As noted above, the image with the minimum critical scale determines the largest compression that can be applied for the scaling model to hold. For two images (Figure 2A and E) the nonlinear mixed-effects model estimated critical scales of approximately 0.14 (see Figure 1G, diamonds). However, examining the individual data for these images (Figure 2D and H) reveals that these critical scale estimates are largely determined by the hierarchical nature of the mixed-effects model, not the data itself. Both images remain highly discriminable for the lowest scale factor we could generate. This suggests that the mixed-effects model critical scale may be an overestimate of the true scale factor required to generate metamers. Thus, human observers are highly sensitive to the relatively small distortions produced by the FS-model at scale factors of 0.25 (compare Figure 2B and F at scale 0.25 and C and G at scale 0.46 to images A and B).
Discussion
It is a popular idea that the appearance of scenes in the periphery is described by summary statistic textures captured at the scaling of V2 neural populations. In contrast, here we show that humans are very sensitive to the difference between original and model-matched images at this scale (Figure 1). A recent preprint (Deza, Jonnalagadda, and Eckstein 2017) finds a similar result in a set of 50 images, and our results are also consistent with the speculations made by Wallis et al. based on their experiments with Portilla and Simoncelli textures (Wallis, Bethge, and Wichmann 2016). Together, these results show that the pooling of texture-like features in the FS-model at the scaling of V2 receptive fields does not explain the appearance of natural images.
If the peripheral appearance of visual scenes is explained by image-independent pooling of texture-like features, then the pooling regions must be small. Consider that participants in our experiment could easily discriminate the images in Figure 2B and F from those in Figure 2A and E respectively. Therefore, images synthesised at a truly metameric scaling must remain extremely close to the original: at least as small as V1 neurons, and likely even lower (Figure 2). This may even be consistent with scaling in precortical visual areas. For example, the scaling of retinal ganglion cell receptive fields at the average eccentricity of our stimuli (6 degrees) is approximately 0.08 for the surround (Croner and Kaplan 1995) and 0.009 for the centre (Dacey and Petersen 1992). It becomes questionable how much is learned about compression in the visual system using such an approach, beyond the aforementioned, relatively well-studied limits of optics and retinal sampling (e.g. Wandell 1995; Watson 2014).
Furthermore, it can be seen via simple demonstration that we can be quite insensitive to even large texture-like distortions, so long as these fall on texture-like regions of the input image. The “China Lane” sign in Figure 3A has been distorted in B (using texture features from Gatys et al (2015)). The same type of distortion in a texture-like region of the image is far less visible (the brickwork in the image centre; FS-model result Figure 3C; see also Figure S13). It is the image content, not retinal eccentricity, that is the primary determinant of the visibility of at least some summary statistic distortions. Requiring information to be preserved at V1 or smaller scaling would therefore be rather inefficient from the standpoint of compression: small scale factors will preserve texture-like structure that could be compressed without affecting appearance.
It may seem trivial that a texture statistic model better captures the appearance of textures than non-textures. However, if the human visual system represents the periphery as a set of textures, and these models are sufficient approximations of this representation, then image content should not matter— because scene-like retinal inputs in the periphery are transformed into textures by the visual system.
Perhaps the texture scaling theory might hold at larger scales but the FS-model texture features themselves are insufficient to capture natural scene appearance. To test whether improved texture features (Gatys, Ecker, and Bethge 2015; Wallis et al. 2017) could help in matching appearance for scenes, we developed a new model (CNN-model) that was inspired by the FS-model but uses the texture features of a convolutional neural network (see Methods and Figures S6–9). The CNN-model shows very similar behaviour to the FS-model such that human performance for scene-like images is higher than for texturelike images (triangles in Figure 1D and Figure 2), and the CNN-model also fails to create metamers for all images (see also Figures S9, S12–13). Furthermore, the NeuroFovea model of Deza et al (2017), which like our CNN-model uses deep neural network texture features, also fails to capture scene appearance. Together, these results show that no known summary statistic pooling model is sufficient to match the appearance of arbitrary natural scenes at computationally feasible scale factors.
What, then, is the missing ingredient that could capture appearance while compressing as much information as possible? Through the Gestalt tradition, it has long been known that the appearance of local image elements can crucially depend on the context in which they are placed. For example, Saarela et al (2009) and Manassi et al (2013) found that global stimulus configuration modulates crowding (see also Vickery et al. 2009), and the results of Neri (2017) suggest that early global segmentation processes influence local perceptual sensitivity. Potentially, global scene organisation needs to be considered if one wants to capture appearance—yet current models that texturise local regions do not explicitly include perceptual organisation (Herzog et al. 2015). We speculate that segmentation and grouping processes are critical for efficiently matching scene appearance, and therefore the approach of uniformly computing summary statistics without including these processes will require preserving much of the original image structure by making pooling regions very small. A parsimonious model capable of compressing as much information as possible might need to adapt either the size and arrangement of pooling regions or the feature representations to the image content.
Our results do not undermine the considerable empirical support for the periphery-as-summary-statistic theory as a description of visual performance. Humans can judge summary statistics of visual displays (Ariely 2001; Dakin and Watt 1997), summary statistics can influence judgments where other information is lost (Fischer and Whitney 2011; Faivre, Berthet, and Kouider 2012), and the information preserved by summary statistic stimuli may offer an explanation for performance in various visual tasks (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Rosenholtz, Huang, and Ehinger 2012; Keshvari and Rosenholtz 2016; Chang and Rosenholtz 2016; Zhang et al. 2015; Whitney, Haberman, and Sweeny 2014; Long et al. 2016; though see Agaoglu and Chung 2016; Herzog et al. 2015; Francis, Manassi, and Herzog 2017). Texture-like statistics may even provide the primitives—appropriately organised on a global scale—from which form is constructed (Lettvin 1976). However, one additional point merits further discussion. The studies by Rosenholtz and colleagues primarily test summary statistic representations by showing that performance with summary statistic stimuli viewed foveally is correlated with peripheral performance with real stimuli. This means that the summary statistic preserves sufficient information to explain the performance of tasks in the periphery. Our results show that these summary statistics are insufficient to match scene appearance, at least under the pooling scheme used in the Freeman and Simoncelli model at computationally feasible scales. This shows the usefulness of scene appearance matching as a test: a parsimonious model that matches scene appearance would be expected to also preserve enough information to show correlations with peripheral task performance; the converse does not hold.
While it may be useful to consider summary statistic pooling in accounts of visual performance, to say that summary statistics can account for phenomenological experience of the visual periphery (Cohen, Dennett, and Kanwisher 2016; see also Block 2013; Seth 2014) seems premature in light of our results (see also Haun et al. 2017). Cohen et al (2016) additionally posit that focussed spatial attention can in some cases overcome the limitations imposed by a summary statistic representation. We instead find little evidence that participants’ ability to discriminate real from synthesised images is improved by cueing spatial attention, at least in our experimental paradigm and for our CNN-model (Figure S11).
One exciting aspect of Freeman and Simoncelli (2011) was the promise of inferring a critical brain region via a receptive field size prediction derived from psychophysics. Indeed, aspects of this promise have since received empirical support: the presence of texture-like features can discriminate V2 neurons from V1 neurons (Freeman et al. 2013; Ziemba et al. 2016; see also Okazawa, Tajima, and Komatsu 2015). Discarding all higher-order structure not captured by the candidate model by comparing syntheses to each other, thereby isolating only features that change, may therefore be a useful way to distinguish sequential feedforward processing stages in neurons. On the other hand, explaining appearance is—to return to Koffka—a grand goal of vision science. For this the original vs synthesised comparison is key. Our results suggest that it is wrong to believe that “appearance”, even only peripheral appearance, can be tied solely to the scaling of receptive fields in any single brain region or to Bouma’s Law.
Methods
All stimuli, data and code to reproduce the figures and statistics reported in this paper are available at http://dx.doi.org/10.5281/zenodo.1475112. This document was prepared using the knitr package (Xie 2013, 2015) in the R statistical environment (R Core Team 2017; Wickham and Francois 2016; Wickham 2009, 2011; Auguie 2016; Arnold 2016) to improve its reproducibility.
Participants
Eight observers participated in the experiment: authors CF and TW, a research assistant unfamiliar with the experimental hypotheses, and five näıve participants recruited from an online advertisement pool who were paid 10 Euro / hr for two one-hour sessions. An additional näıve participant was recruited but showed insufficient eyetracking accuracy (see below). All participants signed a consent form prior to participating. Participants reported normal or corrected-to-normal visual acuity. All procedures conformed to Standard 8 of the American Psychological Association’s “Ethical Principles of Psychologists and Code of Conduct” (2010).
Stimuli
We selected 10 “scene-like” and 10 “texture-like” source images from the MIT 1003 scene dataset (Judd, Durand, and Torralba 2012; Judd et al. 2009). We used images from this dataset to allow better comparison to related experiments (see Supplementary Material). A square was cropped from the center of the original image and downsampled to 512 x 512 px. The images were converted to grayscale and standardized to have a mean gray value of 0.5 (scaled [0,1]) and an RMS contrast (σ/µ) of 0.3.
Freeman and Simoncelli syntheses
We synthesised images using the FS-model (Freeman and Simoncelli 2011, code available from https://github.com/freeman-lab/metamers). Four unique syntheses were created for each source image at each of eight scale factors (0.25, 0.36, 0.46, 0.59, 0.7, 0.86, 1.09, 1.45), using 50 gradient steps as in Freeman and Simoncelli. Pilot experiments with stimuli generated with 100 gradient steps produced similar results. To successfully synthesise images at scale factors of 0.25 and 0.36 it was necessary to increase the central region of the image in which the original pixels were perfectly preserved (pooling regions near the fovea become too small to compute correlation matrices). Scales of 0.25 used a central radius of 32 px (0.8 dva in our viewing conditions) and scales 0.36 used 16 px (0.4 dva). This change should, if anything, make syntheses even harder to discriminate from the original image. All other parameters of the model were as in Freeman and Simoncelli. Synthesising an image with scale factor 0.25 took approximately 35 hours, making a larger set of syntheses or source images infeasible. It was not possible to reliably generate images with scale factors lower than 0.25 using the code above.
CNN model syntheses
The CNN pooling model (triangles in Figure 1) was inspired by the model of Freeman and Simoncelli, with two primary differences: first, we replaced the Portilla and Simoncelli (2000) texture features with the texture features derived from a convolutional neural network (Gatys, Ecker, and Bethge 2015), and second, we simplified the “foveated” pooling scheme for computational reasons. Specifically, for the CNN 32 model presented above, the image was divided up into 32 angular regions and 28 radial regions, spanning the outer border of the image and an inner radius of 64 px. Within each of these regions we computed the mean activation of the feature maps from a subset of the VGG-19 network layers (conv1 1, conv2 1, conv3 1, conv4 1, conv5 1). To better capture long-range correlations in image structure, we computed these radial and angular regions over three spatial scales, by computing three networks over input sizes 128, 256 and 512 px. Using this multiscale radial and angular pooling representation of an image, we synthesised new images to match the representation of the original image via iterative gradient descent (Gatys, Ecker, and Bethge 2015). Specifically, we minimised the mean-squared distance between the original and a target image, starting from Gaussian noise outside the central 64 px region, using the L-BFGS optimiser as implemented in scipy (Jones, Oliphant, and Peterson 2001) for 1000 gradient steps, which we found in pilot experiments was sufficient to produce small (but not zero) loss. Further details, including tests of other variants of this model, are provided in the Supplement.
Equipment
Stimuli were displayed on a VIEWPixx 3D LCD (VPIXX Technologies Inc., Saint-Bruno-de-Montarville, Canada; spatial resolution 1920 x 1080 pixels, temporal resolution 120 Hz, operating with the scanning backlight turned off in normal colour mode). Outside the stimulus image the monitor was set to mean grey. Participants viewed the display from 57 cm (maintained via a chinrest) in a darkened chamber. At this distance, pixels subtended approximately 0.025 degrees on average (approximately 40 pixels per degree of visual angle). The monitor was linearised (maximum luminance 260 cd/m2) using a Konica-Minolta LS-100 (Konica-Minolta Inc., Tokyo, Japan). Stimulus presentation and data collection was controlled via a desktop computer (Intel Core i5-4460 CPU, AMD Radeon R9 380 GPU) running Ubuntu Linux (16.04 LTS), using the Psychtoolbox Library (version 3.0.12, Brainard 1997; Kleiner, Brainard, and Pelli 2007; Pelli 1997), the Eyelink toolbox (Cornelissen, Peters, and Palmer 2002) and our internal iShow library (http://dx.doi.org/10.5281/zenodo.34217) under MATLAB (The Mathworks Inc., Natick MA, USA; R2015b). Participants’ gaze position was monitored by an Eyelink 1000 (SR Research) video-based eyetracker.
Procedure
On each trial participants were shown three images in succession; two images were identical, one image was different (the “oddball”, which could occur first, second or third with equal probability). The oddball could be either a synthesised or a natural image (in the orig vs synth condition; counterbalanced), whereas the other two images were physically the same as each other and from the opposite class as the oddball. In the synth vs synth condition (as used in Freeman and Simoncelli), both oddball and foil images were (physically different) model synths. The participant identified the temporal position of the oddball image via button press. Participants were told to fixate on a central point (Thaler et al. 2013) presented in the center of the screen. The images were centred around this spot and displayed with a radius of 512 pixels (i.e. images were upsampled by a factor of two for display), subtending ≈ 12.8° at the eye. Images were windowed by a circular cosine, ramping the contrast to zero in the space of 52 pixels. The stimuli were presented for 200 ms, with an inter-stimulus interval of 1000 ms (making it unlikely participants could use motion cues to detect changes), followed by a 1200 ms response window. Feedback was provided by a 100 ms change in fixation cross brightness. Gaze position was recorded during the trial. If the participant moved the eye more than 1.5 degrees away from the fixation spot, the trial immediately ended and no response was recorded; participants saw a feedback signal (sad face image) indicating a fixation break. Prior to the next trial, the state of the participant’s eye position was monitored for 50 ms; if the eye position was reported as more than 1.5 degrees away from the fixation spot a recalibration was triggered. The inter-trial interval was 400 ms.
Scene-like and texture-like images were compared under two comparison conditions (orig vs synth and synth vs synth; see main text). Image types and scale factors were randomly interleaved within a block of trials (with a minimum of one trial from another image in between) whereas comparison condition was blocked. Participants first practiced the task and fixation control in the orig vs synth comparison condition (scales 0.7, 0.86 and 1.45); the same images used in the experiment were also used in practice to familiarise participants with the images. Participants performed at least 60 practice trials, and were required to achieve at least 50% correct responses and fewer than 20% fixation breaks before proceeding (as noted above, one participant failed). Following successful practice, participants performed one block of orig vs synth trials, which consisted of five FS-model scale factors (0.25, 0.36, 0.46, 0.59, 0.86) plus the CNN 32 model, repeated once for each image to give a total of 120 trials. The participant then practiced the synth vs synth condition for at least one block (30 trials), before continuing to a normal synth vs synth block (120 trials; scale factors of 0.36, 0.46, 0.7, 0.86, 1.45). Over two one-hour sessions, näıve participants completed a total of four blocks of each comparison condition in alternating order (except for one participant who ran out of time to complete the final block). Authors performed more blocks (total 11).
Data analysis
We discarded trials for which participants made no response (N = 66) and broke fixation (N = 239), leaving a total of 7555 trials for further analysis. To quantify the critical scale as a function of the scale factor s, we used the same 2-parameter function for discriminability d′ fitted by Freeman and Simoncelli:
consisting of the critical scale sc (below which the participant cannot discriminate the stimuli) and a gain parameter a (asymptotic performance level in units of d′). This d′ value was transformed to proportion correct using a Weibull function as in Wallis et al (2016):
with m set to three (the number of alternatives), and scale λ and shape k parameters chosen by minimising the squared difference between the Weibull and simulated results for oddity as in Craven (1992). The posterior distribution over model parameters (sc and α) was estimated in a nonlinear mixed-effects model with fixed effects for the experimental conditions (comparison and image type) and random effects for participant (crossed with comparison and image type) and image (crossed with comparison, nested within image type), assuming binomial variability. Estimates were obtained by a Markov Chain Monte Carlo (MCMC) procedure implemented in the Stan language (version 2.16.2, Stan Development Team 2017; Hoffman and Gelman 2014), with the model wrapper package brms (version 1.10.2, Bürkner 2017) in the R statistical environment. The model parameters were given weakly-informative prior distributions, which provide information about the plausible scale of parameters but do not bias the direction of inference. Specifically, both critical scale and gain were estimated on the natural logarithmic scale; the mean log critical scale (intercept) was given a Gaussian distribution prior with mean -0.69 (corresponding to a critical scale of approximately 0.5—i.e. centred on the result from Freeman and Simoncelli) and sd 1, other fixed-effect coefficients were given Gaussian priors with mean 0 and sd 0.5, and the group-level standard deviation parameters were given positive-truncated Cauchy priors with mean 0 and sd 0.1. Priors for the log gain parameter were the same, except the intercept prior had mean 1 (linear gain estimate of 2.72 in d′ units) and sd 1. The posterior distribution represents the model’s beliefs about the parameters given the priors and data. This distribution is summarised above as posterior mean, 95% credible intervals and posterior probabilities for the fixed-effects parameters to be negative (the latter computed via the empirical cumulative distribution of the relevant MCMC samples).
Acknowledgments
Designed the experiments: TSAW, ASE, CMF, LAG, FAW, MB. Programmed the CNN-model: CMF, LAG. Programmed the experiments: TSAW. Collected the data: CMF, TSAW. Analysed the data: TSAW. Wrote the paper: TSAW, CMF. Revised the paper: ASE, LAG, FAW, MB. Funded by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002), the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307), and the German Science Foundation (DFG; priority program 1527, BE 3848/2-1 and SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP03). We thank Wiebke Ringels for assistance with data collection, and Heiko Schütt and Corey Ziemba for helpful comments on an earlier draft. TSAW was supported in part by an Alexander von Humboldt Postdoctoral Fellowship. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Christina Funke.
Footnotes
↵1Many other phenomena also demonstrate striking “failures” of peripheral vision, for example change blindness (Rensink, O’Regan, and Clark 1997; O’Regan, Rensink, and Clark 1999) and inattentional blindness (Mack and Rock 1998), though there is some discussion as to what extent these are distinct from crowding (Rosenholtz 2016).
↵2This selection of images is debateable. In particular some “texture-like” images contain scene-like content. Interestingly, “scene-like” images tend to contain human-made content whereas “texture-like” images tend to contain more natural content, though this was not consciously part of our selection criteria (thanks to Corey Ziemba for pointing this out). We selected these images from a scene database to remain consistent with our other experiments (see Supplementary Material). The fact that we do find differences in critical scaling supports our general argument. See Figure S3 and Supplement for further discussion.