Image content is more important than Bouma’s Law for scene metamers

Thomas S. A. Wallis; Christina M. Funke; Alexander S. Ecker; Leon A. Gatys; Felix A. Wichmann; Matthias Bethge

doi:10.1101/378521

Abstract

We subjectively perceive our visual field with high fidelity, yet large peripheral distortions can go unnoticed and peripheral objects can be difficult to identify (crowding). A recent paper proposed a model of the mid-level ventral visual stream in which neural responses were averaged over an area of space that increased as a function of eccentricity (scaling). Human participants could not discriminate synthesised model images from each other (they were metamers) when scaling was about half the retinal eccentricity. This result implicated ventral visual area V2 and approximated “Bouma’s Law” of crowding. It has subsequently been interpreted as a link between crowding zones, receptive field scaling, and our rich perceptual experience. However, participants in this experiment never saw the original images. We find that participants can easily discriminate real and model-generated images at V2 scaling. Lower scale factors than even V1 receptive fields may be required to generate metamers. Efficiently explaining why scenes look as they do may require incorporating segmentation processes and global organisational constraints in addition to local pooling.

Introduction

Vision science seeks to understand why things look as they do (Koffka 1935). Typically, our entire visual field looks subjectively crisp and clear. Yet our perception of the scene falling onto the peripheral retina is actually limited by at least three distinct sources: the optics of the eye, retinal sampling, and the mechanism(s) giving rise to crowding, in which our ability to identify and discriminate objects in the periphery is limited by the presence of nearby items (Bouma 1970; Pelli and Tillman 2008).¹ Thus we can be insensitive to significant changes in the world despite our rich subjective experience.

Visual crowding has been characterised as compulsory texture perception (Parkes et al. 2001; Lettvin 1976) and compression (Balas, Nakano, and Rosenholtz 2009; Rosenholtz, Huang, and Ehinger 2012). This idea entails that we cannot perceive the precise structure of the visual world in the periphery. Rather, we are aware only of the summary statistics or ensemble properties of visual displays, such as the average size or orientation of a group of elements (Ariely 2001; Dakin and Watt 1997). One of the appeals of the summary statistic idea is that it can be directly motivated from the perspective of efficient coding: it is a form of compression. Image-computable texture summary statistics have been shown to be correlated with human performance in various tasks requiring the judgment of peripheral information, such as crowding and visual search (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Freeman and Simoncelli 2011; Rosenholtz 2016; Ehinger and Rosenholtz 2016). Recently, it has even been suggested that summary statistics underlie our rich phenomenal experience itself—in the absence of focussed attention, we perceive only a texture-like visual world (Cohen, Dennett, and Kanwisher 2016).

Across many tasks, summary statistic representations seem to capture aspects of peripheral vision when their pooling corresponds to “Bouma’s Law” (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Freeman and Simoncelli 2011; Wallis and Bex 2012; Ehinger and Rosenholtz 2016). Bouma’s Law states that objects will crowd (correspondingly, statistics will be pooled) over spatial regions corresponding to about half the retinal eccentricity (Bouma 1970; Pelli and Tillman 2008; though see Rosen, Chakravarthi, and Pelli 2014). If the visual system does indeed represent the periphery using summary statistics, then Bouma’s scaling implies that as retinal eccentricity increases, increasingly large regions of space are texturised by the visual system. If a model captured these statistics and their pooling, images could be created that are indistinguishable from the original despite being physically different (metamers). These images would be equivalent to the model and to the human visual system (Freeman and Simoncelli 2011; Wallis, Bethge, and Wichmann 2016; Portilla and Simoncelli 2000; Koenderink et al. 2017).

Freeman and Simoncelli (2011) developed a model (hereafter, FS-model) in which texture-like summary statistics were pooled over spatial regions inspired by the receptive fields in primate visual cortex. The size of neural receptive fields in ventral visual stream areas increases as a function of retinal eccentricity, and as one moves downstream from V1 to V2 and V4 at a given eccentricity. Each visual area therefore has a signature scale factor, defined as the ratio of the receptive field diameter to retinal eccentricity (Freeman and Simoncelli 2011). Similarly, the pooling regions of the FS-model also increase with retinal eccentricity with a definable scale factor. New images could be synthesised that matched the summary statistics of original images at this scale factor. As scale factor increases, texture statistics are pooled over increasingly large regions of space, resulting in more distorted synthesised images relative to the original (that is, more information is discarded).

The maximum scale factor for which the images remain indistinguishable (the critical scale) characterises perceptually-relevant compression in the visual system’s representation. If the scale factor of the model corresponded to the scaling of the visual system in the responsible visual area, and information in upstream areas was irretrievably lost, then the images synthesised by the model should be indistinguishable while discarding as much information as possible. That is, we seek the maximum compression that is perceptually lossless. Larger scale factors would discard more information than the relevant visual area and therefore the images should look different. Smaller scale factors preserve information that could be discarded without any perceptual effect.

Crucially, it is the minimum critical scale over images that is important for the scaling theory. If the visual system computes summary statistics over fixed (image-independent) pooling regions in the same way as the model, then the model must be able to produce metamers for all images. While images may vary in their individual critical scales, the image with the smallest critical scale determines the maximum compression for appearance to be matched in general.

Freeman and Simoncelli showed that the largest scale factor for which two synthesised images could not be told apart was 0.5, or pooling regions of about half the eccentricity. This scaling matched the signature of area V2, and also matched the approximate value of Bouma’s Law. Subsequently, this result has been interpreted as a link between receptive field scaling, crowding, and our rich phenomenal experience (e.g. Block 2013, Cohen, Dennett, and Kanwisher (2016), Landy (2013), Movshon and Simoncelli (2014), Seth (2014)). These interpretations imply that the FS-model creates metamers for natural scenes. However, observers in Freeman and Simoncelli’s experiment never saw the original scenes, but only compared synthesised images to each other. Showing that two model samples are indiscriminable from each other could yield trivial results. For example, two white noise samples matched to the mean and contrast of a natural scene would be easy to discriminate from the scene but hard to discriminate from each other. Wallis, Bethge and Wichmann (2016) showed that observers could easily discriminate Portilla and Simoncelli (2000) textures from original images in the periphery, but did not test the FS-model. The Portilla and Simoncelli model makes no explicit connection to neural receptive field scaling. In addition, relative to the textures tested by Wallis et al., the pooling region overlap used in the FS-model provides a strong constraint on the resulting syntheses, making the images much more similar to the originals. It is therefore entirely possible that the FS-model produces metamers for natural scenes for scale factors of 0.5. Here we test this, and compare the results to our own model using CNN texture features.

Results

We tested whether the FS-model can produce metamers using an oddity design in which the observer had to pick the odd image out of three successively shown images (Fig 1E). In a 3-alternative oddity paradigm, performance for metamerism would lie at 0.33 (dashed horizontal line, Figure 1F). We used two comparison conditions: either observers compared two model syntheses to each other (synth vs synth; as in Freeman and Simoncelli 2011) or the original image to a model synthesis (orig vs synth). As in the original paper (Freeman and Simoncelli 2011) we measured the performance of human observers for images synthesised with different scale factors (using Freeman and Simoncelli’s code, see Methods). To quantify the critical scale factor we fit the same nonlinear model as Freeman and Simoncelli, which parameterises sensitivity as a function of critical scale and gain, as a mixed-effects model with random effects of participant and image (see Methods).

In previous experiments (see Supplementary Figures S8 and S9), we observed that texture-like distortions are more visible when they fall over image regions containing inhomogenous structure, long edges, or borders between different surfaces than when they fall into texture-like regions. We therefore explicitly compared “scene-like” images containing inhomogenous structures to “texture-like” images containing more homogenous or periodically-patterned content in the periphery. We hand-selected ten images from each class (Figure 1A and Figure 1B)².

When participants compared synthesised images to each other as in Freeman and Simoncelli (Figure 1F, synth vs synth), there was little evidence that the critical scale depended on the image content (scene-vs texture-like). The difference in critical scale between texture-like and scene-like images was 0.09, 95% CI [-0.03, 0.24], p(β < 0) = 0.078. This result is consistent with Freeman and Simoncelli, who reported no dependency on image. It seems likely that this is because comparing synthesised images to each other means that the model has removed higher-order structure that might allow discrimination. All images appear distorted, and the task becomes one of identifying a specific distortion pattern. While the critical scales we find are somewhat lower than those reported by Freeman and Simoncelli (2011; Figure 1G), they are within the range of other reported critical scale factors (Freeman and Simoncelli 2013). One striking difference between our results for synth vs synth and those of Freeman and Simoncelli is that the performance of our participants was quite poor even for large scale factors. This may be because because we used more images in our experiment than Freeman and Simoncelli, so participants were less familiar with the distortions that could appear. We have replicated these results in an experiment using the same ABX task as in Freeman and Simoncelli (Figure S5).

Comparing the original image to model syntheses yielded a different pattern of results. First, participants are able to discriminate the original images from their FS-model syntheses at scale factors of 0.5 (Figure 1F). Performance lies well above chance for all participants: these images are not metamers. This holds for both scene-like and texture-like images. Furthermore, critical scale depends on the image type. Model syntheses match the texture-like images on average with scale factors of approximately 0.25 (the smallest value we could generate using the FS-model). In contrast, the scene-like images are quite discriminable from their model syntheses at this scale. Correspondingly, texture-like images had higher critical scales than scene-like images on average (0.13, 95% CI [0.06, 0.22], p(β < 0) = 0.001). Thus, smaller pooling regions are required to make metamers for scene-like images than for texture-like images.

As noted above, the image with the minimum critical scale determines the largest compression that can be applied for the scaling model to hold. For two images (Figure 2A and E) the nonlinear mixed-effects model estimated critical scales of approximately 0.14 (see Figure 1G, diamonds). However, examining the individual data for these images (Figure 2D and H) reveals that these critical scale estimates are largely determined by the hierarchical nature of the mixed-effects model, not the data itself. Both images remain highly discriminable for the lowest scale factor we could generate. This suggests that the mixed-effects model critical scale may be an overestimate of the true scale factor required to generate metamers. Thus, human observers are highly sensitive to the relatively small distortions produced by the FS-model at scale factors of 0.25 (compare Figure 2B and F at scale 0.25 and C and G at scale 0.46 to images A and B).

Discussion

It is a popular idea that the appearance of scenes in the periphery is described by summary statistic textures captured at the scaling of V2 neural populations. In contrast, here we show that humans are very sensitive to the difference between original and model-matched images at this scale (Figure 1). A recent preprint (Deza, Jonnalagadda, and Eckstein 2017) finds a similar result in a set of 50 images, and our results are also consistent with the speculations made by Wallis et al. based on their experiments with Portilla and Simoncelli textures (Wallis, Bethge, and Wichmann 2016). Together, these results show that the pooling of texture-like features in the FS-model at the scaling of V2 receptive fields does not explain the appearance of natural images.

If the peripheral appearance of visual scenes is explained by image-independent pooling of texture-like features, then the pooling regions must be small. Consider that participants in our experiment could easily discriminate the images in Figure 2B and F from those in Figure 2A and E respectively. Therefore, images synthesised at a truly metameric scaling must remain extremely close to the original: at least as small as V1 neurons, and likely even lower (Figure 2). This may even be consistent with scaling in precortical visual areas. For example, the scaling of retinal ganglion cell receptive fields at the average eccentricity of our stimuli (6 degrees) is approximately 0.08 for the surround (Croner and Kaplan 1995) and 0.009 for the centre (Dacey and Petersen 1992). It becomes questionable how much is learned about compression in the visual system using such an approach, beyond the aforementioned, relatively well-studied limits of optics and retinal sampling (e.g. Wandell 1995; Watson 2014).

Furthermore, it can be seen via simple demonstration that we can be quite insensitive to even large texture-like distortions, so long as these fall on texture-like regions of the input image. The “China Lane” sign in Figure 3A has been distorted in B (using texture features from Gatys et al (2015)). The same type of distortion in a texture-like region of the image is far less visible (the brickwork in the image centre; FS-model result Figure 3C; see also Figure S13). It is the image content, not retinal eccentricity, that is the primary determinant of the visibility of at least some summary statistic distortions. Requiring information to be preserved at V1 or smaller scaling would therefore be rather inefficient from the standpoint of compression: small scale factors will preserve texture-like structure that could be compressed without affecting appearance.

It may seem trivial that a texture statistic model better captures the appearance of textures than non-textures. However, if the human visual system represents the periphery as a set of textures, and these models are sufficient approximations of this representation, then image content should not matter— because scene-like retinal inputs in the periphery are transformed into textures by the visual system.

Perhaps the texture scaling theory might hold at larger scales but the FS-model texture features themselves are insufficient to capture natural scene appearance. To test whether improved texture features (Gatys, Ecker, and Bethge 2015; Wallis et al. 2017) could help in matching appearance for scenes, we developed a new model (CNN-model) that was inspired by the FS-model but uses the texture features of a convolutional neural network (see Methods and Figures S6–9). The CNN-model shows very similar behaviour to the FS-model such that human performance for scene-like images is higher than for texturelike images (triangles in Figure 1D and Figure 2), and the CNN-model also fails to create metamers for all images (see also Figures S9, S12–13). Furthermore, the NeuroFovea model of Deza et al (2017), which like our CNN-model uses deep neural network texture features, also fails to capture scene appearance. Together, these results show that no known summary statistic pooling model is sufficient to match the appearance of arbitrary natural scenes at computationally feasible scale factors.

What, then, is the missing ingredient that could capture appearance while compressing as much information as possible? Through the Gestalt tradition, it has long been known that the appearance of local image elements can crucially depend on the context in which they are placed. For example, Saarela et al (2009) and Manassi et al (2013) found that global stimulus configuration modulates crowding (see also Vickery et al. 2009), and the results of Neri (2017) suggest that early global segmentation processes influence local perceptual sensitivity. Potentially, global scene organisation needs to be considered if one wants to capture appearance—yet current models that texturise local regions do not explicitly include perceptual organisation (Herzog et al. 2015). We speculate that segmentation and grouping processes are critical for efficiently matching scene appearance, and therefore the approach of uniformly computing summary statistics without including these processes will require preserving much of the original image structure by making pooling regions very small. A parsimonious model capable of compressing as much information as possible might need to adapt either the size and arrangement of pooling regions or the feature representations to the image content.

Figure 1

Two texture pooling models fail to match arbitrary scene appearance. We selected ten scene-like (A) and ten texture-like (B) images and synthesised images to match them using the Freeman & Simoncelli model (FS scale 0.46 shown) or a model using CNN texture features (CNN 32; example scene and texture-like stimuli shown in C and D respectively). E: The oddity paradigm. Three images were presented in sequence, with two being physically-identical and one being the oddball. Participants indicated which image was the oddball (1, 2 or 3). On ”orig vs synth” trials participants compared real and synthesised images, whereas on ”synth vs synth” trials participants compared two images synthesised from the same model. F: Performance as a function of scale factor (pooling region diameter divided by eccentricity) in the Freeman-Simoncelli model (circles) and for the CNN 32 model (triangles; arbitrary x-axis location). Points show grand mean ±2 SE over participants; faint lines link individual participant performance levels (FS-model) and faint triangles show individual CNN 32 performance. Solid curves and shaded regions show the fit of a nonlinear mixed-effects model estimating the critical scale and gain. Participants are still above chance for scene-like images in the original vs synth condition for the lowest scale factor of the FS-model we could generate, and for the CNN 32 model, indicating that neither model succeeds in producing metamers. G: When comparing original and synthesised images, estimated critical scales (scale at which performance rises above chance) are lower for scene-like than for texture-like images. Points with error bars show population mean and 95% credible intervals. Triangles show posterior means for participants; diamonds show posterior means for images. Black squares show critical scale estimates of the four participants from Freeman & Simoncelli reproduced from that paper (x-position jittered to reduce overplotting); shaded regions denote the receptive field scaling of V1 and V2 estimated by Freeman & Simoncelli.

Figure 2

The two images with smallest critical scale estimates are highly discriminable even for the lowest scale factor we could generate. A: The original image. B: An example FS synthesis at scale factor 0.25. C: An example FS synthesis at scale factor 0.46. D: The average data for this image. Points and error bars show grand mean and ±2 SE over participants, solid curve and shaded area show posterior mean and 95% credible intervals from the mixed-effects model. Embedded text shows posterior mean and 95% credible interval on the critical scale estimate for this image. E–G: Same as A–D for the image with the second-lowest critical scale. Note that in both cases the model is likely to overestimate critical scale.

Our results do not undermine the considerable empirical support for the periphery-as-summary-statistic theory as a description of visual performance. Humans can judge summary statistics of visual displays (Ariely 2001; Dakin and Watt 1997), summary statistics can influence judgments where other information is lost (Fischer and Whitney 2011; Faivre, Berthet, and Kouider 2012), and the information preserved by summary statistic stimuli may offer an explanation for performance in various visual tasks (Rosenholtz et al. 2012; Balas, Nakano, and Rosenholtz 2009; Rosenholtz, Huang, and Ehinger 2012; Keshvari and Rosenholtz 2016; Chang and Rosenholtz 2016; Zhang et al. 2015; Whitney, Haberman, and Sweeny 2014; Long et al. 2016; though see Agaoglu and Chung 2016; Herzog et al. 2015; Francis, Manassi, and Herzog 2017). Texture-like statistics may even provide the primitives—appropriately organised on a global scale—from which form is constructed (Lettvin 1976). However, one additional point merits further discussion. The studies by Rosenholtz and colleagues primarily test summary statistic representations by showing that performance with summary statistic stimuli viewed foveally is correlated with peripheral performance with real stimuli. This means that the summary statistic preserves sufficient information to explain the performance of tasks in the periphery. Our results show that these summary statistics are insufficient to match scene appearance, at least under the pooling scheme used in the Freeman and Simoncelli model at computationally feasible scales. This shows the usefulness of scene appearance matching as a test: a parsimonious model that matches scene appearance would be expected to also preserve enough information to show correlations with peripheral task performance; the converse does not hold.

While it may be useful to consider summary statistic pooling in accounts of visual performance, to say that summary statistics can account for phenomenological experience of the visual periphery (Cohen, Dennett, and Kanwisher 2016; see also Block 2013; Seth 2014) seems premature in light of our results (see also Haun et al. 2017). Cohen et al (2016) additionally posit that focussed spatial attention can in some cases overcome the limitations imposed by a summary statistic representation. We instead find little evidence that participants’ ability to discriminate real from synthesised images is improved by cueing spatial attention, at least in our experimental paradigm and for our CNN-model (Figure S11).

Figure 3

The visibility of texture-like distortions depends on image content. A: ”Geotemporal Anomaly” by Pete Birkinshaw (2010; re-used under a CC-BY 2.0 license). B: Two texture-like distortions have been introduced into circular regions of the scene in A (see Figure S1 for higher resolution). The distortion in the upper-left is quite visible, even with central fixation on the bullseye, because it breaks up the high-contrast contours of the text. The second distortion occurs on the brickwork centered on the bullseye, and is more difficult to see (you may not have noticed it until reading this caption). The visibility of texture-like distortions can depend more on image content than on retinal eccentricity (see also Figure S13). C: Results synthesised from the FS-model at scale 0.46 for comparison. Pooling regions depicted for one angular meridian as overlapping red circles; real pooling regions are smooth functions tiling the whole image. Pooling in this fashion reduces large distortions compared to B, but our results show that this is insufficient to match appearance.

One exciting aspect of Freeman and Simoncelli (2011) was the promise of inferring a critical brain region via a receptive field size prediction derived from psychophysics. Indeed, aspects of this promise have since received empirical support: the presence of texture-like features can discriminate V2 neurons from V1 neurons (Freeman et al. 2013; Ziemba et al. 2016; see also Okazawa, Tajima, and Komatsu 2015). Discarding all higher-order structure not captured by the candidate model by comparing syntheses to each other, thereby isolating only features that change, may therefore be a useful way to distinguish sequential feedforward processing stages in neurons. On the other hand, explaining appearance is—to return to Koffka—a grand goal of vision science. For this the original vs synthesised comparison is key. Our results suggest that it is wrong to believe that “appearance”, even only peripheral appearance, can be tied solely to the scaling of receptive fields in any single brain region or to Bouma’s Law.

Methods

All stimuli, data and code to reproduce the figures and statistics reported in this paper are available at http://dx.doi.org/10.5281/zenodo.1475112. This document was prepared using the knitr package (Xie 2013, 2015) in the R statistical environment (R Core Team 2017; Wickham and Francois 2016; Wickham 2009, 2011; Auguie 2016; Arnold 2016) to improve its reproducibility.

Participants

Eight observers participated in the experiment: authors CF and TW, a research assistant unfamiliar with the experimental hypotheses, and five näıve participants recruited from an online advertisement pool who were paid 10 Euro / hr for two one-hour sessions. An additional näıve participant was recruited but showed insufficient eyetracking accuracy (see below). All participants signed a consent form prior to participating. Participants reported normal or corrected-to-normal visual acuity. All procedures conformed to Standard 8 of the American Psychological Association’s “Ethical Principles of Psychologists and Code of Conduct” (2010).

Stimuli

We selected 10 “scene-like” and 10 “texture-like” source images from the MIT 1003 scene dataset (Judd, Durand, and Torralba 2012; Judd et al. 2009). We used images from this dataset to allow better comparison to related experiments (see Supplementary Material). A square was cropped from the center of the original image and downsampled to 512 x 512 px. The images were converted to grayscale and standardized to have a mean gray value of 0.5 (scaled [0,1]) and an RMS contrast (σ/µ) of 0.3.

Freeman and Simoncelli syntheses

We synthesised images using the FS-model (Freeman and Simoncelli 2011, code available from https://github.com/freeman-lab/metamers). Four unique syntheses were created for each source image at each of eight scale factors (0.25, 0.36, 0.46, 0.59, 0.7, 0.86, 1.09, 1.45), using 50 gradient steps as in Freeman and Simoncelli. Pilot experiments with stimuli generated with 100 gradient steps produced similar results. To successfully synthesise images at scale factors of 0.25 and 0.36 it was necessary to increase the central region of the image in which the original pixels were perfectly preserved (pooling regions near the fovea become too small to compute correlation matrices). Scales of 0.25 used a central radius of 32 px (0.8 dva in our viewing conditions) and scales 0.36 used 16 px (0.4 dva). This change should, if anything, make syntheses even harder to discriminate from the original image. All other parameters of the model were as in Freeman and Simoncelli. Synthesising an image with scale factor 0.25 took approximately 35 hours, making a larger set of syntheses or source images infeasible. It was not possible to reliably generate images with scale factors lower than 0.25 using the code above.

CNN model syntheses

The CNN pooling model (triangles in Figure 1) was inspired by the model of Freeman and Simoncelli, with two primary differences: first, we replaced the Portilla and Simoncelli (2000) texture features with the texture features derived from a convolutional neural network (Gatys, Ecker, and Bethge 2015), and second, we simplified the “foveated” pooling scheme for computational reasons. Specifically, for the CNN 32 model presented above, the image was divided up into 32 angular regions and 28 radial regions, spanning the outer border of the image and an inner radius of 64 px. Within each of these regions we computed the mean activation of the feature maps from a subset of the VGG-19 network layers (conv1 1, conv2 1, conv3 1, conv4 1, conv5 1). To better capture long-range correlations in image structure, we computed these radial and angular regions over three spatial scales, by computing three networks over input sizes 128, 256 and 512 px. Using this multiscale radial and angular pooling representation of an image, we synthesised new images to match the representation of the original image via iterative gradient descent (Gatys, Ecker, and Bethge 2015). Specifically, we minimised the mean-squared distance between the original and a target image, starting from Gaussian noise outside the central 64 px region, using the L-BFGS optimiser as implemented in scipy (Jones, Oliphant, and Peterson 2001) for 1000 gradient steps, which we found in pilot experiments was sufficient to produce small (but not zero) loss. Further details, including tests of other variants of this model, are provided in the Supplement.

Equipment

Stimuli were displayed on a VIEWPixx 3D LCD (VPIXX Technologies Inc., Saint-Bruno-de-Montarville, Canada; spatial resolution 1920 x 1080 pixels, temporal resolution 120 Hz, operating with the scanning backlight turned off in normal colour mode). Outside the stimulus image the monitor was set to mean grey. Participants viewed the display from 57 cm (maintained via a chinrest) in a darkened chamber. At this distance, pixels subtended approximately 0.025 degrees on average (approximately 40 pixels per degree of visual angle). The monitor was linearised (maximum luminance 260 cd/m²) using a Konica-Minolta LS-100 (Konica-Minolta Inc., Tokyo, Japan). Stimulus presentation and data collection was controlled via a desktop computer (Intel Core i5-4460 CPU, AMD Radeon R9 380 GPU) running Ubuntu Linux (16.04 LTS), using the Psychtoolbox Library (version 3.0.12, Brainard 1997; Kleiner, Brainard, and Pelli 2007; Pelli 1997), the Eyelink toolbox (Cornelissen, Peters, and Palmer 2002) and our internal iShow library (http://dx.doi.org/10.5281/zenodo.34217) under MATLAB (The Mathworks Inc., Natick MA, USA; R2015b). Participants’ gaze position was monitored by an Eyelink 1000 (SR Research) video-based eyetracker.

Procedure

On each trial participants were shown three images in succession; two images were identical, one image was different (the “oddball”, which could occur first, second or third with equal probability). The oddball could be either a synthesised or a natural image (in the orig vs synth condition; counterbalanced), whereas the other two images were physically the same as each other and from the opposite class as the oddball. In the synth vs synth condition (as used in Freeman and Simoncelli), both oddball and foil images were (physically different) model synths. The participant identified the temporal position of the oddball image via button press. Participants were told to fixate on a central point (Thaler et al. 2013) presented in the center of the screen. The images were centred around this spot and displayed with a radius of 512 pixels (i.e. images were upsampled by a factor of two for display), subtending ≈ 12.8° at the eye. Images were windowed by a circular cosine, ramping the contrast to zero in the space of 52 pixels. The stimuli were presented for 200 ms, with an inter-stimulus interval of 1000 ms (making it unlikely participants could use motion cues to detect changes), followed by a 1200 ms response window. Feedback was provided by a 100 ms change in fixation cross brightness. Gaze position was recorded during the trial. If the participant moved the eye more than 1.5 degrees away from the fixation spot, the trial immediately ended and no response was recorded; participants saw a feedback signal (sad face image) indicating a fixation break. Prior to the next trial, the state of the participant’s eye position was monitored for 50 ms; if the eye position was reported as more than 1.5 degrees away from the fixation spot a recalibration was triggered. The inter-trial interval was 400 ms.

Scene-like and texture-like images were compared under two comparison conditions (orig vs synth and synth vs synth; see main text). Image types and scale factors were randomly interleaved within a block of trials (with a minimum of one trial from another image in between) whereas comparison condition was blocked. Participants first practiced the task and fixation control in the orig vs synth comparison condition (scales 0.7, 0.86 and 1.45); the same images used in the experiment were also used in practice to familiarise participants with the images. Participants performed at least 60 practice trials, and were required to achieve at least 50% correct responses and fewer than 20% fixation breaks before proceeding (as noted above, one participant failed). Following successful practice, participants performed one block of orig vs synth trials, which consisted of five FS-model scale factors (0.25, 0.36, 0.46, 0.59, 0.86) plus the CNN 32 model, repeated once for each image to give a total of 120 trials. The participant then practiced the synth vs synth condition for at least one block (30 trials), before continuing to a normal synth vs synth block (120 trials; scale factors of 0.36, 0.46, 0.7, 0.86, 1.45). Over two one-hour sessions, näıve participants completed a total of four blocks of each comparison condition in alternating order (except for one participant who ran out of time to complete the final block). Authors performed more blocks (total 11).

Data analysis

We discarded trials for which participants made no response (N = 66) and broke fixation (N = 239), leaving a total of 7555 trials for further analysis. To quantify the critical scale as a function of the scale factor s, we used the same 2-parameter function for discriminability d′ fitted by Freeman and Simoncelli:

consisting of the critical scale s_c (below which the participant cannot discriminate the stimuli) and a gain parameter a (asymptotic performance level in units of d′). This d′ value was transformed to proportion correct using a Weibull function as in Wallis et al (2016):

with m set to three (the number of alternatives), and scale λ and shape k parameters chosen by minimising the squared difference between the Weibull and simulated results for oddity as in Craven (1992). The posterior distribution over model parameters (s_c and α) was estimated in a nonlinear mixed-effects model with fixed effects for the experimental conditions (comparison and image type) and random effects for participant (crossed with comparison and image type) and image (crossed with comparison, nested within image type), assuming binomial variability. Estimates were obtained by a Markov Chain Monte Carlo (MCMC) procedure implemented in the Stan language (version 2.16.2, Stan Development Team 2017; Hoffman and Gelman 2014), with the model wrapper package brms (version 1.10.2, Bürkner 2017) in the R statistical environment. The model parameters were given weakly-informative prior distributions, which provide information about the plausible scale of parameters but do not bias the direction of inference. Specifically, both critical scale and gain were estimated on the natural logarithmic scale; the mean log critical scale (intercept) was given a Gaussian distribution prior with mean -0.69 (corresponding to a critical scale of approximately 0.5—i.e. centred on the result from Freeman and Simoncelli) and sd 1, other fixed-effect coefficients were given Gaussian priors with mean 0 and sd 0.5, and the group-level standard deviation parameters were given positive-truncated Cauchy priors with mean 0 and sd 0.1. Priors for the log gain parameter were the same, except the intercept prior had mean 1 (linear gain estimate of 2.72 in d′ units) and sd 1. The posterior distribution represents the model’s beliefs about the parameters given the priors and data. This distribution is summarised above as posterior mean, 95% credible intervals and posterior probabilities for the fixed-effects parameters to be negative (the latter computed via the empirical cumulative distribution of the relevant MCMC samples).

Acknowledgments

Designed the experiments: TSAW, ASE, CMF, LAG, FAW, MB. Programmed the CNN-model: CMF, LAG. Programmed the experiments: TSAW. Collected the data: CMF, TSAW. Analysed the data: TSAW. Wrote the paper: TSAW, CMF. Revised the paper: ASE, LAG, FAW, MB. Funded by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002), the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307), and the German Science Foundation (DFG; priority program 1527, BE 3848/2-1 and SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP03). We thank Wiebke Ringels for assistance with data collection, and Heiko Schütt and Corey Ziemba for helpful comments on an earlier draft. TSAW was supported in part by an Alexander von Humboldt Postdoctoral Fellowship. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Christina Funke.

Footnotes

↵¹Many other phenomena also demonstrate striking “failures” of peripheral vision, for example change blindness (Rensink, O’Regan, and Clark 1997; O’Regan, Rensink, and Clark 1999) and inattentional blindness (Mack and Rock 1998), though there is some discussion as to what extent these are distinct from crowding (Rosenholtz 2016).
↵²This selection of images is debateable. In particular some “texture-like” images contain scene-like content. Interestingly, “scene-like” images tend to contain human-made content whereas “texture-like” images tend to contain more natural content, though this was not consciously part of our selection criteria (thanks to Corey Ziemba for pointing this out). We selected these images from a scene database to remain consistent with our other experiments (see Supplementary Material). The fact that we do find differences in critical scaling supports our general argument. See Figure S3 and Supplement for further discussion.

References

↵
Agaoglu Mehmet N., and Susana T. L. Chung. 2016. “Can (Should) Theories of Crowding Be Unified?” Journal of Vision 16 (15):10. https://doi.org/10.1167/16.15.10.
OpenUrl CrossRef PubMed
↵
Ariely Dan. 2001. “Seeing Sets: Representation by Statistical Properties.” Psychological Science 12 (2):157–62.
OpenUrl CrossRef PubMed Web of Science
↵
Arnold, Jeffrey B. 2016. Ggthemes: Extra Themes, Scales and Geoms for ‘Ggplot2’.
↵
Auguie Baptiste. 2016. gridExtra: Miscellaneous Functions for ”Grid” Graphics.
↵
Balas, Benjamin, L Nakano, and Ruth Rosenholtz. 2009. “A Summary-Statistic Representation in Peripheral Vision Explains Visual Crowding.” Journal of Vision 9 (12):13.
OpenUrl Abstract
↵
Block N. 2013. “Seeing and Windows of Integration.” Thought: A Journal of Philosophy.
↵
Bouma H. 1970. “Interaction Effects in Parafoveal Letter Recognition.” Nature 226 (5241):177–78.
OpenUrl CrossRef PubMed Web of Science
↵
Brainard, David H. 1997. “The Psychophysics Toolbox.” Spatial Vision 10 (4):433–36.
OpenUrl CrossRef PubMed Web of Science
↵
Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1):1–28. https://doi.org/10.18637/jss.v080.i01.
OpenUrl
↵
Chang, Honghua, and Ruth Rosenholtz. 2016. “Search Performance Is Better Predicted by Tileability Than Presence of a Unique Basic Feature.” Journal of Vision 16 (10):13. https://doi.org/10.1167/16.10.13.
OpenUrl CrossRef PubMed
↵
Cohen Michael A., Daniel C. Dennett, and Nancy Kanwisher. 2016. “What Is the Bandwidth of Perceptual Experience?” Trends in Cognitive Sciences 20 (5):324–35. https://doi.org/10.1016/j.tics.2016.03.006.
OpenUrl CrossRef PubMed
↵
Cornelissen Frans W., Enno M. Peters, and John Palmer. 2002. “The Eyelink Toolbox: Eye Tracking with MATLAB and the Psychophysics Toolbox.” Behavior Research Methods, Instruments, & Computers 34 (4):613–17. https://doi.org/10.3758/BF03195489.
OpenUrl CrossRef PubMed Web of Science
↵
Craven B. J. 1992. “A Table Of d′ forM-Alternative Odd-Man-Out Forced-Choice Procedures.” Perception & Psychophysics 51 (4):379–85.
OpenUrl PubMed
↵
Croner L. J., and E. Kapla. 1995. “Receptive Fields of P and M Ganglion Cells Across the Primate Retina.” Vision Research 35 (1):7–24.
OpenUrl CrossRef PubMed Web of Science
↵
Dacey D. M., and M. R. Peterse. 1992. “Dendritic Field Size and Morphology of Midget and Parasol Ganglion Cells of the Human Retina.” Proceedings of the National Academy of Sciences 89 (20):9666–70. https://doi.org/10.1073/pnas.89.20.9666.
OpenUrl Abstract/FREE Full Text
↵
Dakin, Steven C, and R.J. Watt. 1997. “The Computation of Orientation Statistics from Visual Texture.” Vision Research 37 (22):3181–92. https://doi.org/10.1016/S0042-6989(97)00133-8.
OpenUrl CrossRef PubMed Web of Science
↵
Deza, Arturo, Aditya Jonnalagadda, and Miguel Eckstein. 2017. “Towards Metamerism via Foveated Style Transfer.” arXiv Preprint arXiv:1705.10041.
↵
Ehinger Krista A., and Ruth Rosenholtz. 2016. “A General Account of Peripheral Encoding Also Predicts Scene Perception Performance.” Journal of Vision 16 (2):13. https://doi.org/10.1167/16.2.13.
OpenUrl CrossRef PubMed
↵
Faivre, Nathan, Vincent Berthet, and Sid Kouider. 2012. “Nonconscious Influences from Emotional Faces: A Comparison of Visual Crowding, Masking, and Continuous Flash Suppression.” Frontiers in Psychology 3. https://doi.org/10.3389/fpsyg.2012.00129.
↵
Fischer, Jason, and David Whitney. 2011. “Object-Level Visual Information Gets Through the Bottleneck of Crowding.” Journal of Neurophysiology 106:1389–98.
OpenUrl CrossRef PubMed Web of Science
↵
Francis, Gregory, Mauro Manassi, and Michael H. Herzog. 2017. “Neural Dynamics of Grouping and Segmentation Explain Properties of Visual Crowding.” Psychological Review. https://doi.org/10.1037/rev0000070.
↵
Freeman, Jeremy, and Eero P. Simoncelli. 2011. “Metamers of the Ventral Stream.” Nature Neuro-science 14 (9):1195–1201.
OpenUrl CrossRef PubMed
↵
Freeman, Jeremy, and Eero P. Simoncelli. 2013. “The Radial and Tangential Extent of Spatial Metamers.” Journal of Vision 13 (9):573–73. https://doi.org/10.1167/13.9.573.
OpenUrl Abstract
Freeman, Jeremy, Corey M. Ziemba, David J Heeger, Eero P Simoncelli, and J. Anthony Movshon. 2013. “A Functional and Perceptual Signature of the Second Visual Area in Primates.” Nature Neuro-science 16 (7):974–81.
OpenUrl CrossRef PubMed
↵
Gatys L. A., A. S. Ecker, and M. Bethg. 2015. “Texture Synthesis Using Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 28.
↵
Haun Andrew M., Giulio Tononi, Christof Koch, and Naotsugu Tsuchiya. 2017. “Are We Underestimating the Richness of Visual Experience?” Neuroscience of Consciousness 3 (1). https://doi.org/10.1093/nc/niw023.
↵
Herzog Michael H., Bilge Sayim, Vitaly Chicherov, and Mauro Manassi. 2015. “Crowding, Grouping, and Object Recognition: A Matter of Appearance.” Journal of Vision 15 (6).
↵
Hoffman Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (Apr):1593–1623.
OpenUrl
↵
Jones, Eric, Travis Oliphant, and Pearu Peterson. 2001. SciPy: Open Source Scientific Tools for Python.
↵
Judd, Tilke, Frédo Durand, and Antonio Torralba. 2012. “A Benchmark of Computational Models of Saliency to Predict Human Fixations.” CSAIL Technical Reports, January.
↵
Judd, Tilke, Krista A. Ehinger, F. Durand, and A. Torralb. 2009. “Learning to Predict Where Humans Look.” In IEEE 12th International Conference on Computer Vision, 2106–13. Kyoto. https://doi.org/10.1109/ICCV.2009.5459462.
↵
Keshvari, Shaiyan, and Ruth Rosenholtz. 2016. “Pooling of Continuous Features Provides a Unifying Account of Crowding.” Journal of Vision 16 (3):39. https://doi.org/10.1167/16.3.39.
OpenUrl CrossRef
↵
Kleiner, M, David H Brainard, and Denis G Pell. 2007. “What’s New in Psychtoolbox-3?” Perception 36 (ECVP Abstract Supplement).
↵
Koenderink, Jan, Matteo Valsecchi, Andrea van Doorn, Johan Wagemans, and Karl Gegenfurtner. 2017. “Eidolons: Novel Stimuli for Vision Research.” Journal of Vision 17 (2):7. https://doi.org/10.1167/17.2.7.
OpenUrl CrossRef
↵
Koffka Kurt. 1935. Principles of Gestalt Psychology. Oxford, UK: Harcourt Brace.
↵
Landy, Michael S. 2013. “Texture Analysis and Perception.” The New Visual Neurosciences (Ed. Werner JS, Chalupa LM), 639–52.
↵
Lettvin, Jerome Y. 1976. “On Seeing Sidelong.” The Sciences 16 (4):10–20. https://doi.org/10.1002/j.2326-1951.1976.tb01231.x.
OpenUrl
↵
Long, Bria, Talia Konkle, Michael A. Cohen, and George A. Alvarez. 2016. “Mid-Level Perceptual Features Distinguish Objects of Different Real-World Sizes.” Journal of Experimental Psychology: General 145 (1):95–109. https://doi.org/10.1037/xge0000130.
OpenUrl
↵
Mack, Arien, and Irvin Rock. 1998. Inattentional Blindness. Vol. 33. MIT press Cambridge, MA.
↵
Manassi, M, B Sayim, and Michael H. Herzog. 2013. “When Crowding of Crowding Leads to Uncrowding.” Journal of Vision 13 (13):10.
OpenUrl Abstract/FREE Full Text
↵
Movshon J. Anthony, and Eero P. Simoncelli. 2014. “Representation of Naturalistic Image Structure in the Primate Visual Cortex.” Cold Spring Harbor Symposia on Quantitative Biology 79:115–22. https://doi.org/10.1101/sqb.2014.79.024844.
OpenUrl Abstract/FREE Full Text
↵
Neri Peter. 2017. “Object Segmentation Controls Image Reconstruction from Natural Scenes.” Edited by Christopher C. Pack. PLOS Biology 15 (8):e1002611. https://doi.org/10.1371/journal.pbio.1002611.
OpenUrl CrossRef PubMed
↵
Okazawa, Gouki, Satohiro Tajima, and Hidehiko Komatsu. 2015. “Image Statistics Underlying Natural Texture Selectivity of Neurons in Macaque V4.” Proceedings of the National Academy of Sciences 112 (4):E351–E360. https://doi.org/10.1073/pnas.1415146112.
OpenUrl Abstract/FREE Full Text
↵
O’Regan, J Kevin, Ronald A Rensink, and James J Clar. 1999. “Change-Blindness as a Result of ‘Mudsplashes’.” Nature 398 (6722):34–34.
OpenUrl CrossRef PubMed Web of Science
↵
Parkes, L, J Lund, A Angelucci, JA Solomon, and M Morga. 2001. “Compulsory Averaging of Crowded Orientation Signals in Human Vision.” Nature Neuroscience 4 (7):739–44. https://doi.org/10.1038/89532.
OpenUrl CrossRef PubMed Web of Science
↵
Pelli, Denis G. 1997. “The VideoToolbox Software for Visual Psychophysics: Transforming Numbers into Movies.” Spatial Vision 10 (4):437–42.
OpenUrl CrossRef PubMed Web of Science
↵
Pelli, Denis G, and K A Tillma. 2008. “The Uncrowded Window of Object Recognition.” Nature Neuroscience 11 (10):1129–35.
OpenUrl CrossRef PubMed Web of Science
↵
Portilla, J, and Eero P Simoncell. 2000. “A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients.” International Journal of Computer Vision 40 (1):49–70.
OpenUrl CrossRef Web of Science
↵
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
↵
Rensink, Ronald A, J Kevin O’Regan, and James J. Clark. 1997. “To See or Not to See: The Need for Attention to Perceive Changes in Scenes.” Psychological Science 8 (5):368–73.
OpenUrl CrossRef Web of Science
↵
Rosen S., R. Chakravarthi, and Denis G Pell. 2014. “The Bouma Law of Crowding, Revised: Critical Spacing Is Equal Across Parts, Not Objects.” Journal of Vision 14 (6):10–10. https://doi.org/10.1167/14.6.10.
OpenUrl CrossRef PubMed
↵
Rosenholtz Ruth. 2016. “Capabilities and Limitations of Peripheral Vision.” Annual Review of Vision Science 2 (1):437–57. https://doi.org/10.1146/annurev-vision-082114-035733.
OpenUrl
↵
Rosenholtz, Ruth, Jie Huang, and Krista A. Ehinger. 2012. “Rethinking the Role of Top-down Attention in Vision: Effects Attributable to a Lossy Representation in Peripheral Vision.” Frontiers in Psychology 3:13.
OpenUrl
Rosenholtz, Ruth, Jie Huang, Alvin Raj, Benjamin Balas, and Livia Ilie. 2012. “A Summary Statistic Representation in Peripheral Vision Explains Visual Search.” Journal of Vision 12 (4).
↵
Saarela, T P, B Sayim, Gerald Westheimer, and Michael H. Herzog. 2009. “Global Stimulus Configuration Modulates Crowding.” Journal of Vision 9 (2):5.
OpenUrl Abstract
↵
Seth, Anil K. 2014. “A Predictive Processing Theory of Sensorimotor Contingencies: Explaining the Puzzle of Perceptual Presence and Its Absence in Synesthesia.” Cognitive Neuroscience 5 (2):97–118. https://doi.org/10.1080/17588928.2013.877880.
OpenUrl CrossRef PubMed
↵
Stan Development Team. 2017. “Stan: A C++ Library for Probability and Sampling, Version 2.14.0.”.
↵
Thaler, L, A C Schütz, M A Goodale, and K R Gegenfurtne. 2013. “What Is the Best Fixation Target? The Effect of Target Shape on Stability of Fixational Eye Movements.” Vision Research 76:31–42.
OpenUrl CrossRef PubMed Web of Science
↵
Vickery T. J., W. M. Shim, R. Chakravarthi, Y. V. Jiang, and R. Luedema. 2009. “Supercrowding: Weakly Masking a Target Expands the Range of Crowding.” Journal of Vision 9 (February):12–12. https://doi.org/10.1167/9.2.12.
OpenUrl Abstract
↵
Wallis, Thomas S. A., and Peter J Be. 2012. “Image Correlates of Crowding in Natural Scenes.” Journal of Vision 12 (7):1–19. https://doi.org/10.1167/12.7.6.
OpenUrl Abstract/FREE Full Text
↵
Wallis, Thomas S. A., Matthias Bethge, and Felix A. Wichmann. 2016. “Testing Models of Peripheral Encoding Using Metamerism in an Oddity Paradigm.” Journal of Vision 16 (2):4. https://doi.org/10.1167/16.2.4.
OpenUrl CrossRef PubMed
↵
Wallis, Thomas S. A., Christina M. Funke, Alexander S. Ecker, L. A. Gatys, Felix A. Wichmann, and Matthias Bethge. 2017. “A Parametric Texture Model Based on Deep Convolutional Features Closely Matches Texture Appearance for Humans.” Journal of Vision 17 (12):5. https://doi.org/10.1167/17.12.5.
OpenUrl CrossRef PubMed
↵
Wandell, Brian A. 1995. Foundations of Vision. Sinauer Associates.
↵
Watson, Andrew B. 2014. “A Formula for Human Retinal Ganglion Cell Receptive Field Density as a Function of Visual Field Location.” Journal of Vision 14 (7):15:1–17. https://doi.org/10.1167/14.7.15.
OpenUrl Abstract/FREE Full Text
↵
Whitney, David, J Haberman, and TD Sweeny. 2014. “From Textures to Crowds: Multiple Levels of Summary Statistical Perception.” The New Visual Neurosciences, 695–710.
↵
Wickham Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
↵
Wickham Hadley.. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1):1–29.
OpenUrl CrossRef
↵
Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation.
↵
Xie Yihui. 2013. “Knitr: A Comprehensive Tool for Reproducible Research in R. BT - Implementing Reproducible Computational Research.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Frederich Leisch, and Roger D Pen. Chapman & Hall/CRC.
↵
Xie Yihui.. 2015. Dynamic Documents with R and Knitr, Second Edition. 2nd ed. Chapman & Hall/CRC.
↵
Zhang, Xuetao, Jie Huang, Serap Yigit-Elliott, and Ruth Rosenholtz. 2015. “Cube Search, Revisited.” Journal of Vision 15 (3):9. https://doi.org/10.1167/15.3.9.
OpenUrl Abstract/FREE Full Text
↵
Ziemba Corey M., Jeremy Freeman, J. Anthony Movshon, and Eero P. Simoncelli. 2016. “Selectivity and Tolerance for Visual Texture in Macaque V2.” Proceedings of the National Academy of Sciences 113 (22):E3140–E3149. https://doi.org/10.1073/pnas.1510847113.
OpenUrl Abstract/FREE Full Text

References

Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1):1–28. https://doi.org/10.18637/jss.v080.i01.
OpenUrl
Cohen Michael A., Daniel C. Dennett, and Nancy Kanwisher. 2016. “What Is the Bandwidth of Perceptual Experience?” Trends in Cognitive Sciences 20 (5):324–35. https://doi.org/10.1016/j.tics.2016.03.006.
OpenUrl CrossRef PubMed
Deza, Arturo, Aditya Jonnalagadda, and Miguel Eckstein. 2017. “Towards Metamerism via Foveated Style Transfer.” arXiv Preprint arXiv:1705.10041.
Freeman, Jeremy, and Eero P. Simoncelli. 2011. “Metamers of the Ventral Stream.” Nature Neuroscience 14 (9):1195–1201.
OpenUrl CrossRef PubMed
Gatys L. A., A. S. Ecker, and M. Bethg. 2015. “Texture Synthesis Using Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 28.
↵
Gelman, Andrew, Jessica Hwang, and Aki Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24 (6):997–1016. https://doi.org/10.1007/s11222-013-9416-2.
OpenUrl
Hoffman Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (Apr):1593–1623.
OpenUrl
Jones, Eric, Travis Oliphant, and Pearu Peterson. 2001. SciPy: Open Source Scientific Tools for Python.
Judd, Tilke, Frédo Durand, and Antonio Torralba. 2012. “A Benchmark of Computational Models of Saliency to Predict Human Fixations.” CSAIL Technical Reports, January.
Judd, Tilke, Krista A. Ehinger, F. Durand, and A. Torralb. 2009. “Learning to Predict Where Humans Look.” In IEEE 12th International Conference on Computer Vision, 2106–13. Kyoto. https://doi.org/10.1109/ICCV.2009.5459462.
↵
Kruschke John. 2015. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press.
↵
Kubilius, Jonas, Stefania Bracci, and Hans P. Op de Beeck. 2016. “Deep Neural Networks as a Computational Model for Human Shape Sensitivity.” PLoS Comput Biol 12 (4):e1004896.
OpenUrl CrossRef PubMed
Macmillan, N A, and C D Creelma. 2005. Detection Theory: A User’s Guide. Mahwah, NJ: Lawrence Erlbaum.
↵
McElreath Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Texts in Statistical Science 122. Boca Raton London New York: CRC Press, Taylor & Francis Group.
↵
Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ICLR abs/1409.1556.
↵
Stan Development Team. 2015. Stan Modeling Language Users Guide and Reference Manual, Version 2.10.0.
Stan Development Team. 2017. “Stan: A C++ Library for Probability and Sampling, Version 2.14.0.”.
Thaler, L, A C Schütz, M A Goodale, and K R Gegenfurtne. 2013. “What Is the Best Fixation Target? The Effect of Target Shape on Stability of Fixational Eye Movements.” Vision Research 76:31–42.
OpenUrl CrossRef PubMed Web of Science
↵
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2016. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC_ú.” arXiv Preprint arXiv:1507.04544.
↵
Venables W. N., and B. D. Riple. 2002. Modern Applied Statistics with S. Fourth. New York: Springer.

View the discussion thread.

Posted October 30, 2018.

Download PDF

Citation Tools

Subject Area

Neuroscience

Subject Areas

All Articles

Animal Behavior and Cognition (5200)
Biochemistry (11703)
Bioengineering (8722)
Bioinformatics (29127)
Biophysics (14932)
Cancer Biology (12048)
Cell Biology (17359)
Clinical Trials (138)
Developmental Biology (9406)
Ecology (14143)
Epidemiology (2067)
Evolutionary Biology (18268)
Genetics (12220)
Genomics (16766)
Immunology (11841)
Microbiology (28005)
Molecular Biology (11552)
Neuroscience (60808)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3231)
Physiology (4939)
Plant Biology (10384)
Scientific Communication and Education (1679)
Synthetic Biology (2877)
Systems Biology (7333)
Zoology (1642)

[1] ↵
Agaoglu Mehmet N., and Susana T. L. Chung. 2016. “Can (Should) Theories of Crowding Be Unified?” Journal of Vision 16 (15):10. https://doi.org/10.1167/16.15.10.
OpenUrl CrossRef PubMed

[2] ↵
Ariely Dan. 2001. “Seeing Sets: Representation by Statistical Properties.” Psychological Science 12 (2):157–62.
OpenUrl CrossRef PubMed Web of Science

[3] ↵
Arnold, Jeffrey B. 2016. Ggthemes: Extra Themes, Scales and Geoms for ‘Ggplot2’.

[4] ↵
Auguie Baptiste. 2016. gridExtra: Miscellaneous Functions for ”Grid” Graphics.

[5] ↵
Balas, Benjamin, L Nakano, and Ruth Rosenholtz. 2009. “A Summary-Statistic Representation in Peripheral Vision Explains Visual Crowding.” Journal of Vision 9 (12):13.
OpenUrl Abstract

[6] ↵
Block N. 2013. “Seeing and Windows of Integration.” Thought: A Journal of Philosophy.

[7] ↵
Bouma H. 1970. “Interaction Effects in Parafoveal Letter Recognition.” Nature 226 (5241):177–78.
OpenUrl CrossRef PubMed Web of Science

[8] ↵
Brainard, David H. 1997. “The Psychophysics Toolbox.” Spatial Vision 10 (4):433–36.
OpenUrl CrossRef PubMed Web of Science

[9] ↵
Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1):1–28. https://doi.org/10.18637/jss.v080.i01.
OpenUrl

[10] ↵
Chang, Honghua, and Ruth Rosenholtz. 2016. “Search Performance Is Better Predicted by Tileability Than Presence of a Unique Basic Feature.” Journal of Vision 16 (10):13. https://doi.org/10.1167/16.10.13.
OpenUrl CrossRef PubMed

[11] ↵
Cohen Michael A., Daniel C. Dennett, and Nancy Kanwisher. 2016. “What Is the Bandwidth of Perceptual Experience?” Trends in Cognitive Sciences 20 (5):324–35. https://doi.org/10.1016/j.tics.2016.03.006.
OpenUrl CrossRef PubMed

[12] ↵
Cornelissen Frans W., Enno M. Peters, and John Palmer. 2002. “The Eyelink Toolbox: Eye Tracking with MATLAB and the Psychophysics Toolbox.” Behavior Research Methods, Instruments, & Computers 34 (4):613–17. https://doi.org/10.3758/BF03195489.
OpenUrl CrossRef PubMed Web of Science

[13] ↵
Craven B. J. 1992. “A Table Of d′ forM-Alternative Odd-Man-Out Forced-Choice Procedures.” Perception & Psychophysics 51 (4):379–85.
OpenUrl PubMed

[14] ↵
Croner L. J., and E. Kapla. 1995. “Receptive Fields of P and M Ganglion Cells Across the Primate Retina.” Vision Research 35 (1):7–24.
OpenUrl CrossRef PubMed Web of Science

[15] ↵
Dacey D. M., and M. R. Peterse. 1992. “Dendritic Field Size and Morphology of Midget and Parasol Ganglion Cells of the Human Retina.” Proceedings of the National Academy of Sciences 89 (20):9666–70. https://doi.org/10.1073/pnas.89.20.9666.
OpenUrl Abstract/FREE Full Text

[16] ↵
Dakin, Steven C, and R.J. Watt. 1997. “The Computation of Orientation Statistics from Visual Texture.” Vision Research 37 (22):3181–92. https://doi.org/10.1016/S0042-6989(97)00133-8.
OpenUrl CrossRef PubMed Web of Science

[17] ↵
Deza, Arturo, Aditya Jonnalagadda, and Miguel Eckstein. 2017. “Towards Metamerism via Foveated Style Transfer.” arXiv Preprint arXiv:1705.10041.

[18] ↵
Ehinger Krista A., and Ruth Rosenholtz. 2016. “A General Account of Peripheral Encoding Also Predicts Scene Perception Performance.” Journal of Vision 16 (2):13. https://doi.org/10.1167/16.2.13.
OpenUrl CrossRef PubMed

[19] ↵
Faivre, Nathan, Vincent Berthet, and Sid Kouider. 2012. “Nonconscious Influences from Emotional Faces: A Comparison of Visual Crowding, Masking, and Continuous Flash Suppression.” Frontiers in Psychology 3. https://doi.org/10.3389/fpsyg.2012.00129.

[20] ↵
Fischer, Jason, and David Whitney. 2011. “Object-Level Visual Information Gets Through the Bottleneck of Crowding.” Journal of Neurophysiology 106:1389–98.
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Francis, Gregory, Mauro Manassi, and Michael H. Herzog. 2017. “Neural Dynamics of Grouping and Segmentation Explain Properties of Visual Crowding.” Psychological Review. https://doi.org/10.1037/rev0000070.

[22] ↵
Freeman, Jeremy, and Eero P. Simoncelli. 2011. “Metamers of the Ventral Stream.” Nature Neuro-science 14 (9):1195–1201.
OpenUrl CrossRef PubMed

[23] ↵
Freeman, Jeremy, and Eero P. Simoncelli. 2013. “The Radial and Tangential Extent of Spatial Metamers.” Journal of Vision 13 (9):573–73. https://doi.org/10.1167/13.9.573.
OpenUrl Abstract

[24] Freeman, Jeremy, Corey M. Ziemba, David J Heeger, Eero P Simoncelli, and J. Anthony Movshon. 2013. “A Functional and Perceptual Signature of the Second Visual Area in Primates.” Nature Neuro-science 16 (7):974–81.
OpenUrl CrossRef PubMed

[25] ↵
Gatys L. A., A. S. Ecker, and M. Bethg. 2015. “Texture Synthesis Using Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 28.

[26] ↵
Haun Andrew M., Giulio Tononi, Christof Koch, and Naotsugu Tsuchiya. 2017. “Are We Underestimating the Richness of Visual Experience?” Neuroscience of Consciousness 3 (1). https://doi.org/10.1093/nc/niw023.

[27] ↵
Herzog Michael H., Bilge Sayim, Vitaly Chicherov, and Mauro Manassi. 2015. “Crowding, Grouping, and Object Recognition: A Matter of Appearance.” Journal of Vision 15 (6).

[28] ↵
Hoffman Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (Apr):1593–1623.
OpenUrl

[29] ↵
Jones, Eric, Travis Oliphant, and Pearu Peterson. 2001. SciPy: Open Source Scientific Tools for Python.

[30] ↵
Judd, Tilke, Frédo Durand, and Antonio Torralba. 2012. “A Benchmark of Computational Models of Saliency to Predict Human Fixations.” CSAIL Technical Reports, January.

[31] ↵
Judd, Tilke, Krista A. Ehinger, F. Durand, and A. Torralb. 2009. “Learning to Predict Where Humans Look.” In IEEE 12th International Conference on Computer Vision, 2106–13. Kyoto. https://doi.org/10.1109/ICCV.2009.5459462.

[32] ↵
Keshvari, Shaiyan, and Ruth Rosenholtz. 2016. “Pooling of Continuous Features Provides a Unifying Account of Crowding.” Journal of Vision 16 (3):39. https://doi.org/10.1167/16.3.39.
OpenUrl CrossRef

[33] ↵
Kleiner, M, David H Brainard, and Denis G Pell. 2007. “What’s New in Psychtoolbox-3?” Perception 36 (ECVP Abstract Supplement).

[34] ↵
Koenderink, Jan, Matteo Valsecchi, Andrea van Doorn, Johan Wagemans, and Karl Gegenfurtner. 2017. “Eidolons: Novel Stimuli for Vision Research.” Journal of Vision 17 (2):7. https://doi.org/10.1167/17.2.7.
OpenUrl CrossRef

[35] ↵
Koffka Kurt. 1935. Principles of Gestalt Psychology. Oxford, UK: Harcourt Brace.

[36] ↵
Landy, Michael S. 2013. “Texture Analysis and Perception.” The New Visual Neurosciences (Ed. Werner JS, Chalupa LM), 639–52.

[37] ↵
Lettvin, Jerome Y. 1976. “On Seeing Sidelong.” The Sciences 16 (4):10–20. https://doi.org/10.1002/j.2326-1951.1976.tb01231.x.
OpenUrl

[38] ↵
Long, Bria, Talia Konkle, Michael A. Cohen, and George A. Alvarez. 2016. “Mid-Level Perceptual Features Distinguish Objects of Different Real-World Sizes.” Journal of Experimental Psychology: General 145 (1):95–109. https://doi.org/10.1037/xge0000130.
OpenUrl

[39] ↵
Mack, Arien, and Irvin Rock. 1998. Inattentional Blindness. Vol. 33. MIT press Cambridge, MA.

[40] ↵
Manassi, M, B Sayim, and Michael H. Herzog. 2013. “When Crowding of Crowding Leads to Uncrowding.” Journal of Vision 13 (13):10.
OpenUrl Abstract/FREE Full Text

[41] ↵
Movshon J. Anthony, and Eero P. Simoncelli. 2014. “Representation of Naturalistic Image Structure in the Primate Visual Cortex.” Cold Spring Harbor Symposia on Quantitative Biology 79:115–22. https://doi.org/10.1101/sqb.2014.79.024844.
OpenUrl Abstract/FREE Full Text

[42] ↵
Neri Peter. 2017. “Object Segmentation Controls Image Reconstruction from Natural Scenes.” Edited by Christopher C. Pack. PLOS Biology 15 (8):e1002611. https://doi.org/10.1371/journal.pbio.1002611.
OpenUrl CrossRef PubMed

[43] ↵
Okazawa, Gouki, Satohiro Tajima, and Hidehiko Komatsu. 2015. “Image Statistics Underlying Natural Texture Selectivity of Neurons in Macaque V4.” Proceedings of the National Academy of Sciences 112 (4):E351–E360. https://doi.org/10.1073/pnas.1415146112.
OpenUrl Abstract/FREE Full Text

[44] ↵
O’Regan, J Kevin, Ronald A Rensink, and James J Clar. 1999. “Change-Blindness as a Result of ‘Mudsplashes’.” Nature 398 (6722):34–34.
OpenUrl CrossRef PubMed Web of Science

[45] ↵
Parkes, L, J Lund, A Angelucci, JA Solomon, and M Morga. 2001. “Compulsory Averaging of Crowded Orientation Signals in Human Vision.” Nature Neuroscience 4 (7):739–44. https://doi.org/10.1038/89532.
OpenUrl CrossRef PubMed Web of Science

[46] ↵
Pelli, Denis G. 1997. “The VideoToolbox Software for Visual Psychophysics: Transforming Numbers into Movies.” Spatial Vision 10 (4):437–42.
OpenUrl CrossRef PubMed Web of Science

[47] ↵
Pelli, Denis G, and K A Tillma. 2008. “The Uncrowded Window of Object Recognition.” Nature Neuroscience 11 (10):1129–35.
OpenUrl CrossRef PubMed Web of Science

[48] ↵
Portilla, J, and Eero P Simoncell. 2000. “A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients.” International Journal of Computer Vision 40 (1):49–70.
OpenUrl CrossRef Web of Science

[49] ↵
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

[50] ↵
Rensink, Ronald A, J Kevin O’Regan, and James J. Clark. 1997. “To See or Not to See: The Need for Attention to Perceive Changes in Scenes.” Psychological Science 8 (5):368–73.
OpenUrl CrossRef Web of Science

[51] ↵
Rosen S., R. Chakravarthi, and Denis G Pell. 2014. “The Bouma Law of Crowding, Revised: Critical Spacing Is Equal Across Parts, Not Objects.” Journal of Vision 14 (6):10–10. https://doi.org/10.1167/14.6.10.
OpenUrl CrossRef PubMed

[52] ↵
Rosenholtz Ruth. 2016. “Capabilities and Limitations of Peripheral Vision.” Annual Review of Vision Science 2 (1):437–57. https://doi.org/10.1146/annurev-vision-082114-035733.
OpenUrl

[53] ↵
Rosenholtz, Ruth, Jie Huang, and Krista A. Ehinger. 2012. “Rethinking the Role of Top-down Attention in Vision: Effects Attributable to a Lossy Representation in Peripheral Vision.” Frontiers in Psychology 3:13.
OpenUrl

[54] Rosenholtz, Ruth, Jie Huang, Alvin Raj, Benjamin Balas, and Livia Ilie. 2012. “A Summary Statistic Representation in Peripheral Vision Explains Visual Search.” Journal of Vision 12 (4).

[55] ↵
Saarela, T P, B Sayim, Gerald Westheimer, and Michael H. Herzog. 2009. “Global Stimulus Configuration Modulates Crowding.” Journal of Vision 9 (2):5.
OpenUrl Abstract

[56] ↵
Seth, Anil K. 2014. “A Predictive Processing Theory of Sensorimotor Contingencies: Explaining the Puzzle of Perceptual Presence and Its Absence in Synesthesia.” Cognitive Neuroscience 5 (2):97–118. https://doi.org/10.1080/17588928.2013.877880.
OpenUrl CrossRef PubMed

[57] ↵
Stan Development Team. 2017. “Stan: A C++ Library for Probability and Sampling, Version 2.14.0.”.

[58] ↵
Thaler, L, A C Schütz, M A Goodale, and K R Gegenfurtne. 2013. “What Is the Best Fixation Target? The Effect of Target Shape on Stability of Fixational Eye Movements.” Vision Research 76:31–42.
OpenUrl CrossRef PubMed Web of Science

[59] ↵
Vickery T. J., W. M. Shim, R. Chakravarthi, Y. V. Jiang, and R. Luedema. 2009. “Supercrowding: Weakly Masking a Target Expands the Range of Crowding.” Journal of Vision 9 (February):12–12. https://doi.org/10.1167/9.2.12.
OpenUrl Abstract

[60] ↵
Wallis, Thomas S. A., and Peter J Be. 2012. “Image Correlates of Crowding in Natural Scenes.” Journal of Vision 12 (7):1–19. https://doi.org/10.1167/12.7.6.
OpenUrl Abstract/FREE Full Text

[61] ↵
Wallis, Thomas S. A., Matthias Bethge, and Felix A. Wichmann. 2016. “Testing Models of Peripheral Encoding Using Metamerism in an Oddity Paradigm.” Journal of Vision 16 (2):4. https://doi.org/10.1167/16.2.4.
OpenUrl CrossRef PubMed

[62] ↵
Wallis, Thomas S. A., Christina M. Funke, Alexander S. Ecker, L. A. Gatys, Felix A. Wichmann, and Matthias Bethge. 2017. “A Parametric Texture Model Based on Deep Convolutional Features Closely Matches Texture Appearance for Humans.” Journal of Vision 17 (12):5. https://doi.org/10.1167/17.12.5.
OpenUrl CrossRef PubMed

[63] ↵
Wandell, Brian A. 1995. Foundations of Vision. Sinauer Associates.

[64] ↵
Watson, Andrew B. 2014. “A Formula for Human Retinal Ganglion Cell Receptive Field Density as a Function of Visual Field Location.” Journal of Vision 14 (7):15:1–17. https://doi.org/10.1167/14.7.15.
OpenUrl Abstract/FREE Full Text

[65] ↵
Whitney, David, J Haberman, and TD Sweeny. 2014. “From Textures to Crowds: Multiple Levels of Summary Statistical Perception.” The New Visual Neurosciences, 695–710.

[66] ↵
Wickham Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. New York: Springer.

[67] ↵
Wickham Hadley.. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1):1–29.
OpenUrl CrossRef

[68] ↵
Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation.

[69] ↵
Xie Yihui. 2013. “Knitr: A Comprehensive Tool for Reproducible Research in R. BT - Implementing Reproducible Computational Research.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Frederich Leisch, and Roger D Pen. Chapman & Hall/CRC.

[70] ↵
Xie Yihui.. 2015. Dynamic Documents with R and Knitr, Second Edition. 2nd ed. Chapman & Hall/CRC.

[71] ↵
Zhang, Xuetao, Jie Huang, Serap Yigit-Elliott, and Ruth Rosenholtz. 2015. “Cube Search, Revisited.” Journal of Vision 15 (3):9. https://doi.org/10.1167/15.3.9.
OpenUrl Abstract/FREE Full Text

[72] ↵
Ziemba Corey M., Jeremy Freeman, J. Anthony Movshon, and Eero P. Simoncelli. 2016. “Selectivity and Tolerance for Visual Texture in Macaque V2.” Proceedings of the National Academy of Sciences 113 (22):E3140–E3149. https://doi.org/10.1073/pnas.1510847113.
OpenUrl Abstract/FREE Full Text

Image content is more important than Bouma’s Law for scene metamers

Abstract

Introduction

Results

Discussion

Methods

Participants

Stimuli

Freeman and Simoncelli syntheses

CNN model syntheses

Equipment

Procedure

Data analysis

Acknowledgments

Footnotes

References

References

Citation Manager Formats

Subject Area