Abstract
Core object recognition, the ability to rapidly recognize objects despite variations in their appearance, is largely solved through the feedforward processing of visual information. Deep neural networks are shown to achieve human-level performance in these tasks, and explain the primate brain representation. On the other hand, object recognition under more challenging conditions (i.e. beyond the core recognition problem) is less characterized. One such example is object recognition under occlusion. It is unclear to what extent feedforward and recurrent processes contribute in object recognition under occlusion. Furthermore, we do not know whether the conventional deep neural networks that were shown to be successful in solving core object recognition, can perform similarly well in problems that go beyond the core recognition. Here, we characterize neural dynamics of object recognition under occlusion, using magnetoencephalography (MEG), while human subjects were presented with images of objects with various levels of occlusion. We provide evidence from multivariate analysis of MEG data, behavioral data, and computational modelling, demonstrating an essential role for recurrent processes in object recognition under occlusion. Furthermore, the computational model with local recurrent connections, used here, suggests a mechanistic explanation of how the human brain might be solving this problem.
1. Introduction
There is abundance of feedforward, and recurrent connections in the primate visual cortex (Lamme et al., 1998, Sporns and Zwi, 2004). The feedforward connections form a hierarchy of cortical areas along the visual pathway, playing a significant role in various aspects of visual object processing (Felleman and Van Essen, 1991). However, the role of recurrent connections in visual processing have remained poorly understood (Lamme et al., 1998, Lamme and Roelfsema, 2000, Gilbert and Li, 2013, Kafaligonul et al., 2015, Klink et al., 2017).
Several complementary behavioral, neuronal, and computational modeling studies have confirmed that a large class of object recognition tasks called “core recognition” are largely solved through a single sweep of feedforward visual information processing (DiCarlo and Cox, 2007, DiCarlo et al., 2012, Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014, Cadieu et al., 2014, Wen et al., 2018). Object recognition is defined as the ability to differentiate an object’s identity or category from many other objects having a range of identity-preserving changes (DiCarlo and Cox, 2007). Core recognition refers to the ability of visual system to rapidly recognize objects despite variations in their appearance, e.g. position, scale, and rotation (DiCarlo and Cox, 2007).
Object recognition under challenging conditions, such as high variations (Ghodrati et al., 2014, Karimi-Rouzbahani et al., 2017), degradation and occlusion (Rensink and Enns, 1998, Oram, 2010, Fabre-Thorpe, 2011, Wyatte et al., 2014, Kosai et al., 2014, Choi et al., 2016, Spoerer et al., 2017, Tang et al., 2017), crowding (Livne and Sagi, 2011, Manassi and Herzog, 2013, Clarke et al., 2014) goes beyond the core recognition problem, which is thought to require more than the feedforward processes. Object recognition under occlusion is one of the key challenging conditions that occurs in many of the natural scenes we interact with every day. How our brain solves object recognition under such challenging condition is still an open question. We do not know the dynamics of object processing under this challenging condition; and that to what extent object recognition under occlusion relies on recurrent processes. Furthermore, as opposed to the core object recognition problem, where the conventional feedforward CNNs are shown to explain brain representations (Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014, Cadieu et al., 2014, Wen et al., 2018), we do not yet have computational models that successfully explain human brain representation and behavior under this challenging condition.
Few fMRI studies have investigated how and where occluded objects are represented in the human brain (Rauschenberger et al., 2006, Hulme and Zeki, 2007, Hegdé et al., 2008, Ban et al., 2013, Erlikhman and Caplovitz, 2017). Hulme and Zeki (2007) found that faces and houses in fusiform face area (FFA) and lateral occipital cortex (LOC) are represented similary with and without occclusion. Ban et al. (2013) used topographic mapping with simple geometric shapes (e.g. triangles), finding that the occluded portion of the shape is represented topographically in human V1 and V2, suggesting the involvement of early visual areas in object completion. A more recent study showed that the early visual areas may only code spatial information about occluded objects, but not their identity, and higher-order visual areas instead represent object-specific information, such as category or identity of occluded objects (Erlikhman and Caplovitz, 2017). While these studies provide insights about object processing under occlusion, they do not provide any information about the temporal dynamics of these processes, and whether object recognition under occlusion requires recurrent processing.
Our focus in this study is understanding the temporal dynamics of object recognition under occlusion; and whether recurrent connections get involved in processing occluded objects? If yes, in what form are they engaged (e.g. long range feedback or local recurrent?), and how much is their contribution compared to the contribution of the feedforward visual information? We constructed a controlled image set of occluded objects, and used the combination of multivariate pattern analyses (MVPA) of MEG signals, computational modeling, backward masking, and behavioral experiments to characterize representational dynamics of object processing under occlusion, and the role of recurrence.
Here, we provide four complementary evidence for the contribution of recurrent processes in recognizing occluded objects. First, MEG decoding time courses show that onset and peak for occluded objects—without backward masking—are significantly delayed compared to when the whole object is presented without occlusion. Second, time-time decoding analysis (i.e. temporal generalization) suggests that occluded object processing goes through a relatively long sequence of stages that involve recurrent interaction—likely local recurrent. Third, the results of backward masking demonstrate that while the masking significantly impairs both human categorization performances and MEG decoding performances under occlusion, it has no significant effect on object recognition when objects are not occluded. Fourth, results from two computational models showed that a conventional feedforward CNN (AlexNet) that could achieve human-level performance in the no-occlusion condition, performed significantly worse than humans when objects were occluded. Additionally, the feedforward CNN could only explain the human MEG data when objects were presented without occlusion; but failed to explain the MEG data under the occlusion condition. In contrast, a hierarchical CNN with local recurrent connections (recurrent ResNet) achieved human-level performance and could explain the MEG neural data when objects were occluded. These findings demonstrate significant involvement of recurrent processes in occluded object recognition, and improve our understand of object recognition beyond the core problem.
2. Results
We used multivariate pattern analysis (MVPA) of MEG data to characterize representational dynamics of object recognition under occlusion (Carlson et al., 2013, Cichy et al., 2014, Isik et al., 2014, Grootswagers et al., 2017). MEG along with MVPA allows for a fine-grained investigation of the underlying object recognition processes across time (Grootswagers et al., 2017, Contini et al., 2017). Subjects (N=15) were presented with images of objects with varying levels of occlusion (i.e., 0% = no-occlusion, 60% and 80% occlusion; Figure 1b). We also took advantage of the visual backward masking (Breitmeyer and Öğmen, 2006) as a tool to further control the feedforward and feedback flow of visual information processing. In the MEG experiment, each stimulus was presented for 34 ms, followed by a blank-screen ISI, and then in half of the trials followed by a dynamic mask (Figure S1). We extracted and pre-processed MEG signals from −100 ms to 700 ms with regard to the stimulus onset. To calculate pairwise discriminability between objects, a support vector machine (SVM) classifier was trained and tested at each time point (Figure 1a). MEG decoding time-courses show the pairwise discriminability of object images averaged across individuals. We first present the MEG results of the no-mask trials. After that in section 2.3 we discuss the effect of backward masking.
2.1. Object recognition is significantly delayed under occlusion
We used pairwise decoding analysis of MEG signals to measure how object information evolves over time, Figure 1a. Significantly above-chance decoding accuracy means that objects can be discriminated using the information available to the brain at that time-point. The decoding onset latency indicates the earliest time that the object-specific information becomes available and the peak decoding latency is the time-point wherein we have the highest object-discrimination performance.
We found that object information emerges significantly later under occlusion compared to the no-occlusion condition. Object decoding under no-occlusion had an early onset latency at 79ms [±3 ms standard deviation (SD)] and was followed by a sharp increase reaching its maximum accuracy (i.e. peak latency) at 139±1 ms (Figure 1c). This early and rapidly evolving dynamic is well consistent with the known time-course of the feedforward visual object processing (Liu et al., 2009, Carlson et al., 2013, Cichy et al., 2014).
However, when objects were partially occluded (i.e. 60% occlusion), decoding time-courses were significantly slower than the no-occlusion condition: the onset for decoding accuracy was at 123±15 ms followed by a gradual increase in decoding accuracy until it reached its peak decoding accuracy at 199±3 ms (Figure 1c). The difference between onset latencies and peak latencies were both statistically significant with p<10−4 (two sided sign-rank test). The slow temporal dynamics of object recognition under occlusion and the observed significant temporal delay in processing occluded objects compared to un-occluded objects do not match with a fully feedforward account of visual information processing. This may be best explained by the engagement of recurrent processes.
Under 80% occlusion, the MEG decoding results do not reach significance (Figure 1c). However, behaviorally, human subjects still perform above-chance in object categorization even under 80% occlusion (Figure 4b). This discrepancy might be due to MEG acquisition noise, whereas the behavioral categorization task is by definition free from that type of noise.
While the MEG and behavioral data have different levels of noise, we show that within the MEG data itself, object images with different levels of occlusion (0%, 60%, 80%) do not differ in terms of their level of noise (Figure S2). Thus, the difference in decoding performance between different levels of occlusion cannot be simply explained by difference in noise.
2.2. Time-time decoding analysis for occluded objects suggests a neural architecture with recurrent interactions
We performed time-time decoding analysis measuring how information about object discrimination generalizes across time (Figure 2a). Time-time decoding matrices are constructed by training a SVM classifier at a given time point and testing its generalization performance at all other time-points (see Methods). The pattern of temporal generalization provides useful information about the underlying processing architecture (King and Dehaene, 2014).
We were interested to see if there are differences between temporal generalization patterns of occluded and un-occluded objects. Different processing dynamics may lead to distinct patterns of generalization in the time-time decoding matrix [see (King and Dehaene, 2014) for a review]. For example, a narrow diagonal pattern suggests a hierarchical sequence of processing stages wherein information is sequentially transferred between neural stages. This hierarchical architecture is well consistent with the feedforward account of neural information processing across the ventral visual pathway. On the other hand, a time-time decoding pattern with off-diagonal generalization suggests a neural architecture with recurrent interactions between processing stages [see Figure 5 in (King et al., 2016)].
The temporal generalization pattern under no-occlusion (Figure 2b) indicates a sequential architecture, without off-diagonal generalization until its early peak latency at 140 ms. This is consistent with a dominantly feedforward account of visual information processing. There is some off-diagonal generalization after 140 ms, however that is not of interest here, because the ongoing recurrent activity after the peak latency (as shown in Figure 1c –dark green curve) does not carry any information that further improves pairwise decoding performance of un-occluded objects. On the other hand, when objects are occluded, the temporal generalization matrix (Figure 2c) indicates a significantly delayed peak latency at 199ms with extensive off-diagonal generalization before reaching its peak. In other words, for occluded objects, we see a discernible pattern of temporal generalization, which is characterized by 1) a relatively weak early diagonal pattern of the decoding accuracy during [100 150]ms with limited temporal generalization, which is in contrast with the high accuracy decoding of un-occluded objects in the same time period. 2) A relatively late peak decoding accuracy with a wide generalization pattern around 200ms. This pattern of temporal generalization can be simulated by a hierarchical neural architecture with local recurrent interactions within the network [Figure 5 of (King et al., 2016)]
We also performed sensorwise decoding analysis to explore spatio-temporal dynamics of object information. To calculate sensorwise decoding, pairwise decoding analysis was conducted on 102 neighboring triplets of MEG sensors (2 gradiometers and 1 magnetometer in each location) yielding a decoding map of brain activity at each time-point. The sensorwise decoding patterns indicated the approximate locus of neural activity: in particular, we see that for both un-occluded (supp. movie 1) and occluded (supp. movie2) conditions, during the onset of decoding as well as the peak decoding time, the main source of object decoding is in the left posterior-temporal sensors. From [110ms to 200ms], the peak of decoding accuracy remains locally around the same sensors, suggesting a sustained local recurrent activity.
2.3. Backward masking significantly impaired object recognition only under occlusion
Visual backward masking has been used as a tool to disrupt the flow of recurrent information processing, while feedforward processes are left relatively intact (Lamme and Roelfsema, 2000, Lamme et al., 2002, Bacon-Macé et al., 2005, Breitmeyer and Öğmen, 2006, Fahrenfort et al., 2007, Serre et al., 2007, Ghodrati et al., 2014). Our time-time decoding results (Figure 3d unoccluded) additionally supports the recurrent explanation of backward masking: off-diagonal generalization in time-time decoding matrices are representative of recurrent interactions; these off-diagonal components disappear when backward masking is present.
Considering the recurrent explanation of the masking effect, we further examined how the recurrent processes contribute in object processing under occlusion. We found that backward masking significantly reduced both MEG decoding accuracy time-course (Figure 3b) and subjects’ behavioral performances (Figure 4b), only when objects were occluded. When occluded objects are masked, the MEG decoding time-course from 185ms to 237ms is significantly lower than the decoding time-course when there is no-mask (Figure 3b, black horizontal lines; two-sided signrank test, FDR-corrected across time p < 0.05). On the other hand, for un-occluded objects, there is no significant difference between decoding time-courses of the mask and no-mask conditions (Figure 3a).
Consistent with the MEG decoding results, while the masking significantly reduced behavioral categorization performances when objects were occluded, it had no significant effect on the categorization performance for the un-occluded objects (Figure 4b) [two-sided signrank test]. Particularly, the backward masking removed the late MEG decoding peak (around 200ms) under occlusion (Figure 3f) likely due to disruption of later recurrent interactions.
Taken together, we demonstrated that visual backward masking, which is known to disrupt recurrent processes (Lamme and Roelfsema, 2000, Lamme et al., 2002, Breitmeyer and Öğmen, 2006, Fahrenfort et al., 2007, Macknik and Martinez-Conde, 2007), significantly impairs object recognition only under occlusion. On the other hand, masking did not affect object processing under no occlusion, when information from the feedforward sweep is shown to be sufficient for object recognition. Thus, providing further evidence for the essential role of recurrent processes in object recognition under occlusion.
2.4. A computational model with local recurrent interactions explains both neural and behavioral data under occlusion
Recent studies have shown that convolutional neural networks (CNNs) achieve human-level performance and explain neural data under non-challenging conditions—also referred to as the core object recognition [(DiCarlo and Cox, 2007, Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014), however also see (Rajalingham et al., 2018)]. Here, we first examined whether such feedforward CNNs (i.e. AlexNet) can explain the observed human neuronal and behavioral data in a challenging object recognition task when objects are occluded. The model accuracy was evaluated by the same object recognition task used to measure human behavioral performance (see Method and Figure S4). To assess model’s performance in explaining the human MEG data, we used representational similarity analysis (RSA), correlating time-resolved human MEG representations with that of the model, on the same set of stimuli (Figure 4a; also see Methods).
We found that in the no-occlusion condition the feedforward CNN could explain both the human behavioral performance and the MEG data. Significant correlation between the model and MEG representational dissimilarity matrices (RDMs) started at ~90ms after the stimulus onset and remained significant for several hundred milliseconds with two peaks at 150ms and 220ms (Figure 4c). However, the feedforward CNN failed to explain human MEG data when objects were occluded. And the model performance was significantly lower than that of human in the occluded object recognition task.
We were wondering if a model with local recurrent connections can account for object recognition under occlusion. Inspired by recent advancements in deep convolutional neural networks (He et al., 2016a, He et al., 2016b, Liao and Poggio, 2016, Veit et al., 2016), we built a hierarchical recurrent ResNet (HRRN) that follows the hierarchy of ventral visual pathway (Figure 5, also see Methods for more details about the model). The recurrent model (HRRN) could rival the human performance in the occluded object recognition task (Figure 4b), performing strikingly better than AlexNet in 60% and 80% occlusion. The HRRN additionally could explain the human MEG data under occlusion [Figure 4c] (onset = 138±2ms; peak = 182±19ms).
Overall, we demonstrated that a CNN with local recurrent connections could successfully explain the human MEG data and the behavioral categorization performances under both occlusion and no-occlusion conditions, whereas the feedforward CNN failed to achieve human-level performance under occlusion—in both MEG and behavior.
3. Discussion
We investigated how the human brain processes visual objects under occlusion. Using multivariate MEG analysis, behavioral data, backward masking and computational modeling, we demonstrated that recurrent processing plays a major role in object recognition under occlusion.
3.1. Beyond core object recognition
Several recent findings have indicated that a large class of object recognition tasks referred to as ‘core object recognition’ are mainly solved in humans within the first ~100 ms after stimulus onset (Thorpe, 2009, Liu et al., 2009, Carlson et al., 2013, Isik et al., 2014, Cichy et al., 2016), largely associated with the feedforward path of visual information processing (Lamme and Roelfsema, 2000, Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014, Cadieu et al., 2014). More challenging tasks, such as object recognition under occlusion, go beyond the core recognition problem. So far it has not been clear whether the visual information from the feedforward sweep can fully account for this or otherwise recurrent information are essential to solve object recognition under occlusion.
3.1.1. Temporal dynamics
We found that under the no-occlusion condition, the MEG object-decoding time-course peaked at 140ms with an early onset at 79ms, consistent with findings from previous studies (DiCarlo and Cox, 2007, Carlson et al., 2013, Cichy et al., 2014, Isik et al., 2014). Furthermore, studies have reported that approximately after 150ms, a very complex and dynamic phase of visual processing may begin deriving a category-specific semantic representation (Clarke et al., 2011, Clarke, 2015), which is likely to be driven by recurrent (Thorpe, 2009). In our study, when objects were occluded, object decoding accuracy peaked at 200ms—significantly later than the peak under the no-occlusion (i.e. 140ms)—suggesting the involvement of recurrent processes. Given the results from the temporal generalization analysis (Figure 2c), and the computational modelling, we argue for the engagement of mostly local recurrent connections as opposed to long-range top-down feedback in solving object recognition under occlusion for this image set. Previous studies also suggest that long-range top-down recurrent is prominently engaged after 200ms from stimulus onset (Tomita et al., 1999, Garrido et al., 2007, Liu et al., 2009, Goddard et al., 2016).
The additional time needed for processing occluded objects may specifically facilitate object recognition through providing integrated semantic information from visible parts of the target objects. In other words, partial semantic information (e.g. having wheels, having legs, etc.) may activate prior information associated with the category of the target object (Clarke and Tyler, 2014, Clarke, 2015). Overall these suggest the observed temporal delay under 60% occlusion can be best explained by the engagement of recurrent processes—mostly local recurrent connections.
3.1.2. Computational modeling
Feedforward CCNs have been shown to be able to account for the core object recognition (Cadieu et al., 2014, Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014, Güçlü and van Gerven, 2015, Kubilius et al., 2016, Kheradpisheh et al., 2016a, Kheradpisheh et al., 2016b, Khaligh-Razavi et al., 2017). The natural question to ask next is whether these models perform similarly well under more challenging conditions, beyond the core object recognition. To address this, we compared a conventional feedforward CNN with a recurrent convolutional network in terms of their object recognition performance, and their representational similarity with that of the human MEG data, under the challenging condition of occlusion. The feedforward CNN only achieved human-level performance when objects were not occluded; and performed significantly lower than the humans and the recurrent network when objects were occluded. The feedforward CNN also failed to explain human neural data when objects were occluded. On the other hand, the convolutional network with local recurrent connections could achieve human-level performance in occluded object recognition and explained a significant variance of the human neural data. Thus, demonstrating that the conventional feedforward CNNs (such as AlexNet) do not account for object recognition under such challenging conditions, where recurrent computations have a prominent contribution.
3.2. Object occlusion vs. object deletion
Object recognition when part of an object is removed without an occluder is one of the challenging conditions that has been previously studied (Wyatte et al., 2014, Tang et al., 2014, Tang et al., 2017) and may partly look similar to occlusion. However, as shown by Johnson and Olshausen (2005) deleting part of an object is different from occluding it with another object. Occlusion occurs when an object or shape appears in front of another one (Johnson and Olshausen, 2005), in which case the occluding object might serve as an additional cue for object completion. On the other hand, deletion occurs when part of an object is removed without additional cues about the shape or the place of the missing part. Given the difference between these two phenomena at the level of stimulus set, dynamics of object processing (and the underlying computational mechanisms) will likely be different when part of an object is occluded compared to when it is deleted. Future studies need to further characterize how these two may differ.
3.3. Does a feedforward system with arbitrarily long depth work the same as a recurrent system with limited depth?
While conventional CNNs could not account for object recognition beyond the core recognition problem, we do not rule out the possibility that much deeper CNNs could perform better under such challenging conditions.
Computational studies have shown that very deep CNNs outperform shalow ones on a variety of object recognition tasks (Simonyan and Zisserman, 2014, Szegedy et al., 2015, Taigman et al., 2014). Specifically, residual learning allows for a much deeper neural network with hundreds (He et al., 2016a) and even thousands (He et al., 2016b) of the layers providing better performance. This is due to the fact that the complex functions that can be represented by deeper architectures cannot be represented by shallow architectures (Bengio and LeCun, 2007). Recent computational modeling studies have tried to clarify why increasing the depth of a network can improve its performance (Liang and Hu, 2015, Liao and Poggio, 2016). These efforts have demonstrated that unfolding a recurrent architecture across time leads to a feedforward network with arbitrary depth, in which the weights are shared among the layers. Although such a recurrent network has far fewer parameters, Liao and Poggio (2016) have empirically shown that it performs as well as a very deep feedforward network without shared weights. We also showed that a very deep ResNet (e.g. with 150 layers) can be reformulated into the form of a recurrent CNN with much fewer layers (e.g. five layers) (Figure 5). Thus, a compact architecture that resembles these very deep networks in terms of performance is a recurrent hierarchical network with much fewer layers. This compact architecture is probably what the human visual system has selected to be like (Lamme et al., 1998, Sporns and Zwi, 2004), given the biological constraints of having a very deep neural network inside the brain (Dunbar, 1992, Kaas, 2000, Weaver, 2005, Isler and van Schaik, 2009, Bosman and Aboitiz, 2015).
From a computational viewpoint, recognition of complex images might require more processing efforts; in other words, they might need to go through more layers of processing to be prepared for the final readout. Similarly, in a recurrent architecture, more processing means more iterations. Our modeling results supports this assumption, showing that under more challening recogntion tasks, more iterations are required to reach human-level perfomrance.
While the feedforward path of the HRRN (i.e. no local recurrent engaged) was sufficient for achieving human-level performance under no-occlusion, the model reached human-level performance only when the local recurrent connections were enabled. Under 60% and 80% occlusion, the model reached human level performance, respectivley after going through 13 local recurrent stages, and 43 local recurrent stages (Figure S5).
3.4. The neural basis of masking effect
Backward masking is a useful tool for studying temporal dynamics of visual object processing (Lamme et al., 2002, Breitmeyer and Öğmen, 2006). It can impair recognition of the target object and reduce or eliminate perceptual visibility through the presentation of a second stimulus (mask) immediately or with an interval after the target stimulus, e.g. 50 ms after the target’s onset. While the origin of masking effect was not the focus of the current study, our MEG results could provide some insights about the underlying mechanisms of backward masking.
There are several accounts of backward masking in the literature: Breitmeyer and Ganz (1976) provided a purely feedforward explanation (two-channel model), arguing that the mask travels rapidly through the fast channel disrupting recognition of the target object traveling through the slow channel. A number of other studies, however, suggest that the masking mainly interferes with the top-down feedback processes (Lamme and Roelfsema, 2000, Lamme et al., 2002, Breitmeyer and Öğmen, 2006, Fahrenfort et al., 2007). And finally, Macknik and Martinez-Conde (2007) explain the masking effect by the lateral inhibition mechanism of neural circuits within different levels of the visual hierarchy; arguing that the mask interferes with the recognition of the target object through lateral inhibition (i.e. inhibitory interactions between target and mask).
The last two accounts of masking, while being different, both argue for the disruption of recurrent processes by the mask: either the top-down recurrent processes, or the local recurrent processes (e.g. lateral interactions). With a short interval between the target and mask, the mask may interfere with the fast recurrent processes (i.e. local recurrent) while with a relatively long interval it may interfere with the slow recurrent processes (i.e. top-down feedback).
Our results of MEG decoding time-courses, time-time decoding and behavioral performances under the no-occlusion condition does not support the purely feedforward account of visual backward masking. We showed that the backward masking did not have a significant effect on disrupting the fast feedforward processes of object recognition under no occlusion (MEG: Figure 3a; behaviorally: Figure 4b). On the other hand, when objects were occluded the backward masking significantly impaired object recognition both behaviorally (Figure 4b) and neurally (Figure 3b). Additionally, the time-time decoding results (Figure 3c,3d,3f) showed that backward masking, under no occlusion, had no significant effect on disrupting the diagonal component of the temporal generalization matrix that is mainly associated with the feedforward path of visual processing. On the other hand, the masking removed the off-diagonal components and the late peak (>200ms) observed in the temporal generalization matrix of the occluded objects.
Taken together, our MEG and behavioral results are in favor of a recurrent account for backward masking. Particularly in our experiment with a short stimulus onset asynchrony (SOA = time from stimulus onset to the mask onset), the mask seems to have affected mostly the local recurrent connections.
4. Methods
4.1. Occluded objects image set
Images of four different object categories (i.e. camel, deer, car, and motorcycle) were used as the stimulus set (Figure 1b). Object images were transformed to be similar in terms of size and contrast level. To generate an occluded image, in an iterative process we added several black circles (as artificial occluders) of different sizes in random positions on the image. To simulate the type of occlusion that occurs in natural scenes, the black circles are positioned in both front and back of the target object. The percent of object occlusion is defined as the percent of the target object covered by the black occluders. We defined three levels of occlusion: 0% (no occlusion), 60% occlusion and 80% occlusion. Black circles also existed in the 0% occlusion, but not covering the target object; this was to make sure that the difference observed between occluded and un-occluded objects cannot be solely explained by the presence of these circles. The generated image set is comprised of 12 conditions: four objects × three occlusion levels. For each condition, we generated M = 64 sample images varying by the occlusion pattern and the target object position. To remove the potential effect of low-level visual features in object discrimination—objects positions were slightly changed around the center of the images (by Δx ≤ 15, Δy ≤ 15 pixels). Overall, we controlled for low-level image statistics, as such that images of different levels of occlusion could not be discriminated simply by using low-level visual features (i.e. Gist and V1 model).
4.2. Participants and MEG experimental design
Fifteen young volunteers (22-38 year-old, all right-handed; 7 female) participated in the experiment. The study was conducted according to the Declaration of Helsinki. The experiment protocol was approved by the local committee on the use of humans as experimental subjects. Volunteers completed a consent form before participating in the experiment and were financially compensated after finishing the experiment.
During the experiment, participants completed eight runs; each run consisted of 192 trials and lasted for approximately eight minutes (total experiment time for each participant = ~70min). Each trial started with 1sec fixation followed by 34ms presentation of an object image (6° visual angle). In half the trials, we employed backward masking in which a dynamic mask was presented for 102ms shortly after the stimulus offset—inter-stimulus-interval (ISI) of 17ms—(Figure S1). In each run, each object image (i.e. camel, deer, car, motor) was repeated 8 times under different levels of occlusions without backward masking; and another 8 repetitions with backward masking. In other words, each condition (i.e. combination of object-image, occlusion-level, mask or no-mask) was repeated 64 times over the duration of the whole experiment.
Every 1-3 trials, a question mark appeared on the screen (lasted for 1.5 sec) prompting participants to choose animate if the last stimulus was deer/camel and inanimate if the last stimulus was car/motor (Figure S1; see Figure S6 for behavioral performance of animate/inanimate task).
Participants were instructed to only respond and blink during the question trials to prevent contamination of MEG signals with motor activity and the eye-blink artifact. The question trials were excluded from further MEG analyses.
The dynamic mask was a sequence of random images (n = 6 images; each presented for 17ms) selected from a pool of the synthesized mask images. They were generated by using a texture synthesis toolbox that is available at: http://www.cns.nyu.edu/~lcv/texture/ (Portilla and Simoncelli, 2000). The synthesized images have low-level feature statistics similar to the original stimuli.
4.3. MEG acquisition
To acquire brain signals with millisecond temporal resolution, we used 306-sensors MEG system (Elekta Neuromag, Stockholm). The sampling rate was 1000Hz and band-pass filtered online between 0.03 and 330 Hz. To reduce noise and correct for head movements, raw data were cleaned by spatiotemporal filters [Maxfilter software, Elekta, Stockholm; (Taulu and Simola, 2006)]. Further pre-processing was conducted by Brainstorm toolbox (Tadel et al., 2011). Trials were extracted −200ms to 700ms relative to the stimulus onset. The signals were then normalized by their baseline (-200ms to 0ms), and were temporally smoothed by low-pass filtering at 20Hz.
4.4. Behavioral task of multiclass object recognition
We ran a psychophysical experiment, outside MEG, to evaluate human performance on a multi-class occluded object recognition task. Sixteen subjects participated in a session lasting about 40 minutes. The experiment was a combination of mask and no-mask trials that were randomly distributed across the experiment. Each trial, started by a fixation point presented for 0.5s followed by a stimulus presentation of 34ms. In the masked trials, a dynamic mask of 102ms was presented after a short ISI of 17ms (Figure S4). Subjects were instructed to respond accurately and as soon as possible after detecting the target stimulus. They were asked to categorize the object images by pressing one of the pre-assigned four keys on a keyboard corresponding to the four object categories: camel, deer, car, and motorcycle.
Overall, 16 human subjects (25 to 40 years-old) participated in this experiment. Before the experiment, participants performed a short training phase on a totally different image-set to learn the task and reach a predefined performance level in the multi-class object recognition task. The main experiment consisted of 768 trials that were randomly distributed into four blocks of 192 trials (32 repetitions of object images with small variations in position and the pattern of occlusion × three occlusion levels × two masking conditions × four object categories = 768). Images of 256×256 pixels size were presented at a distance of 70 cm at the center of a CRT monitor with the frame rate of 60 Hz and a resolution of 1024×768. We used the MATLAB based psychophysics toolbox of (Pelli, 1997).
4.5. Multivariate pattern analyses (MVPA)
4.5.1. Pairwise decoding analysis
To measure temporal dynamics of object information processing, we used pairwise decoding analysis on the MEG data (Isik et al., 2014, Cichy et al., 2014, Kietzmann et al., 2017). For each subject, at each time-point, we created a data matrix of 64-trials × 306-sensors per condition. We used a support vector machine (SVM) to pairwise decode any two conditions, with a leave-one-out cross-validation approach. At each time-point, for each condition, N-1 pattern vectors were used to train the linear classifier [SVM; LIBSVM, (Chang and Lin, 2011), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/], and the remaining Nth vector was used for evaluation. The above procedure was repeated 100 times with random reassignment of the data to training and testing sets. This was then averaged over the six pairwise decoding accuracies. The SVM decoding analysis is done independently for each subject and then we report the average decoding performance over these individuals (Figure 1a).
4.5.2. Time-time decoding analysis
We also reported time-time decoding accuracies, obtained by cross-decoding across time. For each pair of objects, a SVM classifier is trained at a given time and tested at all other time-points, thus showing the generalization of the classifier across time. The results are then averaged across all the pairwise classifications. This yields an 801×801 (t=-100 to 700 ms) matrix of average pairwise decoding accuracies for each subject. Figure 2 shows the time-time decoding matrices averaged across 15 participants. To test for statistical significance, we did one-sided signrank test against the chance-level and then corrected for multiple comparison using FDR.
4.5.3. Sensorwise decoding analysis
We also examined a sensorwise visualization of pairwise object decoding across time (Supp Movies 1 & 2). To this end, we trained and tested the SVM classifier at the level of sensors (i.e. combination of three neighboring sensors) across the whole 306 sensors. First, 306 MEG sensors were grouped into 102 triplets (Elekta Triux system; 2 gradiometers and 1 magnetometer in each location). At each time-point, we applied the same pairwise decoding procedure as previously explained in 4.5.1, this time at the level of groups of 3 adjacent sensors (instead of taking all the 306 MEG sensors together). Average pairwise decoding accuracies across subjects, at each time point, are color-coded across the head surface. We used black dots to indicate channels with significantly above chance accuracy (FDR-corrected across both time and sensors), and gray dots to show accuracies with p<0.05, before correcting for multiple comparison. At each time-point, we also specify the channel with peak decoding accuracy by a red dot.
4.5.4. Representational similarity analysis (RSA) over time
We used representational similarity analysis (RSA) (Kriegeskorte, 2009, Kriegeskorte and Kievit, 2013, Cichy et al., 2014, Carlson et al., 2013, Khaligh-Razavi et al., 2016), to compare representations of computational models with time-resolved representations derived from MEG data.
For the MEG data, representational dissimilarity matrices (RDM) were calculated at each time-point by computing the dissimilarity (1 - Spearman’s R) between all pairs of the MEG patterns elicited by object images. Time-resolved MEG RDMs were then correlated (Spearman’s R) with the computational model RDMs, yielding a correlation vector over time (Figure 4a).
To construct CNN model RDMs, we used the extracted features from the penultimate layer of the networks (i.e. the layer before softmax operation). Significant correlations were determined by one-sided signrank test (p < 0.5, FDR-corrected across time).
4.6. Significance Testing
We used the non-parametric Wilcoxon signrank test (Gibbons and Chakraborti, 2011) for random effect analysis. To determine time-points with significantly above chance decoding accuracy (or significant RDM correlations), we used a right-sided signrank test across n = 15 participants. To adjust p-values for multiple comparisons (e.g. across time), we further applied the false discovery rate (FDR) correction (Benjamini and Hochberg, 1995) [RSA-Toolbox: is available from https://github.com/rsagroup/rsatoolbox (Nili et al., 2014)].
To determine whether two time-courses (e.g. correlation or decoding) are significantly different at any time interval, we used a two-sided signrank test, FDR corrected across time.
Onset latency
We defined onset latency as the earliest time where performance became significantly above chance for at least ten consecutive milliseconds. Mean and standard deviation (SD) for onset latencies were calculated by leave-one-subject-out repeated for N=15 times.
Peak latency
The time for peak decoding accuracy was defined as the time where the decoding accuracy was the maximum value. The mean and SD for peak latencies were calculated similar to the onset latencies.
4.7. Computational modeling
4.7.1. Feedforward computational model (AlexNet)
We used a well-known CNN (AlexNet) (Krizhevsky et al., 2012) that is shown to account for the core object recognition (Khaligh-Razavi and Kriegeskorte, 2014, Cadieu et al., 2014, Kheradpisheh et al., 2016a, Kheradpisheh et al., 2016b). CNNs are cascades of hierarchically organized feature extraction layers. Each layer has several hundred convolutional filters and each convolutional filter scans various places on the input generating a feature map at its output. A convolutional layer may be followed by a local or global pooling layer merging outputs of a group of units. The pooling layers make the feature maps invariant to small variations (Bengio and LeCun, 2007). AlexNet has eight cascading layers: five convolutional layers and three fully-connected layers (Krizhevsky et al., 2012). A pre-trained version of the model, which is trained on 1.2 million images from ImageNet dataset (Russakovsky et al., 2015) is used for the experiments here. We used the features extracted by the fc7 layer (before softmax operation) as the model output.
4.7.2. Hierarchical Recurrent ResNet (HRRN)
In convolutional neural networks, performance in visual recognition tasks can be substantially improved by adding to the depth of the network (Simonyan and Zisserman, 2014, Szegedy et al., 2015, He et al., 2015). However, this comes at a cost: deeper networks of simply stacking layers (plain nets) have higher training errors due to the vanishing gradients (degradation) (Glorot and Bengio, 2010) problem that prevents convergence in the training phase. To address this problem, He et al. (2016a) introduced a deep residual learning framework. Residual networks can overcome the vanishing gradient problem during learning by employing identity shortcut connections that allow bypassing residual layers. This framework enables training ultra-deep networks, e.g. with 1202 layers, leading to much better performances compared to the shallower networks (He et al., 2016a, He et al., 2016b).
Residual connections give ResNet an interesting characteristic of having several possible pathways with different lengths from the network’s input to the output instead of a single deep pathway (Veit et al., 2016). For example, the ultra-deep 152-layers ResNet in its simplest form—by skipping all the residual layers—is a hierarchy of five convolutional layers. By including additional residual layers, more complex networks with various depths are constructed [see table 1 in (He et al., 2016a)]. In this study, we proposed a generalization of this convolutional neural network by redefining residual layers as local recurrent connections. As shown in Figure 5, we reformulated the 152-layers ResNet of He et al. (2016a) into the form of a five-layer convolutional network with folded residual layers as its local recurrent connections. The model is pre-trained on ImageNet 2012 dataset with a training set similar to that of Alexnet (1.2 million training images). It is shown experimentally that an unfolded recurrent CNN (with shared weights) is similar to a very deep feedforward network with non-shared weights (Liao and Poggio, 2016). In our analyses, we used the extracted features of the penultimate layer (i.e. layer pool5, which is before the softmax layer) as the model output.
Acknowledgement
The study was conducted at the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research, Massachusetts Institute of Technology. We would like to thank Aude Oliva and Dimitrios Pantazis for their help and support in conducting this study. We would also like to thank Radoslaw Martin Cichy for helpful comments.