Abstract
Images of iconic buildings, for example, the Empire State Building, instantly transport us to New York City. Despite the substantial impact of architectural design on people’s visual experience of built environments, we know little about its neural representation in the human brain. We have found patterns of neural activity associated with specific architectural styles in a network of several high-level visual brain regions including the scene-selective parahippocampal place area (PPA). Surprisingly, this network, which is characterized by correlated error patterns, includes the fusiform face area. Accuracy of decoding architectural styles from the PPA was negatively correlated with expertise in architecture, indicating a shift from purely visual cues to the use of domain knowledge with increasing expertise. Our study showcases that neural representations of architectural styles in the human brain are driven not only by perceptual features but also by semantic and cultural facets, such as expertise for architectural styles.
As of 2014, more than half of the world’s population resided in urban environments [1]. Architectural design has profound impact on people’s preferences and productivity in such built environments [2]. Despite the ubiquity and importance of architecture for people’s lives, it is so far unknown where and how architectural styles are represented in people’s brains. Here we show that architectural styles are represented in distributed patterns of neural activity in several visually active brain regions in ventral temporal cortex but not in primary visual cortex.
In a functional magnetic resonance imaging (fMRI) scanner, 23 students in their final year at The Ohio State University (11 majoring in architecture, 12 majoring in psychology or neuroscience, one psychology major excluded due to excessive head motion) passively viewed blocks of images. Each block comprised four images from one of the following sixteen categories; (1) representative buildings of four architectural styles (Byzantine, Renaissance, Modern, and Deconstructive); (2) representative buildings designed by four famous architects (Le Corbusier, Antoni Gaudi, Frank Gehry, and Frank Lloyd-Wright); (3) four entry-level scene categories (mountains, pastures, highways, and playgrounds); and (4) photographs of faces of four different non-famous men (Fig. 1). Brain activity was recorded in 35 coronal slices, which covered approximately the posterior 70% of the brain. For each participant, several visually active regions of interest (ROI) were functionally localized: the parahippocampal place area (PPA), the occipital place area (OPA), the retrosplenial cortex (RSC), the lateral occipital complex (LOC), and the fusiform face area (FFA). Primary visual cortex (V1) was delineated based on anatomical atlases.
Following standard pre-processing, data from the image blocks were subjected to a multi-voxel pattern analysis (MVPA). For each of the four groups of stimuli, a linear support vector machine decoder was trained to discriminate between the activity patterns associated with each of the four sub-categories. The decoder was tested on independent data in a leave-one-run-out (LORO) cross validation. Separate decoders were trained and tested for each participant and each ROI. Accuracy was compared to chance (25%) at the group level using one-tailed t tests.
Results
Successful decoding of architectural categories from human visual cortex
Consistent with previous results [3, 4, 5] we could decode entry-level scene categories from all visually active ROIs (Fig. 1A). Furthermore, we could decode architectural styles from all five high-level visual brain regions, but not from V1 (Fig. 1B). In addition, it was possible to decode buildings by famous architects from brain activity in the PPA, the OPA, and the LOC, but not from V1, the RSC, or the FFA (Fig. 1C). Decoding of facial identity succeeded only in V1 and was not possible in any of the high-level ROIs, including the FFA. We also found statistically significant differences in average activity levels between sub-categories for the categorization conditions in a subset of the ROIs. However, such differences were not sufficient to allow for 4-way decoding. The discrimination between sub-categories was only possibly by considering the spatial pattern of brain activity within ROIs.
Searchlight analysis of the scanned parts of the brain confirmed the ROI-based results. The searchlight map of decoding entry-level scene categories showed significant clusters at both occipital poles and calcarine gyri as well in bilateral lingual, fusiform, and parahippocampal gyri and bilateral transverse occipital sulci. On the other hand, the searchlight map of decoding architectural styles showed clusters encompassing bilateral fusiform gyri and transverse occipital sulci, but not the occipital poles and nearby areas. The searchlight map for decoding buildings by famous architects was similar to that of decoding architectural styles, with an additional small cluster on the right occipital pole. Table 1 provides a full list of significant clusters from each searchlight map. Analysis of the overlap of individual’s searchlight maps with their ROIs is shown in Table S1.
Analysis of error patterns
To explore the nature of the representation of architectural styles in visual cortex in more detail, we analyzed patterns of decoding errors. Decoding errors were recorded in confusion matrices, whose rows (r) indicate the ground truth of the presented category, and whose columns (c) represent predictions by the decoder. Individual cells (r,c) contain the proportion of blocks with category r, which were decoded as category c. Diagonal elements contain correct predictions, summarized as decoding accuracy in Fig. 1. Off-diagonal elements represent decoding errors. The patterns of decoding errors serve as a proxy for the nature of the neural representation of categories in a particular brain region. We computed the correlations of error patterns as a measure of the similarity between these neural representations across ROIs. Significance of error correlations was established non-parametrically against the null distribution of correlations obtained by jointly permuting the rows and columns of one of the confusion matrices. Only error correlations with none of the 24 permutations resulting in higher correlation than the correct ordering (p < 0.0417) were deemed significant.
In the case of entry-level scene categorization, we found significant correlations of error patterns between the three ROIs known to specialize in scene perception: the PPA, the RSC, and the OPA (Fig. 2A). We also found significant error correlation between the PPA and the LOC, which is likely due to the recruitment of LOC for the detection of diagnostic objects in scenes [6, 7]. Note that error patterns from the FFA did not correlate significantly with any of the other ROIs, even though we could decode entry-level scene categories from the FFA.
For architectural styles, we found a different error correlation structure (Fig. 2B). For this more specialized, subordinate level categorization, we found statistically significant error correlations of the FFA with the LOC, the PPA, and the OPA, as well as between the PPA and the LOC. Note that the RSC no longer shows significant error correlations with any of the other ROIs. We here show for the first time that the FFA is recruited into the scene processing network for demanding subordinate-level scene categorization but not for simple entry-level categorization. This is consistent with the FFA’s role in visual expertise as shown for object categories as varied as birds, cars, motorcycles or artificial “Greeble” objects [8], even though those results were shown for mean activity levels, whereas ours appear in the interpretation of patterns of brain activity
We did not find any statistically significant error correlations between ROIs for decoding architects, possibly due to the difficulty of decoding architects from brain activity in the first place. Given that facial identity could not be decoded from any of the high-level visual ROIs, we did not further pursue error correlations for the face identification condition.
The effect of expertise
The involvement of the FFA in the representation of categories for architectural styles suggests a role of expertise in the subordinate-level categorization of architectural styles, but not in entry-level scene categorization. However, unlike the typical scenario of subordinate-level visual categorization (i.e., Golden Retrievers vs. Chihuahua), accurate recognition of architectural styles or architects of the buildings is highly affected by non-visual factors. The distinction between architectural styles relies not only on visual consistency within a style but also on the historical, regional, and cultural context of buildings. Prior knowledge of a building’s style may be an important factor in accurate classification, without requiring reference to the visual aspects of the building. How, then, does expertise for architecture affect the neural representation of architectural categories in visual cortex?
We measured expertise for architectural styles in a post-scan behavioral experiment employing the Vanderbilt Expertise Test [9]. During the behavioral experiment, participants were asked to identify which of three displayed images belonged to a given set of six target categories. Behavioral accuracy ranged from 20.0% to 100.0% with a mean of 72.5% (chance: 33.3%). Architecture students were more accurate than non-architecture students at a statistically significant level for architectural styles, t(20) = 3.963, p < .001 (architecture students: 77.1%, SD = 8.9%, psychology and neuroscience students: 59.5%, SD = 11.8%), and architects, t(20) = 3.960, p < .001 (architecture students: 72.5%, SD = 12.0%, psychology and neuroscience students: 44.2%, SD = 9.2%), but not for entry-level scene categories, t(20) = .869, p = .395 (architecture students: 98.2%, SD = 2.9%, psychology and neuroscience students: 96.1%, SD = 2.3%).
Comparison of decoding accuracy from neural data showed no between-group differences in any of the ROIs, except for a marginal effect for decoding architectural styles from the PPA, where architecture students showed lower decoding accuracy (30.1%) than did psychology and neuroscience students (34.3%), t(10) = 2.038, p = .069. To account for the full range of individual differences, we correlated each individual’s behavioral accuracy with their MVPA decoding accuracy for architectural styles in the PPA. We found significant negative correlation between behavior and decoding accuracy (r = -.56, p = .007; Fig 3).
Discussion
We have shown for the first time that subordinate categories of buildings, architectural styles, are represented in the neural activity patterns of several high-level visual areas in human temporal cortex. It was even possible to decode the architects of buildings from neural activity elicited by images of the buildings in the PPA, the OPA, and the LOC. Unlike entry-level scene categories, architectural style and architects could not be decoded from activity in V1, indicating that the simple visual properties encoded in V1 are insufficient to discriminate between architectural styles. These findings suggest that the neural representations of architectural features rely on complex visual structure beyond simple feature statistics. For instance, byzantine architecture is characterized by symmetry in the global shape of buildings and a dome roof, whereas deconstructive architecture is well-known for its non-collinearity and fragmented global shape. Complex visual properties associated with architectural design elements have previously been suggested to contribute to successful cross-decoding between interior and exterior views of landmark buildings [10].
When discriminating between architectural styles, the fusiform face area, previously implicated in the preferential processing of faces [11] as well as visual expertise [8], is recruited as part of a network of regions that share similar error patterns. The FFA could be involved in the encoding of configural characteristics of buildings, such as Lloyd-Wright’s signature horizontally elongated proportions. By contrast, entry-level categorization of scenes does not include the FFA in the same way, instead relying on a tight network of three scene-selective areas, the PPA, the RSC, and the OPA, as well as the LOC.
Categorizing a building by its architectural style or its designer involves not only detecting characteristic visual features, but also recruitment of semantic knowledge. Indeed, domain knowledge of architecture is likely to contribute to the neural representations of architectural styles. This was shown clearly by the negative correlation between behavioral expertise scores and individual decoding accuracy for architectural styles in the PPA. We presume that participants with more expertise in architecture relied more on their domain knowledge and less on the high-level visual features represented in the PPA when making judgments about architectural styles.
In summary, several high-level visual regions, but not the V1, contain decodable neural representations of architectural styles and architects of buildings. The FFA participates in a network of high-level visual areas characterized by similar error patterns, but only in the subordinate categorization of architectural styles and not in entry-level categorization of scenes. Furthermore, accuracy of decoding architectural styles from the PPA is negatively correlated with expertise in architecture, indicating a shift from purely visual cues to domain knowledge with increasing expertise. Our study showcases that neural correlates of human classification of visual categories are driven not only by perceptual features but also by semantic and cultural facets, such as expertise of architectural styles and architects of buildings. Most importantly, we have identified in the human visual system a neural representation of architecture, one of the predominant and longest-lasting artefacts of human culture.
Methods
Participants
Twenty-three healthy undergraduate students in their final year at The Ohio State University participated in the study for monetary compensation. We recruited eleven students from the Department of Architecture (2 females; l left-handed; age range = 21–27, M = 22.4, SD = 3.0), and twelve senior students majoring in psychology or neuroscience (3 females; 2 left-handed, age range = 21– 24, M = 21.8, SD = 0.9). Data from one psychology student were not included in the analysis due to excessive head motion during the scan.
fMRI Experiment
MRI images were recorded on a 3T Siemens MAGETOM Trio with a 12-channel head coil at the Center for Cognitive and Behavioral Brain Imaging at The Ohio State University. High-resolution anatomical images were obtained with a 3D-MPRAGE sequence with coronal slices covering the whole brain; inversion time = 930 ms, repetition time (TR) = 1900 ms, echo time (TE) = 4.44 ms, flip angle = 9°, voxel size = 1 × 1 x 1 mm, matrix size = 224 × 256 × 160. Functional images were obtained with T2*-weighted echo-planar sequences with coronal slices covering approximately the posterior 70% of the brain: TR = 2000ms, TE = 28ms, flip angle = 72°, voxel size = 2.5 × 2.5 × 2.5 mm, matrix size = 90 × 100 × 35.
Participants viewed 512 grayscale photographs of four image types: (1) 32 images of representative buildings of each of four architectural styles: Byzantine, Renaissance, Modern, and Deconstructive; (2) 32 images of buildings designed by each of four well-known architects: Le Corbusier, Antoni Gaudi, Frank Gehry, and Frank Lloyd-Wright; (3) 32 scene images per each of four entry-level scene categories: mountains, pastures, highways, and playgrounds; (4) 32 face images per each of four different individuals [12]. Brightness and contrast were equalized across all images. Images were back-projected with a DLP projector (Christie DS+6K-M 3-chip SXGA+) onto a screen mounted in the back of the scanner bore and viewed through a mirror attached to the head coil. Images subtended approximately 12º x 12 º of visual angle. A fixation cross measuring 0.5º x 0.5º of visual angle was displayed at the center of the screen.
During each of nine runs, participants saw sixteen 8-second blocks of images. In each block, four photographs from a single category were each shown for 1800 ms, followed by a 200 ms gap. The order of images within a block and the order of blocks within a run were randomized in such a way that the four blocks belonging to the same stimulus type (entry-level scenes, styles, architects, faces) were shown back to back. A 12-sec fixation period was placed between blocks as well as at the beginning and the end of each run, resulting in a duration of 5 min 32sec per run. Occasionally, (approximately one out of eight blocks), an image was repeated back-to-back within a block. Participants were asked to press a button when they detected image repetitions.
FMRI data were motion corrected, spatially smoothed (2 mm full width at half maximum), and converted to percent signal change. We used a general linear model with only nuisance regressors to regress out effects of motion and scanner drift. Residuals corresponding to image blocks were extracted with a 4 s hemodynamic lag and averaged over the duration of each block. Block-average activity patterns within pre-defined ROIs was used for MVPA.
Author Contributions
Conceptualization, D. B. W, J. N., and C. H.; Methodology, H. C., D. B. W., B. N.; Investigation, H. C., and D. B. W.; Writing – Original Draft, H. C.; Review & Editing, D. B. W., and J. N.; Funding Acquisition, D. B. W.; Resources, D. B. W.; Supervision, D. B. W. and J. N.
Supplemental Experimental Procedures
Regions of interest
High-level visual regions of interest (ROI) were defined functionally using separate localizer scans. Participants saw one to three runs (7 minutes and 12 seconds each) of blocks of images of faces, scenes, objects, and grid-scrambled objects while responding to image repetitions with a button press. Following motion correction, spatial smoothing (4 mm full width at half maximum Gaussian kernel) and normalization to percent signal change, localizer data were analyzed using a general linear model (3dDeconvolve in AFNI). ROIs were defined as contiguous clusters of voxels with significant contrasts (q < 0.05; corrected for multiple comparisons using false discovery rate) in the following comparisons: scenes > (faces, objects) for the parahippocampal place are (PPA), retrosplenial cortex (RSC), and the occipital place area (OPA) [1, 2]; faces > (scenes, objects) for the fusiform face area (FFA) [3]; and objects > scrambled objects for the lateral occipital complex (LOC) [4]. The PPA and RSC were successfully identified in all twenty-two participants. We could not find significant clusters corresponding to the OPA in five participants, the FFA in one participant, and the LOC in two participants. Group statistics of ROI-based results was performed only for the participants for whom we could identify the ROIs.
Primary visual cortex (V1) was defined on each participant’s original cortical surface map using the automatic cortical parcellation provided by Freesurfer [5]. Surface-defined V1 was registered back to the volumetric brain separately for each hemisphere using AFNI.
Univariate analysis
We tested whether the four types of visual categories elicited different levels of mean activity in each of the ROIs. We conducted a mixed-effects analysis of variances (ANOVA) for each ROI separately, using participant group (Architecture vs. Psychology and Neuroscience students) as a between-subjects factor, and visual category (entry-level scene categories vs. architectural styles vs. architects vs. faces) as within-subjects factors. Since there was neither a main effect for group nor an interaction between group and visual category, we collapsed the data for the two groups. Results are show in Fig. S1A. Differences in mean activity between the three scene-type categories and faces were assessed using planned paired t-tests, separately for each ROI. Differences in mean activity among the subordinate categories for each of the four main categories were evaluated with one-way ANOVAs. Results are shown in Fig. S1B.
Searchlight analysis
We explored representations of image categories outside of the pre-defined ROIs with a searchlight analysis using the Searchmight Toolbox [6]. The size of the searchlight region was chosen as a 5x5x5 =125 voxel cube to approximate the average size of a unilateral PPA of the participants (159 voxels). The searchlight was centered on each voxel in turn [7], and decoding analysis with leave-one-run-out cross-validation was performed using the voxels within the searchlight regions. Decoding accuracies for the searchlight locations were recorded in a brain map, thresholded at p < 0.01 (one-tailed analytical p value), and corrected for multiple comparisons at the cluster level with a minimum cluster size determined separately for each participant, ranging from 4 to 8 voxels (M = 4.8, SD = 0.9). We evaluated the agreement between the searchlight analysis and the pre-defined ROIs as the fraction of voxels within each ROI that was found to be significantly above chance in the searchlight analysis (Table S1).
For group analysis, anatomical brain volumes of each of the participants were registered to the Montreal Neurological Institute (MNI) 152 template [8]. Searchlight accuracy maps were registered to MNI space using the parameters from the anatomical registration, followed by smoothing with a 2 mm full width at half maximum Gaussian filter. Significance of group-average decoding accuracy versus chance (25%) was assessed with a one-sample one-tailed t-test (p < 0.01), followed by cluster-level correction for multiple comparisons (minimum cluster size of 13 voxels, determined by α probability simulation).
Post-scan behavioral experiment
We measured participants’ visual domain knowledge in a post-scan behavioral experiment similar to the Vanderbilt Expertise Test [9]. Domain knowledge for each image type was tested in four separate blocks. Each block consisted of three components: study, practice, and testing. During study, participants were introduced to six target categories. Example images for each of the six target categories were displayed on the screen with correct category labels: (1) entry-level scene categories: fountains, highways, mountains, pastures, skylines, and waterfalls; (2) architectural styles: Byzantine, Gothic, Renaissance, Modern, Postmodern, and Deconstructive; (3) buildings by famous architects: Peter Eisenman, Antoni Gaudi, Frank Gehry, Michael Graves, Le Corbusier, and Frank Lloyd-Wright; (4) faces: six non-famous individuals varied in gender and race. Following the study phase, participants experienced twelve practice trials. In these trials, three images (12° x 12° of visual angle each) were presented side by side. Participants were asked to indicate which of the three images belonged to a given target category by pressing one of the keys, “1,” “2,” or “3.” During practice, one of the three images was always drawn from the set of studied examples. The images were presented until the participant made a response, and feedback was provided by displaying the word “CORRECT” or “INCORRECT.” Study exemplars were shown again halfway through practice and at the beginning of the subsequent test phase. For the 35 test trials, 24 new grayscale images from the target categories and 48 new grayscale foil images from different categories were used. Structure of the test trials was the same as practice, except that participants no longer received feedback. The entire experiment lasted approximately 30 min.
We confirmed that architecture students had higher expertise for architectural styles and buildings by famous architects in an analysis of variance (ANOVA) of average accuracy rates, using participant group as a between-subjects factor, and visual category (entry-level scene categories vs. architectural styles vs. architects vs. faces) as a within-subjects factor. Furthermore, planned comparisons between the two groups were conducted for each of the four visual categories. We also conducted the same analyses on average reaction times (RT). Results are shown in Fig. S4.
Supplemental Data
References
References
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.