ABSTRACT
Artificial neural network models have long proven useful for understanding healthy, disordered, and developing cognition, but this work has often proceeded with little connection to functional brain imaging. We consider how analysis of functional brain imaging data is best approached if the representational assumptions embodied by neural networks are valid. Using a simple model to generate synthetic data, we show that four contemporary methods each have critical and complementary blind-spots for detecting distributed signal. The pattern suggests a new approach based on structured sparsity that, in simulation, retains the strengths of each method while avoiding its weaknesses. When applied to functional magnetic resonance imaging data the new approach reveals extensive distributed signal missed by the other methods, suggesting radically different conclusions about how brains encode cognitive information in the well-studied domain of visual face perception.
Scientists interested in understanding how minds emerge from brains have increasingly embraced a network-based view: cognition arises from the propagation of activity amongst distributed neural populations interacting via weighted connections in complex brain networks1,2. Information is encoded, not in the mean activation of local neural populations considered independently, but in the similarity structure of activity patterns distributed over many potentially distal populations3,4. This raises a statistical challenge for functional brain imaging, where various technologies yield thousands of neurophysiological measurements per second: cognitive structure must be encoded jointly over some subset of these measurements, but the number of possible subsets is prohibitively large. How can the theorist find those that encode structure of interest? We propose an answer motivated by neural network models of cognition5 and show that, when applied to data from functional magnetic resonance imaging (fMRI), the new approach can lead to dramatically different conclusions about the neural bases of cognition even in well-studied domains.
Neural network models (see SI-NN) propose that cognitive representations are patterns of activity distributed over neural populations or units, with each unit potentially contributing to many representations and each representation encoded over many units (Fig. 1A)5,6. The patterns arise as units communicate their activity through weighted connections that determine the effect of a sending unit on a receiving unit. Pools of similarly-connected units together function as a representational ensemble that encodes a particular kind of cognitive structure (phonological, semantic, visual, etc). Network topography is initially specified but learning shapes the connection weight values that generate patterns over ensembles. Cognitive processing arises from the propagation of activity through the network, typically from perceptual stimulation in a particular task and environmental context. Behaviors are the motor outputs eventually generated by such updating.
Such models have long proven useful for understanding the neural bases of healthy7, disordered8–11, and developing12,13 cognition, but until recently10,14–16 this research advanced with little connection to brain imaging. The reason is simple—the models suggest that information can be encoded and distributed in ways that pose significant challenges to statistical analysis of brain imaging data (Fig. 1). Specifically:
Unit activations may not be independently interpretable. Because information resides in activation patterns computed across units in a representational ensemble, a single unit’s activation profile may mislead when considered in isolation (Fig 1B). Independent analysis of each population (e.g. single voxels) may therefore obscure the information encoded in distributed patterns.
Neighboring units need not encode the same information in the same way. Because representational structure inheres in pattern similarities over an ensemble, adjacent units in the ensemble need not respond to stimuli in similar ways—they may contribute to different components of the structure, or may encode a given component through either increased or decreased activation. Thus approaches that spatially average data within subject may destroy signal (Fig 1B bottom left).
Single units can vary arbitrarily in their responses even when ensembles encode the same structure. Many different weight configurations can compute the same input/output mapping for a given network. The configuration acquired through learning depends on factors that vary across individuals, such as the initial weight configuration or the idiosyncrasies of experience. Thus a particular unit in a given model can, across different training runs in the exact same environment, exhibit essentially arbitrary responses to inputs, even if the patterns of activation generated across unit subsets always express the same structure. Approaches that average voxel/region responses across subjects at a given anatomical location may therefore destroy signal (Fig 1B bottom middle and right).
Representational ensembles need not be anatomically contiguous. Depictions of network models portray representational ensembles as occupying a single contiguous “layer,” inviting a correspondence with cortical regions (Fig 1A). Yet units with similar connectivity will function as a representational ensemble—responding to the same inputs and contributing to the same outputs—even if situated in different cortical regions or intermingled with differently-connected units (Fig 1C). Thus approaches that analyze different anatomical regions independently can fail to uncover important signal.
Fine connectivity varies across individuals. While neural network models initially specify coarse connectivity, the values ultimately taken by individual weights are seeded randomly then tuned by learning, similar to real brains where region-to-region connectivity is initially specified but finer patterns are more variable and plastic. Consequently precise unit-to-unit alignment may be impossible even if the neural populations contributing to some representation reside in roughly similar locations across individuals (Fig 1C)—in which case cross-subject averaging of neural responses can destroy signal even with sophisticated alignment techniques.
To understand whether and how contemporary methods overcome these challenges, we first describe a simulated functional imaging study using a simple neural network to generate model data, which we then analyze with four different statistical methods. Each discerns some model signal but also has critical blind spots. The contrasting patterns suggest a new approach based on structured sparsity17 that preserves the strengths of other methods while avoiding their weaknesses. We develop this approach, then compare it to the others when applied to real fMRI data to discover neural signal that discriminates visually-presented faces from other visual stimuli. Each method again yields different results, but the new approach resolves seeming discrepancies amongst the others while simultaneously suggesting radically different conclusions about how neural systems encode representations in this well-studied domain.
Results
Simulation study
The model (Fig 2A) is an auto-encoder: it learns to reproduce input patterns across output units. The patterns come from two domains, A and B, corresponding to some cognitive distinction of interest (e.g. faces vs places, animals vs artifacts, etc). Each domain exemplar generates a unique pattern of input/output activations. Systematic input/output (SIO) units each independently adopt a noisy but consistent code, with some slightly more active on average for A items and others for B items. Arbitrary input-output (AIO) units have equal probability of activation regardless of domain. Activation propagates from input to output via two hidden layers. The systematic hidden (SH) units connect only systematic I/O units, while the arbitrary hidden (AH) units receive connections from arbitrary inputs and send connections to both systematic and arbitrary outputs. This architecture promotes a division of labor across the hidden units: SH units encode distributed representations that strongly express domain structure, while AH units encode the idiosyncratic differences existing amongst domain exemplars. The model also contains irrelevant units unconnected to the network and taking on low random values.
Blue shading in Figure 2A shows units treated as anatomically contiguous. I/O units of a given type (A, B, arbitrary) were always anatomically grouped in the same way across model individuals within a single contiguous region. For the hidden and irrelevant units we considered two spatial layouts. When localized, units of a given type (SH, AH, irrelevant) were anatomically grouped in the same way across model individuals within a single contiguous region, just like I/O units. When dispersed, the model retained the same connectivity but the hidden units were spatially arranged in four anatomically distal “regions”—three containing a mix of SH, AH, and irrelevant units and the fourth containing only irrelevant units—to capture the possibility that distal units can function together as a representational ensemble. Within each region, unit locations were shuffled for each model subject, capturing the possibility that fine-grained localization can vary across individuals. These different configurations did not affect model training or behavior, but represent different assumptions about the spatial locations of the signals measured in a simulated MRI study.
Simulated data were generated by training a model on 72 item, then computing the response of each unit to each input pattern and perturbing this with noise sampled independently for each unit. Each unit activation was taken as a model analog of the peak BOLD signal generated by a single stimulus at a single voxel in a single individual. To simulate 10 individuals in a brain imaging study we trained and tested the model 10 times with different initial weights. We then applied four different statistical methods to find units that differentiate A from B items: univariate contrast (UC), searchlight multivariate pattern classification (sMVPC), and whole-brain logistic regression regularized with the L2 (ridge) or L1 (LASSO) norm (see SI-Approaches). We measured how well each method discriminates informative from non-informative units amongst all I/O units and amongst all hidden/irrelevant units. We also considered how well each method uncovers the domain code: which units are more active for A, which for B, and which adopt a heterogeneous code. Figure 2 shows the results.
Univariate contrast spatially smooths data, then identifies voxels whose mean activation across individuals at a given spatial location reliably differs for A vs B items.18,19 This approach identified the SIO units and accurately specified their coding direction, but failed to identify the SH units in either spatial layout.
Searchlight MVPC generates information maps by evaluating pattern classifiers in their ability to discriminate A from B items from the activation patterns generated by each item across voxels20. The approach avoids model over-fitting by applying the classifier only to voxels within a searchlight of fixed anatomical radius centered on a particular voxel. A separate classifier is evaluated for each voxel in each subject without spatial smoothing or cross-subject averaging. Cross-validation accuracy is stored at the voxel location in each subject; cross-subject univariate tests then identify anatomical clusters where accuracy is reliably higher than chance. The resulting maps indicate regions where the categorical distinction of interest is locally decodable on average across subjects within the radius of the searchlight. Figure 2C shows model results for the best-performing searchlight size (radius 7) which excelled at discovering localized signal in both I/O and hidden units. Smaller searchlights successfully identified SH but not SIO units while larger searchlights showed the reverse pattern (see SI-Approaches). Regardless of size, the approach failed to discover signal-carrying hidden units when these were spatially dispersed. Also, because information maps only indicate classifier accuracy, the approach did not reveal the different category codes employed by I/O and hidden units.
Regularized whole-brain MVPC fits and evaluates a single classifier per subject using all voxels and avoids over-fitting through model regularization21 –that is, by finding classifier coefficients that jointly minimize both prediction error and some additional penalty. We used logistic regression as the classifier and considered two common regularizers: the ridge penalty, which increases with the sum of squared classifier coefficients, and the LASSO penalty, which increases with the sum of the absolute value of classifier coefficients. In both cases a classifier was fit to minimize: …where β is the vector of classifier coefficients (one element for each unit), f(β) is the classifier logistic loss, h(β) is the regularization penalty, and λ ∈ [0,1] is a free parameter tuned via cross-validation that determines the relative weighting of the loss vs the regularization penalty.
The ridge classifier showed a mean hold-out accuracy of 60% correct, reliably above chance (p < 0.0001), but placed non-zero weights on all units making it difficult to discriminate signal-carrying from arbitrary/irrelevant units (see SI-Approaches). The LASSO classifier showed equivalently good hold-out accuracy (60%) with a much sparser solution (Fig. 2D). Only 2 of the 7 SH units and only one of the I/O units were selected more often than expected by chance given the base probability of selection in a permutation test, and no false alarms were observed. Thus LASSO showed high precision—all selected units carried signal—with a low hit rate.
Regularization with structured sparsity. These results highlight complementary strengths and weaknesses across methods. Spatial averaging allows UC to detect the noisy signal in SIO units and also reveals the code direction, but only succeeds when units employing the same code are consistently localized within a region across subjects—because SH units employ a heterogeneous code, they are always missed. When sized appropriately, searchlight can discover both kinds of localized signal but misses anatomically dispersed signal and obscures the code direction. Regularized whole-brain MVPC can identify spatially dispersed and heterogeneous signal, but ridge regularization selects everything without discrimination while LASSO forces a very sparse solution that identifies only a small proportion of signal-carrying units.
Structured sparsity22–24 provides an avenue for preserving the strengths of each method while avoiding its weaknesses. A single classifier is fit using all data from all subjects, but the fit is regularized with a penalty that encourages desired sparsity patterns amongst the classifier coefficients. Specifically, the solution should reflect the characteristics of neural signal that neural network models suggest and that other methods individually exploit. It should (1) clearly delineate selected and unselected voxels, (2) allow heterogeneous codes among neighboring units within and across individuals, (3) reveal code direction where this is consistent, (4) identify distal units that jointly express representational structure, (5) capitalize on shared location across subjects where this exists but also (6) tolerate individual variation in signal location.
The sparse-overlapping-sets (SOS) LASSO is one such function25,26 (see SI-SOS). Neighboring voxels within a specified radius are grouped into sets, similar to searchlights. Each voxel belongs to several sets and sets overlap in the voxels they contain. Sets then contribute to the regularization cost of a single classifier as follows: …where S defines the grouping of voxels into sets, i indexes model coefficients within a set, and α ∈ [0,1] is a free parameter. The total cost is a sum over sets. The cost for each set is the proportional weighted sum of two terms: the standard L1 sparsity penalty, and a grouping penalty formulated as the root of the sum of squared coefficients. Because this root is taken over units within a set, the grouping penalty is smaller when non-zero coefficients occupy the same set than when they occupy different sets (see SI-SOS). Thus SOS LASSO encourages sparse solutions where selected voxels occupy a small number of sets. The free parameter α controls the relative weighting of the grouping vs sparsity penalties within set—when α=0 the penalty reduces to LASSO. This function replaces h(β) in Equation 1, so the full optimization jointly minimizes model error and regularization cost with a second free parameter λ controlling their respective weighting.
SOS LASSO fits a single classifier to all data from all subjects simultaneously. Voxels from different subjects are projected into a common reference space without interpolation and with no cross-subject averaging. Sets are defined for grid points in the space and each encompasses all voxels within and across subjects that fall within a specified radius of a grid point. The optimization is convex and returns a unique solution for a given pair of hyperparameters (α and λ) that are tuned via cross-validation on model hold-out error. The result is a classifier coefficient for each voxel in each subject, with the coefficients tending to occupy a small number of sets, and thus to lie in roughly similar anatomical locations across subjects.
Like LASSO, SOS LASSO “sees” all voxels at once and so can exploit structure coded across distal regions, while also clearly delineating selected voxels by forcing many coefficients to zero. Like searchlight, the code direction can be heterogeneous within region and across subjects, but like LASSO each voxel in each subject gets a unique coefficient whose sign indicates the code direction. SOS LASSO promotes a similar anatomical distribution of voxels across subjects, but the pressure to be similar is balanced with pressure to reduce model prediction error and achieve structured sparsity within each subject. Since the strength of the grouping penalty is tuned by data, the approach can exploit localization across subjects where this exists but can also accommodate cross-subject variability in location.
The last row of Fig 2D shows the SOS-LASSO solution applied to model data. The classifier achieved better hold-out accuracy than LASSO or ridge regression (70% correct) while discovering more than half the SIO units and almost all the SH units in both localized and dispersed layouts. The only false alarms were observed amongst arbitrary or irrelevant units that were anatomically interspersed with or neighboring the signal-carrying units. The signs of the coefficients revealed the consistency and direction of the domain code among identified SIO units as well as the heterogeneity of the code in SH units. Thus in simulation, SOS-LASSO captures the strengths of each method while largely avoiding their limitations.
Functional imaging study
The simulations suggest that several contemporary statistical methods for functional imaging have blind spots, raising the possibility that prior work may have missed important signal even in well-studied domains where multiple different approaches have been applied. Study 2 assesses this possibility in the domain of face representation. We applied univariate contrast, searchlight MVPC, LASSO, and SOS-LASSO to fMRI data collected in a previous unrelated study27 in which 10 participants judged the pleasantness of images depicting people (30 items), scenes (30 items), or objects (30 items) while their brains were scanned with fMRI. The study employed a slow event-related design that allowed estimation of the peak BOLD response to each item at each voxel without time-series deconvolution. We applied each method to these data to find voxels that reliably discriminate face (people) from non-face (place and object) stimuli.
Figure 3 shows the neural systems typically thought to support face and place perception (downloaded from http://web.mit.edu/bcs/nklab/GSS.shtml) along with the results for the current data from each method. Whole-brain univariate contrast revealed significantly reduced mean activation for faces in bilateral parahippocampal regions (blue areas p < 0.05 with cluster correction), while an ROI analysis found reliably elevated activation for faces in the right “fusiform face area” (FFA) only (warm colors, p < 0.05 for mean response across ROI voxels).
Searchlight MVPC with a 9mm radius identified a localized signal in regions near the FFA bilaterally. LASSO selected a handful of voxels in each subject, concentrated near FFA bilaterally and in right lateral occipito-temporal cortex. The classifier coefficients indicated that reduced activity predicted faces in medial ventro-temporal cortex while elevated activity predicted faces for lateral ventro-temporal and occipito-temporal regions. Together the three approaches suggest somewhat different conclusions about face representation in cortex, each having precedent in prior work: the UC result suggests that faces selectively activate the right FFA;28 searchlight identifies localized but bilateral signal around the FFA;29 and LASSO reveals a nonface-to-face gradient bilaterally in these regions plus a right-lateralized occipital “face” region.30
Results from SOS LASSO differed strikingly. Colored regions in Fig. 3C (bottom) show voxels that received non-zero weights on significantly more subjects than expected from a null distribution estimated by permutation testing (p < 0.001 corrected; see Methods). Hue again indicates the proportion of selected coefficients that were positive across subjects. In addition to the canonical face and place systems the results implicate a host of regions spanning anterior temporal, frontal, and parietal cortex.
We assessed the quality of the representations discovered by each method as follows. For each subject and method, we trained a (ridge-regularized) logistic classifier to discriminate face from non-face stimuli using only the voxels that the method identified as important, then compared the classifier cross-validation error (mean of hit and correct rejection rates) across methods. All classifiers performed reliably better than chance (50%) but with substantial differences among them. Searchlight solutions performed worst (67%); the canonical univariate ROIs yielded significantly higher accuracy (76% correct, p < 0.001 vs searchlight); but performance for LASSO and SOS LASSO was even better (88% and 87% respectively, p < 0.001 vs univariate for both) and did not differ significantly from each other.
Why do neural representations appear so widely distributed under SOS LASSO? Perhaps the neural signal outside canonical areas merely serves to de-noise, interact with, or otherwise amplify the signal encoded within the canonical face and place systems (see SI-Denoising).31 We assessed this possibility with two additional analyses. First we fit SOS LASSO to two subsets of data: voxels lying within the canonical face and place systems (padded by an additional 7mm) or voxels lying outside these systems and also not selected by any other method. If the true signal lies mainly in the face and place systems, the within-system classifiers should yield higher accuracy on held-out items than outside-system classifiers. Instead the reverse result obtained: outside-system classifiers showed significantly higher hold-out accuracy (87%) than within-system classifiers (83%; t(11) = 2.4, p < 0.05), and as well as whole-brain classifiers (87%).
Second, we used SOS LASSO to find voxels that reliably discriminate place from non-place stimuli. If broadly-distributed regions are selected because they de-noise system-specific signal or for some other spurious reason, they should be selected regardless of the category decoded. Figure 4 shows this is not the case: the voxels that discriminate place from non-place stimuli are anatomically localized and consistent with the canonical view of place representation in the brain.32 Together the results suggest that widely distributed regions outside canonical systems contain information that helps to discriminate visually-presented faces from non-face stimuli.
General Discussion
Using computer simulations we have shown that four contemporary statistical methods for brain imaging have significant blind spots remedied by a new approach based on structured sparsity, the SOS LASSO. All methods yielded qualitatively different results when applied to fMRI data collected while participants judged visual images of faces, places, or objects, but solutions from prior approaches were generally consistent with the standard view that face representations are encoded within posterior temporal and occipital cortices. The SOS LASSO yielded a much more widely distributed solution encompassing anterior temporal, frontal, and parietal regions.
Can we be confident that SOS LASSO has revealed real signal? The result does not solely reflect selection of spurious voxels outside canonical systems, since (1) extra-system regions were sufficient to discriminate face from non-face stimuli with high accuracy and (2) the same approach yielded a more localized solution when seeking signal that discriminates place from non-place stimuli. Might it have arisen from idiosyncrasies of the particular stimuli or task used in the current study? We cannot rule the possibility out, but the current data were sufficient to replicate canonical results using prior methods—only with SOS LASSO was further signal revealed. Thus it remains possible that prior work has likewise missed extensive distributed signal.
Many aspects of the SOS LASSO solution accord well with both standard views of face and place perception and with the broader literature. Regions where classifier coefficients are consistently positive (warm colors in Figure 3) pick out much of the canonical face system and other cortical areas known to support social cognition, including the bilateral temporal poles,33,34 right orbito-frontal cortex,35,36 and superior medial-frontal cortex.37 Regions where coefficients are reliably negative (cool colors in Figure 3) pick out areas known to encode less socially-critical information, including scenes (parahippocampal place area)32 and object-directed action (dorsal visual stream, left dorsal premotor motor area)38–40. The mixed-direction regions (green in Figure 3) suggest that, in addition to these directionally-consistent codes, information about the stimulus category can be encoded with distributed patterns in which the direction of the code varies within and across individuals. The current results cohere with other recent work suggesting that neural representations may be more broadly distributed that heretofore suspected, for face perception specifically41,42 and for conceptual structure more generally.43,44
The pattern of results across methods has two important implications for hypotheses about neural representation. First, the out-of-system information detected by SOS LASSO was not discovered by even a large searchlight, indicating that it does not reside within local cortical regions considered independently. Thus anatomically distal regions can jointly encode multivariate representational structure. Second, SOS LASSO assigned consistently positive or consistently negative classifier weights in regions where univariate contrast yielded a null result. Whereas univariate contrast considers mean activations of each voxel independently, a classifier coefficient indicates the voxel’s contribution to the classification decision while taking the contributions of other voxels into account. Similar to partial correlation analyses, a voxel’s activity may not correlate reliably with stimulus category when considered alone, but may correlate with residual variation once the effects of other voxels have been factored in. In that case the voxel will receive a weight in the classifier but will not show a significant effect in univariate analysis. The contrasting results thus indicate that voxels can contribute to representational structure in a directionally and locationally consistent manner even when they do not independently correlate with that structure.
These observations suggest a new perspective on face representation in the brain. The standard view posits “core” and “extended” systems residing within posterior temporal and occipital cortices. 45,46 SOS LASSO suggests that, in addition to these regions, face perception generates widely-distributed activation patterns across much of the cortex, with anatomically distal regions jointly helping to differentiate the representations of visually presented faces from other stimuli. Within the full pattern, faces are signaled by elevated activity in social-cognitive brain networks, by suppressed activity in networks relevant to navigation and object-directed action, and by heterogeneous patterns in parietal and frontal cortices—but because elements of the pattern vary in the direction, independence, and localization of their code, previous methods each provide only a partial view of the full representation. Their agreement with the standard view arises because classic “core” and “extended” face systems comprise parts of the pattern where information is encoded efficiently and independently within circumscribed cortical regions, often in a directionally-consistent manner localized similarly across subjects.
Relation to other multivariate brain imaging approaches. We have focused on the common experimental design wherein the theorist seeks neural signal that differentiates one discrete experimental condition from another. Other designs have other goals and so employ other methods—for instance, representational similarity analysis (RSA) seeks voxel sets that jointly express a target similarity structure,47 while “generative” approaches seek to predict whole-brain images from externally-derived features of the stimulus or experimental condition.48 Like MVPC, such approaches often adopt techniques that can limit their ability to detect network-distributed signal, such as independent consideration of each voxel time-series or of different anatomical regions. Structured sparsity may likewise provide new insights for these kinds of problems.49 We also note that SOS-LASSO constitutes just one way of leveraging structured sparsity for pattern classification—other groups are pursuing similar ideas,24,50 and understanding the relations among these approaches is a central goal for future research.
Broader implications for cognitive neuroscience
Our aim has been to consider how functional imaging results might change when one employs statistical methods capable of discovering distributed signal of the kind suggested by neural network models of cognition. We chose visual face perception as one extensively-studied domain where both univariate and multivariate functional imaging have highlighted fundamental questions about neuro-cognitive representation. The contrasting results from SOS-LASSO compared to other methods illustrates that such an approach can lead to quite different conclusions even in well-studied domains. Yet such a discrepancy is not guaranteed: the decoding of place information with this method yielded results highly consistent with canonical views. The current study thus raises the possibility that functional imaging has provided an incomplete picture of neuro-cognitive representation across domains—but assessing that possibility in any given domain will require further work of this kind.
Online Methods
Simulation study
Model implementation
Code and documentation for running the simulations and subsequent analyses appear at https://github.com/crcox/SOSLassoSimulations. The model shown in Figure 2A of the main paper was trained on 72 items sampled from two domains, A and B. Each item activated exactly 2 systematic and 2 arbitrary input units, and across items each unit was active in exactly 8 items. Half of the systematic units were activated only by items from domain A, while the remaining half were activated only by items from domain B. Thus any pair of items in the same domain had a small probability of overlapping in some of their systematic properties, while items from different domains never overlapped in their systematic properties. Arbitrary units were equally likely to be active for items from domain A versus B.
The model was fit using the Light Efficient Network Simulator (https://github.com/crcox/lens) using back-propagation to minimize cross-entropy error. The weights were adjusted with a learning rate of 0.1, using momentum (“Doug’s” momentum = 0.9) and subject to weight decay (decay constant = 0.001). The model was trained 10 times to asymptotic performance with very low error over 1000 epochs. Prior to each training run, the model was initialized with random weights sampled from a uniform distribution in the range [−1, 1]. These 10 models were used to generate data for 10 model “subjects,” based on the patterns of activity elicited by each input over the whole network. Each model was presented with the 72 input patterns in sequence, and the pattern of activation elicited over the 114 units in the network (including the 28 irrelevant units, which always had an activation of zero) was recorded. The dataset for each model subject thus consisted of a matrix with 72 rows corresponding to stimulus items and 114 columns corresponding to model voxels. Each matrix contained the “true” response pattern for each subject to each item. To simulate noise in the measurement of this activity, a random value sampled independently from a Gaussian distribution with a mean of zero and standard deviation of 1 was added to each cell of the matrix. We take the resulting values in each cell of a matrix to be a model analog of the estimated BOLD response to a single stimulus at a single voxel in a single subject in an fMRI study.
To apply different brain-imaging methods to the discovery of structure, it is necessary to further stipulate the anatomical locations of the different units the model. In all simulations, input units were situated all together, with domain-A units neighboring one another, domain-B units neighboring one another, arbitrary units neighboring one another, and the three units types concatenated end-to-end (AH units immediately adjacent to SH units; irrelevant units immediately adjacent to AH units). Output units were organized the same way, though outputs were assumed to be anatomically distal to inputs. The anatomical arrangement of input and output units was identical across model individuals.
For hidden units, we considered two different anatomical organizations. For anatomically localized models, units within a hidden layer (SH, AH, or irrelevant) were treated as anatomical neighbors with the unit types again concatenated end to end and localized in the same way across model individuals. Consequently, large searchlights or over-zealous smoothing could integrate activation from more than one hidden unit type. In the anatomically dispersed condition, hidden and irrelevant units were organized into three “regions” each containing a mix of SH, AH and I units, plus a fourth region containing only I units. The regions were each treated as anatomically distal to one another and to the input and output units. The three regions with mixed unit types all contained 7 units total, with 2 or 3 SH units, 2 or 3 AH units, and 2 or 3 I units. All model subjects had the same 7 units assigned to a given region, but the spatial ordering of these units within region was shuffled at random independently for each. The aim was to generate a scenario in which spatially distal units can jointly encode representational structure, with the coarse anatomical layout consistent but fine-grained layout variable across individuals. Note that model connectivity, the activation patterns evoked by different inputs, and the ways these patterns were distorted by measurement noise were identical for both model layouts—all that differed was the spatial locations of the units in each layer.
Statistical analysis
Each analysis ultimately involves computing, for each unit, the probability of the observed result under the null hypothesis that the unit does not contribute to differentiating A from B items. For fair comparison across methods, we therefore used the same statistical threshold for “counting” a unit as significant (p < 0.002 uncorrected). The qualitative pattern of results across methods does not vary with this criterion.
Univariate contrast
The activity at each unit was analyzed for all subjects using a mixed-effects model that treated subject as a random factor1,2, stimulus category (A or B) as the sole fixed effect, and unit activation as the dependend measure. The model was fit using the fitlme function in MATLAB, and the fixed-effect coefficient was tested for significance using the Satterthwaite approximation to the degrees of freedom and a standard F-test, numerator degrees of freedom = 1, denominator degrees of freedom = 9. The results are directly analogous to a repeated-measures ANOVA. Results were thresholded at uncorrected p < 0.002. The analysis was conducted for both the anatomically localized and the dispersed model. In both cases, the data were spatially smoothed, taking a weighted average over a three unit window, where the center unit was weighted about twice as much as the two flanking units.
Searchlight MVPC
The searchlight analysis was conducted using the SearchMight toolbox3 for MATLAB. Non-contiguous unit groups from Figure 2A in the main paper were treated as anatomically separated regions so that a searchlight never encompassed units in different regions. This was accomplished by inserting empty units between the layers, and providing a mask to SearchMight to omit those units during analysis. Within each searchlight, a Gaussian Naive Bayes (GNB) classifier was fit to distinguish between category A and B items. Although GNB classifiers are limited in some ways,3 the concerns do not apply to this simple and idealized case where noise is truly identically and independently distributed (iid) with uniform variance. Classifier performance at each searchlight was estimated through 6-fold cross validation. The mean cross-validation accuracy was stored at each searchlight center, and the mean accuracy over model subjects was tested for each unit to see if it differed significantly from chance. The resulting map of p-values was thresholded at p < 0.002 uncorrected. The analysis was performed on both the anatomically localized and the anatomically dispersed arrangements of units. The data were not smoothed prior to the searchlight analysis. The analysis was performed with small (3 unit), medium (7 unit), and large (14 unit) searchlights; results for smaller and larger searchlights are reported in SI-Model.
LASSO
Logistic LASSO (and ridge regression; see SI-Model) were conducted as special cases of the SOS-LASSO optimization code posted at https://github.com/crcox/WholeBrain_MVPA/blob/master/src/%4QSOSLasso/SOSLasso.m, which is part of the broader Wholebrain Imaging with Sparse Correlations (WISC) workflow maintained by CRC. Both methods have a free parameter λ that controls the importance of the regularization penalty relative to the prediction error, leading to greater sparsity in LASSO and more severe weight shrinkage in ridge regression. The analysis thus proceeded in two steps: one to estimate a useful λ for each simulated subject, and a second to fit a model at the estimated λ and evaluate it on a hold-out set. The data for each simulated subject was first divided into 6 equal parts, each containing the same number of category A and B items. One part was set aside and the remaining 5 were passed to a function that conducted a 5-fold cross validation accuracy search across many values of λ. The parameter search was implemented using hyperband, a state of the art search procedure that optimizes use of parallel computing infrastructure when fitting complex optimizations4. The function returns the λ producing the highest cross-validation accuracy, which is subsequently used to fit a model to all 5 parts of the data. The resulting model was then assessed on the original hold-out set (the 6th part). This procedure was carried out separately for all 10 model subjects, in both localized and anatomically dispersed model variants.
LASSO provides a straightforward selection criterion: any unit receiving a non-zero weight has been “selected” as important for predicting the stimulus class. We conducted a permutation test to assess, for each unit, the probability of selection under the null hypothesis of no relationships between unit activation and category label. We first estimated the best sparsity level λ from the unpermuted model data using the procedure just described. For each permutation, we randomly shuffled the category labels, then fit a LASSO model at the pre-specified λ and recorded the resulting coefficient on each unit. 1000 permutations were conducted. We then estimated the probability of selection for each unit as the number of times the unit received a non-zero weight divided by 1000. These probabilities were estimated separately for each model subject, then averaged across model subjects at each unit, yielding a base probability of selection for each unit. In the non-permuted data, we counted how often a given unit received a non-zero weight across the 10 simulated subjects, then used the binomial distribution to compute the probability of this outcome given the base probability of selection from the permutation testing. For instance, if a unit was selected with probability 0.2 in the permutation testing, and was selected in 5 out of 10 simulated subjects in the real (non-permuted) data, we computed the likelihood of this outcome under the null hypothesis as the binomial probability of achieving 5 or more successes in 10 attempts given a base probability of 0.2 for success (p ~= 0.033). This probability was computed for every unit and the results were thresholded at p < 0.002 without correction for multiple comparisons.
SOS LASSO
The SOS LASSO analysis was implemented using custom code built on top of MALSAR5 for MATLAB. The data were divided into overlapping groups based on “anatomical” proximity. Each set included 7 units (except for the last group in the input and output layer, which have 8 rather than having one unit on its own), and the groups did not overlap. The set size was selected to correspond with the number of systematic hidden units, to emphasize the difference between the localized and dispersed cases (i.e., group size was optimal for the localized condition, and ensured that systematic hidden units were maximally dispersed over groups in the dispersed condition). SOS LASSO has two free parameters, one controlling sparsity at the set level and one controlling overall sparsity. As in the previous analysis, these parameter values were selected through an internal 5-fold cross-validation process, then a final model was trained with the best parameters and tested on a sixth hold-out set.
To statistically determine which model units were reliably selected we again conducted a permutation test. The true category labels were permuted 1000 times and for each permutation we fit a SOS LASSO model to all 10 model subjects simultaneously, using hyper-parameters chosen from the true data. We then computed, for each unit, the proportion of times it was selected across all permutations and all model subjects, taking this as the probability of selection under the null hypothesis. The true data were thresholded using the binomial distribution to compute, for each unit, the probability of the observed number of selections out of 10 model subjects given the null-hypothesis base probability for that unit. We again thresholded this probability map at p < 0.002 uncorrected.
Imaging study
Data collection methods
The fMRI dataset was collected and contributed by Lewis-Peacock and Postle.6 Subjects viewed 90 stimuli from three categories: 30 famous people, 30 famous locations, and 30 common objects. They then indicated how much they liked the celebrity, how much they would like to visit the location, or how often they encountered the object in everyday life using a stimulus-response box and a four-point Likert scale. Each stimulus was presented one time only, for a total of 90 randomly ordered stimulus presentations. Each trial consisted of a cue period (2 s), a stimulus period (5 s), and a judgment period (3 s). Each trial was followed by an arithmetic task (16 s) to reduce interference between trials.
Whole-brain images were acquired with a 3T scanner (Signa VH/I; GE Healthcare). T1-weighted images (30 axial slices, 0.9375 x 0.9375 x 4 mm) were acquired for all 10 subjects. Functional images were acquired using a gradient-echo, echo-planar sequence [repetition time (TR), 2000 ms; echo time, 50 ms] within a 64 x 64 matrix (30 axial slices coplanar with the T1 acquisition, 3.75 x 3.75 x 4 mm). Six scans were obtained for each subject, each scan lasting 6 min, 50 s. Each scan was preceded by 20 s of dummy pulses to achieve a steady state of tissue magnetization. Preprocessing of the functional data was done with the Analysis of Functional NeuroImages (AFNI) software package7 using the following preprocessing steps, in order: correction for slice time acquisition and rigid-body realignment to the first volume from the experimental task; removal of signal spikes; removal of the mean from each voxel and linear and quadratic trends from within each run; and correction for magnetic field inhomogeneities (using in-house software). Note that spatial smoothing was not imposed, and the data were not spatially transformed into a common atlas space before hypothesis testing. Rather, the data from each subject were analyzed in that subject’s un-smoothed, native space, without fitting a GLM to the hemodynamic response. Instead, the BOLD response at the 5th TR after stimulus onset was selected for analysis, which was selected to be near the anticipated peak of the hemodynamic response.
Statistical analysis
In all analyses the functional data for each subject were first masked to exclude voxels outside of cortex (white matter, CSF, cerebellum, thalamus and sub-cortical structures, bone, etc).
Univariate analysis
The functional data for each subject were projected to Talairach space using the T1 data with a combination of manual landmark identification (anterior and posterior commissures) and automated affine transformation obtained using 3dvolreg in AFNI. The spatially normalized response to each stimulus, sampled at the 5th TR after stimulus onset, was smoothed with 4 mm FWHM Gaussian kernel. At each voxel we computed the mean response for face stimuli and for non-face stimuli and subtracted these (face – nonface). In a whole-brain analysis, we computed a t-test on this difference against the null hypothesis of 0, and thresholded the resulting map using a cluster-corrected whole-brain threshold of p < 0.05. The resulting whole-brain map revealed bilteraral parahippocampal regions that were less active for faces than for non-face stimuli, but no regions that were more active for faces. We then conducted a region-of-interest analysis directly assessing whether the right fusiform face area (rFFA)—the earliest and most consistently reported face-selective area in cortex—was reliably more active for face than non-face stimuli. Using the rFFA mask published by Kanwisher’s group, we computed the mean difference in response for faces and for non-faces across all rFFA voxels in each subject, took the difference, and computed a t-test against the null hypothesis of 0. The rFFA was reliably more active for faces with p < 0.03, one-tailed. The same ROI analysis was then conducted for each other face-specific region in the canonical maps, but no other region showed a reliable difference in activation for faces versus non-faces.
Searchlight MVPC
We used SearchMight with a GNB classifier to generate native-space information maps for each subject using both a 9 mm and a 15 mm searchlight around each voxel. The maps were first normalized to Talairach space using the same procedure described for the univariate analysis, and then smoothed with the same 4 mm Gaussian kernel. The performance metric for each searchlight (i.e., the value stored at each point in the information map) was the difference between the hit rate and the false positive rate, which has an expected value of 0 under the null hypothesis. A one-sample, one-tailed t-test against 0 was conducted on this metric. P-values were thresholded at p < 0.05 using the same whole-brain cluster-correction approach employed with univariate analysis.
LASSO
The analyses were completed in two rounds of modeling, to support two different statistical analyses: one to test the generalization accuracy of the fitted models (the performance round), and the other to assess which voxels reliably contribute to that performance (the importance mapping round).
The performance round
Model performance was evaluated independently for each subject using cross-validated generalization error. The workflow for evaluating performance (Figure SI-1) proceeds as follows. The available examples (i.e., whole-brain maps showing the BOLD response generated by each item in the experiment) are split into 10 blocks; these block assignments are the same for all analyses reported in this document. Before model fitting, one block is set aside as the final holdout set, and model fitting then loops over the remaining nine blocks, holding out each one in turn as a test block and using the remaining 8 to “tune” the free hyperparameter. Many models are fit using the tuning data and different hyperparameter values and each such model is evaluated against the same test block in that loop. This yields a set of error values, denoted as Eλ1…Eλn in the diagram. The same hyperparameter values are used in each iteration of the loop, and the model prediction error associated with a given hyperparameter value is estimated as the mean error across all 9 loops. The hyper-parameter value that produces the lowest mean error is then used to fit a model to data from all 9 tuning blocks and this model is then evaluated on the final holdout set. The whole procedure is repeated 10 times, with each block in turn treated as the final holdout set. This yields 10 “final error” values, denoted as Ef1…Efn in the diagram, which are averaged to produce an estimate of the classifier performance. This nested cross validation procedure ensures that the final holdout set is completely isolated from the hyper-parameter selection and model fitting stages, and also helps avoid idiosyncrasies of particular combinations of training and test sets.
The importance mapping round fits a single model for each subject using all data, with the aim of then analyzing the model coefficients to determine which voxels are contributing in what way to classifier performance. The workflow for importance mapping (Figure SI-2) is similar to the prior round except that the outer loop over final holdout sets is eliminated so that a single model is fit to each subject using the entire dataset. As for the performance round, hyperparameters are tuned with 10-fold cross validation on model hold-out error. The best hyperparameters are then used to fit a final model to all data from all subjects. The coefficients from this model are then analyzed.
The motivation for the two steps is as follows. To get an accurate estimate of classifier performance, it is important that the final test items are excluded form all model-fitting steps, including hyperparameter selection, as done in the performance round. Several folds of validation must be conducted to ensure that the performance estimate is not unduly influenced by the particular items selected for the hold-out set—but this means that a different model is fit for each hold-out fold, and that the best set of hyperparameters may differ across folds. It is then unclear which model or hyper-parameter set should be used to interpret coefficients. Thus the performance round is conducted only to assess the expected hold-out error for the classifier. The importance mapping round does not provide useful information about expected hold-out error but yields a single model with one weight per voxel per subject.
To interpret these weights, the coordinates associated with each non-zero value were warped into Talairach space and the weight values at each point were linearly interpolated onto a common 3x3x3 mm grid. These interpolated weights were then smoothed with a 4 mm FWHM Gaussian kernel to compensate for imperfect warp alignment in each subject. To determine which voxels were selected across participants more often than expected under the null hypothesis, we fit 1000 permuted models for each subject, using the same functional data with the category labels shuffled randomly, and using the same hyperparameters selected from the properly-labeled data. The permutation solutions were projected into Talairach space and smoothed in the same way as the real data. For each voxel in each subject we computed the probability of selection (non-zero value) from these permutations. This allowed us to assess, for each voxel in each subject, its probability of selection using the data-tuned hyperparameters but with no systematic relationship between the fMRI data and the target labels. This information was used to construct a group-level binomial test at each voxel, where the probability of a voxel being selected N times out of 10 was assessed relative to the base rate of selection over the 1,000 permuted models. Voxels selected 3 or more times were associated with an uncorrected p < 0.002, and this is the threshold applied to the LASSO solutions.
SOS LASSO
As for LASSO, only cortical voxels were modeled, and analyses were completed in two phases. The performance phase proceeded with a nested cross-validation procedure exactly as described for LASSO. For the importance-mapping phase, appropriate hyperparameters were determined using 10-fold cross validation as with LASSO, and SOS LASSO solutions were then fit for 1000 permutations using these hyperparameters, shuffling the category labels in the same way for all participants on each permutation. Because the solutions are not independent across participants in SOS LASSO, we could not use the binomial approach developed for LASSO. Instead we counted, for each permutation, how often each voxel was selected across the 10 subjects. We then took the maximum of this value across all voxels and permutations—that is, the maximum across 1000 permutations * approximately 10,000 voxels. The largest amount of overlap observed across subjects in the permuted data was 7—we therefore thresholded the data map to show voxels selected in 8 or more of 10 participants. Because this level of overlap was never observed in the permuted analyses, we can be confident that the thresholded maps contain very few false alarms. To estimate the p-value, consider the number observations that went into determining the maximum amount of overlap in the permutation distribution, which is the number of voxels in the group map x the number of permutations. For 10,000 voxels, this would yield an uncorrected p < 0.0000001, which is lower than Bonferroni corrected p < 0.05 for 10,000 tests.
Because the SOS LASSO solutions implicated areas of the brain not often seen in analysis of face functional specificity, we performed follow-up analyses in which all voxels within the face and place systems published in 8 and padded by 7mm, or identified in any of the univariate, searchlight, or LASSO analyses, were excluded from the outset, leaving only the unexpected areas. We also ran the analysis only on voxels included within the padded face and place systems. Both analyses were conducted exactly as reported for the whole-brain analysis; results appear in the main paper text.
Footnotes
Author’s Note: We would like to acknowledge Nikhil Rao and Robert Nowak for developing the SOS Lasso loss function and its MATLAB implementation, Mark Seidenberg and Matthew A. Lambon Ralph for valuable discussions of this work in its development, and Brad Postle and Jarrod Lewis-Peacock for sharing their functional imaging data with us. This work was partially supported by Medical Research Council Programme grant MR/J004146/1 and by European Research Council grant GAP: 670428 - BRAIN2MIND_NEUROCOMP.
We have no conflicts of interest