Abstract
Whether the human brain represents emotional stimuli as discrete categories or continuous dimensions is still widely debated. Here we directly contrasted the power of categorical and dimensional models at explaining behavior and cerebral activity in the context of perceived emotion in the voice. We combined functional magnetic resonance imaging (fMRI) and magneto-encephalography (MEG) to measure with high spatiotemporal precision the dynamics of cerebral activity in participants who listened to voice stimuli expressing a range of emotions. The participants also provided a detailed perceptual assessment of the stimuli. By using representational similarity analysis (RSA), we show that the participants’ perceptual representation of the stimuli was initially dominated by discrete categories and an early (<200ms) cerebral response. These responses showed significant associations between brain activity and the categorical model in the auditory cortex starting as early as 77ms. Furthermore, we observed strong associations between the arousal and valence dimensions and activity in several cortical and subcortical areas at later latencies (>500ms). Our results thus show that both categorical and dimensional models account for patterns of cerebral responses to emotions in voices but with a different timeline and detail as to how these patterns evolve from discrete categories to progressively refined continuous dimensions.
One Sentence Summary: Emotions expressed in the voice are instantly categorized in cortical processing and their distinct qualities are refined dimensionally only later on.
Main text
A persistent and controversial debate in affective sciences is whether emotions are better conceptualized as discrete categories or continuous dimensions (1, 2). Discrete emotion theories postulate a small number of modules, each specific to a basic emotional category such as fear or anger (3, 4). Dimensional theories instead argue that emotions are best described along a number of continuous dimensions such as valence (reflecting the degree of pleasantness ranging from negative to positive) or arousal (reflecting the degree of intensity ranging from calm to excited) (5, 6).
Despite decades of continuous effort this fundamental question is still unresolved (7) and conflicting behavioral evidence continues to emerge both in intercultural and cross-cultural studies (8-10). Neuroimaging research on the cerebral bases of emotion, either felt or perceived, has not unequivocally settled this debate either (11, 12) and meta-analyses of large bodies of evidence can support either the notion of category-specific modules (13) or that of large-scale networks representing dimensional attributes (1, 14). Multi-voxel pattern analyses (MVPA) (15) have identified distributed patterns of cerebral activity allowing classification of felt or perceived emotions in others into discrete categories as well as estimation of valence and arousal dimensions (16-24). Yet, whether the brain represents emotional events more as discrete categories or as continuous dimensions remains unclear, in large part because the predictions of the two major theoretical positions have so far not been directly compared in a dedicated study integrating behavior with neuroimaging detailing how cerebral responses evolve in both space and time (8, 17, 18).
Here we address this question in humans by combining comprehensive behavioral assessments with multimodal brain-activity measurements from the same individuals at high spatial and temporal resolution. We measured cerebral activity using functional magnetic resonance imaging (fMRI) and magneto-encephalography (MEG) while participants listened to voices that densely sampled a range of perceived emotion categories and dimensional attributes (Fig. 1 and 2). This approach allowed measuring the spatiotemporal dynamics of cerebral activity during the passive perception of emotional stimuli, linking these patterns to overt behavioral responses collected after scanning and directly comparing the predictions of discrete and continuous models. We applied a multivariate analysis technique called representational similarity analysis (RSA) (25) to relate the perceived categorical and dimensional attributes of the stimuli (categorical and dimensional models derived from behavioral measures) to the multivariate cerebral responses (Fig. S1). With this approach, we combined multiple behavioral measures with an integrated analysis of spatial (fMRI) and spatiotemporal (MEG) cerebral activity patterns from the same participants and obtained robust converging evidence for categorical and dimensional representations of perceived vocal emotions.
Auditory stimuli consisted of a homogeneous set of emotionally expressive nonverbal vocalizations obtained by morphing between the recordings of each of two actors (one female, one male) portraying four different emotions — anger, fear, disgust, and pleasure — as well as a neutral expression (26) while briefly uttering the vowel /a/. Morphing combined pairs of emotional vocalizations from the same actor with weights varying in 25% steps from 0 (neutral) to 125% (emotional caricature) for neutral-emotion morphs and from 0 to 100% between the four expressed emotions (Fig. 1A), resulting in 39 stimuli per actor (Audio S1-S78). Healthy participants (n=10) were each scanned in alternating sessions of fMRI and MEG (four sessions each) while performing a simple repetition detection task that ensured appropriate attention to the stimuli while avoiding an explicit focus on emotional attributes. The large amount of multimodal imaging data for each individual (8 sessions each for a total of 80 sessions) was key to adjudicating between overlapping emotion models with robust analyses. Once scanning was complete they rated the perceived dissimilarity of all (within-actor) pairs of stimuli in the absence of instructions that would bias the judgment toward a specific stimulus feature. During the last session, they evaluated the perceived emotional stimulus attributes by categorizing emotions and rating their valence and arousal (cf. Supplementary Materials and Methods).
Analysis of behavioral results confirmed that the morphing method reliably modulated perceived emotion categories and dimensions (see Fig. 1B, C for perceptual effects of morphing and Fig. 2A, B for visualization of emotion attributes in all stimuli). We quantified the relevance to perceived stimulus dissimilarity (Fig 2C) of each of three emotion-attribute distances derived from the categorization and valence and arousal ratings (emotion representational dissimilarity matrices – RDMs; Fig 2A, B; correlations between emotion RDMs ≥ .19 and ≤ 0.36, standard error of the mean–s.e.m. ≤ 0.08, T(9) ≥ 4.07, p < 0.05 family-wise error rate – FWE corrected across correlations). Although larger differences in each of the emotion attributes were associated with an increase in perceived dissimilarity (r ≥ 0.27, s.e.m. ≤ 0.03, T(9) ≥ 8.01, p < 0.05 FWE corrected across emotion RDMs), only for categories and arousal such modulation was selective, i.e. was independent of the variance shared between all emotion attributes (semi-partial correlation – s.p.r ≥ 0.15, s.e.m. ≤ 0.03, T(9) ≥ 8.98, p < 0.05 FWE corrected across emotion RDMs). Importantly, categories appeared to modulate selectively perceived dissimilarity more strongly than arousal or valence (unique explained variance contrast for categories vs. arousal or valence ≥ 29.07%, s.e.m. ≤ 2.42%, T(9) ≥ 12.52; arousal vs. valence unique explained variance contrast = 7.18%, s.e.m. = 1.58%, T(9) = 4.66; all p < 0.05 FWE corrected across contrasts) and accounted better for perceptual dissimilarity than both dimensional attributes together (percent explained variance contrast = 12.89%, s.e.m. = 1.65%, T(9) = 10.28, p = 0.00002). Thus, the behavioral data indicate that both categories and dimensions influence the perception of the emotional voice stimuli but that categories have a stronger influence.
Next, we asked where (cerebral location) and when (peri-stimulus latency) stimulus-evoked cerebral activity was significantly associated with either the categorical or the continuous emotion models. We first built fMRI RDMs reflecting at each cerebral location (voxel) the pairwise stimulus-evoked blood oxygenation level signal difference measured via fMRI, measured within a local sphere centered on that voxel (spatial fMRI searchlight = 6 mm radius). Each fMRI RDM was tested for a significant correlation with each of the three emotion-attribute RDMs (see Fig. 2A-B correlation maps and Fig. S3 and Table S3 for additional fMRI tests). We used these results to spatially constraint the subsequent MEG analysis and built MEG RDMs only at those locations that yielded significant fMRI-emotion RDM correlations (see Fig. 3A for fMRI correlation maps). The MEG RDMs were derived from pairwise stimulus-evoked magnetic signal difference at the corresponding source-space location and each peri-stimulus time point between −147 ms and 1060 ms after stimulus onset (spatiotemporal MEG searchlight = 10 mm radius, 53 ms duration and 40 ms overlap between subsequent windows; cf. Supplementary Methods and Fig. S1). The encoding of variance shared between different emotion attributes, such as the strong valence/arousal correlation apparent in Fig. 2., was teased apart from the encoding of variance unique to each of them via semi-partial correlation tests. We then contrasted the unique RDM variance explained by each of the three emotion attributes and, more importantly, by categories and both dimensional attributes together. All encoding measures generalized across the acoustical fingerprints of the male and female speakers (speaker averaged perceived emotion attributes correlated with RDMs cross-validated across speakers). Significance testing relied on a group-level permutation-based approach with cluster mass enhancement and multiple comparisons corrections across the entire analysis mask (FWE = 0.05) (cf. Supplementary Methods).
Auditory cortices bilaterally showed strong selective encoding of emotion categories from early latencies (see Fig. 3B for statistical maps and Table S1 for statistical peaks) in both primary (local MEG encoding peak at 77ms) and secondary (global MEG encoding peak at 117ms) areas of the superior temporal gyrus (STG; selective encoding extending to 517ms). At later latencies, activity patterns in these areas where characterized by selective encoding of arousal (237-837ms; global MEG encoding peak at 717ms, Fig. 3D) and, to a lesser extent, of valence in the right insula (717-877ms; global peak at 757 ms, Fig. 3C). Activity patterns selectively encoded categorical or dimensional attributes in several additional cortical and subcortical areas. The left inferior frontal gyrus (pars triangularis; IFGt) preferentially represented the set of stimuli in terms of discrete emotions from as early as 117ms after sound onset (IFGt peak at 157 ms, Fig. 3B) potentially reflecting implicit categorization processes based on feed-forward projections from the temporal cortex (27). Stimuli were represented in terms of their perceived arousal in the right amygdala (a subcortical structure involved in the fast detection and afferent processing of emotional signals (28-32)) only at relatively later latencies: starting from 237ms, (Fig. 3D), then again between 557-597ms (arousal encoding peak) and around 757ms after a brief shift of the arousal-encoding area towards orbitofrontal cortex (677ms). This temporal and differential evolution of the amygdala’s response to arousal aligns with the structure’s afferent and efferent projections to subcortical and cortical brain regions (27, 33).
Thus, converging evidence from three modalities—behavior, fMRI, and MEG—demonstrates that both the categorical and dimensional models explain patterns of behavioral and cerebral response to emotions in the voice—but with markedly different spatio-temporal dynamics. This may explain why previous studies have found evidence in support of either one or the other model (13, 14, 16-24). Our results shed significant light onto the debate by showing that categorical and dimensional representations unfold along different timelines in different cerebral regions, adding a much-needed temporal dimension to the picture of cerebral processing of perceived emotion that so far has remained rather static. We find that the amygdala showed strong associations with the arousal dimension at latencies within 237-757ms. This is consistent with previous findings of selective impairments of arousal, but not valence recognition in amygdala lesions (34, 35) and with neuroimaging of healthy individuals showing representation of arousal but not valence in the amygdala (36). In contrast, the valence dimension was weakly associated with perceptual representations and was represented in the brain only at later latencies (>700ms in the insula). Overall, the selective encoding of dimensional attributes in the amygdala and insula is in agreement with the involvement of a “salience” network (11) linking the processing of emotional states and events across species (37, 38) and thought to represent a phylogenetic precursor for communicative behavior in primates and humans (20, 39, 40). In other cerebral areas, however, stimulus representations appeared to evolve in time from one model to the other, such as right auditory cortex that represents stimuli first in terms of their categorical structure at early latencies and then in terms of their perceived arousal at later latencies, subsequent to their initial encoding in subcortical structures (Fig. 3E). The representational dynamics observed for the right auditory cortex thus suggests a transition from an early dominance of feed-forward sensory processing to late attentional modulations resulting from feedback signals transmitted through lateral and medial cortical connections from the amygdala (39, 41).
Finally, we perfomed a direct comparison of the categorical and continuous models by asking when and where patterns of neural activity reflected one theoretical account more than the other. For this, we initially calculated the contrast of RDM variance explained uniquely by each of the emotion attributes and then contrasted the explanatory power of the categorical and dimensional models (see Fig. 3E and S3 for contrast maps and Tables S1 and S2 for statistical peaks). The categorical model uniquely explained significantly more MEG RDM variance than either valence or arousal or both combined at early latencies (157ms) in the right auditory cortex centered on mid-STG (mSTG; categories vs. valence contrast significant also at 197 and 357ms). Conversely, the dimensional model uniquely explained significantly more MEG RDM variance at later latencies (717-757 ms) in a similar area of the right auditory cortex (arousal vs. categories contrast significant at 717ms; arousal vs. valence contrast significant at 637-677 ms).
In summary, by enabling a direct contrast of the predictions of the two models, our results provide crucial insight into the category vs. dimension debate. Statistical comparison of the predictions of the two models yielded unequivocal evidence for a clear prevalence of the categorical model: the perceptual structure of the stimuli was more related to categories than dimensions and spatiotemporal activity patterns in widespread areas of the auditory cortices were associated from early latencies on (as early as 77ms post-onset) with the categorical stimulus structure. The contrast of variance uniquely explained by categories and dimensions was significant in the right auditory cortex around 157ms after stimulus onset. However, dimensional representations become more prevalent at later latencies in the auditory cortex, subcortical areas, and orbitofrontal cortex, suggesting progressive refinement of emotional stimulus representations from formation of main emotional categories well suited to trigger fast adaptive reactions to increasingly fined-grained representations modulated by valence and arousal. Overall, our results provide a comprehensive characterization of the spatiotemporal dynamics of perceived emotion processing by the brain and demonstrate how both categories and dimensions are interwoven into rich and complex representations initially dominated by categories then progressively refined into dimensions.
Acknowledgments
Supported by UK’s Biotechnology and Biological Sciences Research Council (grants BB/M009742/1 to JG, BLG, SAK, and PB, and BB/L023288/1 to PB and JG), by the French Fondation pour la Recherche Médicale (grant AJE201214 to PB), and by Research supported by grants ANR-16-CONV-0002 (ILCB), ANR-11-LABX-0036 (BLRI), and the Excellence Initiative of Aix-Marseille University (A*MIDEX). Conceptualization: BLG, PB; Methodology: BLG, CW, NK, SAK, PB, JG; Software: BLG; Validation: BLG; Formal Analysis: BLG, CW, JG; Investigation: BLG, CW; Resources: BLG, PB; Data Curation: BLG, CW; Writing - Original Draft: BLG, CW, SAK, PB, JG; Writing – Review & Editing: BLG, CW, NK, SAK, PB, JG; Visualization: BLG; Supervision: BLG, PB, JG; Project Administration: JG; Funding Acquisition: BLG, SAK, PB, JG. We thank Dr. Olivier Coulon and Dr. Oliver Garrod for help with the development of the 3D glass brain.
Footnotes
↵† Joint senior authors