Abstract
Traditionally, syntactic operations are thought of as the core computational machinery that sets human language aside from other animal communication systems. Here, we tested an alternative hypothesis: the primary driver of the response in the language-selective regions of the brain is semantic composition. Using formal machinery from information theory, we estimated the likelihood of semantic composition via mutual information among words in a local linguistic context. Across two fMRI experiments, we examined the processing of veridical sentences as well as syntactically degraded sentences, including sentences where the local context does not support semantic composition. Consistent with behavioral/computational modeling work, syntactic degradedness did not lead to lower responses in the fronto-temporal language-selective network, except for when mutual information among words was low. These results challenge the primacy of syntax in the human language architecture, instead supporting the idea that successful semantic composition is what drives the language network in the brain.
A left-lateralized network of frontal and temporal brain regions selectively supports language processing [e.g., 1]. These regions respond to both i) individual word meanings, and ii) combinatorial semantic/syntactic processing [e.g., 2, 3, 4]. The magnitude of neural response appears to scale with how language-like the input is, with the strongest response elicited by sentences, and progressively lower responses elicited by phrases, lists of unconnected words, pseudowords, and foreign or indecipherable speech [5, 6, 7]. But what exact features of the linguistic stimulus drive the language network’s response? In particular, sentences—its preferred stimulus—both a) express complex meanings, and b) obey syntactic rules. Is one of these properties dominant?
Many prior proposals have focused on some aspects of syntax [e.g., 6, 8, 9], sometimes with respect to a particular region within the language network, like “Broca’s area” [10, 11, 12]. Yet language is fundamentally a vehicle for conveying complex meanings [e.g., 13, 14], and recent work suggests that our sentence interpretation mechanisms are well-designed for coping with syntactic errors (“noise”) as long as a plausible meaning is recoverable [e.g., 15, 16, 17, 18]. To evaluate the robustness of the language system to syntactic noise, and inspired by work on lexical processing where word recognition has been shown to be quite resilient to changes in the letter order [19, 20], we used a novel manipulation: naturalistic 12-word-long sentences were gradually degraded by increasing the number of local word swaps (Figure 1), leading to progressively less syntactically well-formed strings (see SI for behavioral evidence of sensitivity to the scrambling manipulation). In an fMRI experiment, 16 participants read these materials (presented using rapid serial visual presentation, one word at a time), along with two control conditions consisting of lists of unconnected words and lists of nonwords. Neural responses were examined in language-responsive regions defined functionally in each individual using a separate localizer task [1], illustrated in Figure 2a-c.
We further hypothesized that the maximal response in the language network would be elicited as long as there is sufficiently high mutual information [21] among the words within a span of a few words, which allows for the construction of complex meanings. At the core of this hypothesis is the idea that the likelihood of an attempt to construct a complex meaning, and probably the ease/success of doing so, is determined by our prior linguistic experience, which, in turn, reflects the distributional properties of objects and events in the world [e.g., 22, 23, 24]. In particular, if we have previously frequently encountered words w1 and w2 - or words semantically similar to w1 and w2—in a semantic dependency, we would be likely to try to combine them again if we encounter them together. For example, if we have frequently encountered words eat and apple, or eat and restaurant in a semantic dependency, we will try to combine them if we encounter them again in proximity to one another, even if the syntactic structure within which they occur does not support their semantic composition [15, 17]. If, on the other hand, we have not previously encountered some words together (say eat and accordion), we may treat them as standalone entities and not attempt to build a complex meaning out of them. The locality constraint is intuitive and presumably arises from our limited working memory capacity [25] and/or our implicit knowledge that words that form semantic dependencies tend to be near each other linearly [26].
To quantify prior linguistic experience, we use pointwise mutual information (PMI), a metric from information theory [27, 28], which measures the mutual dependence between variables (in this case, words). Positive PMI values suggest a semantic dependence between words based on their overlap in contexts of use. Negative and near-zero PMI values suggest the absence of a semantic dependence. Importantly, PMI reflects semantic association, while down-weighting the contribution of high frequency, closed-class words (e.g., the, it and up). We estimate PMI using a large corpus of English [the Google N-gram corpus; 29]. For each 12-word string, we averaged the positive PMI for all word pairs occurring within a four-word sliding window (see Materials and Methods). We focus on the positive PMI values, because negative values suggest there is no semantic dependency worth building.
To directly evaluate the local mutual information hypothesis, in Experiment 2 (n=32 participants), in addition to some of the same degraded conditions from Experiment 1, we critically included a manipulation where words are scrambled within a sentence in a way that minimizes local PMI and thus, plausibly, the likelihood of complex meaning construction. In particular, we maximized the distance between any two originally proximal content words (Figure 1; see Materials and Methods). The manipulation was effective, leading to a significant drop in local mutual information (Figure 2e). According to the local mutual information hypothesis, the neural response should be substantially lower for this condition relative to the other degraded conditions.
Results
In Experiment 1, replicating much prior work [30, 1, 6], well-formed sentences elicited significantly stronger responses than the word-list and nonword-list conditions (Figure 2c, Table 1). Strikingly, however, degrading the sentences by introducing word swaps did not decrease the magnitude of the language network’s response: even stimuli with seven word swaps (e.g., their last on they overwhelmed were day farewell by messages and gifts; Figure 1) elicited as strong a response as fully grammatical sentences (e.g., on their last day they were overwhelmed by farewell messages and gifts; Figure 2c, Table 1). This pattern of similarly strong neural responses for the well-formed and degraded sentences is in line with the local mutual information hypothesis: mutual information remains high across the sentence conditions (Figure 2e), which plausibly leads to complex meaning construction in spite of syntactic noise.
In Experiment 2, we replicated the pattern observed in Experiment 1 for the intact sentences and sentences with one, three, or five local word swaps, all of which elicited similarly strong responses, all reliably higher than the control word-list condition (Table 1). Critically, in line with the local mutual information hypothesis, the LowPMI condition elicited a neural response that was as low as that elicited by the list of unconnected words (Figure 2b, Table 1).
In spite of eliciting as strong a neural response as veridical sentences, the scrambled sentences are rated as less acceptable behaviorally (SI), suggesting there has to be a cost to the processing of this kind of degraded linguistic input. Indeed, some brain regions outside the fronto-temporal language network—specifically, within the domain-general fronto-parietal multiple demand network [31, 32]—were sensitive to the scrambling manipulation, with stronger responses to more degraded stimuli (SI).
Discussion
Across two fMRI experiments, we found that sufficiently high local mutual information appears to be necessary and sufficient for eliciting the maximal response in the language system, where maximal is defined as the response to the preferred stimulus—well-formed and meaningful sentences. In particular, syntactic degradation—achieved via varying numbers of local word swaps—did not lead to lower neural responses in the frontal or temporal language brain regions, in spite of the fact that such degraded sentences were judged to be less acceptable (SI). Only when there was no evidence in the input for building semantic connections among nearby words did the neural response in the language regions drop to the level of unconnected word lists.
These results suggest that the language system works hard whenever incoming words can be combined into complex meanings [see also 33], in line with current theorizing and evidence from the sentence processing literature whereby our comprehension mechanisms appear to be well-designed for coping with signal corruption [15, 16, 17] and where meaning can drive interpretation even when the syntactic form does not support it [e.g., 34, 35, 36]. However, this result is surprising in light of the emphasis that has traditionally been placed on syntax as the core computational capacity of language [e.g., 37, 38, 39, 10, 12].
Many important questions about semantic composition remain: Is our likelihood of combining any two words affected by our experience with specific lexical items, or would you be as likely to combine eat and sapodilla even if you only now learned that sapodilla is a type of fruit? What is the span over which high mutual information is detected and affects composition? How is composition influenced by the preceding linguistic and non-linguistic context? For example, if you encounter a string of unrelated words followed by a pair of related words, is the latter sufficient to trigger your semantic composition mechanisms? And is your language system willing to work harder in trying to semantically compose in the presence of top-down knowledge about communicative intent? In spite of all these open questions, a formally specified hypothesis laid out here brings us one step closer to a mechanistic-level account of the computations that the language network plausibly supports.
Methods
Participants
Forty-seven individuals (age 18 - 48, average age 22.8; 31 females) participated for payment (Experiment 1: n = 16; Experiment 2: n = 32; one individual overlapped between the two experiments). (Note that we included twice as many participants in Experiment 2 to ensure that the lack of between-condition differences was not due to insufficient power.) Forty-one were right-handed, as determined by the Edinburgh handedness inventory [40], or by self-report; the remaining six left-handed/ambidextrous individuals showed typical left-lateralized language activations in the language localizer task (see 41 for arguments to include left-handers in cognitive neuroscience experiments). All participants were native speakers of English from the Boston community. Four additional participants were scanned (for Experiment 2) but excluded from the analyses due to excessive head motion or sleepiness, and failure to perform the behavioral task. All participants gave written informed consent in accordance with the requirements of MIT’s Committee on the Use of Humans as Experimental Subjects (COUHES).
Experimental Design and Materials
Each participant completed a) a version of the language localizer task [1, Figure 2a], which was used to identify language-responsive areas at the individual subject level, and b) the critical sentence comprehension task (28 participants completed the localizer task in the same session as the critical task, the remaining 19 performed the localizer in an earlier session; see 42 for evidence of the stability of the localizer responses across sessions). Some participants further completed one or two additional tasks for unrelated studies. The entire scanning session lasted approximately 2 hours.
Language localizer
Participants passively read sentences and lists of pronounceable nonwords in a blocked design. The Sentences > Nonwords contrast targets brain regions sensitive to high-level linguistic processing [1]. We have previously established the robustness of this contrast to materials, modality of presentation, language, and task [1, 43, 7]. Each trial started with 100 ms pre-trial fixation, followed by a 12-word-long sentence or a list of 12 nonwords presented on the screen one word/nonword at a time at the rate of 450 ms per word/nonword. Then, a line drawing of a hand pressing a button appeared for 400 ms, and participants were instructed to press a button whenever they saw this icon, and finally a blank screen was shown for 100 ms, for a total trial duration of 6 s. The simple button-pressing task was included to help participants stay awake and focused. Each block consisted of 3 trials and lasted 18 s. Each run consisted of 16 experimental blocks (8 per condition), and five fixation blocks (14 s each), for a total duration of 358 s (5 min 58 s). Each participant performed two runs. Condition order was counterbalanced across runs.
Critical task in Experiment 1
Design and materials
Participants read sentences with correct word order (Intact (Int)) and sentences with progressively more scrambled word orders created by an increasing number of local word swaps (Scrambled (Scr) 1, 3, 5, and 7; Figure 1), as well as two control conditions: lists of unconnected words and nonword lists. At the end of each trial, participants were presented with a word (in the sentence and word-list conditions) or a nonword (in the nonword-list condition) and asked to decide whether this word/nonword appeared in the preceding trial.
To create the sentence materials, we extracted 150 12-word-long sentences from the British National corpus [BNC; 44]. We then permuted the word order in each sentence via local swaps, to create the scrambled conditions. In particular, a word was chosen at random and switched with one of its immediate neighbors. This process was repeated a specified number of times. Because one random swap can directly undo a previous swap, we ensured that the manipulation was successful by calculating the edit distance. (The code used to create the scrambled conditions is available at OSF: https://osf.io/y28fz/) We chose versions with 1, 3, 5, and 7 swaps in order to i) limit the number of sentence conditions to five, while, at the same time, ii) covering a range of degradedness levels. The materials thus consisted of 150 sentences with five versions each (Int, Scr1, Scr3, Scr5, and Scr7), for a total of 750 strings. These were distributed across five experimental lists following a Latin Square design, so that each list contained only one version of a sentence and 30 trials of each of the five conditions. Any given participant saw the materials from just one experimental list, and each list was seen by 2-4 participants. The word-list condition consisted of sequences of 12 real words (173 unique words: 55.5% nouns, 15.6% verbs, 22.5% adjectives, and 6.4% adverbs; average length 7.19 phonemes, standard deviation 1.43; [45]; average log frequency 1.73, standard deviation 0.80 [46]), and the nonword-list condition consisted of sequences of 12 nonwords (there were actually four different nonword-list conditions—a manipulation not of interest to the current study; we averaged the responses across the four nonword-list conditions in the analyses). The word-list and nonword-list materials were the same across participants. (All the materials are available at OSF: https://osf.io/y28fz/)
Procedure
Participants read sentences, scrambled sentences, word lists, and nonword lists in an eventrelated fMRI design. Each trial lasted 8 s and consisted of the presentation of the stimulus (a sequence of 12 words/nonwords presented one at a time with no punctuation in the center of the screen, for 500 ms each, in black font using capital letters on a white background), followed by a blank screen for 300 ms, followed by a memory probe presented in blue font for 1,200 ms, followed again by a blank screen for 500 ms. The memory probe came from the preceding stimulus on half of the trials. For the sentences, the probes were uniformly distributed across the beginning (first four words), middle (middle four words), or end (last four words) of the sentence; for the word and nonword lists, the probes were uniformly distributed across the 12 positions. Incorrect probes were the shuffled correct probes from other sequences in the same condition (see SI for behavioral performance).
The trials in each experimental list (300 total; 30 trials per condition, where the conditions included the intact sentence condition, four scrambled sentence conditions, the word-list condition, and four nonword-list conditions) were divided into six subsets corresponding to six runs. Each run lasted 480 s (8 min) and consisted of 8 s * 50 trials (5 per condition) and 80 s of fixation. The optseq2 algorithm [47] was used to create condition orderings and to distribute fixation among the trials so as to optimize our ability to deconvolve neural responses to the different conditions. Condition order varied across runs and participants. Most participants (n = 13) performed 5 runs; the remaining participants performed 4 or 3 runs.
Critical task in Experiment 2
Design and materials
As in Experiment 1, participants read sentences with correct word order (Int) and sentences with progressively more scrambled word orders (Scr 1, 3, and 5). The latter three conditions were included in order to assess the robustness and replicability of the results in Experiment 1, in line with recent efforts to increase the reproducibility of findings in cognitive neuroscience [e.g., 48]. (The materials for these scrambled conditions were identical to those in Experiment 1 for half of the participants, and different permutations of the same intact stimuli for the other half.)
The condition with 7 word swaps (Scr7) was replaced by a condition where each pair of nearby content words was separated as much as possible within the 12-word string, so as to minimize local mutual information (Figure 1). We focused on separating nearby content words because those carry the most information in the signal [21]. Take, for example, one of our intact sentences: Larger firms and international companies tended to offer the biggest pay rises. First, the content words were given a fixed order that maximized the sum of the distances between adjacent content words (two content words are considered adjacent in the original string if they have no content words between them): e.g., larger international tended biggest rises firms companies offer pay. This process was repeated for the function words (e.g., and the to). Then, the ordered function words were embedded in the center of the ordered content words (i.e., larger international tended biggest rises and the to firms companies offer pay), which maximizes the distances between adjacent content words in the original sentence. (The code is available at OSF: https://osf.io/y28fz/)
In addition to the five sentence conditions, we included five word-list conditions that were matched word-for-word to the sentence conditions. In particular, each of 876 unique words in the sentence conditions was replaced by a different word of the same syntactic category (using the following set: nouns, verbs, adjectives, adverbs, and function words), similar in length (+/− 0.06 phonemes, on average; [45]) and frequency (+/− 0.31 log (lf), on average; [46]). We included the same number of word-list conditions as sentence conditions to match the distribution of sentence and word−/nonword-list conditions in Experiment 1. However, in the analyses, we averaged the responses across the five word-list conditions given that there is no reason to expect a difference among them.
The materials were distributed across five experimental lists; any given participant saw the materials from just one list, and each list was seen by 5-7 participants.
As in Experiment 1, at the end of each trial, participants were presented with a word and asked to decide whether this word appeared in the preceding trial (see SI for behavioral performance).
Procedure
The procedure was identical to that in Experiment 1 except that the memory probe was uniformly distributed across the 12 positions in every condition. Most participants performed 5 or 6 runs (n=30); the remaining participants performed 4 or 3 runs.
fMRI data acquisition
Structural and functional data were collected on the whole-body 3 Tesla Siemens Trio scanner with a 32-channel head coil at the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research at MIT. T1-weighted structural images were collected in 179 sagittal slices with 1 mm isotropic voxels (TR = 2,530 ms, TE = 3.48 ms). Functional, blood oxygenation level dependent (BOLD) data were acquired using an EPI sequence (with a 90° flip angle and using GRAPPA with an acceleration factor of 2), with the following acquisition parameters: thirty-one 4 mm thick near-axial slices, acquired in an interleaved order with a 10% distance factor; 2.1 mm × 2.1 mm in-plane resolution; field of view of 200 mm in the phase encoding anterior to posterior (A > P) direction; matrix size of 96 mm × 96 mm; TR of 2,000 ms; and TE of 30 ms. Prospective acquisition correction [49] was used to adjust the positions of the gradients based on the participant’s motion one TR back. The first 10 s of each run were excluded to allow for steady-state magnetization.
fMRI data preprocessing and first-level analysis
First-level analyses were conducted in SPM5 (note that first-level analyses have not changed much in later versions of SPM; we use an older version of the software here due to the use of these data in other projects spanning many years and hundreds of subjects); critical second-level analyses were performed using custom MATLAB and R scripts. Each subject’s data were motion corrected and then normalized into a common brain space (the Montreal Neurological Institute (MNI) template) and resampled into 2 mm isotropic voxels. The data were then smoothed with a 4 mm Gaussian filter and high-pass filtered (at 200 s). The task effects in both the language localizer task and the critical experiment were estimated using a General Linear Model (GLM) in which each experimental condition was modeled with a boxcar function (corresponding to a block or event) convolved with the canonical hemodynamic response function (HRF).
Language fROI definition and response estimation
For each participant, functional regions of interest (fROIs) were defined using the Group-constrained Subject-Specific (GSS) approach [1], whereby a set of parcels or “search spaces” (i.e., brain areas within which most individuals in prior studies showed activity for the localizer contrast) is combined with each individual participant’s activation map for the same contrast. To define the language fROIs, we used six parcels (Figure 2b) derived from a group-level representation of data for the Sentences>Nonwords contrast in 220 participants. These parcels included three regions in the left frontal cortex: two located in the inferior frontal gyrus (LIFG and LIFGorb), and one located in the middle frontal gyrus (LMFG); and three regions in the left temporal and parietal cortices spanning the entire extent of the lateral temporal lobe and extending into the angular gyrus (LAntTemp, LPostTemp, and LAngG). (These parcels were similar to the parcels reported originally in [1], except that the two anterior temporal regions got grouped together, and the two posterior temporal regions got grouped together.) Following much prior work in our group, individual fROIs were defined by selecting—within each parcel—the top 10% of most localizer-responsive voxels based on the t values for the Sentences>Nonwords contrast. Responses (in percent BOLD signal change units) to the relevant critical experiment’s conditions, relative to the fixation baseline, were then extracted from these fROIs. So, the input to the critical statistical analyses consisted of—for each participant—a value (percent BOLD signal change) for each of 10 conditions in each of the six language fROIs. Further, for Experiment 1, responses were averaged across the four nonword-list conditions, leaving a total of seven conditions; and for Experiment 2, responses were averaged across the five word-list conditions, leaving a total of six conditions. In all the critical analyses, we consider the language network as a whole (treating regions as random effects; see below) given the abundant evidence that the regions of this network form an anatomically [e.g., 50] and functionally integrated system, as evidenced by strong inter-regional correlations during rest and language comprehension [e.g., 51]), but see Figure S1 and Table S3 for the six language fROIs’ profiles and associated statistics.
Computing mutual information values
We used a sliding four-word window to extract local word pairs from each 12-word string. This is equivalent to collecting the bigrams, 1-skip-grams and 2-skip-grams from each string. For each word pair, we calculated PMI as follows:
Probabilities were estimated using the Google N-gram corpus [29] and ZS [52] with Laplace smoothing (α = 0.1). For each 12-word string, we averaged across the positive PMI values. (The code for computing PMI is available at OSF: https://osf.io/y28fz/)
Statistical tests
To compare the average change in BOLD response across conditions, we conducted a mixed effect linear regression model with maximal random effect structure, predicting the level of BOLD response with a fixed effect and random slopes for Condition, and random effects for Region of Interest and Participant. Condition was dummy-coded with Intact sentences as the reference level. Models were fit separately for Experiment 1 and Experiment 2 using the brms package [53] in R [54] to interface with Stan [55]. Results are presented in Table 1. (Data and analysis code are available at OSF: https://osf.io/y28fz/)
Competing Interests
We declare no competing interests.
Acknowledgements
We thank i) Zuzanna Balewski for help with creating the experimental script for Experiment 1, ii) EvLab members for help with scanning and helpful discussions, and iii) Nancy Kanwisher, Ted Gibson, Adele Goldberg, Leon Bergen, Josh Tenenbaum, and the audience at the CUNY2017 Sentence Processing conference for comments on this line of work. The authors would also like to acknowledge the Athinoula A. Martinos Imaging Center at the McGovern Institute for Brain Research at MIT, and the support team (Steven Shannon, Atsushi Takahashi, and Sheeba Arnold). E.F. was supported by NIH awards R00-HD-057522 and R01-DC-016607.
Footnotes
↵1 Switching from a categorical (Condition) to continuous coding (Edit Distance) permits testing non-monotonicity and increases the sensitivity to detect an effect. To ensure that our analysis of the language network is robust to this switch, we conducted the same analyses looking for either a linear (i.e., first order) or non-linear (i.e., second order) effect of Edit Distance. Consistent with our Categorical analysis, there was no effect of Edit Distance on change in response in the language regions.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵