Abstract
Spatial selective attention enables listeners to process a signal of interest in natural settings. However, most past studies on auditory spatial attention used impoverished spatial cues: presenting competing sounds to different ears, using only interaural differences in time (ITDs) and/or intensity (IIDs), or using non-individualized head-related transfer functions (HRTFs). Here we tested the hypothesis that impoverished spatial cues impair spatial auditory attention by only weakly engaging relevant cortical networks. Eighteen normal-hearing listeners reported the content of one of two competing syllable streams simulated at roughly +30 ° and −30° azimuth. The competing streams consisted of syllables from two different-sex talkers. Spatialization was based on natural spatial cues (individualized HRTFs), individualized IIDs, or generic ITDs. We measured behavioral performance as well as electroencephalographic markers of selective attention. Behaviorally, subjects recalled target streams most accurately with natural cues. Neurally, spatial attention significantly modulated early evoked sensory response magnitudes only for natural cues, not in conditions using only ITDs or IIDs. Consistent with this, parietal oscillatory power in the alpha band (8-14 Hz; associated with filtering out distracting events from unattended directions) showed significantly less attentional modulation with isolated spatial cues than with natural cues. Our findings support the hypothesis that spatial selective attention networks are only partially engaged by impoverished spatial auditory cues. These results not only suggest that studies using unnatural spatial cues underestimate the neural effects of spatial auditory attention, they also illustrate the importance of preserving natural spatial cues in assistive listening devices to support robust attentional control.
1 Introduction
Spatial hearing is crucial to selectively attend to sounds of interest in everyday social settings. The remarkable ability of normal-hearing listeners to focus on a sound source within a complex acoustic scene is often referred to as “the cocktail party phenomenon,” and has a rich history (Cherry, 1953). Nevertheless, the mechanisms controlling spatial selective attention are still poorly understood. Acoustically, in everyday situations, the two ears provide the listener with a listener-specific combination of spatial cues that include interaural time and intensity differences (ITDs and IIDs, respectively), as well as spectral cues caused by acoustical filtering of the pinnae (Blauert, 1997a). Together, these cues, captured by individualized head-related transfer functions (HRTFs), allow the brain to create a clear, punctate internal representation of the location of sound sources in the environment (Majdak et al., 2018; Middlebrooks, 2015).
When only isolated or impoverished spatial cues are present, auditory localization performance degrades and the natural perception of external auditory objects may even collapse into the listener’s head (Baumgartner et al., 2017; Callan et al., 2013; Cubick et al., 2018; Hartmann and Wittenberg, 1996). Nevertheless, degraded or isolated ITDs and IIDs still create a strong sense of lateralization within the head; moreover, even highly impoverished spatial cues can be used to achieve spatial release from speech-on-speech masking, behaviorally (Cubick et al., 2018; Culling et al., 2004; Ellinger et al., 2017; Glyde et al., 2013; Kidd et al., 2010; Loiselle et al., 2016). The relative importance of ITDs and IIDs in spatial release from masking remains unclear, with past studies reporting conflicting results when directly comparing different binaural conditions (Ellinger et al., 2017; Glyde et al., 2013; Higgins et al., 2017; Shinn-Cunningham et al., 2005). More importantly, it is a puzzle as to why realistic and degraded spatial cues yield at best small behavioral differences in masking release even though spatial perception is clearly degraded when cues are impoverished (e.g., Cubick et al., 2018).
Previous electroencephalography (EEG) and magnetoencephalography (MEG) studies have demonstrated that rich spatial cues in sound stimuli lead to different cortical activity compared to using isolated cues during sound localization (Callan et al., 2013; Leino et al., 2007; Palomäki et al., 2005) and auditory motion processing (Getzmann and Lewald, 2010). However, the apparently minor behavioral consequences of using unnatural, non-individualized spatial cues on spatial release from masking, combined with the ease of implementing studies with simple, non-individualized spatial cues, led to their wide usage in auditory neuroscience studies (Cusack et al., 2001; Dahmen et al., 2010; Dai et al., 2018; Itoh et al., 2000; Kong et al., 2014; Sach et al., 2000). Indeed, in the auditory neuroscience literature, many studies did not even present true binaural signals, but instead studied “spatial” attention by using dichotic signals, with one sound presented monaurally to one ear and a competing sound presented monaurally to the other ear (Ahveninen et al., 2011; Alho et al., 1999b; Das et al., 2016; Wöstmann et al., 2016). These studies implicitly assumed that because listeners were able to use impoverished spatial cues to listen to one sound from a particular (relative) direction, the cognitive networks responsible for controlling spatial attention must be engaged just as they are when listening to rich, natural spatial cues. Nonetheless, it is unclear whether and how engagement of higher-order cognitive processes such as deployment of selective attention is affected by the use of unnatural or impoverished spatial cues.
Modulation of neural signatures, such as event-related potentials (ERPs) and induced oscillatory activity, is often taken as evidence of effective attentional control (Herrmann and Knight, 2001; Siegel et al., 2012). In particular, auditory spatial attention is known to modulate early sensory ERPs in the N1 time range (processing latencies of 100 to 150 ms; see Choi et al., 2013; Röder et al., 1999), whereas modulation of P1 ERPs (50 to 100 ms) has only recently been demonstrated in a free field experiment (Giuliano et al., 2014). Induced alpha oscillation (8 to 14 Hz) has been hypothesized to function as an information gating mechanism (Klimesch et al., 2007). During auditory spatial attention, parietal alpha power often decreases in the contralateral hemisphere of attended stimuli and/or increases in the ipsilateral hemisphere (Banerjee et al., 2011; Lim et al., 2015; Wöstmann et al., 2016). These neural modulations constitute objective metrics of the efficacy of attentional control.
Here, we test listeners in a selective attention paradigm with simultaneous, spatially separated talkers. We use the aforementioned EEG measures to compare both perceptual ability and the neural signatures of attentional control for simulations with impoverished vs. natural spatial cues. Eighteen subjects performed an auditory spatial attention task with two competing streams located at roughly +30 ° and −30° azimuth (Figure 1). On every trial, listeners were cued by an auditory cue to attend to either the left or right stream and report the content of the cued stream. The competing streams consisted of syllables (/ba/, /da/ or /ga/) from two different-sex talkers. Sound stimuli (including the cuing sound) were spatialized using three different levels of naturalness and richness: 1) generic ITDs only, 2) individualized IIDs, or 3) individualized HRTFs containing all of the naturally occurring spatial cues a listener experiences in the everyday world. We show that behavioral performance is better when listeners hear natural, individualized spatial cues than when they hear impoverished cues. Importantly, only natural spatial cues yield significant attentional modulation of P1 amplitudes. Moreover, induced alpha activity is less robust and poorly lateralized with isolated spatial cues compared to rich, natural spatial cues.
2 Materials and Methods
2.1 Subjects
Twenty-one paid volunteers and one author within the age of 18-42 years (M = 22.9, SD = 5.5; 12 females, 10 males) participated in this study. None of the subjects had audiometric thresholds greater than 20 dB for frequencies from 250 Hz to 8 kHz. All participants gave informed consent as approved by the Boston University Institutional Review Board. Two subjects were withdrawn from the study due to the inability to perform the task (percentage of correct response less than 30% after training), and two subjects were removed during EEG data preprocessing due to excessive artifacts. Therefore 18 subjects remained for further analysis (N = 18).
2.2 Stimuli and Procedure
The sound stimuli consisted of consonant-vowel syllables (/ba/, /da/, & /ga/), each 0.4 s in duration. These syllables were recorded from three talkers that naturally differed in fundamental frequency (F0). Details on stimulus are provided in Stimulus Presentation. Cue and stimuli were presented via earphones (ER-2, Etymotic Research, Inc.) and spatialized to approximately ±30° azimuth (0° elevation). Three different spatialization conditions were used: HRTF, IID, and ITD. In the HRTF condition, individualized HRTFs, providing natural combinations of ITDs, IIDs, and spectral cues, were used (See Individual HRTF Measurement for measurement methods). In the IID condition, ITDs were removed from the individualized HRTFs by computing minimum-phase representations of the filters via removing the non-causal part of the cepstrum. Hence, the IID and HRTF conditions provided the same monaural magnitude spectra and thus the same energetic advantage of the ear ipsilateral to the target while differing in spatial perception. In the ITD condition, spatialization was based on simply delaying the signal presented to the contralateral ear by 300 μs, thus providing no energetic advantage to the ipsilateral ear. This spatialization method was tested due to its popularity in auditory neuroscience.
The auditory cue was a single syllable /ba/ spoken by a low-pitch male voice (F0 = 91 Hz, estimated by Praat software) (Boersma, 2001). The following target and distractor streams both consisted of three syllables randomly chosen out of the set of three syllables (with replacement). The target stream was spoken by either a female (F0 = 189 Hz) or a high-pitch male talker (F0 = 125 Hz), and the distractor stream was spoken by the other talker different than the target stream. The first syllable of the target and distractor sound overlapped in time, while the latter two syllables were separated by 200 ms, onset to onset (Figure 1). To avoid engagement of temporal attention rather than spatial attention, the assignment of the target stream being leading or lagging was equally distributed across trials. In the leading stream, the onsets of all three syllables were separated by 400 ms; in the lagging stream, the onsets of the first and the second syllable were separated by 600 ms, whereas those of the second and the third syllable were separated by 400 ms. All sound stimuli were presented at a sound pressure level of approximately 75 dB.
2.3 Task
Subjects performed a spatial attention task in a Posner paradigm (Figure 1) (Posner et al., 1980). Sound spatialization was realized by one of the three spatialization conditions fixed within trials but pseudo-randomized across trials. Subjects were instructed to fixate their gaze on a dot at the center of the screen at the beginning of each trial. The fixation dot lasted for 1.2 s before an auditory cue was presented. The auditory cue came from either left or right indicating where the target sound would come from. A target sound started 0.8 s later from the cued location. At the same time a distractor sound started from the opposite location of the target sound. Subjects were asked to report the syllable sequence of the target sound by pressing a keyboard after the sounds finished and a response cue was shown. Feedback about whether or not they correctly reported the syllables was given at the end of every trial.
Each subject performed 450 randomized trials of this task, divided into 9 blocks each consisting of 50 trials. In total, every subject performed 150 trials for each of the three sound spatialization conditions (75 trials attending left and 75 trials attending right). Prior to the test sessions, all participants received a practice session to get familiarized with the task. Participants with a percentage of correct response lower than 30% after 3 blocks of training (50 trials per block) were excluded from the study.
2.4 EEG Acquisition and Preprocessing
32-channel scalp EEG data was recorded (Activetwo system with Activeview acquisition software, Biosemi B.V.) in a sound proof booth (Eckel Industries, Inc.) while subjects were performing the task. Two additional reference electrodes were placed on the earlobes. Horizontal eye movements were recorded by two electrooculography (EOG) electrodes placed on the outer canthi of each eye. Vertical eye movement was recorded by one EOG electrode placed below the right eye. The timing of stimulus was controlled by Matlab (Mathworks) with Psychtoolbox (extension 3) (Brainard, 1997).
EEG preprocessing was conducted in Matlab with Eeglab toolbox (Delorme and Makeig, 2004). EEG data were corrected against the average of the two reference channels. Bad channels were marked by manual selection during recording and automatically detected based on joint probability measures of Eeglab. EEG signals were then down-sampled to 256 Hz and epochs containing responses to individual trials were extracted. Each epoch was baseline corrected against 100 ms prior to the cue onset by removing the mean of the baseline period from the whole trial. ICA artifact rejection was performed with Eeglab to remove components of eye movements, blinks, and muscle artifacts. The maximum number of independent components rejected for each subject was five. After ICA rejection, bad channels were removed and interpolated. Trials with a maximum absolute value over 80 μV were rejected (Delorme et al., 2007). Two subjects with excessive artifacts were removed from further EEG analysis because less than 50% of trials remained after thresholding. For the rest of the 18 subjects, at least about two thirds of the trials (minimum was 48 out of 75 trials) remained for each condition after artifact rejection. Trial numbers were equalized within and across subjects by randomly selecting the minimum number of available trials (N = 48) for each condition across the whole recording session.
2.5 Data analysis
Behavioral performance was quantified by the percentage of correct responses for every one of the three syllables in the target stream and each spatialization condition. Behavioral results were collapsed across the attend-left and attend-right trials. The percentages of correct response were then normalized by logit transformation before parametric statistical testing was performed on the resulting data. ERP responses were evaluated for the second syllable of the target sound and distractor sound, respectively. The reason we looked at the second syllable only is that 1) the first syllable of the target and distractor aligned in time and therefore the ERPs were inseparable, and 2) the ERP amplitude in response to the third syllable was small, and therefore more contaminated by noise. ERP components were then extracted from the time series data. The preprocessed data (details see EEG Preprocessing Procedures) was bandpass filtered from 0.5 to 20 Hz by a finite impulse response filter with Kaiser window design (β = 7.2, n = 1178). Data from four fronto-central channels (Cz, Fz, FC1, and FC2) were averaged to get the auditory ERP response. We picked these four channels a priori because auditory ERP responses in sensor space are largest in the fronto-central area of the scalp. To quantify the amplitudes of ERP components, the maximum value within the window of 50 to 100 ms after the second syllable onset was taken to be the P1 amplitude; the minimum value within the window of 100 to 180 ms after the second syllable onset was calculated to be the N1 amplitude. The values extracted from the selected windows were calculated for each channel and plotted onto a 2D scalp map to generate topography plots. The values of the ERP components from the four selected channels were then averaged and compared across different spatialization conditions.
To get the amplitude of alpha oscillation, the preprocessed EEG data was bandpass filtered to the alpha range (8 to 14 Hz) before a Hilbert transform was applied. The magnitude of the resulting data was taken as the extracted alpha power envelope. To get induced alpha power, the alpha power was calculated for single trials first and then averaged across trials (Snyder and Large, 2005). The time course of alpha power was baseline corrected against 700 ms before the auditory cue onset. GFP (Murray et al., 2008; Skrandies, 1990) constitutes the spatial standard deviation across all scalp electrodes; it has been used as a measurement to quantify the amount of alpha variation across the scalp (Lim et al., 2015). We calculated the time courses of alpha GFP by taking the standard deviation of all electrodes. To quantify the degree of alpha modulation based on direction of attention, we calculated the Attentional Modulation Index (AMI) of alpha power, defined as the alpha power difference between attended left and attended right trials divided by the overall alpha power (Wöstmann et al., 2016). The AMI of alpha was calculated for each time point, yielding the time course of AMI for each spatialization condition. We then averaged the alpha AMI of each spatialization condition over the 800 ms immediately before stimulus onset (−800 ms to 0 ms, re: onset). This is the period where subjects have been cued to orient their spatial attention in preparation for the target sound, but before the speech streams begin. Scalp topographies of the preparatory alpha AMI were plotted for each condition. Hemispheric lateralization of alpha AMI was further compared across spatialization conditions and evaluated as the difference between the left hemisphere and the right hemisphere. Calculated in this way, the AMI is expected to be positive in left and negative in right parietal channels.
For testing the significance of different means across conditions, we conducted repeated measures ANOVAs followed by post-hoc analyses for all significant main effects and interactions using Fisher’s least significant difference procedure. We separately tested whether condition means differed significantly from zero using Bonferroni-corrected t-tests (Padj). The Lilliefors test was performed prior to statistical testing to check normality of the data. Data was considered normally distributed at P > 0.05. Prior to statistical analysis of behavioral performance, the percentages of correctly reported syllable were logit transformed in order to obtain normally distributed data.
3 Results
3.1 Natural spatial cues facilitate behavioral performance
Percentages of correctly recalling each syllable of the target stream differed across the three spatialization conditions (Figure 2; 1st syllable: F(2,34) = 25.25, P < 0.001; 2nd syllable: F(2,34) = 6.27, P = 0.005; 3rd syllable: F(2,34) = 5.60, P = 0.008). For the first syllable, where the target and distractor sounds overlapped in time, subjects were least accurate in the ITD condition compared to the IID condition (t(34) = 5.31, P < 0.001) and HRTF condition (t(34) = 6.74, P < 0.001). However, no statistically significant difference was observed between IID and HRTF conditions for that syllable (t(34) = 1.43, P = 0.16). For the second and the third syllable, where target and distractor streams occurred staggered in time, subjects performed better in the HRTF condition than in both the ITD condition (2nd syllable: t(34) = 3.27, P = 0.002; 3rd syllable: t(34) = 3.33, P = 0.002) and the IID condition (2nd syllable: t(34) = 2.81, P = 0.008; 3rd syllable: t(34) = 1.94, P = 0.06). There was no significant difference between the ITD and IID conditions for the two staggered syllables (2nd syllable: t(34) = 1.41, P = 0.17; 3rd syllable: t(34) = 1.39, P = 0.17).
3.2 Impoverished spatial cues affect attentional modulation of ERPs
Figure 3A shows the ERPs evoked by the onset of the second syllable of the attended target sound and the unattended distractor sound, aligning the onsets of the target and distractor syllables to 0 s to allow direct comparison. Stimulus onsets elicited a fronto-central positivity (P1) between 50 to 100 ms followed by a negativity (N1) between 100 to 180 ms (Figure 3A-B). The amplitudes of these two components were extracted and the difference between attended stimuli (target sound) and unattended stimuli (distractor sound) was calculated in order to quantify attentional modulation for both the P1 and N1 components (Figure 3C).
We tested whether P1 responses were significantly larger to attended stimuli than to unattended stimuli in each of the three conditions. Only the HRTF condition showed a significant P1 modulation (t(17) = 3.12, Padj = 0.017); no significant attentional modulation was found in either the ITD (t(17) = 0.50, Padj = 1) or IID conditions (t(17) = 0.06, Padj = 1). Across conditions we found a statistically significant main effect of spatial cue on P1 amplitude modulation (F(2,34) = 3.34, P = 0.047). Attentional modulation was significantly larger in the HRTF condition than in the ITD (t(34) = 2.38, P = 0.023) and IID conditions (t(34) = 2.07, P = 0.046); however, modulation did not differ significantly between the ITD and IID conditions (t(34) = 0.31, P = 0.76) (Figure 3C).
In all three spatialization conditions, the N1 amplitude was modulated significantly by spatial attention, that is, attended sounds evoked larger N1 amplitudes than unattended sounds (ITD: t(17) = 3.01, Padj = 0.024; IID: t(17) = 4.12, Padj = 0.002; HRTF: t(17) = 3.56, Padj = 0.007). Across the three spatialization conditions the magnitude of N1 modulation did not differ significantly (F(2,34) = 0.060, P = 0.94; Figure 3C).
3.3 Alpha oscillation power shows less attentional modulation with impoverished spatial cues
To investigate the effect of spatialization on attentional control, we analyzed the power in alpha oscillations during the attentional preparation period (−800 ms to 0 ms), a time period in which listeners knew where to orient spatial attention based on the preceding acoustic cue, but before the sound mixture of competing streams began. We averaged the power in alpha across all trials for each spatialization condition, regardless of where spatial attention was focused, to get a measure of the total engagement of alpha activity. We then compared relative power for different attentional directions. On average across directions of attentional focus, we calculated the time courses of alpha global field power (GFP, Figure 4A) and compared within-subject differences of the temporal average within the preparatory time period across spatialization conditions (Figure 4B). Alpha GFP was not significantly modulated in either the ITD or ILD conditions (ITD: t(17) = 0.44, Padj = 1; ILD: t(17) = 0.43, Padj = 1), while in the HRTF condition, the GFP tended to be greater than zero (HRTF: t(17) = 2.56, Padj = 0.061). In a direct comparison, spatialization conditions differed significantly in alpha GFP (F(2,34) = 5.26, P = 0.010). In particular, alpha GFP in the HRTF condition was significantly larger than in the other two conditions (HRTF vs ITD: t(34) = 2.80, P = 0.008; HRTF vs IID: t(34) = 2.82, P = 0.008). No significant difference was found between the ITD and IID conditions (t(34) = 0.019, P = 0.99).
We next assessed the lateralization of alpha power with the spatial focus of attention by comparing AMI differences across hemispheres (Figure 5). In general, the scalp topographies of AMIs show the expected hemispheric differences. However, statistically significant hemispheric differences were found only in the HRTF condition (t(17) = 3.09, Padj = 0.020), not in the ITD (t(17) = 1.29, Padj = 0.64) and the IID condition (t(17) = 0.15, Padj = 1). A direct comparison of these hemisphere differences across conditions revealed a trend in which the HRTF condition had larger differences in AMI across hemispheres (F(2,34) = 2.98, P = 0.064).
In summary, impoverished spatial cues lead to worse behavioral performance, smaller P1 modulation, reduced modulation of preparatory alpha power GFP, and reduced lateralization of alpha power with attentional focus, confirming our hypothesis that impoverished spatial cues impaired engagement of spatial attention.
3.4 Relationships between Attentional Modulation Metrics
Given all these consistent effects of modulation metrics, we explored, post hoc, whether there were ordered relationships in the individual measures of performance and neural signatures of attentional control, including P1 modulation, preparatory alpha GFP, and alpha power lateralization. To investigate the relationship between evoked response modulation and alpha oscillatory activities, we first calculated the regression slope relating P1 amplitude to preparatory alpha GFP for each subject, and then performed a paired t-test on the coefficients obtained. No consistent relationship between alpha GFP and P1 amplitudes was observed (t(17) = 0.90, P = 0.38). Correlation analysis was also conducted comparing behavioral accuracy to P1 modulation, defined as the attended P1 amplitude minus unattended P1 amplitude. No consistent relationships between P1 modulation and behavioral performance were observed for any syllable (1st syllable: t(17) = 0.54, P = 0.59; 2nd syllable: t(17) = 0.31, P = 0.76; 3rd syllable: t(17) = 0.69, P = 0.50). Similarly, we did not observe consistent relationships between alpha AMI lateralization and response accuracy for any syllable (1st syllable: t(17) = 0.19, P = 0.85; 2nd syllable: t(17) = 1.39, P = 0.18; 3rd syllable: t(17) = 0.11, P = 0.91). Thus, although there were significant differences in engagement of attention across spatial conditions as measured both behaviorally and neurally, the individual subject differences in these metrics were not closely related.
4 Discussion
Behaviorally, we found that impoverished spatial cues impair performance on an auditory spatial attention task in a multi-talker scene. We used objective electrophysiological measures to assess whether the naturalness and richness of spatial cues also impacts how strongly auditory spatial attention modulates brain responses. We found that impoverished spatial cues reduce the strength of the evoked and induced neural signatures of attentional control. Specifically, evoked P1 amplitudes and induced alpha oscillatory power showed less attentional modulation for sound stimuli with impoverished spatial cues compared to when spatial cues were tailored to recreate the natural, rich experience of individual listeners.
4.1 Impoverished spatial cues result in less neural modulation during selective attention
We investigated attentional modulation of four established neural signatures of selective attention: evoked P1 and N1 amplitudes and induced power and lateralization of alpha oscillation. While attentional modulation of N1 amplitude was observed in all conditions, attentional modulation of the earlier P1 amplitude was not observed or was significantly weaker in the impoverished cue conditions compared to the natural cue condition. Similarly, we found less preparatory alpha power activity in the impoverished spatial cue conditions than in the natural cue condition, reflected by two indexes quantifying the amount of spatial variability of alpha power: alpha GFP (Figure 4) and AMI (Figure 5). In the ITD and IID conditions, although there was a hint of preparatory alpha lateralization over parietal sensors, the amount of lateralization was significantly smaller than in the HRTF condition and did not reach statistical significance. Preparatory alpha activity during spatial attention tasks has been well documented to form a specific lateralization pattern in both vision and audition (Banerjee et al., 2011; Kelly, 2006; Sauseng et al., 2005; Worden et al., 2018), which is thought to be evidence of a preparatory information-gating mechanism (Foxe and Snyder, 2011; Jensen and Mazaheri, 2010; Klimesch, 2012; Klimesch et al., 2007). In vision, alpha lateralization has been observed to increase with the laterality of attention focus (Rihs et al., 2007; Samaha et al., 2015), reflecting an inhibition pattern topographically specific to attention focus. Moreover, evidence for active top-down control of the phase of alpha oscillation during visual spatial attention suggests that alpha oscillatory activity represents active engagement and disengagement of the attentional network (Samaha et al., 2016). In addition, a previous somatosensory study has revealed that the alpha lateralization is positively correlated to pre-stimulus cue reliability, further suggesting that alpha lateralization reflects the top-down control in order to optimize the processing of upcoming stimuli (Haegens et al., 2011). Although relatively few studies have investigated alpha activity in audition, studies suggest that alpha control mechanisms are supra-modal rather than sensory specific (Banerjee et al., 2011). In the current experiment, a pre-stimulus auditory cue directed listeners where to focus attention in an upcoming sound mixture. The cue was spatialized using the same auditory features used to spatialize the stream mixture. Our results thus suggest that compared to stimuli with natural spatial cues, stimuli featuring only ITDs or only IIDs are less reliable in directing attentional focus, producing weaker engagement of spatial attention and reduced attentional modulation of neural responses.
Consistent with the idea that impoverished spatial cues lead to weaker engagement of spatial attention, we found that the P1 ERP component was modulated by attention only with natural spatial cues, not with impoverished cues; this result is consistent with a weak spatial representation failing to engage attentional modulation of early sensory responses (Figure 3). Our finding that attentional focus leads to a modulation of P1 amplitude for natural spatial cues is consistent with previous reports of effects of attention on the P1 amplitude observed in previous spatial attention studies across sensory modalities [auditory: (Giuliano et al., 2014); visual: (Hillyard and Anllo-Vento, 1998; Hopfinger et al., 2004)]. Past studies agree that P1 modulation reflects an early sensory inhibition mechanism related to suppression of task-irrelevant stimuli. Although debates remain as to whether P1 modulation results from bottom-up sensory gain control (Hillyard and Anllo-Vento, 1998; Luck, 1995; Slagter et al., 2016) or for some top-down inhibitory process (Freunberger et al., 2008; Klimesch, 2011), it is generally accepted in visual spatial studies that greater P1 amplitude modulation is associated with greater inhibition of to-be-ignored stimuli (Couperus and Mangun, 2010; Hillyard and Anllo-Vento, 1998; Klimesch, 2012). Interestingly, attentional modulation of auditory P1 has been found to be positively correlated to visual working memory capacity, a result that was used to suggest that stronger P1 modulation reflects better attentional control of the flow of sensory information into working memory (Fukuda and Vogel, 2009; Giuliano et al., 2014). Our result is consistent with the hypothesis that P1 modulation directly reflects attentional control. Specifically, impoverished spatial cues likely produce a “muddy” representation of auditory space that supports only imprecise, poorly focused top-down spatial attention. The resulting lack of control and specificity of spatial auditory attention results in early P1 responses that are unmodulated by attentional focus.
N1 modulation is well documented as a neural index of attentional control (Choi et al., 2013; Hillyard et al., 1998; Stevens et al., 2008; Wyart et al., 2012). The attentional modulation of N1 is thought to reflect attentional facilitation rather than inhibition (Couperus and Mangun, 2010; Marzecová et al., 2018; Slagter et al., 2016). In contrast to preparatory alpha and P1, we found that the later N1 evoked response was modulated similarly, regardless of the richness and naturalness of spatial cues.
Due to the robustness and relatively large amount of modulation, changes in auditory N1 amplitude have been used as a biomarker and a primary feature for classification of attentional focus (Blankertz et al., 2011; Schreuder et al., 2011); see also recent work on decoding attentional focus for running speech using the correlation between neural responses and the power envelope of the speech streams: (Chait et al., 2010; Mesgarani and Chang, 2012; Rimmele et al., 2015). However, there is little known about how N1 amplitudes reflects the processing of different spatial cues during auditory spatial attention. Previous studies have revealed different N1 topographies during ITD and IID processing, leading to the conclusion that ITD and IID are processed by different neural populations in the auditory cortex (Johnson and Hautus, 2010; Tardif et al., 2006; Ungan et al., 2001). However, debates remain about whether this difference in topography depends on perceived laterality, instead of different neural populations specialized for processing different spatial cues. Results from a more recent study show that auditory N1 modulation does not differ across spatial cue conditions, indicating integrated processing of sound locations in auditory cortex regardless of cues (Salminen et al., 2015). In the current study, N1 modulation did not differ across the three spatialization conditions. Thus, our results support the idea that the same cortical neural population is responsible for processing different binaural spatial cues.
4.2 Behavioral disadvantages associated with impoverished spatial cues are modest and depend on sound stimulus characteristics
Despite the influence of spatial cue richness on neural metrics, our behavioral results showed only small (albeit significant) behavioral differences between impoverished spatial cues and natural, individualized spatial cues (Figure 2). In line with previous studies that observed greater spatial release from masking with combined spatial cues compared to with isolated cues (Culling et al., 2004; Ellinger et al., 2017), accuracy was best in the HRTF condition. The small accuracy improvement over using impoverished cues is seen consistently across subjects. In the first syllable where the target and distractor streams overlap in time, the HRTF condition yielded a 13% increase in accuracy over the ITD condition, but is comparable to performance in the IID condition. In the two staggered syllables, accuracy in the HRTF condition is greater than in the ITD and IID conditions by only about 6% and 1%, respectively. These differences in behavioral performance across syllables suggest that the characteristics of sound stimuli influence the difficulty of the task and may influence the behavioral advantages of having richer, more robust spatial cues (Kidd et al., 2010). Concordantly, a previous study with complex tone stimuli has shown much larger differences in behavioral performance of up to 20% (Schröger, 1996), whereas studies presenting speech stimuli in a multi-talker environment found no behavioral advantage of having combined cues compared to impoverished cues (Glyde et al., 2013). These behavioral discrepancies, in combination with our neural findings, indicate that behavioral performance alone is not a sensitive metric for determining whether cortical networks controlling spatial selective attention are fully engaged.
Conclusions
Our results indicate that although impoverished spatial cues can support spatial segregation of speech in a multi-talker environment, they do not fully engage the brain networks controlling spatial attention and lead to weak attentional control. Previous auditory studies have provided evidence that impoverished spatial cues do not evoke the same neural processing mechanisms as natural cue combinations during localization tasks with single sounds (Callan et al., 2013; Getzmann and Lewald, 2010; Leino et al., 2007; Palomäki et al., 2005). The current study extends these findings, demonstrating that the efficacy of higher-level cognitive processing, such as deployment of auditory selective attention, also depends on the naturalness of spatial cues. Poor attentional control was reflected in limited modulation of neural biomarkers of attentional processes. These findings suggest that the many past auditory attention studies using impoverished spatial cues may have underestimated the robust changes in cortical activity associated with deployment of spatial auditory attention in natural settings. Although impoverished auditory spatial cues can allow listeners to deploy spatial attention effectively enough to perform well in simple acoustic scenes, noisy, complex listening environments like those encountered in everyday environments pose greater challenges to attentional processing. In natural settings, spatial attention may fail unless attentional control networks are fully engaged. Thus, these results demonstrate the importance of preserving rich, natural spatial cues in hearing aids and other assistive listening devices.
Author contributions
R.B., B.S.-C., Y.D., and I.C. designed research; Y.D. and R.B. performed research; Y.D. and R.B. analyzed data; and Y.D., R.B., B.S.-C., and I.C. wrote the paper.
The authors have no competing interests to declare.
Acknowledgements
We thank Ashvini Meltoke for collecting part of the data, Gerald Kidd Jr. for providing the facilities for individualized HRTF measurements, and Virginia Best for fruitful discussions. This work was supported by the Austrian Science Fund (grant J3803-N30 to R.B.) and NICD (grant DC013825 to B.S.-C.).