Abstract
Neural information flow (NIF) is a new framework for system identification in neuroscience. NIF subsumes population receptive field estimation, neural encoding, effective connectivity analysis and hemodynamic response estimation in a single differentiable model that can be trained end-to-end via stochastic gradient descent. NIF models represent neural information processing systems as a network of coupled tensors, each encoding the representation of the sensory input contained in a brain region. The elements of these tensors can be interpreted as cortical columns whose activity encodes the presence of a specific feature in a spatio-temporal location. Each tensor is coupled to the measured data specific to a brain region via low-rank observation models that can be decomposed into the spatial, temporal and feature receptive fields of a localized neuronal population. Both these observation models and the convolutional weights defining the information processing within regions and effective connectivity between regions are learned end-to-end by predicting the neural signal during sensory stimulation. We trained a NIF model on the activity of early visual areas using a large-scale fMRI dataset. We show that we can recover plausible visual representations and population receptive fields that are consistent with the existing literature. Trained NIF models are accessible for in silico analyses.
1 Introduction
Uncovering the neural computations that subserve cognition and behaviour is a major goal in neuroscience (Churchland and Sejnowski, 1992). Arguably, a true understanding of the brain requires replicating biological neural information processing in silico. However, a general approach for estimating large-scale brain models from observed data has not been proposed so far. This paper introduces a new framework, referred to as neural information flow (NIF), which allows us to achieve this goal.
A popular approach for modeling neural information processing is a goal-driven approach, where a basis set of of stimulus features optimized to solve a specific task is used to model neural responses to complex naturalistic input (Naselaris et al., 2011; van Gerven, 2017; Yamins and DiCarlo, 2016). Using this approach, the best results so far have been obtained using deep neural networks (DNNs) (Kriegeskorte, 2015; Güçlü and van Gerven, 2015; Yamins and DiCarlo, 2016; Agrawal et al., 2014; Cichy et al., 2016; Horikawa and Kamitani, 2017; Cadena et al., 2019). However, these models have been optimized on tasks such as object classification on image databases, and not for explaining brain responses. While there exists a correspondence between DNNs and brains at a functional level, they do not provide realistic models of neural information processing.
An alternative data-driven approach is to directly estimate neural models from measurements of neural activity. We refer to this uncovering of neural information processing systems from observed data as neural system identification (Stanley, 2005; Wu et al., 2006). This approach has been used to reveal various mechanisms of neural information processing in biological systems (Joukes et al., 2014; Klindt et al., 2017; Antolík et al., 2016; McIntosh et al., 2016; Brackbill et al., 2017). However, so far, neural system identification has been used to explain neural responses in individual brain regions. Instead, we aim to estimate whole-brain models that model neural computations in individual neural populations as well as causal interactions between neural populations.
Both goal-driven and data-driven models tend to ignore the causal interactions between brain regions. That is, they are lacking the coupling between distinct neural regions commonly studied with effective connectivity methods (Friston et al., 2003; Friston, 2011; Liao et al., 2010; Ambrogioni et al., 2017). Established techniques for uncovering effective connectivity are able to uncover this causal coupling (Friston et al., 2003), but do not capture the nature of information processing that drives the interaction between brain regions.
NIF models can be interpreted as “synthetic (in silico) brain models” that learn to capture the nonlinear computations that take place in real brains. Instead of modeling brain regions in an isolated fashion, they include connectivity between brain regions by taking afferent input into account (Haak et al., 2013). Moreover, by making use of convolution and factorization, we are able to estimate whole-brain models in an efficient manner. In the following, we outline the basic methodology of NIF modeling. Using a large functional magnetic resonance imaging (fMRI) dataset acquired under naturalistic stimulation we demonstrate that the model is capable of generating realistic brain measurements and that the computations captured in the model are biologically meaningful.
This paper outlines the basic principles of NIF models. However, the framework is general in the sense that the experimenter is free to choose the neural architecture of individual brain regions and how these regions map onto observed measurements, which can be either neural or behavioural in nature. The philosophy of neural information flow is outlined in Figure 1. Given the generality of our framework, we expect that it will guide the development of a new family of generative models that allow us to uncover the principles of neural computations in biological systems.
2 Neural information flow
The purpose of a NIF model is to capture the neuronal computations that take place within and between neuronal populations in response to a naturalistic sensory input. The core of a NIF model is a deep convolutional architecture. Each layer of this architecture stores the representation of the sensory input encoded in a specific brain region. Information is processed through convolutions, which model the topographically organized connectivity between brain regions. These representations are used to predict measurements through low-rank observation models that couple each layer of the network to the observable responses of a specific brain region. These low-rank observation models can be factorized into spatial, temporal and feature components.
The model parameters are estimated by fitting the measured neural signals during sensory stimulation. Specifically, the model receives the same sensory input that is presented to the participant and predicts the measurements of all the brain regions of interest. Both the internal convolutions and the observation models are trained end-to-end using SGD. Note that the only error signal comes from the neural responses without any further pre-training and regularization. This differs from most existing encoding approaches, where neural responses are predicted from the activations of a network trained on an unrelated classification task (Kriegeskorte, 2015; Güçlü and van Gerven, 2015; Agrawal et al., 2014; Horikawa and Kamitani, 2017). In the following we describe the components in more detail.
2.1 Modeling neural representations
We model the neural representations encoded in individual brain regions using tensors. The activity of a neural area is encoded in a four-dimensional tensor , whose array dimensions represent channels c, two spatial coordinates (x, y) and time t, respectively. During training on neural measurements, the feature maps N[i,:,:,:] learn to encode neural processing of specific stimulus (input) characteristics such as oriented edges or coherent motion. Consequently, a tensor element can be interpreted as the response of one cortical column. Under the same interpretation, cortical hyper-columns are represented by a sub-tensor storing the activations of all the columns that respond to the same spatial location. Sensory inputs are represented in the same manner. The input tensor represents the responses of sensory receptors such as the photoreceptors in the retina.
2.2 Modeling information flow
We model effective connectivity from a source region a to a target region b as a convolution between the neural tensor Na and a tensor of synaptic weights Wa→b: In other words, here we model neural processing using 3D convolutions. To enforce causality of the neural responses, the temporal filters should be causal, meaning that the only non-zero weights correspond to past time points. However, this assumption can be dropped when the time scale of our observations is much slower than that of the underlying temporal dynamics.
As shown in Equation (1), the tensor W has five array dimensions: input channels, two spatial coordinates, one temporal coordinate and one output channel. The convolution is performed along the two spatial dimensions and the temporal dimension. The spatial weights model the topographically organized synaptic connections while the temporal component models synaptic delays. Using this notation, we can define the activation of the j-th brain area as a function of its afferent input: where f is a (nonlinear) activation function (applied element-wise) and Bj determines the bias. Using this setup, we can model how neural populations respond to sensory input, as well as to each other. Note further that bottom-up and top-down interactions between brain regions can be integrated in the same model.
2.3 Modeling observable signals
NIF models are estimated by linking neural tensors to observation models that capture (indirect) measurements of brain activity. Observations are represented using tensors Y that reflect multiple responses in space and time. They are modelled as where g is the observation model and ϵ is measurement noise. The exact form of g depends on the kinds of measurements that are being made. Neuroimaging methods such as fMRI, single- and multi-unit recordings, local field potentials, calcium imaging, EEG, MEG; but also motor responses and eye movements are observable responses to afferent input and can thus be used as a training signal. Note that the same brain regions can be observed using multiple observation models, conditioning them on multiple heterogeneous datasets at the same time. This provides a solution for multimodal data fusion in neuroscience (Uludağ and Roebroeck, 2014).
In this paper, we focus on blood-oxygenation-level dependent (BOLD) responses obtained for individual voxels using fMRI. In this case, we can consider the voxel responses separately for each region, such that we have Yi = gi (Ni) + ϵ for each region i. Let Yi ∈ ℝK×T denote BOLD responses of K voxels acquired over T time points inside the i-th region. Our observation model for the k-th voxel in that region is defined as where Δt is a temporal shift of the BOLD response that is used to take a basic offset in the hemodynamic delay into account (4.9 s in our example). Every brain region can be observed using a function of the form shown in Equation (4).
To simplify parameter estimation and facilitate model interpretability we use a factorized representation of U. That is, where Uc[·, k] are the channel loadings which capture the sensitivity of a voxel to specific input features, Us[·, ·, k] is the spatial receptive field of a voxel and Ut[·, k] is the temporal profile of the observed BOLD response of the k-th voxel. Hence, the estimated voxel-specific observation models have a direct biophysical interpretation.
We further facilitate parameter estimation by using a spatial weighted low-rank decomposition of the spatial receptive field: Here, bk is a voxel-specific bias and ak,r are positive rank amplitudes. In our experiments, we used R = 4. To further stabilize the model and obtain localized population receptive fields, we apply a softmax nonlinearity to the columns (voxel-specific weights) of Ut, Ux and Uy. That is, the elements ui of each column vector u of these matrices are given by where the vi are learnable parameters. This enforces positively weighted spatio-temporal receptive fields and reduces noise in the individual estimation of the voxel-specific weight values. The rank limits the complexity of the spatial observation model. Rank one models can estimate unimodal receptive fields. However, a small number of voxels have nonclassical receptive fields that respond to multiple parts of the input space (see Figure 6), for which more degrees of freedom are needed.
2.4 Model estimation
Once the architecture of the NIF model is defined, its parameters (synaptic weights and observation model parameters) can be estimated using stochastic gradient descent (SGD). NIF allows modeling recurrent neural computations (loops) that arise due to both bottom-up and top-down connections between brain regions as well by unrolling the graph and performing backpropagation through time (Werbos, 1990). However, in the present paper we only model feed-forward processing. Losses are estimated within brain regions as mean squared error across voxels. The individual loss terms for every brain region are summed to obtain a final loss that is minimized using SGD. Note that since the model couples neuronal populations, region-specific estimates are constrained by one another and consequently make use of all observed data. The NIF example presented here was implemented in the chainer framework for automatic differentiation (Tokui et al., 2015).
3 Experimental validation
To test our example visual system model, we made use of a unique large-scale functional MRI dataset for which which one subject was exposed to almost 23 hours of complex naturalistic spatiotemporal stimuli. Specifically, we presented episodes from the BBC series Doctor Who (Davies et al., 2005).
3.1 Stimulus material
A single human participant (male, age 27.5) watched 30 episodes from seasons 2 to 4 of the 2005 relaunch of Doctor Who. This comprised the training set which was used for model estimation. Episodes were split into 12 min chunks (with each last one having varying length) and presented with a short break after every two runs. The participant additionally watched repeated presentations of short movies (Pond Life (five movies of 1 min, 26 repetitions) and Space / Time (two movies of 3 min, 22 repetitions)) in random permutations after nearly every episode. They were taken from the series’ sequel to avoid overlap with the training data. This comprised the test set which was used for model validation.
3.2 Data acquisition
We collected 3T whole-brain fMRI data. It was made sure that the training stimulus material was novel to the participant. Data were collected inside a Siemens 3T MAGNETOM Prisma system using a 32-channel head coil (Siemens, Erlangen, Germany). A T2*-weighted echo planar imaging pulse sequence was used for rapid data acquisition of whole-brain volumes (64 transversal slices with a voxel size of 2.4 × 2.4 × 2.4 mm3 collected using a TR of 700 ms). We used a multiband-multi-echo protocol with multiband acceleration factor of 8, TE of 39 ms and a flip angle of 75 degrees. The video episodes were presented on a rear-projection screen with the Presentation software package, cropped to 698 × 732 pixels squares so that they covered 20°of the vertical and horizontal visual field. The participant’s head position was stabilized within and across sessions by using a custom-made MRI-compatible headcast, along with further measures such as extensive scanner training. The participant had to fixate on a fixation cross in the center of the video. At the beginning of every break and after every test set video a black screen was shown for 16 s to record BOLD fadeout, however here we omit this part of the data. In total this leaves us with 118.197 whole-brain volumes of single-presentation data, forming our training set (used for model estimation); and 1.031 volumes of resampled data, forming our test set (used for model evaluation).
Data collection was approved by the local ethical review board. All specifics of the data set are described in a separate manuscript accompanying the data that will be made publicly available.
3.3 Data preprocessing
Minimal BOLD data preprocessing was performed using FSL v5.0. Volumes were first aligned within each 12 min run to their center volume (run-specific reference volume). Next, all reference volumes were aligned to the center volume of the first run (global reference volume). The run-specific transformations were applied to all volumes to align them with the global reference volume.
The signal of every voxel used in the model was linearly detrended, then standardized (demeaning, unit variance) per run. Test set BOLD data was averaged over repetitions to increase signal to noise ratio, and as a final step the result was standardized again. A fixed delay of 7 TRs (4.9 s) was used to associate stimulus video chunks with responses and allow the model to learn voxel-specific HRF delays within Ut. With the video segments covering 3 TRs starting from the fixed delay, the BOLD signal corresponding to a stimulus is thus expected to occur within a time window of 4.9 s to 6.3 s after the onset of the segment. As there were small differences between frame rates in the train and test sets we converted the stimulus videos to a uniform frame rate of 23.86 Hz (16 frames per TR) for training the example model. To reduce model complexity we downsampled the videos to 112 × 112. As the model operates on three consecutive TRs, the training input size was 112 × 112 × 48. The stimuli were converted to greyscale prior to presenting them to the model. Otherwise stimuli were left just as they were presented in the experiment.
3.4 Model architecture
We implemented a purely feed-forward architecture for modeling parts of the visual system (LGN, V1, V2, V3, FFA and MT). The used architecture is illustrated in detail in Figure 2. FFA and MT have their own tensors originating from V3 to allow for a simplified model of the interactions between upstream and downstream areas. We intentionally used a simplified model to focus on demonstrating the capabilities of the NIF framework. To model LGN output, we used a linear layer consisting of a single 3 × 3 × 1 spatial convolutional kernel. The model was trained for 11 epochs with a batch size of 3, using the Adam optimizer (Kingma and Ba, 2014) with learning rate α = 5 × 10-4. Weights were initialized with Gaussian distributions scaled by the number of feature maps in every layer (He et al., 2015).
4 Results
In this paper, we focus on the processing of visual information. In the following, we show that a NIF model uncovers meaningful characteristics of the visual system.
4.1 Response prediction
After training the NIF model, we observed that BOLD responses in a majority of voxels in each brain region could be significantly predicted by the model (p < 0.01, Bonferroni-corrected for the total number of gray matter voxels). This is illustrated in Figure 3, showing voxel-wise correlations between predicted and observed test data per region. The results show that the NIF model indeed generates realistic brain activity in response to unseen input stimuli (out-of-sample prediction).
4.1.1 Linear preprocessing
Both the retina and the LGN process visual information before it enters the visual cortex (Graham et al., 2006; Dan et al., 1996). In our NIF model, we account for this biological preprocessing with a learnable linear purely spatial single-channel convolutional layer (3 × 3 × 1) (without nonlinearities) that connects the retinal input to the V1 input. In the main experiment presented in this paper, we found that this layer behaves as an edge extractor (Figure 4B).
In alternative training runs, we included a second preprocessing channel in order to investigate its effect. In this case, one channel became an edge extractor while the other channel extracted luminance (Figure 4C). This is likely to be a reflection of the independence of luminance and contrast information in natural images and in LGN responses (Mante et al., 2005). These results indicate that the model is capable of learning a biologically plausible linear transformation of the visual input.
To simplify the model and reduce computational burden we used a single preprocessing channel in the following analyses.
4.2 Neural representations
Before nonlinearities are applied, neural network features can be inspected by visualizing the learned weights. Figure 5A shows the 64 channels (feature detectors) learned by the neural tensor connected to V1 voxels. We see that several well-known feature detection mechanisms of V1 arise such as Gabor-like response profiles (Jones and Palmer, 1987). Several of these feature detectors show distinct dynamic temporal profiles (see Figure 5A, right panel), reflecting the processing of visual motion (Joukes et al., 2014).
For higher-order regions we need to resort to other visualization techniques. One common approach is computing the gradient that leads to an increase in the activity in the whole feature map of the target channel, and using this gradient to modify the input, starting from a white noise pattern. The resulting visualizations give an impression of the inputs which specific neural network channels prefer or respond strongly to. This procedure leads to noisy visualizations, but can be stabilized by iteratively repeating it on multiple increasing scales of the input image (Mordvintsev et al., 2015). We have used this procedure to synthesize preferred inputs of our network. We have used four scales (starting at 37 × 37 × 8 increasing with a scaling factor of 1.3 up to 81 × 81 × 18), using the mean absolute error on the feature map activations as loss, and an SGD optimizer with a learning rate of 2000 and L2 regularization. The results for different areas can be seen in Figure 5B. The V1 preferred input synthesis mainly shows similarly oriented bars across the feature maps. Higher-order channels of the model show complex combinations consisting of multiple spatial frequencies.
4.3 Receptive field mapping
We examined whether the retinotopical organization of the visual cortex can be recovered from the spatial observation models.
Us represents spatial receptive field estimates for every voxel. Some of these voxel-specific receptive fields are shown in Figure 6A. The model has primarily learned classical local unimodal population receptive fields, but has also learned more complex non-classical response profiles (Olshausen and Field, 2005). This matches the expectation that a population (voxel) response is not necessarily restricted to unipolar receptive fields.
To check that the NIF model has indeed captured sensible retinotopic properties we determined the center of mass of the spatial receptive fields, and transformed these centers to polar coordinates using the central fixation point as origin. Sizes of the receptive fields were estimated as the standard deviation across Us, using the centers of mass as mean. Voxels whose responses could not be significantly predicted were excluded from this analysis.
Figure 6 shows polar angle (B), eccentricity (C) and receptive field size (D) for early visual system areas observed by our model. Maps were generated with pycortex (Gao et al., 2015). Note that the boundaries between visual areas V1, V2 and V3 have been estimated with data from a classical wedge and ring retinotopy session. It becomes clear that reversal boundaries align well with the traditionally estimated ROI boundaries. The larger eccentricity and increase in receptive field size (C) matches the expected fovea-periphery organization as well.
The NIF model thus learns sensible retinotopic characteristics of the visual system directly from naturalistic data. Our results also indicate that the NIF framework allows the estimation of accurate retinotopic maps from naturalistic videos.
5 Discussion
This paper introduced neural information flow as a new approach for neural system identification. The approach relies on a neural architecture specified in terms of interacting brain regions that each embody nonlinear computations. By conditioning each brain region on associated measurements of neural activity, we can estimate neural information processing systems end-to-end. By allowing interactions between brain regions, each brain region will act as a regularizer for the neural computations that emerge in other brain regions. After all, the estimated neural computations must jointly explain observed responses across all brain regions.
We showed using fMRI data collected during prolonged naturalistic stimulation that we can successfully predict BOLD responses across different brain regions. Furthermore, meaningful receptive fields emerged after model estimation. Importantly, the learnt receptive fields are specific to each brain region but collectively explain all of the observed measurements. To the best of our knowledge, these results demonstrate for the first time that biologically meaningful information processing systems of multiple interconnected regions can be directly estimated from neural data. NIF allows neuroscientists to specify hypotheses about neuronal interactions and test these by quantifying how well the resulting models explain observed measurements.
NIF generalizes current encoding models. For example, basic population receptive field models (Dumoulin and Wandell, 2008) and more advanced neural network models (van Gerven, 2017) are special cases of NIF that assume no interactions between brain regions and make specific choices for the nonlinear transformations that capture neuronal processing. Furthermore, current approaches mainly rely on using neural networks that are trained to solve another task such as object recognition (Güçlü and van Gerven, 2015; Yamins et al., 2014). An exception is, e.g. (Güçlü and van Gerven, 2016), but the employed models required the transformation of sensory input to lower-dimensional stimulus features and did not allow the explicit recovery of neural computations in individual regions of interest. Here we show that biologically-interpretable models can be estimated directly from neural data using neural information flow.
The present work also provides a new approach to effective connectivity analysis. The researcher can specify alternative NIF models and then use explained variance as a model selection criterion. This is similar in spirit to dynamic causal modelling. However, instead of using changes in neural dynamics to estimate effective connectivity, we can embraces changes in neural computation to estimate causal interactions.
NIF can be naturally extended in several ways. The employed convolutional layer to model neural computation can be replaced by neural networks that have a more complex architecture. For example, recurrent neural networks can be used to explicitly model the changes in neural dynamics that are now captured by 3D convolutions. Furthermore, lateral and feedback processes are easily added by adding additional links between brain regions.
NIF models can also be extended to handle other data modalities. Alternative observation models can be formulated that allow us to infer neural computations from other measures of neural activity (e.g., single- and multi-unit recordings, local field potentials, calcium imaging, EEG, MEG). Moreover, NIF models can be conditioned on multiple heterogeneous datasets at the same time, providing an elegant solution for multimodal data fusion. Cortical flow can also be easily applied to other sensory inputs. For example, auditory areas can be conditioned on auditory input (see e.g. (Güçlü et al., 2016)). If this is combined with visual input then we may be able to uncover new properties of multimodal integration (Simanova et al., 2014).
Note that we are not restricted to conditioning NIF models on neural data. We may instead (or also) condition these models on behavioural data, such as motor responses or eye movements. The resulting models should then show the same behavioural responses as the system under study. We can even teach NIF models to solve the task at hand directly using reinforcement learning (Sutton and Barto, 2017). In this sense, NIF models provide a starting point for creating brain-inspired AI systems that more closely model how real brains solve cognitive tasks.
Estimated NIF models can be interpreted as synthetic brains that model their biological counter-parts. This implies that we can subject them to any approach which can also be used to probe neural information processing in real brains. For instance, we can apply any method for neural data analysis to the neural time series that result from driving the model with external input. Recently developed nonlinear decoding techniques can shed further light on the neural representations that are encoded by different brain regions (Güçlütürk et al., 2017), providing insight into the phenomenological experience of synthetic brains. Here we restricted ourselves to demonstrating the virtues of our approach using basic receptive field analyses.
Finally, we can use NIF models as in-silico models to examine changes in neural computation. For example, we can examine how neural representations change during learning or by virtual lesioning of the network (Graziano and Aflalo, 2007). This can provide insights into cognitive development and decline. We can also test what happens to neural computations when we directly drive individual brain regions with external input. This provides new approaches for understanding how brain stimulation modulates neural information processing, guiding the development of future neurotechnology (Roelfsema et al., 2018).
Summarizing, we view neural information flow as a starting point for building a new family of rich, general, biologically-inspired computational models that capture neural information processing in biological systems. As such it provides a perfect blend of computational and experimental neuroscience (Churchland and Sejnowski, 2016). NIF models are also scalable since they make use of efficient stochastic gradient methods, as developed by the artificial intelligence community. This provides us with a principled approach to make sense of the high-resolution datasets produced by continuing advances in neurotechnology (Stevenson and Kording, 2011). We expect that (variants of) NIF models will provide exciting new insights into the principles and mechanisms that dictate neural information processing in biological systems.
Acknowledgements
This research was supported by VIDI grant number 639.072.513 of The Netherlands Organization for Scientific Research (NWO).