Abstract
Visual neurons respond selectively to features that become increasingly complex in their form and dynamics from the eyes to the cortex. These features take specific forms: retinal neurons prefer localized flashing dots1, primary visual cortical (V1) neurons moving bars2–4, and those in higher cortical areas, such as middle temporal (MT) cortex, favor complex features like moving textures5–8. Whether there are general principles behind this diverse complexity of response properties in the visual system has been an area of intense investigation. To date, no single normative model has been able to account for the hierarchy of tuning to dynamic inputs along the visual pathway. Here we show that hierarchical temporal prediction - representing features that efficiently predict future sensory input from past sensory input9–12 - can explain how neuronal tuning properties, particularly those relating to motion, change from retina to higher visual cortex. In contrast to some other approaches13–17, the temporal prediction framework learns to represent features of unlabeled and dynamic stimuli, an essential requirement of the real brain. This suggests that the brain may not have evolved to efficiently represent all incoming stimuli, as implied by some leading theories. Instead, the selective representation of sensory features that help in predicting the future may be a general coding principle for extracting temporally-structured features that depend on increasingly high-level statistics of the visual input.
Introduction
Temporal prediction11 relates to a class of principles, such as information bottleneck9,10,12 and slow feature analysis18, that involve selectively encoding only those features that are efficiently predictive of the future. Such principles have value in finding features that can guide future action, uncovering underlying variables, and discarding irrelevant information9,11. This class of principles differs from others that are more typically used to explain sensory coding – efficient coding16,19, sparse coding14 and predictive coding17,20 – that aim instead to efficiently represent all current and perhaps past input. Although these principles have been successful in accounting for visual receptive field properties in areas such as V111,12,14,16–20, it remains to be demonstrated whether a single principle can explain the diverse spatiotemporal tuning that emerges along the dorsal visual stream, which is responsible for the processing of complex object motion.
Any general principle of visual encoding needs to explain temporal aspects of neural tuning – the encoding of movies rather than static images. It is also important that any general principle is largely unsupervised. Some features of the visual system have been reproduced by deep supervised network models optimized for image classification on large labelled datasets13. While these models can help explain the likely hard-wired retina21, they are less informative if neuronal tuning is influenced by experience, as in cortex, since most sensory input is unlabeled except for sporadic reinforcement signals. The temporal prediction approach is unsupervised (requires no labelled data), inherently applies over the temporal domain, and can account for temporal aspects of V1 simple cell receptive fields11. We therefore investigated whether this principle can furthermore account for the emergence of motion processing along the visual pathway, from retina to higher cortical areas.
We instantiated temporal prediction as a hierarchical model consisting of stacked feedforward single-hidden-layer convolutional neural networks (Fig. 1). The first stack was trained to predict the immediate future frame (40 ms) of unfiltered natural video inputs from the previous 5 frames (200 ms). Each subsequent stack was then trained to predict the future hidden-unit activity of the stack below it from the past activity in response to the natural video inputs. The four stacks contained 50, 100, 200 and 400 hidden units, respectively. L1 regularization was applied to the weights of each stack, akin to a constraint on neural wiring.
Results and Discussion
After training, we examined the input weights of the units in the first stack. Each hidden unit can be viewed as a linear non-linear (LN) model22,23, as commonly used to describe neuronal receptive fields. With L1 regularization slightly above the optimum for prediction, the receptive fields of the units showed spatially-localized center-surround tuning with a decaying temporal envelope, characteristic of retinal and lateral geniculate nucleus (LGN) neurons1,24. The model units have either an excitatory center with a suppressive surround (ON units) or vice versa (OFF units). Both ON and OFF units can have either small RFs that do not change polarity (Fig. 2a, units 3-6; Fig. 2b, bottom left) over time, or large RFs that switch polarity over time (Fig. 2a, units 1-2; Fig. 2b, top right). This is reminiscent of the four main cell types in the primate retina and LGN: the parvocellular-pathway ON/OFF neurons and the more change-sensitive magnocellular-pathway ON/OFF neurons, respectively25.
Interestingly, simply decreasing L1-regularization strength causes the model RFs to change from center-surround tuning to Gabor-like tuning, resembling localized oriented bars that shift over time (Fig. 2c). It is possible that this balance, between a code that is optimal for prediction and one that prioritizes efficient wiring, might underlie differences in the retina and LGN of different species. The retina of mice and rabbits contains many neurons with oriented and direction-tuned RFs, whereas cats and macaques mostly have center-surround RFs26. Efficient retinal wiring may be more important in some species, due, for example, to different constraints on the width of the optic nerve or different impacts of light scattering by superficial retinal cell layers.
Using the trained center-surround-tuned network as the first stack, a second stack was added to the model and trained. The output of each second stack unit results from a linear-nonlinear-linear-nonlinear transformation of the visual input, and hence we estimated their RFs by reverse correlation with Gaussian noise input. The resulting RFs were Gabor-like over space, resembling those of V1 simple cells2–4. The RF envelopes decayed into the past, and often showed spatial shifts or polarity changes over time, indicating direction or flicker sensitivity, as is also seen in V127 (Fig. 3a,b, I; Supplementary Fig. 1). Using full-field drifting sinusoidal gratings (Fig. 3a,b II), we found that most units were selective for stimulus orientation, spatial and temporal frequency (Fig. 3a,b, IV-VI), and some were also direction selective (Fig. 3b). Responses to the optimum grating typically oscillate over time between a maximum when the grating is in phase with the RF and 0 when the grating is out of phase (Fig. 3a,b, III). These response characteristics are typical of V1 simple cells28.
In the third and fourth stack, we followed the same procedures as in the second stack. Most of these units are also tuned for orientation, spatial frequency, temporal frequency and in some cases for direction (Fig. 3c,d, IV-VI; Supplementary Figs. 2–3). However, while some units resembled simple cells, most resembled the complex cells of V1 and secondary visual areas (V2/V3)3. Complex cells are tuned for orientation and spatial and temporal frequency, but are relatively invariant to the phase of the optimum grating29; each cell’s response to its optimum grating has a high mean value and changes little with the grating’s phase (Fig. 3c,d, II,III). Whether a neuron is assigned as simple or complex is typically based on the modulation ratio in such plots (<1 indicates complex)30. Model units with low modulation ratios had little discernible structure in their RFs (Fig. 3c,d, I), another characteristic feature of V1 complex cells31,32.
We quantified the tuning characteristics of units in stacks 2-4 and compared them to published V1 data33 (Fig. 4a-j). The distribution of modulation ratios is bimodal in both V130,33 and our model (Fig. 4a). Both model and real neurons were typically orientation selective with the model units more biased towards weaker tuning, as measured by orientation bandwidth (median data33: 23.5°, model: 37.5°; Fig 4b) and circular variance (median data33: simple cells 0.44, complex cells 0.69; median model: simple cells 0.46, complex cells 0.87; Fig 4c,d). Orientation-tuned units (circular variance < 0.9) units in the second stack were exclusively simple (modulation ratios > 1), whereas those in subsequent stacks become increasingly complex (Fig. 4a,c-f). In both model and data, circular variance is inversely correlated with the modulation ratio (Fig. 4c-e,h). As in V1, model units showed a range of direction selectivity preferences, with simple cells biased towards being direction tuned (direction selectivity index, DSI ≥ 0.5; layer 4 cat V134 64%, n=220, model 68%, n=156), while the DSI values of complex cells tend to be more evenly distributed in V134 and lower in the model (22% with DSI ≥ 0.5, n=205; Fig. 4g,j).
Simple and complex cells extract many dynamic features from natural scenes. However, their small RFs prevent individual neurons from tracking the motion of complex objects because of the aperture problem; the direction of motion of an edge is ambiguous, with only the component of motion perpendicular to the cell’s preferred orientation being represented. Two classes of neurons exist that can recover 2-dimensional motion information and overcome the aperture problem. End-stopped neurons, found in primary and secondary visual areas, respond unambiguously to the direction of motion of endpoints of restricted moving contours35. Pattern-selective neurons, in MT of primates, solve the problem for the more general case, likely by integrating over input from many direction-selective V1 complex cells5,6,8,36,37, and hence respond selectively to the motion of patterns as a whole.
To investigate end-stopping in our model units, we applied circular masks of different sizes to the grating stimuli. Some units displayed end-stopping, responding most strongly to gratings with an intermediate mask radius, with the response decreasing as the radius increased beyond this (Fig. 5a,b). To determine whether these end-stopped units unambiguously represent the direction of motion of end-points, we measured two-bar response maps35, which determine response dependence on the horizontal and vertical components of motion (see Methods). Recordings from V1 indicate that more strongly end-stopped neurons have a weak tendency for less ambiguous motion tuning in these maps35. Consistent with this, our model produces examples of end-stopped units with less ambiguous motion tuning (Fig. 5b,c, first two panels) and non-end-stopped units with more ambiguous motion tuning (Fig. 5b,c, last panel).
To investigate pattern selectivity in our model units, we measured their responses to drifting plaids, comprising two superimposed drifting sinusoidal gratings with different orientations. The net direction of the plaid movement lies midway between these two orientations (Fig. 5d). In V1 and MT, component-selective cells respond maximally when the plaid is oriented such that either one of its component gratings moves in the preferred direction of the cell (as measured by a drifting grating). This results in two peaks in plaid-direction tuning curves5,7,8. Conversely, pattern-selective cells have a single peak in their direction tuning curves, when the plaid’s direction of movement aligns with the preferred direction of the cell5,7,8,38. We see examples of both component-selective units (in stacks 2-4) and pattern-selective units (only in stack 4) in our model, as indicated by plaid-direction tuning curves (Fig. 5e) and plots of response as a function of the directions of each component (Fig. 5f).
Previous non-hierarchical models of retina or primary cortex alone have demonstrated retina-like61, simple-cell like11, or primary auditory cortical features11 using principles related to temporal prediction. However, any general principle of neural representation should be extendable to a hierarchical form and not tailored to the region it is attempting to explain. Here we show that temporal prediction can indeed be made hierarchical and so reproduce the major motion-tuning properties that emerge along the dorsal visual pathway from retina to MT. This includes center-surround retinal features, direction-selective simple and complex cells, and complex-motion sensitive units resembling end-stopped and pattern-sensitive neurons. This suggests that, by learning behaviorally-useful features from dynamic unlabeled data, temporal prediction may represent a fundamental coding principle in the brain.
Methods
Data used for model training and testing
Visual inputs
Videos (grayscale, without sound, sampled at 25 fps) of wildlife in natural settings were used to create visual stimuli for training the artificial neural network. The videos were obtained from http://www.arkive.org/species, contributed by: BBC Natural History Unit, http://www.gettyimages.co.uk/footage/bbcmotiongallery; BBC Natural History Unit & Discovery Communications Inc., http://www.bbcmotiongallery.com; Granada Wild, http://www.itnsource.com; Mark Deeble & Victoria Stone Flat Dog Productions Ltd., http://www.deeblestone.com; Getty Images, http://www.gettyimages.com; National Geographic Digital Motion, http://www.ngdigitalmotion.com. The longest dimension of each video frame was clipped to form a square image. Each frame was then down-sampled (using bilinear interpolation) over space, to provide 180 x 180 pixel frames. The video patches were cut into non-overlapping clips, each of 20 frames duration (800 ms). We used a training set of N = ∼1305 clips from around 17 min of video, and a validation set of N = ∼145 clips. Finally, each clip was normalized by subtracting the mean and dividing by the standard deviation of that clip.
Hierarchical temporal prediction model
The model and cost function
The hierarchical temporal prediction model consisted of stacked feedforward single-hidden-layer 3D convolutional neural networks. Each stack consisted of an input layer, a convolutional hidden layer and a ‘transposed convolutional’ output layer. Each unit (convolutional kernel) in the hidden layer performed 3D convolution over its inputs (over time and 2D space; Figure 1) and its output was determined by passing the result of this operation through a rectified linear function. Following the hidden layer there was a ‘transposed convolutional’ output layer, which again performed convolution (and dilation for stride >1). Each stack was trained to minimize the difference between its output and its target. The target was the input at the immediate future time-step.
The first stack of the model was trained to predict the immediate future frame (40 ms) of unfiltered natural video inputs from the previous 5 frames (200 ms). Each subsequent stack was then trained to predict the immediate future hidden-unit activity of the stack below it from the past hidden-unit activity in response to the natural video inputs. This process was repeated until 4 stacks had been trained. The first stack used 50 hidden units and this number was doubled with each added stack, until we had 400 units in the 4th stack.
More formally, each stack of model can be described by a network of the same form. The input to the network has i = 1 to to I input channels. For channel i, for clip n, the input Uin is a rank-3 tensor spanning time and 2D-space with x = 1 to X and y = 1 to Y spatial positions, and t = 1 to T time steps. Throughout the Methods, capital, bold and underlined variables are rank-3 tensors over time and 2D-space, otherwise variables are scalars. The first stack has only a single input channel (the grayscale input frames, I = 1). Each subsequent stack had as many input channels (I) as the number of hidden units (feature maps) in the previous stack.
The network has a single hidden layer of j = 1 to J convolutional kernels. For clip n and kernel j, the output of each kernel is a rank-3 tensor over time and 2D-space, Hjn:
The parameters in Equation 1 are the connective input weights of kernels Wji (between each input channel i and hidden unit j) and the bias bj (of hidden unit j). f( ) is the rectified linear function and * is the 3D convolutional operator over the 2 spatial and 1 temporal dimensions of the input, with stride (s1, s2, s3). Each hidden layer kernel Wji is 3D with size (X’,Y’,T’). No zero padding is applied to the input.
The output of the network predicts the future activity of the input. Hence, the number of input channels (I) always equals the number of output channels (K) for each stack. To ensure that the predicted output has the same size as the input when a stride >1 is used, the hidden layer representation is dilated by adding s-1 zeros between adjacent input elements, where s = (s1, s2, s3) is the stride of the convolutional operator in the hidden layer. The dilated hidden-unit outputs is . When stride=1, .
The activity of each output channel k is the estimate of the true future Vkn given the past Uin. Vkn is simply Uin shifted 1 time step into the future, and k = i. This prediction is estimated from the hidden unit output by:
The parameters in Equation 2 are the connective output kernels Wkj (the weights between each hidden unit j and output channel k) and the bias bk. Each output kernel Wkj is 3D with size (X’,Y’,1), predicting a single time-step into the future based on hidden layer activity in that portion of space.
The parameters Wji, Wkj, bj, and bk were optimized for the training set by minimizing the cost function given by:
Where ‖ ‖P is the entrywise p-norm of the tensor over time and 2D-space, p = 2 is the sqrt of the sum of squares of all values in the tensor, and p = 1 is the sum of absolute values. Thus, E is the sum of the squared error between the prediction and the target Vkn, plus an L1-norm regularization term, which is proportional to the sum of absolute values of all weights in the network and its strength is determined by the hyper-parameter λ. This regularization tends to drive redundant weights to near zero and provides a parsimonious network.
Implementation details
The networks were implemented in Python (https://lasagne.readthedocs.io/en/latest/; http://deeplearning.net/software/theano/). The objective function was minimized using backpropagation as performed by the Adam optimization method62 with hyperparameters β1 and β2 kept at their default settings of 0.9 and 0.999, respectively, and the learning rate (α) varied as detailed below. Training examples were split into minibatches of 32 training examples each.
During model network training, several hyperparameters were varied, including the regularization strength (λ) and the learning rate (α). For each hyperparameter setting, the training algorithm was run for 1000 iterations. The effect of varying λ on the prediction error (the first term of Equation 3) and receptive field structure of the first stack is shown in Fig. 2. For all subsequent stacks, the results presented are the networks that predicted best on the validation set after 1000 iterations through the training data. The settings for each stack are presented in Table 1:
Model unit spatiotemporal extent and receptive fields
Due to the convolutional form of the hidden layer, each hidden unit can potentially receive from a certain span over space and time. We call this the unit’s spatial and temporal extent. For stack 1, this this extent is given by the kernel size (21 x 21 x 5, space x space x time). For stack 2, the extent of each hidden unit is a function of its kernel size and the kernel size and stride of the hidden units in the previous stack, resulting in an extent of 41 x 41 x 9. Similarly, the extent of each hidden unit in stack 3 is 61 x 61 x 13 and in 4 is 81 x 81 x 17. The receptive field size of a unit can be considerably smaller than the unit’s extent.
In the first stack of the model, the combination of linear weights and nonlinear activation function are similar to the basic linear non-linear (LN) model22,23 commonly used to describe neuronal RFs. Hence, the input weights between the input layer and a hidden unit of the model network are taken directly to represent the unit’s RF, indicating the features of the input that are important to that unit. The output activities of hidden units in stacks 2-4 are transformations with multiple linear and nonlinear stages, and hence we estimated their RFs by applying reverse correlation to 100,000 responses to Gaussian noise input with mean 0 and standard deviation 1.5 to stack 1.
In vivo V1 RF data
Responses to drifting gratings measured using recordings from V1 simple and complex cells were compared against the model (Fig. 4). The in vivo data were taken from http://www.ringachlab.net/lab/Data.html33.
RF size vs proportion of RF switching polarity
We measured the size of the receptive fields of the units in the first stack and examined the relationship between the RF size and the proportion of the RF switching polarity. For each unit, all pixels in the most recent time-step of the RF with intensities ≥50% of the maximum pixel intensity in that time-step are included in the RF. The RF size was determined by counting the number of pixels fitting this criterion. We then counted the proportion of pixels included in the RF that changed sign (either positive to negative or vice versa) between the two most recent timesteps. The relationship between these two properties for the units in the first stack is shown in Fig. 2b.
Drifting sinusoidal gratings
In order to characterize the tuning properties of the model’s visual RFs, we measured the responses of each unit to full-field drifting sinusoidal gratings. For each unit, we measured the response to gratings with a wide range of orientations, spatial frequencies and temporal frequencies until we found the parameters that maximally stimulated that unit (giving rise to the highest mean response over time). We define this as the optimal grating for that unit. In cases where orientation or tuning curves were measured, the gratings with optimal spatial and temporal frequency for that unit were used and were varied over orientation. Each grating alternated between an amplitude of ±3 on a gray (0) background. Some units, especially in higher stacks, had weak or no responses to drifting sinusoidal gratings. To account for this, we excluded any units with a mean response (over time) <1% of the maximum mean response of all the units in that stack. As a result of this, 0/100, 88/200 and 261/400 units were excluded from the 2nd, 3rd and 4th stacks, respectively.
We measured several aspects of the V1 neuron and model unit responses to the drifting gratings. For each unit we measured the circular variance, orientation bandwidth, modulation ratio and direction selectivity.
As a control, we examined the receptive fields and responses to drifting gratings of each unit with its immediate input weights shuffled. In this case, the receptive fields lacked discernable structure, with only patchy spatial frequency and orientation tuning in response to gratings. There were very few orientation tuned (circular variance < 0.9) units with modulation ratios <1.
Circular variance
Circular variance (CV) is a global measure of orientation selectivity. For a unit with mean response over time rq to a grating with angle θq, with angles θ spanning the range of 0 to 360° in equally spaced intervals of 5° and measured in radians, the circular variance is defined as33:
Orientation bandwidth
We measured the orientation bandwidth33, which is a more local measure of orientation selectivity. First, we smoothed the direction tuning curve with a Hanning window filter with a half width at half height of 13.5°. We then determined the peak of the orientation tuning curve. The orientation angles closest to the peak for which the response was (or 70.7%) of the peak response were measured. The orientation bandwidth was defined as half of the difference between these two angles. We limited the maximum orientation bandwidth to 180°.
Modulation ratio
We measured the modulation ratio of each unit’s response to its optimal sinusoidal grating. The modulation ratio is defined as: where F1 is the amplitude of the best-fitting sinusoid to the unit’s response over time to the drifting grating. F0 is the mean response to the grating over time.
Direction selectivity index
To measure the direction selectivity index, we obtained each unit’s direction tuning curve at its optimal spatial and temporal frequency. We measured the peak of the direction tuning curve, indicating the unit’s response to gratings presented in the preferred direction (rp) as well as the response to the grating presented in the opposite (non-preferred) direction (rnp). The direction selectivity index is then defined as:
Measuring end-stopping
In order to measure the effects of end-stopping, we measured the responses of the hidden units to the same set of drifting gratings but with a circular mask applied to the inputs (e.g. Fig 5a, ii). Masks with a range of spatial extents were tested and the response of the units as a function of this spatial extent was measured (Fig. 5b).
Sparse noise stimuli and two-bar maps
We measured the responses of the hidden units to ‘sparse noise’ (moving two-bar) stimuli35. Each stimulus contained a single oriented bar over the two most recent time-steps and a blank stimulus in the preceding time-steps. For each unit, the bar was oriented in the preferred orientation (as measured using drifting sinusoidal gratings) of the unit being probed. The length and width of the bar were limited to 50% and 10% of the unit’s spatial extent, respectively. This was typically enough for the bar to be longer than the unit’s spatial receptive field. In the first time-step with a bar, the center position of the bar (its x and y coordinate) was selected from a dense grid of spatial positions starting from the center position and covering 1/3 of the unit’s spatial extent in each direction. In the second time-step, another center position was selected from the same grid. This way, displacement of the bar from each grid position to each other grid position was used to stimulate the unit. To generate two-bar response maps, we measured the response of the unit as a function of the vertical and horizontal displacement (the starting position minus the end position) and then averaged over starting position. This gives a map of the unit’s response as a function of the displacement of the stimulus regardless of the starting position. We performed this procedure using all combinations of pairs of white (amplitude +3) and black (amplitude −3) bars on a gray (amplitude 0) background. This yielded four maps (white-to-white, black-to-black, white-to-black and black-to-white). We then summed the same contrast maps (white-to-white and black-to-black) and subtracted the opposite contrast maps (white-to-black and black-to-white) to yield the final two-bar map for each unit. This preserves directional responses while eliminating the responses that depend only on the spatial position of the bars in each frame35.
Examining the two-bar maps, the position (0,0) indicates that the bar was in the same position in two successive frames, while the vertical and horizontal axes indicate movement in these directions. Positive activity indicates that the unit was excited by movement in that direction, while negative activity indicates inhibition of the unit to movement in the given direction. A non-end-stopped unit will respond to any movement with a component in the preferred direction of the cell. This results in an elongated response profile on the two-bar map (Fig. 5c, right). An end-stopped unit will only respond to movement in the cell’s preferred direction, resulting in a two-bar map whose excitatory activity is limited to a more circumscribed region35 (Fig. 5c, left).
Drifting plaid stimuli
In order to test whether units were pattern selective, we measured their responses to drifting plaid stimuli. Each plaid stimulus was composed of two superimposed half-intensity (amplitude 1.5) sinusoidal gratings with different orientations. The net direction of the plaid movement lies midway between these two orientations (Fig. 5d). As with the sinusoidal inputs, plaids with a variety of orientations, spatial frequencies, temporal frequencies and spatial extents (as defined by the extent of a circular mask) were tested. For each unit, the direction tuning curves of the optimal plaid stimulus (that giving rise to the largest mean response over time) was measured (Fig. 5d-f).
Code and data availability
All custom code used in this study was implemented in Python. We will upload the code to a public Github repository upon acceptance. The movies used for training the models are all publicly available at the websites detailed in the Methods. The V1 data used for comparison is available at http://www.ringachlab.net/lab/Data.html33.