Abstract
Natural scenes sparsely activate neurons in the primary visual cortex (V1). However, whether and how sparsely active neurons sufficiently and robustly represent natural image contents has not been revealed. We reconstructed the natural images from neuronal activities of mouse V1. Single natural images were linearly decodable from surprisingly small number (~20) of highly responsive neurons. This was achieved by diverse receptive fields (RFs) of the small number of responsive neurons. Furthermore, these neurons robustly represented the image against trial-to-trial response variability. Synchronous neurons with partially overlapping RFs formed functional clusters and were active at the same trials. Importantly, multiple clusters represented similar patterns of local images but were active at different trials. Thus, integration of activities among the clusters led to robust representation against the variability. Our results suggest that the diverse, partially overlapping RFs ensure the sparse and robust representation, and propose a new representation scheme in which information is reliably represented, while representing neuronal patterns change across trials.
Introduction
Sensory information is thought to be represented by relatively small number of active neurons in the sensory cortex. This sparse representation has been observed in several cortical areas1–9 and is postulated to reflect an efficient coding of the statistical features in sensory inputs4, 10. However, it has not been determined whether and how small numbers of active neurons represent sufficient information about sensory inputs.
In the primary visual cortex (V1), a type of neuron termed a simple cell has a receptive field (RF) structure that is spatially localized, oriented, and has a bandpass filter property of specific spatial frequency. This RF structure is modelled by a two-dimensional (2D) Gabor function11. According to theoretical studies, single natural images are represented by relatively small numbers of neurons using Gabor-like RFs, whereas information about multiple natural scenes is distributed across the neuronal population10,12,13. Indeed, V1 neurons respond sparsely to natural scenes at the single cell level2, 3, 5–9 and population level3,5,14. Population activity with higher sparseness exhibits greater discriminability between natural scenes5.
What types of information from natural scenes are represented in sparsely active neuronal populations in a brain? The visual contents of natural scenes or movies are reconstructed from populations of single unit activities in the lateral geniculate nucleus (LGN) collected from several experiments15 and functional magnetic resonance imaging (fMRI) data from the visual cortices16–19. However, it has not been addressed experimentally whether the visual contents of natural images are represented by small numbers of sparsely active neurons and whether V1 RFs in the brain are useful to represent the natural image. Furthermore, do the sparsely active neurons reliably represent the natural image contents against trial-to-trial response variability? Although a computational model20 has suggested that sparse and overcomplete representation is optimal representation for natural images with unreliable neurons, this has not been examined experimentally.
We also addressed how visual information is distributed among neurons in a local population. It has been reported that subsets of neurons are ‘unresponsive’ to visual stimuli (e.g., a responsive rate for visual stimuli in mouse V1 of 26-68%)21–27, indicating subsets of neurons represent sensory information. However, this may partly because stimulus properties do not completely cover RF properties of all neurons. Thus, there are two extreme possibilities; sparsely active neurons distributed among all neurons in a local population, or only a specific subset of cells processes the natural images. What proportion of neurons are actually involved in information processing has been debated28, 29.
Here, we examined whether and how a small number of highly responsive V1 neurons was sufficient for the representation of natural image contents. Using two-photon Ca2+ imaging, we recorded visual responses to natural images from local populations of single neurons in V1 of anaesthetized mice. A small number of neurons (<3%) highly responded to each natural image, which was sparser than that predicted by linear encoding model. On the other hand, approximately 90% of neurons were activated by at least one of the natural images, revealing that most neurons in a local population are involved in natural image processing. We reconstructed the natural images from the activities to estimate the information about the visual contents. Visual contents of single natural images were linearly decodable from a small number (~20) of highly responsive neurons. The highly responsive neurons showed diverse RF, which helped small numbers of neurons represent complex natural images. Furthermore, the highly responsive neurons robustly represent the image against trial-to-trial response variability. We found that subsets of the neurons whose RFs partially overlapped formed functional clusters based on correlated activities. Importantly, between the clusters, represented local images were similar to each other, while across-trial response variabilities were almost independent. Thus, integration of activities among the clusters led to a robust representation. We also found that the responsive neurons were only slightly shared between images, and many natural images were represented by the combinations of responsive neurons in a population. Finally, visual features represented by a local population were sufficient to represent the features in all the natural images we used. These results revealed new robust representation of natural images by a small number of neurons in which information is reliably represented, while representing neuronal patterns change across trials. Preliminary results of this study have been published in an abstract form30 and on a preprint server31.
Results
The main purpose of this study is to examine whether and how the natural images are represented in the sparse representation scheme. We first confirm the sparse response to natural images in our dataset. Next, we demonstrate that the natural images were reconstructed from a relatively small number of responsive neurons. Finally, we address how the small number of neurons robustly represent natural images against trial-to-trial response variability.
Sparse visual responses to natural images in mouse V1
We presented flashes of natural images as visual stimuli (Fig. 1a, see Methods) and simultaneously recorded the activities of several hundreds of single neurons from layers 2/3 and 4 of mouse V1 using two-photon calcium (Ca2+) imaging (560 [284–712] cells/plane, median [25–75th percentiles], n = 24 planes from 14 mice, 260–450 microns in depth, Fig. 1b for representative response traces). Fig. 1c presents plots of significant visual response events for all images (x-axis) across all neurons (y-axis) in a plane (n = 726 cells, depth: 360 microns from the brain surface). Significant response for each image was defined as an evoked response which was significantly different from 0 (p < 0.01 using the signed-rank test) and whose normalized response amplitude (z-score) was greater than 1 (see Methods). Hereafter, we call these significant visual responses highly responsive. A few to 10% of neurons were highly responsive to a single image (5.1% [3.9–6.7%] cells/image, Fig. 1c bottom panel), indicating sparse visual responses to natural images. In contrast, nearly all neurons (98%, 711/726 cells) responded to at least one image (each cell responded 4.5% [2.5–7.5%] images, Fig. 1c right panel). Across planes, 2.7% cells were activated by a single image ([1.8–3.2%], Fig. 1f), whereas almost all cells responded to at least one image (90% [86–93%], Fig. 1g). This low responsive rate to each image was not due to poor recording conditions. The same neurons responded well to moving gratings (27% [22–34%] for one direction, and 75% [66–79%] for at least one of 8 directions, Fig. 1h and i).
The highly responsive neurons only slightly overlapped between images. Fig. 1d presents representative activity patterns for three natural images (Fig. 1d, left column). Each image activated different subsets of neurons that exhibited small overlaps between images (Fig. 1d, right column). Of the responsive cells, 4.8% exhibited overlap between two images (25–75th percentiles for 24 planes: 4.0–5.5%, Fig. 1j). We further computed the distributions of the response amplitudes to single images (Fig. 1e). Only a small number of neurons exhibited visual responses with greater amplitudes, which is a characteristic property of a sparse representation (Fig. 1e). Population sparseness2, 3, a measure of sparse representation, was comparable to a previous report for mouse V15 (0.36 [0.30–0.42], Fig. 1k, see Methods). Thus, each natural image activated a relatively small number of neurons, whereas most neurons in a local population were activated by at least one of the images, suggesting the sparsely distributed representation of natural images in V1 that was originally proposed in a previous study10. The latter result also provides the first report that most neurons in mouse V1 are visually responsive to natural image stimuli28, 29.
Partially overlapping representations of visual features among local V1 populations
We created encoding models for the visual responses of individual neurons to examine the visual features represented by each neuron. We used a set of Gabor wavelet filters (1248 filters, Supplementary Fig. 1a and b, see Methods) to extract the visual features from the natural images. Natural images were applied to each Gabor filter and transformed into sets of feature values (Gabor feature values). For each neuron, we first selected the Gabor features that exhibited strong correlations with the visual response. The correlation threshold for the selected feature was adjusted to maximize the visual response prediction (Supplementary Fig. 1c–e, see Methods). Then, the visual response was represented by a linear regression of the selected feature values followed by a non-linear scaling (Fig. 2a, see Methods). The visual response prediction of the model was estimated with a different dataset from the dataset used in the regression (10-fold cross validation, see Methods).
Visual response of an individual neuron was represented by a small number of Gabor features. In the example cells (Fig. 2b and c), the correlation coefficients between the observed responses and the responses predicted by the model were 0.76 and 0.89. These neurons were represented by 19 and 13 Gabor features, respectively (Fig. 2b and c, right panels), and their encoding filters (weighted sums of the Gabor filters) were spatially localized (Fig. 2b and c, insets in the right panels). In the representative plane presented in Fig. 1, the median of the prediction performance of the encoding model (i.e., the correlation coefficient between the observed and predicted responses) was 0.34 (25–75th percentiles: 0.16–0.52, n = 726 cells, Supplementary Fig. 1f), and the median performance of all cells across planes was 0.24 (25–75th percentiles: 0.07–0.45, n = 12755 cells across 24 planes, Supplementary Fig. 1i). An examination of the non-linear scaling function revealed that this step suppressed weak predicted responses and enhanced strong predicted responses (see Fig. 2d and e for a representative cell and average across planes, respectively), suggesting that this non-linear step enhanced the sparseness of the predicted response obtained from the linear step (i.e., linear regression by feature values). On average, 2.0% of the features (25/1248 features, 25–75th percentiles: 2.0–2.1%) were represented in each cell of the example plane (upper panels in Fig. 2f and Supplementary Fig. 1g), and 2.1% were represented in each cell of all recorded cells across all planes (~26/1248 features, 25–75th percentiles: 0.9–4.9%, n = 12755 cells, Fig. 2h and Supplementary Fig. 1k). These features were related to the RF structure of each cell (Supplementary Fig. 2). The RF structure of each cell was estimated using the regularized inverse method32–34 (see Methods). The regression weights of the Gabor features in the encoding model were positively correlated with the similarity between the corresponding Gabor filter and the RF structure (Supplementary Fig.2a–d).
The Gabor features encoded in one cell partially overlapped with those of other cells in a local population (Fig. 2i). Among 19 and 13 Gabor features represented by the two example cells (Fig. 2b and c), only two features overlapped. For all cell pairs across all planes, the median overlap was 3.4% (25–75th percentile: 0.0–9.6% relative to features represented by each cell, Fig. 2i and Supplementary Fig. 1h and 1l). The feature overlap between neurons was positively correlated with the similarity of RF structure (Supplementary Fig. 2e–j). Based on these findings, the Gabor features encoded by individual neurons in a local population were highly diverse and partially overlapped.
The analysis of the encoding model also revealed how the individual Gabor features were encoded across neurons (upper left and bottom panels in Fig. 2f and g). As the spatial frequency (SF) of the Gabor filter increased (i.e. the scale decreased), the corresponding feature contributed to the visual responses of fewer neurons (Fig. 2g). This pattern likely reflected the fact that Gabor filters with a low SF (i.e., a large scale) covered more of the neuron’s RF, whereas Gabor filters with a high SF (i.e. a small scale) affected the responses of fewer neurons. Furthermore, almost all features contributed to the responses of at least one cell (100% in the plane presented in Fig. 2f and 100% [99.4–100%] across all planes, median [25–75th percentiles], Fig. 2j).
Image reconstruction from the activities of the neuronal population
The encoding model revealed the Gabor features represented by each neuron. We next examined whether the features encoded in a local population of neurons were sufficient to represent the visual contents of the natural images. We reconstructed stimulus images from the neuronal activities to evaluate information about visual contents in the population activity15–19. Using the same Gabor features as in the encoding model, each Gabor feature value was subjected to a linear regression of the neuronal activities of multiple neurons (Fig. 3a and Supplementary Fig. 3a). Each Gabor feature value was independently reconstructed. Then, the sets of reconstructed feature values were transformed into images (Fig. 3a, see Methods). The reconstruction performance was estimated with a different dataset from the dataset used in the regression (10-fold cross validation, see Methods).
We first used all simultaneously recorded neurons to reconstruct the image. In the examples of the plane (n = 726 neurons, presented in Figs. 1 and 2), the rough structures of the stimulus images were reconstructed from the population activities (“All-cells” in Fig. 3b). The reconstruction performances (pixel-to-pixel correlations between stimulus images and reconstructed images) were 0.45 [0.36–0.56] (median [25–75th percentiles] of 200 images) in the representative plane (n = 726 cells, Fig. 3c upper panel) and 0.36 [0.31–0.38] across all planes (n = 24 planes, “All cells” in Fig. 3d). Thus, the visual contents of natural images were extracted linearly from the neuronal activities of the local population in V1
The encoding model used in the previous section revealed how each neuron encodes the Gabor features (Fig. 2f). We next examined whether these encoded features were sufficient for the representation of visual contents. In this analysis, each Gabor feature value was reconstructed with a subset of neurons selected using the encoding model (cell-selection model, Supplementary Fig. 3a, and see Methods). In this model, different subsets of neurons were used to reconstruct different features (Fig. 2f). Across all features, almost all neurons were used to reconstruct at least one feature (Fig. 2j). The examples of the reconstructed images from the cell-selection model are presented in Fig. 3b (Cell-selection). The reconstruction performance of the cell-selection model was comparable to or even slightly higher than the model using all cells (R = 0.49 [0.37–0.59] for the representative plane, Fig 3c lower panel, and 0.36 [0.32–0.39] for all planes, median [25–75th percentiles], p = 4.0×10−4 using the signed-rank test, Fig. 3d). Thus, the Gabor features encoded in individual cells in a population captured sufficient information about the visual contents of the natural image. When the neurons were selected to maximize the reconstruction of each feature, the image reconstruction performance was only slightly improved (Supplementary Fig. 3b–h). Thus, main information about the visual contents was captured by the cell-selection model.
Visual contents of natural images are linearly decodable from small numbers of responsive neurons
Single natural images activated small numbers of neurons in a local population (Fig. 1). We next examined whether these small number of highly responsive neurons were sufficient to reconstruct a single image. For this purpose, we changed the number of neurons used in the reconstruction of each image and examined how many responsive neurons were sufficient for each image reconstruction. Parameters (weights and biases) of the cell-selection model were used in the reconstruction, and only the number of neurons used in the reconstruction was changed in this analysis.
Representative results are presented in Fig. 4a–c. In each image, neurons were sorted by visual response amplitude (descending order) first among the highly responsive neurons (red dots in Fig. 4a–c) and then among the remaining neurons (black dots in Fig. 4a–c). The image was reconstructed by top N neurons (N = 1–726 cells), and the reconstruction performances were plotted against the numbers of neurons used (Fig. 4a–d). All highly responsive neurons or even fewer neurons were sufficient to reconstruct the image to a level that was fairly comparable to the image created with all neurons (Fig. 4a–d). In summary, the performance of the highly responsive neurons was slightly better than the performance of all neurons (Representative plane: R = 0.52 [0.40–0.64] for the responsive neurons and 0.49 [0.37–0.59] for all neurons, Fig. 4f. Across planes: R = 0.38 [0.34–0.44] for the responsive neurons and 0.35 [0.31–0.40] for all neurons, median [25-75th percentiles], p = 3.2×10−4 using the signed-rank test, n = 24 planes, Fig. 4g). On average, only approximately 20 neurons were sufficient to achieve 95% of the peak performance (vertical line in Fig. 4d). Thus, the visual contents of the single natural images were linearly decodable from small numbers of highly responsive neurons.
The features represented by individual neurons should be diverse to represent features in a natural image using a small number of neurons. Fig. 4e illustrates how individual responsive neurons contributed to the image reconstruction in the case presented in Fig 4a. Each neuron had a specific pattern of contributions (reverse filter: sum of Gabor filters × weights, see Methods), and the patterns varied across neurons (Fig. 4e top panels), while partially overlapping in the visual field. In neuron pairs that were highly responsive to the same image, the number of overlapping Gabor features were slightly increased compared to all pairs, but the percentage was still less than 10% (7.1% [1.0–16%] of features for the all pairs and 8.1% [6.3–10%] of features for 24 planes, Fig. 4h–j, cf. Fig. 2g). These small overlaps and diversity in the represented features among neurons should be useful for the representation of natural images by the relatively small number of highly responsive neurons.
Robust image representation by neurons with spatially overlapping representation
We next examined whether a single image was robustly represented by the small number of responsive neurons. We computed reconstruction performance after dropping single cells (Fig. 5a and b. Cell # on the x-axis is the same as in Fig. 4d). Single cell-drop had only a small effect on the reconstructed image (middle panels in Fig. 5a). On average, at most 5% reduction of reconstruction performance was observed for the best-responding neurons, and there were almost no effects in most neurons (Fig. 5b). This indicates that an image was robustly represented by highly responsive neurons against cell drop.
We found that this robustness was due to spatial overlap of representation patterns (i.e., reverse filters) among responsive neurons (Fig. 5c). We selected nine neurons which represented the upper part of the image and whose representation patterns spatially overlapped but variable in structure (overlapping cells, top panels in Fig. 5c and Supplementary Fig. 4). Although single-cell drop had almost no effect on the reconstructed local image (bottom panels in Fig. 5c), sequential drop of these cells gradually degraded the upper part of the reconstructed image (Fig. 5d). Pixel values in the overlapping area of the reconstructed image gradually decreased as the number of dropped cells increased (Fig. 5e and f). These results indicate that the robust image representation was due to neurons with spatially overlapping representation.
Independent activities among subsets of neurons provide robust image representation against trial-to-trial variability
We further analyzed whether this overlapping representation is useful to reduce trial-to-trial variability of image representation. Cortical neurons often show trial-to-trial variability in response to repetitions of the same stimulus. If neurons with spatially overlapping representations showed independent or negatively correlated activities, integration of activities among these neurons should reduce the variability of image representations35, 36.
Across-trial variability of the reconstructed images of the example case (shown in Fig. 5) is exhibited in Figure 6. Single-trial reconstructed images from all responsive neurons (57 cells) were mostly stable across trials and were distorted only in a few trials (e.g., trial 10, Fig. 6a). By contrast, reconstructed images from individual neurons were variable across trials (Fig. 6c). Importantly, some neuron pairs showed positively correlated representation across trials, other pairs showed almost independent representation across trials. Thus, integration of activities among the neurons with overlapping representation resulted in reliable representation across trials, even though the activity patterns of individual neurons were variable across trials (Fig. 6d).
Based on this observation, we hypothesized that some neurons which show positively correlated activities form a functional cluster and work together, while neurons between different clusters show independent or negatively correlated activities to reduce variability of image representations. In the case shown in Fig. 6, the nine neurons formed three clusters based on their noise correlations (Fig. 7a and Supplementary Fig. 5a, see Methods). Neurons with overlapping representations usually formed two clusters (Fig. 7b). Importantly, the neuron pairs between different clusters exhibited almost zero or slightly negative correlations (between-cluster pair: −0.05 [−0.22–0.12], and within-cluster pair: 0.26 [0.09–0.42], median and 25-75th percentile, Fig. 7c, blue). This tendency was independent of the number of clusters (Supplementary Fig. 5i). Similarity of reverse filters for the within-cluster pair was almost comparable with that for the between-cluster pair (Supplementary Fig. 5b), indicating that reverse filter structures did not simply explain the structure of noise correlation. Further, cortical positions of neurons did not explain the structure of noise correlation, because neurons in different clusters were spatially intermingled in FOVs (Supplementary Fig. 6).
We next compared reconstructed images obtained from different clusters (Supplementary Fig. 5c, d, h). Importantly, the images were similar between clusters (pixel-to-pixel correlation of reconstructed image: 0.33 [0.11–0.52], median [25-75th percentile], Supplementary Fig. 5d), indicating that the clusters represented similar information. At single-trial, the reconstructed images from individual clusters were still variable across trials (Fig. 7d), due to the relatively high noise correlation within clusters. We further compared trial-to-trial variability of reconstructed images between clusters. As predicted from almost zero noise correlation between clusters, the trial-to-trial variability of reconstructed image was almost independent between the clusters (Fig. 7d for the representative case and Supplementary Fig. 5e, i for summary data. Across-trial correlation coefficients of reconstructed images between clusters were: −0.08 [−0.25–0.09], median [25–75th percentile]). Integration of the multiple clusters resulted in more reliable image representation compared to individual clusters (Supplementary Fig. 5f). These results indicate that integration of activities among the clusters provides robust representation against trial-to-trial response variability.
Representation of multiple natural images in a local population
Finally, we examined how multiple natural images were represented in a population of responsive neurons (Fig. 8a–c). Figs. 8a and b provide an example of the representative plane shown in the previous figures (n = 726 cells). Natural images were sorted by reconstruction performance (y-axis in Fig. 8a), and the cells responding to each image are plotted in each row. First, as the number of images increased, new responsive cells are added, and the total number of responsive cells used for the reconstructions quickly increased (right end of the plot on each row, Fig. 8a). At approximately 50 images, the number of newly added responsive cells quickly decreased, and the increase in the total number of responsive cells slowed, indicating that the newly added image was represented by a combination of the already plotted responsive cells (i.e., the neurons that responded to other images), which was due to the small overlap in responsive cells between images (Fig. 1j). These findings are summarized in Figures 8b and c in which the number of newly added cells quickly decreased to zero as the number of images increased (red lines in Fig. 8b and c for the representative case and for all planes, respectively). Therefore, although only 4.8% responsive neurons overlapped between images (Fig. 1j), this small overlap is useful for the representation of many natural images by a limited number of responsive neurons.
We also analyzed whether the features represented by the local populations of the responsive neurons were sufficient to represent all features of the natural images. If the features in a local population are sufficient to represent all natural images, all features of the natural images should be accurately represented by the combination of features in the individual cells in a population. We represented a set of features in each image by linear regression of weights (i.e., features) of all responsive cells from the reconstruction model (cell-selection model) and computed the fitting errors (see Methods, Fig. 8d). The median error was less than 10% for all images and all planes (8.2% [4.5–15.2%] for all image cases and 5.7% [4.9–16%], n = 24 for the planes, Fig. 8e and f). Based on this result, features that sufficiently represent the visual contents of natural images are encoded in neurons in a local population.
Discussion
In the mouse V1, single natural images activated a small number of neurons (2.7%) which was sparser than that predicted by the linear model. The Gabor features represented in the individual neurons only slightly overlapped between neurons, indicating diverse representations. Visual contents of natural images were linearly decodable from the small number of active neurons (about 20 neurons), which was achieved by the diverse representations. A local part of the image was robustly represented by neurons whose representation patterns partially overlapped. These neurons with overlapping representation formed a small number of functional clusters which represented similar local image but were active independently across trials. Thus, integration of activities across the clusters led to robust representation against across-trial response variability. Further, small share of responsive neurons between the images helped a limited number of the responsive neurons to represent multiple natural images. Finally, the visual features represented by all the responsive neurons provided a good representation of the original visual features in the natural images.
Visual responses to natural images or movies in V1 are sparse at the single cell level (high lifetime sparseness)2, 3, 5–9 and at the level of populations (population sparseness)3, 5, 6, 14. Recently, recordings of local population activity using two-photon Ca2+ imaging have enabled us to precisely evaluate the population sparseness5, 14, 37. We confirmed that a single natural image activated only a small number of neurons. Encoding model analysis indicated that visual responses in individual neurons were sparser than that predicted from a linear model (Fig. 2d, e). Here, this sparse activity was shown to contain sufficient and even robust information to represent the natural image contents. Image reconstruction is useful for evaluating the information contents represented by the neuronal activity and is widely used to analyze populations of single unit activities in response to natural scenes or movies in LGN15 and fMRI data from several visual cortical areas16–19. The former15 study used “pseudo-population” data collected from several experiments, and the latter studies16–19 used blood oxygen level-dependent (BOLD) signals that indirectly reflect the average of local neuronal activity. Thus, it has not been examined whether and how the visual contents of natural images are represented in simultaneously recorded populations of single neurons in the cortex. We revealed that visual contents of single natural images were linearly decodable from relatively small number of responsive neurons in a local population. It has been proposed that information is easy to be read out from the sparse representation4. Indeed, the sparse population activity increases the discriminability of two natural scenes by rendering the representations of the two scenes separable5. Our results extend this in that information about visual contents encoded in sparsely active neurons is linearly accessible, suggesting that downstream areas are easy to read out images from the sparse representation in V1.
The visual features encoded by individual neurons should be diverse so that a small number of active neurons represent the complex visual features of the image. Although RF structures in the local population of mouse V1 have already been reported21, 22, 33, 34, their diversity has not been analyzed with respect to natural image representation. In the present study, the visual features represented by sparsely active neurons were sufficiently diverse to represent visual contents of natural images. Computational models for natural image representation have suggested that sparse activity and number of available neurons affect diversity of RF structures20, 38–40.
We also demonstrated that sparsely active neurons robustly represented an image against trial-to-trial response variability. Although a computational model proposed sparse and overcomplete representation as optimal representation for natural images with unreliable neurons20, this has never been addressed experimentally. We demonstrated that the robust representation was mainly achieved by the diverse, partially overlapping representations, consistent with the overcomplete representation. It has been reported that subregions of receptive fields of some V1 neurons partially overlap21. Our results suggest that such overlap may be useful for the robust image representation. We further revealed that neurons with overlapping reverse filters formed functional clusters and integration across the clusters reduced the trial-to-trial variability, suggesting a new representation scheme in which information is reliably represented, while representing neuronal patterns change across trials. This seems to be similar to “drop-out” in deep learning41 and may be useful to avoid overfitting and local minimum problems in learning.
Our analysis also revealed how multiple natural images were represented in a local population of responsive neurons. A single natural image activated specific subsets of neurons, whereas most neurons in a local population responded to at least one of the images, supporting sparse, distributed code proposed in a previous study. The overlap of responsive neurons between images involved only 4.8% of the responsive cells (Fig. 1i). However, due to this small overlap, many natural images were represented by a limited number of responsive neurons (Fig. 5a–c). Furthermore, the features of all responsive neurons in a local population were sufficient to represent all the natural images used in the present study (Fig. 5d–f). Based on these findings, any natural image could be represented by a combination of responsive neurons in a local population.
In summary, this work highlighted how the visual contents of natural images are sufficiently and even robustly represented in sparsely active V1 neurons. The diverse, but partially overlapping representation helps the small number of neurons to represent a complex image robustly against across-trial variability. We propose a new representation scheme in which information is reliably represented with variable neuronal patterns across trials and which may be effective to avoid over-fitting in learning.
Author contributions
T.Y. and K.O. designed the research. T.Y. performed experiments. T.Y. and K.O. analyzed data and wrote the manuscript. K.O. supervised the research.
Competing financial interests
We declare no competing financial interests.
Methods
All experimental procedures were approved by the local Animal Use and Care Committee of Kyushu University.
Animal preparation for two-photon imaging
C57BL/6 mice (male and female) were used (Japan SLC Inc., Shizuoka, Japan). Mice were anaesthetized with isoflurane (5 % for induction, 1.5 % for maintenance during surgery, ~0.5% during imaging with a sedation of < 0.5mg/kg chlorprothixene, Sigma-Aldrich, St. Louis, MO, USA). The head skin was removed from the head, and the skull over the cortex was exposed. A custom-made metal plate for head fixation was attached with dental cement (Super Bond, Sun Medical, Shiga, Japan), and a craniotomy (~3mm in diameters) was performed over the primary visual cortex (center position: 0–1 mm anterior to lambda, +2.5–3mm lateral to the midline). A mixture of 0.8 mM Oregon Green BAPTA1-AM (OGB1, Life Technologies, Grand Island, NY, USA) dissolved with 10% Pluronic (Life Technologies) and 0.025 mM sulforhodamine 10142 (SR101, Sigma-Aldrich) was pressure-injected with a Picospritzer III (Parker Hannifin, Cleveland, OH, USA) at the depth of 300–500 μm from the brain surface. The cranial window was sealed with a coverslip and dental cement. The imaging experiment began at least one hour after the OGB1 injection.
Two-photon Ca2+ imaging
Imaging was performed with a two-photon microscope (A1R MP, Nikon, Tokyo, Japan) equipped with a 25× objective (NA 1.10, PlanApo, Nikon) and Ti:sapphire mode-locked laser (MaiTai Deep See, Spectra Physics, Santa Clara, CA, USA)43, 44 OGB1 and SR101 were excited at a wavelength of 920 nm. Emission filters of 525/50nm and 629/56nm were used for the OGB1 and SR101 signals, respectively. The fields of view (FOVs) were 338 × 338 μm (10 planes from 7 mice) and 507 × 507 μm (14 planes from 7 mice) at 512 × 512 pixels. The sampling frame rate was at 30Hz using a resonant scanner.
Visual stimulation
Before beginning the recording session, the retinotopic position of the recorded FOV was determined using moving grating patches (lateral or upper directions, 99.9% contrast, 0.04 cycle/degrees, 2 Hz temporal frequency, 20 and 50 degrees in diameter) while monitoring the changes in signals over the entire FOV. The lateral or upper motion directions of the grating were used to activate many cells because the preferred directions of mouse V1 neurons are slightly biased towards the cardinal directions44, 45. First, the grating patch of 50 degrees in diameter was presented in one of 15 (5 × 3) positions that covered the entire monitor to roughly determine the retinotopic position. Then, the patch of 20 degrees in diameter was presented on the 16 (4 × 4) positions covering an 80 × 80-degree space to finely identify the retinotopic position. The stimulus position that induced the maximum visual response of the entire FOV was set as the centre of the retinotopic position of the FOV.
A set of circular patches of grey-scaled, contrast-enhanced natural images (200 image types) was used as the visual stimuli for response prediction and natural image reconstruction (60 degrees in diameter, 512 × 512 pixels, with a circular edge (5 degrees) that was gradually mixed to grey background). Each natural image was adjusted to almost full contrast (99.9%). The mean intensity across pixels in each image was adjusted to an approximately 50% intensity. Original natural images were obtained from the van Hateren’s Natural Image Dataset (http://pirsquared.org/research/#van-hateren-database)46 and the McGill Calibrated Color Image Database (http://tabby.vision.mcgill.ca/html/welcome.html)47. During image presentation, one image type was consecutively flashed three times (three 200-ms presentations interleaved with 200 ms of grey screen), and the presentation of the next image was initiated after the presentation of the grey screen for 200 ms. Images were presented in a pseudo-random sequence in which each image was presented once every 200 image types. Each image was presented at least 12 times (i.e., 12 trials) in a total recording session. We did not set a long interval between image flashes to reduce the total recording time and increase the number of repetitions. In this design, the tail of the Ca2+ response to one image invaded the time window of the next image presentation (Fig. 1b). Although this overlap may have affected the visual responses between two adjacent images, many trial repetitions (> 11 times for each image) in the pseudo-random order and the sparse responses to natural images (Fig. 1) minimized the effects of response contamination between two consecutive images.
Moving, square gratings (8 directions, 0.04 cycle/degrees, 2 Hz temporal frequency, 60-degree patch diameter) were presented at the same position as the natural image on the screen. Each direction was presented for 4 secs interleaved by 4 secs of the grey screen. The sequence of directions was pseudo-randomized, and each direction was presented 10 times in a recording session.
All stimuli were presented with PsychoPy48 on a 32-inch LCD monitor (Samsung, Hwaseong, South Korea) with a 60-Hz refresh rate, and the timing of the stimulus presentation was synchronized with the timing of image acquisition using a TTL pulse counter (USB-6501, National Instruments, Austin, TX, USA).
The entire recording session for one plane was divided into several recording sessions (4–6 trials/sub-session and 15–25 min for each session). Each recording session was interleaved by approximately 5–10 minutes of rest time during which the slight drift of the FOV was manually corrected. Every two or three sessions, the retinotopic position of the FOV was checked with the grating patch stimuli during the resting period. The recording was terminated, and then data were discarded if the retinotopic position was shifted (probably due to eye movement). The recordings were performed in one to three planes of different depths and/or positions in each animal (1.7 ± 0.8 planes, mean ± standard deviation).
Data analysis
All data analysis procedures were performed using MATLAB (Mathworks, Natick, MA, USA). Recorded images were phase-corrected and aligned between frames. The averaged image across frames was used to determine the region of interests (ROIs) of individual cells. After removing slow SF component (obtained with a Gaussian filter with a sigma of approximately five times the soma diameter), the frame-averaged image was subjected to a template matching method in which two-dimensional difference of Gaussian (sigma1: 0.26 × soma diameter that was adjusted for zero-crossing at the soma radius, sigma2: soma diameter) was used as a template for the cell body. Highly correlated areas between the frame-averaged image and the template were detected as ROIs for individual cells. ROIs were manually corrected via visual inspection. SR101-positive cells (putative astrocytes42) were removed from the ROI. Time course of calcium signal in each cell was computed as an average of all pixel’s intensities within an ROI. Signal contamination from out of focus plane was removed by a previously reported method44, 49. Briefly, a signal from ring-shaped area surrounding each ROI was multiplied by a factor (contamination ratio) and subtracted from the signal of each cell. The contamination ratio was determined to minimize the difference between the signal from a blood vessel and the surrounding ring shape region multiplied by the contamination ratio. The contamination ratios were computed for several blood vessels in the FOV, and the mean value for several blood vessels was used for all cells in the FOV.
The average response of 200-ms grey screen period just before each image was subtracted from the average response of the time course during the last 200 ms of the stimulus period (during 3rd flash of each image at approximately time of the peak Ca2+ transient) to compute visually evoked responses. The evoked response was normalized for each cell by dividing by the standard deviation across all visual responses (200 images × trials, z-scored response). If the z-scored response to one image was significantly different from 0 (p < 0.01 using signed-rank test across trials) and the across-trial average of the z-scored response was greater than 1, the response was considered significant for the image. The population sparseness (s) was computed using the equation described in previous studies2, 3, 50 as follows: s = [1−(Σ Ri)2/(NΣ Ri2)]/(1–1/N), where Ri is the evoked response of ith cell, and N is the number of cells (i = 1–N).
Natural images were scaled so that maximum intensity and minimum intensities were 1 and −1, respectively, and gray intensity was 0. A square position (50 × 50 degrees) of the centre of natural image patch was extracted and down-sampled to 32 × 32-pixel image. The down-sampled image was used to analyze the Gabor features, response prediction and image reconstruction.
Gabor features
A set of spatially overlapping Gabor filter wavelets was prepared to extract the visual features of the natural images10, 51, 52. The down-sampled images were first subjected to the set of Gabor filters to obtain Gabor feature values. Single feature value corresponds to a single wavelet filter.
Gabor filters have four orientations (0, 45, 90, and 135 degrees), two phases, and 4 sizes (8 × 8, 16 × 16, 32 × 32, and 64 × 64pixels) located on 11 × 11, 5 × 5, 3 × 3, and 1 × 1 grids (Supplementary Fig. 1a and b). Therefore, the three smaller scale filters were spatially overlapped with each other. The spatial frequencies of the four scale sizes of the Gabor wavelets were 0.13, 0.067, 0.033, and 0.016 cycle/degrees (cpd). This filter set was almost self-inverting, i.e., the feature values obtained by applying an image to the wavelet set were transformed back to the image by summing the filters after multiplying by the feature values51. The Gabor filters and the transformations were based on an open source program (originally written by Drs. Daisuke Kato and Izumi Ohzawa, Osaka University, Japan, https://visiome.neuroinf.jp/modules/xoonips/detail.php?item_id=6894).
Encoding model
In the encoding model for response prediction, single-cell responses were predicted using a linear regression analysis of selected Gabor feature values (Fig. 2a and Supplementary Fig. 1a–e). The encoding model was created independently for each cell. First, Pearson’s correlation coefficients between the response and each feature value were computed. Then, using one of the preset values for the correlation coefficient as a threshold (12 points ranging from 0.05 to 0.35, Supplementary Fig. 1c–e), only the more strongly correlated features were selected (feature selection) and used in the regression analysis. The weight and bias parameters of the regression were estimated by Bayesian linear regression with an expectation-maximization algorithm which is almost equivalent to linear regression with L2 regularization53. After the regression analysis, the non-linearity of predicted response was adjusted via a rectification step using the following equation34, predicted response = A/[1 + exp(αx + β)], where x is the output of the regression and A, α, and β are parameters to be estimated. This step only scaled the regression output without changing the regression parameters (i.e., weights and biases). The response prediction of the model was estimated by 10-fold cross validations (CVs) in which the response data for 180 images were used to estimate the parameters, and the remaining data for 20 images were used to evaluate the prediction. In the 10-fold CVs, all images were used once as test data. The prediction performances were estimated using Pearson’s correlation coefficients between the observed (trial-average) and predicted responses. Encoding models were created for all preset threshold values for feature selection, and the model that exhibited the best prediction performance was selected as the final model. In the analysis of weights (i.e., feature) overlap between the two cells, the percentage of overlapping weights relative to the number of non-zero weights was computed for each cell and averaged between the two cells in the pair.
Using the same dataset as used in the encoding model, The RF structure was estimated for each cell using a regularized inverse method32–34. The regularized inverse method uses one hyper-parameter (regularized parameter). In the 10-fold CVs, the RF structure was estimated with the training dataset using one of preset regularized parameters (13 logarithmically spaced points between 10−3 and 103). The visual response was predicted using the estimated RF and test dataset. The Prediction performance of visual response was estimated by determining Pearson’s correlation coefficients between the observed and the predicted responses. RFs were estimated for all values of the pre-set regularized parameters, and the value that resulted in the best response prediction was selected for the final RF model.
Image reconstruction
In the image reconstruction, each Gabor feature value was linearly regressed by the single-trial activities of multiple cells. In the 10 CVs, the weights and a bias parameter were estimated using the same algorithm as in the encoding model with the training dataset (see above), and each Gabor feature value was reconstructed from the visual response in the test dataset. After each Gabor feature was independently reconstructed, sets of reconstructed feature values were transformed into images as described above (Gabor features section, also see Fig. 3a). Reconstruction performance was evaluated by determining pixel-to-pixel Pearson’s correlation coefficient between the stimulus and reconstructed images. In the cell-selection model (Fig. 3), each feature value was reconstructed with the subset of cells that were selected using the encoding model (Fig. 2f and Supplementary fig. 3a), and almost all cells were used across features (Fig. 2j). In the encoding model, each cell was represented by a subset of features that affected the cell’s response. Thus, in the cell-selection model, each feature was only reconstructed by the cells that encoded information about the reconstructed feature (Supplementary Fig. 3a).
In the analysis of the weights (i.e., feature) overlap between cells, the percentage of overlapping weights relative to the number of non-zero weights was computed for each cell and averaged between the two cells in the pair.
In the analyses shown in Fig. 4a–d and Fig. 5a–b, cells were separated into responsive and non-responsive cells in each image and sorted by their response amplitude in descending order (i.e., from highest to lowest response amplitude). Then the cells were added (in Fig. 4) or dropped (in Fig. 5) one by one first from the responsive cells and then from non-responsive cells.
In the analysis of robustness (Fig. 5c–f), first, z-scored reverse filter was computed in each neuron. A cluster of pixels whose absolute z-scores were more than 1.5 were defined as a representation area after smoothing their contours (e.g., red contours in Fig. 5c and Supplementary Fig. 4). If multiple areas were obtained, the largest one was used. In each stimulus image, one responsive neuron was selected as a reference cell, and correlation coefficients of binarized representation areas were computed between the reference cell and other responsive cells to an image. Cells whose correlation coefficients were more than 0.4 were selected. A set of neurons including both the reference and the selected cells were called “overlapping cells”. To evaluate the effects of cell drop, cells were randomly removed from the overlapping cells, and reconstructed image was computed after each cell-drop. The reference cell was removed at first, and then other remaining overlapping cells were removed in each cell-drop sequence. The changes of reconstructed images were estimated by quantifying pixel values of a local part of the image. The local part of the image was determined as the reference cell’s representation area overlapped by at least one remaining overlapping cell (overlapping area in Fig. 5d and supplementary fig. 4). Absolute pixel values were averaged inside the local part of the image (Note that stimulus images were scaled from −1 to 1. See the section of Data analysis) and used for the evaluation of the local part of reconstructed image. Random drops of overlapping cells were repeated for 120 times, and the results were averaged across the random orders in each reference cell. All responsive cells were used once as the reference cell in each stimulus image. Data including at least 10 responsive cells and 5 overlapping cells were only used in this analysis.
In the cluster analysis (Fig. 7), the overlapping cells selected as described above were clustered by k means analysis with noise correlation of responses to an image for distance measure (predetermined number of clusters, k = 2, 3, 4, and 5 were used). We used the number of cluster (k) which showed the minimal between-cluster noise correlation in each overlapping neuron set (Supplementary Fig. 5a). In the analysis of correlation of trial-to-trial variability of reconstructed image between clusters (Fig. 7d and Supplementary Fig. 5e, i), trial-to-trial variability of the reconstructed image was evaluated by pixel values of the local part of the image as the analysis in Fig. 5e and f, and correlation coefficient of the pixel value change was computed between clusters. The local part of the reconstructed image was determined as described above. In the analysis of reliability of reconstructed image across trials (Supplementary Fig. 5f), correlations between single-trial reconstructed image and trial-averaged reconstructed image were computed and averaged across trials. The main results were independent of the choice of cluster number (Supplementary Fig. 5g–i). Data including at least 10 responsive neurons and 5 overlapping neurons were only used in this analysis.
The feature values of each image were linearly regressed with the weights of image reconstruction model (cell-selection model) in all responsive cells in a local population to examine whether all the features of natural images were represented by the features of the responsive cells (Fig. 8d–f). The fitting error rate (% error) was computed in each image using a following equation, % error = Σ(Ffitted–Fimage)2/Σ(Fimage–Fmean)2×100, where Ffitted is the set of fitted features, Fimage is the set of features of the natural image, and Fmean is the mean of Fimage.
Statistical analyses
All data are presented as the median and 25–75Th percentiles unless indicated otherwise. The significant level was set to 0.05, with the exception of the criteria of significant visual response (0.01). When more than two groups were compared, the significant level was adjusted with the Bonferroni correction. Two-sided test was used in all analyses. The experiments were not performed in a blind manner. The sample sizes were not predetermined by any statistical methods, but are comparable to the sample size of other reports in the field.
Data availability
The datasets of the current study and the code used to analyze them are available from the corresponding authors on reasonable request.
Acknowledgements
We thank Ms. Y. Sono, A. Hayashi, T. Inoue, A. Ohmori, A. Honda, M. Nakamichi for animal care, and all members of Ohki laboratory for support and discussions. This work was supported by grants from Core Research for Evolutionary Science and Technology (CREST)—Japan Agency for Medical Research and Development (AMED) (to K.O.), Japan Society for the Promotion of Science (JSPS) KAKENHI (Grant number 25221001 and 25117004 to K.O. and 15K16573, 17K13276 to T.Y.), International Research Center for Neurointelligence (WPI-IRCN), JSPS (to K.O.), Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS)—AMED (to K.O.), Strategic International Research Cooperative Program (SICP)—AMED (to K.O.), grants from the Ichiro Kanehara Foundation for the Promotion of Medical Sciences and Medical Care, and the Uehara Memorial Foundation (to T.Y.).