Abstract
Goal-driven convolutional neural networks (CNN) have been shown to be able to predict and decode cortical responses to natural images or videos. Here, we explored an alternative deep neural network, variational auto-encoder (VAE), as a computational model of the visual cortex. We trained a VAE with a five-layer encoder and a five-layer decoder to learn visual representations from a diverse set of unlabeled images. Inspired by the “free-energy principle” in neuroscience, we modeled the brain’s bottom-up and top-down pathways using the VAE’s encoder and decoder, respectively. Following such conceptual relationships, we found that the VAE was able to predict cortical activities observed with functional magnetic resonance imaging (fMRI) from three human subjects watching natural videos. Compared to CNN, VAE resulted in relatively lower prediction accuracies, especially for higher-order ventral visual areas. On the other hand, fMRI responses could be decoded to estimate the VAE’s latent variables, which in turn could reconstruct the visual input through the VAE’s decoder. This decoding strategy was more advantageous than alternative decoding methods based on partial least square regression. This study supports the notion that the brain, at least in part, bears a generative model of the visual world.