Although recent computational studies of feedforward neural network models have demonstrated remarkable performance in object recognition and neural response prediction, visual processing clearly has much more complex aspects that cannot be understood without feedback processing. Here, we propose a novel framework called mixture of sparse coding models, inspired by the formation of category-specific subregions in the inferotemporal (IT) cortex such as faces and objects. The model reconciles two opposing ideas, parts-based and holistic processing, where the former is achieved by sparse coding and the latter by the top-down explain-away effect of the mixture model. We developed a concrete hierarchical network that implemented a mixture of two sparse coding submodels on top of a simple Gabor analysis, where each submodel was trained with face or non-face object images and the latent variables were estimated by standard Bayesian inference to model evoked neural activities. As a result, the units in the face submodel not only exhibited significant selectivity to face images compared to object images, but also explained, qualitatively and quantitatively, several tuning properties to facial features found in the middle patch of face processing in the macaque IT cortex as documented by Freiwald, Tsao, and Livingstone (2009). Namely, we found tuning to only a small number of facial features that were often related to geometrically large parts like face outline and hair, preference of extreme facial features (e.g., large inter-eye distance) with anti-preference of the other extremes (e.g., small inter-eye distance), and reduction of the gain of feature tuning for face stimuli that were partial as opposed to whole. Thus, we hypothesize that the coding principle of facial features in the middle patch of face processing in the macaque IT cortex may be closely related to mixture of sparse coding models.