Recognizing the identity of a person is fundamental to guide social interactions. We can recognize the identity of a person looking at her face, but also listening to her voice. An important question concerns how visual and auditory information come together, enabling us to recognize identity independently of the modality of the stimulus. This study reports converging evidence across univariate contrasts and multivariate classification showing that the posterior superior temporal sulcus (pSTS), previously known to encode polymodal visual and auditory representations, encodes information about person identity with invariance within and across modality. In particular, pSTS shows selectivity for faces, selectivity for voices, classification of face identity across image transformations within the visual modality, and classification of person identity across modality.