Abstract
Identifying high-confidence cell-type specific open chromatin regions with coherent regulatory function from single-cell open chromatin data (scATAC-seq) is difficult due to the complexity of resolving cell types given the low coverage of reads per cell. In order to address this problem, we present Semi-Supervised Identification of Populations of cells in scATAC-seq data (SSIPs), a semi-supervised approach that integrates bulk and single-cell data through a generalizable network model featuring two types of nodes. Nodes of the first type represent cells from scATAC-seq with edges between them encoding information about cell similarity. A second set of nodes represents “supervising” datasets connected to cell nodes with edges that encode the similarity between that data and each cell. Via global calculations of network influence, this model allows us to quantify the influence of bulk data on scATAC-seq data and estimate the contributions of scATAC-seq cell populations to signals in bulk data. Using simulated data, we show that SSIPs successfully separates distinct cell types even when they differ in very few mapped scATAC-seq reads, with a significant improvement over unsupervised cell type identification. We apply SSIPs to scATAC-seq data from the developing human brain and show that supervising with just 25 differentially expressed genes from scRNA-seq enables the identification of two subtypes of interneurons not identifiable from scATAC-seq data alone. SSIPs opens the door to identifying high resolution cell types in single-cell open chromatin data, enabling the study of cell-type specific regulatory elements.