ABSTRACT
Motivation Current technologies for single-cell transcriptomics allow thousands of cells to be analyzed in a single experiment. The increased scale of these methods led to a higher risk of cell doublets’ contamination. Available tools and algorithms for identifying doublets and estimating their occurrence in single-cell expression data focus on cell doublets from different species, cell types or individuals.
Results In this study, we analyze transcriptomic data from single cells having an identical genetic background. We claim that the ratio of monoallelic to biallelic expression provides a discriminating power towards doublets’ identification. We present a pipeline called BIRD (BIallelic Ratio for Doublets) that relies on heterologous genetic variations extracted from single-cell RNA-seq (scRNA-seq). For each dataset, doublets were artificially created from the actual data and used to train a predictive model. BIRD was applied on Smart-Seq data from 163 primary fibroblasts. The model achieved 100% accuracy in annotating the randomly simulated doublets. Bonafide doublets from female-origin fibroblasts were verified by the unexpected biallelic expression from X-chromosome. Data from 10X Genomics microfluidics of peripheral blood cells analyzed by BIRD achieved in average 83% (± 3.7%) accuracy with an area under the curve of 0.88 (± 0.04) for a collection of ∼13,300 single cells.
Conclusions BIRD addresses instances of doublets which were formed from cell mixtures of identical genetic background and cell identity. Maximal performance is achieved with high coverage data. Success in identifying doublets is data specific which varies according to the experimental methodology, genomic diversity between haplotypes, sequence coverage, and depth.
Footnotes
KWK: kerem.wainer{at}mail.huji.ac.il