Abstract
Background Recent advances in single-cell gene expression profiling technology have revolutionized the understanding of molecular processes underlying developmental cell and tissue differentiation, enabling the discovery of novel cell-types and molecular markers that characterize developmental trajectories. Common approaches for identifying marker genes are based on pairwise statistical testing for differential gene expression between cell-types in heterogeneous cell populations, which is challenging due to unequal sample sizes and variance between groups resulting in little statistical power and inflated type I errors.
Results We developed an alternative feature extraction method, Marker gene Identification for Cell-type Identity (MICTI) that encodes the cell-type specific expression information to each gene in every single-cell. This approach identifies features (genes) that are cell-type specific for a given cell-type in heterogeneous cell population. To validate this approach, we used (i) simulated single cell RNA-seq data, (ii) human pancreatic islet single-cell RNA-seq data and (iii) a simulated mixture of human single-cell RNA-seq data related to immune cells, particularly B cells, CD4+ memory cells, CD8+ memory cells, dendritic cells, fibroblast cells, and lymphoblast cells. For all cases, we were able to identify established cell-type-specific markers.
Conclusions Our approach represents a highly efficient and fast method as an alternative to differential expression analysis for molecular marker identification in heterogeneous single-cell RNA-seq data.
List of abbreviations
- MICTI
- Marker gene Identification for Cell-type Identity
- DE
- Differential Expression
- MAST
- Model-based Analysis of Single-cell Transcriptomics.
- ROTS
- The Reproducibility-Optimized Test Statistic
- BPSC
- Beta-Poisson model for Single-Cell RNA-seq data analyses
- EC2
- Elastic cloud compute
- DGE
- Differential Gene Expression
- TPM
- Transcript Per Millions of mapped read
- RPKM
- Read Per Kilobase per Millions of mapped read
- GEO
- Gene Expression Omnibus
- UMI
- Unique Molecular Identifier
- SC3
- Single-Cell Consensus Clustering
- LDA
- Latent Dirichlet Allocation
- PCA
- Principal Component Analysis
- ICA
- Independent Component Analysis
- TF-IDF
- Term Frequency Inverse Document Frequency
- NMF
- Negative Matrix Factorization
- t-SNE
- t-Distributed Stochastic Neighbor Embedding
- MST
- Minimum Spanning Tree
- TSCAN
- Tools for Single-Cell Analysis
- NB
- Negative Binomial
- FACs
- Fluorescence-Activated Cell Sorter
- PCR
- Polymerase Chain Reaction
- cDNA
- complementary DNA
- scRNA-seq
- Single-Cell RNA Sequencing