The confluence of microfluidic and sequencing technologies has enabled profiling of the transcriptome1,2, epigenome3, and chromatin conformation of single cells4 at an unprecedented scale. Initial applications of single cell RNA-sequencing have characterized cellular heterogeneity in tumors5, 6, tissues7, 8, and response to stimulation9. More recently, droplet-based technologies have significantly increased the throughput of single cell capture and library preparation1, 10, enabling transcriptome sequencing of thousands of cells from one microfluidic reaction.
While improvements in biochemistry11, 12 and microfluidics13, 14 continue to increase the number of cells that can be sequenced per sample, for many applications (e.g. differential expression and genetic studies), sequencing thousands of cells each from many individuals would better capture interindividual variability than sequencing more cells from a few individuals. However, in standard workflows, running a separate microfluidic reaction for each sample remains cost prohibitive15. Multiplexing could significantly reduce the per sample cost by allowing cells from several individuals to be processed simultaneously, and reduce the per cell cost by allowing higher flow rates due to the ability to detect and exclude doublets that contain cells from two different individuals. Further, sample multiplexing limits the technical variability associated with sample and library preparation, improving statistical power to accurately estimate true biological effects16.
We present a simple experimental design and computational algorithm, demuxlet, to multiplex samples in dscRNA-seq without additional experimental modification (Fig. 1A). While strategies to demultiplex cells from different species1,10,17 or host and graft samples have been reported, no method is available for simultaneous demultiplexing and doublet detection of cells from > 2 individuals. Inspired by models and algorithms developed for contamination detection in DNA sequencing data18, demuxlet is fast, accurate, scalable and works with standard input formats17,19,20.
At the heart of our strategy is a statistical model for predicting the probability of observing a consistent 'genetic barcode', a set of single nucleotide polymorphisms (SNPs), in the RNA-seq reads of a single cell and the genotypes (from SNP genotyping, imputation or DNA sequencing) of donor samples. The model accounts for the base quality score of the RNA-sequencing reads as previously described18 and genotype uncertainties at unobserved SNPs from imputation to large reference panels21. It then uses maximum likelihood to determine the most likely sample identity for each cell using a mixture model. A small number of reads overlapping common SNPs is sufficient to accurately identify the sample of origin. For a pool of 8 samples, 4 SNPs can uniquely assign a cell to the donor of origin (Fig. 1B), and 20 SNPs each with minor allele frequency (MAF) of 50% can distinguish every sample with 98% probability.
The mixture model in demuxlet also uses genetic information to identify doublets containing two cells from different individuals, which comprise most droplets containing multiple cells. By multiplexing even a small number of samples, a doublet will have a high probability (1 − 1/N, e.g. 87.5% for N = 8 samples) of containing cells from two individuals which is detectable by the demuxlet model (Fig. 1C). The ability to recover the sample identity of each cell ("demuxing") and identify most doublets enables experimental designs that significantly increase the per sample throughput of current dscRNA-seq workflows.
We first assess the feasibility of our strategy and the performance of demuxlet by analyzing multiplexed peripheral blood mononuclear cells (PBMCs) from 8 patients with systemic lupus erythematosus (SLE). Using a sequential pooling strategy, three pools of equimolar concentrations of cells were generated (W1: patients S1-S4, W2: patients S5-S8 and W3: patients S1-S8) and each loaded in a well on a 10X Chromium Single Cell instrument (Fig. 2A). 3,645, 4,254 and 6,205 single cells were obtained from each well and sequenced to an average depth of 23k, 17k and 13k reads per cell.
Demuxlet identified 91% (3332/3645), 91% (3864/4254), and 86% (5348/6205) of droplets as singlets from wells W1, W2 and W3, respectively. 25% (+/− 2.6%), 25% (+/− 4.6%) and 12.5% (+/− 1.4%) of singlets from wells W1, W2 and W3 mapped to each donor, consistent with equal mixing of 8 individuals. We estimate an error rate (number of cells assigned to individuals not in the mixture) of 2/3332 (W1) and 0/3864 (W2) singlets by analyzing wells W1 and W2, each containing cells from two disjoint sets of 4 individuals (Fig. 2B), suggesting > 99% of singlets were assigned to individuals correctly.
We next assess the ability of demuxlet to detect doublets in both simulated and real data. 466/3645 (13%) cells were simulated as synthetic doublets by setting the cellular barcodes of two sets of 466 cells from individuals S1 and S2 to be the same. Applied to the simulated data, demuxlet identified 91% (426/466) of synthetic doublets as doublets or ambiguous, correctly recovering the sample identity of both cells in 403/426 (95%) doublets (fig. S1). Applied to real data from W1, W2 and W3, demuxlet identified 138/3645, 165/4254, and 384/6205 doublets, corresponding to 5.0%, 5.2% and 7.1%, consistent with the linear relationship between the number of cells sequenced and doublet rates estimated using a mixed species experiment (Fig. 2C).
Sample demultiplexing enables individual-specific visualization of single cell data we call 'drop prints'. While both variability in cell type proportion and gene expression have been previously observed in PBMCs, it has not been possible to fully control for batch effects due to separate processing of samples22, 23. Singlets identified by demuxlet in all three wells cluster into known PBMC subpopulations (Fig. 2D) and are not confounded by well to well effects (fig. S2A). While we found 6 differentially expressed genes (FDR < 0.05) between wells W1 and W2, only 2 genes were differentially expressed in well W3 between W1 and W2 individuals (FDR < 0.05) (fig. S2B) suggesting sample multiplexing could reduce confounding such as library preparation batch effects. Furthermore, for the same individuals, drop prints from two different wells are qualitatively consistent, the estimates of cell type proportions for the same individuals in W1 or W2 and W3 are highly correlated (R = 0.99) (Fig. 2E and fig. S3), and the inferred cell type-specific expression profiles are correlated with bulk sequencing of sorted cell populations (R=0.76-0.92) (fig. S4). These results demonstrate that demuxlet recovers the sample identity of single cells with high accuracy, identifies doublets at the expected rate, and can allow for comparison of individuals within and across wells.
Demuxlet enables multiplexed experimental designs that increase the sample throughput for profiling of interindividual responses across a variety of conditions. We applied such a multiplexing strategy to characterize cell type-specific responses to IFN-β, a potent cytokine that induces genome-scale changes in the transcriptional profiles of immune cells24, 25. From 8 lupus patients, 1M PBMCs each were isolated, sequentially pooled, and divided in two aliquots. One sample was activated with recombinant IFN-β for 6 hours, a time point we previously found to maximize the expression of interferon-sensitive genes (ISGs) in dendritic cells (DCs) and T cells26, 27. A matched control sample was also cultured for 6 hours. From this experiment, we captured and sequenced 14,619 control and 14,446 stimulated cells.
In control and stimulated experiments, demuxlet identified 83% (12138/14619) and 84% (12167/14446) of droplets as singlets, and recovered the sample identity of 99% (12127/12138 and 12155/12167) of singlets. Detected doublets form distinct clusters in t-SNE space at the periphery of other cell types, indicative of the expected enrichment of doublets for mixed cell types in a heterogeneous population (fig. S5). The estimated doublet rate of 10.9% is consistent with predicted rates based on the number of cells recovered, and the observed proportion of doublets from each pair of individuals is highly correlated with the expected proportions (R=0.98) (Fig. 2C and fig. S6).
Demultiplexing individuals enables the use of the 8 samples within a pool as biological replicates to quantitatively assess cell type-specific responses to IFN-β stimulation. Consistent with previous reports from bulk RNA-sequencing data, IFN-β stimulation induces widespread transcriptomic changes observed as a shift in the t-SNE projections of singlets (Fig. 3A)24. After assigning each singlet to a reference cell population, we identified 2,686 differentially expressed genes (logFC > 2, FDR < 0.05) in at least one cell type in response to IFN-β stimulation (table S1). These genes cluster into modules of cell type-specific responses enriched for distinct gene regulatory processes (Fig. 3B, table S2). For example, the two clusters of upregulated genes, pan-leukocyte (Cluster III: 401 genes, logFC > 2, FDR < 0.05) and CD14+ specific (Cluster I: 767 genes, logFC > 2, FDR < 0.05), were enriched for general antiviral response (e.g. KEGG Influenza A: Cluster III P < 1.6×10−5), chemokine signaling (Cluster I P < 7.6×10−3) and genes implicated in SLE (Cluster I P < 4.4×10−3). The five clusters of downregulated genes were enriched for antibacterial response (KEGG Legionellosis: Cluster II monocyte down P < 5.5×10−3) and natural killer cell mediated toxicity (Cluster IV NK/Th cell down: P < 3.6×10−2). The differential expression using cell type-specific estimates from single cell data recovers known gene regulatory programs affected by interferon stimulation.
We next characterize interindividual variability in PBMC expression at baseline and in response to IFN-β stimulation. In both control and stimulated cells, the variance of mean expression among individuals is substantially higher than expected from synthetic replicates (Fig. 3C). As previously reported22,28, cell type proportion varied significantly among individuals and contributes to variability in gene expression (fig. S7). The variance estimated from synthetic replicates with matched cell type proportions is more concordant with the observed variance (Lin’s concordance = 0.54 versus 0.022, Pearson correlation = 0.78 versus 0.69, Fig. 3C-D). However, comparing mean expression from synthetic replicates within cell types (Lin's concordance = 0.007 - 0.20, Pearson correlation = 0.27 − 0.68) shows that there is interindividual variability not explained simply by cell type proportion (fig. S8).
We then explored interindividual variability in expression within one cell type, CD14+CD16- monocytes. The correlation of mean expression between pairs of synthetic replicates from the same individual (>99%) was greater than between different individuals (∼97%), indicating variation beyond sampling (Fig. 3E). We found 585 genes that have significant interindividual variability in stimulated CD14+CD16- monocytes and 827 in control by correlating the synthetic replicates across individuals (Pearson correlation, FDR < 0.05). The variable genes in stimulated CD14+CD16- monocytes and to a lesser extent in CD4+T cells (P < 9.3×10−4 and 4.5×10−2, hypergeometric test, Fig. 3F) are enriched for differentially expressed genes, consistent with our previous discovery of more IFN-β response-eQTLs in monocyte-derived dendritic cells than CD4+ T cells26,27. We hypothesize that natural genetic variation could explain interindividual variability in gene expression in our multiplexed data. For example, schlafen family member 5 (SLFN5) and guanylate binding protein 3 (GBP3) expression are highly correlated between replicates after IFN-β stimulation (R=0.92, P < 0.0011 and 0.80, P < 0.017). The average expression of the two synthetic replicates are associated with known eQTLs in CD14+ monocytes and lymphoblastoid cell lines, respectively (SLFN5: rs11080327 P < 3.1×10−4, GBP3: rs10493821 P < 2.1×10−2, Fig. 3G)26,29. These results suggest that single cell sequencing recovers repeatable interindividual variation in gene expression and in two genes, is associated with known genetic determinants.
We introduce demuxlet, a new computational method that enables simple and efficient sample multiplexing for dscRNA-seq, validate its performance in simulated and real data, and characterize single cell expression of PBMCs from SLE patients in several different conditions. Our results demonstrate demuxlet provides reliable estimation of cell type proportion across individuals, recovers cell type-specific transcriptional programs from mixed populations consistent with previous reports, and identifies genes with interindividual variability24. The capability to demultiplex and identify doublets using natural genetic variation significantly reduces the per-sample and per-cell cost of single-cell RNA-sequencing, does not require synthetic barcodes or split-pool strategies30-34, and captures biological variability among individual samples while limiting the effects of unwanted technical variability.
The application of single cell sequencing methods such as dscRNA-seq to larger numbers of individuals is a promising approach to characterizing cellular heterogeneity among individuals at baseline and in different environmental conditions, a crucial area for further understanding of health and disease35-37. Experimental and computational methods for reliable and efficient sample multiplexing could enable broad adoption of droplet-based RNA-seq for population-scale studies, facilitating genetic and longitudinal analyses in relevant cell types and conditions across a range of sampled individuals38.
Methods
Identifying the sample identity of each single cell
We first describe the method to infer the sample identity of each cell in the absence of doublets. Consider RNA-sequence reads from C barcoded droplets multiplexed across S different samples, where their genotypes are available across V exonic variants. Let dcv be the number of unique reads overlapping with the v-th variant from the c-th droplet. Let bcvi ∈ {R, A, O}, i ∈ {1,…, dcv} be the variant-overlapping base call from the i-th read, representing reference (R), alternate (A), and other (O) alleles respectively. Let ecvi ∈ {0,1} be a latent variable indicating whether the base call is correct (0) or not (1), then given ecvi = 0, bcvi ∈ {R,A} and ∼ Binomial () when g ∈ {0,1,2} is the true genotype of sample corresponding to c-th droplet at v-th variant. When ecvi = 1, we assume that Pr(bcvi|g,ecvi) follows table S3. ecvi is assumed to follow Bernoulli () where qcvi is a phred-scale quality score of the observed base call.
We allow uncertainty of observed genotypes at the v-th variant for the s-th sample using , the posterior probability of a possible genotype g given external DNA data Datasv (e.g. sequence reads, imputed genotypes, or array-based genotypes). If genotype likelihood Pr(Datasv|g) is provided (e.g. unphased sequence reads) instead, it can be converted to a posterior probability scale using where Pr(g) ∼ Binomial(2, pv) and pv is the population allele frequency of the alternate allele. To allow errors ε in the posterior probability, we replace it to . The overall likelihood that the c-th droplet originated from the s-th sample is
In the absence of doublets, we use the maximum likelihood to determine the best-matching sample as argmaxs[Lc(s)].
Screening for droplets containing multiple samples
To identify doublets, we implement a mixture model to calculate the likelihood that the sequence reads originated from two individuals, and the likelihoods are compared to determine whether a droplet contains cells from one or two samples. If sequence reads from the c-th droplet originate from two different samples, s1,s2 with mixing proportions (1 − α): α, then the likelihood in (1) can be represented as the following mixture distribution18,
To reduce the computational cost, we consider discrete values of α ∈ {α1,⋯ αM}, (e.g.5 - 50% by 5%). We determine that it is a doublet between samples s1, s2 if and only if and the most likely mixing proportion is estimated to be argmaxαLc(s1,s2, α). We determine that the cell contains only a single individual s if . The less confident droplets, we classify cells as ambiguous. While we consider only doublets for estimating doublet rates, we remove all doublets and ambiguous droplets to conservatively estimate singlets. Figure S1 illustrates the distribution of singlet, doublet likelihoods and the decision boundaries when t = 2 was used.
Isolation and preparation of PBMC samples
Peripheral blood mononuclear cells were isolated from patient donors, Ficoll separated, and cryopreserved by the UCSF Core Immunologic Laboratory (CIL). PBMCs were thawed in a 37°C water bath, and subsequently washed and resuspended in EasySep buffer. Cells were treated with DNAseI and incubated for 15 min at RT before filtering through a 40um column. Finally, the cells were washed in EasySep and resuspended in 1× PBMS and 0.04% bovine serum albumin. Cells from 8 donors were then re-concentrated to 1M cells per mL and then serially pooled. At each pooling stage, 1M cells per mL were combined to result in a final sample pool with cells from all donors.
IFN-β stimulation and culture
Prior to pooling, samples from 8 individuals were separated into two aliquots each. One aliquot of PBMCs was activated by 100 U/mL of recombinant IFN-β (PBL Assay Science) for 6 hours according to the published protocol26. The second aliquot was left untreated. After 6 hours, the 8 samples for each condition were pooled together in two final pools (stimulated cells and control cells) as described above.
Droplet-based capture and sequencing
Cellular suspensions were loaded onto the 10× Chromium instrument (10× Genomics) and sequenced as described in Zheng et al17. The cDNA libraries were sequenced using a custom program on 10 lanes of Illumina HiSeq 2500 Rapid Mode, yielding 1.8B total reads and 25K reads per cell. At these depths, we recovered > 90% of captured transcripts in each sequencing experiment.
Bulk isolation and sequencing
PBMCs from lupus patients were isolated and prepared as described above. Once resuspended in EasySep buffer, the EasyEights Magnet was used to sequentially isolate CD14+ (using the EasySep Human CD14 positive selection kit II, cat #17858), CD19+ (using the EasySep Human CD19 positive selection kit II, cat #17854), CD8+ (EasySep Human CD8 positive selection kitII, cat#17853), and CD4+ cells (EasySep Human CD4 T cell negative isolation kit (cat #17952) according to the kit protocol. RNA was extracted using the RNeasy Mini kit (#74104), and reverse transcription and tagmentation were conducted according to Picelli et al. using the SmartSeq2 protocol39, 40. After cDNA synthesis and tagmentation, the library was amplified with the Nextera XT DNA Sample Preparation Kit (#FC-131-1096) according to protocol, starting with 0.2ng of cDNA. Samples were then sequenced on one lane of the Illumina HiSeq 4000 with paired end 100bp read length, yielding 350M total reads.
Alignment and initial processing of single cell sequencing data
We used the CellRanger v1.1 and v1.2 software with the default settings to process the raw FASTQ files, align the sequencing reads to the hg19 transcriptome, and generate a filtered UMI expression profile for each cell17. The raw UMI counts from all cells and genes with nonzero counts across the population of cells were used to generate t-SNE profiles.
Cell type classification and clustering
To identify known immune cell populations in PBMCs, we used the Seurat package to perform unbiased clustering on the 2.7k PBMCs from Zheng et al., following the publicly available Guided Clustering Tutorial17,41.The FindAllMarkers function was then used to find the top 20 markers for each of the 8 identified cell types. Cluster averages were calculated by taking the average raw count across all cells of each cell type. For each cell, we calculated the Spearman correlation of the raw counts of the marker genes and the cluster averages, and assigned each cell to the cell type to which it had maximum correlation.
Differential expression analysis
Demultiplexed individuals were used as replicates for differential expression analysis. For each gene, raw counts were summed for each individual. We used the DESeq2 package to detect differentially expressed genes between control and stimulated conditions42. Genes with baseMean > 1 were filtered out from the DESeq2 output, and the qvalue package was used to calculate FDR < 0.05 43.
Estimation of interindividual variability in PBMCs
For each individual, we found the mean expression of each gene with nonzero counts. The mean was calculated from the log2 single cell UMI counts normalized to the median count for each cell. To measure interindividual variability, we then calculated the variance of the mean expression across all individuals. Lin’s concordance correlation coefficient was used to compare the agreement of observed data and synthetic replicates. Synthetic replicates were generated by sampling without replacement either from all cells or cells matched for cell type proportion.
Estimation of interindividual variability within cell types
For each cell type, we generated two bulk equivalent replicates for each individual by summing raw counts of cells sampled without replacement. We used DESeq2 to generate variance-stabilized counts across all replicates. To filter for expressed genes, we performed all subsequent analyses on genes with 5% of samples with > 0 counts. The correlation of replicates and QTL detection was performed on the log2 normalized counts. Pearson correlation of the two replicates from each of the 8 individuals was used to find genes with significant interindividual variability.
Single cell and bulk RNA-sequencing data has been deposited in the Gene Expression Omnibus under the accession number GSE96583. Demuxlet software is freely available at https://github.com/hyunminkang/apigenome.