Abstract
While a highly diverse T-cell receptor (TCR) repertoire is the hallmark of a healthy adaptive immune system, relatively little is understood about how the CD4+ and CD8+ TCR repertoires differ from one another. We here utilize high-throughput single T-cell sequencing to obtain approximately 100,000 TCR αβ chain pairs from human subjects, stratified into CD4+ and CD8+ lineages. We reveal that substantial information about T-cell lineage is encoded by Vαβ gene pairs and, to a lesser extent, by several other TCR features such as CDR3 length and charge. We further find that the strength of association between the β chain and T-cell lineage is surprisingly weak, similar in strength to that of the α chain. Using machine learning classifiers to predict T-cell lineage from TCR features, we demon-strate that αβ chain pairs are significantly more informative than individual chains alone. These findings provide unprecedented insight into the CD4+ and CD8+ TCR repertoires and highlight the importance of αβ chain pairing in TCR function and specificity.
1. Introduction
During thymic positive selection, bipotent T-cell precursors differentiate into either CD4+ helper T-cell or CD8+ cytotoxic T-cell lineage. While this process is contingent upon the interaction of the heterodimeric αβ T-cell receptor (TCR) with either MHC class II or I, respectively, relatively little is currently known about the TCR features mediating this interaction1–3. One possible explanation posits the existence of germline-encoded sequences that have been hard-wired into the Variable (V) region’s CDR1 and CDR2 loops4–13. Recent support for such germline-bias includes the finding that expression levels of specific TCR V-regions are correlated with MHC polymorphisms 14. However, the role of the entire αβ chain sequence in specifying CD4+ and CD8+ repertoires has remained unknown.
While previous methods for paired αβ TCR sequencing have been developed 15–21, only recently have technological advances enabled high-throughput capture of paired αβ TCR sequences22–25. As both α and β chains have been implicated to play important roles in
TCR binding of the peptide-MHC (pMHC) complex, it follows that such single-cell sequencing methods may reveal differences in the paired TCR repertoires between each T-cell lineage26–32. Thus, in order to better understand the factors that influence T-cell differentiation, we addressed how the paired αβ TCR repertoires differ between the CD4+ and CD8+ T-cell populations.
2. Results
Overlap between the CD4+ and CD8+ repertoires
We previously employed a novel high-throughput, single-cell sequencing method to capture TCR pairs obtained from the peripheral blood of 5 healthy individuals24,33. In this study, we utilized another single-cell microfluidic platform (10x Genomics) 25 to add to this database and create the largest database of paired CD4+ and CD8+ TCR sequences to date (Sup. Figs. 1 and 2). Using this dataset comprised of nearly 100,000 paired αβ TCR sequences, we first assessed the CD4+ and CD8+ TCR repertoire overlap.
Considering the unique set of TCR clonotypes (Vαβ and amino acid CDR3αβ) across all individuals, we found that the paired CD4+ and CD8+ repertoires were largely disjoint from one another. Splitting the paired repertoire into the constituent α and β populations resulted in considerably higher overlap between the two lineages (Fig. 1A-C). Next quantifying the overlap between the CD4+ and CD8+ TCR repertoires within each individual, we observed greater similarity between the CD4+ and CD8+ single chain repertoires than between the paired αβ repertoires (Fig. 1D). Previous findings have suggested that TCRs shared between individuals may have shorter CDR3β sequences34 and may be closer to germline recombination sequences than clonotpyes found only in a single individual 35–37. Accordingly, TCR sequences shared between the CD4+ and CD8+ lineages were, on average, shorter than those found only in one of the two lineages with respect to the α (p=1.4×10−5), β (p=6.3×10−8) and αβ (p=9.3×10−6 by Mann-Whitney U test) repertoires(Fig. 1E and Sup. Fig. 3).
The decreased CD4+ and CD8+ repertoire overlap for αβ pairs relative to either single chain repertoire may reflect an increased specificity of αβ pairs for a given MHC class. As this explanation would be biologically consistent with previous structural findings implicating both chains in determining TCR-pMHC binding26–32, we further explored the extent to which αβ pairs could be used to provide additional information on T-cell lineage as opposed to the either chain alone.
Association of VJ germline segment usage with CD4+-CD8+ status
Significant biases in V and J germline segment use between the single-chain CD4+ and CD8+ repertoires have been identified previously38–40. To further explore this, we calculated the frequency with which all Vα and Vβ regions were used by each individual (Fig. 2A-B). While variations in the usage statistics exist between individuals, our results are in general agreement with previous estimates (Sup Figs. 4-7) 41,42. The association between each V region and T-cell lineage was quantified by calculating the odds ratio38, revealing only weak associations between the usage of a particular Vα or Vβ segment and T-cell lineage (Fig. 2C-D). Weaker associations between T-cell lineage and single chain Jα and Jβ usage were also present (Sup. Fig. 8A-D). Interestingly, these associations for both V- and J-regions are significantly weaker than previously reported38.
The role of paired germline segment usage in biasing T-cell differentiation was examined by comparing the Vαβ and Jαβ paired distributions for both T-cell populations (Fig. 2E-F and Sup Fig. 8E-F). The CD4+:CD8+ odds ratio was then calculated for each germline pair (Sup. Figs. 9-11). Our results reveal 352 Vαβ and 70 Jαβ pairs associated with a significant (q<0.05) lineage specification bias (Fig. 2G and Sup Fig. 8G). Interestingly, the strength of association with T-cell lineage was significantly stronger for Vαβ pairs than for Jαβ pairs, likely reflecting the contribution of the CDR1 and CDR2 loops present in each V region to MHC binding43.
We further note the association between paired Vαβ and cell lineage was significantly stronger (CD4+: p=2.1×10−6, CD8+: p=6.3×10−10 by Mann-Whitney U test) than those associations found with the single chains individually (Fig. 2H-I). Similarly, the association between Jαβ pairs was significantly stronger (CD4+: p=9.8×10−7, CD8+: p=2.1×10−4 by Mann-Whitney U test) than those of either the α or β chain alone (Sup. Fig. 8H-I).
Biologically, this finding is consistent with the notion that both the α and β chain contribute substantially to TCR-pMHC binding26–32. These findings additionally highlights the importance of new single-cell methods that allow for the capture of paired αβ chains over traditional bulk-sequencing methods that allow only for the capture of individual chains.
CDR3 features are weakly associated with T-cell lineage
The TCR-pMHC interaction is also dependent upon the contributions of the CDR3 regions of both the α and β chains26–32, leading us to investigate the relationship between CDR3 sequence and T-cell lineage. Examining the frequency with which each amino acid occurred across the single-chain CDR3 repertoires shows strong differences between the α and β chains (Fig. 3A-D). This is likely due to the differences in amino acid usage in the α and β chain V(D)J germline regions. However, we observed only small differences in amino acid use between the CD4+ and CD8+ repertoires (Fig. 3E-F). Previous studies have observed an association between CDR3 net charge and T-cell lineage38,39, consistent with our findings that net CDR3 charge, but not CDR3 length, is associated with T-cell lineage for both the α and β chains (Fig. 3G-I and Sup. Fig. 12A-C).
We further examined the relationship of paired CDR3αβ charge and length with T-cell lineage (Fig. 3J-K and Sup. Fig. 12D-E). Again calculating the odds ratio, we found 21 CDR3αβ charge pairs and 14 CDR3αβ length pairs associated with a significant CD4+:CD8+ bias (Fig. 3L and Sup. Fig. 12F). We additionally observe that paired αβ chain lengths tend to be associated with stronger biases towards CD4+ status than either of the single chains alone (Sup. Figs. 12G). Surprisingly, however, no significant differences were observed in the strength of association between paired and single-chain CDR3 length for CD8+ status or for CDR3 charge for either CD4+ or CD8+ status (Sup. Figs. 12H and 13).
Paired chain sequences are more informative of CD4+-CD8+ status than single chains
In order to better understand the amount of information about CD4+ and CD8+ status encoded in the α, β, and αβ TCR sequences, we next quantified the mutual information33,45, corrected for finite sample sizes, between several TCR features and T-cell lineage (Table 1). Examining V and J usage, as well as CDR3 length, we find that paired sequences carry more information about lineage than either of the single chains alone. Particularly for Vαβ, we observe synergistic information46 in which the paired chains carry more information than the individual chains summed together.
We next investigated whether the use of paired sequences would better allow us to predict T-cell lineage from TCR features using machine learning classifiers. Using a multi-layer perceptron neural network classifier, we demonstrate that the α and β chain are both weakly informative of lineage and that paired TCR sequences carry substantially more information than either the α chain (p=9.0×10−5) or β chain (p=9.1×10−5 by Mann-Whitney U test) alone (Fig 4). Similar results were obtained using both support vector machine (SVM) and logistic regression classifiers (Sup. Fig. 14) 38,39. From a biological perspective, this finding is consistent with a mechanistic model in which both chains contribute to the TCR-pMHC interaction.
Of note is a previous report using a SVM classifier and CDR3 length-dependent parametrization to predict T-cell lineage from TCR sequences with greater than 90% accuracy 39. This approach, however, failed to achieve the same degree of predictive accuracy when using our dataset (Sup. Fig. 15). To better understand this finding, we compared the TCR sequences from this study 39 with those reported here and an additional bulk-sequencing TCRβ dataset40. We find that the aforementioned increased predictive accuracy is driven by anomalous Vβ and Jβ gene frequencies in the Li et al. dataset, possibly due to a lack of rigorous PCR correction, as compared with the other two datasets (Sup. Figs 16-18).
3. Conclusions
In summary, we have created the largest database of paired αβ TCR sequences to date. Our analysis of the healthy CD4+ and CD8+ TCR repertoires revealed systematic differences between the two T-cell populations, particularly in the utilization of Vαβ pairings. Further-more, we have presented one of the first comprehensive analyses of the α chain repertoire, showing both chains are similarly informative of T-cell lineage. Finally, utilizing approaches from information theory and machine learning, we have shown that features of the paired αβ TCR are substantially more informative of lineage than individual chains. Our results thus provide new evidence for the role of germline-encoded TCR-pMHC interactions and implicate both chains as playing important roles in determining TCR interactions. We believe that the rigorous examination of the normal TCR repertoires presented in this study both demonstrates the utility of capturing αβ pairs in profiling the TCR repertoire and will prove to be valuable in understanding the perturbations caused by infectious, oncological and auto-immune disease states47–53.
4. Materials and Methods
Single-cell barcoding and sequencing
TCR sequences for subjects 1-5 were obtained from Grigaityte et al.33 In brief, peripheral blood mononuclear cells (PBMCs) were obtained from five healthy donors after appropriate informed consent. Blood samples then underwent a pan T-cell enrichment, were tagged with unique barcodes via a newly developed single-cell barcoding in emulsion technology24, and sequenced using an Illumina MiSeq sequencer. Raw sequences were processed using a custom pipeline33 to identify αβ pairs utilizing MiXCR 2.2.154 to identify V(D)J segments and annotate the CDR3 region of each TCR.
TCR sequences for Subject 6 were similarly obtained from a commercially purchased PBMC sample (ATCC PCS-800-011TM) drawn from a healthy individual. CD4+ and CD8+ T-cell populations were separated using magnetic bead enrichment according to the manufacturer protocol (EasySep Human T Cell Enrichment Kit, StemCell Technologies). The PBMC samples used in Grigaityte et al.33 for S1 and S3 were additionally obtained and sorted into CD4+ and CD8+ using fluorescence activated cell sorting (Becton Dickinson FACSARIA SORP). For these samples, cells were barcoded in emulsion25 using the Chromium Controller using the Single Cell V(D)J reagent kit (10X Genomics) and sequenced using an Illumina HiSeq 2500 sequencer. Raw sequencing reads were processed using the computational pipeline previously described33.
The Li et al. dataset39 was provided by N.P. Weng as a processed datafile containing VJ segments and CDR3 amino acid sequences. The Emerson et al. dataset40 was downloaded from Adaptive Biotechnologies open-access immuneACCESS database (https://clients.adaptivebiotech.com/immuneaccess). Of note, though the original study consisted of both TCR sequences obtained from healthy and disease patients, only the 17 healthy samples are used here.
Data analysis
Following the processing described above, we generated text files containing information about V(D)J segment use and CDR3 nucleotide and amino acid sequence for each of the identified paired αβ TCR sequences (Supplemental Figures 1 and 2). As we care about identifying features of the TCR repertoires between the CD4+ and CD8+ populations, we count each unique TCR clonotype only once. That is, clonal expansion of random clones in the CD4+ and CD8+ would bias our analysis of the factors that effect differentiation. As such, we include each TCR clonotype only once into our final dataset. Here, we define a clonotype to be the Vαβ regions used and amino acid CDR3αβ sequences. We then identified TCR clonotypes that were shared between the CD4+ and CD8+ compartments.
The degree of overlap between the CD4+ and CD8+ TCR repertoires was quantified using the Jaccard Index (J):
Here |CD4∩CD8| refers to the cardinality of the intersection between the CD4+ and CD8+ TCR repertoires (i.e. the number of TCRs found in both repertoires). |CD4 CD8| refers to the union of the two repertoires (i.e. the number of TCRs found in either of the two repertoires). The Jaccard Index was calculated independently for the α (J(CD4α, CD8α)), β(J(CD4β, CD8β)), and αβ (J(CD4αβ, CD8αβ)) TCR repertoires. TCR sequences shared between the CD4+ and CD8+ TCR repertoires were excluded from the machine learning classification analysis.
Furthermore, as done previously33, the paired αβ repertoire consists of all unique, paired TCR sequences and the α and β individual chain repertoires were derived directly from the paired repertoire. That is, the individual α repertoire consists of all the α chains present in the paired dataset. Thus, the α, β, and αβ datasets are all of the same size and differences in sample size do not drive the observed differences. Furthermore, all boxplots represent median and inter-quartile range.
All analysis steps, unless otherwise noted, were performed using custom Python scripts available at our Github repository (https://github.com/JasonACarter/CD4CD8-Mansucript).
VJ segment usage
V(D)J segments were identified from raw sequences by MiXCR and annotated according to the International ImMunoGeneTics (IMGT) V(D)J gene definitions55. The odds ratio (OR) for a given TCR characteristic and T-cell lineage was calculated by counting the number of TCRs with (C+) and without (C−) that characteristic within the CD4+ (T 4) and CD8+ (T 8) repertoires. The OR is then given as:
That is, the numerator is the number of CD4+ TCRs with a given feature are multiplied by the number of CD8+ TCRs without that feature. The denominator is given by the number of CD4+ cells without that feature multiplied by the number of CD8+ with that feature. Thus, an OR greater than 1 corresponds with a bias towards CD4+ and an OR less than 1 corresponds with a CD8+ bias. 95% confidence intervals and a p-value were then calculated for each OR using Fisher’s exact test implemented using the SciPy library (www.scipy.org). Multiple hypothesis testing correction was applied to single chain p-values using a Bonferroni correction and paired chains p-values were converted to q-values56. Significance was assessed at the p<0.05 or q<0.05 level.
CDR3 features
Sequence logos showing the amino acid frequency for a given position in the sequence were generated using all α and β CDR3 sequences of length 14 using WebLogo44. Of note, we defined the CDR3 length to be inclusive of the proximal cysteine and terminal phenylalanine that define the CDR3 region. The ratio of each amino acid in CDR3 between the CD4+ and CD8+ populations was calculated by dividing the frequency of a given amino acid across all CD4+ CDR3 sequences for a given chain by the frequency with which that amino acid occurred across all CD8+ CDR3 sequences. CDR3 charge was calculated as the sum of negatively charged amino acids (D and E) and positively charged amino acids (R and K) present in the CDR3 region.
Mutual information
The mutual information45 (I), in bits, between a given feature, X, and T-cell lineage (L) was calculated as:
In order to correct for biases in our MI estimate arising from our limited sample sizes, we then applied a bootstrapping based finite-sampling correction previously described33,57. We additionally calculate the synergistic information 46 (S) according to: where Xα and Xβ refer to TCRα and TCRβ features, respectively.
Machine learning
Multi-layer perceptron (MLP) neural network, logistic regression, and support vector machine (SVM) classifiers were implemented using custom Python scripts employing sklearn’s SVM library58. For SVM’s trained on the Li et al. and Emerson et al. dataset, CDR3β amino acid sequences were first converted in numeric vectors using Atchley factors39,59. As the length of these numeric vectors depended on the length of the CDR3 region, a separate SVM was trained for each CDR3 length between 10 and 15. For all machine learning classifiers, each dataset was divided into a training set (75%) and a testing set (25%) and the accuracy of the testing set was reported for both the CD4+ and CD8+ populations. Standard deviations were calculated via 10 rounds of bootstrapping.
For our dataset, we wished to understand if the paired αβ repertoire was more informative than either of the single chain repertoires. As converting each CDR3αβ pair into a numeric vector would drastically lower our sample size, we developed a new methodology for preparing input vectors for TCRs that are independent of the CDR3 length. Specifically, we designated a TCR’s V and J segment as categorical variables. Additionally, we included the length of each CDR3 region and the frequency of each of the twenty amino acids used in the CDR3 region. Although this methodology loses information encoded in the amino acid sequence of the CDR3 region, it still captures many of the salient features we find to carry information about T-cell lineage and has the advantage of not quickly diminishing our sample size as a length-dependent method would.
5. Acknowledgments
The authors thank Doug Fearon for comments on the manuscript, Pamela Moody and the CSHL Flow Cytometry Shared Resource for help with FACS experiments and the CSHL DNA Sequencing Core for next-generation sequencing. We additionally thank N.P. Weng for providing the β chain bulk sequencing dataset from Li et al. JAC was partially supported by NIHGM MSTP Training award T32-GM008444 and a LIBH grant. KG was funded by the Ferish-Gerry fellowship from the Watson School of Biological Sciences. GA was funded by the Simons Foundation and the Stand Up To Cancer-Breast Cancer Research Foundation Convergence Team Translational Cancer Research Grant, Grant Number SU2C-BCRF 2015-001.