Abstract
The T cell receptor (TCR) repertoire encodes immune exposure history through the dynamic formation of immunological memory. Statistical analysis of repertoire sequencing data has the potential to decode disease associations from large cohorts with measured phenotypes. However, the repertoire perturbation induced by a given immunological challenge is conditioned on genetic background via major histocompatibility complex (MHC) polymorphism. We explore associations between MHC alleles, immune exposures, and shared TCRs in a large human cohort. Using a previously published repertoire sequencing dataset augmented with high-resolution MHC genotyping, our analysis reveals rich structure: striking imprints of common pathogens, clusters of co-occurring TCRs that may represent markers of shared immune exposures, and substantial variations in TCR-MHC association strength across MHC loci. Guided by atomic contacts in solved TCR:peptide-MHC structures, we identify sequence covariation between TCR and MHC. These insights and our analysis framework lay the groundwork for further explorations into TCR diversity.
1 Introduction
T cells are the effectors of cell-mediated adaptive immunity in jawed vertebrates. To control a broad array of pathogens, massive genetic diversity in loci encoding the T cell receptor (TCR) is generated somatically throughout an individual’s life via a process called V(D)J recombination. All nucleated cells regularly process and present internal peptide antigens on cell surface molecules called major histocompatibility complex (MHC). Through the interface of TCR and MHC, a rare T cell with a TCR having affinity for a peptide antigen complexed with MHC (pMHC) is stimulated to initiate an immune response to an infected (or cancerous) cell. The responding T cell proliferates clonally, and its progeny inherit the same antigen-specific TCR, constituting long-term immunological memory of the antigen. The diverse population of TCR clones in an individual (the TCR repertoire) thus dynamically encodes a history of immunological challenges.
Advances in high-throughput TCR sequencing have shown the potential of the TCR repertoire as a personalized diagnostic of pathogen exposure history, cancer, and autoimmunity (Kirsch et al., 2015; Friedensohn et al., 2017). Public TCRs—defined as TCR sequences seen in multiple individuals and perhaps associated with a shared disease phenotype—have been found in a range of infectious and autoimmune diseases and cancers including influenza, Epstein-Barr virus, and cytomegalovirus infections, type I diabetes, rheumatoid arthritis, and melanoma (Venturi et al., 2008; Li et al., 2012; Madi et al., 2017; Pogorelyy et al., 2017; Dash et al., 2017; Glanville et al., 2017; Chu et al., 2018; Pogorelyy et al., 2018). By correlating occurrence patterns of public TCRβ chains with cytomegalovirus (CMV) serostatus across a large cohort of healthy individuals, Emerson et al. identified a set of CMV-associated TCR chains whose aggregate occurrence was highly predictive of CMV seropositivity (Emerson et al., 2017). Staining with multimerized pMHC followed by flow cytometry has been used to isolate and characterize large populations of T cells that bind to defined pMHC epitopes (Dash et al., 2017; Glanville et al., 2017), providing valuable data on the mapping between TCR sequence and epitope specificity. We and others have leveraged these data to develop learning-based models of TCR:pMHC interactions, using TCR distance measures (Dash et al., 2017), CDR3 sequence motifs (Glanville et al., 2017) and k-mer frequencies (Cinelli et al., 2017), and other techniques.
MHC proteins in humans are encoded by the human leukocyte antigen (HLA) loci, among the most polymorphic in the human genome (Robinson et al., 2014). Within an individual, six major antigen-presenting proteins are each encoded by polymorphic alleles. The set of these alleles comprise the individual’s HLA type, which is unlikely to be shared with an unrelated individual and which determines the subset of peptide epitopes presented to T cells for immune surveillance. Specificity of a given TCR for a given antigen is bio-physically modulated by MHC structure: MHC binding specificity determines the specific antigenic peptide that is presented, and the TCR binds to a hybrid molecular surface composed of peptide- and MHC-derived residues. Thus, population-level studies of TCR-disease association are severely complicated by a dependence on individual HLA type.
Here we report an analysis of the occurrence patterns of public TCRs in a cohort of 666 healthy volunteer donors, in which information on only TCR sequence and HLA association guide us to inferences concerning disease history. To complement deep TCRβ repertoire sequencing available from a previous study (Emerson et al., 2017), we have assembled high-resolution HLA typing data at the major class I and class II HLA loci on the same cohort, as well as information on age, sex, ethnicity, and CMV serostatus. We focus on statistical association of TCR occurrence with HLA type, and show that many of the most highly HLA-associated TCRs are likely responsive to common pathogens: for example, eight of the ten TCRβ chains most highly associated with the HLA-A*02:01 allele are likely responsive to one of two viral epitopes (influenza M158 and Epstein-Barr virus BMLF1280). We introduce new approaches to cluster TCRs by primary sequence and by the pattern of occurrences among individuals in the cohort, and we identify highly significant TCR clusters that may indicate markers of immunological memory. Four of the top five most significant clusters appear linked with common pathogens (parvovirus B19, influenza virus, CMV, and Epstein-Barr virus), again highlighting the impact of viral pathogens on the public repertoire. We also find HLA-unrestricted TCR clusters, some likely to be mucosal-associated invariant T (MAIT) cells, which recognize bacterial metabolites presented by non-polymorphic MR1 proteins, rather than pMHC (Kjer-Nielsen et al., 2012). Our global, unbiased analysis of TCR-HLA association identifies striking variation in association strength across HLA loci and highlights trends in V(D)J generation probability and degree of clonal expansion that illuminate selection processes in cellular immunity. Guided by structural analysis, we used our large dataset of HLA-associated TCRβ chains to identify statistically significant sequence covaration between the TCR CDR3 loop and the DRB1 allele sequence that preserves charge complementarity at the TCR:pMHC interface. These analyses help elucidate the complex dependence of TCR sharing on HLA type and immune exposure, and will inform the growing number of studies seeking to identify TCR-based disease diagnostics.
2 Results
2.1 The matrix of public TCRs
Of the around 80 million unique TCRβ chains (defined by V-gene family and CDR3 sequence) in the 666 cohort repertoires, about 11 million chains are found in at least two individuals and referred to here as public chains (for a more nuanced examination of TCR chain sharing see Elhanati et al., 2018). The occurrence patterns of these public TCRβs—the subset of subjects in which each distinct chain occurs—can be thought of as forming a very large binary matrix M with about 11 million rows and 666 columns. Entry Mi,j contains a one or a zero indicating presence or absence, respectively, of TCR i in the repertoire of subject j (ignoring for the moment the abundance of TCR i in repertoire j). Emerson et al. (2017)demonstrated that this binary occurrence matrix M encodes information on subject genotype and immune history: they were able to successfully predict HLA-A and HLA-B allele type and CMV serostatus by learning sets of public TCRβ chains with occurrence patterns that were predictive of these features. Specifically, each feature—such as the presence of a given HLA allele (e.g. HLA-A*02) or CMV seropositivity—defines a subset of the cohort members positive for that feature, and can be encoded as a vector of 666 binary digits. This phenotype occurrence pattern of zeros and ones can be compared to the occurrence patterns of all the public TCRβ chains to identify similar patterns, as quantified by a p-value for significance of co-occurrence across the 666 subjects; thresholding on this p-value produces a subset of significantly associated TCRβ chains whose collective occurrence in a repertoire was found by Emerson et al. to be predictive of the feature of interest (in cross-validation and, for CMV, on an independent cohort). Generalizing from these results, it is reasonable to expect that other common immune exposures may be encoded in the occurrence matrix M, and that these encodings could be discovered if we had additional phenotypic data to correlate with TCR occurrence patterns. In this study, we set out to discover these encoded exposures de novo, without additional phenotypic correlates, by learning directly from the structure of the occurrence matrix M and using as well the sequences of the TCRβ chains (both their similarities to one another and to TCR sequences characterized in the literature). To support this effort we assembled additional HLA typing data for the subjects, now at 4-digit resolution and including MHC class II alleles, and we compiled a dataset of annotated TCRβ chains by combining online TCR sequence databases, structurally characterized TCRs, and published studies (see Methods; Shugay et al., 2017; Tickotsky et al., 2017; Berman et al., 2000; Dash et al., 2017; Glanville et al., 2017; Song et al., 2017; Kasprowicz et al., 2006). Here we describe the outcome of this discovery process, and we report a number of intriguing general observations about the role of HLA in shaping the T cell repertoire.
2.2 Globally co-occurring TCR pairs form clusters defined by shared associations
We hypothesized that we could identify unknown immune exposures encoded in the public repertoire by comparing the occurrence patterns of individual TCRβ chains to one another. A subset of TCRβ chains that strongly co-occur among the 666 subjects might correspond to an unmeasured immune exposure that is common to a subset of subjects. Since shared HLA restriction could represent an alternative explanation for significant TCR co-occurrence, we also compared the TCR occurrence patterns to the occurrence patterns for class I and class II HLA alleles. We began by analyzing TCR occurrence patterns over the full set of cohort members. For each pair of public TCRβ chains t1 and t2 we computed a co-occurrence p-value PCO(t1, t2) that reflects the probability of seeing an equal or greater overlap of shared subjects (i.e., subjects in whose repertoires both t1 and t2 are found) if the occurrence patterns of the two TCRs had been chosen randomly (for details, see Methods). In a similar manner we computed, for each HLA allele a and TCR t, an association p-value PHLA(a, t) that measures the degree to which TCR t tends to occur in subjects positive for allele a. Finally, for each pair of strongly co-occurring (PCO < 1×10−8) TCRβ chains t1 and t2, we looked for a mutual HLA association that might explain their co-occurrence, by finding the allele having the strongest association with both t1 and t2, and noting its association p-value: where 𝒜 denotes the set of all HLA alleles. In words, we take the p-value of the strongest HLA allele association with the TCR pair, where the association of an HLA allele with a TCR pair is defined by the weakest association of the allele among the individual TCRs.
Based on this analysis, we identified two broad classes of strongly co-occurring TCR pairs (Figure 1): those with a highly significant shared HLA association, where the co-occurrence of the two TCRs can be explained by a shared HLA allele association (i.e. a common HLA restriction), and those with only modest shared HLA-association p-value, for which another explanation of co-occurrence must be sought. Points above the dashed y = x line correspond to pairs of TCRs for which there exists an HLA allele whose co-occurrence with each of the TCRs is stronger than their mutual co-occurrence, while for points below the line no such HLA allele was present in the dataset.
We used a neighbor-based clustering algorithm, DBSCAN (Ester et al., 1996), to link strongly co-occurring TCR pairs together to form larger correlated clusters (see Methods), and then investigated phenotype associations with these clusters. At an approximate family-wise error rate of 0.05 (see Methods), we identified 28 clusters of co-occurring TCRs, with sizes ranging from 7 to 386 TCRs (Figure 2). Given one of these clusters of co-occurring TCRs, we can count the number of cluster member TCRs found in each subject’s repertoire. The aggregate occurrence pattern of the cluster can be visualized as a rank plot of this cluster TCR count over the subjects (the black curves in Figure 2B-C). This ranking can also be compared with other phenotypic or genotypic features of the same subjects. In particular, by comparing this aggregate occurrence pattern to a control pattern generated by repeatedly choosing equal numbers of subjects independently at random (dotted green lines in Figure 2B-C), we can identify a subset of the cohort with an apparent enrichment of cluster member TCRs and look for overlap between this subset and other defined cohort features. Performing this comparison against the occurrence patterns of class I and class II HLA alleles revealed that the majority of the TCR clusters were strongly associated with at least one HLA allele (as depicted for a DRB1*15:01-associated cluster in Figure 2B and summarized in Figure 2A).
In addition, there were two large clusters of TCRs which were not strongly associated with any of the typed HLA alleles. Visual inspection of the CDR3 regions of TCRs in one of these clusters revealed a distinctive ‘YV’ C-terminal motif that is characteristic of the TRBJ2-7*02 allele (Figure 2–Figure Supplement 1), and indeed the 41 subjects whose repertoires indicated the presence of this genetic variant were exactly the 41 subjects enriched for members of this TCR cluster (Figure 2C). This demonstrated that population diversity in germline allele sets manifests as occurrence pattern clustering. The other large, non-HLA associated TCR cluster had a number of distinctive features as well: strong preference for the TRBV06 family, followed by TRBV20 and TRBV04 (Figure 2–Figure Supplement 2); low numbers of inserted ‘N’ nucleotides; and a skewed age distribution biased toward younger subjects (Figure 2–Figure Supplement 3). These features, together with the lack of apparent HLA restriction, suggested that this cluster represented an invariant T cell subset, specifically MAIT (mucosal-associated invariant T) cells (Kjer-Nielsen et al., 2012; Venturi et al., 2013; Pogorelyy et al., 2017). Since MAIT cells are defined primarily by their alpha chain sequences, we searched in a recently published paired dataset (Howie et al., 2015) for partner chains of the clustered TCRβ chain sequences, and found a striking number that matched the MAIT consensus (TRAV1-2 paired with TRAJ20/TRAJ33 and a 12 residue CDR3, Figure 2–Figure Supplement 3D). We also looked for these clustered TCRs in a recently published MAIT cell sequence dataset (Howson et al., 2018) and found that 93 of the 138 cluster member TCRs occurred among the 31,654 unique TCRs from this dataset; of these 93 TCRβ chains, 27 were found among the 78 most commonly occurring TCRs in the dataset (the TCRs occurring in at least 7 of the 24 sequenced repertoires), a highly significant overlap (P < 2 ×10−52 in a one-sided hypergeometric test). These concordances indicate that our untargeted approach has detected a well-studied T cell subset de novo through analysis of occurrence patterns.
2.3 HLA-associated TCRs
These analyses suggested to us that TCR co-occurrence patterns across the full cohort of subjects are strongly influenced by the distribution of the HLA alleles, in accordance with the expectation that the majority of αβ TCRs are HLA-restricted. Covariation between TCRs responding to the same HLA-restricted epitopes would only be expected in subjects positive for the restricting alleles, with TCR presence and absence outside these subjects likely introducing noise into the co-occurrence analysis. Thus we next analyzed patterns of TCR co-occurrence within subsets of the cohort positive for specific HLA alleles, and we restricted our co-occurrence analysis to TCRs having a statistically significant association with the specific allele defining the cohort subset. At a false discovery rate of 0.05 (estimated from shuffling experiments; see Methods), we were able to assign 16,951 TCRβ sequences to an HLA allele (or alleles: DQ and DP alleles were analyzed as αβ pairs, and there were 5 DR/DQ haplotypes whose component alleles were so highly correlated across our cohort that we could not assign TCR associations to individual DR or DQ components; see Methods). Table 1 lists the top 50 HLA-associated TCR sequences by association p-value and top 10 associated TCRs for the well-studied A*02:01 allele.
We find that 8 of the top 10 A*02:01-associated TCRs have been previously reported and annotated as being responsive to viral epitopes, specifically influenza M158 and Epstein-Barr virus (EBV) BMLF1280 (Shugay et al., 2017; Tickotsky et al., 2017). Moreover, each of these 8 TCRβ chains is present in a recent experimental dataset (Dash et al., 2017) that included tetramer-sorted TCRs positive for these two epitopes; each TCR has a clear similarity to one of the consensus epitope-specific repertoire clusters identified in that work, with the EBV TRBV20, TRBV29, and TRBV14 TCRs, respectively, matching the three largest branches of the BMLF1280 TCR tree, and the three influenza M158 TCRs all matching the dominant TRBV19’RS’ motif consensus (Figure 10). TCRs with annotation matches are sparser in the top 50 across all alleles, which is likely due in part to a paucity of experimentally characterized non-A*02 TCRs, however we again see EBV-epitope responsive TCRs (with B*08:01 and B*35:01 restriction).
A global comparison of TCR feature distributions for HLA-associated versus non-HLA-associated TCRs provides further evidence of functional selection. As shown in Figure 3A, HLA-associated TCRs are on average more clonally expanded than a set of background, non-HLA associated TCRs with matching frequencies in the cohort. They also have lower generation probabilities—are harder to make under a simple random model of the VDJ rearrangement process—which suggests that their observed cohort frequencies may be elevated by selection (Figure 3B, see Methods for further details on the calculation of clonal expansion indices and generation probabilities; also see Pogorelyy et al., 2018). Examination of two-dimensional feature distributions suggests that these shifts are correlated, with HLA-associated TCRs showing an excess of lower-probability, clonally expanded TCRs (Figure 3C); this trend appears stronger for class-I associated TCRs than for class II-associated TCRs (Figure 3–Figure Supplement 1).
To give a global picture of TCR-HLA association, we counted the number of significant TCR associations found for each HLA allele in the dataset, and plotted this number against the number of subjects in the cohort with that allele (Figure 4). As expected, the more common HLA alleles have on average greater numbers of associated TCRs (since greater numbers of subjects permit the identification of more public TCRs, and the statistical significance assigned to an observed association of fixed strength grows as the number of subjects increases). What was somewhat more surprising is that the slope of the correlation between cohort frequency and number of associated TCRs varied dramatically among the HLA loci, with HLA-DRB1 alleles having the largest number of associated TCRs for a given allele frequency and HLA-C alleles having the smallest. The best-fit slope for the five DR/DQ haplotypes (12.2) was roughly the sum of the DR (7.99) and DQ (3.39) slopes, suggesting as expected that these haplotypes were capturing TCRs associated with both the DR and DQ component alleles.
2.4 HLA-restricted TCR clusters
We next sought to identify TCR clusters that might represent HLA-restricted responses to shared immune exposures. We performed this analysis for each HLA allele individually, restricting our clustering to the set of TCR chains significantly-associated with that allele and comparing occurrence patterns only over the subset of subjects positive for that allele. The smaller size of many of these allele-positive cohort subsets reduces our statistical power to detect significant clusters using co-occurrence information. To counter this effect, we used TCRdist (Dash et al., 2017) to leverage the TCR sequence similarity which is often present within epitope-specific responses (Dash et al., 2017; Glanville et al., 2017) (e.g., A*02:01 TCRs in Table 1 and Figure 10). We augmented the probabilistic similarity measure used to define neighbors for DBSCAN clustering to incorporate information about TCR sequence similarity, in addition to cohort co-occurrence (see Methods). We independently clustered each allele’s associated TCRs and merged the clustering results from all alleles; using the Holm multiple testing criterion (Holm, 1979)to limit the approximate family-wise error rate to 0.05, we found a total of 78 significant TCR clusters.
We analyzed the sequences and occurrence patterns of the TCRs belonging to these 78 clusters in order to assess their potential biological significance and prioritize them for further study (Table 3). Each cluster was assigned two scores (Figure 5): a size score (Ssize, x-axis), reflecting the significance of seeing a cluster of that size given the total number of TCRs clustered for its associated allele, and a co-occurrence score (ZCO, y-axis), reflecting the degree to which the TCRs in that cluster co-occur within its allele-positive cohort subset (see Methods). In computing the co-occurrence score, we defined a subset of individuals with an apparent enrichment for the member TCRs in each cluster; the size of this enriched subset of subjects is given in the ‘Subjects’ column in Table 3. We rank ordered the 78 clusters based on the sum of their size and co-occurrence scores (weighted to equalize dynamic range); the top 5 clusters are presented in greater detail in Figure 6. HLA associations, member TCR and enriched subject counts, cluster center TCR sequences, scores, and annotations for all 78 clusters are given in Table 3.
We found that a surprising number of the most significant HLA-restricted clusters had links to common viral pathogens. For example, the top cluster by both size and co-occurrence (Figure 6, upper panels) is an A*24:02-associated group of highly similar TCRβ chains, five of which can be found in a set of 12 TCRβ sequences reported to respond to the parvovirus B19 epitope FYTPLADQF as part of a highly focused CD8+ response to acute B19 infection (Kasprowicz et al., 2006). The subject TCR-counts curve for this cluster (Figure 6, top right panel) shows a strong enrichment of member TCRs in roughly 30% of the A*24:02 repertoires, which is on the low end of prevalence estimates for this pathogen (Heegaard and Brown, 2002) and may suggest that, if cluster enrichment does correlate with B19 exposure, there are likely to be other genetic or epidemiologic factors that determine which B19-exposed individuals show enrichment. The second most significant cluster by both measures is an A*02:01-associated group of TRBV19 TCRs with a high frequency of matches to the influenza M158 response (41/43 TCRs, labeled ‘INF-pGIL’ for the first three letters of the GILGFVFTL epitope). Notably, the cluster member sequences recapitulate many of the core features of the tree of experimentally identified M158 TCRs (Figure 10): a dominant group of length 13 CDR3 sequences with an ‘RS’ sequence motif together with a smaller group of length 12 CDR3s with the consensus CASSIG.YGYTF.
Rounding out the top five, the third and fifth most significant clusters also appear to be pathogen-associated. Cluster #3 brings together a diverse set of DRB1*07:01-associated TCRβ chains (Figure 6, second page, middle dendrogram), none of which matched our annotation database. However, it was strongly associated with CMV serostatus: As is evident in the subject TCR-counts panel for this cluster (Figure 6, second page, middle right), there is a highly significant (P < 3 × 10−19) association between CMV seropositivity (blue dots at the bottom of the panel) and cluster enrichment (here defined as a subject TCR count ≥ 3). Finally, the B*08:01-associated cluster #5 (bottom panels in second page Figure 6) appears to be EBV-associated: four of the TCRβ chains in this cluster match TCRs annotated as binding to EBV epitopes (two matches for the B*08:01-restricted FLRGRAYGL epitope and two for the B*08:01-restricted RAK-FKQLL epitope). The fact that this cluster brings together sequence-dissimilar TCRs that recognize different epitopes from the same pathogen supports the hypothesis that at least some of the observed co-occurrence may be driven by a shared exposure.
As a preliminary validation of the clusters identified here, we examined the occurrence patterns of cluster member TCRs in two independent cohorts: a set of 120 individuals (“Keck120”) that formed the validation cohort for the original Emerson et al. study, and a set of 86 individuals (“Brit86”) taken from the aging study of Britanova et al. (2016). Whereas the Keck120 repertoires were generated using the same platform as our 666-member discovery cohort, the Brit86 repertoires were sequenced from cDNA libraries using 5’-template switching and unique molecular identifiers. In the absence of HLA typing information for these subjects, we simply evaluated the degree to which each cluster’s member TCRs co-occurred over the entirety of each of these validation cohorts, using the co-occurrence score described above ( and columns in Table 3). Although rare alleles and cluster-associated exposures may not occur with sufficient frequency in these smaller cohorts to generate co-occurrence signal, co-occurrence scores support the validity of the clusterings identified on the discovery cohort: 94% of the Keck120 scores and 92% of the Brit86 scores are greater than 0, indicating a tendency of the clustered TCRs to co-occur (smoothed score distributions are shown in Figure 5–Figure Supplement 1).
2.5 Covariation between CDR3 sequence and HLA allele
Given our large dataset of HLA-associated TCRβ sequences, we set out to look for correlations between CDR3 sequence and HLA allele sequence. Previous studies have identified correlations between TCR V-gene usage and HLA alleles (Sharon et al., 2016; Blevins et al., 2016). In our previous work on epitope-specific TCRs (Dash et al., 2017), we identified a significant negative correlation between CDR3 charge and peptide charge, suggesting a tendency toward preserving charge complementarity across the TCR:pMHC interface. Although the CDR3 loop primarily contacts the MHC-bound peptide, computational analysis of solved TCR:peptide:MHC structures in the Protein Data Bank (Berman et al., 2000) (see Methods) identified a number of HLA sequence positions that are frequently contacted by CDR3 amino acids (Table 2). For each frequently-contacted HLA position with charge variability among alleles we computed the covariation between HLA allele charge at that position and average CDR3 charge for allele-associated TCRs. Since portions of the CDR3 sequence are contributed by the V- and J-gene germline sequences, and covariations are known to exist between HLA and V-gene usage, we also performed a covariation analysis restricting to ‘non-germline’ CDR3 sequence positions whose coding sequence is determined by at least one non-templated insertion base (based on the most parsimonious VDJ reconstruction; see Methods). We found a significant negative correlation (R = –0.47, P < 4×10−4 for the full CDR3 sequence; R = –0.52, P < 7×10−5 for the non-germline CDR3 sequence) between CDR3 charge and the charge at position 70 of the class II beta chain. We did not see a significant correlation for the frequently contacted position on the class II alpha chain, perhaps due to the lack of sequence variation at the DRα locus and/or the more limited number of DQα and DPα alleles. None of the five class I positions showed significant correlations, which could be due to their lower contact frequencies, a smaller average number of associated TCRs (51 for class I versus 309 for class II), bias toward A*02 in the structural database, or noise introduced from multiple contacted positions varying simultaneously. Further analysis of the class II correlation suggested that it was driven largely by HLA-DRB1 alleles: position 70 correlations were –0.56 versus –0.10 for DR and DQ, respectively, over the full CDR3 and –0.64 vs –0.38 for the non-germline CDR3. Figure 7 provides further detail on this DRB1-TCR charge anti-correlation, including a structural superposition showing the proximity of position 70 to the TCRβ CDR3 loop.
2.6 CMV-associated TCRβ chains are largely HLA-restricted
We analyzed the HLA associations of strongly CMV-associated TCRβ chains to gain insight into their predictive power across genetically diverse individuals. Here we change perspective somewhat from earlier sections, in that we select TCRs based on their CMV association and then evaluate HLA association, rather than the other way around. In their original study, Emerson et al. identified a set of TCRβ chains that were enriched in CMV seropositive individuals and showed that by counting these CMV-associated TCRβ chains in a query repertoire they could successfully predict CMV serostatus both in cross-validation and on an independent test cohort. The success of this prediction strategy across a diverse cohort of individuals raises the intriguing question of whether these TCRβs are primarily HLA-restricted in their occurrence and in their association with CMV, or whether they span multiple HLA types. To shed light on this question we focused on a set of 68 CMV-associated TCRβ chains whose co-occurrence with CMV seropositivity was significant at a p-value threshold of 1.5×10−5 (corresponding to an FDR of 0.05; see Methods). For each CMV-associated TCRβ chain, we identified its most strongly associated HLA allele and compared the p-value of this association to the p-value of its association with CMV (Figure 8A). From this plot we can see that the majority of the CMV-associated chains do appear to be HLA-associated, having p-values that exceed the FDR 0.05 threshold for HLA association. The excess of highly significant HLA-association p-values for these CMV-associated TCRβs can be seen in Figure 8B, which compares the observed p-value distribution to a background distribution of HLA association p-values for randomly selected frequency-matched public TCRβs.
As a next step we looked to see whether these HLA associations fully explained the CMV association, in the sense that the CMV association was only present in subjects positive for the associated allele. For each of the 68 CMV-associated TCRs, we divided the cohort into subjects positive for its most strongly associated HLA allele and subjects negative for that allele. Here we considered both 2- and 4-digit resolution alleles when defining the most strongly associated allele, to allow for TCRs whose association extends beyond a single 4-digit allele. We computed association p-values between TCR occurrence and CMV seropositivity over these two cohort subsets independently and compared them (Figure 8C). We see that the majority of the points lie below the y = x line—indicating a stronger CMV-association on the subset of the cohort positive for the associated allele—and also below the line corresponding to the expected minimum of 68 uniform random variables (i.e. the expected upper significance limit in the absence of CMV association on the allele-negative cohort subsets). There are however a few TCRβs which do not appear strongly HLA-associated and for which the CMV-association remains strong in the absence of their associated allele (the points above the line y = x in Figure 8C). For example, the public TCRβ chain defined by TRBV07 and the CDR3 sequence CASSSDSGGTDTQYF (which corresponds to the highest point in Figure 8C) is strongly CMV-associated (22/23 subjects with this chain are CMV positive; P < 3×10−7) but does not show evidence of HLA association in our dataset. TCRs with HLA promiscuity may be especially interesting from a diagnostic perspective, since their phenotype associations may be more robust to differences in genetic background.
Finally, we looked to see whether CMV assocation completely explained the observed HLA associations, in the sense that a response to one or more CMV epitopes was likely the only driver of HLA association, or whether there might be evidence for other epitope-specific responses by these TCRβ chains or a more general affinity for the associated allele, perhaps driven by common self antigens. Put another way, do we see evidence for pre-existing enrichment of any of these TCRβ chains when their associated allele is present, even in the absence of CMV, which might suggest that the CMV response recruits from a pre-selected pool enriched for TCRs with intrinsic affinity for the restricting allele? To approach this question we split the cohort into CMV seropositive and seronegative subjects and computed, for each of the 68 CMV-associated TCRs, the strength of its association with its preferred allele over these two subsets separately. Figure 8D compares these HLA-association p-values computed over the subsets of the cohort positive (289 individuals, x-axis) and negative (352 individuals, y-axis) for CMV. We can see in this case that all of the associations on the CMV-positive subset are stronger than those on the CMV-negative sub-set, and indeed the CMV-negative p-values do not appear to exceed random expectation given the number of comparisons performed. Thus, the apparent lack of any significant HLA-association on the CMV-negative cohort subset suggests that the HLA associations of these CMV-predictive chains are largely driven by CMV exposure. A limitation of this analysis is that, although the CMV-negative subset of the cohort is larger than the CMV-positive subset, the number of TCR occurrences in the CMV-negative subset is likely lower than in the CMV-positive subset for these CMV-associated chains, which will limit the strength of the HLA associations that can be detected.
3 Discussion
Each individual’s repertoire of circulating immune receptors encodes information on their past and present exposures to infectious and autoimmune diseases, to antigenic stimuli in the environment, and to tumor-derived epitopes. Decoding this exposure information requires an ability to map from amino acid sequences of rearranged receptors to their eliciting antigens, either individually or collectively. One approach to developing such an antigen-mapping capability would involve collecting deep repertoire datasets and detailed phenotypic information on immune exposures for large cohorts of genetically diverse individuals. Correlation between immune exposure and receptor occurrence across such datasets could then be used to train statistical predictors of exposure, as demonstrated by Emerson et al. for CMV serostatus. The main difficulty with such an approach, beyond the cost of repertoire sequencing, is likely to be the challenge of assembling accurate and complete immune exposure information.
For this reason, we set out to discover potential signatures of immune exposures de novo, in the absence of phenotypic information, using only the structure of the public repertoire—its receptor sequences and their occurrence patterns. By analyzing co-occurrence between pairs of public TCRβ chains and between individual TCRβ chains and HLA alleles, we were able to identify statistically significant clusters of co-occurring TCRs across a large cohort of individuals and in a variety of HLA backgrounds. Indirect evidence from sequence matches to experimentally-characterized receptors suggests that some of these TCR clusters may reflect hidden immune exposures shared among subsets of the cohort members; indeed, several of the most significant clusters appear linked to common viral pathogens (parvovirus B19, influenza, CMV, and EBV).
The results of this paper demonstrate the potential for a productive dialog between statistical analysis of TCR repertoires and immune exposure analysis. Specifically, sequences from the statistically-inferred clusters defined here could be tested for antigen reactivity or combined with immune exposure data to infer the driver of TCR expansion, as was done here for the handful of CMV-associated clusters based on CMV serostatus information. In either case our clustering approach will reduce the amount of independent data required, since the immune phenotype data is used for annotation of a modest number of defined TCR groupings rather than direct discovery of predictive TCRs from the entire public repertoire. We can also look for the presence of specific TCRs and TCR clusters identified here in other repertoire datasets, for example from studies of specific autoimmune diseases or pathogens, as a means of assigning putative functions. However the answer may not be entirely straightforward: it remains possible that enrichment for other cluster TCRs, rather than being associated with an exposure per se, is instead associated with some subject-specific genetic or epigenetic factor that determines whether a specific TCR response will be elicited by a given exposure.
The finding by Emerson et al.—now replicated and extended in this work—that there are large numbers of TCRβ chains whose occurrence patterns (independent of potential TCRα partners) are strongly associated with specific HLA alleles, raises the question of what selective forces drive these biased occurrence patterns. Our observations point to a potential role for responses to common pathogens in selecting some of these chains in an HLA-restricted manner. Self-antigens (presented in the thymus and/or the periphery) may also play a role in enriching for specific chains, as suggested by Madi et al. (2017)in their work on TCR similarity networks formed by the most frequent CDR3 sequences. Our conclusions diverge somewhat from this previous work, which may be explained by the following factors: our use of HLA-association rather than intra-individual frequency as a filter for selecting TCRs, our inclusion of information on the V-gene family in addition to the CDR3 sequence when defining TCR sharing and computing TCR similarity, and our use of TCR occurrence patterns, rather than CDR3 edit distance, to discover TCR clusters. We also find it interesting that class II loci appear on average to have greater numbers of associated TCRβ chains than class I loci (Figure 4): presumably this reflects differences in selection and/or abundance between the CD4+ and CD8+ T cell compartments, but the underlying explanation for this trend is unclear. It is also worth pointing out that our primary focus on presence/absence of TCRβ chains (rather than abundance) assumes relatively uniform sampling depths across the cohort; in the limit of very deep repertoire sequencing, pathogen-associated chains may be found (presumably in the naive pool) even in the absence of the associated immune challenge, while shallow sampling reliably picks out only the most expanded T cell clones. Here the use of clusters of responsive TCRs rather than individual chains lessens stochastic fluctuations in TCR occurrence patterns, providing some measure of robustness.
We look forward to the accumulation of new data sets, which will enable future researchers to move beyond the limitations of the study presented here. An ideal study would perform discovery on repertoire data from multiple large cohorts, rather than the single large cohort generated with a single sequencing platform. Although we do validate TCR clusters on two independent datasets, with one from a different immune profiling technology, performing discovery on multiple large cohorts would presumably give more robust results. Future analyses of independent, HLA-typed cohorts will provide additional validation of trends seen here. We also hope that future studies will have rich immune exposure data beyond CMV serostatus: although the cohort members were all nominally healthy at the time of sampling, it is likely that there are a variety of immune exposures, some presaging future pathologies, that can be observed in a diverse collection of 650+ individuals. As an example, two of our EBV-annotated clusters contain TCRβ chains also seen in the context of rheumatoid arthritis: cross-reactivity between pathogen and autoimmune epitopes may mean that TCR clusters discovered on the basis of common infections also provide information relevant in the context of autoimmunity.
4 Materials and Methods
4.1 Datasets
TCRβ repertoire sequence data for the 666 members of the discovery cohort was downloaded from the Adaptive biotechnologies website using the link provided in the original Emerson et al. (2017)publication (https://clients.adaptivebiotech.com/pub/Emerson-2017-NatGen). The repertoire sequence data for the 120 individuals in the “Keck120” validation set was included in the same download. Repertoire sequence data for the 86 individuals in the “Brit86” validation set was downloaded from the NCBI SRA archive using the Bioproject accession PRJNA316572 (Britanova et al., 2016) and processed using scripts and data supplied by the authors (https://github.com/mikessh/aging-study) in order to demultiplex the samples and remove technical replicates. Repertoire sequence data for TCRβ chains from MAIT cells was downloaded from the NCBI SRA archive using the Bioproject accession PRJNA412739 (Howson et al., 2018).
V and J genes were assigned by comparing the TCR nucleotide sequences to the IMGT/GENE-DB (Giudicelli et al., 2005) nucleotide sequences of the human TR genes (sequence data downloaded on 9/6/2017 from http://www.imgt.org/genedb/). CDR3 nucleotide and amino acid sequences and most-parsimonious VDJ recombination scenarios were assigned by the TCRdist pipeline (Dash et al., 2017) (the most parsimonious recombination scenario, used for identifying non-germline CDR3 amino acids, is the one requiring the fewest non-templated nucleotide insertions). To define the occurrence matrix of public TCRs and assess TCR-TCR, TCR-HLA and TCR-CMV association, a TCRβ chain was identified by its CDR3 amino acid sequence and its V-gene family (e.g., TRBV6-4*01 was reduced to TRBV06). TCR sequence reads for which a unique V-gene family could not be determined (due to equally well-matched V genes from different families, a rare occurrence in this dataset) were excluded from the analysis.
4.2 Eliminating potential cross-contamination
A preliminary analysis of TCR sharing at the nucleotide level was conducted to identify potential cross-contamination in the discovery cohort repertoires. Each TCRβ nucleotide sequence that was found in multiple repertoires was assigned a generation probability (Pgen, see below) in order to identify nucleotide sequences with suspiciously high sharing rates among repertoires. Visual comparison of the sharing rate (the number of repertoires in which each TCRβ nucleotide sequence was found) to the generation probability (Figure 9) showed that the majority of highly-shared TCRs had correspondingly high generation probabilities; it also revealed a cluster of TCR chains with unexpectedly high sharing rates. Examination of the sequences of these highly-shared TCRs revealed them to be variants of the consensus sequence CFFKQKTAYEQYF (coding sequence: tgttttttcaagcagaagacggcatacgagcagtacttc). Consultation with scientists at Adaptive Biotechnologies confirmed that these sequences were likely to represent a technical artifact. We elected to remove all TCRβ nucleotide sequences whose sharing rates put them outside the decision boundary indicated by the black line in Figure 9, which eliminated the vast majority of the artifactual variants as well as a handful of other highly shared, low-probability sequences.
4.3 Measuring clonal expansion
Each public TCRβ chain was assigned a clonal expansion index (Iexp) determined by its frequencies in the repertoires in which it was found. First, the unique TCRβ chains present in each repertoire were ordered based on their inferred nucleic acid template count (Carlson et al., 2013), and assigned a rank ranging from 0 (lowest template count) to S – 1 (highest template count), where S is the total number of chains present in the repertoire. TCRs with the same template count were assigned the same tied rank equal to the midpoint of the tied group. In order to compare across repertoires, the ranks for each repertoire were then normalized by dividing by the number of unique sequences in the repertoire. The clonal expansion index for a given public TCR t was taken to be its average normalized rank for the repertoires in which it occurred: where the sum is taken over the m repertoires in which t is found, ri is the template-count rank of TCR t in repertoire i, and Si is the total size of repertoire i.
4.4 HLA typing
HLA genotyping was performed and confirmed by molecular means (either Sanger sequencing or next-generation sequencing) and independently by imputation of HLA alleles using data generated by high density single-nucleotide polymorphism arrays. HLA typing data availability varied across loci as follows: HLA-A (629 subjects), HLA-B (630 subjects), HLA-C (629 subjects), HLA-DRB1 (630 subjects), HLA-DQA1 (522 subjects), HLA-DQB1 (630 subjects), HLA-DPA1 (606 subjects), and HLA-DPB1 (472 subjects). When calculating the association p-values between TCRβ chains and HLA alleles reported in Table 1, the cohort was restricted to the subset of subjects with available HLA typing at the relevant locus. For comparing TCR association rates across loci in Figure 4, associations were calculated over the cohort subset (522 subjects) with typing data at all compared loci (A, B, C, DRB1, DQA1, and DQB1) in order to avoid spurious differences in association strengths arising from differential data availability among the loci. Due to their very strong linkage on our cohort, five DR-DQ haplotypes were treated as single allele units for association calculations and clustering: DRB1*03:01-DQA1*05:01-DQB1*02:01, DRB1*15:01-DQA1*01:02-DQB1*06:02, DRB1*13:01-DQA1*01:03-DQB1*06:03, DRB1*10:01-DQA1*01:05-DQB1*05:01, and DRB1*09:01-DQA1*03:02-DQB1*03:03.
4.5 TCR generation probability
We implemented a version of the probabilistic model proposed by Walczak and co-workers (Murugan et al., 2012) in order to assign to each public TCRβ chain (defined by a V-gene family and a CDR3 amino acid sequence) a generation probability, Pgen, which captures the probability of seeing that TCRβ in the preselection repertoire. Pgen is calculated by summing the probabilities of the possible VDJ rearrangements that could have produced the observed TCR: where 𝑆 represents the set of possible VDJ recombination scenarios capable of producing the observed TCR V family and CDR3 amino acid sequence. To compute the probability of a given recombination scenario s, we use the factorization proposed by Marcou et al. (2018), which captures observed dependencies of V-,D-, and J-gene trimming on the identity of the trimmed gene and of inserted nucleotide identity on the identity of the preceding nucleotide:
Here the recombination scenario s consists of a choice of V gene (Vs), D gene (Ds), J gene (Js), number of nucleotides trimmed back from the end of the V gene (delsV) or J gene (delsJ) or D gene (delsD5′ and delsD3′), number of nucleotides inserted between the V and D genes (InssV D) and between the D and J genes (InssDJ) and the identities of the inserted nucleotides ({ni} and {mi}respectively). At the start of the calculation, the CDR3 amino acid sequence is converted to a list of potential degenerate coding nucleotide sequences. Since each amino acid other than Leucine, Serine, and Arginine has a single degenerate codon (and these three amino acids have two such codons), this list of nucleotide sequences is generally not too long. The generation probability is then taken to be the sum of the probabilities of these degenerate nucleotide sequences. Since the total number of possible recombination scenarios is in principle quite large, we make a number of approximations to speed the calculation: we limit excess trimming of genes to at most three nucleotides, where excess trimming is defined to be trimming back a nucleotide which matches the target CDR3 nucleotide (therefore requiring non-templated reinsertion of the same nucleotide); at most 2 palindromic nucleotides are allowed; sub-optimal D gene alignments are only considered up to a score gap of 2 matched nucleotides relative to the best match. The parameters of the probability model are fit by a simple iterative procedure in which we generate rearrangements using an initial model, compare the statistics of those rearrangements to statistics derived from observed out-of-frame rearrangements in the dataset, and adjust the probability model parameters to iteratively improve agreement.
4.6 Co-occurrence calculations
We used the hypergeometric distribution to assess the significance of an observed overlap between two subsets of the cohort, taking our significance p-value to be the probability of seeing an equal or greater overlap if the two subsets had been chosen at random: where k is the size of the overlap, N1 and N2 are the sizes of the two subsets, and N is the total cohort size. A complication arises when assessing TCR-TCR co-occurrence in the presence of variable-sized repertoires: TCRs are more likely to come from the larger repertoires than the smaller ones, which violates the assumptions of the hypergeometric distribution and leads to inflated significance scores. In particular, when we use the hypergeometric distribution to model the overlap between the sets of subjects in which two TCR chains are found, we implicitly assume that all subjects are equally likely to belong to a TCR chain’s subject set. If the subject repertoires vary in size, this assumption will not hold. For example, in the limit of a subject with an empty repertoire, no TCR subject sets will contain that subject, which will inflate all the overlap p-values since we are effectively overstating the size N of the cohort by 1. On the other hand, if one of the subject repertoires contains all the public TCR chains, then each TCR-TCR overlap will automatically contain that subject, again inflating the p-values since we are artificially adding 1 to each of k, N1, N2, and N. We developed a simple heuristic to correct for this effect using a per-subject bias factor by defining where Si is the size of repertoire i and N is the cohort size. To score an overlap 𝒪 of size k involving subjects s1,…, sk, we adjust the overlap p-value by theproduct of the bias factors of the subjects in the overlap:
This has the effect of decreasing the significance assigned to overlaps involving larger repertoires, yet remains fast to evaluate, an important consideration given that the all-vs-all TCR co-occurrence calculation involves about 1014 pairwise comparisons (and this calculation is repeated multiple times with shuffled occurrence patterns to estimate false-discovery rates). When clustering by co-occurrence, we augmented this heuristic p-value correction by also eliminating repertoires with very low (fewer than 30,000) or very high (more than 120,000) numbers of public TCRβ chains (nonzero entries in the occurrence matrix M), as well as five additional repertoires which showed anomolously high levels of TCR nucleotide sharing with another repertoire—all with the goal of reducing potential sources of spurious TCR-TCR co-occurrence signal.
4.7 Estimating false-discovery rates
We used the approach of Storey and Tibshirani (2003) to estimate false-discovery rates for detecting associations between TCRs and HLA alleles and between TCRs and CMV seropositivity. Briefly, for a fixed significance threshold P we estimate the false-discovery rate (FDR) by randomly permuting the HLA allele or CMV seropositivity assignments 20 times and computing the average number of significant associations discovered at the threshold P in these shuffled datasets. The estimated FDR is then the ratio of this average shuffled association number to the number of significant associations discovered in the true dataset at the same threshold. In order to estimate a false-discovery rate for TCR-TCR co-occurrence over the full cohort, we performed 20 co-occurrence calculations on shuffled occurrence matrices, preserving the per-subject bias factors during shuffling by resampling each TCR’s occurrence pattern with the bias distribution {bi} determined by the subject repertoire sizes.
4.8 TCR clustering
We used the DBSCAN (Ester et al., 1996) algorithm to cluster public TCRβ chains by their occurrence patterns. DBSCAN is a simple and robust clustering procedure that requires two input parameters: a similarity/distance threshold (Tsim) at which two points in the dataset are considered to be neighbors, and a minimum number of neighbors (Ncore) for a point to be considered a core, as opposed to a border, point. DBSCAN clusters consist of the connected components of the neighbor-graph over the core points, together with any border point neighbors the core cluster members have. To prevent the discovery of fictitious clusters, Tsim and Ncore can be selected so that core points (points with at least Ncore neighbors) are unlikely to occur by chance. There is a trade-off between the two parameter settings: as Tsim is relaxed, points will tend to have more neighbors on average and thus Ncore should be increased, which biases toward discovery of larger clusters; conversely, more stringent settings of Tsim are compatible with smaller values for Ncore which permits the discovery of smaller, more tightly linked clusters.
For clustering TCRs by co-occurrence over the full cohort, we used a threshold of Tsim= 10−8 and chose a value for Ncore (6) such that no core points were found in any of the 20 shuffled datasets. In other words, two TCRs t1 and t2 were considered to be neighbors for DBSCAN clustering if PCO(t1, t2) < 10−8; a TCR was considered a core point if it had at least 6 neighbors. Choosing parameters for HLA-restricted TCR clustering was slightly more involved due to the variable number of clustered TCRs for different alleles, and the more complex nature of the similarity metric, whose dependence on TCR sequence makes shuffling-based approaches more challenging. To begin, we transformed the TCRdist sequence-similarity measure into a significance score PTCRdist which captures the probability of seeing an observed or smaller TCRdist score for two randomly selected TCRβ chains. Since public TCRβ chains are on average shorter and closer to germline than private TCRs, we derived the PTCRdist CDF by performing TCRdist calculations on randomly selected public TCRs seen in at least 5 repertoires. We identified neighbors for DBSCAN clustering using a similarity score Psim that combines co-occurrence and TCR sequence similarity: where the transformation by f (x) = x – x log(x) corrects for taking the product of two p-values because f (x) is the cumulative distribution function of the product of two uniform random variables. Thus, if PTCRdist and PCO are independent and uniformly distributed, the same will be true of Psim.
For HLA-restricted clustering using this combined similarity measure we set a fixed value of Tsim= 10−4 and adjusted the Ncore parameter as a function of the total number of TCRs clustered for each allele. As in global clustering, our goal was to choose Ncore such that core points were unlikely to occur by chance (more precisely, had a per-allele probability less than 0.05). We estimated the probability of seeing core points by modeling neighbor number-using the binomial distribution, assuming that the observed neighbor number of a given TCR during clustering is determined by M – 1 independent Bernoulli-distributed neighborness tests with rate r, where M is the number of clustered TCRs. Rather than assuming a fixed neighbor-rate r across TCRs, we captured the observed variability in neighbor-rate (due, for example, to unequal V-gene frequencies and variable CDR3 lengths) by using a mixture of 20 rates estimated from similarity comparisons on randomly chosen public TCRs.
We also used this neighbor-number model to assign a p-value (Psize) to each cluster reflecting the likelihood of seeing the observed degree of clustering by chance. Since DBSCAN clusters are effectively single-linkage-style partitionings of the core points (together with any neighboring border points), they can have a variety of shapes, ranging from densely interconnected graphs, to extended clusters held together by local neighbor relationships (Ester et al., 1996). Modeling the total size of these arbitrary groupings is challenging, so we took the simpler and more conservative approach of assigning p-values based on the size of the largest TCR neighborhood (set of neighbors for a single TCR) contained within each cluster. We identified the member TCR with the greatest number of neighbors in each cluster (the cluster center) and computed the likelihood of seeing an equal or greater neighbor-number under the mixture model described above. This significance estimate is conservative in that it neglects clustering contributions from TCRs outside the neighborhood of the cluster center, however in practice we observed that the majority of TCR clusters were dominated by a single dense region of repertoire space and therefore reasonably well-captured by a single neighborhood. To control false discovery when combining DBSCAN clusters from independent clustering runs for different HLA alleles, we used the Holm method (Holm, 1979) applied to the sorted list of cluster Psize values, with a target family-wise error rate (FWER) of 0.05 (i.e., we attempted to limit the overall probability of seeing a false cluster to 0.05). In the Holm FWER calculation we set the total number of hypotheses equal to the total number of TCRs clustered across all alleles minus the cumulative neighbor-count of the cluster centers (we exclude cluster center neighbors since their neighbor counts are not independent of the neighbor count of the cluster center).
4.9 Analyzing TCR clusters
For each (global or HLA-restricted) TCR cluster, we analyzed the occurrence patterns of the member TCRs in order to identify a subset of the (full or allele-positive) cohort enriched for those TCRs. We counted the number of cluster member TCRs found in each subject’s repertoire and sorted the subjects by this TCR count (rank plots in Figure 2B-C and in the right panels of Figure 6). For comparison, we generated control TCR count plots by independently resampling the subjects for each member TCR, preserving the frequency of each TCR and biasing by subject repertoire size. Each complete resampling of the cluster member TCR occurrence patterns produced a subject TCR rank plot; we repeated this resampling process 1000 times and averaged the rank plots to yield the green (‘randomized’) curves in Figure 2B-C and Figure 6. To compare the observed and randomized curves, we took a signed difference between the observed counts Cj and the randomized counts Rj, where the value of the subject index i = imax that maximizes the right-hand side in the equation above represents a switchpoint below which the observed counts generally exceed the randomized counts and above which the reverse is true (both sets of counts are sorted in decreasing order). We take this switchpoint imax as an estimate of the number of enriched subjects for the given cluster (this is the value given in the ‘Subjects’ column in Table 3).
Since the raw DCO values are not comparable between clusters of different sizes and for different alleles, we transformed these values to a Z-score (ZCO) by generating, for each cluster, 1000 additional random TCR count curves and computing the mean (μD) and standard deviation (σD) of their score distribution:
We used this co-occurrence score ZCO together with a log-transformed version of the cluster size p-value, for visualizing clustering results in Figure 5 (Ssize on the x-axis and ZCO on the y-axis) and prioritizing individual clusters for detailed follow-up.
4.10 TCR annotations
We annotated public TCRs in our dataset by matching their sequences against two publicly available datasets: VDJdb (Shugay et al., 2017), a curated database of TCR sequences with known antigen specificities (downloaded on 3/29/18; about 17, 000 human TCRβ entries) and McPAS-TCR (Tickotsky et al., 2017), a curated database of pathogen-associated TCR sequences (downloaded on 3/29/18; about 9, 000 human TCRβ entries). VDJdb entries are associated with a specific MHC-presented epitope, whereas McPAS-TCR also includes sequences of TCRs isolated from diseased tissues whose epitope specificity is not defined. We added to this merged annotation database the sequences of structurally characterized TCRs of known specificity (see below), as well as literature-derived TCRs from a handful of primary studies (Dash et al., 2017; Glanville et al., 2017; Song et al., 2017; Kasprowicz et al., 2006). For matches between HLA-associated TCRs and database TCRs of known specificity, we filtered for agreement (at 2-digit resolution) between the associated HLA allele in our dataset and the presenting allele from the database. In other words, TCRs belonging to B*08:01-restricted clusters were not annotated with matches to database TCRs that bind to A*02:01-presented peptides.
4.11 Structural analysis
We analyzed a set of experimentally determined TCR:peptide-MHC structures to find MHC positions frequently contacted by the CDR3β loop. Crystal structures of complexes involving human TCRs and human class I or class II HLA alleles were identified using BLAST (Altschul et al., 1997) searches against the RCSB PDB (Berman et al., 2000) sequence database (ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt). Structural coverage of HLA loci and alleles is sparse and highly biased toward well studied alleles such as HLA-A*02. Given the high degree of structural similarity among class I and among class II MHC structures solved to date, we elected to share contact information across loci using trans-locus sequence alignments. For class I we used the merged alignment (ClassI_prot.t×t) available from the IPD-IMGT/HLA (Robinson et al., 2014) database. Starting with multiple sequence alignments for individual class II loci from the IPD-IMGT/HLA database, we inserted gaps as needed in order to created merged alignments for the class II α and β chains. These alignments provided a common reference frame in which to combine residue-residue contacts from the TCR:peptide-MHC structures. We considered two amino acid residues to be in contact if they had a side chain heavyatom contact distance less than or equal to 4.5Å. The CDR3β contact frequency for an alignment position (class I, class II-α, or class II-β) was defined to be the total number of contacted CDR3β amino acids observed for that position, divided by the total number of structures analyzed. Redundancy in the structural database was assessed at the level of TCR and HLA sequence, ignoring the sequence of the peptide. Contacts from a set of n structures all containing the same TCR and HLA were given a weight of 1/n when computing the residue contact frequencies.
5 Acknowledgments
This work was supported in part through the NIH/NCI Cancer Center Support Grant P30 CA015704 and by NIH NHLBI grant R01-HL105914 to JH, as well as R01 GM113246 and U19 AI117891. The research of Frederick Matsen was supported in part by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation. We gratefully acknowledge superlative computing support from Fred Hutch scientific computing and thank Paul Thomas and Jeremy Crawford for helpful comments on a preliminary version of this manuscript.