Abstract
Despite the extreme diversity of T cell repertoires, many identical T cell receptor (TCR) sequences are found in a large number of individual mice and humans. These widely-shared sequences, often referred to as ‘public‘, have been suggested to be over-represented due to their potential immune functionality or their ease of generation by V(D)J recombination. Here we show that even for large cohorts the observed degree of sharing of TCR sequences between individuals is well predicted by a model accounting for the known quantitative statistical biases in the generation process, together with a simple model of thymic selection. Whether a sequence is shared by many individuals is predicted to depend on the number of queried individuals and the sampling depth, as well as on the sequence itself, in agreement with the data. We introduce the degree of publicness conditional on the queried cohort size and the size of the sampled repertoires. Based on these observations we propose a public/private sequence classifier, ‘PUBLIC’ (Public Universal Binary Likelihood Inference Classifier), based on the generation probability, which performs very well even for small cohort sizes.
I. INTRODUCTION
The adaptive immune system relies on a diverse set of T-cell receptors (TCR) to recognize pathogen-derived peptides presented by the major histocompatibility com plex (MHC). Each T cell expresses a distinct TCR that is created stochastically by V(D)J recombination. This process is very diverse, with the potential to generate up to 1061 different sequences in humans [1]. The resulting ‘repertoire’ of distinct TCRs expressed in an individual defines a unique footprint of immune protection. Despite this diversity, a significant overlap in the TCR response of different individuals to a variety of antigens and in fections has been observed in humans [2–4], mice [5–7], and macaques [8] (reviewed in Refs. [9, 10]). This ob servation led to the notion of a ‘public’ response shared by all, and a complementary ‘private’ response specific to each individual [5]. Since antigen-specific TCRs have a restricted set of sequences[11, 12], and since there is no identified analog for T cells of B cell affinity maturation, a public response can only arise if the specific responding T cells are independently generated in each individual’s T cell repertoire. It was proposed [7–9] that these shared sequences can be explained by the biases inherent in the V(D)J recombination process, together with ‘conver gent recombination‘, the possibility to generate the same TCR sequence (especially the same CDR3 amino acid sequence) in independent recombination events. In this hypothesis, shared TCRs are simply those that have a higher-than-average generation probability and are thus more abundant in the unselected repertoire [13]. The ad vent of high-throughput sequencing of TCR repertoires [14–17] has largely confirmed this view through the anal ysis of shared TCR sequences between unrelated humans [18–20], monozygous human twins [21, 22], and mice [23]. However, despite recent efforts to characterize the land scape of public TCRs [24], the relative contributions of convergent recombination, V(D)J bias, thymic selection [25], peripheral and antigen-specific selection, remain to be elucidated and quantified.
In this review, we address the sharing phenomenon us ing quantitative models of the stochastic V(D)J recom bination process that have been inferred from repertoire data [26–29]. These generative models, augmented by a simple one-parameter model of thymic selection, can be used to predict the number of sequences that will be shared between any number of individuals, each sampled to any sequencing depth. We make these predictions on the basis of stochastic simulations, but we also derive general mathematical formulas that allow us to calculate sharing from any recombination model. We show that these predictions are in excellent quantitative agreement with data from two recent T cell repertoire studies in humans [30] and mice [23]. Our results are consistent with arguments [9, 31] that the dichotomy between pub lic and private is misleading. Instead, we find a wide range of possible degrees of sharing, depending on se quencing depth of the individual repertoires, the number of individuals in the study, and the number of individu als between whom the sequence is shared. We propose ‘PUBLIC’ (Public Universal Binary Likelihood Inference Classifier), a ‘publicness score’ defined as the recombi nation probability predicted by our model. This score predicts the sharing status of any TCR with very high accuracy, irrespective of the definition for being public versus private.
II. PREDICTING SHARING BETWEEN REPERTOIRES
A. Spectrum of sharing numbers
We start with an operational definition of sharing in repertoire data obtained by high-throughput sequencing from several individuals or cell subsets, which closely fol lows that of Ref. [23]. For each individual, we compile a list of unique TCR sequences (Fig. 1A). Since the func tional character of a T cell is thought to be largely deter mined by the amino acid sequence of the highly variable Complementary Determining Region 3, or CDR3 (to be more precisely defined later) of the beta chain protein, we record in our list just the unique CDR3 beta chain amino acid sequences found in a given biological sample of T cells. For each TCR amino acid sequence, we define the ‘sharing number’ as the number of different samples in which that sequence was found (Fig. 1B). The sharing number depends both on the number of samples and on the number of unique sequences in each sample. We note that more restricted definitions of sharing, based for ex ample on the full nucleotide sequence, are possible, but the correspondingly reduced statistics make it harder to draw sharp conclusions. Counting the number of TCRs with each sharing number (Fig. 1C), we obtain a distri bution of sharing, from purely private sequences (shar ing number 1) to fully public sequences (sharing number equal to the number of individuals), and everything in between. We will compare the distribution of sharing numbers obtained from the data sequences with predic tions of our models.
Early estimates of sharing of human TCRs [7] showed that assuming a uniform distribution of TCR genera tion underestimates observed sharing by several orders of magnitude [18]. Thus, having an accurate model for the non-uniform distribution of TCR generation probabilities is crucial for making quantitative predictions of the shar ing distribution. A simple non-homogeneous model that assigns lower probability to TCR sequences with more N insertions in the V(D)J recombination process is able to predict sharing between pairs of individuals within the correct order of magnitude [18]. However, this estimate ignores the detailed structure of biases inherent to the recombination process and results in strong biases in the distribution of TCR sequences that, as we will show, in fluence the sharing spectrum.
B. TCR generation bias
T-cell receptors are composed of an α and a β chain en coded by separate genes stochastically generated by the V(D)J recombination process [32]. Each chain is assem bled from the combinatorial concatenation of two or three segments (V as Variable, D as Diversity, and J as Joining for the β chain, and V and J for the α chain) picked at random from a list of germline template genes. Further diversity comes from random nontemplated N insertions between, together with random deletions from the ends of, the joined segments. The α chain is less diverse than the β chain and sharing analyses have mostly focussed on the latter. The germline gene usages are highly non uniform [14, 15, 33], due to differences in gene copy num bers [34] as well as the conformation [35] and processive excision dynamics [36] of DNA during recombination. In addition, the distributions of the number of deleted and inserted base pairs, as well as the composition of N nu cleotides, are also biased [37]. Taken together, the biases imply that some recombination events are more likely than others. In addition, distinct recombination events can lead to the same nucleotide sequence, and many nu cleotide sequences can lead to the same amino-acid se quence. This convergent recombination further skews the distribution of TCRs, as some sequences can be produced in more ways than others [7, 9].
The effects of recombination biases and convergent re combination can be captured by stochastic models of re combination. Given the probability distributions for the choice of gene segments, deletion profiles and insertion patterns, one can generate in silico TCR repertoire sam ples that mimic the statistics of real repertoires, and al low us to predict sharing statistics and the effects of con vergent recombination [11, 20, 22, 23, 26, 38]. To obtain accurate predictions, the distributions of recombination events used in the model must closely match repertoire data. This task is made difficult by the fact that, as a consequence of convergent recombination, the specific recombination event behind an observed sequence is not directly accessible. However, methods of statistical in ference can be used to overcome this problem and learn accurate models of V(D)J recombination [26, 27, 29, 39], models which can in turn be used to predict sharing properties of sampled repertoires or of individual TCR sequences. These models have been shown to vary little between individuals, with small differences only in the germline gene usage and remarkable reproducibility in the insertion and deletion profiles [?]. In our analy sis we will assume a universal model, independent of the individual.
C. Using TCR recombination models to predict sharing
We used the above-described models of recombination to predict the distribution of sharing among cohorts of humans and mice. Specifically, we re-analyzed published TCR β-chain nucleotide sequences of 14 Black-6 mice [23] and 658 human donors [30] (Methods). Individual sam ples comprised 20,000-50,000 unique sequences for mice, and up to 400,000 for humans. Sequences were trans lated into amino-acid sequences, and trimmed to keep only the CDR3 loop, defined as the sequence between the last cysteine in the V gene and the first phenylala nine in the J gene [40]. The sharing number of each ob served CDR3 amino acid sequence, and the sharing num ber distribution, were then computed from the data. We chose to focus on the CDR3 amino-acid sequences to get higher sharing numbers than would have been obtained for untrimmed nucleotide sequences, limiting the effects of sequencing errors and allowing for a better comparison to the model.
To obtain model predictions for humans, we used a pre viously described model for TCRβ sequence generation inferred by the software package IGoR [29] from reper toire data of a single individual [30] (Methods). The mouse model was inferred using IGoR from the reper toire data of the 14 animals of [23]. In both cases, the model is learned from unproductive rearrangements (i.e. with a frameshift in the CDR3) since those sequences give us access to the raw result of recombination, without subsequent effects of selection [26]. These unproductive sequences are only used to infer a generative model and are not used in the sharing analysis. A productive (in frame) sequence that is generated in a V(D)J recombi nation event will not necessarily survive thymic selection to become a functional T cell in the periphery. To model this effect, we assume that there is a probability q, in dependent of the actual sequence but dependent on the species under study, that any given generated sequence will survive thymic selection [41]. Model sharing pre dictions are then obtained in two ways: (i) by simulating sequences and selecting them at random with probability q to generate samples of the same size as in the data (an important point about simulation is that, once a partic ular CDR3 amino acid sequence has been chosen to not pass thymic selection, any future recurrence of that se quence in the simulation is also discarded); (ii) by deriv ing analytical mathematical expressions for the expected value (Methods). These predictions can then be directly compared to data.
D. Model predicts many degrees of publicness in the data
The comparison between data, model simulations and mathematical predictions shows excellent agreement in mice (Fig. 2A) and humans (Fig. 2B). The predictions depend on the only free parameter of the model, the se lection factor q. This parameter was not set simply by fitting the sharing curves to the data. Instead, it was ob tained independently as a proportionality factor required to explain the number of observed unique amino acid CDR3 sequences given the number of unique nucleotide sequences (insets of Figs. 2A and B). This convergent re combination curve depends on q in a predictable way (see Methods for mathematical expressions), making it possi ble to fit q to the data (insets of Figs. 2 A and B). This method yielded selection factors of q = 0.15 for mice, and q = 0.037 for humans, surprisingly close to the es timate of 3% for the fraction of human TCR that pass thymic selection [42]. Comparison of the prediction with and without selection in mice (red and green lines and points in Fig. 2A) shows that adding selection greatly improves the agreement, despite a slight overestimation of high sharing numbers.
Humans have a much more diverse repertoire than mice [28], resulting in lower number of shared amino acid TCR sequences. On the other hand, the very large co hort in the data set we analyze allows us to illustrate the very wide range of sharing behaviors. In particular, we find a long-tailed power-law distribution in the dis tribution of sharing numbers (Fig. 2B), a feature that is reproduced by the model. A very small fraction of sequences are shared between all individuals in the 658 donor cohort, while a large (> 90%) fraction of TCRs are found in just one sample. This diversity of behaviors reflects the diversity of generation probabilities implied by the strong biases in the VDJ recombination process that are correctly captured by our model.
III. FROM SAMPLES TO FULL REPERTOIRES
A. Sampling depth affects sharing
While the sharing potential of a sequence depends just on its generation and selection probabilities, it is impor tant to realize that actual sharing numbers will depend on the size of the cohort under study and the sampling depth of each individual T cell repertoire. To illustrate this effect, we downsampled both the cohort size and the number of sequences in the human dataset, and recal culated sharing. Fig. 3A compares the distribution of sharing numbers in the original dataset, with the same distribution obtained from samples where a random half of the unique sequences were removed. The number of TCRs with each sharing number drops with downsam pling, and this drop is more marked for high sharing numbers, as evidenced by the fraction of CDR3s with each sharing number (see inset of Fig. 3A). In short, the more TCRs are captured in the repertoire samples, and the more likely sequences are to be shared. This effect is reproduced in detail by the model calculations. This re sult generalizes previous observations that the number of shared TCRs between a pair of individuals should scale approximately with the product of the numbers of unique TCRs in each sample [20, 21, 26, 43] to arbitrary sharing numbers.
To demonstrate the effects of varying cohort and sample size more clearly, we plot in Fig. 3B the complemen tary quantity—the fraction of CDR3s which are purely ‘private‘, i.e. present in only one repertoire. This fraction decreases for large cohorts and large sample sizes. We note that cohort size and sample depth vary greatly from study to study; the data analyzed in this review go from a small cohort of mice (14 repertoires with a few tens of thousands TCRs each) to a very large cohort of humans (658 donors with 200,000 TCRs each). The strong dependence of the notion of privateness upon the parameters of the study cautions us against interpreting sharing numbers and public or private status of individ ual sequences too literally, and further emphasizes that publicness is not a binary but rather a continuous measure.
B. Cumulative diversity and extrapolation to full repertoires
As Fig. 3B shows, most (more than 90%) amino acid TCRs are found in only one repertoire. This means that, when pooling repertoires, each newly added repertoire will contribute a brand new set of TCRs to the pool. To explore this idea, we define the ‘cumulative reper toire’ obtained by pooling together the sampled reper toires of several individuals, and count the number of unique TCRβ amino acid sequences in it. This cumu lative diversity grows almost linearly with the number of pooled samples (Fig. 4A), both in the data and ac cording to the model (see Methods for calculation of the model prediction). The ratio of unique to total sequences starts at 1 for small numbers of pooled individuals, and decreases to around 0.9 for high numbers of pooled indi viduals, consistent with the fraction of private sequences. It is interesting to ask whether this trend would continue for larger populations all the way up to the entire world population. Although we cannot answer this question directly by experiments, we can use the model to make predictions. Generating in silico repertoires for billions of individuals is of course impractical, but we can use mathematical expressions (Methods) to calculate the ex pected diversity. Fig. 4B shows the theoretical cumula tive diversity as a function of the number of individuals for up to 1012 individuals. Even with numbers of individ uals largely exceeding the number of humans having ever lived (1011), we are very far from saturating the space of observed TCRs.
The previous estimates rely on partial repertoires com prising a few hundred thousand unique TCRs obtained from small blood samples. However, the human body hosts 5. 1011 T cells [44], and while the T cell population has a clonal structure, recent estimates of the number of clones, and thus of independent TCR recombination events, ranges from 108 (from indirect sampling using potentially inaccurate statistical estimators [45]), to 1010 (based on theoretical arguments [46]). The theoretical cumulative diversity based on that latter estimate of 1010 (Fig. 4B, black curve) still shows no sign of saturation. These results are a consequence of the enormous poten tial diversity of VDJ recombination, and indicate that the diversity of TCRβ is not exhausted even by the pooled repertoire of the entire world population.
Extrapolating these considerations to the full TCR repertoire of an individual allows us to estimate the frac tion of truly ‘public’ TCRs, defined as the sequences that are present in almost all individuals. If we define a public TCR sequence as one that has a generation probability larger than 1/N, where N is the number of T-cell clones in the body, then 1 ‒ e‒1 = 63% of all individuals would be expected to have that sequence in their repertoire. With this definition, we can predict the percentage of public sequences as a function of repertoire size (Fig. 4C). Interestingly, this fraction ranges from 10 to 20% for both humans and mice depending on estimates of the number of clones, despite their widely different TCRβ diversities and repertoire sizes. It is interesting to note that the lower diversity of the TCRβ repertoire in mice as compared to humans is matched in a proportional way to the ratio of the TCR repertoire sizes in the two species.
IV. PREDICTING PUBLICNESS
A. Sharing and TCR generation probability
As we have seen, the sequence generation model correctly predicts the amount of sharing across individuals, as well as the fraction of public sequences. Underlying this prediction method is the idea that the likelihood that a given sequence will be shared is largely determined by the probability of generation of the sequence. Early versions of this argument [9, 47] noted that sequences with a high number of N insertions have lower generation probability (because of the diversity of possible insertions, each reducing the generation probability by a factor ≈ 1/4), predicting that shared sequences would have fewer inser tions than average. We have used recombination models inferred from data to refine this argument by accounting quantitatively for the effects of biases, convergent recombination, etc., on the probability of generation of particular TCR sequences. As a further test of the underlying ideas, we compute the generation probability of TCR sequences and ask how this quantity correlates with the sharing numbers.
To calculate the generation probability of TCRs, one needs to sum the occurrence probabilities of all the pos sible recombination events leading to a given nucleotide sequence [26, 29] and, since we choose to follow CDR3 amino acid sequences, sum the probabilities of all nu cleotide sequences leading to the amino acid sequence of interest. This is a computationally hard task that can be rendered tractable using a dynamic programming ap proach (see Methods). We find that the distribution of generation probabilities of all TCRβ CDR3 amino acid sequences (Fig. 5, blue curves) is extremely broad, span ning many orders of magnitude. This observation is con sistent with similar analyses at the level of nucleotide se quences in nonproductive [26] and productive [20] human TCRβ, in the α and β chains of monozygous twins [22], and mice [28]. If we plot instead the generative probabil ity distribution of sequences that are shared among two or more individuals in our data set, we find that the dis tribution narrows and shifts towards higher generation probabilities [20, 22, 26] as expected. This effect is dis played in more detail in a plot of the generative probabil ity distribution for sequences in our dataset with differ ent sharing numbers (Fig. 5). On the same figure we plot the predictions of the recombination model, following the same protocol used for predicting sharing numbers (see Methods). There is a systematic shift between the pre dictions of the recombination model and the distribution of the data itself, for all sharing levels. This difference is due to the fact that the recombination model was in ferred from non-productive sequences, and does not ac count for selection effects. The data sequences, however, have passed thymic and possibly other kinds of periph eral selection, affecting their statistics. The sequence dependent nature of this effect was characterized and quantified in [20], with the general finding that selection favors sequences with high generation probability. This is qualitatively consistent with the positive sign of the shift (solid lines versus dotted lines) we see in Fig. 5. Our sharing calculations ignore any possible sequence depen dence of selection, and instead selects TCRs at random (with probability q), regardless of their sequence identity. The model prediction could in principle be improved by adding sequence-dependent selection factors to match the distributions as in [20]. However, unlike the recombina tion model, such factors are expected to be specific to each individual, owing to their unique HLA type which is involved in thymic selection.
B. PUBLIC: Classifier of public vs. private TCRs based on generation probability
The distributions of generation probabilities for the different sharing numbers suggest that the generation probability is a good proxy for the property of being public, regardless of the exact definition of publicness. We built a classifier called PUBLIC (Public Universal Binary Likelihood Inference Classifier), which is entirely based on the probability of generation computed as ex plained above (detailed in Methods) for each amino acid sequence (Fig. 6A). Before discussing the performance of this classifier, it is important to note that it is based on a model of recombination trained in a completely unsuper vised way, i.e. without using any information about the public status of the sequences. In fact, this training can be done with IGoR [29] from the repertoire of a single individual, without including any sharing information. Unlike previous approaches [23], we do not fit additional model features based on the catalogue of sequences with their public or private status.
We arbitrarily define as ‘public’ the TCRs that are found in at least m repertoire samples among a total pool of n individuals. The PUBLIC classifier calls a given TCR ‘public’ if its generation probability is larger than a threshold θ, calling it ‘private’ otherwise. Intuitively, the threshold should be set to separate reliably the peaks in the probability density function of Fig. 5 corresponding to different sharing numbers, as schematized in Fig. 6B. The general performance of the PUBLIC classifier can be estimated by plotting the Receiver Operating Character istic (ROC) curve, which represents the rate of false posi tives versus that of true positives as θ is varied (Fig. 6C).
We plot ROC curves for a few different choices of m (the minimal number of individuals with the TCR in their sampled repertoire for the sequence to be called public), for mice (Fig. 7 A) and humans (Fig. 7 B). The classification accuracy improves as publicness is defined to be more restrictive (larger m), although it performs well even for small m. For mice, the dataset we used had few individuals, making the operational definition of pub licness less reliable. However, for humans we find highly public TCRs are predicted almost perfectly by PUBLIC, despite the larger diversity of human TCRs. This sug gests that the lesser performance of PUBLIC for mice may be attributed to the small size of the cohort, rather than to limitations of the classifier itself.
The performance of PUBLIC can be reduced to a sin gle number by calculating the area under the ROC curve (AUROC). The AUROC corresponds to the probability that the classifier ranks a randomly chosen public se quence higher than a randomly chosen private one. The closer the AUROC score is to 1, the better the classifier. As was clear from the ROC curves themselves, the AUC improves as the degree of publicness is higher (insets of Fig. 7A-B). As the minimal sharing number m increases, the classifying task becomes easier and the prediction better. In fact, having the minimal sharing number m close to the cohort size n will in general make publicness rarer, and the public sequences more extreme in their generation probabilities.
V. PUBLIC SPECIFIC RESPONSE
Sharing properties are interesting in their own right, but they also provide a basal expectation for the preva lence of certain TCRs. Using the sharing prediction, one can identify TCRs that are more shared in specific popu lations or subsets than expected according to the recom bination model. When counting sharing in a population of individuals affected by a common condition, this ‘over sharing’ can be indicative of a specific T-cell response to the antigens associated with the condition. Such sharing of specific TCRs is expected from the relatively low di versity of antigen-specific sequences revealed by in vitro multimer-staining experiments [11, 12]. A very similar idea has been exploited by several groups to identify TCRs specific to the Cytomegalovirus [30], Type-1 di abetes [48, 49], arthritis [50] and other immune diseases [51]. In these studies, there is no theoretical expectation from the recombination model. Rather, the basal ex pectation for TCR sharing is given by a negative-control cohort. However this control can be efficiently replaced by the recombination model presented here, as demon strated in [41]. In this analysis, specific TCRs emerge as outliers that are shared much more frequently than predicted by the model.
We wondered whether such an approach could be useful for identifying tumor-specific TCRs as sharing outliers among cancer patients. The T-cell repertoire of tumor infiltrating cells has been studied to look for signatures of immunogenicity [52–54], and the overlap between the tumor and blood repertoires was shown to predict survival in glioblastoma patients [55]. In addition, the tumor specific TCRs have been reported to be shared in the tumor-infiltrating and blood T-cell repertoires of breast cancer [56].
We thus asked whether the blood repertoires of patients with bladder cancer contained TCRs with more sharing than would be predicted by our recombination model. We performed the sharing analysis on 30 patients with bladder cancer, on TCR repertoires sequenced from blood samples [54]. We compared it with 30 healthy individuals, chosen at random among the individuals studied in Ref. [30] to have similar sample sizes. We then down sampled the reference repertoires of the healthy individ uals to have the exact same sample sizes as the cancer patients to guarantee a fair comparison. We found that the numbers of shared sequences in the blood of bladder cancer patients are almost identical to those found in the healthy samples, and thus also in agreement with the recombination model (Fig. 8). This is consistent with previous reports that did not find any signatures of TCR repertoire anomalies in the blood of bladder cancer patients, although some small differences could be seen in the tumors. There are many possible explanations for this observation: the tumor-specific response may be statistically negligible amid the large number of other cells; or the response may not have propagated to the blood; or different patients generate responses against different neoantigens; or they generate very different responses against the same neoantigen; or the tumor does not generate any response at all. Tumor samples from larger cohorts would be needed to distinguish between these different hypotheses. Additionally this result is only true for bladder cancer. Different tumor types that have a higher rate of infiltration to the blood may be more likely to result in detectable signatures in the blood.
VI. DISCUSSION
In this paper we extensively tested and quantified the previously proposed hypothesis [9, 31] that public TCRs owe their status to the ease of generating them through V(D)J recombination. Predicting and character izing TCR sharing and publicness is important to identify universal features of the immune response across indi viduals. This knowledge can be useful when designing vaccines that have a high probability of eliciting an im mune response, or for identifying candidate T-cell clones in immunotherapeutic strategies [57].
Our predictions, and their agreement with the data, support the notion that ‘publicness‘, as it is usually de fined, is context-dependent [9]. The public status of a TCR depends not only on its (intrinsic) generation prob ability, but also on (extrinsic) parameters including the number of individuals sampled, the sequencing depth of the samples, and the definition of publicness – the min imal number of individuals that need to share a TCR to call that TCR public. Instead, we have showed that we can define the potential for publicness, largely deter mined by the generation probability of the sequence, and use it to predict actual sharing numbers for any set of repertoire samples. At the same time, we proposed that an absolute notion of publicness can be defined based on the full repertoire of individuals. According to this definition, a TCR is public if its probability of occur rence is larger than the inverse of the number of unique TCRs hosted in the entire repertoire. While this def inition is impossible to explore directly in humans, for whom only repertoire samples can be obtained, our data driven recombination model can make predictions about the public status of particular sequences, and the frac tion of the repertoire that is public, using this specific definition (Fig. 4).
We report a wide spectrum of publicness, which we show arises from the very wide distribution of TCR gen eration probabilities. The high-end of the distribution holds sequences that will be included in any healthy repertoire, just by virtue of their high generation proba bility. Due to their publicness, it had been conjectured that some of these common TCRs might have a close to innate function [31]. In this context it should be noted that young, pre-birth repertoires are known to be much less diverse both in humans [22] and mice [28], due the late appearence of TdT, the enzyme responsible for in sertions in the recombination process. Consequently, the pre-birth repertoire is expected to be much more public that the adult one, and could be enriched in innate-like TCRs. However, since no conclusive evidence has been provided about the functional role of these high proba bility sequences, we cannot rule out the possibility that they are just there by chance, without a specific func tion. The other end of the TCR distribution—the long tail of low generation probabilities—contributes to the private part of the repertoire, which makes up the ma jority of the repertoire according to our estimates. It would be interesting to explore whether these sequences have a functional role or are just by-products of the re combination process.
High-throughput TCR repertoire datasets contain abundance levels (number of reads) for each TCR. TCR abundances have be attributed to convergent recombi nation, implying a correlation between high abundance and publicness [9]. However, this connection may be con founded by other processes affecting the abundance levels reported by high-throughput sequencing. A big source of diversity in TCRs abundances is the peripheral prolifer ation of some TCRs, regardless of their generation prob ability. In addition, experimental or phenotypic noise, including PCR amplification noise [58] and expression variability (for cDNA sequencing) also affect reported abundances. These various effects are expected to dilute the correlation between abundance and publicness. Note that our statistical models are constructed based only on unique sequences, circumventing clonal expansion dy namics, and ignoring abundance levels altogether.
The sharing analysis naturally leads to defining the PUBLIC score, which we show predicts sharing proper ties with high accuracy. The PUBLIC score is learned in an unsupervised manner, using a statistical model trained with no information about the sharing status of TCRs. Thus, sharing can be very well predicted with neither abundance nor sharing information. This success suggests that being public is a very basic property of the recombination process itself, and also provides a strong validation of the recombination model. It would be in teresting to explore how using TCR sharing status and abundance levels in a supervised manner that refines the classifier could lead to better predictions.
Our prediction for sharing is mainly based on the gen eration model [29], which is sequence specific, attribut ing each sequence its own probability of generation. We have found that an overall selection factor is needed to predict sharing numbers correctly, but this simple and effective model is sequence independent. Previous work [20] inferred a sequence-specific selection process by com paring generation model results to observed sequences. In principle such a model could be combined within our framework to yield refined sharing predictions. While the parameters of the generation process are largely in variant across individuals [29], selection is expected to be individual-dependent and heritable due to the diversity of HLA types in the population [59]. The large variability in the V and J genes selection pressures inferred in [20] is consistent with this notion, but in the same work some amino-acid features of selection were found to be univer sal. Quantifying these universal features and including them in the model could both improve the predictions for the sharing numbers, and enable a better assessment of the potential publicness of specific sequences through an improved classifier.
The discussion in this work was focused on TCRβ chains, but in general can be applied to any recombined chain, including α, γ and d TCR chains, as well as B cell receptor (BCR) light and heavy chains, or to paired chain combinations. The α chain, as part of the αβ re ceptor, contributes to antigen recognition. It is less di verse than the β chain, implying higher sharing numbers [22]. Paired αβ data is becoming available as paired se quencing technologies improve [60, 61], but the resulting repertoires are currently too small or not yet available for analysis. As more paired sequencing data becomes available, it will be interesting to study the sharing proper ties of the αβ repertoire using recombination models for pairs.
A similar analysis could be performed on BCRs. The problem is further complicated by somatic hypermutations, which add further diversity and are expected to reduce sharing as well as the ability to predict it. However, the role of the generation probability, for which the models have been trained [29, 39, 62], for sharing and publicness has not been explored. Machine learning ap proaches to predict publicness of BCR [63] could be com bined with estimates of the probability of generation and hypermutations profile [29, 64, 65] to provide accurate predictions for the public status of BCRs. Such an analy sis applied to the result of affinity maturation in different individuals infected with the same pathogens [66] could be used to assess the impact of convergent recombination in the public response and better understand the evolution of specific antibodies, and guide vaccination strategies to facilitate the emergence of broadly neutralizing antibodies [67].
VII. METHODS
A. The probability of generating a TCR sequence
To evaluate TCR generation probabilities, we first con structed a probabilistic generation model of the recom bination process [26]. Such a model is parametrized by probabilities for each choice of V,D and J gene, for each deletion length of the different genes, and for each in sertion pattern of random nucleotide between the genes. Then the probability of a recombination scenario is
The probability of a TCR sequence, whether it is a nu cleotide of amino-acid sequence, for the full sequence or just the CDR3, is obtained by summing the above prob ability over all the possible scenarios leading to the se quence of interest.
The generation model can be inferred efficiently using the IGoR software [29] from non-functional recombina tions, which produce out-of-frame or stop codon con taining sequences. Model training is done by finding model parameters that maximize the likelihood of the data, equal to the product of generation probabilities of the observed TCRs in the dataset. Here we used IGoR to infer a generation model from the non-functional reads in the datasets from which the productive reads used for the sharing analysis came, human data in [30], and mice data in [23].
To calculate the generation probabilities of CDR3 amino-acid sequences, we used an efficient algorithm that avoids brute-force summation of all possible scenarios us ing dynamic programing.
B. Evaluating the number of shared sequences using simulations
Once inferred, a generative model can be used to gener ate random in silico samples of any size. Recombination scenarios are generated using Monte Carlo sampling by drawing events such as gene choices, deletions and inser tions according to the model parameters. Each recombi nation scenario constructs a nucleotide sequence which is filtered for productivity (in-frame, no stop codons or pseudogenes, and the conserved residues C and F are present). A productive nucleotide sequence is then trimmed to the CDR3β region and translated into an amino acid sequence. To model thymic selection only a random fraction q of the productive CDR3β sequences are considered. This is implemented using a hash func tion, keeping only sequences whose normalized hash val ues are less than q. This negative selection process is a random function of the sequence, which is consistent be tween any simulated individual sample, so that a given CDR3β will either pass or fail selection in all individuals. A simulated sample can thus be generated to match the cohort size and sequencing depth of the real data, and then analysed with with the same pipelines.
C. Analytical calculation of the number of shared sequences
1. Predicting sharing numbers from the distribution of generation probabilities
Given a collection of CDR3β sequences s ∈ S, a model that assigns probabilities p(s) for each sequence, and N independent sequences drawn from the model, the expected number of observed unique sequences M0 is:
where we have made the Poisson approximation for small p(s). If there are n individuals, with sample sizes {Ni}, then the expected number of sequences which will be found in exactly m individuals (sharing number m) is:
where Jm is the collection of all possible combinations of m individuals. This can be computed more efficiently by use of the generating function G(x, {Ni}):
where the Mms are the coefficients of the polynomial G(x, {Ni}), and can be calculated just by expanding the polynomial in x and summing over s.
2. Density of states approximation
While the above equations are exact, summing over each individual sequence is intractable given the huge number of sequences. Instead, an integral approxima tion based on the “density of mstates” is used. Let us call E(s) = ‒ ln p(s) the Shannon surprise of generat ing sequence s at random, also formally equivalent to an energy in physics according to Boltzman’s law. The den sity of states, g(E)dE, counts the number of sequences between E and E +dE. Summation of an arbitrary func tion F(p(s)) = F(E(s)) over the states (sequences) in S can then be turned into an integral:
A numerical estimation of g(E) is required to compute this integral. Estimating g(E) is done by drawing large Monte Carlo samples of sequences (107 for humans and 106 for mice) from the generative model and calculating the generation probabilities of each sequence. Values of E(s) = ‒ ln p(s) can then be histogrammed into bins of size dE and the resulting distribution normalized to integrate to 1. This yields a probability density, P (E) (shown Fig. 5), which can be used to compute the density of states:
Equations 2 and 4 can now be rewritten in terms of integrals:
and
3. Sharing modified by selection
While the above analysis is general, it depends on the state or sequence space S (the collection of productive CDR3βs that pass selection) and on a model that as signs probabilities to each sequence. The preferred model to use will be the probability of generating a sequence (pgen(s)), however this model is defined and normalized over a state space of all possible recombination events, many of which lead to non-functional or negatively se lected sequences. As a result, the model p(s) that will be used needs to be renormalized to reflect the reduced se quence space of productive, selected sequences. This in troduces two factors. First factor, f, is the fraction of se quences which are functional (in-frame, no stop codons or pseudogenes, conserved residues are present), and can be computed directly from the generative model (f = 0.236 for humans and f = 0.260 for mice). The second fac tor, q, is the fraction of productive sequences which pass selection and must be inferred (see below). These two factors provide the definition for the model that is used in the analysis:
The effect of renormalizing pgen(s) to p(s) on the density of states is that the energies are shifted by a constant ln f + ln q and is everywhere reduced by a factor of f × q:
Where ggen(E) is the density of states computed from pgen(s) and g(E) is derived from p(s).
4. Inferring the selection factor q
The selection factor q is inferred by running a least squares regression on the model predictions for the M0(N) curve (Eq. 7). This curve relates the number M0 of unique amino acid CDR3 sequences observed to the number N of productive, selected recombinations gen erated. To determine the M0(N) curve from the data, the number of productive selected recombinations must be determined for each sample. Fortunately, due to the limited sequencing depth, the number of unique productive nucleotide reads in each individual sample is very close to the actual number of selected recombinations. In practice, N was taken to be the number of unique nu cleotide sequences of each repertoire, summed over a sub set of the individuals, and M0 was the number of unique amino-acid sequences resulting from the translation of the aggregated repertoire of the same subset of individuals. The curve was obtained by adding more and more random individuals to the subset, and averaged over 30 realizations of that random addition process (Fig. 4A). A least-squared regression of Eq. 7 with Eq. 10 to that empirical curve yielded a value for q of approximately 0.037 for humans and 0.155 for mice.
5. Analytic computation of public fraction of a repertoire
In Fig. 4C a sequence s in a repertoire of size N is defined as public if p(s) ≥ 1/N. The fraction of the repertoire comprised of these sequences is computed by evaluating:
where the term in parenthesis corresponds to the probability that a given sequence with probability e-E is found in a repertoire of size N.
Acknowledgements
The work of TM and AMW was supported in part by grant ERCCOG n. 724208. The work of ZS and CC was supported in part by NSF grant PHY-1607612. The work of YE was supported by a fellowship from the V Foundation. The work of CC was performed in part at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵