Quantifying selection in immune receptor repertoires

Yuval Elhanati; Anand Murugan; Curtis G. Callan; Thierry Mora; Aleksandra M. Walczak

doi:10.1101/004341

Abstract

The efficient recognition of pathogens by the adaptive immune system relies on the diversity of receptors displayed at the surface of immune cells. T-cell receptor diversity results from an initial random DNA editing process, called VDJ recombination, followed by functional selection of cells according to the interaction of their surface receptors with self and foreign antigenic peptides. To quantify the effect of selection on the highly variable elements of the receptor, we apply a probabilistic maximum likelihood approach to the analysis of high-throughput sequence data from the β-chain of human T-cell receptors. We quantify selection factors for V and J gene choice, and for the length and amino-acid composition of the variable region. Our approach is necessary to disentangle the effects of selection from biases inherent in the recombination process. Inferred selection factors differ little between donors, or between naive and memory repertoires. The number of sequences shared between donors is well-predicted by the model, indicating a purely stochastic origin of such “public” sequences. We find a significant correlation between biases induced by VDJ recombination and our inferred selection factors, together with a reduction of diversity during selection. Both effects suggest that natural selection acting on the recombination process has anticipated the selection pressures experienced during somatic evolution.

Significance statement

The immune system defends against pathogens via a diverse population of T-cells that display different antigen recognition surface receptor proteins. Receptor diversity is produced by an initial random gene recombination process, followed by selection for a desirable range of peptide binding. Although recombination is well-understood, selection has not been quantitatively characterized. By combining high throughput sequencing data with modeling, we quantify the selection pressure that shapes functional repertoires. Selection is found to vary little between individuals or between the naive and memory repertoires. It reinforces the biases of the recombination process, meaning that sequences more likely to be produced are also more likely to pass selection. The model accounts for “public” sequences shared between individuals as resulting from pure chance.

The T-cell response of the adaptive immune system begins when receptor proteins on the surface of these cells recognize a pathogen peptide presented by an antigen presenting cell. The immune cell repertoire of a given individual is comprised of many clones, each with a distinct surface receptor. This diversity, which is central to the ability of the immune system to defeat pathogens, is initially created by a stochastic process of germline DNA editing (called VDJ recombination) that gives each new immune cell a unique surface receptor gene. This initial repertoire is subsequently modified by selective forces, including thymic selection against excessive (or insufficient) recognition of self proteins, that are also stochastic in nature. Due to this stochasticity and the large T-cell diversity, these repertoires are best described by probability distributions. In this paper we apply a probabilistic approach to sequence data to obtain quantitative measures of the selection pressures that shape T-cell receptor repertoires.

New receptor genes are formed by choosing at random from a set of genomic templates for several sub-regions (V, D and J) of the complete gene. Insertion and deletion of nucleotides in the junctional regions between the V and D and D and J genes greatly enhances diversity beyond pure VDJ combinatorics [1]. The variable region of the gene lies between the last amino acids of the V segment and the beginning of the J segment; it codes for the Complementarity Determining Region 3 (CDR3) loop of the receptor protein, a region known to be functionally important in recognition [2]. Previous studies have shown that immune cell receptors are not uniform in terms of VDJ gene segment usage [3–6], or probability of generation [1], and that certain receptors are more likely than others to be shared by different individuals [4, 7]. In other words, the statistical properties of the immune repertoire are rather complex, and their accurate determination requires sophisticated methods.

Over the past few years, advances in sequencing technology have made it possible to sample the T-cell receptor diversity of individual subjects in great depth [8], and this has in turn led to the development of sequence statistics-based approaches to the study of immune cell diversity [9, 10]. In particular, we recently quantitatively characterized the primary, pre-selection diversity of the human T-cell repertoire by learning the probabilistic rules of VDJ recombination from out-of-frame DNA sequences that cannot be subject to functional selection and whose statistics can reflect only the recombination process [1]. After generation, T-cells undergo a somatic selection process in the thymus [11] and later in the periphery [12]. Cells that pass thymic selection enter the peripheral repertoire as ‘naive’ T-cells, and the subset of naive cells that eventually engage in an immune response will survive as a long-lived ‘memory’ pool. Even though we now understand the statistical properties of the initial repertoire of immune receptors [1], and despite some theoretical studies of thymic selection at the molecular level [13, 14], a quantitative understanding of how selection modifies those statistics to produce the naive and memory repertoires is lacking.

In this paper, we build on our understanding of the primitive pre-selection distribution of T-cell receptors to derive a statistical method for identifying and quantifying selection pressures in the adaptive immune system. We apply this method to naive and memory DNA sequences of human T-cell β chains obtained from peripheral blood samples of nine healthy individuals (Fig. 1). Our goal is to characterize the likelihood that any given sequence, once generated, will survive selection. Our analysis reveals strong and reproducible signatures of selection on specific amino acids in the CDR3 sequence, and on the usage of V and J genes. Most strikingly, we find significant correlation between the primitive generation probability of a sequence and the probability it will pass selection. This suggests that natural selection, which acts on very long time scales to shape the generation mechanism itself, may have tuned it to anticipate somatic selection, which acts on single cells throughout the lifetime of an individual. The quantitative features of selection inferred from our model vary very little between donors, indicating that these features are universal. In addition, our measures of selection pressure on the memory and naive repertoires are statistically indistinguishable, consistent with the hypothesis that the memory pool is a random subsample of the naive pool.

Fig. 1.

A. T-cell receptor β chain sequences are formed during VDJ recombination. Sequences from this probability distribution, described by P_pre, are then selected with a factor Q defined for each sequence, resulting in the observed P_post distribution of receptor sequences. Selection is assumed to act independently on the V and J genes, the length of the CDR3 region and each of the amino acids, a_i, therein. B. A schematic of the fitting procedure: the parameters are set so that P_post fits the marginal frequencies of amino acids at each position, the distribution of CDR3 lengths and VJ gene choices. Since the latter is not known unambiguously from the observed sequences, it is estimated probabilistically using the model itself in an iterative procedure.

I. Methods

We analyzed human CD4+ T-cell β-chain DNA sequence reads (60 or 101 nucleotides long) centered around the CDR3 region. T-cells were obtained from nine individuals and sorted into naive (CD45RO−) and memory (CD45RO+) subsets, yielding datasets of ∼200,000 unique naive and ∼120,000 unique memory sequences per individual, on average. The datasets are the same as those used in [1] and were obtained by previously described methods [15, 16].

In [1] we used the “nonproductive” sequences (those where either the J sequences are out of frame, or the CDR3 sequences have a stop codon) to characterize the receptor generation process. The result of that analysis was an evaluation of the probability that a VDJ recombination event will produce a β-chain gene consistent with the specific DNA sequence read . In this study we focus instead on the in-frame, productive, sequences (from both the naive and the memory repertoires) with the goal of quantifying how the post-selection probability distribution on sequences is modified from the original distribution . In what follows we distinguish between the read and the CDR3 region , the latter defined to run from a conserved cysteine near the end of the V segment to the last amino acid of the read (leaving two amino acids to the conserved Phe). The CDR3 amino acid sequence can be uniquely read off from each in-frame sequence read; by contrast, the specific V- and J-genes responsible for the read may not be uniquely identifiable (because of the relatively short read length). An unambiguous selection effect can be seen by comparing the length distribution of CDR3 regions between the preselection ensemble and the naive, or memory, datasets (Fig. 2A): sequences that are longer or shorter than the mean are suppressed resulting in a more peaked distribution.

Fig. 2.

A. CDR3 length distributions, pre- and post-selection and the length selection factor q_L (green). Selection makes the length-distribution of CDR3 regions in the pre-selection repertoire more peaked for the naive and memory repertoires (overlapping). Error bars show standard variation over 9 individuals. B. Comparison between data and model of the connected pairwise correlation functions, which were not fitted by our model. The excellent agreement validates the inference procedure. As a control, the prediction from the pre-selection model (in gray) does not agree with the data as well. C. Values of the inferred amino-acid selection factors for each amino acid, ordered by length of the CDR3 region (ordinate) and position in the region (abscissa). D. Values of the V J gene selection factors.

For each receptor sequence, we define a selection factor that quantifies whether selection (thymic selection or later selection in the periphery) has enriched or impoverished the frequency of compared to the pre-selection ensemble. Since the generation probability of sequences varies over many orders of magnitude, such a comparison is the only way to define selection strength. Denoting by the distribution of sequences in the selected naive or memory pools, we will set . Due to the large number of possible sequences, we cannot sample the post-selection probability P_post for each sequence directly from the data; we need a reduced complexity model to estimate it. We propose a simple model, summarized in Fig. 1A, that we we will show captures the main features of selection: where V and J denote the choice of V and J segments in the sequence , L is the amino-acid length of the CDR3 specified by the read, (τ₁, …, τ_3L) is CDR3 nucleotide sequence, and (a₁,…,a_L) its amino-acid sequence. The factors q_L, q_i;L(a) and q_{V J} denote, respectively, selective pressures on the CDR3 length, its composition, and the associated VJ identities. Note that the D segment is entirely included in this junctional region, so selection acting on it is encoded in the q_i;L factors. Z enforces the model normalization condition .

It is important to understand why we do not write Q directly as a function of the read . While determines and determines , V and J cannot always be inferred deterministically from the read . The VJ assignment of any given read will have to be treated as probabilistically defined hidden variables. In addition, because of correlations in P_pre, the q factors cannot be identified with marginal enrichment factors (so that, for example, P_i;L,data(a_i)/P_i;L,pre(a_i), cannot be set equal to q_i;L(a_i)). For all these reasons, we must use a maximum likelihood procedure to learn the q_L, q_i;L and q_V,J factors of Eq. 1. We use an expectation maximization algorithm (EM) that iteratively modifies the q′s until the observed marginal frequencies (for CDR3 length, amino acid usage as a function of CDR3 position, and VJ usage) in the data match those implied by the model distribution Eq. 1 (the pre-selection distribution P_pre being taken as a fixed, known, input). The procedure is schematically depicted in Fig. 1B (see the appendices for full details).

One important assumption of the model is that selection factors act independently of each other on the sequence. Consequently, while the model is fit only to single point marginal frequencies, and not to pairwise frequencies. To check the validity of this assumption, we plot the correlation functions of amino acid pairs in the model post-selection repertoire vs the observed naive ones (Fig2B). These pairwise correlations are well predieted, even though they are not model inputs. It is also noteworthy that they are nonzero, even though the selection model does not take into account the possibility of interactions in the selection factors q_i;L. This is because the pre-selection distribution does not factorize over amino acids in the CDR3 region, and has correlations of its own, as shown by the green points of Fig 2B (note that these pre-selection correlations do not agree well with those observed in the post-selection data).

Another assumption of our model is that selection acts at the level of the amino acid sequence, regardless of the underlying codons. To test this, we learned more general models where a represented one of the possible 61 codons, instead of one of the 20 amino acids. We found that codons coding for the same residue had similar selection factors (see Fig. 8), except near the edges of the CDR3 where amino acids may actually come from genomic V and J segments and reflect their codon biases.

To compare the different donors, we learned a distinct model for each donor, as well as a “universal” model for all sequences of a given type from all donors taken together (see the appendices for details). We also learned models from random subsets of the sequence dataset to assess the effects of low-number statistical noise.

II. Results

A. Characteristics of selection and repertoire variability

The length, single-residue, and VJ selection factors, learned from the naive datasets of all donors taken together, are presented in Fig. 2A,C,D. The q_{V J} distribution shows that the different V and J genes are subject to a wide range of selection factors (note that these factors act in addition to the quite varied gene segment usage probabilities in . We looked for correlations between the selection factors q_i;L(a) on amino acids and a variety of amino-acid biochemical properties [17]: hydrophobicity, charge, pH, polarity, volume, and propensities to be found in α or β structures, in turns, at the surface of a binding interface, on the rim or in the core [18] (see the appendices for details and references). We found no significant correlations, save for a negative correlation with amino acid volume and α helix association, as well as a positive correlation with the propensities to be in turns or in the core of an interacting complex (Fig. 13).

To estimate differences between datasets, we calculated the correlation coefficients between the logs of the q_{V J} and q_i;L(a) selection factors (see Fig. 10). Comparing naive vs. naive, memory vs. memory or naive vs. memory between donors (see Fig. 3A-C for an example for q_i;L, and Fig. 9 for q_{V J}) gave correlation coefficients of ≈ 0.9 in log q_i;L, while the naive vs. memory repertoires of the same donor gave 0.95. To get a lower bound on small-number statistical noise, we also compared the factors inferred from artificial datasets obtained by randomly shuffling sequences between donors (see the appendices), yielding an average correlation coefficient of 0.98. Repeating the analysis for log q_{V J}, we found correlation coefficients of ≈ 0.8 between datasets of different donors, 0.84 for the naive and memory dataset of the same donor, all of which must be compared to 0.94 obtained between shuffled datasets. Thus, the observed variability between donors of q_i;L and q_{V J} are small, and consistent with their expected statistical variability.

Fig. 3.

A.-C. Variability between repertoires. The scatter between q_i;L selection factors between two sample individuals A and B for naive (A) and memory repertoires (B) compared to that of memory and naive repertoires for the same individual (C) shows great similarity between them. See also Fig. 11. D. The entropy of the pre-selection repertoire (top) is reduced in the post-selection repertoire (bottom). E.-F. Distribution of V J (E) and D J (F) insertions in the pre-selection and naive repertoires shows elimination of long insertions. Error bars show standard deviations over 9 donors. The insertion distributions for the memory repertoire are the same as for the naive repertoire (see scatter plots in insets).

We use Shannon entropy, S = , to quantify the diversity of the naive and memory distributions. Entropy is a diversity measure that accounts for non-uniformity of the distribution and is additive in independent components. Since S = log₂ Ω when there are Ω equally likely outcomes, the diversity index 2^S can be viewed as an effective number of states. The entropy of the naive repertoire according to the model is 38 bits (corresponding to a diversity of ∼ 3.0 · 10¹¹) down from 43.5 bits in the primitive, pre-selection repertoire (Fig. 3D). This is a reduction of ∼ 6 bits, or 50-fold in diversity. The majority of the reduction comes from insertions and deletions, which accounted for most of the diversity in the pre-selection repertoire. The entropies of the memory and naive repertoires are the same, indicating that selection in the periphery does not further reduce diversity.

Knowing the post-selection distribution of sequences, we can ask how different features of the recombination scenario fare in the face of selection. This does not imply that selection acts on the scenarios themselves —it acts on the final product— but it is an a posteriori assessment of the fitness of particular rearrangements. For example, the distributions of insertions at VD and DJ junctions in the post-selection ensemble have shorter tails (Fig. 3E-F), while the distribution of deletions at the junctions seems little affected by selection (Fig. 11), although large numbers of deletions are selected against.

B. Selection factor as a measure of fitness

The selection factor Q is a proxy for the probability of a particular sequence to be selected or amplified, and sequences with large Q values should thus be enriched in the observed dataset. To test this, we consider the distributions of Q both in the pre-selection model, P_pre(Q), and in the dataset from which Q was learned, P_data(Q) (insets of Fig. 4; see the appendices for details on how the distributions are calculated when V and J are hidden variables). This approach is very similar to the one used by Mustonen et al. [19, 20] to characterize the fitness landscape of transcription factor binding sites.

Fig. 4.

Probability of passing selection. A.-B. Ratio of the distributions of sequence-wide selection factors Q between the observed sequences and the pre-selection ensemble (red line), plotted as a function of Q for naive (A.) and memory (B.) repertoires. The model prediction P_post(Q)/P_pre(Q) = Q is shown in black, and the pre-selection and observed distributions of Q are shown in the insets. The selection ratio saturate around ≈ 7, which may be interpreted as the maximum probability of being selected. Naive and memory repertoires show similar behaviors. C. A cartoon of the effective selection landscape captured by our model (red line). Our method does not capture localized selection pressures (such as avoiding self) specific to each individual, but captures general global properties.

By construction, the distribution of Q in the post-selection model satisfies exactly P_post(Q) = QP_pre(Q). In Fig. 4A-B we plot the ratio P_data(Q)/P_pre(Q) as a function of Q, both for the naive and memory models learned from all donors. We observe that for Q ≤ 5, i.e. for > 90% of sequences, this ratio is exactly equal to Q — a validation of our model prediction at the sequence-wide level. For larger values of Q however, this ratio saturates to around Q_max ≈ 7.

This plateau may be viewed as a limiting value, above which selection is insensitive to Q. A similar plateau was observed in the fitness of transcription factor binding sites below a certain binding energy [20]. In the case considered here, the plateau can be rationalized if we assume that Q is proportional to the probability for a sequence to be selected, . Since P_sel cannot exceed one, Q cannot exceed α⁻¹. The average probability of selection is given by . The observed plateau gives a lower bound to the true maximum of Q: α⁻¹ ≥ Q_max, and thus the average fraction of cells to pass selection satisfies α ≤ 15%. This can be compared to estimates [2] for passing positive and negative thymic selection: 10–30% for positive selection only, and ≈5% for both. This analysis only includes the β chain, and including the α chain could further reduce our estimate.

The saturation also seems to indicate that our model may be too simple to describe the very fit (high Q) sequences. Because of its fairly simple factorized structure, our model can only account for the coarse features of selection, and may not capture very individual-specific traits such as avoidance of the self (corresponding to Q ≪ 1 in localized regions of the sequence space) or response to pathogens (Q ≫ 1 for particular sequences). This individual-dependent ruggedness of the fitness landscape Q, schematized in Fig.4C, is probably ignored by our description, and may be hard to model in general.

To check that the saturation does not affect our inference procedure, we relearned our model parameters from simulated data, where sequences were generated from P_pre and then selected with probability min(Q/Q_max, 1) (see the appendices for details), and we found that the model was correctly recovered (Fig. 12).

C. Natural selection anticipates somatic selection

Comparing the pre- and post-selection length distributions in Fig. 2A shows that the CDR3 lengths that were the most probable to be produced by recombination are also more likely to be selected. Formally, Spearman’s rank correlation coefficient between P_pre(L) and q_L is 0.76, showing good correlation between the probability of a CDR3 length and the corresponding selection factor. We asked whether this correlation was also present in the other sequence features. The histogram of Spearman’s correlation between the selection factors q_i;L(a) and the pre-selection amino-acid usage P_i;L,pre(a) for different lengths and positions (i, L) (Fig. 5A) shows a clear majority of positive correlations. Likewise, the selection factors q_{V J} are positively correlated with the pre-selection VJ usage P_{V J, Pre} (Spearman’s rank correlation 0.3, p < 2 · 10⁻²⁰).

Fig. 5.

Correlations between the pre- and post-selection repertoires. A. A histogram of Spearman’s correlation coefficient values between the q_i;L(a) selection factors in the CDR3 region and their generation probabilities P_i;L,pre(a) for all i, L shows an abundance of positive correlations. B. Heatmap of the joint distribution of the pre-selection probability distribution P_pre and selection factors Q for each sequence shows the two quantities are correlated. C. Sequences in the observed, selected repertoire (green line) had a higher probability to have been generated by recombination than unselected sequences (blue line). Agreement between the post-selection model (red line) and data distribution (green line) is a validation of the model.

The correlations observed for each particular feature of the sequence (CDR3 length, amino acid composition and VJ usage) combine to create a global correlation between the probability that a sequence was generated by recombination, and its propensity to be selected (Spearman’s rank correlation 0.4, p = 0, see Fig. 5B). Consistent with this observation, the post-selection repertoire is enriched in sequences that have a high probability to be produced by recombination (Fig. 5C). This enrichment is well predicted by the model, providing another validation of its predictions at the sequence-wide level.

Taken together, these results suggest that the mechanism of VDJ recombination (including insertions and deletions) has evolved to preferentially produce sequences that are more likely to be selected by thymic or peripheral selection.

D. Shared sequences between individuals

The observation of unique sequences that are shared between different donors has suggested that these sequences make up a “public” repertoire common to many individuals, formed through convergent evolution or a common source. However, it is also possible that these common sequences are just statistically more frequent, and are likely to be randomly recombined in two individuals independently, as previously discussed by Venturi et al. [6, 7, 21]. In other words, public sequences could just be chance events. Here we revisit this question by asking whether the number of observed shared sequences between individuals is consistent with random choice from our inferred sequence distribution P_post.

We estimated the expected number of shared sequences between groups of donors in two ways: (i) by assuming that each donor α had its own “private” model learned from his own sequences or (ii) by assuming that sequences are drawn from a “universal” model learned from all sequences together. While the latter ignores small yet perhaps significant differences between the donors, the former may exaggerate them where statistics are poor. For details on how these estimates are obtained from the models, we refer the reader to the appendices. In Fig. 6A we plot, for each pair of donors, the expected number of shared nucleotide sequences in their naive repertoires under assumptions (i) and (ii), versus the observed number. The number is well predicted under both assumptions, the universal model assumption giving a slight overestimate, and the private model giving a slight underestimate. We repeat the analysis for sequences that are observed to be common to at least three or at least four donors (Fig. 6B-C). The universal model predicts their number better than the private models, although it still slightly overestimates it.

Fig. 6.

Shared sequences between individuals. A. The mean number of shared sequences between any pair of individuals compared to the number expected by chance (model prediction) for one common model for all individuals (red crosses) and private models learned independently for each individual (blue crosses). Error bars are standard deviations from distributions over pairs. The distribution of shared sequences between triplets (B.) and quadruplets (C.) of individuals for the data (black histogram), from common (red line) and private models (blue line). D. The shared sequences are most likely to be generated and selected: comparison of the P_post post-selection distribution for sequences from the preselection (dotted line), and post-selection repertoires (according to the model in gray, and to the data in black), as well as the sequences shared by at least two donors (model prediction in magenta, data in red).

These results suggest that shared sequences are indeed the result of pure chance. If that is so, shared sequences should have a higher occurrence probability than average; specifically, the model predicts that the sequences that are shared between at least two donors are distributed according to (see the appendices). We test this by plotting the distribution of P_post for regular sequences, as well as for pairwise-shared sequences, according to the model and in the naive datasets (Fig. 6D), and find excellent agreement. In general, sequences that are shared between at least n individuals by chance should be distributed according to . For triplets and quadruplets, this model prediction is not as well verified (see Fig. 14). This discrepancy may be explained by the fact that such sequences are outliers with very high occurrence probabilities, and may not be well captured by the model, which was learned on typical sequences.

We repeated these analyses for sequences shared between the memory repertoires of different individuals, with very similar conclusions, except for donors 2 and 3, and donors 2 and 7, who shared many more sequences than expected by chance (see Fig. 15). We conclude that the vast majority of shared sequences occur by chance, and are well predicted by our model of random recombination and selection.

III. Discussion

We have introduced and calculated a selection factor that serves as a measure of selection acting on a given receptor sequence in the somatic evolution of the immune repertoire. Our approach accounts for the fact that the pre-selection probabilities of sequences vary over orders of magnitude.

Using this measure, we show that the observed repertoires have undergone significant selection starting from the initial repertoire produced by VDJ recombination. We find little difference between the naive and memory repertoires, in agreement with recent findings showing no correlation between receptor and T-cell fate [22], as well as between the repertoires of different donors. This is perhaps surprising, because the donors have distinct HLA types (which determine the interaction between T-cell receptors and peptide-MHC complexes), and we could expect their positive and negative selective pressures to be markedly different. Besides, memory sequences have undergone an additional layer of selection compared to the naive ones —recognizing a pathogen— and we could also expect to see different signatures of selection there. A possible interpretation is that our model only captures coarse and universal features of selection related to the general fitness of receptors, and not the fine-grained, individual-specific selective pressures such as avoidance of the self, as illustrated schematically in Fig. 4C. In other words, our selection factors may “smooth out” the complex landscapes of specific repertoires and fail to capture their rough local properties, such as would be expected from specific epitopes that would correspond to very tall peaks or deep valleys in the landscape of selection factors. To really probe these specific deep valleys, we need to develop methods based on accurate sequence counts. Another interesting future direction would be to see whether at this global level the signatures of selection are similar between (relatively) isolated populations. Lastly, comparing data from different species (mice, fish), in particular where inbred individuals with the same HLA type can be compared, would be an interesting avenue for addressing these issues.

Our results suggest that natural selection has refined the generation process over evolutionary time scales to produce a pre-selection repertoire that anticipates the actions of selection. Sequences that are likely to be eliminated and fail selection are not very likely to be produced in the first place. Because of this “rich become richer” effect, the diversity of the repertoire is significantly reduced by selection, by a 50-fold factor in terms of diversity index. This does not mean that only 2% of the sequences pass selection. In fact, our results are consistent with as much as 15% of sequences passing selection. This apparent paradox is resolved by noting that selection, by keeping clones that were likely to be generated, get rids of very rare clones that contributed to the large initial diversity.

Although we did observe sequences that were present in the repertoires of different donors, we showed using our model that their number was broadly compatible with that expected by pure chance. This suggests that the “public” part of the repertoire is made of sequences that are just more likely to be randomly produced and selected.

To summarize, our work clearly shows that thymic selection and later peripheral selection modify the form of the generated repertoire. Our work is a starting point for a description of a mechanism of the two processes.

Acknowledgements

The work of YE, TM and AW was supported in part by grant ERCStG n. 306312. The work of CC was supported in part by NSF grants PHY-0957573 and PHY-1305525 and by W.M. Keck Foundation Award dated 12/15/09.

Appendix A Data

The DNA nucleotide data used in our analysis consists of human CD4+ naive (CD45RO−) or memory (CD45RO+) β chain sequences from 9 healthy individuals, sequenced and made available to us by H. Robins and already used in [1]. Reads are 60 base pair long for 6 donors and 101 base pair long for 3 donors (individuals 2, 3 and 7) and contain the CDR3 region and neighboring V and J gene nucleotides. All end at the same position in the J gene, with four nucleotides between this position and the first nucleotide of the conserved phenylalanine. The data were divided into out-of-frame reads (non-coding), used to learn the pre-selection model as described in [1] and in-frame (coding) reads used in the analysis presented in this paper. The sequence data we used are available at http://princeton.edu/~ccallan/TCRPaper/data/.

In our study we limit ourselves to unique sequences. The experimental procedure and initial assessment of the quality of the reads were done in the Robins lab following the procedures described in [15, 23]. Each sequence was read multiple times, allowing for the correction of most sequencing errors. The numbers of unique sequences used in each dataset is shown in Table SI.

View this table:

Table I:

Number of unique coding sequences in each datasets.

The alignment to all possible V and J genes was done using the curated datasets in the IMGT database [24]. There are 48 V genes, 2 D genes and 13 J genes plus a number of pseudo V genes that cannot lead to a functioning receptor due to stop codons. We discarded sequences that were associated to a pseudo-gene as our model only accounts for coding genes. The germline sequences of the genes used in our analysis are the same as were used in [1] to analyze the generative V(D)J recombination process. The complete list of gene sequences can be found at http://princeton.edu/~ccallan/TCRPaper/genes/.

Appendix B Pre-Selection Model

The pre-selection, or generative model, assumes the following structure for the probability distribution of recombination scenarios S [1]: where a scenario is given by the VDJ choice, the number of insertions insVD, insDJ and the number of deletions (delV,dellD), (delrD,delJ) at each of the two junctions, together with the identities (s₁,…, s_insVD),(t₁,…, t_insDJ) of the inserted nucleotides. It is worth noting that the insertions are assumed to be independent of the identities of the genes between which insertions are made. By contrast, the deletion probabilities are allowed to depend on the identity of the gene being deleted. The validity of these assumptions is verified a posteriori.

Appendix C Model Fitting

1. Maximum likelihood formulation

The model probability to observe a given coding nucleotide sequence is: where is the nucleotide sequence of the CDR3 (defined as running from the conserved cysteine in the V segment up to the last amino acid in the read, leaving two amino acids between the last read amino acid and the conserved phenylalanine in the J segment), L is the length of the CDR3, and V and J index the choice of the germline V and J segments (which completely determine the sequence outside the CDR3 region). The D segment is entirely absorved into , and is not explicitly tracked in assessing selection. The selection factor Q is assumed to take the following factorized form: where is the amino-acid sequence of the CDR3, and Z is a normalization constant that enforces

The probability, , of generating a specific sequence in a V(D)J recombination event can be obtained from the noncoding sequence reads by the methods explained in [1]. Specifically, the pre-selection model gives the probability P_pre(S) of a recombination scenario S = (V, D, J, insVD, insDJ, delV,…) as given by Eq. B1. A scenario S completely determines the sequence , but the converse is not true. The pre-selection probability for a coding sequence is thus given by where we sum over scenarios resulting in a particular CDR3 sequence and a particular V, J pair. The normalization factor p_coding ≈ 0.26 corrects for the fact that a randomly generated sequence is not always productive (i.e. in-frame and with no stop codon). From this point on, we regard the initial generation probability of any specific read as known. When we make statements about the pre-selection distribution of CDR3 properties, such as length or amino acid utilization, they are derived from synthetic repertoires drawn from the above pre-selection distribution.

We want to infer the parameters q_L, q_V,J and q_i;L(·) of the model from the observed coding sequence repertoires. Formally we want to maximize the likelihood of the data given the model. Unfortunately the sequence reads from the data are not long enough to fully specify the V and J segments, so we cannot use P_post (, V, J) as our raw likelihood. Instead, we need to write the probability of observing a given (truncated) read , of length 60 or 101 nucleotides, depending on the donor: where we note again that fully specifies , while fully specifies , but not V and J. Given a dataset of N sequences, (see Fig. 7 for notations), the likelihood reads:

Our goal is maximize ℒ with respect to the parameters q_L, q_{V, J}, and q_i;L(·) (globally refered to as Q).

Fig. 7.

Summary of the notations used in this paper for the sequences. The CDR3 region is defined from the conserved cysteine around the end of the V segment to the last amino-acid in the read, leaving two amino acids to the conserved phenylalanine in the J segment. The nucleotides in the read are defined as σ_i, the nucleotides in the CDR3 region as τ_i and the amino acids in the CDR3 region as a_i. The data sequences therefore can be defined in terms of , or their V, J genes and . The generated sequences, with known V and J genes, are defined in terms of for the whole sequence or for only the CDR3.

2. Expectation maximization

Calculating is computationally intensive. Given the form of the model, it seems more natural to work with , but this likelihood involves the “hidden” variables V and J. To circumvent this problem, we use the expectation maximization algorithm [25, 26]. This algorithm uses an iterative two-step process, with two sets of model parameters Q and Q′. The log-likelihood of the data is calculated using the set of parameters Q′; in the “Expectation” step, this log-likelihood is averaged over the hidden variables with their posterior probabilities, which are calculated using the second set of parameters Q. In the “Maximization” step, this average log-likelihood is maximized over the first set Q′, while keeping the second set Q fixed. Then Q is updated to the optimal value of Q′, and the two steps are repeated iteratively until convergence.

In practice, starting with a test set of parameters Q, we calculate, for each sequence of the data, the posterior probability of a (V, J) pair:

The log-likelihood, expressed in terms of the hidden variables V and J, is maximized after averaging over V and J using that posterior. Specifically we will maximize:

Here we have added the Q dependencies explicitly because there are two different parameter sets Q and Q′. The maximization is performed over Q′, which parametrizes the log-likelihood itself, while keeping Q, which parametrizes how the average is done over the hidden variables, constant. After each maximization step we substitute: and iterate until convergence. This procedure is guaranteed to find a local maximum of the likelihood ℒ(Q).

3. Equivalence with fitting marginal probabilities

The expectation-maximization step can be simplified by noting that at the maximum, derivatives vanish:

Precisely, we take derivatives with each of the parameters, q_L, q_{V J} etc. and set them to zero. Since is naturally factorized in the Q parameters, we obtain simple expressions, e.g. gives: where δ_a,b is Kronecker’s delta function. The term in the sum gives the total number of sequences in the data with length L. Besides we have:

Hence the maximality condition simply becomes:

i.e. that the length distribution of the model must be equal to that of the data. Similarly, maximizing with respect to q_i;L(a_i) entails that single amino-acid frequencies at a given position are matched between data and model:

The condition for q_{V J} is slightly different, because we do not directly have the frequencies of V and J in the data. This is replaced by their expected frequency under the posterior taken with parameters Q: where again the left-hand side is the empirical distribution of V and J (indirectly estimated with the help of the model with parameters Q), and the right-hand side is the model distribution of the same quantities (estimated with parameters Q′, which are then varied to achieve equality with the data estimate). The approach of iteratively adjusting model parameters to match a corresponding set of data marginals is a conceptually clear and computationally effective implementation of the expectation maximization algorithm.

4. Gauge

As defined above, the model is degenerate: for each i, L, the factors q_i;L(a) and Z may be multiplied by a common constant without affecting the model. We need to fix a convention, or gauge, to lift this degeneracy. We impose that, for each i, L: where P_i;L,pre(a) is the probability of having amino-acid a at position i in CDR3s of length L.

5. Numerical implementation

To solve the fitting equations (C13)-(C15) in practice, we use a gradient descent algorithm: and similarly for q_i;L and q_{V J}. To do this, we must be able to calculate the marginals P_post(L; Q′), P_i,L,post (a_i; Q′) and P_post(V, J; Q′) from the model at each step.

This leaves us with the problem of estimating marginals in the model, which we do using importance sampling. Although it is easy to sample sequences from P_pre by picking a random recombination scenario, sampling from P_post = QP_pre is much harder, as the q_i;L, q_L and q_{V J} factors introduce complex dependencies between the different features of the recombination scenario. To overcome this issue, we sample a large number M of triplets from , and, when estimating P_post expectation values, weight the contribution of each sequence with its value (this is a particularly simple instance of importance sampling). The generated triplets are denoted by , and the corresponding reads by (see Fig. 7 for notations). The marginal probability distribution of lengths, for instance, is estimated by and similar expressions give estimates of P_i;L,post(a_i; Q′) and P_post(V, J; Q′). Since we are optimizing over Q′, the sequences can be generated once and for all at the beginning of the algorithm. Then the marginal probabilities are updated according to the modified Q′ using Eq. C18. Finally, the normalization constant is evaluated by calculating: so that

Fig. 8.

The q_i;L(a) selection factors learned for codons (red crosses) agree with those learned for amino acids (blue). The q_i;L(a) are plotted for each position in the CDR3 region (panels from 1 to 12) for naive CDR3 sequences of length 12, as a function of the amino acids at each position. A given amino acid at a given position can come from different codons, which are marked by multiple crosses at that position. Codons or amino acids for which there was not enough data to infer the selection factors are not represented.

Fig. 9.

The scatter of VJ gene selection factors q_{V J} between donors A and B for naive (A) and memory repertoires (B), as well as between the memory and naive repertoires of the same individual (C) shows that the memory and naive repertoires are statistically similar to each other and across individuals. See Fig. 10 for the correlation analysis of all individuals and cell types.

Fig. 10.

Correlation coefficients between selection factors obtained for models learned for different donors and cell type (naive and memory). The compared factors are the amino-acid selection factors q_i;L (A) and the VJ gene selection factors q_{V J} (B). Each position along the two axes in each plot corresponds to a different individual. The naive dataset of donor 8, and the memory dataset of donor 5 were removed because of too low statistics. In all heat maps, the x and y axes correspond to different donors (1-7;9 for naive, 1-4;6-9 for memory, and 1,2,3,4,6,7,9 for comparison between naive and memory).

6. Equivalence with minimum discriminatory information

The principle of minimum discriminatory information is to look for a distribution that reproduces exactly some mean observables of the data, such as position-dependent amino-acid frequencies, while being minimally biased with respect to some background distribution. When the background distribution is uniform, this principle is equivalent to the principle of maximum entropy.

Taking P_pre as our background distribution, assume we are looking for the distribution P_post that satisfies Eqs. (C13)-(C15) while minimizing the divergence or relative entropy with respect to P_pre, defined as:

Solving this problem is mathematically equivalent to solving the maximum likelihood problem described above.

Fig. 11.

The effects of selection on deletion profiles. Distribution of V (A), D left-hand side (B), D right-hand side (C), and J (D) deletions in the pre-selected (black lin e), naive (colored line) and memory (gray dashed line) repertoires. Error bars show standard deviation over 9 individuals. Results using 9 separate models learned for each of the individuals. The deletion distributions for the memory repertoire are the same as for the naive repertoire. Selection has a slight effect on favoring distributions with non-extreme deletion values of deletions for V and J deletions, and does not have a significant effect on D deletions.

Appendix D Individual, Universal And Shuffled Donors

We partition the data in three different ways to learn the model. First, we learn a distinct model for each donor, and for each of the naive and memory pools. For each donor, we have a distinct P_pre learned from the out-of-frame sequences of that donor (although in fact they differ little from donor to donor as discussed in [1]). Second, we pool all the sequences of a given type (naive or memory) from all nine donors together, and learn a “universal” or average model. For this we use a mean P_pre averaged over all nine donors, and then learn Q using all sequences. Third, to assess the effect of finite-size sampling in the universal model, we partition the data from all donors into nine random subsamples of equal sizes. This way we can estimate how much variability one should expect from just sampling noise.

Appendix E Entropy, Distributions of P_pre, P_post and Q

To estimate global statistics, such as entropy, from the model, we draw a large set of sequences from P_pre, and weight them according to the inferred (normalized) Q values. Specifically, for each generated sequence, we estimate its primitive generation probability by summing over all the possible scenarios that could have given rise to it: where is the full nucleotide sequence, including the CDR3 as well as the V_b and J_b segments. The entropy (in bits) of the selected sequence repertoire is defined as and, to include selection effects, we estimate it by

The distributions of P_pre, P_post and Q over the selected sequences are determined from the same draw of M sequences from P_pre, weighted by the normalized selection factors Q. For example the distribution of log P_pre is:

Marginal distributions over pairs of amino-acids (a_i, a_j) at two positions i and j can also be calculated using the sequences and weighting them with Q. This can be generalized to arbitrary marginals or statistics.

Appendix F Shared Sequences

The number of shared sequences in a subset of donors is counted based on the nucleotide sequences. This empirical number can then be compared to two kinds of theoretical predictions. Either by assuming that the sequences of each donor were generated and selected by a “private” model , where α denotes the donor, i.e. a model inferred from the sequences of donor α; or by assuming that sequences were generated and selected by a “common” or universal model inferred from all sequences together. The latter is justified by the fact that differences between private models are small, and could reflect spurious noise that would exaggerate differences between individuals.

Fig. 12.

The saturation of the P_data (Q)/P_pre(Q) ratio does not affect the inference of the model. We simulated a dataset from P_pre and selected sequences with probability . The plot compares the q_i;L(a) selection factors directly inferred from data (ordinate) to values inferred from such simulated data (blue dots: simulation). The scatter in these points is compared to the scatter obtained from learning the selection factors using a random subset of the data (red dots: sample). The size of the points denotes the probability P_i;l,data(a) in the data repertoire.

If we assume private models, the expected number of shared sequences between donors α and β is: where N_α and N_β are the numbers of sequences in each donor dataset. To estimate that number, we collect sequences that are shared between the generated datasets of two (or more) donors, and reweight them by Q: where M_α and M_β are the number of generated sequences for each donor model, and where the sum is over the sequences found in the dataset of both donors. Similar equations are used for comparing more than two donors.

Fig. 13.

Correlation of the q_i;L selection factors with several biochemical properties. Each panel shows the histogram, over all positions and lengths, of Spearman’s correlation coefficient between the q_i;L (a) values for a given amino acid and the biochemical properties of that amino acid. The following biochemical properties are considered (from left to right, top to bottom): preference to appear in alpha helices (A), beta sheets (B), turns (C) (source for (A-C): Table 3.3 [17]). Residues that are exposed to solvent in protein-protein complexes (following definitions and data from [18]) are divided intothree groups: surface (interface) residues that have unchanged accessibility area when the interaction partner is present (D), rim (interface) residues that have changed accessibility area, but no atoms with zero accessibility in the complex (E) and core (interface) residues that have changed accessibility area and at least one atom with zero accessibility in the complex (F). Rim residues roughly correspond to the periphery of the interface region, and core residues correspond to the center. Finally we plot the basic biochemical amino acid properties (source: http://en.wikipedia.org/wiki/Amino_acid and http://en.wikipedia.org/wiki/Proteinogenic_amino_acid): charge (G), pH (H), polarity (I), hydrophobicity (J) and volume (K). For all properties the actual numerical values used to calculate the correlations are listed in the inset tables. We see a positive correlation trend with turns and core residues and a negative correlation trend with the preference of amino acids to appear in alpha helices and volume.

If we assume a common model, the expected number of shared sequences reads:

This can be estimated by: where are sequences generated from the mean VDJ recombination model . Similarly, the number of shared sequences between a triplet of donors α, β, γ is: and likewise for quadruplets and more.

The expected numbers of shared sequences calculated above are averages. Their distribution is given by a Poisson distribution of the same mean. We use these Poisson distribution to estimate the error bars in Fig. 6A and 15A, as well as the distributions in Fig. 6B-C and S15B-C.

If we assume a common model, sequences that are shared between at least n individuals are distributed according to . To explore the statistics of these sequences, we take our sequences generated from and weigh them with . For example, to estimate the distribution of log P_post in shared sequences as in Fig. 6D (for pairs), and Fig. 14 (for triplets and quadruplets), we calculate:

Sampling from shared sequences is equivalent to sampling from the high-probability, large deviation regime of the distribution. This statement can be made more physically intuitive by rewriting P_post as a Boltzmann distribution e^−E/T with T = 1 and E = −log P_post. Considering sequences observed in at least n donors, is equivalent to sampling from (1/Z(n))e^−nE (where Z(n) is a normalisation constant), i.e. the Boltzmann distribution with T = 1/n. Sequences shared between more and more individuals correspond to lower and lower temperatures, and thus lower energies and higher probabilities. In the low temperature regime, the roughness of the landscape depicted in Fig. 4C is starting to become important, and may not be well captured by our model, as suggested by Fig. 14.

Fig. 14.

Model prediction (magenta) and observed (red) distributions of P_post in the naive sequences that are shared between at least three (left) or four (right) donors. The model discrepancy may be attributed to its failure to capture the very highly probable sequences.

Fig. 15.

Comparison between data and model for the number of shared sequences in the memory repertoires, in pairs (A), triplets (B) and quadruplets (C) of individuals.

Appendix G Codon Model

It is reasonable to assume that selection acts on the protein structure, at the amino acid level. But each amino acid can be obtained using a number of different codons, which could in principle each have a different selection factor. We checked the robustness of our selection coefficients by learning an alternative model in which selection acts on codons. We present the results of this alternative codon model in Fig. 8 on the example of CDR3 sequences of length 12. We show the q_i;L(a) selection factors at each position for each amino acid, and compare them to the selection factors obtained for the codons coding for that amino acid. We see that, especially in the bulk of the CDR3 sequence, selection at the level of codons or amino acids are equivalent, proving the generality of our approach.

Appendix H Additional Effects of Selection on Repertoire Properties

In the main text we present several repertoire properties, such as insertion profiles and comparisons of the q_i;L(a) selection factors between naive and memory repertoires. In Fig. 11 we plot the deletion profiles for V, J and D-lefthand side and D-righthand side deletions, comparing the distributions for the pre-selection, naive and memory repertoires. We note that the deletion profiles for the V and J distributions are more peaked, favoring intermediate deletion values. However the D distributions are little affected by selection. Similarly to the case of insertion distributions shown in in Fig. 3E-F, the naive and memory distributions appear indistinguishable within the error bars.

In Fig. 3A-C, the selection factors q_i;L(a) acting on amino acids are compared between individuals and cell type. Similarly, the selection factors acting on the genes q_{V J} are statistically indistinguishable between the memory and naive repertoires for one individual, compared to the variability between the naive (or memory) repertoires taken from two sample individuals (see Fig. 9).

To compare the repertoires of individuals as well as the naive and memory repertoires with each other, we consider the correlation coefficients between the selection factors log q_i;L, and between the VJ gene selection factor log q_{V J}, of different individuals (Fig. 10). Correlations between memory and naive repertoires are similar to those between naive-naive or memory-memory repertoires for different individuals; all are a bit smaller than the correlations between the artificial, shuffled sequence datasets, where the discrepancy is entirely attributable to statistical noise. These observations lead us to the conclusion that at this level of description, the selection processes that shape the memory and naive repertoires are very similar with each other and between different individuals.

Appendix I Effects Of Saturation Of Global Selection Factors On The Inference Procedure

We consider distributions of the selection factor Q in the pre-selection ensemble P_pre(Q), in the post-selection ensemble according to the model P_post(Q), and in the actual data sequences P_data(Q). These three distributions are formally defined as:

As can be seen in Fig. 4, the ratio of the distribution of global selection factors P_data(Q)/P_pre(Q) saturates for large values of Q. To make sure that this saturation does not impair our ability to correctly infer the selection factors, we simulated a dataset from P_pre and selected sequences with probability to mimic the effects of this plateau. We then inferred the selection coefficients for this artificial dataset. We see that the saturation does not affect our ability to correctly infer the selection coefficients (Fig. 12) and the variability in the inferred q_i;L(a) selection factors is of the same order as from using random subsamples of the original data.

Appendix J Biochemical Correlations

To check for correlations of our inferred q_i;L(a) selection factors with known biochemical properties, we calculated Spearman’s coefficient between the selection factors and a number of standard quantities (see Fig. 13 for the full list). We find that the selection factors do not correlate well with most standard properties, such as charge, hydrophobicity and polarity. However we do find a trend of positive correlation with amino acids that are likely to appear in turns (Fig. 13 C) and ones that have been identified as those that make the core of the interface in a protein-protein complexes (Fig. 13 F) [18]. We find a trend of negative correlations with amino acids that have large volume (Fig. 13 K) and are likely to appear in alpha helices (Fig. 13 A). These observations are consistent with the fact that structurally CDR3 regions form loops and bulky amino acids as well as stabilizing alpha helix-like interactions would interfere with this structure. Core amino acids are at the center of the interface and are known to be the main contributors to interface recognition and affinity. On the other hand interface rim and non-interface (surface) residues, which are both in touch to various degrees with the solvent and are not crucial interface forming elements, show similar non-distinctive correlation patterns.

References

1.↵
Murugan A, Mora T, Walczak AM, Callan CG (2012) Statistical inference of the generation probability of t-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences 109:16161–16166.
OpenUrl Abstract/FREE Full Text
2.↵
Janeway C (2005) Immunobiology, the immune system in health and disease (Garland Pub).
3.↵
Weinstein JA, Jiang N, White RA, Fisher DS, Quake SR (2009) High-throughput sequencing of the zebrafish antibody repertoire. Science 324:807–10.
OpenUrl Abstract/FREE Full Text
4.↵
Ndifon W, et al. (2012) Chromatin conformation governs t-cell receptor j gene segment usage. Proc Natl Acad Sci USA 109:15865–70.
OpenUrl Abstract/FREE Full Text
5.
Mora T, Walczak AM, Bialek W, Callan CG (2010) Maximum entropy models for antibody diversity. Proceedings of the National Academy of Sciences of the United States of America 107:5405–5410.
OpenUrl Abstract/FREE Full Text
6.↵
Quigley MF, et al. (2010) Convergent recombination shapes the clonotypic landscape of the naive t-cell repertoire. Proc Natl Acad Sci USA 107:19414–9.
OpenUrl Abstract/FREE Full Text
7.↵
Venturi V, Price DA, Douek DC, Davenport MP (2008) The molecular basis for public t-cell responses? Nat Rev Immunol 8:231–8.
OpenUrl CrossRef PubMed Web of Science
8.↵
Baum PD, Venturi V, Price DA (2012) Wrestling with the repertoire: The promise and perils of next generation sequencing for antigen receptors. Eur. J. Immunol. 42:2834–2839.
OpenUrl CrossRef PubMed
9.↵
Six A, et al. (2013) The past, present, and future of immune repertoire biology – the rise of next-generation repertoire analysis. Front. Immunol. 4:1–16.
OpenUrl CrossRef PubMed
10.↵
Robins H (2013) Immunosequencing: applications of immune repertoire deep sequencing. Current Opinion in Immunology 25:646–52.
OpenUrl CrossRef PubMed
11.↵
Yates AJ (2014) Theories and quantification of thymic selection. Front. Immunol. 5:13.
OpenUrl PubMed
12.↵
Jameson SC (2002) Maintaining the norm: T-cell homeostasis. Nature Reviews Immunology 2:547–56.
OpenUrl CrossRef PubMed Web of Science
13.↵
Detours V, Mehr R, Perelson AS (1999) A quantitative theory of affinity-driven t cell repertoire selection. J Theor Biol 200:389–403.
OpenUrl CrossRef PubMed Web of Science
14.↵
Kosmrlj A, Jha AK, Huseby ES, Kardar M, Chakraborty AK (2008) How the thymus designs antigen-specific and self-tolerant t cell receptor sequences. Proc Natl Acad Sci USA 105:16671–6.
OpenUrl Abstract/FREE Full Text
15.↵
Robins HS, et al. (2009) Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114:4099–4107.
OpenUrl Abstract/FREE Full Text
16.↵
Robins H, et al. (2010) Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine 2:47ra64.
OpenUrl Abstract/FREE Full Text
17.↵
Stryer L, Berg JM, Tymoczko JL (2002) Biochemistry, 5th edition (W.H. Freeman & Co Ltd) Vol. 5th edition.
18.↵
Martin J, Lavery R (2012) Arbitrary protein-protein docking targets biologically relevant interfaces. BMC Biophysics 1:7.
OpenUrl CrossRef
19.↵
Mustonen V, Lässig M (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA 102:15936–41.
OpenUrl Abstract/FREE Full Text
20.↵
Mustonen V, Kinney J, Callan CG, Lässig M (2008) Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc Natl Acad Sci USA 105:12376–81.
OpenUrl Abstract/FREE Full Text
21.↵
Venturi V, et al. (2006) Sharing of t cell receptors in antigen-specific responses is driven by convergent recombination. Proc Natl Acad Sci USA 103:18691–6.
OpenUrl Abstract/FREE Full Text
22.↵
Wang C, et al. (2010) High throughput sequencing reveals a complex pattern of dynamic interrelationships among human t cell subsets. Proc Natl Acad Sci USA 107:1518–23.
OpenUrl Abstract/FREE Full Text
23.↵
Robins HS, et al. (2010) Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine 2:47ra64.
OpenUrl Abstract/FREE Full Text
24.↵
Monod MY, Giudicelli V, Chaume D, Lefranc MP (2004) IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics 20:i379–i385.
OpenUrl CrossRef PubMed
25.↵
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39:1–38.
OpenUrl CrossRef Web of Science
26.↵
McLachlan GJ, Krishnan T (2008) The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) (Wiley-Interscience), 2 edition.

View the discussion thread.

Posted April 21, 2014.

Download PDF

Citation Tools

Subject Area

Immunology

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11752)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14974)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28097)
Molecular Biology (11594)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2886)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Murugan A, Mora T, Walczak AM, Callan CG (2012) Statistical inference of the generation probability of t-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences 109:16161–16166.
OpenUrl Abstract/FREE Full Text

[2] 2.↵
Janeway C (2005) Immunobiology, the immune system in health and disease (Garland Pub).

[3] 3.↵
Weinstein JA, Jiang N, White RA, Fisher DS, Quake SR (2009) High-throughput sequencing of the zebrafish antibody repertoire. Science 324:807–10.
OpenUrl Abstract/FREE Full Text

[4] 4.↵
Ndifon W, et al. (2012) Chromatin conformation governs t-cell receptor j gene segment usage. Proc Natl Acad Sci USA 109:15865–70.
OpenUrl Abstract/FREE Full Text

[5] 5.
Mora T, Walczak AM, Bialek W, Callan CG (2010) Maximum entropy models for antibody diversity. Proceedings of the National Academy of Sciences of the United States of America 107:5405–5410.
OpenUrl Abstract/FREE Full Text

[6] 6.↵
Quigley MF, et al. (2010) Convergent recombination shapes the clonotypic landscape of the naive t-cell repertoire. Proc Natl Acad Sci USA 107:19414–9.
OpenUrl Abstract/FREE Full Text

[7] 7.↵
Venturi V, Price DA, Douek DC, Davenport MP (2008) The molecular basis for public t-cell responses? Nat Rev Immunol 8:231–8.
OpenUrl CrossRef PubMed Web of Science

[8] 8.↵
Baum PD, Venturi V, Price DA (2012) Wrestling with the repertoire: The promise and perils of next generation sequencing for antigen receptors. Eur. J. Immunol. 42:2834–2839.
OpenUrl CrossRef PubMed

[9] 9.↵
Six A, et al. (2013) The past, present, and future of immune repertoire biology – the rise of next-generation repertoire analysis. Front. Immunol. 4:1–16.
OpenUrl CrossRef PubMed

[10] 10.↵
Robins H (2013) Immunosequencing: applications of immune repertoire deep sequencing. Current Opinion in Immunology 25:646–52.
OpenUrl CrossRef PubMed

[11] 11.↵
Yates AJ (2014) Theories and quantification of thymic selection. Front. Immunol. 5:13.
OpenUrl PubMed

[12] 12.↵
Jameson SC (2002) Maintaining the norm: T-cell homeostasis. Nature Reviews Immunology 2:547–56.
OpenUrl CrossRef PubMed Web of Science

[13] 13.↵
Detours V, Mehr R, Perelson AS (1999) A quantitative theory of affinity-driven t cell repertoire selection. J Theor Biol 200:389–403.
OpenUrl CrossRef PubMed Web of Science

[14] 14.↵
Kosmrlj A, Jha AK, Huseby ES, Kardar M, Chakraborty AK (2008) How the thymus designs antigen-specific and self-tolerant t cell receptor sequences. Proc Natl Acad Sci USA 105:16671–6.
OpenUrl Abstract/FREE Full Text

[15] 15.↵
Robins HS, et al. (2009) Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114:4099–4107.
OpenUrl Abstract/FREE Full Text

[16] 16.↵
Robins H, et al. (2010) Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine 2:47ra64.
OpenUrl Abstract/FREE Full Text

[17] 17.↵
Stryer L, Berg JM, Tymoczko JL (2002) Biochemistry, 5th edition (W.H. Freeman & Co Ltd) Vol. 5th edition.

[18] 18.↵
Martin J, Lavery R (2012) Arbitrary protein-protein docking targets biologically relevant interfaces. BMC Biophysics 1:7.
OpenUrl CrossRef

[19] 19.↵
Mustonen V, Lässig M (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA 102:15936–41.
OpenUrl Abstract/FREE Full Text

[20] 20.↵
Mustonen V, Kinney J, Callan CG, Lässig M (2008) Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc Natl Acad Sci USA 105:12376–81.
OpenUrl Abstract/FREE Full Text

[21] 21.↵
Venturi V, et al. (2006) Sharing of t cell receptors in antigen-specific responses is driven by convergent recombination. Proc Natl Acad Sci USA 103:18691–6.
OpenUrl Abstract/FREE Full Text

[22] 22.↵
Wang C, et al. (2010) High throughput sequencing reveals a complex pattern of dynamic interrelationships among human t cell subsets. Proc Natl Acad Sci USA 107:1518–23.
OpenUrl Abstract/FREE Full Text

[23] 23.↵
Robins HS, et al. (2010) Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine 2:47ra64.
OpenUrl Abstract/FREE Full Text

[24] 24.↵
Monod MY, Giudicelli V, Chaume D, Lefranc MP (2004) IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics 20:i379–i385.
OpenUrl CrossRef PubMed

[25] 25.↵
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39:1–38.
OpenUrl CrossRef Web of Science

[26] 26.↵
McLachlan GJ, Krishnan T (2008) The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) (Wiley-Interscience), 2 edition.