Abstract
High throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis and interpretation of these datasets. We present IGoR - a new comprehensive tool that takes B or T-cell receptors sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can investigate models of increasing biological complexity for different organisms. For B-cells IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.
The adaptive immune system recognizes pathogens by binding their antigens to specific surface receptors expressed on T and B cells. The recent advent of high throughput immune repertoire sequencing (RepSeq) [1–4] gives us direct insight into the diversity of B-cell and T-cell receptor (BCR and TCR) repertoires with great potential to change the way we diagnose, treat and prevent immune system related disorders. A growing number of algorithms and software tools have been designed to address the new challenges of RepSeq, in particular sequence analysis, germline assignment and clone construction [5–10]. However, each receptor sequence can be generated in a large number of ways, or “scenarios,” through recombination of genomic segments, insertions and deletions and hypermutations. Standard assignments introduce systematic errors when describing this inherently stochastic process. Quantitatively characterizing the diversity and the biases of these mechanisms remains a challenge for understanding adaptive immunity and applying RepSeq for diagnostics.
We present a flexible computational method and software tool, IGoR (Inference and Generation of Repertoires), that processes raw immune sequence reads from any source (cDNA or gDNA) and learns unbiased statistics of V(D)J recombination and somatic hypermutations. Using these statistics, for each sequence IGoR outputs a whole list of potential recombination and hypermutation scenarios, with their corresponding likelihoods. IGoR’s performance at identifying the correct scenario is 2.5 times better than current state-of-the-art methods. IGoR used as a sequence generator produces an arbitrary number of randomly rearranged sequences with the same statistics as in the dataset. Applied to BCRs, IGoR learns a context-dependent hypermutation model to identify hotspots, which allows for a comprehensive analysis of the mutational landscape of BCRs.
RESULTS
Probabilistic assignment of recombination scenarios
V(D)J recombination selects two or three segments (V and J for TCR α and BCR lights chains; V, D, and J for TCR β and BCR heavy chains) from a library of germline genes, and assembles them while deleting base pairs and inserting other non-templated ones at the junctions (Fig. 1a). B cell receptors can further diversify through somatic hypermutations during affinity maturation. The recombination process is degenerate, as the same sequence can be generated in many different ways [11]. IGoR starts by listing the possible recombination and hypermutation scenarios leading to an observed sequence in the dataset. It then assigns probability weights reflecting the likelihood of these scenarios. As the example in Fig. 1a shows, explored scenarios can be very different yet have comparable contributions to the sequence likelihood. Since exploring all possible scenarios would be computationally too costly, IGoR restricts its exploration to the reasonably likely ones. Scenario exploration takes from 1 ms up to less than a second per sequence on a single CPU core, depending on the chain (see full distributions of runtimes in Fig. S1). Different recombination architectures and dependencies can be configured within IGoR by specifying dependencies between elementary events (gene choices, deletions, insertions, hypermutations) through an acyclic directed graph, or Bayesian network, as illustrated in Fig. 1b for the case of TCR β chains (see Online Methods for the other used structures).
IGoR functions according to three modes: learning, analysis, and generation (Fig. 1c). In the learning mode, IGoR infers the recombination statistics of large datasets of sequences using a Sparse Expectation-Maximization algorithm (see Online Methods). In the analysis mode, IGoR assigns recombination events to sequences in a probabilistic way, by outputing the most likely scenarios ranked by their probabilities, as well as the overall generation probability of the sequence. In the generation mode, IGoR outputs random sequences with specified statistics, e.g. learned from real datasets.
Inference of V(D)J recombination
We used IGoR’s learning mode to infer the accurate statistics of V(D)J recombination from four datasets comprised of unique sequences of non-productive rearrangements of three different chains, sequenced either at the levels of mRNA (TCRα chain or TRA, and TCRβ chain or TRB [14]) or DNA (TRB [13], BCR heavy chain or IGH from naive cells [12]), generalizing earlier methods [15ȁ17]. Restricting to nonproductive unique sequences allowed us to avoid biases introduced by functional selection. The Expectation-Maximization algorithm converged within a few iterations (see Fig. S2 for convergence of parameters, and Fig. S3 for the case of IGH).
The same TRB insertion and deletion distributions were inferred regardless of the individual, laboratory of origin, or sequencing protocol, and of whether DNA [13] (light blue distributions in Fig. 2) or mRNA [14] (dark blue) was used. By contrast, V and J gene usage varied moderately but significantly across individuals, and even more across sequencing technologies, suggesting possible primer-dependent biases (Fig. S4, see also Fig. S17 for IGH D-J gene usage). Insertions at the TRA V-J junction, and at the TRB V-D and D-J junctions have similar distributions (Fig. 2a), as previously reported [17]. IGH have significantly more insertions at the junctions than TCRs, consistent with previous observations [16].
We then validated the learning algorithm on synthetic datasets. Sequences were generated in batches of 103 to 105 by IGoR with a variable error rate, using statistics inferred from 60bp DNA TRB data. IGoR’s learning algorithm was then run on these raw sequences, and the resulting statistics compared to the known ground truth. We found that the inference was highly accurate for datasets of 105 sequences and an error rate set to its typical experimental value, 10−3 (Fig. 3a and b), and was not affected by overfitting. However, not all high-throughput sequencing datasets reach this depth, especially when restricted to unique non-productive sequences. In addition, hypermutation rates in BCRs, which IGoR treats in the same way as errors, can reach 1–10%. To assess how these limitations affect accuracy, we calculated the Kullback-Leibler divergence (a non-parametric measure of difference between probability distributions, see Online Methods) between the true distributions and the inferred ones, for varying sizes of datasets and error rates. For an error rate of 10−3, ~ 5000 unique out-of-frame sequences (which can be obtained from less than 2ml of blood with current mRNA sequencing technologies [14]) were sufficient to learn an accurate model of TRB (Fig. 3c), with the majority of the estimation error due to deletion profiles (which account for the majority of parameters). Increasing the error rate has little effect up to rates of 10−2, but significantly degrades accuracy for typical hypermutation rates, 10−1 (Fig. 3d), with the gene usage distribution affected the most (Fig. S5). This suggests that the recombination statistics of BCRs should be inferred using sequences from naive, non hypermutated cells (as we did in Fig. 2).
Analysis of scenario degeneracy
By considering all possible recombination scenarios for each sequence, our approach departs significantly from most existing methods, whose goal is to find the most likely one. To assess how often the most plausible scenario is the correct one, we analyzed synthetic sequences for which the generation scenario is known. For each generated sequence, we used IGoR’s analysis mode to enumerate the set of scenarios that were consistent with the nucleotide sequence, and ranked them according to their likelihood. Fig. 4a shows the distribution of the rank of the true recombination scenario for TRB and IGH synthetic data. The maximum-likelihood scenario is not the correct one in 72% of IGH sequences and 85% of 60bp TRB sequences. The distributions have long tails, meaning that a substantial fraction of sequences have a very large recombination degeneracy.
We then estimated how many scenarios, ranked from most likely to least likely, were needed to explain a given fraction f of the total sequence likelihood. The distributions of this number across 100,000 generated sequences are shown in Fig. 4b for various values of f (see Fig. S6 for the equivalent plot for TRB data). To enumerate the correct scenario with f = 95% confidence requires to include at least 30 to 50 scenarios. This analysis indicates that many scenarios need to be considered to correctly characterize the generation process.
IGoR outputs the probability of generation of the processed sequences, by summing the probabilities of all their possible scenarios, which deterministic assignment methods cannot do. It was shown that this generation probability was predictive of sharing properties between healthy individuals [14, 15]. This functionality could be used as a useful indicator of convergent recombination in studies attempting to identify antigen-specific or autoimmune related sequences from large clinical datasets.
Comparison to other methods
We compared our method to two representative state-of-the-art algorithms: MiXCR [8], an efficient assignment tool that finds the best matching germline genes, and Partis [10], a BCR-specific tool that uses maximum likelihood to find the most plausible scenario. 130 base-pair IGH sequences were synthetized in silico from a data-inferred model using IGoR’s generation mode. We then assigned recombination scenarios using MiXCR, Partis and IGoR, and compared them to the true scenarios with which sequences were generated. In IGoR’s and Partis’ case, the model parameters were learned from the generated dataset to mimick the analysis of real data. Fig. 4c shows the performance of the three methods in assigning the correct scenario of recombination. IGoR performs about 2.5 times better than MiXCR and Partis in predicting the complete recombination scenario, as well as each of its individual components. Note that Partis does not include palindromic insertions, which both IGoR and MiXCR treat by appending a short palindromic sequence at the end of each germline segment; restricting the analysis to sequences generated without palindromic insertions makes Partis’ performance comparable to that of MiXCR (Fig. S7).
Next, we compared the recombination statistics learned by the three methods to the true statistics used to generate the data. For MiXCR and Partis, we built the distribution of recombination events assigned to each sequence, while for IGoR these distributions were inferred using Expectation-Maximization, as explained before. All three methods yield similar statistics for V and J gene usage and deletion profiles (see Fig. S8). However, the dependency between D an J usage in TRB is correctly captured by IGoR but not by the other methods (Fig. 4d). TRB D and J genes are organised in two clusters, one containing D1 followed by genes of the J1 family, the other containing D2 followed by genes of the J2 family. Because of this organisation, D2 cannot be recombined with genes from the J1 family [18]. MiXCR assigns 20% of impossible D2-J1 recombination events to sequences (note that Partis does not process TCRs). By constrast, IGoR correcly learns the rule by assigning zero frequency to these impossible D-J pairs. The same results are obtained directly on real data (see Fig. S9). Finally, IGoR accurately reconstructs the distribution of insertions, while the other methods systematically overestimate the probability of zero insertions (Fig. S8a and b).
Somatic hypermutations
To study patterns of SHMs in BCR expressed by memory B cells, we included into IGoR the possibility to infer a sequence-dependent hypermutation rate. The probability of error or mutation at a given position on the nucleotide sequence is assumed to depend on its immediate n-mer context (see Fig. 5a), through the logistic transformation of an additive score computed using a position weigth matrix (PWM), similar to binding energy motifs used to describe DNA binding sites [19]. We ran IGoR on memory out-of-frame IGH sequences from Ref. [12] to learn 7–mer PWMs, as well as overall mutation rates (the geometric mean of the mutation rate over all possible 7–mers), while fixing the recombination statistics to those previously learned from naive sequences, using Expectation Maximization (see Online Methods). IGoR’s probabilistic framework handles the degeneracy of sequence origin caused by convergent combinations of gene choices and hypermutations. The learning procedure differs crucially from Ref. [16], where the hypermutation rate was uniform. Three distinct PWMs were learned for V, D, and J templated regions (Fig. 5b). To validate our PWM and mutation rate learning algorithm, we generated synthetic data with hypermutations according to the model learned from the real dataset, and re-learned its parameters using IGoR, finding excellent agreement (Fig. S10).
The PWM prediction for the position-dependent probability of hypermutations correlated well with that actually observed in the sequences (r = 0.7 for V genes, see Fig. 5c and Fig. S11). PWMs were very reproducible across the two tested individuals (r = 0.98, Fig. S12), indicating that the inference procedure is robust to the individual history of infections, and pointing to the universal nature of the SHM mechanism. By constrast, the inferred overall mutation rate differred by a two-fold factor between the two individuals, probably owing to differences in age, past infections, or lifestyle (Fig. S12). The motifs we found recapitulate previously reported hotspot motifs (positive values of the PWM) for every gene, including WRCY (or WRCH [20]) and WA [21, 22] (W = A or T, Y = C or T, R = G or A; mutated position underlined), as well as cold-spot motifs albeit to a lesser extend (SYC, where S = C, G) [23]. In all three motifs, C and G are generally underrepresented, except for the mutated position in V and D genes where T is less mutated than others. We assessed the robustness of the model to n-mer length by learning PWMs of sizes ranging from 3 to 9 (Fig. S13). The contributions of each relative position did not change substantially as a function of context length. Positions at least up to 4 nucleotides away from the mutation locus contribute to the motif. This could mean that the context dependence is broad, or alternatively that the motif model is indirectly capturing non-contextual effects. Overall, the inferred PWMs give both a more detailed and more nuanced view of the rules that govern hotspot positions, and cannot be reduced to a few easily describable motifs.
Fig. 5b shows that the motifs differ substantially between V, D, and J genes. V-learned PWMs only moderately predict J-gene hypermutation rates (r = 0.5 versus r = 0.7 for V-gene rates), and J-learned PWMs predict V-gene rates even worse (r = 0.24, see Fig. S11). This disagreement indicates that predictions purely based on context-dependent motifs are insufficient to explain all of the variability in hypermutation probabilities, and that other mechanisms must be at play. The overall mutation rate was also different between germline genes, consistent with reports that the chromatin state affects hypermutation rates [24–26].
We then used the inferred PWM within IGoR to probabilistically call putative hypermutations in sequences. We first examined the distribution of the number of mutations in a sequence (Fig. 5d). The empirical distribution (red) is more skewed and has a longer tail than would be expected by assuming independent hypermutations in each sequence, as predicted by generating randomly hypermutated sequences with the inferred PWM (blue). This observation is consistent with the fact that different B cells have undergone a variable number of cycles of affinity maturation, resulting in differences in effective hypermutation rates. Second, we asked whether hypermutations co-localized within the same sequence, by calculating the enrichment of hypermutations at two positions as a function of their genomic distance (Fig. 5e). While this enrichment is 1 in synthetic sequences (since our model assumes that hypermutations are independent of each other), real data shows up to a 4-fold enrichment of hypermutations at nearby positions. This difference is consistent with the fact that AID can cause repairs of DNA over large regions [27]. The typical distance at which the co-localization enrichment index decays gives an estimate for the length of these correlated regions of hypermutations, about 15 base pairs.
IGoR can in principle calculate the generation probability of any sequence. However, highly hypermutated sequences pose an additional challenge because the ancestral (unmutated) recombined sequence itself is sometimes not known with certainty. To overcome this issue, IGoR explores for each sequence all possible recombination and hypermutation scenarios, and calculates the generation probability of each potential ancestral sequence. Using synthetic data, we checked that the generation probability of individual sequences is well predicted by this method (r = 0.97, see Fig. S14 and Online Methods), and its distribution accurately reproduced (see Fig. S15).
DISCUSSION
By treating alignments of immune receptors to the germline probabilistically [15], IGoR corrects for systematic biases in the estimate of V(D)J recombination statistics, and predicts recombination scenarios more accurately than previous methods. Its detailed analysis of recombination scenarios further reveals that, even with a perfect estimator, the scenario is incorrectly called in more than 70% of sequences, suggesting caution when interpreting results from deterministic assignments.
Although we demonstrated its functions on human TRB and IGH, IGoR’s flexible structure makes it applicable to any variable lymphocyte receptor (TCR or immunoglobulin) and species for which genomic data is available. Unlike Hidden Markov Model based methods (e.g. [10, 17]), it can include a wide array of possible dependencies between the recombination events. It can also be adapted to handle unusual or incomplete rearrangements (D-J rearrangments, DD2/DD3 rearrangements in TCR δ chains, hybrid TRA/TRD recombinations, etc.). IGoR can also help detect unusual rearrangement features by using its syntheticaly generated sequences as a control. For instance, rearrangements with tandem Ds have been reported [12], but distinguishing them from random insertions can be challenging. To test this, we counted sequences with two ≥10-nt D segments in the data, and compared it with predictions from IGoR’s synthetic sequences generated with a single D segment (see Online Methods). We found 5 times more double-D assignments in IGH data than in the control, validating the findings of [12]. In contrast, the same analysis performed on TRB showed no significant presence of tandem Ds. Future versions of IGoR should include the possibility of including multiple D rearrangements. Note that IGoR does not find reversed Ds in IGH (Fig. S18).
IGoR infers recombination statistics from nonproductive sequences only, but can do it with as few as 5000 sequences. Once a recombination model is learned for a given locus, IGoR can generate arbitrary numbers of synthetic sequences with the same statistics, which could be used as a control in disease-association studies, by helping to distinguish antigen-specific clonotypes from public sequences with high convergent recombination frequencies, and thus dispense with the need of a healthy control cohort.
Our analysis of hypermutations led us to infer distinct sequence motifs for mutation targets on the V, D and J segments of human IGH, in contrast with previous approaches that assume a universal context model [28]. Although our motifs were learned on short reads comprising only part of the V and J segments, analysis of synthetic sequences showed that motifs could be accurately learned from such short reads. Exploring their applicability to longer reads would be an interesting future direction. We further found that hypermutations tend to co-localize along the sequence. These results suggest that at least three effects determine hypermutation hotspots: the immediate DNA context of the hypermutation, as modeled by our sequence motifs, position-specific effects mediated by e.g. chromatin configuration and hi-stone modifications, and the co-occurence of nearby mutations. Future improvements of hypermutation target predictions will have to rely on a better understanding of the precise mechanisms of AID operation [26].
Software availability. IGoR along with example datasets and pre-learned human TRA, TRB, and IGH models is available at bitbucket.org/qmarcou/igor.
ONLINE METHODS
IGoR functions according to three modes: VDJ statistics learning, sequence analysis, and sequence generation. All modes rely on an explicit stochastic description of the recombination and hypermutation events. In the analysis and learning modes, each sequence is analysed by listing all possible recombination and hypermutation scenarios. The learning mode iterates the analysis mode by updating the model parameters according to an Expectation-Maximization algorithm.
Recombination model
In all three modes, IGoR assumes that receptor sequences result from a recombination scenario comprising several stochastic elements — choice of germline segments, deletions and insertions. These features are stochastic and share statistical dependencies with each other. For tractability, we assume that these dependencies can be represented by an acyclic graph, also called Bayesian network (see SI Text for details). This structure can be configured within IGoR’s setup files. For the purpose of this study, we used the following dependency structures for the α chain of T cells (TRA): and for the β chain of T cell receptors (TRB) and heavy chain of B cell receptors (IGH): where V, D, J denote the choice of germline genes, delV, del J the number of deleted base pairs at the ends of the V and J segments, delDl, delDr the number of deletions at the left and right ends of the D segments, insVJ, insVD, insDJ, the numbers of insertions at each of the insertion sites (between V-J, or V-D and D-J), and ni, mi the identities of the inserted base pairs. In the case of TRB, gene usage is further factorized as P(V, D, J) = P(V)P(D, J)
Context dependent hypermutation model
When processing TCRs or naive BCRs, a constant error probability is assumed throughout the sequence. When processing memory BCRs, a context-dependent hypermutation model is assumed: at each position along the V, D, and J genes, a hypermutation occurs with probability Pmut, with where (π−m,…, πm) is the (2m + 1)-mer sequence context centered around the location of the mutation. The entries of the position weight matrix (PWM), ei(π), contribute additively to the motif, and μ is the overall hypermutation rate.
Alignment to germline and scenario listing
In the analysis and learning modes, each sequence is first aligned to all possible germline genes retrieved from genomic databases (e.g. IMGT), using the Smith-Waterman algorithm [29]. Only germline genes with alignment scores higher than an adjustable threshold are considered for further analysis (see SI Text for details). Possible scenarios are then listed by picking germline genes with an above-threshold alignment score, and by choosing a number of base pairs to further delete from the ends of their aligned parts. The base pairs located between the germline segments trimmed in this manner are called insertions, and alignment mismatches to the germline are called errors or hypermutations. When the palindromic end of germline genes is not entirely deleted, the number of remaining palindromic base pairs are described as negative deletions. To allow for the possibility that the D segments be inserted in both directions in BCRs, we added the reverse complements of each D germline segment to the list of genomic templates.
Sequence analysis
For each sequence in the dataset, the probability of possible scenarios are computed using the recombination probability of Eqs. 1 or 2, multiplied by the probability of errors or hypermutations Perr: Pscenario = Precomb × Perr. Scenarios are then listed in order of decreasing probability. The sum of probabilities Precomb × Perr of possible recombination and hypermutation events gives the probability of observation of that particular sequence read, Pread. The probability that the pre-mutation sequence was generated by recombination, Pgen, is defined as the sum of the probabilities Precomb of scenarios leading to that sequence. Since the pre-mutation sequence is not known with certainty, we calculated an approximate generation probability Pgen as the geometric mean of Pgen of all possible unmutated sequences consistent with the read, weighted by their posterior probabilities, Pgen × Perr/Pread. Alternatively, we approximated Pgen by that of the most likely pre-mutation sequence.
To shorten computation times, only plausible scenarios are listed by IGoR. Scenarios are enumerated by exploring the nodes of a hierarchical decision tree, where each depth corresponds to the choice of a scenario feature. Branches of the tree are discarded if their total contribution to the sequence probability is upper-bounded to be below a certain threshold. Details of the procedure are given in the SI text.
Learning algorithm
The learning algorithm infers the parameters of Eqs. 1 or 2, as well as the error or hypermutation model parameters of Eq. 3, from a large datasets of unique sequences. It relies on the sequence analysis module, and follows an Expectation-Maximization procedure. Starting from an arbitrary (but reasonable) set of parameters, all sequences in the dataset are analysed as described above, producing a long list of scenarios associated with each sequence. We define the pseudo-log-likelihood as the weighted sum of the log-likelihoods of all scenarios of all sequences, where the weights are given by the conditional probabilities of scenarios given the sequence, Precomb/Pread (Expectation step). This pseudo-log-likelihood is then maximized with respect to the parameters of the log-likelihoods (Eqs. 1–3), while keeping the weights fixed. The parameters are updated, and the procedure repeated, until convergence. Mathematical derivations of the update rules and details about Expectation-Maximization are given in the SI Text.
Validation of model inference
To compare the model parameters θ1 inferred from synthetic data to the known model parameters θ2 from which these data were generated, we computed the Kullback-Leibler divergence between two probability distributions, , where the sum is over all scenarios E. P(E, θ) is computed using Eqs. 1 or 2. This Kullback-Leibler divergence can be decomposed into additive contributions from each of the scenario features, as detailed in the SI text.
Datasets
We applied the learning algorithm on the following publicly available datasets: TCR alpha and beta chains RNA datasets from [14] are available on Sequence Read Archive (SRP078490); TCR beta chains 60 bp DNA datasets from [15] are available at http://physics.princeton.edu/~ccallan/TCRPaper/data/; naive and Memory BCR heavy chains DNA datasets from [16, 30] are available at http://physics.princeton.edu/~ccallan/BCRPaper/data/.
Correlations between hypermutations
To evaluate correlations between the occurence of hypermutations at close-by positions along the BCR sequence, we computed the radial disbribution function defined as: where f(i,V) and f(i,j,V) are the frequencies of hypermutations at position i, and at both positions i and j, respectively, calculated from individual scenario statistics weighted by their posterior probabilities. CV(r) is the set of pairs of positions separated by r that were observed a large enough number of times in gene V, and .
Usage of tandem D segments
In order to assess the occurrence of double D insertions during the VDJ recombination event of IGH or TRB, we computed the frequency with which one could align (with the Smith-Waterman algorithm) two non-overlapping Ds over least 10 nucleotides, between the best V and best J alignments. We then compared the frequency obtained for synthetically generated sequences, to that obtained for real sequencing data.
Acknowledgements
The work was supported by grant ERCStG n. 306312.
APPENDICES
A Model definitions
We start by giving the particular model structures used in this study. We then give a more general definition applicable to other general types of recombination products.
1 Models for TRA, TRB and IGH
We define a probabilistic model for each type of chain (e.g. α, β, heavy, light) that describes the probability of each recombination event E by the probabilities of the known elements of the recombination subprocess (gene choice, insertions, deletions at each of the junctions etc) for each chain, and assumes only the minimum correlations between the subprocesses needed to explain the correlations observed in the data. We model insertions as a Markov chain (the identity of an inserted nucleotide only depends on the previously inserted one) with a nonparametric length distribution [14–16]. For each insertion site (X= VD and DJ for β and heavy chains and X=VJ for α and light chains) we infer the probability of observing a non-templated sequence of a given length, P(insX), and the transition matrices PVJ(ni|ni–1), PVD(ni|ni_1), PDJ(mi|mi_1) giving the probability of inserting a given nucleotide as a function of the identity of previous one. For each gene we infer the probability of the number of deletions conditioned on the gene identity, e.g. P(delV|V) for deletions from the V gene. We model templated palindromic insertions as negative deletions [15, 16]. The D gene is very short and may get fully deleted. This introduces correlations between the deletions on both sides of the original D gene template. We account for these correlations by inferring the joint probability P(delDl, delDr|D). We treat every allele as a different gene [16] and infer the joint gene usage P(V, D, J) for β and heavy chains, and P(V, J) for α and light chains, to be able to capture correlations between segment usage.
For TCR α chains or BCR light chains, the probability of a recombination event E = (V, J, delV, delJ, insVJ) is:
Similarly, the probability of a recombination event E = (V, D, J, delV, delDl, delDr, delJ, insVD, insDJ) for a TCRβ or BCR heavy chain is:
In the case of TRB, gene usage is further factorized as P(V, D, J) = P(V)P(D, J).
2 General model formulation
IGoR is designed in a modular way so the user can define arbitrary model forms. The models are Bayesian networks encoded as directed acyclic graphs, whose vertices i = 1,…, K label individual recombination subprocesses Ei (V, D, J choices, deletions, etc. in the examples above). Dependence of the recombination process j upon i is encoded by a directed edge between i and j, denoted vij = 1 (while vij = 0 means no direct dependence). The set of parents of i, i.e. processes on which i depends directly, is denoted by .
Using these definitions we can, generally and irrespectively of the assumed form of the underlying model of recombination, write the probability of a complete recombination scenario E = (E1,…, EK) as: where θ denotes the underlying model parameters (i.e. probability distributions of gene choice, insertions at a given junction, and deletions from a given gene in the studied examples).
Each recombination scenario E leads to a unique sequence (in the following we often write S for for brevity). However, in order to produce a given sequence S several scenarios might be equivalent, and we can write the probability of generating a given sequence as:
The above description only holds to assess the generation probability of a pure product of recombination and does not account for sequencing errors or hypermutations. Note that, since longer reads allow for more reliable determination of V and J gene segments, Pgen depends in general on read length: shorter reads can be created in more ways than longer reads, leading to larger Pgen.
3 Errors and hypermutations
Sequencing is inherently noisy and introduces nucleotide substitutions. In addition, BCRs can accumulate hypermutations, which can be mathematically treated in the same way as errors. For the sake of clarity, we distinguish between the sequencing read R and the original sequence S resulting from recombination, as defined above. For simplicity we ignore insertion and deletion errors, so that R and S are of the same length L.
We define our error model as deviations from the initial recombination event (through sequencing errors or somatic hypermutations) such that Perr(R|S, θ) is the probability of observing the sequencing read R given the recombination product S. Since the recombination scenario E completely determines S, Perr(R|S, θ) = Perr(R|E, θ), and we use these two notations interchangeably. The dependence on θ reflects the fact that θ also includes the parameters of the error or hypermutation model.
We write the joint probability of producing a given sequence S and observing a given read R as:
Summing over all possible recombination products, the likelihood of a sequencing read is: and the total likelihood of the model given a dataset of reads (R1,…, RN) is given by:
B Expectation-maximization
1 General scheme
The recombination machinery is degenerate, as several scenarios of recombination and hypermutations can lead to the same sequence, and the recombination scenario E from which the sequencing read R comes from is in general unknown. The Expectation-Maximization algorithm is a commonly used algorithm that maximizes the likelihood of models with hidden variables given the data. In this section we re-derive this algorithm for our class of models.
The procedure is iterative. Starting from an initial set of parameters θ, one wishes to update another set of parameters θ′. From Bayes formula, Pread(R|θ′) = P(E, R|θ′)/P(E|R, θ′), we rewrite the log-likelihood of a read as: where we have used , and where we have defined
The difference between the log-likelihood, ln , between the current set of parameters θ and the candidate new parameters θ′ reads: where , and where we have used Gibbs inequality:
This inequality ensures that maximizing the “pseudo-log-likelihood” Q(θ′|θ) over θ′ increases total likelihood by at least the same amount. The Expectation-Maximization scheme updates θ by doing such a maximization, and repeating the procedure iteratively. The algorithm converges to a maximum of the likelihood.
2 Optimizing the recombination model
The pseudo-log-likelihood can be broken up in two independent terms, Q(θ′|θ) = Qrecomb(θ′|θ) + Qerr(θ′|θ), respectively corresponding to the recombination model and the error or hypermutation model:
In order to maximize the pseudo-log-likelihood of the recombination model we need to maximize Qrecomb(θ′|θ) with respect to every model component contained in the parameter set . We impose normalization using Lagrange multipliers, λi, and define:
Taking the functional derivative of with respect to the model parameter we get:
Setting this derivative to zero gives: where the Lagrange parameter λi = N ensures normalization. In other words the modified log-likelihood is maximized by using an update rule that equates the probability of a realization of a recombination event to its posterior frequency.
3 Optimizing the independent single nucleotide error model
The independent single nucleotide error model is the simplest instance of an error model, where each nucleotide of the read has a probability r to be mis-sequenced as one of the three other nucleotides with equal probability. For this model we have where Nerr(R, S) the number of mismatches between R and S, and L the number of error-prone base pairs. We compute the derivative of the modified log-likelihood of the error model with respect to R* as:
Setting this derivative to zero yields: where L(Ra, E) is the number of potentially erroneous nucleotides in read a. For simplicity we ignore errors and hypermutations in the insertion part of the sequence, as they are almost indistinguishable from unmutated random insertions, and accounting for them would imply summing over an exponentially large number of scenarios. As a result, L in the above formula is not the read length, but rather the number of genomic nucleotides in each scenario, which depends on the scenario E as well as on the sequence read.
4 Optimizing the hypermutation model
The hypermutation model assumes the following form for the probability of hypermutations: with where (π−m,…, πm) = (Sx−m,…, Sx+m) is the sequence context of the original recombination product around a hypermutation at position x. The parameters ei(N), the position-weight matrix, and μ, the overall mutation rate, are part of the parameter set θ. In order to lift the degeneracy of the model we impose that at every position i.
The pseudo-log-likelihood of the hypermutation model reads: where . It can be rewritten as: where
During the Expectation step, we compute these two quantities for each (2m+1)-mer and then maximize Qerr at each step of the Expectation-Maximization scheme using Newton’s method with a backtracking line search. To impose we remove one parameter per position i by setting for one nucleotide, .
We can then compute the entries of the gradient vector J (of size 3(2m + 1) + 1): along with the Hessian matrix H entries:
For each step of Newton’s method we find the step direction by solving HΔθ′ = − J and we gradually refine the step size based on the Armijo-Goldstein condition. These operations are iteratively repeated until the pseudo-log-likelihood of the error model for a given Maximization step of the EM framework is maximized.
C Model entropy and DKL
Shannon’s entropy [31, 32], is a measure of the uncertainty about the outcome of a stochastic process described by a variable x, governed by the distribution p(x|θ) and parametrized by θ. As in [15–17] we compute this quantity based on our probabilistic framework and use it as an estimate for the diversity generated by the V(D)J recombination process. In the main text we also introduced the relative entropy or Kullback-Leibler divergence, as a measure of dissimilarity between two probability distributions parametrized by θ1 and θ2 respectively, and used it to quantify the error made by our probabilistic framework upon inferring the V(D)J recombination parameters.
Since both the entropy and the Kullback Leibler divergence between two recombination models can be computed once one knows how to compute the cross entropy between the distributions for the two sets of parameters θ1 and θ2, we focus here on the computation of H(θ1, θ2).
1 General form
For the considered class of models, the cross-entropy can be divided into subparts for each model component, with
To calculate this sum, one does not need to sum over all possible scenarios E, but only on combinations of processes that affect Ei directly or indirectly. Let us call the set of indices affecting process i. These are defined as the “ancestors” of i in the acyclic graph, i.e. indices j such that there exists a lineage from j to i, (i1 = i,i2,…, ik = j) with (note that Ai includes i itself as a 0th order ancestor). Then the previous sum can be reduced to a sum over the processes in A only: where EAi denotes the subvector of elements of E with indices in A. Estimating the cross entropy for an event Ei requires exponential time in the number of ancestors of that node. Fortunately, in the recombination models considered in this paper the set of ancestors are small and obtaining the cross entropy is easy for every event. The special case of insertions is discussed below. Note that this cross-entropy only takes into account the recombination model, and not the error model.
2 Inserted nucleotides
For a given insertion length insVJ (or insVD, or insDJ), the cross-entropy between two models of insertions is given by where n = (n1,…, ninsVJ) is the inserted sequence, and Ps(n1, θ) is the stationary distribution of the Markov chain of insertions, solution of the equation . The average cross-entropy over possible lengths is then given by: where is the subset of processes affecting either insVJ or n, exluding insVJ itself.
D Probability of generation
Although the probability of generation of a sequence without errors or hypermutations is well defined, computing the probability of generation of a mutated sequence, before mutations occurred, is strictly speaking not possible because that sequence is not know with certainty. However, we can compute a good approximation for it, and we can also calculate its distribution across sequences.
To approximate Pgen(S) from a noisy or hypermutated sequence R, we take its geometric average weighted by the probability of the recombination product S: with . Alternatively, one can take the generation probability of the most likely recombination product: where S* = argmaxS P(S|R, θ).
The distribution ρ(x) of the log-probabilities of generation, x = log Pgen, can be computed from data using:
Note that unlike estimates for single sequences, this expression should become exact in the limit of N → ∞.
E Data and software
1 Genomic templates
We used custom genomic templates derived from the IMGT database [33]. TCR alpha V and J genomic templates were taken from the IMGT human database. For TCR beta V, D and J genes we used curated genomic templates from [15]. BCR heavy chain V, D and J genes were taken from the customized genomic templates used in [16]. For software comparison we used default genomic templates provided with Partis and MiXCR.
2 Alignments
Initial alignments to germline genes were performed using the Smith-Waterman algorithm [29], with scores of 5 for matching base pairs, −14 for mismatches, and a 50 gap penalty. Alignments with a score below the following gene dependent threshold were discarded: 50 for TRBV, 0 for TRBD, 10 for TRBJ, 20 for TRAV, 10 for TRAJ, 50 for IGHV, 40 for IGHD, 10 for IGHJ. We also discarded alignments whose score fell below the maximum alignment score (found for this read and segment type), minus the following variable range: 55 for TRBV, 35 for TRBD, 10 for TRBJ, 55 for IGHV, 20 for IGHJ.
The alignment offset (the index of the nucleotide on the read to which the first letter of the undeleted genomic template is aligned) was constrained depending on known primer locations on the J gene.
3 Pruning the tree of scenarios
Since enumerating all possible scenarios for each sequence is not tractable, we used a heuristic method for reducing their numbers. Exploring all possible scenarios is equivalent to exploring all the terminal leafs of a tree. Our heuristic is to prune all branches that do not contribute substantially to the likelihood of the read. To do this we implement a Sparse Expectation Maximization algorithm as motivated in [34]. Due to the acyclicity of the directed graph underlining the Bayesian network, there exists a topological sorting of the events constituting a partially ordered set (we will assume in the following that the indices of the different events Ei respect this ordering). IGoR processes event realizations according to this order corresponding to different layers of depth in the tree. To discard irrelevant branches (containing negligible scenarios) IGoR computes at each depth k (with 0 ≤ k < K) an upper bound on the probability of the currently explored scenario: where ε is the set of already fully explored scenarios, and 0 ≤ ε ≤ 1 is a tunable parameter setting the precision of the sparsity approximation. While ε = 0 will explore every possible scenario and perform an exact Expectation step, ε = 1 will explore only scenarios more likely that any scenario already explored.
Although Eq. 47 captures the essence behind our tree pruning approach, in practice IGoR uses more information than a simple upper probability bound. By picking two gene choice realizations, imposing the identity and position of these specific V and J genes, we explicitly impose the total nucleotide length of event realizations between those V and J genes (number of insertions, deletions, D gene length, …). When computing the probability upper-bounds IGoR computes the upper probability bound for a given junction length between two event realizations, and uses this refined bound to efficiently prune the tree of scenarios.
4 Generating synthetic sequences
Synthetic sequences are generated by randomly drawing scenarios of recombination from the probability distribution in Eq. 4 or 5. In order to fit the data, the resulting sequences are then cut to mimic the sequencing process (e.g. fixed starting point and fixed read length).
5 Comparison to other software
We benchmarked our method against MiXCR 2.0.2 [8] - a commonly used deterministic alignment method. We used the MiXCR sequence assignment to compute the frequency of gene usage, insertion length, deletions and obtain the distributions shown in Fig. S8. We also compared to Partis [10] - a recent HMM based model of recombination. Since Partis uses a Viterbi learning algorithm, we used the most likely assignments it outputs to compute the corresponding probability distribution shown in Fig. S8. Since Partis is designed to handle BCRs we assessed its performance on the BCR dataset only.