Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires

Victor Greiff; Cédric R. Weber; Johannes Palme; Ulrich Bodenhofer; Enkelejda Miho; Ulrike Menzel; Sai T. Reddy

doi:10.1101/127902

Abstract

Recent studies have revealed that immune repertoires contain a substantial fraction of public clones, which are defined as antibody or T-cell receptor (TCR) clonal sequences shared across individuals. As of yet, it has remained unclear whether public clones possess predictable sequence features that separate them from private clones, which are believed to be generated largely stochastically. This knowledge gap represents a lack of insight into the shaping of immune repertoire diversity. Leveraging a machine learning approach capable of capturing the high-dimensional compositional information of each clonal sequence (defined by the complementarity determining region 3, CDR3), we detected predictive public- and private-clone-specific immunogenomic differences concentrated in the CDR3’s N1-D-N2 region, which allowed the prediction of public and private status with 80% accuracy in both humans and mice. Our results unexpectedly demonstrate that not only public but also private clones possess predictable high-dimensional immunogenomic features. Our support vector machine model could be trained effectively on large published datasets (3 million clonal sequences) and was sufficiently robust for public clone prediction across studies prepared with different library preparation and high-throughput sequencing protocols. In summary, we have uncovered the existence of high-dimensional immunogenomic rules that shape immune repertoire diversity in a predictable fashion. Our approach may pave the way towards the construction of a comprehensive atlas of public clones in immune repertoires, which may have applications in rational vaccine design and immunotherapeutics.

Introduction

The clonal identity, specificity, and diversity of adaptive immune receptors is largely defined by the sequence of complementarity determining region 3 (CDR3) of variable heavy (V_H) and variable beta (V_β) chains of antibodies and TCRs, respectively [1–5]. The CDR3 encompasses the junction region of recombined V-, D-, J-gene segments as well as non-templated nucleotide (n, p) addition [6]. Due to the enormous theoretical diversity of antibody and TCR repertoires (>10¹³) [7–10] and technological limitations (Sanger sequencing), it was long believed that clonal repertoires were to an overwhelming extent private to each individual [11,12]. However, due to recent advances in high-throughput immune repertoire sequencing, it has been observed that a considerable fraction (>1%) of CDR3s are shared across individuals [1,5,13–26]. Thus these shared clones (hereafter referred to as “public clones”) are refining our view of adaptive immune repertoire diversity. Therefore, a fundamental question emerges: are there immunogenomic differences that predetermine whether a clone becomes part of the public or private immune repertoire?

In the context of antibody and TCR repertoires, the large theoretical clonal (CDR3) diversity renders the investigation of public and private repertoires computationally challenging [27]. Previous studies using conventional low-dimensional analysis suggested that public clones are germline-like clones with few insertions, thereby having higher occurrence probabilities, whereas private clones contain more stochastic elements (i.e. N1, N2 insertions) [17,23]. In order to investigate the composition of large numbers of sequences with the appropriate dimensionality, sequence kernels are increasingly used [28,29]. Sequence kernels are high-dimensional functions which measure the similarity of pairs of sequences, for example, by comparing the occurrence of specific subsequences (k-mers) in a high-dimensional space [30,31]. Supervised machine learning (e.g., support vector machine analysis) is an approach, which takes low and high-dimensional feature functions as input to find a classification rule that discriminates between two (or more) given classes on a single-clone level (e.g., public vs. private clones) [32]. In contrast to using conventional low-dimensional features to analyze immune repertoires, the coupling of high-dimensional sequence kernels to support vector machine (SVM) analysis may lead to greater insight into the immunogenomic structure of repertoire diversity; specifically the difference between public and private repertoires. As opposed to previous approaches [33], a key advantage of sequence-kernel based SVM analysis is the prediction-profile-based identification of CDR3 subregions that are most predictive for a respective class (public or private class) [30,31]. This approach may lead to predictive immunological and mechanistic insight into the immunogenomic elements that shape repertoire diversity.

In order to identify the immunogenomic differences between public and private antibody repertoires (Figure 1), we applied support vector machine analysis (Figure 1B) to six large-scale immune repertoire (antibody and TCR) sequencing datasets from mice and humans (Figure 1A). When using low-dimensional features (germline gene and amino acid usage, CDR3 subregion length) as the input for SVM analysis, prediction accuracy of private and public status reached maximally 66%, which only slightly improves on a random classifier (50%). However, when implementing a high-dimensional sequence-kernel (sequence composition) based support vector machine analysis, we were able to detect strong immunogenomic differences concentrated in the N1-D-N2 region in public and private clones, with a high prediction accuracy (balanced accuracy≈79–83%, Figure 1C). Our results unexpectedly signify that both public and private antibody repertoires contain predictive high-dimensional features that enable their accurate classification. Our SVM approach was sufficiently robust to be applied across repertoire studies with different library preparations and high-throughput sequencing protocols demonstrating their widespread applicability.

Figure 1 Immunogenomic analysis of public and private antibody repertoires.

(A) We asked whether there are immunogenomic differences that predetermine a clonal sequence’s (CDR3) public or private status within a an immune repertoire. The public repertoire is composed of clones being shared among at least two individuals (we also explored an alternative public clone definition, Figure 6F). Private clones are those distinct to each individual. We defined antibody and T cell clones based on 100% CDR3 (complementarity determining region 3) identity. For statistical power, we used six large-scale datasets (Supplementary Table 1) comprising different B-cell populations, species (humans, mice) and immune antigen receptors (B/T cell receptor).

(B) To answer our question, we decomposed public and private immune repertoires in conventional low-dimensional features (e.g., CDR3 amino acid usage, Figures 2 and 3) or novel high-dimensional features (CDR3 sequence decomposition into subsequences of length k (k-mers) separated by a gap of length m, Figures 4 and 5). Leveraging supervised machine learning (support vector machines), we tested whether low and high-dimensional features can detect immunogenomic differences between public and private repertoires (see Methods) and consequently can be used for prediction of public and private status at single clone resolution.

(C) We found that low-dimensional features are poor predictors of public and private clone status. In contrast, we detected strong predictive immunogenomic differences, concentrated in the N1-D-N2 CDR3 subregion, between public and private clones using high-dimensional features. Thus, public as well as private clones each possess a class-specific high-dimensional immunofingerprint composed of class-specific subsequences that enables their classification with high accuracy. Our SVM approach was found to be generalizable across datasets produced in different laboratories with different library preparation and high-throughput sequencing (HTS) protocols.

Results

Public and private clone repertoires cannot be predicted by germline gene or amino acid usage

As the basis for elucidating the immunogenomic differences between public and private clones, we used a recently published high-throughput sequencing antibody repertoire dataset [16] (Dataset 1, Methods). This dataset contains ∼200 million full-length antibody V_H sequences derived from 19 different mice, stratified into key stages of B-cell differentiation: pre-B cells (preBC, IgM), naïve B cells (nBC, IgM), and plasma cells (PC, IgG). This dataset thus provided the important advantages of both high sequencing and biological depth (preBC and nBC represent antigen-inexperienced cells, while PC are post-clonal selection and expansion due to antigen exposure). Public clones, precisely defined here as CDR3 sequences (100% amino acid identity) occurring in at least two mice, were found to compose on average 15% (preBC), 23% (nBC), and 26% (PC) of antibody repertoires across B-cell stages (Figure 2A). As previously reported, we found that public clones are both biased to higher frequencies and are enriched in sequences from natural antibodies (Supplementary Figure 10) [17,24,34]. Throughout B-cell development, public and private clones used nearly identical V, D, J, VJ and VDJ germline genes (overlap >95%), which were at nearly identical frequencies in preBC and nBC (Spearman r≈1) and at varied frequencies in PC (Spearman r>0.5–0.8) (Figure 2B). Thus, neither public nor private clones showed any preferential germline gene usage. On average as well as at each CDR3 sequence position, higher frequency amino acids occurred more often in public clones (e.g.: A, C, D), whereas lower frequency amino acids could be found at higher frequency in private clones (e.g.: H, I, K) (Figure 2C, Supplementary Figure 2B). This observation held true across all B-cell stages (r = 0.5–0.76; p<0.05, Supplementary Figure 1A). Repertoire-wide absolute differences in amino acid usage between private and public clones were slight (0.2–1.4 percentage points, Figure 2C). To test whether these repertoire-level differences were sufficient to predictively discriminate between public and private clones on a single clone level, we employed supervised support vector machine learning (SVM) analysis (Methods, Figure 1B). For all SVM analyses in this study, in order to minimize classification bias, a dataset was constructed for each repertoire, which consisted of all public clones and an equal number of private clones from the repertoire (Supplementary Table 1) such that both public and private clones had identical CDR3 length distributions. Subsequently, the dataset constructed for SVM analyses was divided into 80% training sequences and 20% test sequences (Figure 1B, Methods). We found that amino acid usage was a suboptimal predictor of clonal status with a prediction accuracy ≤ 65% (Figure 2D) where prediction accuracy is defined as the mean (balanced accuracy) of specificity and sensitivity (see Methods), as described previously [30,35].

Figure 2 Public and private clone repertoires do neither differ predictively in germline gene usage nor amino acid composition

(A) Public clones represent 15–26% of murine antibody repertoires throughout B-cell ontogeny. Public clones were defined as being shared in at least two mice (see Methods).

(B) Overlap and Spearman correlation of V, D, J germline genes and their respective combinations (V-J, and V-D-J) between private and public clones by B-cell population.

(C) Relative amino acid composition of public (red) and private clones (black). Differences between public and private clones were not significant (Kolmorogov-Smirnov test, p>0.05).

(D) SVM-based discrimination of public and private clones based on CDR3 amino acid composition (see Methods). Balanced prediction accuracy was defined as the mean of specificity (detection rate of public clones) and sensitivity (detection rate of private clones). Barplots show mean±s.e.m.

Public and private clones do not differ predictively in CDR3 subregion length

Since public and private clones did not differ in germline gene usage, we asked whether they differed with respect to length and diversity of CDR3 subregions (V, N1, D, N2, J). The V, D and J subregions are derived from germline genes (IGHV, IGHD, IGHJ), while N1 and N2 represent the insertions introduced during the junctional recombination process (n- and p-nucleotides). Public clones in preBC and nBC repertoires possessed a relative V subregion length of 23–24% (Figure 3A), whereas private clones had slightly shorter V subregions (≈21%, p<0.05, Supplementary Figure 3A). The J subregion length behaved analogously (public: 40%, private: 36%) while the D subregion length did not differ between groups (public: 25%, private: 25%). We observed the largest difference between public and private clones in the relative length of N1 and N2 subregions with deviations of 36–46 percentage points from a 1:1 ratio (N1: public ≈6.5%, private ≈8.2%; N2: public ≈4.3%, private: ≈7.7%, p<0.05, Figure 3A, Supplementary Figures 3A, B). Conversely, PC CDR3 subregion lengths did not differ between public and private clones (with the exception of N1, which was slightly longer in public clones, Figure 3A, Supplementary Figure 3B).

Figure 3 CDR3 subregion length does not predict a clone’s public/private status.

(A) Normalized CDR3 subregion (V, N1, D, N2, J) lengths (median) of public and private clones by B-cell population.

(B) Frequency of clones (public, private) with at least one N1/N2 insertion or deletion occurrence by B-cell population.

(C) Overlap and Spearman correlation of CDR3 subregions and combinations thereof by B-cell population.

(D) Number of unique V, N1, D, N2, J subregions (species richness) of public and private clones by B-cell population. Species richness of private clones CDR3 subregions was obtained by accounting for private and public clones size differences (bootstrapping, see Methods).

(E) SVM-based prediction of public and private clones based on V, N1, D, N2, J subregion composition (Figure 3A, see Methods). Balanced (prediction) accuracy was defined as the mean of specificity (detection rate of public clones) and sensitivity (detection rate of private clones). Barplots show mean±s.e.m.

Regardless of public or private designation, nearly all CDR3s (>94%) had at least one nucleotide insertion (N1 or N2) and at least one deletion (Figure 3B), thus only a very small portion of clones were “germline-like” having neither insertion nor deletion (≤4%, Supplementary Figure 4C,D). Furthermore, across B-cell populations both N1 and N2 insertions were present in >50% and >70% of public and private clones, respectively. Of note, N1 and N2 insertions showed no preferential selection of germline gene segments (IGHV, D, J) (Supplementary Figure 3D) and the mean length of the sum of insertions (N1+N2) did not correlate with V-D-J frequencies (Supplementary Figure 5A, Pearson r=0).

Deletion length was highest in D subregions (mean of 5’ and 3’ D-deletions: ≈7 nt, Supplementary Figure 3C) whereas it was lowest in V subregions (≈0.8 nt, Supplementary Figure 3C). Although private clones showed a higher number of deletions, differences between public and private clones were slight (max difference≈0.6 nt, Supplementary Figure 3C). Of interest, we were unable to detect an association between the lengths of insertions and deletions (Supplementary Figure 5B).

Although differences in CDR3 subregion length and occurrence of insertions and deletions were significant in preBC and nBC (Figure 3A, Supplementary Figures 3A–C, 4C, D, p<0.05), training a SVM based on CDR3 subregion length, led to low prediction accuracy of public/private clone discrimination (balanced accuracy ≤ 68%, Figure 3E). This indicates that the slight differences observed in CDR3 subregion length on the repertoire level are not reliable for class prediction.

Public and private clones show differences in sequence composition

Since low-dimensional features (CDR3 a.a. and subregion properties) did not achieve high discrimination accuracy between public and private clones (Figures 2D, 3E), we investigated whether CDR3 sequence composition (potential dimensionality: >10¹³ different CDR3 sequences) differed between public and private clones. In preBC and nBC, V and J subregions neither differed in public and private clones with regard to unique sequences (>97%) nor frequency thereof (Spearman r>0.95, Figure 3C). Consequently, we observed no differences in V and J subregion diversity (number of unique V and J subregions) between public and private clones (Figure 3D). Although there was a major difference in diversity of N1, D, and N2 subregions between private and public repertoires, as the number of private preBC and nBC clones surpassed that of public clones by 1.6–4.5-fold (Figure 3D, p<0.05, size adjusted, see also Supplementary Figure 3A), N1, D, N2 subregion overlap between public and private clones was >66% (Figure 3C). In PC repertoires, diversity differences between public and private repertoires were minimal but overlap of subregions reached maximally 46% and Spearman correlation was consistently negative. In contrast to single subregions, combinations of subregions showed low overlap between public and private repertoires irrespective of B-cell population (e.g., N1-D-N2 overlap in nBC was ≈6%, Figure 3C), which is explained by a large combinatorial diversity (Supplementary Figure 4B, Supplementary Table 2) of CDR3 subregions. Thus, sequence composition differed substantially between public and private clones.

High-dimensional CDR3 sequence composition analysis predicts public and private clones with high accuracy

In order to test, whether the detected differences in sequence composition were predictive, we utilized high-dimensional sequence kernels for SVM analysis [30]. We used the gappy-pair sequence kernel [30,36,37], which decomposes each CDR3 into subsequences of length k (k-mers) separated by a gap of length m (Figure 4A, see Methods). Applying this kernel function to all CDR3s of a given training dataset generates a feature matrix of dimension n*f, which serves as input for the SVM analysis: here, n is the number of CDR3s in the training dataset and f the number of features. By cross-validation, we selected the parameter combinations that resulted in the highest prediction accuracy: k=3, m=1 at the nucleotide level (potential feature diversity: 8192, Methods) and k=1, m=1 at the amino acid level (potential feature diversity: 800, Methods). On both the nucleotide and the amino acid level, public and private clones in preBC and nBC could be classified with ≈80% accuracy, with very low variation across mice (Figure 4A, Supplementary Figure 6A, E). In order to validate the robustness of the chosen public clone definition, we showed that the SVM was incapable of separating public from public and private from private clones across individuals (balanced accuracy < 50%, Supplementary Figure 6D). In addition, we validated that the high prediction accuracy was maintained for an alternative and more stringent definition for public clones (balanced accuracy = 83–84%, Supplementary Figure 6F). In order to quantify the statistical significance of our high-dimensional SVM approach, we confirmed that the balanced accuracy was close to random (50%) when shuffling CDR3 nucleotide and amino acid sequences (Supplementary Figure 6B) and when shuffling public and private labels across clones (Supplementary Figure 6C).

Figure 4 Public and private clones can be predicted with 80% accuracy using high-dimensional CDR3 sequence decomposition.

(A) Specificity (detection rate of public clones), sensitivity (detection rate of private clones) and balanced accuracy (mean of specificity and sensitivity) for public vs. private clones SVM discrimination by B-cell population. For each repertoire, a dataset composed of equal numbers of public and private clones (nucleotide sequences, length equilibrated) was assembled (Methods, Supplementary Table 1). Subsequently, as displayed in the insert, the gappy pair kernel function decomposes each CDR3 sequence into features made of two k-mers separated by a gap of maximal length m. The maximal number of features is 4^(2×k)×(m+1)=8192 for nucleotide sequences (k=3, m=1) and 20^(2×k)×(m+1)=800 for amino acid sequences (k=1, m=1). Based on this decomposition, a feature matrix of dimension #CDR3s times #Features is constructed. Each row of the feature matrix thus corresponds to a feature vector for a CDR3 and contains counts of each feature as it occurs in the CDR3 sequence. These feature vectors serve as the input to the linear SVM analysis. The optimal parameter combinations (k=3/m=1 for nucleotide, k=1/m=1 for amino acid sequences) was determined by cross-validation on the training dataset (Methods).

(B) Prediction accuracy of public vs. private clones of human naïve and memory B-cells, and murine T cells. SVM parameters were identical to those used in (A).

(C) Public clones were accumulated across mice by B/T-cell populations (nBC, CD4), strain (nBC: C57BL/6, BALB/c, pet) or across B-cell populations (human naïve and memory B cells) in order to subsequently perform SVM-based classification as described in (A). Sizes of aggregated SVM-datasets ranged between ≈5x104 (CD4 T cell) and 3x106 (nBC: C57BL/6, BALB/c, pet) clones. ROC curves show excellent classification results (AUC [area under the ROC curve] ≈ 0.90).

(D) SVM-based prediction of public vs. private clones across experimental studies. NBC repertoires of Dataset 1 (mean size: ≈180,000 clones) were used to predict public and private clones in the B2-B-cell repertoires of Dataset 4 (mean size: ≈2’400 clones, Supplementary Table 1). Barplots show mean±s.e.m.

Furthermore, we confirmed that the differences in immunogenomic composition between public and private clones were not exclusively mouse-strain-specific (C57BL/6); we replicated a balanced accuracy of ≈80% with repertoires from BALB/c and pet shop mice (Datasets 2, 3, Supplementary Figure 6A). Analogously, public and private clones could be discriminated with >80% accuracy in human B-cell repertoires (Figure 4B, Dataset 5). Finally, we showed that our approach also demonstrated high classification accuracy between public and private clones of mouse TCR Vβ repertoires (balanced accuracy = 74%, Figure 4B, Dataset 6).

Successful classification within each individual (mouse or human) proved that fundamental and stereotypical differences between public and private classes do indeed exist. However, theoretically, these differences could be specific to each individual and not generalizable. In order to exclude this possibility, we accumulated public and private clones across individuals into datasets of up to 3x10⁶ unique clonal sequences and showed that classification accuracy was maintained (Figure 4C), reaching a maximum in human naïve and memory B cells (balanced accuracy = 83%, AUC [area under the ROC curve] = 0.90). These results signified that the same set of features used to predict public and private clones within one individual is sufficient for prediction across individuals of the same species. Thus, the high-dimensional features provided by sequence kernels (800 for amino acid and 8192 for nucleotide) and learned on the repertoire level, were sufficient and generalizable to discriminate public from private clones in both humans and mice on a per clonal sequence basis (single clone resolution).

Prediction by CDR3 sequence composition is dependent on dataset size and applicable across studies

Our high-dimensional sequence-composition-based SVM approach was unable to predict public and private clones in PCs (balanced accuracy = 50%, Figure 4A, Dataset 1). With respect to unique CDR3s, the PC SVM-dataset was 3 to 4 orders of magnitude smaller than that of preBC and nBC (Supplementary Table 1, Dataset 1); therefore we tested whether the low accuracy was due to sample size. We performed SVM analysis on datasets ranging in size from 100 to 230’000 unique CDR3 sequences (Supplementary Figure 7B) and found that prediction accuracy was indeed a function of sample size, increasing from 56% (100 clonal sequences) to 80% (230’000 clonal sequences). Thus, small sample size may explain the lower prediction accuracies observed in the PC (IgG) dataset. In further support of this hypothesis, we found that in a dataset of human memory B-cells (mixed IgM, IgG) (Dataset 5) that was 3 orders of magnitudes larger than the PC dataset, we were able to achieve >80% accuracy (Figure 4C), suggesting that prediction of public clones may also be possible for antigen experienced B-cell populations and is thus not limited to antigen-inexperienced ones.

Since we observed that dataset size was important for reaching high prediction accuracy (Supplementary Figure 7B), we asked whether cross-dataset meta-analysis, which leverages large datasets as training datasets for performing public and private clone prediction in other (smaller) datasets obtained from studies using slightly different library preparation and high-throughput sequencing protocols. To answer this question, we investigated the prediction accuracy of the sequence-composition-based SVM classifier trained on Dataset 1 (nBC B2-B-cell population), applied to a test dataset 100 times smaller (177’197 vs 1519 sequences), consisting of repertoires from various C57BL/6 B2-B-cell populations [20] (Dataset 4, Supplementary Table 1). By using the model based on the larger dataset (Dataset 1), prediction accuracy could be improved by up to 7 percentage points (76%–77% vs. 69%–73%, Figure 4D), which neared the prediction accuracy within Dataset 1 (Figure 4A). Thus, sequence-kernel-based SVM models can be effectively trained on large datasets (openly accessible) enabling robust predictive performance for meta-analysis across studies.

Stereotypical immunogenomic differences between public and private clones are concentrated in the N1-D-N2 subregions

To identify the subregions that contributed most to classification accuracy, we performed sequence-kernel-based SVM on each CDR3 subregion separately as well as all ten relevant combinations thereof (Figure 5A). Classification based on each single or paired CDR3 subregions did not result in high prediction accuracy (balanced accuracy ≤ 67%, Figure 5A). Among the partial combinations, it was the N1-D-N2 subregion combination that achieved maximum prediction accuracy (74%, Figure 5A, Supplementary Figure 7A) approaching that of the full combination (V-N1-D-N2-J, ≈80%), indicating that the sequence composition between public and private clones differed most within N1-D-N2 subregions. J subregions contributed least to prediction accuracy as V-N1-D (balanced accuracy≈73%) and N1-D-N2 (balanced accuracy≈73%) surpassed D-N2-J (balanced accuracy≈70%, Figure 5A). In order to confirm that subregion differences between public and private clones were largely dictated by the N1, D and N2 subregions and not within the overhang regions linking N1, D, and N2, we showed that subregion shuffling impacted prediction accuracy only negligibly (Supplementary Figure 6E). Visually and numerically, we confirmed the N1, D, and N2 subregions to be the drivers of public and private clone discrimination by constructing prediction profiles, which quantify for each sequence the contribution of each position to the decision value (public, private). Differences in contribution to the decision value were highest in the sequence positions belonging to the N1, D, and N2 subregions (Figure 5B, Supplementary Figure 9). To summarize, our results indicate that the N1, D, N2 subregions of both public and private clone sequences contain class-specific predictive subsequences (k-mers) that enable the prediction of their status (public, private) with high accuracy.

Figure 5 The N1-D-N2 subregions dominate the classification accuracy of public clones.

(A) Balanced accuracy of public and private clone discrimination using sequence-kernel-basel SVM analysis. For each combination of CDR3 subregion, gappy pair kernel parameters (k, m, cost) were determined by cross-validation. Barplots show mean±s.e.m.

(B) Exemplary visualization of prediction profiles of one test dataset (nBC) of CDR3s (rows) of length 39 (nt). Prediction profiles were computed as means of feature weights at each CDR3 position (1–39, see Methods). Positions colored red (<0) count towards “public” prediction of the respective CDR3s, whereas black-colored ones (>0) bias prediction towards the “private” clone status. Barplots indicate the percentage of private (black) or public predicting weights at each of the 39 positions. Color bars indicate median length of V (red), N1 (orange), D (grey), N2 (purple), J subregions (blue, Figure 3A). Prediction profiles across all CDR3 lengths as well as quantitative prediction profile analysis are given in Supplementary Figures 8 and 9, respectively.

Discussion

We have performed a comprehensive immunogenomic decomposition of immune repertoires, which led us to conclude that low-dimensional features (Figures 2, 3, S1, S3–5) – CDR3 subregion length, germline gene usage, amino acid usage (Figures 2D, 3E) – were insufficient in detecting the immunogenomic shift between public and private clonal repertoires. In contrast, a high-dimensional sequence composition (sequence-kernel) approach could predict the public and private status of antibody clones within any individual with 80% accuracy. This CDR3 sequence-composition-based approach was generalizable across individuals, B-cell populations, mouse strains, species (mouse, human), immune cell types (B-cell, T-cell), and datasets produced in different laboratories (Figures 4B–D). While the appropriate definition of “public” clones is subject to current debate [5], the public clone definition adopted in this study has been used previously [5,22,38], and is the most lenient one possible. In fact, we showed that prediction accuracy only increased when increasing the stringency of the public clone definition (Supplementary Figure 6F). The fact that our SVM approach is robust to several public clone definitions, suggests there may not be the need for a consensus definition.

Sequence-kernel based machine learning analysis revealed stereotypical and predictive high-dimensional immunogenomic CDR3 subregion (N1, D, N2) composition biases (high-dimensional fingerprints) specific to both public and private clones, respectively (Figure 5). Those fingerprints achieved up to 100% prediction accuracy when isolated from V and J regions (Supplementary Figure 7A). Shuffling CDR3 subregions (V, N1, D, N2, J) impacted prediction accuracy only negligibly (Figure 5B, Supplementary Figures 6E, 9), confirming that N1, D, and N2 held the highest amount of class-specific information [25,39]. Of note, although the relative size of the human CDR3 N1-D-N2 subregion is larger than that of mice (≈65% [40] vs. 42% in mice, Figure 3A) with the N1-D-N2 subregion being the main amplifier of sequence diversity (Supplementary Table 2) [8,25], identical feature space sizes led to identical prediction accuracies for both species (Figure 4B). Thus, potential species-specific differences in sequence length and diversity did not impact the prediction accuracy of our approach. More generally, it is remarkable that feature spaces of dimension <10⁴ do not only suffice for detecting sub-repertoire clonal expansion-driven changes in individuals of different immunological status [29,35] but also provide ample combinatorial flexibility in defining fingerprints that discriminate whole-repertoire properties (public, private) within a >10¹³-dimensional space (Supplementary Table 2, [8,10]). This may point towards evolutionarily conserved traces in the immunogenome; for example, we found that murine public clones were enriched in natural antibody specificities (Supplementary Figure 10B).

Our results indicate that statistical significance does not necessarily translate into predictive performance: although CDR3 subregion length differed significantly between public and private clones (Figure 3A, Supplementary Figures 3A–C, 4C, D), the prediction accuracy of the low-dimensional SVM model based on CDR3 subregion length (Figure 3D) remained inferior to the high-dimensional one based on the actual sequence composition (Figure 4A). Furthermore, previous probabilistic work on modeling repertoire diversity indicated a broad range in clonal sequence generation probabilities – with (T cell) public clones suggested to be biased towards higher generation probabilities [24]. Corroborating these observations, we found that B cell public clones are more likely to have higher clonal abundance (Supplementary Figure 10A) – in general, however, public clones were distributed throughout the entire frequency spectrum from high to very low clonal frequency (Supplementary Figure 10A) [34]. Instead of attributing to each clonal sequence a generation probability, our work complements previous probabilistic work by leveraging a high-dimensional repertoire-level trained classifier for binary classification on a per sequence basis. It is this sequence-composition-based machine learning approach that led to the unexpected finding that also private clones – which were thought to be mostly stochastically generated – possess a high-dimensional fingerprint (predictive immunogenomic features).

Our SVM-driven approach enables rapid and accurate separation of large repertoire datasets into public and private repertoires. We note that mouse and human trained SVM-classifiers may not only be applied to experimental but also to synthetic repertoire data [41], which could pave the way towards the construction of a comprehensive atlas of human and mouse public clones. The high computational scalability of our machine learning approach – tested with as many as 3×10⁶ public and private sequences (Figure 4C) – allowed us to establish that the dataset size is a deciding factor for high prediction accuracy [33]: (i) in simulations, prediction accuracy increased by ≈30 percentage points when increasing the dataset size by 4 orders of magnitude from ≈10^1–2 to ≈10⁵ clonal sequences (Supplementary Figure 7B). (ii) In experimental data, increasing training dataset size by 1–2 orders of magnitude (sequence data generated in a different lab using different experimental library preparation methods) increased prediction accuracy by up to 7 percentage points, suggesting large-scale cross-study detection of public clones is possible. (iii) The high prediction accuracy of human (antigen-selected) public and private memory B-cell clones (Figure 4B) suggested that the low accuracy of (antigen-selected) PC (IgG) repertoires (Figure 4A) may be due to small dataset size (Supplementary Table 1). More generally, we speculate that the prediction accuracies reported here merely represent lower bounds; future studies, which combine (i) advanced experimental and computational error correction methodologies (e.g., unique molecular identifiers) [42–44], (ii) high sampling and sequencing depth [1] and (iii) novel sequence-based deep learning approaches [45–47] may lead to even higher prediction accuracies.

To conclude, the existence of high-dimensional immunogenomic rules shaping immune repertoire diversity in a predictable fashion, leading to clones with higher occurrence probability within a population, highlights the potential of public clones to be a promising target for rational vaccine design and targeted immunotherapies [23,34,48,49].

Methods

Immune repertoire high-throughput sequencing datasets

We conducted our analysis on six high-throughput immune repertoire sequencing datasets, all of which are characterized below. Quality and read statistics can be found in the respective publications.

Dataset 1

Murine B-cell origin (C57BL/6J): Sequencing data were generated by Greiff and colleagues [16]. B-cells were isolated from four C57BL/6 cohorts (n=4–5) including untreated and prime-boost immunized with protein antigens. Cells were sorted into the subsets pre-B cells (preBC), naïve B cell (nBC) and plasma cells (PC) by flow cytometry. Cell numbers per mouse were: 750’000 (preBC), 1’000’000 (nBC) and 90’000 (PC). RNA was isolated from cells, antibody libraries were prepared by RT-PCR and sequenced using Illumina MiSeq platform (2x300bp paired-end). The sequencing data has been deposited online (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5349/) along with full experimental details and were preprocessed using MiXCR for VDJ-annotation, clonotype formation by CDR3 and error correction as described previously [16,50]. Briefly, for downstream analyses, functional clonotypes were only retained if: (i) they were composed of at least 4 amino acids, and (ii) had a minimal read count of 2 [51,52]. Public clones were defined as those clones that occurred in at least two different individuals within the same B-cell population and cohort.

Dataset 2

Murine B-cell origin (BALB/c): Sequencing data were generated by Greiff and colleagues [16] and have been deposited online (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5349/) with full experimental details. Briefly, naïve B-cells (1,000,000 cells per mouse) from 4 unimmunized BALB/c mice were isolated using the sorting panel from Dataset 1 and antibody libraries were prepared and sequenced analogously to Dataset 1. Data preprocessing was performed analogously to Dataset 1. Public clones were defined as those clones that occurred at least twice across mice.

Dataset 3

Murine B-cell origin (Pet Shop mice): Sequencing data were generated by Greiff and colleagues [16] and have been deposited online (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5349/) with full experimental details. Briefly, naïve B-cells (≈671’000 cells per mouse) from three pet shop mice were isolated and library preparation, sequencing, and data preprocessing was performed analogously to Dataset 1. Public clones were defined as those clones that occurred at least twice across mice.

Dataset 4

Murine B-cell origin (C57BL/6J): Sequencing data were published by Yang and colleagues [20]: Mature B cells were extracted C57BL/6J-mice and sorted (1–2×10⁴ per cell population) into developmentally distinct subsets (splenic follicular B-cells (FOB, n=5), marginal zone B-cells (MZB, n=7), peritoneal B2-B-cells (n=5) and B-1a B-cells (n=43)). Data preprocessing was performed analogously to Dataset 1. Public clones were defined as those clones that occurred at least twice across mice of a given B-cell population.

Dataset 5

Human B-cell origin: Sequencing data of naïve and memory B-cells from three healthy donors were published by DeWitt and colleagues [13] and downloaded already preprocessed from http://datadryad.org/resource/doi:10.5061/dryad.35ks2. Public clones were defined as those clones that occurred at least twice across individuals within a given B-cell population. Cell numbers of naïve and memory B cells were 2–4×10⁷ and 1.5–2×10⁷, respectively.

Dataset 6

Murine T-cell origin: Sequencing data were published by Madi and colleagues [17]. CD4 T cells were isolated from 28 mice (three cohorts; untreated (n=12), immunized with complete Freud’s adjuvant (CFA, n=7) or immunized with CFA and ovalbumin (n=9). Data preprocessing was performed using MiXCR for annotation and error correction as described previously [16,50]. Public clones were defined as those clones that occurred at least twice across mice of a given cohort.

Determination of statistical significance

Significance was tested using the Wilcoxon rank-sum test if not indicated otherwise. Where applicable, significance of correlation coefficients was tested using the R function cor.test() with default parameters.

Statistical analysis and plots

Statistical analysis was performed using R [53] and Python [54]. Graphics were generated using the R packages ggplot2 [55], RColorBrewer [56], and Complex Heatmap [57]. Parallel computing of SVM analyses was performed using the R packages RBatchJobs [58] and doParallel [59].

Definition of a clone

For all analyses, clones were defined by 100% amino acid sequence identity of CDR3 regions [1,16,51]. CDR3 regions were annotated and defined by MiXCR software [50] according to the nomenclature of the Immunogenetics database (IMGT) [60].

Quantification of overlap

As defined previously [16], the percentage of clones shared between two repertoires X and Y: overlap , where |X| and |Y| are the clonal sizes (number of unique clones) of repertoires X and Y. A repertoire was mathematically defined as a set of unique clones.

Junction Analysis

V, N1, D, N2 and J subregion annotation of sequences was performed using IMGT/HighV-Quest [61] (after initial preprocessing by MiXCR) [50]. Deletions were determined by finding the longest common substring between the germline genes and the V, D and J subregions identified in the CDR3 sequences.

Estimation of the technological coverage of V, N1, D, N2, J regions

To estimate the technological coverage of each region (V, N1, D, N2, J), bootstrapping was conducted (Supplementary Figure 2). Briefly, 5, 25, 50, 75 and 100% of the full diversity of each region was sampled. Subsequently, the number of unique sequences per region present in the sample was compared to the total number of unique sequences.

Determination of Shannon Evenness

The Shannon Evenness was calculated as previously described [35]. Briefly, we calculated the Hill-diversity for alpha = for a given frequency distribution (, enumeration of the abundance of each subregion (combination)) of V, N1, D, N2, J subregions or combinations thereof. Subsequently, we obtained the Shannon Evenness ^α=1=E by normalizing ^α=1=D by the respective total number of V, N1, D, N2, J regions or combinations thereof (n) in the given repertoire.

Estimation of the theoretical nucleotide diversity of the murine naïve clonal repertoire

The extent to which the entirety of the subregions V, N1, D, N2, J discovered in preBC and nBC of Dataset 1 covered any preBC/nBC repertoire was quantified by species accumulation curves as previously described [16]. Briefly, we defined the repertoire coverage (C_i) of a given CDR3 subregion (R_i) as the percentage overlap of its set of unique regions {R}_i with the set of regions contained in all previously accumulated repertoires , where i ∈ {1, …, m} with m being the total number of preBC and nBC repertoires (m = 38). To infer the number of subregions necessary for any given coverage, we used non-linear regression analysis using an exponential fit [62], where is the number of unique subregions contained within the accumulated repertoires and s and b are the parameters to be inferred. For ≥95% coverage, this is the estimated size of each murine naïve V, N1, D, N2, J subregion repertoire. We opted to report the coverage at 95% (Supplementary Table 2, column 2) to counter the effect of rare clones as described previously [16]. The product of the extrapolated coverage at 95% of each region (Supplementary Table 2) is the theoretical nucleotide diversity of the murine naïve clonal repertoire.

Determination of private clones with high similarity to public clones

For each public clone, the number of private clones within 1 amino acid edit distance was enumerated (Figure 6B). Edit distance was determined using the stringdist() function (distance metric: Levenshtein distance) from the stringdist R package [63] as well as igraph [64]

Support Vector Machine (SVM) analysis

In order to classify clones into public and private classes, a supervised learning approach was chosen in the form of a support vector machine (SVM) model. As input for all SVM analyses, CDR3-length equilibrated datasets were built for each sample (Supplementary Table 1). Briefly, for each sample, all public clones were paired in equal numbers with private clones of the same sample such that both public and private clones followed identical CDR3 length distributions. SVM analysis was performed using kernel-based analysis of biological sequences (KeBABS) [30] and sklearn [65], both of which are described in more detail below. For all SVM analyses, each dataset was split into training (80%) and test subset (20%). Cross-validation and SVM training was performed on the training dataset and class prediction on the test dataset. Prediction accuracy of class discrimination was quantified by calculating the balanced accuracy , where specificity was defined as , and sensitivity defined as (TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative). Additionally, AUC (area under the curve, ROC curve) was calculated using the KeBABS R package [30]. An AUC value of 1 means perfect prediction accuracy (BACC = 100%), while an AUC value of 0.5 (BACC = 50%) is equivalent to random guessing.

KeBABS support vector machine analysis

To discriminate public and private clones based on CDR3 sequence, we used the KeBABS R package [30], which implements kernel-based analysis of biological sequences. For all datasets, we used the position-independent gappy pair kernel [36,37], which divides all sequences into features of length k with gaps of maximal length m (Figure 4A). For the analysis of nucleotide sequences the parameters were set to k=3, m=1, C = 10, whereas the analysis of amino acid sequences was performed using parameters k=1, m=1, C = 100 (as determined by cross-validation). The cost parameter C sets the cost for the misclassification of a sequence. The maximal number of possible features used in the gappy kernel is determined by 4^2×k×(m + 1) = 8′192 for nucleotide sequences and 20^2×k(M + 1) = 800 for amino acid sequences.

Prediction Profiles

Prediction profiles were computed from feature weights as described by Palme and colleagues [30,31,37]. Prediction profiles quantify the contribution of each sequence position to the decision value (public, private). Thus, prediction profiles provide improved biological interpretability of the learning results compared to single feature weights because those individual positions or sequence stretches that drive classification accuracy most become visible [30].

Sklearn support vector machine analysis

For public vs. private discrimination based on amino acid and V, N1, D, N2, J composition, the sklearn implementation of SVM [65] for Python [54] was employed with the cost parameter set at C=10 as determined by cross-validation.

Acknowledgments

We thank Dr. Christian Beisel, Manuel Kohler, Ina Nissen and Elodie Burcklen from the Genomics Facility Basel of ETH Zürich for their expert technical assistance with Illumina high-throughput sequencing. We thank Sepp Hochreiter (JKU Linz, Austria) for helpful discussions. This work was funded by the Swiss National Science Foundation (Project #: 31003A_143869, to STR), SystemsX.ch – AntibodyX RTD project (to STR), Swiss Vaccine Research Institute (to STR). The professorship of STR is made possible by the generous endowment of the S. Leslie Misrock Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.↵
Greiff V, Miho E, Menzel U, Reddy ST. Bioinformatic and Statistical Analysis of Adaptive Immune Repertoires. Trends Immunol. 2015;36: 738–749. doi:10.1016/j.it.2015.09.006
OpenUrl CrossRef PubMed
2.
Hershberg U, Prak ETL. The analysis of clonal expansions in normal and autoimmune B cell repertoires. Phil Trans R Soc B. 2015;370: 20140239. doi:10.1098/rstb.2014.0239
OpenUrl CrossRef PubMed
3.
Xu JL, Davis MM. Diversity in the CDR3 Region of VH Is Sufficient for Most Antibody Specificities. Immunity. 2000;13: 37–45. doi:10.1016/S1074-7613(00)00006-6
OpenUrl CrossRef PubMed Web of Science
4.
Kunik V, Peters B, Ofran Y. Structural Consensus among Antibodies Defines the Antigen Binding Site. PLoS Comput Biol. 2012;8. doi:10.1371/journal.pcbi.1002388
OpenUrl CrossRef PubMed
5.↵
Castro R, Navelsaker S, Krasnov A, Du Pasquier L, Boudinot P. Describing the diversity of Ag specific receptors in vertebrates: Contribution of repertoire deep sequencing. Dev Comp Immunol. 2017; doi:10.1016/j.dci.2017.02.018
OpenUrl CrossRef
6.↵
Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302: 575–581. doi:10.1038/302575a0
OpenUrl CrossRef PubMed Web of Science
7.↵
Glanville J, Zhai W, Berka J, Telman D, Huerta G, Mehta GR, et al. Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proc Natl Acad Sci. 2009;106: 20216–20221. doi:10.1073/pnas.0909775106
OpenUrl Abstract/FREE Full Text
8.↵
Saada R, Weinberger M, Shahaf G, Mehr R. Models for antigen receptor gene rearrangement: CDR3 length. Immunol Cell Biol. 2007;85: 323–332. doi:10.1038/sj.icb.7100055
OpenUrl CrossRef PubMed
9.
Warren RL, Freeman JD, Zeng T, Choe G, Munro S, Moore R, et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 2011;21: 790–797. doi:10.1101/gr.115428.110
OpenUrl Abstract/FREE Full Text
10.↵
Murugan A, Mora T, Walczak AM, Callan CG. Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proc Natl Acad Sci. 2012;109: 16161–16166. doi:10.1073/pnas.1212755109
OpenUrl Abstract/FREE Full Text
11.↵
Arnaout R, Lee W, Cahill P, Honan T, Sparrow T, Weiand M, et al. High-Resolution Description of Antibody Heavy-Chain Repertoires in Humans. PLoS ONE. 2011;6: e22365. doi:10.1371/journal.pone.0022365
OpenUrl CrossRef PubMed
12.↵
Jiang N, Weinstein JA, Penland L, White RA, Fisher DS, Quake SR. Determinism and stochasticity during maturation of the zebrafish antibody repertoire. Proc Natl Acad Sci. 2011;108: 5348–5353. doi:10.1073/pnas.1014277108
OpenUrl Abstract/FREE Full Text
13.↵
DeWitt WS, Lindau P, Snyder TM, Sherwood AM, Vignali M, Carlson CS, et al. A Public Database of Memory and Naive B-Cell Receptor Sequences. PLOS ONE. 2016;11: e0160853. doi:10.1371/journal.pone.0160853
OpenUrl CrossRef
14.
Galson JD, Trück J, Fowler A, Münz M, Cerundolo V, Pollard AJ, et al. In-depth assessment of within-individual and inter-individual variation in the B cell receptor repertoire. Front Immunol. 2015; 531. doi:10.3389/fimmu.2015.00531
OpenUrl CrossRef
15.
Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol. 2014;32: 158–168. doi:10.1038/nbt.2782
OpenUrl CrossRef PubMed
16.↵
Greiff V, Menzel U, Miho E, Weber C, Riedel R, Cook SC, et al. Systems analysis reveals high genetic and antigen-driven predetermination of antibody repertoires throughout B-cell development. Cell Rep., “accepted in principle”, 2017;
17.↵
Madi A, Shifrut E, Reich-Zeliger S, Gal H, Best K, Ndifon W, et al. T-cell receptor repertoires share a restricted set of public and abundant CDR3 sequences that are associated with self-related immunity. Genome Res. 2014;24: 1603–1612. doi:10.1101/gr.170753.113
OpenUrl Abstract/FREE Full Text
18.
Robinson WH. Sequencing the functional antibody repertoire—diagnostic and therapeutic discovery. Nat Rev Rheumatol. 2014;11: 171–182. doi:10.1038/nrrheum.2014.220
OpenUrl CrossRef
19.
Yaari G, Kleinstein SH. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 2015;7: 121. doi:10.1186/s13073-015-0243-2
OpenUrl CrossRef
20.↵
Yang Y, Wang C, Yang Q, Kantor AB, Chu H, Ghosn EE, et al. Distinct mechanisms define murine B cell lineage immunoglobulin heavy chain (IgH) repertoires. eLife. 2015; e09083. doi:10.7554/eLife.09083
OpenUrl CrossRef PubMed
21.
Jackson KJL, Kidd MJ, Wang Y, Collins AM. The shape of the lymphocyte receptor repertoire: lessons from the B cell receptor. Front B Cell Biol. 2013;4: 263. doi:10.3389/fimmu.2013.00263
OpenUrl CrossRef PubMed
22.↵
Covacu R, Philip H, Jaronen M, Almeida J, Kenison JE, Darko S, et al. System-wide Analysis of the T Cell Response. Cell Rep. 2016;14: 2733–2744. doi:10.1016/j.celrep.2016.02.056
OpenUrl CrossRef
23.↵
Venturi V, Price DA, Douek DC, Davenport MP. The molecular basis for public T-cell responses? Nat Rev Immunol. 2008;8: 231–238. doi:10.1038/nri2260
OpenUrl CrossRef PubMed Web of Science
24.↵
Elhanati Y, Murugan A, Callan CG, Mora T, Walczak AM. Quantifying selection in immune receptor repertoires. Proc Natl Acad Sci. 2014;111: 9875–9880.
OpenUrl Abstract/FREE Full Text
25.↵
Elhanati Y, Sethna Z, Marcou Q, Callan CG, Mora T, Walczak AM. Inferring processes underlying B-cell repertoire diversity. Phil Trans R Soc B. 2015;370: 20140243. doi:10.1098/rstb.2014.0243
OpenUrl CrossRef PubMed
26.↵
Mora T, Walczak AM, Bialek W, Callan CG. Maximum entropy models for antibody diversity. Proc Natl Acad Sci. 2010;107: 5405–5410. doi:10.1073/pnas.1001705107
OpenUrl Abstract/FREE Full Text
27.↵
Kidd BA, Peters LA, Schadt EE, Dudley JT. Unifying immunology with informatics and multiscale biology. Nat Immunol. 2014;15: 118–127. doi:10.1038/ni.2787
OpenUrl CrossRef PubMed
28.↵
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res. 2002;2: 419–444.
OpenUrl CrossRef Web of Science
29.↵
Sun Y, Best K, Cinelli M, Heather JM, Reich-Zeliger S, Shifrut E, et al. Specificity, Privacy, and Degeneracy in the CD4 T Cell Receptor Repertoire Following Immunization. Front Immunol. 2017;8. doi:10.3389/fimmu.2017.00430
OpenUrl CrossRef
30.↵
Palme J, Hochreiter S, Bodenhofer U. KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics. 2015; btv176. doi:10.1093/bioinformatics/btv176
OpenUrl CrossRef PubMed
31.↵
Schwarzbauer K, Bodenhofer U, Hochreiter S. Genome-Wide Chromatin Remodeling Identified at GC-Rich Long Nucleosome-Free Regions. PLOS ONE. 2012;7: e47924. doi:10.1371/journal.pone.0047924
OpenUrl CrossRef PubMed
32.↵
Bishop CM. Pattern Recognition and Machine Learning. New edition. Springer, Berlin; 2007.
33.↵
Thomas N, Best K, Cinelli M, Reich-Zeliger S, Gal H, Shifrut E, et al. Tracking global changes induced in the CD4 T cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinforma Oxf Engl. 2014; doi:10.1093/bioinformatics/btu523
OpenUrl CrossRef PubMed
34.↵
Miho E, Greiff V, Roskar R, Reddy ST. The fundamental principles of antibody repertoire architecture revealed by large-scale network analysis. bioRxiv. 2017; 124578. doi:10.1101/124578
OpenUrl Abstract/FREE Full Text
35.↵
Greiff V, Bhat P, Cook SC, Menzel U, Kang W, Reddy ST. A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Med. 2015;7: 49. doi:10.1186/s13073-015-0169-8
OpenUrl CrossRef PubMed
36.↵
Leslie C, Kuang R. Fast String Kernels Using Inexact Matching for Protein Sequences. J Mach Learn Res. 2004;5: 1435–1455.
OpenUrl
37.↵
Mahrenholz CC, Abfalter IG, Bodenhofer U, Volkmer R, Hochreiter S. Complex networks govern coiled-coil oligomerization‐‐predicting and profiling by means of a machine learning approach. Mol Cell Proteomics MCP. 2011;10: M110.004994. doi:10.1074/mcp.M110.004994
OpenUrl Abstract/FREE Full Text
38.↵
Li H, Ye C, Ji G, Wu X, Xiang Z, Li Y, et al. Recombinatorial Biases and Convergent Recombination Determine Interindividual TCRβ Sharing in Murine Thymocytes. J Immunol. 2012;189: 2404–2413. doi:10.4049/jimmunol.1102087
OpenUrl Abstract/FREE Full Text
39.↵
Janeway CA, Murphy K. Janeway’s Immunobiology. 8th Revised edition. Taylor & Francis; 2011.
40.↵
Mroczek ES, Ippolito GC, Rogosch T, Hoi KH, Hwangpo TA, Brand MG, et al. Differences in the composition of the human antibody repertoire by B cell subsets in the blood. B Cell Biol. 2014;5: 96. doi:10.3389/fimmu.2014.00096
OpenUrl CrossRef
41.↵
Safonova Y, Lapidus A, Lill J. IgSimulator: a versatile immunosequencing simulator. Bioinformatics. 2015; btv326.
42.↵
Khan TA, Friedensohn S, Vries ARG de, Straszewski J, Ruscheweyh H-J, Reddy ST. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Sci Adv. 2016;2: e1501371. doi:10.1126/sciadv.1501371
OpenUrl FREE Full Text
43.
Vollmers C, Sit RV, Weinstein JA, Dekker CL, Quake SR. Genetic measurement of memory B-cell recall using antibody repertoire sequencing. Proc Natl Acad Sci. 2013;110: 13463–13468. doi:10.1073/pnas.1312146110
OpenUrl Abstract/FREE Full Text
44.↵
Shugay M, Britanova OV, Merzlyak EM, Turchaninova MA, Mamedov IZ, Tuganbaev TR, et al. Towards error-free profiling of immune repertoires. Nat Methods. 2014;11: 653–655. doi:10.1038/nmeth.2960
OpenUrl CrossRef PubMed
45.↵
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9: 1735–1780.
OpenUrl CrossRef PubMed Web of Science
46.
Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016;12: 878. doi:10.15252/msb.20156651
OpenUrl Abstract/FREE Full Text
47.↵
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33: 831–838. doi:10.1038/nbt.3300
OpenUrl CrossRef PubMed
48.↵
Miles JJ, Silins SL, Burrows SR. Engineered T cell receptors and their potential in molecular medicine. Curr Med Chem. 2006;13: 2725–2736.
OpenUrl CrossRef PubMed
49.↵
Jardine JG, Kulp DW, Havenar-Daughton C, Sarkar A, Briney B, Sok D, et al. HIV-1 broadly neutralizing antibody precursor B cells revealed by germline-targeting immunogen. Science. 2016;351: 1458–1463. doi:10.1126/science.aad9195
OpenUrl Abstract/FREE Full Text
50.↵
Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015;12: 380–381. doi:10.1038/nmeth.3364
OpenUrl CrossRef PubMed
51.↵
Greiff V, Menzel U, Haessler U, Cook SC, Friedensohn S, Khan TA, et al. Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice. BMC Immunol. 2014;15: 40. doi:10.1186/s12865-014-0040-5
OpenUrl CrossRef PubMed
52.↵
Menzel U, Greiff V, Khan TA, Haessler U, Hellmann I, Friedensohn S, et al. Comprehensive Evaluation and Optimization of Amplicon Library Preparation Methods for High-Throughput Antibody Sequencing. PLoS ONE. 2014;9: e96727. doi:10.1371/journal.pone.0096727
OpenUrl CrossRef PubMed
53.↵
Team RDC. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2009. Available: http://www.R-project.org
54.↵
Rossum GV, Drake FLJ. The Python Language Reference Manual. Network Theory Ltd; 2011.
55.↵
Wickham H. ggplot2: Elegant Graphics for Data Analysis [Internet]. Springer-Verlag New York; 2009. Available: http://ggplot2.org
56.↵
Neuwirth E. RColorBrewer: ColorBrewer Palettes [Internet]. 2014. Available: https://CRAN.R-project.org/package=RColorBrewer
57.↵
Gu Z. ComplexHeatmap: Making Complex Heatmaps [Internet]. 2016. Available: https://github.com/jokergoo/ComplexHeatmap
58.↵
Bischl B, Lang M, Mersmann O, Rahnenführer J, Weihs C. BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments. J Stat Softw. 2015;64: 1–25.
OpenUrl CrossRef PubMed
59.↵
Analytics R, Weston S. doParallel: Foreach Parallel Adaptor for the “parallel” Package [Internet]. 2015. Available: https://CRAN.R-project.org/package=doParallel
60.↵
Lefranc M-P, Giudicelli V, Ginestoux C, Bodmer J, Müller W, Bontrop R, et al. IMGT, the international ImMunoGeneTics database. Nucleic Acids Res. 1999;27: 209–212. doi:10.1093/nar/27.1.209
OpenUrl CrossRef PubMed Web of Science
61.↵
Li S, Lefranc M-P, Miles JJ, Alamyar E, Giudicelli V, Duroux P, et al. IMGT/HighV QUEST paradigm for T cell receptor IMGT clonotype diversity and next generation repertoire immunoprofiling. Nat Commun. 2013;4. doi:10.1038/ncomms3333
OpenUrl CrossRef PubMed
62.↵
Soberón J, Llorente J. The Use of Species Accumulation Functions for the Prediction of Species Richness. Conserv Biol. 1993;7: 480–488. doi:10.1046/j.1523-1739.1993.07030480.x
OpenUrl CrossRef
63.↵
Loo MPJ van der. The stringdist package for approximate string matching. R J. 2014;6: 111–122.
OpenUrl
64.↵
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems: 1695.
65.↵
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830.
OpenUrl CrossRef

View the discussion thread.

Posted April 18, 2017.

Download PDF

Citation Tools

Subject Area

Systems Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11745)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14972)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28085)
Molecular Biology (11592)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Greiff V, Miho E, Menzel U, Reddy ST. Bioinformatic and Statistical Analysis of Adaptive Immune Repertoires. Trends Immunol. 2015;36: 738–749. doi:10.1016/j.it.2015.09.006
OpenUrl CrossRef PubMed

[2] 2.
Hershberg U, Prak ETL. The analysis of clonal expansions in normal and autoimmune B cell repertoires. Phil Trans R Soc B. 2015;370: 20140239. doi:10.1098/rstb.2014.0239
OpenUrl CrossRef PubMed

[3] 3.
Xu JL, Davis MM. Diversity in the CDR3 Region of VH Is Sufficient for Most Antibody Specificities. Immunity. 2000;13: 37–45. doi:10.1016/S1074-7613(00)00006-6
OpenUrl CrossRef PubMed Web of Science

[4] 4.
Kunik V, Peters B, Ofran Y. Structural Consensus among Antibodies Defines the Antigen Binding Site. PLoS Comput Biol. 2012;8. doi:10.1371/journal.pcbi.1002388
OpenUrl CrossRef PubMed

[5] 5.↵
Castro R, Navelsaker S, Krasnov A, Du Pasquier L, Boudinot P. Describing the diversity of Ag specific receptors in vertebrates: Contribution of repertoire deep sequencing. Dev Comp Immunol. 2017; doi:10.1016/j.dci.2017.02.018
OpenUrl CrossRef

[6] 6.↵
Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302: 575–581. doi:10.1038/302575a0
OpenUrl CrossRef PubMed Web of Science

[7] 7.↵
Glanville J, Zhai W, Berka J, Telman D, Huerta G, Mehta GR, et al. Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proc Natl Acad Sci. 2009;106: 20216–20221. doi:10.1073/pnas.0909775106
OpenUrl Abstract/FREE Full Text

[8] 8.↵
Saada R, Weinberger M, Shahaf G, Mehr R. Models for antigen receptor gene rearrangement: CDR3 length. Immunol Cell Biol. 2007;85: 323–332. doi:10.1038/sj.icb.7100055
OpenUrl CrossRef PubMed

[9] 9.
Warren RL, Freeman JD, Zeng T, Choe G, Munro S, Moore R, et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 2011;21: 790–797. doi:10.1101/gr.115428.110
OpenUrl Abstract/FREE Full Text

[10] 10.↵
Murugan A, Mora T, Walczak AM, Callan CG. Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proc Natl Acad Sci. 2012;109: 16161–16166. doi:10.1073/pnas.1212755109
OpenUrl Abstract/FREE Full Text

[11] 11.↵
Arnaout R, Lee W, Cahill P, Honan T, Sparrow T, Weiand M, et al. High-Resolution Description of Antibody Heavy-Chain Repertoires in Humans. PLoS ONE. 2011;6: e22365. doi:10.1371/journal.pone.0022365
OpenUrl CrossRef PubMed

[12] 12.↵
Jiang N, Weinstein JA, Penland L, White RA, Fisher DS, Quake SR. Determinism and stochasticity during maturation of the zebrafish antibody repertoire. Proc Natl Acad Sci. 2011;108: 5348–5353. doi:10.1073/pnas.1014277108
OpenUrl Abstract/FREE Full Text

[13] 13.↵
DeWitt WS, Lindau P, Snyder TM, Sherwood AM, Vignali M, Carlson CS, et al. A Public Database of Memory and Naive B-Cell Receptor Sequences. PLOS ONE. 2016;11: e0160853. doi:10.1371/journal.pone.0160853
OpenUrl CrossRef

[14] 14.
Galson JD, Trück J, Fowler A, Münz M, Cerundolo V, Pollard AJ, et al. In-depth assessment of within-individual and inter-individual variation in the B cell receptor repertoire. Front Immunol. 2015; 531. doi:10.3389/fimmu.2015.00531
OpenUrl CrossRef

[15] 15.
Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol. 2014;32: 158–168. doi:10.1038/nbt.2782
OpenUrl CrossRef PubMed

[16] 16.↵
Greiff V, Menzel U, Miho E, Weber C, Riedel R, Cook SC, et al. Systems analysis reveals high genetic and antigen-driven predetermination of antibody repertoires throughout B-cell development. Cell Rep., “accepted in principle”, 2017;

[17] 17.↵
Madi A, Shifrut E, Reich-Zeliger S, Gal H, Best K, Ndifon W, et al. T-cell receptor repertoires share a restricted set of public and abundant CDR3 sequences that are associated with self-related immunity. Genome Res. 2014;24: 1603–1612. doi:10.1101/gr.170753.113
OpenUrl Abstract/FREE Full Text

[18] 18.
Robinson WH. Sequencing the functional antibody repertoire—diagnostic and therapeutic discovery. Nat Rev Rheumatol. 2014;11: 171–182. doi:10.1038/nrrheum.2014.220
OpenUrl CrossRef

[19] 19.
Yaari G, Kleinstein SH. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 2015;7: 121. doi:10.1186/s13073-015-0243-2
OpenUrl CrossRef

[20] 20.↵
Yang Y, Wang C, Yang Q, Kantor AB, Chu H, Ghosn EE, et al. Distinct mechanisms define murine B cell lineage immunoglobulin heavy chain (IgH) repertoires. eLife. 2015; e09083. doi:10.7554/eLife.09083
OpenUrl CrossRef PubMed

[21] 21.
Jackson KJL, Kidd MJ, Wang Y, Collins AM. The shape of the lymphocyte receptor repertoire: lessons from the B cell receptor. Front B Cell Biol. 2013;4: 263. doi:10.3389/fimmu.2013.00263
OpenUrl CrossRef PubMed

[22] 22.↵
Covacu R, Philip H, Jaronen M, Almeida J, Kenison JE, Darko S, et al. System-wide Analysis of the T Cell Response. Cell Rep. 2016;14: 2733–2744. doi:10.1016/j.celrep.2016.02.056
OpenUrl CrossRef

[23] 23.↵
Venturi V, Price DA, Douek DC, Davenport MP. The molecular basis for public T-cell responses? Nat Rev Immunol. 2008;8: 231–238. doi:10.1038/nri2260
OpenUrl CrossRef PubMed Web of Science

[24] 24.↵
Elhanati Y, Murugan A, Callan CG, Mora T, Walczak AM. Quantifying selection in immune receptor repertoires. Proc Natl Acad Sci. 2014;111: 9875–9880.
OpenUrl Abstract/FREE Full Text

[25] 25.↵
Elhanati Y, Sethna Z, Marcou Q, Callan CG, Mora T, Walczak AM. Inferring processes underlying B-cell repertoire diversity. Phil Trans R Soc B. 2015;370: 20140243. doi:10.1098/rstb.2014.0243
OpenUrl CrossRef PubMed

[26] 26.↵
Mora T, Walczak AM, Bialek W, Callan CG. Maximum entropy models for antibody diversity. Proc Natl Acad Sci. 2010;107: 5405–5410. doi:10.1073/pnas.1001705107
OpenUrl Abstract/FREE Full Text

[27] 27.↵
Kidd BA, Peters LA, Schadt EE, Dudley JT. Unifying immunology with informatics and multiscale biology. Nat Immunol. 2014;15: 118–127. doi:10.1038/ni.2787
OpenUrl CrossRef PubMed

[28] 28.↵
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res. 2002;2: 419–444.
OpenUrl CrossRef Web of Science

[29] 29.↵
Sun Y, Best K, Cinelli M, Heather JM, Reich-Zeliger S, Shifrut E, et al. Specificity, Privacy, and Degeneracy in the CD4 T Cell Receptor Repertoire Following Immunization. Front Immunol. 2017;8. doi:10.3389/fimmu.2017.00430
OpenUrl CrossRef

[30] 30.↵
Palme J, Hochreiter S, Bodenhofer U. KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics. 2015; btv176. doi:10.1093/bioinformatics/btv176
OpenUrl CrossRef PubMed

[31] 31.↵
Schwarzbauer K, Bodenhofer U, Hochreiter S. Genome-Wide Chromatin Remodeling Identified at GC-Rich Long Nucleosome-Free Regions. PLOS ONE. 2012;7: e47924. doi:10.1371/journal.pone.0047924
OpenUrl CrossRef PubMed

[32] 32.↵
Bishop CM. Pattern Recognition and Machine Learning. New edition. Springer, Berlin; 2007.

[33] 33.↵
Thomas N, Best K, Cinelli M, Reich-Zeliger S, Gal H, Shifrut E, et al. Tracking global changes induced in the CD4 T cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinforma Oxf Engl. 2014; doi:10.1093/bioinformatics/btu523
OpenUrl CrossRef PubMed

[34] 34.↵
Miho E, Greiff V, Roskar R, Reddy ST. The fundamental principles of antibody repertoire architecture revealed by large-scale network analysis. bioRxiv. 2017; 124578. doi:10.1101/124578
OpenUrl Abstract/FREE Full Text

[35] 35.↵
Greiff V, Bhat P, Cook SC, Menzel U, Kang W, Reddy ST. A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Med. 2015;7: 49. doi:10.1186/s13073-015-0169-8
OpenUrl CrossRef PubMed

[36] 36.↵
Leslie C, Kuang R. Fast String Kernels Using Inexact Matching for Protein Sequences. J Mach Learn Res. 2004;5: 1435–1455.
OpenUrl

[37] 37.↵
Mahrenholz CC, Abfalter IG, Bodenhofer U, Volkmer R, Hochreiter S. Complex networks govern coiled-coil oligomerization‐‐predicting and profiling by means of a machine learning approach. Mol Cell Proteomics MCP. 2011;10: M110.004994. doi:10.1074/mcp.M110.004994
OpenUrl Abstract/FREE Full Text

[38] 38.↵
Li H, Ye C, Ji G, Wu X, Xiang Z, Li Y, et al. Recombinatorial Biases and Convergent Recombination Determine Interindividual TCRβ Sharing in Murine Thymocytes. J Immunol. 2012;189: 2404–2413. doi:10.4049/jimmunol.1102087
OpenUrl Abstract/FREE Full Text

[39] 39.↵
Janeway CA, Murphy K. Janeway’s Immunobiology. 8th Revised edition. Taylor & Francis; 2011.

[40] 40.↵
Mroczek ES, Ippolito GC, Rogosch T, Hoi KH, Hwangpo TA, Brand MG, et al. Differences in the composition of the human antibody repertoire by B cell subsets in the blood. B Cell Biol. 2014;5: 96. doi:10.3389/fimmu.2014.00096
OpenUrl CrossRef

[41] 41.↵
Safonova Y, Lapidus A, Lill J. IgSimulator: a versatile immunosequencing simulator. Bioinformatics. 2015; btv326.

[42] 42.↵
Khan TA, Friedensohn S, Vries ARG de, Straszewski J, Ruscheweyh H-J, Reddy ST. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Sci Adv. 2016;2: e1501371. doi:10.1126/sciadv.1501371
OpenUrl FREE Full Text

[43] 43.
Vollmers C, Sit RV, Weinstein JA, Dekker CL, Quake SR. Genetic measurement of memory B-cell recall using antibody repertoire sequencing. Proc Natl Acad Sci. 2013;110: 13463–13468. doi:10.1073/pnas.1312146110
OpenUrl Abstract/FREE Full Text

[44] 44.↵
Shugay M, Britanova OV, Merzlyak EM, Turchaninova MA, Mamedov IZ, Tuganbaev TR, et al. Towards error-free profiling of immune repertoires. Nat Methods. 2014;11: 653–655. doi:10.1038/nmeth.2960
OpenUrl CrossRef PubMed

[45] 45.↵
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9: 1735–1780.
OpenUrl CrossRef PubMed Web of Science

[46] 46.
Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016;12: 878. doi:10.15252/msb.20156651
OpenUrl Abstract/FREE Full Text

[47] 47.↵
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33: 831–838. doi:10.1038/nbt.3300
OpenUrl CrossRef PubMed

[48] 48.↵
Miles JJ, Silins SL, Burrows SR. Engineered T cell receptors and their potential in molecular medicine. Curr Med Chem. 2006;13: 2725–2736.
OpenUrl CrossRef PubMed

[49] 49.↵
Jardine JG, Kulp DW, Havenar-Daughton C, Sarkar A, Briney B, Sok D, et al. HIV-1 broadly neutralizing antibody precursor B cells revealed by germline-targeting immunogen. Science. 2016;351: 1458–1463. doi:10.1126/science.aad9195
OpenUrl Abstract/FREE Full Text

[50] 50.↵
Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015;12: 380–381. doi:10.1038/nmeth.3364
OpenUrl CrossRef PubMed

[51] 51.↵
Greiff V, Menzel U, Haessler U, Cook SC, Friedensohn S, Khan TA, et al. Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice. BMC Immunol. 2014;15: 40. doi:10.1186/s12865-014-0040-5
OpenUrl CrossRef PubMed

[52] 52.↵
Menzel U, Greiff V, Khan TA, Haessler U, Hellmann I, Friedensohn S, et al. Comprehensive Evaluation and Optimization of Amplicon Library Preparation Methods for High-Throughput Antibody Sequencing. PLoS ONE. 2014;9: e96727. doi:10.1371/journal.pone.0096727
OpenUrl CrossRef PubMed

[53] 53.↵
Team RDC. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2009. Available: http://www.R-project.org

[54] 54.↵
Rossum GV, Drake FLJ. The Python Language Reference Manual. Network Theory Ltd; 2011.

[55] 55.↵
Wickham H. ggplot2: Elegant Graphics for Data Analysis [Internet]. Springer-Verlag New York; 2009. Available: http://ggplot2.org

[56] 56.↵
Neuwirth E. RColorBrewer: ColorBrewer Palettes [Internet]. 2014. Available: https://CRAN.R-project.org/package=RColorBrewer

[57] 57.↵
Gu Z. ComplexHeatmap: Making Complex Heatmaps [Internet]. 2016. Available: https://github.com/jokergoo/ComplexHeatmap

[58] 58.↵
Bischl B, Lang M, Mersmann O, Rahnenführer J, Weihs C. BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments. J Stat Softw. 2015;64: 1–25.
OpenUrl CrossRef PubMed

[59] 59.↵
Analytics R, Weston S. doParallel: Foreach Parallel Adaptor for the “parallel” Package [Internet]. 2015. Available: https://CRAN.R-project.org/package=doParallel

[60] 60.↵
Lefranc M-P, Giudicelli V, Ginestoux C, Bodmer J, Müller W, Bontrop R, et al. IMGT, the international ImMunoGeneTics database. Nucleic Acids Res. 1999;27: 209–212. doi:10.1093/nar/27.1.209
OpenUrl CrossRef PubMed Web of Science

[61] 61.↵
Li S, Lefranc M-P, Miles JJ, Alamyar E, Giudicelli V, Duroux P, et al. IMGT/HighV QUEST paradigm for T cell receptor IMGT clonotype diversity and next generation repertoire immunoprofiling. Nat Commun. 2013;4. doi:10.1038/ncomms3333
OpenUrl CrossRef PubMed

[62] 62.↵
Soberón J, Llorente J. The Use of Species Accumulation Functions for the Prediction of Species Richness. Conserv Biol. 1993;7: 480–488. doi:10.1046/j.1523-1739.1993.07030480.x
OpenUrl CrossRef

[63] 63.↵
Loo MPJ van der. The stringdist package for approximate string matching. R J. 2014;6: 111–122.
OpenUrl

[64] 64.↵
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems: 1695.

[65] 65.↵
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830.
OpenUrl CrossRef