ABSTRACT
Therapeutic antibody optimization is time and resource intensive, largely because it requires low-throughput screening (103 variants) of full-length IgG in mammalian cells, typically resulting in only a few optimized leads. Here, we use deep learning to interrogate and predict antigen-specificity from a massive diversity of antibody sequence space. Using a mammalian display platform and the therapeutic antibody trastuzumab, rationally designed site-directed mutagenesis libraries are introduced by CRISPR/Cas9-mediated homology-directed repair (HDR). Screening and deep sequencing of relatively small libraries (104) produced high quality data capable of training deep neural networks that accurately predict antigen-binding based on antibody sequence (~85% precision). Deep learning is then used to predict millions of antigen binders from an in silico library of ~108 variants. Finally, these variants are subjected to multiple developability filters, resulting in tens of thousands of optimized lead candidates, which when a small subset of 30 are expressed, all 30 are antigen-specific. With its scalability and capacity to interrogate a vast protein sequence space, deep learning offers great potential for antibody engineering and optimization.
INTRODUCTION
In antibody drug discovery, the ‘target-to-hit’ stage is a well-established process, as screening hybridomas, phage or yeast display libraries typically result in a number of potential lead candidates. However, the time and costs associated with lead candidate optimization often take up the majority of the preclinical discovery and development cycle1. This is largely due to the fact that lead optimization of antibody molecules consists of addressing multiple parameters in parallel, including expression level, viscosity, pharmacokinetics, solubility, and immunogenicity2,3. Once a lead candidate is discovered, additional engineering is often required; phage and yeast display offer a powerful method for high-throughput screening of large mutagenesis libraries (>109), however they are primarily only used for increasing affinity or specificity to the target antigen4. The fact that nearly all therapeutic antibodies require expression in mammalian cells as full-length IgG means that the remaining development and optimization steps must occur in this context. Since mammalian cells lack the capability to stably replicate plasmids, this last stage of development is done at very low-throughput, as elaborate cloning, transfection and purification strategies must be implemented to screen libraries in the max range of 103, meaning only minor changes (e.g., point mutations) are screened5. Interrogating such a small fraction of protein sequence space also implies that addressing one development issue will frequently cause rise of another or even diminish antigen binding altogether, making multi-parameter optimization very challenging.
Machine learning applied to biological sequence data offers a powerful approach to construct models capable of making predictions of genotype-phenotype relationships6,7. This is due to the capability of models to extrapolate complex relationships between sequence and function. One of the principle challenges in constructing accurate machine learning models is the collection of appropriate high-quality training data. Directed evolution platforms are well-suited for this as they rely on the linking of biological sequence data (DNA, RNA, protein) to a phenotypic output8. In fact, it has long been proposed to use machine learning models trained on data generated by mutagenesis libraries as a means to guide protein engineering9,10. Recently, Gaussian processes, a Bayesian learning model, was used to engineer cytochrome enzymes, enabling navigation through a vast protein sequence space to discover highly thermostable variants11. Similarly, the design and screening of a structure-guided library of channel rhodopsin membrane proteins was used to train Gaussian process and regression models, which were able to accurately predict variants that could express and localize on mammalian cell membranes12.
In recent years, access to deep sequencing and parallel computing has enabled the construction of deep learning models capable of predicting molecular phenotype from sequence data13,14. For example, deep learning has been used to learn the sequence specificities of RNA- and DNA-binding proteins15, regulatory grammar of protein expression in yeast16, and HLA-neoantigen presentation on tumor cells17. In most cases deep (artificial) neural networks represent the class of algorithm utilized. While the complexity of neural networks has changed drastically since their conception, the fundamental concept remains the same: mimicking the connections of biological neurons to learn complex relationships between variables18. As an extension of a single-layer neural network, or perceptron19, deep learning incorporates multiple hidden layers to deconvolute relationships buried in large, high-dimensional data sets, such as the millions of reads gathered from a single deep sequencing experiment. Well trained models can then be used to make predictions on completely unseen and novel variants. This application of model extrapolation lends itself perfectly to protein engineering because it provides a way to interrogate a much larger sequence space than what is physically possible. For example, even for a short stretch of just 10 amino acids, the combinatorial sequence diversity explodes to 1013, a size which is nearly impossible to interrogate experimentally.
Here, we leverage the power of deep learning to perform multi-parameter optimization of therapeutic antibodies (full-length IgG) directly in mammalian cells (Figure 1). Starting with a mammalian display cell line20 expressing the therapeutic antibody trastuzumab (Herceptin), we use CRISPR-Cas9-mediated homology-directed repair (HDR) to introduce site-directed mutagenesis libraries in the variable heavy chain complementarity determining region 3 (CDRH3)21. In order to generate information rich training data, single-site deep mutational scanning (DMS) is first performed22, which is then used to guide the design of combinatorial mutagenesis libraries. An experimental (physical) library size of 5 × 104 variants was then screened for specificity to the antigen HER2. All binding and non-binding variant sequences were then used to train recurrent and convolutional deep neural networks, which when fully-trained and optimized were able with high accuracy and precision to predict antigen-specificity based on antibody sequence. Neural networks are then used to predict antigen-specificity on a subset of sequence variants from the DMS-based combinatorial mutagenesis library (~108 sequences), resulting in >3.0 × 106 variants predicted to have a high probability of being antigen-specific. These variants are then subjected to several sequence-based in silico filtering steps to optimize for developability parameters such as viscosity, solubility and immunogenicity, resulting in over 40,000 optimized antibody sequence variants. Finally, a random selection of variants were recombinantly expressed and tested, resulting in 30 out of 30 showing antigen-specific binding.
RESULTS
Deep mutational scanning determines antigen-specific sequence landscapes and guides rational antibody library design
As the amino acid sequence of an antibody’s CDRH3 is a key determinant of antigen specificity, we performed DMS on this region to resolve the specificity determining residues. To start, a hybridoma cell-line was used that expressed a trastuzumab variant that could not bind HER2 antigen (mutated CDRH3 sequence) (Supplementary Fig. 1). Libraries were generated by CRISPR-Cas9-mediated homology-directed mutagenesis (HDM)21, which utilized guide RNA (gRNA) for Cas9 targeting of CDRH3 and a pool of homology templates in the form of single-stranded oligonucleotides (ssODNs) containing NNK degenerate codons at single-sites tiled across CDRH3 (Figure 2a, Supplementary Fig. 2). Libraries were then screened by fluorescence activated cell sorting (FACS), and populations expressing surface IgG which either were binding or not binding to antigen were isolated and subjected to deep sequencing (Illumina MiSeq) (Supplementary Table 1). Deep sequencing data was then used to calculate enrichment scores of the 10 positions investigated, which revealed six positions that were sufficiently amenable to a wide-range of mutations and an additional three positions that were marginally accepting to defined mutations (Figure 2b). Although residues 102D, 103G, 104F, and 105Y appear to be contacting amino acids of the CDRH3 loop with HER223,24, 105Y is the only residue completely fixed (Figure 2c).
Heatmaps and their corresponding sequence logo plots generated by DMS were used to guide the rational design of combinatorial mutagenesis libraries, which consisted of degenerate codons across all positions (except 105Y) (Supplementary Fig. 3, Supplementary Table 6). Degenerate codons were selected per position based on their amino acid frequencies which most closely resembled the degree of enrichment found in the DMS data following 1, 2, and 3 rounds of antigen-specific enrichment (Supplementary Fig. 2, Equation 2). This combinatorial library possesses a theoretical protein sequence space of 7.17 × 108, far greater than the single-site DMS library diversity of 200. Libraries containing CDRH3 variants were again generated in hybridoma cells through CRISPR-Cas9-mediated HDM in the same non-binding trastuzumab clone described previously (Figure 3a). Antigen binding cells were isolated by two rounds of enrichment by FACS (Figure 3b, Supplementary Fig. 3) and the binding/non-binding populations were subjected to deep sequencing. Sequencing data identified 11,300 and 27,539 unique binders and non-binders, respectively (Supplementary Table 2). These sequence variants represented only a miniscule 0.0054% of the theoretical protein sequence space of the combinatorial mutagenesis library. Amino acid usage per position was comparatively similar between antigen binding and non-binding populations (Figure 3c), thus making it difficult to develop any sort of heuristic rules or observable patterns to identify binding sequences.
Training deep neural networks to classify antigen-specificity based on antibody sequence
After having compiled deep sequencing data on binding and non-binding CDRH3 variants, we set out to develop and train deep learning models capable of predicting specificity towards the target antigen HER2. Amino acid sequences were converted to an input matrix by one-hot encoding, an approach where each column of the matrix represents a specific residue and each row corresponds to the position in the sequence, thus a 10 amino acid CDRH3 sequence as here results in a 10 × 20 matrix. Each row will contain a single ‘1’ in the column corresponding to the residue at that position, whereby all other columns/rows receive a ‘0’. We utilized long short-term memory recurrent neural networks (LSTM-RNN) and convolutional neural networks (CNN), which represent two of the main classes of deep learning models used for biological sequence data14. LSTM-RNNs and CNNs both stem from standard neural networks, where information is passed along neurons that contain learnable weights and biases, however, there are fundamental differences in how the information is processed. LSTM-RNN layers contain loops, enabling information to be retained from one step to the next, allowing models to efficiently correlate a sequential order with a given output; CNNs, on the other hand, apply learnable filters to the input data, allowing it to efficiently recognize spatial dependencies associated with a given output. Model architecture and hyperparameters (Figures 4a, c) were selected by performing a grid search across various parameters (LSTM-RNN: nodes per layer, batch size, number epochs and optimizing function; CNN: number of filters, kernel size, dropout rate and dense layer nodes) using a k-fold cross-validation of the data set. All models were built to assess their accuracy and precision of classifying binders and non-binders from the available sequencing data. 70% of the original data set was used to train the models and the remaining 30% was split into two test data sets used for model evaluation: one test data set contained the same class split of sequences used to train the model and the other contained a class split of approximately 10/90 binders/non-binders to resemble physiological frequencies (Figure 3b). Performance of the LSTM-RNN and CNN were assessed by constructing receiver operating characteristic (ROC) curves and precision-recall (PR) curves derived from predictions on the unseen testing data sets (Figure 4b, d). Based on conventional approaches to training classification models, the data set was adjusted to allow for a 50/50 split of binders and non-binders during training. Under these training conditions, the LSTM-RNN and CNN were both able to accurately classify unseen test data (ROC curve AUC: 0.9 ± 0.0, average precision: 0.9 ± 0.0, Supplementary Fig. 6).
Next, we used the trained LSTM-RNN and CNN models to classify a random sample of 1 × 105 sequences from the potential sequence space. We observed, however, an unexpectedly high occurrence of positive classifications (25,318 ± 1,643 sequences or 25.3 ± 1.6%, Supplementary Table 3b). With the knowledge that the physiological frequency of binders should be approximately 10-15%, we sought to adjust the classification split of the training data with the hypothesis that models were being subject to some unknown classification bias. Additional models were then trained on classification splits of both 20/80, and 10/90 binders/non-binders, as well as a classification split with all available data (approximately 30/70 binders/non-binders). Unbalancing the sequence classification led to a significant reduction in the percentage of sequences classified as binders, but also led to a reduction in the model performance on the unseen test data (Supplementary Fig. 4-7, Supplementary Tables 3a, b). Through our analysis, we concluded that the optimal data set for training the models was the set inclusive of all known CDRH3 sequences for the following reasons: 1) the percentage of sequences predicted as binders reflects this physiological frequency, 2) this data set maximizes the information the model sees, and 3) model performance on both test data sets. Final model architecture, parameters, and evaluation are shown in Figure 4. As a final measure of model validation, neural networks were trained with a data set containing randomly shuffled binding and non-binding class labels. Model performance of these networks revealed indiscriminate sequence classification on unseen test data (Supplementary Fig. 8), signifying the identification of learned patterns for networks trained with properly classified data.
Multi-parameter optimization for developability by in silico screening of antibody sequence space
Using our DMS-based combinatorial mutagenesis library as a guide (Figure 3), 7.2 ×107 possible sequence variants were generated in silico. The fully-trained LSTM-RNN and CNN models were used to classify all 7.2 × 107 sequence variants as either antigen binders or non-binders based on a probability score (P), resulting in a prediction of 8.55 × 106(LSTM-RNN) and 9.52 × 106(CNN) potential binders (P > 0.50). This represented a reasonable fraction (11-13%) of antigen-specific variants based on experimental screening (Figure 3b). To increase confidence, we increased the prediction threshold for binder classification to P > 0.75 and took the consensus binders between the LSTM-RNN and CNN. This reduced the antigen-specific sequence space down to 3.0 × 106 variants.
Next, we characterized the 3.0 × 106 predicted antigen-specific sequences on a number of parameters. As a first metric, we investigated their sequence similarity to the original trastuzumab sequence by calculating the Levenshtein distance (LD). The majority of sequences showed an edit distance of LD > 4 (Figure 5a). The first step in filtering was to calculate the net charge and hydrophobicity index in order to estimate the molecule’s viscosity and clearance2. According to Sharma et al., viscosity decreases with increasing variable fragment (Fv) net charge and increasing Fv charge symmetry parameter (FvCSP); however, the optimal Fv net charge in terms of drug clearance is between 0 and 6.2 with a CDRL1+CDRL3+CDRH3 hydrophobicity index sum < 4.0. Based on the wide range of values for these parameters in the 3.0 × 106 predicted variants (Figure 5b, c), we filtered any sequences out that had a Fv net charge > 4.2 and a CDRH3 hydrophobicity index > 4.0, which further reduced the sequence space down to 1.93 × 106 variants. We next padded the CDRH3 sequences with 10 amino acids on the 5’ and 3’ ends and then ran these sequences through CamSol, a protein solubility predictor developed by Sormanni et al.25, which estimates and ranks sequence variants based on their theoretical solubility. The remaining variants produced a wide-range of protein solubility scores (Figure 5d) and sequences with a score < 0.2 were filtered out, leaving 2.36 × 105 candidates for further analysis. As a last step in our in silico screening process, we aimed at reducing immunogenicity by predicting the peptide binding affinity of the variant sequences to MHC Class II molecules by utilizing NetMHCIIpan, a model previously developed by Jensen et al.26. All possible 15-mers from the padded CDRH3 sequences were run through NetMHCIIpan. One output from the model is a given peptide’s % Rank of predicted affinity compared to a set of 200,000 random natural peptides. Typically, molecules with a % Rank < 2 are considered strong binders and those with a % Rank < 10 are considered weak binders to the MHC Class II molecules scanned. After predicting affinity for HLA alleles DRB1*0101, DRB3*0101, DRB4*0101, DRB5*0101, sequences were filtered out if any of the 15-mers contained a % Rank < 15 (Figure 5e). The average % Rank across all 15-mers for the remaining sequences was then calculated and those with an average % Rank < 70 were also filtered out (Figure 5f). Based on these criteria, there were 40,588 multi-parameter optimized variants (Figure 5g).
Optimal antibody sequences are recombinantly expressed and antigen-specific
To validate the precision of our fully trained LSTM-RNN and CNN models, we randomly selected a subset of 30 CDRH3 sequences predicted to be antigen-specific and optimized across the multiple developability parameters. To further demonstrate the capacity of deep learning to identify novel sequence variants, we also added the criteria that the selected variants must have a minimum LD of 5 from the original CDRH3 sequence of trastuzumab, resulting in a library of 32,725 sequences to select from. CRISPR-Cas9-mediated HDR was used to generate mammalian display cell lines expressing the 30 different sequence variants. Flow cytometry was performed and revealed that 30 of the 30 variants (100%) were antigen-specific (Figure 6a). Further analysis was performed on 14 of the antigen-binding variants to more precisely quantify the binding kinetics via biolayer interferometry (BLI, FortéBio Octet RED96e) (Figure 6b). The original trastuzumab sequence was measured to have an affinity towards HER2 of 4.0 × 10−10 M (equilibrium dissociation constant, KD); and although the majority of variants tested had a slight decrease in affinity, 71% (10/14) were still in the single-digit nanomolar range, 21% (3/14) remained sub-nanomolar, and one variant (7%) showed a near 3-fold increase in affinity compared to trastuzumab (KD = 1.4 × 10−10 M). We also investigated any correlations between flow cytometry fluorescence intensity and BLI measured affinity (Supplementary Fig. 9), as well as model prediction values and measured affinities (Supplementary Fig. 10). While there appears to be an overall increasing trend between fluorescence intensity and binding affinity, there also exists outlying points with low fluorescence signals, but high affinity values. Conversely, no observable trend is present when comparing model prediction values to binding affinities, however, the highest affinity variants do tend to have higher prediction values. Figure 6c displays the 30 tested sequence variants along with their associated developability and affinity metrics.
DISCUSSION
Addressing the limitation of antibody optimization in mammalian cells, we have developed an approach based on deep learning that enables us to identify antigen-specific sequences with high precision. Using the clinically approved antibody trastuzumab, we performed single-site DMS followed by combinatorial mutagenesis to determine the antigen-binding landscape of CDRH3. This DMS-based mutagenesis strategy is crucial for attaining high quality training data that is enriched with antigen-binding variants, in this case nearly 10% of our library (Figure 3b). In contrast, if a completely randomized combinatorial mutagenesis strategy was employed (i.e., NNK degenerate codons), it would be unlikely to produce any significant fraction of antigen-binding variants. In the future, other approaches to mutagenesis that generate enriched training data27, such as shotgun scanning mutagenesis28, binary substitution29 and recombination30,12 may also be explored for training deep neural networks.
A remarkable finding in this study was that experimental screening of a library of only 5 × 104 variants, which reflected a tiny fraction (0.0054%) of the total sequence diversity of the DMS-based combinatorial mutagenesis library (7.17 × 108), was capable of training accurate neural networks. This suggests that physical library size limitations of mammalian expression systems (or other expression platforms such as phage and yeast) and deep sequencing read depth will not serve as a limitation in deep learning-guided protein engineering. Another important result was that deep sequencing of antigen-binding and non-binding populations showed nearly no observable difference in their positional amino acid usage (Figure 3c), suggesting that neural networks are effectively capturing non-linear patterns/interactions.
In the current study, we selected LSTM-RNNs and CNNs as the basis of our classification models, as they represent two state-of-the-art approaches in deep learning. Other machine learning approaches such as k-nearest neighbors, random forests, and support vector machines are also well-suited at identifying complex patterns from input data, but as data set sizes continue to grow, as is realizable with biological sequence data, deep neural networks tend to outperform these classical techniques15. Furthermore, deep generative modeling methods such as variational autoencoders may also be used to explore the mutagenesis sequence space from directed evolution31.
We in silico generated approximately 7.2 × 107 CDRH3 variants from DMS-based combinatorial diversity and used fully trained LSTM-RNN and CNN models to classify each sequence as a binder or non-binder. The 7.2 × 107 sequence variants comprise only a subset of the potential sequence space and was chosen to minimize the computational effort, however, it still represents a library size several orders of magnitude greater than what is experimentally achievable in mammalian cells. We easily envision extending the screening capacity through script optimization and employing parallel computing on high performance clusters. Out of all variants classified, the LSTM-RNN and CNN predicted approximately 11-13% to bind the target antigen, showing exceptional agreement with the experimentally observed frequencies by flow cytometry (Figure 3b). With the exception of critical residues determined by DMS, the majority of predicted binders were substantially distant from the original trastuzumab sequence with 80% of sequences having an edit distance of at least 6 residues. This high degree of sequence variability indicated the potential for a wide range of biomolecular properties.
Once an antibody’s affinity for its target antigen is within a desirable range for efficacious biological modification, addressing other biomolecular properties becomes the focus of antibody development. With recent advances in computational predictions32,33, a number of these properties, including viscosity, clearance, stability2, specificity34, solubility25 and immunogenicity26 can be approximated from sequence information alone. With the aim of selecting antibodies with improved characteristics, we subjected the library of predicted binders to a number of these in silico approaches in order to provide a ranking structure and filtering strategy for developability (Figure 5). After implementing these methods to remove variants with a high likelihood of having poor viscosity, clearance or solubility, as well as those with high immunogenic potential, over 40,000 multi-parameter optimized antibody variants remained. It is interesting to note that a considerable number of sequences scored even better than the original trastuzumab sequence. Future work to apply more stringent or additional filters which address other developability parameters (e.g. stability, specificity, humanization) could also be implemented to further reduce the sequence space down to highly developable therapeutic candidates. For instance, previous studies have investigated the likeness of therapeutic antibodies to the human antibody repertoire35.
Lastly, to experimentally validate the precision of neural networks to predict antigen specificity, we randomly selected and expressed 30 variants from the library of optimized sequences with a minimum edit distance of 5 from trastuzumab. The precision of the LSTM-RNN and CNN models were each estimated to be ~85% (at P > 0.75) according to predictions made on the test data sets (Figure 4b, d). By taking the consensus between models, however, we experimentally validated that all randomly selected (30/30) of the antigen-predicted (and developability filtered) sequences were indeed binders, and several of which were high affinity. While we anticipate false positives would be discovered by increasing the sample size tested, validation of this subset strongly infers that potentially thousands of optimized lead candidates maintain a binding affinity in the range of therapeutic relevance, while also containing substantial sequence variability from the starting trastuzumab sequence. Future work to increase the stringency of selection during screening or a more detailed investigation of correlations between prediction probability and affinity could prove insightful towards retaining high target affinities. We also envision this approach to enable the optimization of other functional properties of therapeutic antibodies, such as pH-dependent antibody recycling36 or affinity/avidity tuning37,38. Additionally, extending this approach to other regions across the variable light and heavy chain genes, namely other CDRs, may yield deep neural networks that are able to capture long-range, complex relationships between an antibody and its target antigen. To understand these patterns in greater depth, it may also prove useful to compare neural network predictions with protein structural modeling predictions39.
METHODS
Mammalian cell culture and transfection
Hybridoma cells were cultured and maintained according to the protocols described by Mason et al.21. Hybridoma cells were electroporated with the 4D-Nucleofector™ System (Lonza) using the SF Cell Line 4D-Nucleofector® X Kit L or X Kit S (Lonza, V4XC-2024, V4XC-2032) with the program CQ-104. Cells were prepared as follows: cells were isolated and centrifuged at 125 × G for 10 minutes, washed with Opti-MEM® I Reduced Serum Medium (Thermo, 31985-062), and centrifuged again with the same parameters. The cells were resuspended in SF buffer (per kit manufacturer guidelines), after which Alt-R gRNA (IDT) and ssODN donor (IDT) were added. All experiments performed utilize constitutive expression of Cas9 from Streptococcus pyogenes (SpCas9). Transfections of 1×106 and 1×x107 cells were performed in 100 μl, single Nucleocuvettes™ with 0.575 or 2.88 nmol Alt-R gRNA and 0.5 or 2.5 nmol ssODN donor respectively. Transfections of 2×105 cells were performed in 16-well, 20 μl Nucleocuvette™ strips with 115 pmol Alt-R gRNA and 100 pmol ssODN donor.
Flow cytometry analysis and sorting
Flow cytometry-based analysis and cell isolation were performed using the BD LSR Fortessa™ (BD Biosciences) and Sony SH800S (Sony), respectively. When labeling with fluorescently conjugated antigen or anti-IgG antibodies, cells were first washed with PBS, incubated with the labeling antibody and/or antigen for 30 minutes on ice, protected from light, washed again with PBS and then analyzed or sorted. The labeling reagents and working concentrations are described in Supplementary Table 4. For cell numbers different from 106, the antibody/antigen amount and incubation volume were adjusted proportionally.
Sample preparation for deep sequencing
Sample preparation for deep sequencing was performed similar to the antibody library generation protocol of the primer extension method described previously41. Genomic DNA was extracted from 1-5×106 cells using the Purelink™ Genomic DNA Mini Kit (Thermo, K182001). Extracted genomic DNA was subjected to a first PCR step. Amplification was performed using a forward primer binding to the beginning of the VH framework region and a reverse primer specific to the intronic region immediately 3’ of the J segment. PCRs were performed with Q5® High-Fidelity DNA polymerase (NEB, M0491L) in parallel reaction volumes of 50 ml with the following cycle conditions: 98°C for 30 seconds; 16 cycles of 98°C for 10 sec, 70°C for 20 sec, 72°C for 30 sec; final extension 72°C for 1 min; 4°C storage. PCR products were concentrated using DNA Clean and Concentrator (Zymo, D4013) followed by 0.8X SPRIselect (Beckman Coulter, B22318) left-sided size selection. Total PCR1 product was amplified in a PCR2 step, which added extension-specific full-length Illumina adapter sequences to the amplicon library. Individual samples were Illumina-indexed by choosing from 20 different index reverse primers. Cycle conditions were as follows: 98°C for 30 sec; 2 cycles of 98°C for 10 sec, 40°C for 20 sec, 72°C for 1 min; 6 cycles of 98°C for 10 sec, 65°C for 20 sec, 72°C for 1 min; 72°C for 5 min; 4°C storage. PCR2 products were concentrated again with DNA Clean and Concentrator and run on a 1% agarose gel. Bands of appropriate size (~550bp) were gel-purified using the Zymoclean™ Gel DNA Recovery kit (Zymo, D4008). Concentration of purified libraries were determined by a Nanodrop 2000c spectrophotometer and pooled at concentrations aimed at optimal read return. The quality of the final sequencing pool was verified on a fragment analyzer (Advanced Analytical Technologies) using DNF-473 Standard Sensitivity NGS fragment analysis kit. All samples passing quality control were sequenced. Antibody library pools were sequenced on the Illumina MiSeq platform using the reagent kit v3 (2×300 cycles, paired-end) with 10% PhiX control library. Base call quality of all samples was in the range of a mean Phred score of 34.
Bioinformatics analysis and graphics
The MiXCR v2.0.3 program was used to perform data pre-processing of raw FASTQ files42. Sequences were aligned to a custom germline gene reference database containing the known sequence information of the V- and J-gene regions for the variable heavy chain of the trastuzumab antibody gene. Clonotype formation by CDRH3 and error correction were performed as described by Bolotin et al42. Functional clonotypes were discarded if: 1) a duplicate CDRH3 amino acid sequence arising from MiXCR uncorrected PCR errors, or 2) a clone count equal to one. Downstream analysis was performed using R v3.2.243 and Python v3.6.544. Graphics were generated using the R packages ggplot245, RColorBrewer46, and ggseqlogo47.
Calculation of enrichment ratios (ERs) in DMS
The ERs of a given variant was calculated according to previous methods48. Clonal frequencies of variants enriched for antigen specificity by FACS, fi,Ag+, were divided by the clonal frequencies of the variants present in the original library, fi,Ab+, according to Equation 1.
A minimum value of −2 was designated to variants with log[ER] values less than or equal −2 and variants not present in the dataset were disregarded in the calculation. A clone was defined based on the exact amino acid sequence of the CDRH3.
Codon selection for rational library design
Codon selection for rational library design was based off the equation provided by Mason et al.21, (Equation 2), where Yn,deg represents the amino acid frequency for a given degenerate codon scheme, Yn,target is the target amino acid frequency, and n is the number of amino acids, 20. Residues identified in DMS analysis to have a positive enrichment (ER > 1, or log[ER] > 0) were normalized according to their enrichment ratios and were converted to theoretical frequencies and taken as the target amino acid frequencies. Degenerate codon schemes were then selected which most closely reflect these frequencies as calculated by the mean squared error between the degenerate codon and the target frequencies.
In certain instances, if the selected degenerate codon did not represent desirable amino acid frequencies or contained undesirable amino acids, a mixture of degenerate codons were selected and pooled together to achieve better coverage of the functional sequence space.
Deep learning model construction
Deep learning models were built in Python v3.6.5. LSTM-RNNs, and CNNs were built using the Keras49 v2.1.6 Sequential model as a wrapper for TensorFlow50 v1.8.0. Model architecture and hyperparameters were optimized by performing a grid search of relevant variables for a given model. These variables include nodes per layer, activation function(s), optimizer, loss function, dropout rate, batch size, number of epochs, number of filters, kernel size, stride length, and pool size. Grid searches were performed by implementing a k-fold cross validation of the data set.
Deep learning model training and testing
Data sets for antibody expressing, non-binding, and binding sequences (Sequencing statistics: Supplementary Tables 1, 2) were aggregated to form a single, binding/non-binding data set where antibody expressing sequences were classified as non-binders, unless also identified among the binding sequences. Sequences from one round of antigen enrichment were excluded from the training data set. The complete, aggregated data set was then randomly arranged and appropriate class labeled sequences were removed to achieve the desired classification ratio of binders to non-binders (50/50, 20/80, 10/90, and non-adjusted). The class adjusted data set was further split into a training set (70%), and two testing sets (15% each), where one test set reflected the classification ratio observed for training and the other reflected a classification ratio of approximately 10/90 to resemble the physiological expected frequency of binders.
In silico sequence classification and sequence parameters
All possible combinations of amino acids present in the DMS-based combinatorial mutagenesis libraries were used to calculate the total theoretical sequence space of 7.17 × 108. 7.2 × 107 sequence variants were generated in silico by taking all possible combinations of the amino acids used per position in the combinatorial mutagenesis library designed from the DMS data following three rounds of enrichment for antigen binding variants (Supplementary Fig. 2c, 3c); Alanine was also selected to be included at position 103. All in silico sequences were then classified as a binder or non-binder by the trained LSTM-RNN and CNN models. Sequences were selected for further analysis if they were classified in both models with a prediction probability (P) of more than 0.75.
The Fv net charge and Fv charge symmetry parameter (FvCSP) were calculated as described by Sharma et al. Briefly, the net charge was determined by first solving the Henderson-Hasselbalch equation for each residue at a specified pH (here 5.5) with known amino acid pKas51. The sum across all residues was then calculated as the Fv net charge. The FvCSP was calculated by taking the product of the VL and VH charges. The hydrophobicity index (HI) was also calculated as described by Sharma et al., according to the following equation: HI = −(∑niEi / ∑njEj). E represents the Eisenberg value of an amino acid, n is the number of an amino acid, and i and j are hydrophobic and hydrophilic residues respectively.
The protein solubility score was determined for each, full-length CDRH3 sequence (15 a.a.) padded with 10 amino acids on both the 5’ and 3’ ends (35 a.a.) by the CamSol method25 at pH 7.0.
The binding affinities for HLA alleles DRB1*0101, DRB3*0101, DRB4*0101, DRB5*0101 were determined for each 15-mer contained within the 10 amino acid padded CDRH3 sequence (35 a.a.) by NetMHCIIpan 3.226. The output provides for each 15-mer a predicted affinity in nM and the % Rank which reflects the 15-mer’s affinity compared to a set of random natural peptides. The % Rank measure is unaffected by the bias of certain molecules against stronger or weaker affinities and is used to classify peptides as weak or strong binders towards the specified MHC Class II allele.
Affinity measurements by biolayer interferometry
Monoclonal populations of the individual variants were isolated by performing a single-cell sort. Following expansion, supernatant for all variants was collected and filtered through a 0.20 μm filter (Sartorius, 16534-K). Affinity measurements were then performed on an Octet RED96e (FortéBio) with the following parameters: anti-human capture sensors (FortéBio, 18-5060) were hydrated in conditioned media diluted 1 in 2 with kinetics buffer (FortéBio, 18-1105) for at least 10 minutes before conditioning through 4 cycles of regeneration consisting of 10 seconds incubation in 10 mM glycine, pH 1.52 and 10 seconds in kinetics buffer. Conditioned sensors were then loaded with 0 μg/mL (reference sensor), 10 μg/mL trastuzumab (reference sample), or hybridoma supernatant (approximately 20 μg/mL) diluted 1 in 2 with kinetics buffer followed by blocking with mouse IgG (Rockland, 010-0102) at 50 μg/mL in kinetics buffer. After blocking, loaded sensors were equilibrated in kinetics buffer and incubated with either 5 nM or 25 nM HER2 protein (Sigma-aldrich, SRP6405-50UG). Lastly, sensors were incubated in kinetics buffer to allow antigen dissociation. Kinetics analysis was performed in analysis software Data Analysis HT v11.0.0.50.
AUTHOR CONTRIBUTIONS
D.M.M., S.F., C.R.W. and S.T.R. developed the methodology; D.M.M. and S.T.R. designed the experiments and wrote the manuscript; D.M.M., C.R.W. and S.F. analyzed sequencing data and performed deep learning analysis; C.J. generated in silico libraries; D.M.M. performed experiments; B.W., and S.M.M. performed cell line development.
COMPETING INTERESTS
ETH Zurich has filed for patent protection on the technology described herein, and D.M.M., S.F., C.R.W., and S.T.R. are named as co-inventors on this patent (United States Patent and Trademark Office Provisional Application: 62/831,663).
ACKNOWLEDGEMENTS
We acknowledge the ETH Zurich D-BSSE Single Cell Unit and the Genomics Facility Basel for support, in particular, M. Di Tacchio, A. Gumienny, E. Burcklen, and C. Beisel. We also thank the Vendruscolo Lab (Cambridge, UK), in particular P. Sormanni, for assistance with implementing the CamSol method on large libraries, as well as the group of Prof. Morten Nielson (DTU, Denmark) for providing an easy-to-use package for MHC Class II affinity predictions. Funding was provided by the National Competence Center for Research on Molecular Systems Engineering.