Abstract
Next generation sequencing is widely used to characterize genetic diversity in a sample, yet is hindered by its relatively low resolution. Particularly, detecting rare genetic variants in clinical samples of viruses is still nearly impossible. Here we describe AccuNGS, an approach that combines error reduction in each sequencing stage with in silico error elimination, which enables detection of variants as rare as 1:10,000 or lower. We thoroughly explore AccuNGS background errors and reveal they are mostly generated in the sequencer itself. We demonstrate that as opposed to common assumptions, Illumina paired-end reads are not independent. After applying AccuNGS to an HIV sample taken during acute infection, we reveal that the vast majority of transition variants in the sample segregate at ultra-low frequencies, rendering them undetectable by standard sequencing. These results highlight the early rich accumulation of genetic diversity during viral infection at depths previously unseen.
INTRODUCTION
Recent advances in high-throughput nucleic acid sequencing have revolutionized our ability to identify the prevalence of minor traits in a heterogeneous sample. Identification of rare single nucleotide variants (SNVs) is important in diverse disciplines spanning post-transcriptional modifications, cancer genetics, non-invasive prenatal diagnoses and microbiology (1-4). SNV identification in virus populations is currently at the heart of many studies monitoring drug resistance, estimating mutation rates, quantifying standing genetic variation and predicting the fitness costs of single mutations (4-10). Accountable quantification of such variants present in clinical specimens requires high template recovery, sufficient sequencing depth, and discrimination of real minor variants from the background errors of the sequencing process (11,12). However, using the standard next generation sequencing (NGS) protocols may result in significant background error rates. In fact, following typical post-processing of NGS data, mutations that are at frequencies lower than 1-5% are discarded, drastically limiting rare SNV identification (9,13,14). This limitation of NGS has been recently pointed as one of the major gaps in using genotyping to survey resistance mutations and understand HIV treatment failure (15,16).
In the past few years, several innovative experimental approaches were suggested to reduce the background error rates of the NGS process: rolling-circle-based redundant coding of the amplified fragments (1,17-19); consensus sequencing of barcoded genomic fragments (19-24); error reduction by overlapping paired reads in paired-end sequencing (25-27); and usage of improved polymerases (28). Complementary to the library preparation methods, several computational methods were created to facilitate discrimination of true variants from process errors, based on systematic background error modeling (24,29-33). Apart from the usage of overlapping read pairs (ORP), most experimental methods described above are designed for samples with high biomass and are inapplicable for sequencing of viruses from clinical samples, where the biomass of the viruses may be extremely low. Furthermore, these experimental protocols may provide accuracy at the cost of increased technical complexity, and may introduce their own artifacts to the sequencing process (34). On the variant calling side, most variant calling programs model strand-specific sequencing bias but do not properly incorporate the information from the paired region that may have different error characteristics (35). Moreover, it has been suggested that these well-established variant callers do not perform well on clinical virus samples (36).
We therefore sought to create a simple and highly accurate sequencing protocol that will be suitable for sequencing of clinical samples, with a special focus on low-biomass samples of RNA viruses. The motivation was based on the fact that many of these viruses replicate at huge census population sizes with high mutation rates of ~10−4-10−5 mutations/base/replication, so viral populations from clinical samples are expected to contain very high levels of heterogeneity (5,37,38). We set out to perform a step-by-step optimization of NGS protocols with an emphasis in each step on high accuracy and high yield. The resulting optimized protocol termed “AccuNGS” is a simple and rapid sequencing method and includes an associated variant caller, which if combined can reliably detect ultra-rare variants at frequencies of 1:10,000 or lower. Using homogenous DNA and RNA samples, we characterized the typical error landscape of the AccuNGS protocol, pinpointed the potential sources that generate errors in our protocol and suggest potential solutions. We applied our method to an RNA sample taken from a patient recently infected with HIV-1 (acute stage, seronegative) to examine the breadth of accumulation of mutations in the viral population in this critical period of the infection.
RESULTS
Protocol overview
AccuNGS protocol was designed to enable targeted sequencing of a desired genomic region from clinical samples with maximal yield, high accuracy and rapid implementation. It combines several concepts, some of which were previously reported separately. The principles of our protocol are (i) use of high-fidelity polymerases during reverse transcription (RT) and amplification. In particular we chose to use either the SuperScript III or SuperScript IV RT enzymes (reported mean error rate near 5×10−5 (29,39)) and the Platinum SuperFi DNA polymerase (error rate of near 5×10−7, (40) and manufacturer at https://www.thermofisher.com/); (ii) significant reduction of sequencer errors by overlapping paired-end sequencing; and (iii) a specialized base calling bioinformatics pipeline that incorporates mutation-specific and locus-specific distribution of background errors (see Methods and Fig. 1). Based on the above information we were able to calculate theoretical means of 1.78*10−5 and 6.78*10−5 errors per base introduced in our AccuNGS protocol for DNA and RNA samples, respectively (Table 1). An auxiliary part of our protocol allows for quantification of the actual number of RNA genomes sequenced using uniquely barcoded primers introduced early in the RT step (the “primer-ID” method) (2,20,21,41,42). This is a critical measure when analyzing the levels of diversity present in clinical samples, since low genetic diversity observed in a sample may stem from a small number of sequenced templates rather than from real reduced diversity in the sample. We note that this manuscript focuses on two measures: the mean error rate, reported when we characterize the method, and a cutoff error rate (based on the gamma distribution of errors), which we report as a measure to be used when performing base-calling. Naturally the cutoff value is higher than the mean error.
AccuNGS error sources analysis at the DNA level
We began by evaluating AccuNGS when a DNA plasmid was used as starting material. Our underlying assumption throughout our working process was that our DNA starting material is homogenous with respect to the theoretical error rate we calculated. This assumption is based on the fact that we used low-copy plasmids that were grown in E. coli, and only a single colony was subsequently sequenced. The mutation rate of E. coli is in the order of 1×10−10 errors/base/replication (23), and sequencing of a single colony ensures only a limited number of replication cycles. Accordingly, error rates in the purified plasmids are expected to be much lower than the expected protocol mean error of ~1×10−5. Thus, errors observed when comparing the results of the sequencing to our known reference sequence reflect errors created by the library preparation or by the sequencing process itself. All samples underwent basecalling using our specialized bioinformatics pipeline and positions were considered for analysis only if their coverage exceeded 100,000 bases per position (see Methods). Table S1 provides statistics about the number of reads and the distribution of miscalls in each sample.
We thus set out to sequence the HIV-1 pLAI.2 plasmid (43) using AccuNGS at its baseline conditions, including 40 cycles of PCR amplification of a target region with a high-fidelity polymerase, followed by Nextera XT library preparation with a high-fidelity polymerase and size selection of a 250bp insert (see Methods and Table 2). We then compared the baseline AccuNGS error rate to the results of a protocol typically used in clinical (and other) settings, where less focus is put on the fidelity of the process (44). Fig. 2 compares the proportion of errors observed on our plasmid sequence under AccuNGS and under standard sequencing, broken down according to type of mutation (base-to-base). Reassuringly, AccuNGS showed a significant improvement of one to two orders of magnitude over the standard sequencing protocol. For example, the mean A>G error rate went down from 2.6×10−3 down to around 9.2×10−5. While this improvement was large, perplexingly it was still an order of magnitude higher than the theoretical error rate we had expected of ~1×10−5. We thus set out to optimize the AccuNGS protocol and try to pinpoint the unexpected source introducing errors into the process.
Sources of process errors based on differential sequencing
We next performed a set of sequencing trials, whereby at each trial we tested if removing or changing a specific stage of the protocol alleviates some of the observed errors and improves the fidelity of AccuNGS. Our immediate suspect was the PCR amplification step. Due to the exponential nature of PCR, errors introduced at early stages of the amplification will be carried over and have been reported to create a high background error for ultra-deep sequencing (45,46). Accordingly, we were worried that any misspecification of the error rate of the SuperFi DNA Polymerase we use (Table 1) would lead to an inflation in PCR errors. To test this hypothesis we created a sample with no PCR amplification, by harvesting larger quantities of the plasmid from the bacteria. This led to only a slight decrease in the mean error rate of transition mutations (Fig. 2). We hence concluded that the forty PCR cycles that take place in the PCR amplification prior to library preparation do not explain most of the errors of the AccuNGS protocol.
We next focused on various so-called “chemical” processes that take place in AccuNGS library preparation. Mainly we were concerned that using gel extraction for DNA size selection, particularly UV light exposure, may introduce mutations. Indeed, when replacing gel extraction for the PCR products with magnetic beads extraction, we observed a slight reduction of the mean transition error rates (Fig. 2). On the other hand, this sample showed elevated levels of C:G>A:T errors. These errors are often signatures of oxidative stress and are discussed below. Alternative PCR purification using the Exosap cleanup reagent did not show elevated C:G>A:T errors compared to their levels in the baseline protocol, however it showed error levels comparable to the baseline protocol.
We next tested if the source plasmid itself was the major source of observed errors, focusing on the conditions whereby we grew the plasmid. First we tested if the mutations were accumulated naturally due to lack of selection on the HIV genes by sequencing the antibiotic resistance marker AmpR on pLAI.2 that is presumably under strong selection against mutations. Next we tested if the plasmid was the source of errors by sequenced the highly conserved RpoB gene from E. coli itself. Finally we tested if the errors were introduced by the bacteria during plasmid replication by growing the plasmid in an alternative strain of E. coli (TG1) with a presumably lower mutation rate (47). However, we observed no change in the error rate distribution in any of these conditions, suggesting that the DNA input was not the major source of minor variants.
We next hypothesized that the tagmentation process in the NexteraXT DNA library preparation kit (Illumina) might be the cause of artifacts. We resorted to a home-made tagmentation protocol based on introduction of NexteraXT-compatible adapters and indices via PCR amplification of a 250bp fragment of pLAI.2. Yet again, we observed no significant change in the distribution of errors. Replacing the Illumina MiSeq sequencer with the Illumina NextSeq, which is based on a two-channel sequencing process rather than a four-channel process and a different flow cell, resulted in a small increase in error levels. This suggests that the Illumina MiSeq and the higher throughput HiSeq (which employs the same detection method as MiSeq) are slightly more suitable for AccuNGS sequencing.
After having ruled out all sources of error we could conceive of testing, we were still left with the enigma of what causes the errors observed consistently and reproducibly across all the samples we sequenced. We were left with one condition that we could not alter directly: the sequencing step itself, as discussed next.
Sequencing quality effect
Each base reported by Illumina sequencers is assigned with a probability of that base being wrong, termed the Q-score. The range of Q-scores reported from the Illumina MiSeq is 0 to 40, which translates to a probability of an erroneous call between 1 and 0.0001, respectively. In the AccuNGS base calling scheme, we consider only sites where the two overlapping reads reported the same base with an average Q-score of 30 or higher, as in (25). Our original interpretation of overlapping reads, in line with previous works (48,49) was that the base called jointly on both reads has a corrected Q-score equals approximately to the sum of the Q-scores from both reads. Accordingly, this means that if we filter for bases with a corrected Q-score of at least 60, this translates to an error probability <=1×10−6 per base called, far below our theoretical threshold of 1.78×10−5. We set out to test if this is indeed true. First, we determined whether the independent Q-scores indeed reflect what they are supposed to. When inspecting errors observed on one read only with a Q-score of 30 or higher, we found that their maximal frequency was indeed around 10−3 (Fig. S3). Thus, it seems that the individual Q-scores on each read are reflective of the error rates of the process. However, we suspected that the joint Q-scores that we calculate are incorrect - mainly, that each reported Q-score is not independent of the Q-score on the mate read. To test this, we examined whether using a more stringent filter criterion improves AccuNGS results. We applied a very stringent quality filtering of Q38 on the AmpR sample that was sequenced to extreme depth of 1,500,000 bases per position (Table S1). The very high coverage and the good quality of the sequencing allowed us to still retain most sites at a coverage of above 100,000 reads per base (Table S1). We expected that this filtering will improve the results by the difference between twice Q30 (Q60, error probability of 1×10−6) and twice Q38 (Q76, error probability of ~2×10−8). This difference translates to an improvement which is far below our observed error rate and hence we did not expect to see any improvement. Surprisingly, we observed a dramatic reduction in the rates of errors for A:T>G:C miscalls, and a modest reduction for C:G>T:A miscalls (Fig. 3, Table 3 and Table S2). We hence concluded that the assumption that the Q-scores of overlapping reads are independent is an incorrect assumption, and that the sequencer itself is the major source of errors in AccuNGS.
Effects of surrounding nucleotides on error rates
Previous studies have indicated that the nucleotides surrounding a called base may influence its propensity to be miscalled (13,28). Indeed, we found that the identity of the surrounding bases sometimes affected the error rate observed with AccuNGS: when focusing on G>A mutations, we observed a higher error when the G was preceded by a C (CpG) and a lower error when the G was preceded by an A (ApG, Fig. S1). The exact same pattern was observed for the reverse complement C>T mutations (data now shown). These phenomena were found in all sequenced samples, suggesting that they aren’t affected by any of the differential conditions we tested. When analyzing transversion artifacts, we observed a higher error rate for G:C>T:A mutations compared to all other types of transversions. This enrichment was more prominent in some samples than in the others and is typically associated with oxidative damage (13,50-53). When characterizing the nucleotide context of the C:G>A:T errors, we found that G>T errors occurred more frequently when the mutated G base was followed by A or another G (GpA\G). As in the G>A transitions, the reverse complement C>A mutations were more frequent when C was preceded by C or T (C\TpC; Fig. S2).
AccuNGS error analysis at the RNA level
Since one of the ultimate goals of the development of AccuNGS was the sequencing of RNA viruses, we set out to characterize how the protocol fares for RNA. In order to obtain a homogenous RNA sample we performed in-vitro transcription of a homogeneous plasmid using T7 polymerase, whose error rate has been approximated in the order of 10−6 (54), an order of magnitude lower than the error rate observed for DNA with AccuNGS. The RNA was then used as input for reverse transcription reaction with random hexamers using SuperScript III, whose mean error rate has been approximated to be between 3.1×10−5 and 6.5×10−5 (29,39). We then proceeded with the AccuNGS protocol for DNA as previously described. In this RNA control sample, we expected the observed errors to be the union of those introduced by the DNA part, those introduced during in-vitro transcription and those introduced during reverse transcription. With Q30 filtering, the RNA control sample indeed showed a higher mean transition error rate of 9.52×10−5 compared to the mean transition error rate of 8.49×10−5 in the DNA control sample (Table S1 and Fig. 4). The difference of 1.03×10−5 is indeed in line with most additional errors in the RNA control sample stemming from the RT step. By using the difference between the medians of these two control samples, we were able to calculate upper bounds on the base-by-base error rates of the RT used in the process, which were found to be lower for some mutation types than previously reported (Table S3).
Clinical sample sequencing - acute HIV-1 infection
We next went on to test our method on direct sequencing of a clinical HIV-1 sample. HIV-1 infections typically begin with one to few viruses, indicating that the virus diversity at the population level at the time of infection is very limited (55). Large population sizes coupled with high mutation rates in the order of 10−5 mutations/base/replication cycle allow the virus to obtain mutations shortly after infection (38,56). However, HIV-1 populations sequenced shortly after infection, while the patient is at acute infection, have shown very limited diversity (57-59). At this time point, most variants in HIV are expected to be at frequencies below 10−3, thus mostly obscured by the common clinical sequencing protocols’ error rates. We obtained a plasma sample from a recently infected HIV-1 patient with laboratory confirmed seroconversion (a negative HIV-1 confirmatory assay followed by a positive test, 2 weeks apart), indicating this patient was likely 15-20 days after infection (so called acute HIV infection (60)). We chose to sequence the gag region of the virus (nearly 1800 bases) as it is mostly under purifying selection, but is also targeted by the HLA component of the adaptive immune system and possibly HLA escape mutations will be seen at this early stage of infection (61). We prepared each RT primer with a unique barcode (“primer ID”), 15 nucleotides long that allowed us to quantify the amount of viruses we have actually sequenced (see Methods and Supplementary Text). We aimed to sequence ~30,000 viruses, which is roughly the inverse of the error rate we obtain with AccuNGS (which is around 1 in 10,000). To this end, we started the protocol with roughly 300,000 viruses, as estimated from the viral load of the sample. Based on RT processivity, we estimated that 10% of the viruses would hence be sequenced (62).
For background errors control, we amplified and sequenced a clonal pLAI.2 DNA in parallel. The sequenced sample and the control had a median coverage of nearly 400,000 bases per position at Q30, allowing us to filter called bases for a minimum of Q38 for each base in the Forward and Reverse reads. As expected, background error rates in the control plasmid were similar to those obtained by our previous controls. Primer-ID analysis revealed that nearly 16,000 viruses were actually sequenced, so variants identified at frequencies of 10−4 are likely to represent true diversity (see Supplementary Text). We applied our variant caller on the sample using the pLAI control to serve as the background error distribution. Using a p-value cutoff of 0.01 for each variant called (i.e., this variant is at or above the 99th percentile of the gamma distribution for this mutation type), between 40% and 50% of all sequenced positions were identified as containing true transition mutations (Fig. 5A and Table 4). Using the standard sequencing, only four transition mutations would have been identified; AccuNGS revealed that several hundreds of transition variants exist at this time of infection at low frequencies of 10−4-10−3 (Table 4). When analyzing the type of mutations observed at low frequencies, synonymous mutations were the most prevalent, then non-synonymous mutations and then nonsense mutations (Fig. 5B). This suggests that signals of purifying selection can be already captured at this early time-point of infection and also strongly suggests that the variation observed is true. Interestingly, G>A minor variants were more prevalent than all other mutation classes (Fig. 5). Specifically, G>A variants preceded by A (GpA) were the most prevalent among all G>A variants, followed by G>A variants preceded by G (GpG, Fig. S4). This is possibly evidence for cytosine deamination activity by host APOBEC enzymes on the minus strand during reverse transcription, reflected as excess G>A mutations in the genome of virus (63). This is also in line with the favored editing context of APOBEC3F (63). However, G>A mutations are also the most common replication error of HIV-RT (56), and we cannot rule out that this drives higher frequencies of G>A as observed here.
DISCUSSION
Application of next generation sequencing on clinical samples is still limited by the ability to reliably capture minor variants (15,16,36). Here we describe AccuNGS, a simple, rapid and optimized experimental protocol and associated computational pipeline for detecting ultra-rare variants from low-biomass clinical RNA and DNA samples. AccuNGS aims to accurately detect minor variants present in a population of genomes at frequencies of 1:10,000 or lower, close to the mutation rate of RNA viruses (5). By performing differential sequencing we demonstrate that as opposed to many sequencing protocols, PCR is not a major source of errors in AccuNGS (45,64), and conversely we suggest that the sequencer is a major source for errors, even when correcting the strand-bias using overlapping read pairs. We show that the mean transition error rate of the protocol is around 7.83×10−5 when filtering for Q30, and 5.80×10−5 when filtering for Q38. These error rates translate to a cutoff of 1.35×10−4 and 1.16×10−4 based on the 95 percentile of a fitted gamma distribution of all transition errors. Notably when focusing on specific types of transitions (mostly T>C and A>G) the cutoff drops to below 1×10−4.
AccuNGS excels especially when the input is low-biomass heterogeneous RNA. Comparable methods for accurate sequencing such as rolling-circle-based methods typically require extremely high-biomass input, making them irrelevant for clinical virus sequencing. Furthermore we and others have observed that such protocols exhibited a relatively high C>T rate [exceeding 10−4 (1,17,18,65), unpublished results]. Such error levels were not recapitulated using AccuNGS, suggesting that these may have been artifacts of the rolling circle approach. A possible alternative for clinical sequencing would be the use of barcoded primers (also known as primer-IDs) during RT reaction, to generate consensus sequences that will correct errors inserted during amplification and sequencing (2). The advantage of the barcoded approaches is that they can also correct for unequal PCR sampling. However, the downside of these approaches is that they require splitting the input sample into numerous reactions, since a barcode has to be attached to each sequencing read (typically spanning 500-600 bases). When the number of initial viruses in the sample is not huge (as typical for limited clinical samples), this is problematic since only a relatively small number of viruses will be sequenced in each reaction. AccuNGS is only limited by the capacity of the RT and PCR reactions (i.e., the length of the targeted sequence that undergoes one RT or PCR reaction), which may span several thousand bases. We also noted that error rates in barcoding-based protocols are comparable to AccuNGS (35,66). We do note that in the AccuNGS approach we recommend primer barcoding only on one end of the amplicon, but this is aimed to understand how many RNA templates were actually sequenced rather than for error correction.
The overlapping read pairs (ORP) concept to reduce sequencing errors was first reported by Chen-Harris et al. (25) and further used by PELE-Seq (26). We note that our approach is novel in that it hinges on the combined use of high fidelity enzymes, ORP and a bioinformatics pipeline that compares variant frequencies to the fitted gamma distribution of process errors (24,27,67,68). Indeed, AccuNGS improves over Chen-Harris et al., and we further show that the use of a high fidelity polymerase is key to bringing down the mean transition error rate by approximately 30% (Fig. S5). We further provide here a step-by-step dissection and optimization of the sequencing process. This has allowed us to refute the commonly assumed notion that the forward and reverse reads are independent and can be considered technical replicas [e.g. (48,49,69,70)]. The observed improvement in error rates of A:T>G:C transitions when using a more stringent Q-score filtering threshold strongly indicate that the Illumina chemistry, involving low-fidelity polymerases (71), is a major source of sequencer errors that are only partially reflected in the Q-scores, and are usually not modelled. This is supported by the observation that Taq DNA polymerase, which operates during cluster generation in the sequencer (71), tends to make more errors on A:T>G:C (46,64). The lack of improvement in the reciprocal G:C>A:T process errors in spite of more stringent filtering suggests these may stem from Cytosine deamination during thermal cycling, rather than from polymerization during cluster generation (72). Our results also demonstrate that MiSeq to some extent better suits AccuNGS than the newer NextSeq platform, although significant improvement of error rates was also seen on the latter platform. We suggest that the observed error rate of the Q38 samples approaches the limit of detection that can be achieved using any traditional Illumina sequencing protocol. Hopefully in the future sequencing vendors will create a “high-fidelity” sequencing program that incorporates high fidelity enzymes that could minimize process artifacts.
We further note that the modeling process errors has several advantages over position-specific error models (25,30). First, position-specific error models do not perform well when the consensus base in the sample differs from the consensus base in the background homogeneous control. Second, the stochastic nature of process errors at a given position may result in significant differences between technical replicates, highlighting the fragility of an error estimation based on a single base observation. When possible, we strongly recommend using a homogenous control that is as similar as possible to the samples at hand, since this directly allows detecting loci that are highly error prone. In the absence of such a control, any homogenous control (e.g., a plasmid) is useful to control for process errors.
Performing our benchmarking on relatively long genomic regions allowed us to find that some errors tend to occur more in specific contexts than in others. We find that C:G>T:A mutations, that often arise from spontaneous cytosine deamination during thermal cycling (53,64,72), tend to occur when in CpG context (for both C and G), whereas they are less frequent when in CpT (for C>T) and ApG (for G>A) contexts. Similarly, we find that C:G>A:T mutations, which often stem from the formation of 8-oxo-Guanine under oxidative stress (53), are more prevalent in specific contexts. AccuNGS incorporates this observed bias into its variant calling method, based on more-specific background errors distributions.
We used AccuNGS to characterize HIV-1 diversity shortly after infection. To the best of our knowledge, the immediate evolution of HIV-1 following a new infection has been rarely observed (57-59), mainly due to the lack of resolution associated with the common sequencing protocols.
We demonstrate that AccuNGS captures minor transition mutations that mostly segregate at frequencies between 1:100 and 1:10,000, and match the expected properties of real biological variants such as different rates for silent, missense and nonsense mutations. We found that in this sample, G>A mutations were three times more prevalent than C>U mutations (Fig. 5B). In spite of being the most abundant type of mutation induced by the viral reverse transcriptase (6), the high level of observed minor frequencies exceeded our predictions, given the early time point. This may be partially explained by activity of APOBEC3 editing enzymes acting on the antisense of the HIV reverse transcribed genome. The most common APOBEC3 signature observed in our clinical sample (characterized by GpA>ApA mutations) was that of APOBEC3A/D/F/H, unlike a previous study that highlighted the contribution of APOBEC3G to the observed G>A variants in HIV-1 proviral DNA (characterized by GpG>ApG, (73)). It is possible that in our patient there are variants of APOBEC3 that are more active (74), or that in this patient, the viral Vif protein (that encounters host APOBEC3 proteins) lost the activity against some APOBEC3 enzymes (75). Future studies spanning more patients may elucidate this issue.
Summary
To summarize, we anticipate that using AccuNGS will be highly useful in detecting previously uncharacterized genetic diversity in biological samples. The ease of use of this approach should make it highly amenable for many different studies.
MATERIALS AND METHODS
Ethics declaration
The study was approved by the local institutional review board of the Sheba Medical Center (approval number SMC 1765-14) and of Tel-Aviv University. Written informed consent for retention and testing of residual plasma samples was provided by the patients.
Preparation of plasmids
In order to maintain the plasmid stock as homogenous as possible, plasmids were transformed to a chemically competent bacteria cells [DH5alpha (BioLab, Israel) or TG1 [A kind gift by Itai Benhar (Tel Aviv University, Tel Aviv, Israel)]] by a standard heat-shock protocol. Based on the fact that Escherichia coli doubling time is 20 mins in average using rich growing medium (76), a single colony was selected and grown to a maximum of 100 generations. Plasmids were column purified (HiYield™ Plasmid Mini Kit, RBC Bioscience) and stored at -20°C until use.
Construction of baseline control amplicon
Baseline control amplicon was based on clonal amplification and sequencing of the pLAI.2 plasmid, which contains a full-length HIV-1lai proviral clone (43) (obtained through the NIH AIDS Reagent Program, Division of AIDS, NIAID, NIH: pLAI.2 from Dr. Keith Peden, courtesy of the MRC AIDS Directed Program). The Integrase region of pLAI.2 was amplified using primers: KLV70 – 5’TTC RGG ATY AGA AGT AAA YAT AGT AAC AG and KLV84 - 5’TCC TGT ATG CAR ACC CCA ATA TG (77). Polymerase chain reaction (PCR) amplification was conducted using the high-fidelity Platinum™ SuperFi™ DNA Polymerase (Invitrogen) in a 50μΙ reaction using 20-40 ng of the plasmid as input and according to the manufacturer instructions. Amplification in a thermal cycler was performed as follows: initial denaturation for 3min at 98°C, followed by 40 cycles of denaturation for 20sec at 98°C, annealing for 30sec at 60°C and extension for 1min at 72°C, and final extension for 2min at 72°C. An alternative high-fidelity DNA polymerase used was Q5 High-Fidelity DNA Polymerase (New England Biolabs, NEB). PCR cycles were set according to each manufacturer’s instructions using the above described PCR program. The Integrase amplicon was gel purified (Wizard® SV Gel and PCR Clean-Up System, Promega) and the concentration determined by Qubit fluorometer (Invitrogen) according to each manufacturer instructions. The purified product was further used for library construction.
For the AmpR sample, the conserved AmpR gene was amplified from PLAI.2 plasmid using primers: AmpR FW – 5’AAA GTT CTG CTA TGT GGC GC and AmpR RV - 5’GGT CTG ACA GTT ACC AAT GC. PCR amplification was carried out as described above, except for extension duration of 30sec instead of 1min. Similarly, the conserved RpoB gene was amplified from the bacteria genome using the following primers: RpoB FW 5’-ATG GTT TAC TCC TAT ACC GA and RpoB RV 5’-GTG ATC CAG ATC GTT GGT G and the following PCR program: initial denaturation for 3min at 98°C, followed by 40 cycles of denaturation for 10sec at 98°C, annealing for 10sec at 60°C and extension for 4sec at 72°C, and final extension for 2min at 72°C.
Construction of alternative purification amplicons
The agarose gel purification step of the amplified integrase gene was replaced with other purification methods; (1) For the gel-free sample, the amplified integrase gene was purified using 25mI of AMPure XP beads (0.5X ratio, Beckman Coulter) according to the manufacturer instructions; And (2) For the ExoSap sample, 10μΙ of the amplified integrase gene were mixed with 4μΙ of ExoSap (ExoSAP-IT™ PCR Product Cleanup Reagent, Applied Biosystems) and incubated according to the manufacturer instructions. No other changes in the generation of amplicon protocol were made.
Construction of a PCR free control amplicon
For the PCR-free sample, 10ug of PLAI.2 plasmid was digested using the restriction enzymes: Nhel, StuI and Xcml (NEB) according to the manufacturer instructions. A ~1500bp fragment containing the integrase gene was gel purified and concentration was determined by Qubit. The purified product was further used for library construction.
Construction of an RNA control amplicon
A plasmid containing the full cDNA of Coxsackie virus B3 (CVB3) under a T7 promoter was a kind gift from Marco Vignuzzi (Institut Pasteur, Paris, France). The plasmid was used to generate an RNA control pool. Ten micrograms of this plasmid were linearized using Sail (NEB), purified by AMPure XP beads (0.5X ratio), and then in-vitro transcribed using T7 RNA polymerase (NEB) according to the manufacturer instructions. The transcribed RNA was purified using AMPure XP beads (0.5X ratio) and reverse transcribed with random hexamers using SuperScript Ill Reverse Transcriptase (Invitrogen) according to the manufacturer instructions. Four microliters of the reverse transcription reaction were used as template for a PCR reaction using primers: CVB FW - 5’GGA GAG AAG GTC AAC TCT ATG GAA GC and CVB RV - 5’TAC CAC CCT GTA GTT CCC CA, which amplify a ~1500bp fragment within the CVB genome. PCR reaction (50μΙ total) was set and amplified using Platinum™ SuperFi™ as follows: initial denaturation for 3min at 98°C, followed by 40 cycles of denaturation for 20sec at 98°C, annealing for 30sec at 60°C and extension for 15sec at 72°C, and final extension for 2min at 72°C. The CVB amplicon was gel purified and the concentration measured by Qubit. The purified product was further used for library construction.
Construction of clinical HIV-1 amplicon with primer-ID
Plasma sample from a recently diagnosed HIV-1 patient (clinical sample, ID 83530) with >1×107 c/ml HIV-1 viral load was provided by the National HIV Reference Laboratory, Chaim Sheba Medical Center, Ramat-Gan, Israel. The mode of HIV-1 transmission for this patient was MSM, men who have sex with men. HIV-1 viral load was determined with Xpert HIV-1 viral load assay on GeneXpert (Cepheid Inc., Sunnyvale, CA), according to the manufacturer instructions (78). RNA was extracted from 0.5 mL plasma by NucliSENS Easy MAG (Biomerieux, Marcy l’Etoile, France) according to the manufacturer’s protocol, eluted in a final volume of 55 μl and stored in - 80°C until use. A primer specific to the Gag gene of HIV-1 was designed with a 15 N-bases unique barcode followed by a linker sequence for subsequent PCR, Gag ID RT - 5’TAC CCA TAC GAT GTT CCA GAT TAC GNN NNN NNN NNN NNN NAC TGT ATC ATC TGC TCC TG TRT CT. Based on the measured viral load and sample concentration, 4 μΙ (containing the genomes of roughly 300,000 viruses) were taken for reverse transcription reaction. RT was performed using SuperScript IV Reverse Transcriptase (Invitrogen) according to the manufacturer instructions with the following adjustments; (1) In order to maximize the primer annealing to the viral RNA, the sample was allowed to cool down gradually from 65°C to room temperature for 10 minutes before it was transferred to ice for 2min; And (2) The reaction was incubated for 30min at 55°C to increase the overall reaction yield. To remove excess primers, the resulting cDNA was purified using AMPure XP beads (0.5X ratio) and eluted with 35μΙ nuclease-free water. To avoid loss of barcoded primers due to coverage drop at the ends of a read as a result of the tagmenation process, the PCR forward primer was designed with a 60bp overhang so the barcode (“primer ID”) is far from read end, Gag ID FW - 5’AAG CGA GGA GCT GTT CAC TGC CAT CCT GGT CGA GCT ACC CAT ACG ATG TTC CAG ATT ACG and Gag ID RV - 5’CTC AAT AAA GCT TGC CTT GAG TGC. PCR amplification was accomplished using Platinum™
SuperFi™ in a 50μl reaction with 33μl of the purified cDNA as input using the following conditions: initial denaturation for 3min at 98°C, followed by 40 cycles of denaturation for 20sec at 98°C, annealing for 30sec at 60°C and extension for 1 min at 72°C, and final extension for 2min at 72°C. The Gag amplicon was gel purified and the concentration determined by Qubit. The purified product was further used for library construction.
Libraries construction
PCR fragmentation and indexing of samples for sequencing was performed using the Nextera XT DNA Library Prep Kit (Illumina) with the following adjustments to the manufacturer instructions; (1) In order to get a short insert size of ~250bp, 0.85 ng of input DNA was used for tagmentation; (2) No neutralization of the tagmentation buffer was done, as described previously (79); (3) For library amplification of the tagmented DNA, the Nextera XT DNA library prep PCR reagents were replaced with high-fidelity DNA polymerase reagents (the same DNA polymerase that was used for the amplicon generation). The PCR reaction (50μΙ total) was set as depicted. Directly to the tagmented DNA, index 1 (i5, illumina, 5μΙ), index 2 (i7, illumina, 5μΙ), buffer (10μΙ), high-fidelity DNA polymerase (0.5μΙ), dNTPs (10mM, 1μΙ) and nuclease-free water (8.5μΙ) were added; (4) Amplification was performed with annealing temperature set to 63°C instead of 55°C, as introduced previously (79) and final extension for 2min; (5) Post-amplification clean-up was achieved using AMPure XP beads in a double size-selection manner (80) to remove larger fragments as well as smaller fragments, in order to obtain a narrower size-selection that will maximize the fraction of fully overlapping read pairs. For the first size-selection, 32.5μΙ of beads (0.65X ratio) were added to bind the large fragments. These beads were separated and discarded. For the second-size selection, 10μΙ of beads (0.2X ratio) were added to the supernatant to allow binding of intermediate fragments, and the supernatant containing the small fragments was discarded. The intermediate fragments were eluted and their size was determined using a high-sensitivity DNA tape in Tapestation 4200 (Agilent). A mean size of ~370bp, corresponding to the desired insert size of ~250bp, was achieved; And (6) Normalization and pooling was performed manually.
Alternative library purification methods
For the AMPure XP beads-free sample, post-amplification clean-up by double size-selection was replaced with an agarose gel purification of a ~370bp fragment, with no other changes in the library construction protocol.
Alternative tagmentation sample
For the alternative tagmentation sample, a 250bp amplicon within the integrase region was designed, using specific primers with an overhang corresponding to the sequence inserted during the tagmentation step of the NexteraXT DNA library prep kit, NexteraXT free FW - 5’TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG ACT TGT CCA TGC ATG GCT TCT C and NexteraXT free RV - 5’GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GTC TAT CTG GCA TGG GTA CCA GCA. PCR reaction was set up using Platinum™ SuperFi™ and carried out as follows: initial denaturation for 3min at 98°C, followed by 40 cycles of denaturation for 20sec at 98°C, annealing for 30sec at 62°C and extension for 15sec at 72°C, and final denaturation for 2min at 72°C. The PCR product was gel purified and the concentration was measured by Qubit. The purified product was indexed by a succeeding PCR amplification using primers corresponding to i5 and i7 NexteraXT primers (IDT) as mentioned previously (79) at a final concentration of 1uM. The PCR reaction was set up using Platinum™ SuperFi™ and amplified as detailed: initial denaturation for 3min at 98°C, followed by 12 cycles of denaturation for 20sec at 98°C, annealing for 30sec at 63°C and extension for 30sec at 72°C, and final extension for 2min at 72C. Size selection was achieved by gel purification of ~370bp fragments.
NextSeq libraries construction
Illumina NextSeq supports shorter reads than MiSeq. The longest NextSeq read length is 150bp, therefore we selected a shorter insert size of 270bp, compared to the 370bp insert size for the MiSeq platform. The first size selection of the post-NexteraXT amplification cleanup was performed using 42.5μΙ of AMPure XP beads (0.85X ratio) (80).
Sequencing
Sequencing of all samples (except for the NextSeq sample) was performed on the Illumina MiSeq platform using MiSeq Reagent Kit v2 (500-cycles) [Illumina]. Sequencing of the NextSeq samples was performed on the Illumina NextSeq 500 platform using NextSeq 500/550 High Output Kit (300-cycles) [Illumina].
Reads processing and base calling
The paired-end reads from each control library were aligned against the reference sequence of that control using an in-house script that relies on BLAST command-line tool (81-83). The paired- end reads from the clinical HIV-1 sample were aligned against HIV-1 subtype B HXB2 reference sequence (GenBank accession number K03455.1) and then realigned against the consensus sequence obtained. Bases were called using an in-house script only if the forward and reverse reads agreed and their average Q-score was above an input threshold (30 or 38). At each position, for each alternative base, we calculate mutation frequencies by dividing the number of reads bearing the mutation by sequence coverage. Positions were retained for analysis only if sequenced to a depth of at least 100,000 reads. In order to analyze the errors in the sequencing process we used Python (Anaconda distribution) with the following packages: pandas, matplotlib, seaborn, numpy and stats. Distributions of errors on control plasmids were compared using twotailed t-test or two-tailed Mann-Whitney U test.
Variant calling
In order to facilitate discrimination of true variants from AccuNGS process artifacts, we created a variant caller based on two principles: (i) positions that exhibit relatively high level of error on a control sample are error-prone for the clinical sample as well; and (ii) process errors on a control sample follow a gamma distribution. A gamma distribution was fitted for each mutation type in the control sequence. In order to detect and remove outliers from the fitting process we used the “three-sigma-rule”, and positions that showed error higher than three standard deviations from the mean of the fitted distribution were removed. For these rare loci a base was called only if the mutation was more prevalent in the sample by an order of magnitude. For G>A transition mutations, four distinct gamma distributions were fitted, corresponding for all four G>A combinations with preceding nucleotide. Accordingly, for C>T transition mutations four gamma distributions were fitted as well, on the four C>T reverse complement mutations of the G>A mutations. For establishing Figure 5, variants were called on the input RNA sample only if a mutation was in the extreme 1% of the corresponding gamma distribution fitted using the DNA control.
Standard sequencing control sample
Standard control sample of pLAI.2 was taken from a previous study (77). For obtaining mutation frequencies we used the same pipeline as for the AccuNGS samples, but without correcting overlapping paired reads. Positions in this sample were analyzed only if sequenced to a depth of at least 2,000 bases.
CODE AVAILABILITY
We have developed the following computational resources that complement the AccuNGS sequencing protocol:
Base coverage calculator. AccuNGS relies on overlapping read pairs and high Q-scores for both reads of a pair. The calculator receives as input the length of the target regions and the desired coverage, and outputs the recommended number of reads required for sequencing each sample.
Computational pipeline for computing the number of unique RNA molecules sequenced, based on primer-ID barcodes (see Supplementary Text).
Computational pipeline for base-calling and inferring site by site base frequencies.
All resources are freely available at https://github.com/SternLabTAU/AccuNGS.
ACCESSION NUMBERS
The datasets generated and reported in this study were deposited in the Sequencing Read Archive (SRA, available at https://www.ncbi.nlm.nih.gov/sra), under BioProject PRJNA476431.
COMPETING INTERESTS
The authors have no competing interests to declare.
ACKNOWLEDGEMENTS
The authors would like to thank the members of the Stern lab for helpful comments and Roy Moscona for helpful discussions and support. This work was supported by the SAIA foundation; by the Israeli Science Foundation [1333/16 to AS]; by the German Israeli Foundation [I-1096-411.8-2015 to AS]; by the United-States-Israel Binational Science Foundation [2016555 to AS]; by the Edmond J. Safra center for bioinformatics in Tel Aviv University [to MG, TK and DM]; and by the Constantiner Institute for Molecular Genetics in Tel Aviv University [to MG].
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.
- 32.
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.
- 83.↵