Abstract
As next generation sequencing (NGS) and liquid biopsy become more prevalent in clinical and research area, especially cancer diagnosis, targeted therapy guidance and disease surveillance, there is an increasing need for better methods to reduce cost and to improve sensitivity and specificity. Since the error rate of NGS is around 1%, it is difficult to identify mutations with frequency lower than 1% accurately and efficiently because of low Signal-to-Noise Ratio (SNR). Here we propose a likelihood-based approach, low-frequency mutation detector (LFMD), combining the advantages of duplex sequencing (DS) and bottleneck sequencing system (BotSeqS) to maximize utilization of duplicate sequenced reads. Compared with DS, the new method achieves higher sensitivity (improved ~16%), higher specificity (improved ~1%) and lower cost (reduced ~70%) without involving additional experimental steps, customized adapters and molecular tags. In addition, this method can also be used to improve sensitivity and specificity of other variant calling algorithms by replacing a step in traditional NGS analysis: removing polymerase chain reaction (PCR) duplication. Thus, LFMD can be a promising method used in genomic research and clinical fields.
Introduction
At the individual level, low-frequency mutations (LFMs) are defined as mutations with allele frequency lower than 5% or 1%. LFMs increase power to predict early stage of cancer and Alzheimer’s Disease (AD)1, distinguish samples with different age2, identify disease-causing variants3, diagnose before tri-parental in vitro fertilization4, and track the mutational spectrum in viral genomes, malignant lesions, and somatic tissues5,6. To effectively improve signal-to-noise ratio (SNR) and detect LFMs, stringent thresholds, complex experimental skills1,7, single cell sequencing8–11, circle sequencing12, and more precise models13,14 were developed. The bottleneck sequencing system15 (BotSeqS) and duplex sequencing16 (DS) utilize duplicate reads generated by polymerase chain reaction (PCR), which are discarded by other methods, to achieve much higher accuracy. However, current methods still have some limitations in detecting LFMs.
Disadvantages of single cell sequencing and circle sequencing
For single cell sequencing, DNA extraction is laborious and exacting, with point mutations and copy number biases introduced during amplification of small amounts of fragile DNA. To increase specificity, only variants shared by at least two cells are accepted as true variants11. This method is not cost efficient and cannot be used in large-scale clinical applications because a large number of single cells need to be sequenced to identify rare mutations.
Circle sequencing only utilizes a single strand of DNA, so its specificity is limited by the error rate of PCR. It obtains errors at a rate as low as 7.6 × 10−6 per base sequenced12 while DS can achieve 4 × 10−10 errors per base sequenced16.
Disadvantages of BotSeqS
In contrast, BotSeqS uses endogenous molecular tags, the positions of the aligned read pair, to group reads from the same DNA template and construct double strand consensus reads. As a result, it can detect very rare mutations (<10−6) while it is cheap enough to sequence the whole human genome15. But it introduces highly diluted DNA templates before PCR amplification to reduce endogenous tag conflicts and ensure sufficient sequencing of each DNA template. Thus, it has high specificity with poor sensitivity. In addition, it discards clonal variants and small insertions/deletions (InDels) in order to limit false positives.
Disadvantages of DS
Another compromising method to eliminate tag conflicts is Duplex sequencing (DS). It ligates exogenous random molecular tags (also known as unique molecular identifier, UID or UMI) to both ends of each DNA template before PCR amplification. Although sensitive and accurate, it wastes many data to sequence tags, fixed sequences and a large proportion of read families that contain only one read pair because of a sequencing error on a tag. Since random molecular tags are synthesized with customized adapters, batch effects might occur during DNA library construction. Additionally, DS only works on targeted small genome regions6,13,17 rather than on the whole genome.
A new approach
In order to avoid the aforementioned problems, we present here a new, efficient approach that combines the advantages of BotSeqS and DS. It uses a likelihood-based model13,14 to dramatically reduce endogenous tag conflicts. Then it groups reads into read families and constructs double strand consensus reads to detect ultra-rare mutations accurately while maximizing utilization of non-duplicate read pairs. Without exogenous molecular tags, our method can also work with the 50 bp short reads of BGISEQ as well as the longer reads of HiSeq. In summary, it simplifies the DNA sequencing procedure, saves data and cost, achieves higher sensitivity and specificity, and can be used in whole genome sequencing. Using digital PCR to validate thousands of low-frequency sites is prohibitively expensive and laborious18. A new method which works on an independent platform can be used as a method to validate HiSeq results. Additionally, our new method is a statistical solution of the problem of PCR duplication in the basic analysis pipeline of next generation sequencing (NGS) data and can improve sensitivity and specificity of other variant calling algorithms without requiring specific experimental designs. As the price of sequencing is falling, the depth and the rate of PCR duplication are rising. The method we present here might help deal with such high depth data more accurately and efficiently.
Methodology
Intuitively, to distinguish LFMs (signal) from background PCR and sequencing errors (noise), we need to increase the SNR. To increase SNR, we need to either increase the frequency of mutations or inhibit sequencing errors. Single cell sequencing increases the frequency of mutations by isolating single cells from the bulk population, while BotSeqS and DS inhibit sequencing errors by identifying the major allele at each site of multiple reads from the same DNA template. In this paper, we only focus on the latter strategy.
To group reads from the same DNA template, the simplest idea is to group properly mapped reads with the same coordinates (i.e., chromosome, start position, and end position) because random shearing of DNA molecular can provide natural differences, called endogenous tags, between templates. A group of reads is called a read family. However, as the length of DNA template is approximately determined, random shearing cannot provide enough differences to distinguish each DNA template. Thus, it is common that two original DNA templates share the same coordinates. If two or more DNA templates shared the same coordinates, and their reads were grouped into a single read family, it is difficult to determine, using only their frequencies as a guide, whether an allele is a potential error or a mutation. Thus, BotSeqS introduced a strategy of dilution before PCR amplification to dramatically reduce the number of DNA templates in order to reduce the probability of endogenous tag conflicts. And DS introduced exogenous molecular tags before PCR amplification to dramatically increase the differences between templates. Thus, BotSeqS sacrifices sensitivity and DS sequences extra data: the tags.
Here we introduce a third strategy to eliminate tag conflicts. It is a likelihood-based approach based on an intuitive hypothesis: that if reads of two or more DNA templates group together, a true allele’s frequency in this read family is high enough to distinguish the allele from background sequencing errors. The pipeline of LFMD is shown in Figure 1, and a comparison of DS and LFMD is shown in Figure 2.
Likelihood-based model
We aim to identify alleles at each potential heterozygous position in a read family (grouped according to endogenous tags). Then based on those heterozygous sites, we split the mixed read family into smaller ones, and compress each one into a consensus read. Finally, we detect mutations based on all consensus reads, which have much lower error rates than 0.1%.
First, we define a Watson strand as a read pair for which read 1 is the plus strand while read 2 is the minus strand. A Crick strand is defined as a read pair for which read 1 is the minus strand while read 2 is the plus strand. Thus a read family which contains Watson and Crick strand reads simultaneously is an ideal read family because it is supported by both strands of the original DNA template before PCR amplification. Second, we select potential heterozygous sites which meet the following criteria: 1) the minor allele is supported by both Watson and Crick reads; 2) minor allele frequencies in both Watson and Crick read family are greater than approximately the average sequencing error rate, often 1% or 0.1%; 3) low quality bases (<Q20) and low quality alignments (<Q30) are excluded. Finally, we calculate genotype likelihood in the Watson and Crick family independently in order to eliminate PCR errors during the first PCR cycle.
At each position of a Watson or Crick read family, let P(X|θ) be the probability mass function of a random variable X, indexed by a parameter θ = (θA, θC, θG, θT)T, where θ belongs to a parameter space Ω. Let g ∈ {A, C, G, T}, and θg represents the frequency of allele g at this position. Obviously, we have boundary constraints: θg ∈ [0,1] and ∑ θg = 1.
Assuming N sequence reads cover this site, xi represents the base on read i ∈ {1,2,…, N}, and ei denotes sequencing error of the base, the genotype likelihood can be calculated as in which
So we have the log-likelihood function
Thus, under the null hypothesis H0: θg = 0, and the alternative hypothesis H1: θg ≠ 0, the likelihood ratio test for each allele g is
However, as θg = 0 lies on the boundary of the parameter space, the general likelihood ratio test needs an adjustment to fit . Because the adjustment is related to calculation of a tangent cone19 in a constrained 3-dimensional parameter space, and the computation is too complicated and time consuming for large scale NGS data, here we use a simplified, straightforward adjustment20 presented by Yong et al in 2017.
Let {𝓐1,…, 𝓐K}, K = 4 denote the set of conditional events which are mapped to four alleles at the position. We have
The composite log likelihood can be constructed as in which we set
Let be the maximum composite likelihood estimator, and define the composite score function, sensitivity matrix and variability matrix respectively as
The corresponding estimators of H and V are denoted by and evaluated at . The modified composite likelihood under boundary constraints was given by Yong et al20 as where
Thus, we derive the adjusted likelihood ratio test where and θ0 is the parameter θ under null hypothesis H0.
Let pmf(ei) denote the probability mass function of ei. The expected number of base g with ei is
Thus, where C is a finite constant. Then we derive
As a result, is equal to 0 in the model, which means the adjustment is not necessary. Thus, we finally arrive at a general result that further adjustment of is not helpful in similar cases, although the asymptotic distribution we use is not perfect when N is small (e.g., N<5), and alternative approaches might be derived in the future.
Because the null and alternative hypotheses have two and three free variables respectively, the Chi-square distribution has 1 degree of freedom. Type I error of the allele g can then be given where cdf(x) is the cumulative density function of the distribution. If Pg is less than a given threshold α, the null hypothesis is rejected and the allele g is treated as a candidate allele of the read family.
Although Pg cannot be interpreted as the probability that H0,g is true and allele g is an error, it is a proper approximation of the error rate of allele g. We only reserve alleles with Pg ≤ α in both Watson and Crick families and substitute others with “N”. Then Watson and Crick families are compressed into several single strand consensus sequences (SSCSs). The SSCSs might contain haplotype information if more than one heterozygous site is detected. Finally, SSCSs which are consistent in both Watson and Crick families are claimed as double strand consensus sequences (DCSs).
For each allele on a DCS, let Pw and Pc represent the relative error rates of the given allele in the Watson and Crick family respectively, and let Pwc denote the united error rate of the allele. Thus,
For a read family which proliferated from n original templates, a coalescent model can be used to model the PCR procedure21. According to the model, a PCR error proliferates and its fraction decreases exponentially with the number, m, of PCR cycles. For example, an error that occurs in the first PCR cycle would occupy half of the PCR products, an error that occurs in the second cycle occupies a quarter, the third only 1/8, and so on. As we only need to consider PCR errors which are detectable, the coalescent PCR error rate is defined as the probability to detect a PCR error whose frequency ≥ 2−m/n, and it is equal to
Let epcr denote the coalescent PCR error rate and Ppcr the united PCR error rate of the double strand consensus allele. Empirically we get
Because PwcPpcr ≈ 0, the combined base quality of the allele on the DCS is
Then Q is transferred to an ASCII character, and a series of characters make a base quality sequence for the DCS. Finally, we generate a BAM file with DCSs and their quality sequences.
With the BAM file which contains all the high quality DCS reads, the same approach is used to give each allele a P-value at each genomic position which is covered by DCS reads. Adjusted P-values (q-values) are given via the Benjamin-Hochberg procedure. The threshold of q-values is selected according to the total number of tests conducted and false discovery rate (FDR) which can be accepted.
A similar mathematical model was described in detail in previous papers by Jun et al13 and Yan et al14. Jun et al. used this model to reliably call mutations with frequency > 4%. In contrast, we use this model to deal with read families rather than non-duplicate reads. In a mixed read family, most of the minor allele frequencies are larger than 4%, so the power of the model meets our expectation.
For those reads containing InDels, the CIGAR strings in BAM files contain I or D. It is obvious that reads with different CIGAR strings cannot fit into one read family. Thus, CIGAR strings can also be used as part of endogenous tags. In contrast, the soft-clipped part of CIGAR strings cannot be ignored when considering start and end positions because lowquality parts of reads tend to be clipped, and the coordinates after clipping are not a proper endogenous tag for the original DNA template.
Results
Comparison between DS and LFMD
Simulated data
We used Python scripts developed by the Du novo22 team to simulate mixed double-strand sequencing data and then compared the results of LFMD and DS. Although the simulation was not perfect, the analysis was still useful to demonstrate the power and the potential drawbacks of LFMD and DS because we knew the true mutations explicitly, and true positive (TP) and false positive (FP) could be defined and calculated clearly. The numbers of TP and FP are shown in Tables 1 and 2.
We found that DS induces several false positives due to mapping errors. LFMD eliminates mapping errors of DCSs by outputting DCSs directly into BAM files. LFMD is much more sensitive than DS according to Figures 3, 4, and 5.
Mouse mtDNA
In order to evaluate the performance of LFMD, we compared LFMD with DS on a DS data from mouse mtDNA: SRR1613972. The analysis pipeline is shown in Figure 4. We controlled almost all parameters to be exactly the same in DS and LFMD and then compared the results. Because DS is the current gold standard, we treated the DS results as the true set and then calculated the true positive rate (TP), false positive rate (FP), and positive predictive value (PPV) of LFMD based on all proper mapped reads (Table 1) and unique proper mapped reads (Table 2). We found that mapping quality influenced the performance of both methods.
Although the majority of mutations are identified by both methods, some mutations are detected only by DS or only by LFMD. We investigated these discordant mutations one by one. It is interesting that most of them (42 out of 62 LFMD-only point mutations) can be identified if we consider 1-2 bp sequencing errors and PCR errors in the 24 bp tag sequences of DS. Two of them are potential true positive mutations because there is only one support read in one of the 2 families. The last 18 LFMD-only mutations did not have matched tags to make DCSs. They are potential FPs of LFMD or FNs of DS. But when we consider more than 2 bp mismatches in tags, most of the last 18 LFMD-only mutations had double strand support. This phenomenon implies contamination of DS tags or potential false positive hints of LFMD which should be validated in future research.
Twenty-six samples from Prof. Kennedy’s laboratory1
We compared the performance of DS and LFMD on 26 samples from Prof. Scott R. Kennedy’s laboratory. Only unique mapped reads were used to detect LFMs. The majority of LFMs were detected by both tools. Almost all LFMs only detected by DS were false positives due to alignment errors of DCS, while LFMD outputs BAM files directly and avoids alignment errors. LFMs only detected by LFMD are supported by raw reads if considering PCR and sequencing errors on molecular tags. As a result, LFMD is much more sensitive and accurate than DS. The improvement on sensitivity is about 16% according to Table 5.
YH cell line
We sequenced the YH cell line, passage 19, 8 times in order to validate the stability of the method. All results, shown in Table 6 and Figure 6, are highly consistent.
ABL1 data
Using the duplex sequencing method in 2015, Schmitt et al. analyzed an individual with chronic myeloid leukemia who relapsed after treatment with the targeted therapy imatinib (the Short Read Archive under accession SRR1799908). We analyzed this individual and found 5 extra LFMs. Two of them were in the coding region of the ABL1 gene. It was reported that E255G (E255VDK, Dasatinib, Imatinib, Nilotinib) and V256G (V256L, Imatinib) were associated with drug resistance. The annotation results of 5 LFMs are shown in Table 7.
Materials
Subject recruitment and sampling
A lymphoblastoid cell line (YH cell line) established from the first Asian genome donor23 was used. Total DNA was extracted with the MagPure Buffy Coat DNA Midi KF Kit (MAGEN). The DNA concentration was quantified by Qubit (Invitrogen). The DNA integrity was examined by agarose gel electrophoresis. The extracted DNA was kept frozen at −80°C until further processing.
Mitochondrial whole genome DNA isolation
Mitochondrial DNA (mtDNA) was isolated and enriched by double/single primer set amplifying the complete mitochondrial genome. The samples were isolated using a single primer set (LR-PCR4) by ultra-high-fidelity Q5 DNA polymerase following the protocol of the manufacturer (NEB) (Table 8).
Library construction and mitochondrial whole genome DNA sequencing
For the BGISeq-500 sequencing platform, mtDNA PCR products were fragmented directly by Covaris E220 (Covaris, Brighton, UK) without purification. Sheared DNA ranging from 150 bp to 500 bp without size selection was purified with an Axygen™ AxyPrep™ Mag PCR Clean-Up Kit. 100 ng of sheared mtDNA was used for library construction. End-repairing and A-tailing was carried out in a reaction containing 0.5 U Klenow Fragment (ENZYMATICS™ P706-500), 6 U T4 DNA polymerase (ENZYMATICS™ P708-1500), 10 U T4 polynucleotide kinase (ENZYMATICS™ Y904-1500), 1 U rTaq DNA polymerase (TAKARA™ R500Z), 5 pmol dNTPs (ENZYMATICS™ N205L), 40 pmol dATPs (ENZYMATICS™ N2010-A-L), 1 X PNK buffer (ENZYMATICS™ B904) and water with a total reaction volume of 50 µl. The reaction mixture was placed in a thermocycler running at 37°C for 30 minutes and heat denatured at 65°C for 15 minutes with the heated lid at 5°C above the running temperature. Adaptors with 10 bp tags (Ad153-2B) were ligated to the DNA fragments by T4 DNA ligase (ENZYMATICS™ L603-HC-1500) at 25°C. The ligation products were PCR amplified. Twenty to twenty-four purified PCR products were pooled together in equal amounts and then denatured at 95°C and ligated by T4 DNA ligase (ENZYMATICS™ L603-HC-1500) at 37°C to generate a single-strand circular DNA library. Pooled libraries were made into DNA Nanoballs (DNB). Each DNB was loaded into one lane for sequencing.
Sequencing was performed according to the BGISeq-500 protocol (SOP AO) employing the PE50 mode. For reproducibility analyses, YH cell line mtDNA was processed four times following the same protocol as described above to serve as library replicates, and one of the DNBs from the same cell line was sequenced twice as sequencing replicates. A total of 8 datasets were generated using the BGISEQ-500 platform. For HiSeq-4000 sequencing platforms, 500 ng to 1 μg of input mtDNA were used for library construction according to the protocol of the manufacturer (Illumina).
MtDNA sequencing was performed on an Illumina HiSeq-4000 with 100 bp paired-end reads and on a BGISeq-500 with 50 bp paired-end reads. The libraries were processed for high-throughput sequencing with a mean depth of ~20000x.
The data that support the findings of this study have been deposited in the CNSA (https://db.cngb.org/cnsa/) of CNGBdb with accession code CNP0000297.
Discussion
LFMD is still expensive for target regions >2 Mbp in size because of the high depth. As the cost of sequencing continues to fall, it will become increasingly practical. Only accepting random sheered DNA fragments, not working on short amplicon sequencing data, and only working on pair-end sequencing data are known limitations of LFMD. Moreover, LFMD’s precision is limited by the accuracy of alignment software.
To estimate the theoretical limit of LFMD, let read length equal 100 bp and let the standard deviation (SD) of insert size equal 20 bp. Let N represent the number of position families across one point. Then, N = (2 * 100) * (20 * 6) = 24000 if only considering ±3 SD. As the sheering of DNA is not random in the real world, it is safe to set N as 20,000. Ideally, the likelihood ratio test can detect mutations whose frequency is greater than 0.2% in a read family with Q30 bases. Thus, the theoretical limit of minor allele frequency is around 1e-7 (= 0.002 / 20000).
Conclusion
To eliminate endogenous tag conflicts, we use a likelihood-based model to separate the read family of the minor allele from that of the major allele. Without additional experimental steps and the customized adapters of DS, LFMD achieves higher sensitivity and almost the same specificity with lower cost. It is a general method which can be used in several cutting-edge areas.