Abstract
Motivation Sequencing of cell-free DNA (cf-DNA) has enabled Noninvasive Prenatal Testing (NIPT) and”liquid biopsy” of cancers. However, while the aneuploidy and point mutations were focused on by most of NITP and liquid biopsy studies, detecting sub-chromosome CNVs that affect a few to dozens of megabases was rarely reported, likely attributable to the difficulty in accurately identifying them, especially for those present in a small fraction of cf-DNA.
Results We developed a somatic CNV detection tool (SCDT), for detecting sub-chromosome CNVs in cf-DNA using whole genome sequencing (WGS) data or off-target reads in target sequencing data. Additional to using control samples for correcting genome position specific bias, two GC correction steps were performed, which regressed GC content of DNA fragments and that of genome bins, respectively. After GC correction, the coefficients of variation of copy ratios approximated the lower boundary of theoretical values, suggesting removing of almost all systematic errors. Finally, CNVs were detected by a piecewise least squares fitting based segmentation algorithm, which outperformed other segmentation methods. We applied SCDT on simulated and real maternal plasma samples, and target cf-DNA sequencing of 118 normal individuals and 240 cancer patients, and demonstrated high sensitivity and specificity.
Availability SCDT is available at https://github.com/Martiantian/Somatic_cnv_detect_tool.
Contact zhuhongmei{at}genomics.cn
1 Introduction
Copy number variation (CNV) is one type of structural variation with duplication or deletion event that affects a considerable number of base pairs (Sharp, et al., 2005), altering gene dosage and subsequently affecting functional and biological behavior of cells. CNVs have been known to greatly contribute to a wide repertoire of human diseases, including genetic disease, developmental and neuropsychiatric disorders (Kirov, et al., 2009; Sebat, et al., 2007; Walsh, et al., 2008) and almost all types of cancers (Pollack, et al., 2002; Shlien and Malkin, 2009; Taylor, et al., 2008). Detecting CNVs in human genomes has been a routine clinical test for disease screening, diagnosis and therapy guiding.
In the recent years, plasma cell free DNA (cf-DNA) sequencing has been broadly applied to non-invasive genetic diagnostics. One of the most important applications of cf-DNA sequencing is non-invasive prenatal testing (NIPT), which directly sequences cf-DNA extracted from maternal blood to identify likely aneuploidy of the fetus. However, NIPT generally focused on whole-chromosome aneuploidies (triploid 13/18/21, X, XXY and XYY) which account for only 30% of all live births with a chromosome abnormality. Recent progression has been made in genome-wide screening of sub-chromosomal CNVs with significantly smaller sizes, which have a considerably higher incidence (0.5-1.7%) than whole-chromosome aneuploidy in human pregnancy (Brady, et al., 2016), and could be associated with genetic disease including DiGeorge syndrome (22q11.2 deletion), Cri-du-chat syndrome (5p deletion), Angelman syndrome (15q11–q13 deletion) and 1p36 deletion syndrome. However, non-invasively detecting CNVs with small chimeric fraction without previously known positions is much more challenging than aneuploidy testing. After two proof-of-concept studies (Jensen, et al., 2012; Peters, et al., 2011) on a few cases, several subsequent studies focused on achieving statistical significance using relatively high depth whole genome sequencing (more than 100 million reads). These methods were based on statistical test on individual genomic bins, or required several consecutive bins to be significant (Srinivasan, et al., 2013; Yu, et al., 2013) However, in addition to the cost of relatively deep whole genome sequencing, high rates of false positives (FPs) and false negatives (FNs) in these individual-bin test methods would restrict their application in real clinical use. Some other methods reduced the requirement on sequencing depth by employing sliding window strategy (Straver, et al., 2014) or using binary segmentation with dynamic threshold (Chen, et al., 2013), and aimed at detecting large fragment aberrations (>10M) using low-coverage sequencing data. Lo et al, reported 60.7% (17/28) of accuracy for analyzing 3Mb to 42Mb de novo CNVs using 4-10 M reads, while the sensitivity increased to 92.9% using relative higher sequencing depth up to 120M reads (Lo, et al., 2016). Yin et al developed a method to identify 69 of 73 (94.5%) CNVs identified by array CGH using 10 million reads, with a specificity of 98.1% (Yin, et al., 2015). A method based on unified Hidden Markov model was developed for detecting fetal CNVs and achieved great resolution (400 kb) with fetal fraction of 13%, however, is only feasible using deep sequencing data and information of parental SNP genotypes, which is unavailable in routine NIPT (Rampasek, et al., 2014).
Another inspiring application of cf-DNA sequencing is to be used as a surrogate for tissue biopsy, named”liquid biopsy”, for screening and monitoring tumor-derived genomic aberrations. Circulating tumor DNA (ct-DNA) can be detected in the plasma of cancer patients, and has great potential in clinical management of cancers. A plenty of targeted therapies have been developed to target copy number change of some cancer driver genes, e.g. high level amplifications of ERBB2, MET, CCND1 and FGFR1 (Baselga and Swain, 2009; Christensen, et al., 2005; Musgrove, et al., 2011; Turner, et al., 2010), etc. However, current reported noninvasive assays seldom include detection of actionable CNVs, which may be due to the serious difficulty on accurately identifying CNVs in ct-DNA, as ct-DNA typically accounts for a little proportion (<10%) of cf-DNA, even in many advanced stage cancer patients(Adalsteinsson, et al., 2017). Most published studies employed reads counting strategy for genome bins and simple statistics such as Z-test to identify CNVs. Chan et al used individual bin based Z-test on 4 high depth whole genome sequencing (WGS) (17X) of HCC cases (Chan, et al., 2013). Heitzer et al calculated segment z-score after CNV segmentation using circular binary segmentation (CBS) algorithm in 13 plasma samples of 9 metastatic prostate patients, which usually have high tumor DNA concentrations (Heitzer, et al., 2013). Xu et al performed individual bin z-score based CNV analysis on 31 patients, and showed that recognizable CNVs were only detectable in most samples with large tumor size (tumor dimension > 50 mm) (Xu, et al., 2015). However, method accurately determining the CNV fragment using shallow depth WGS data with low FPs is scarce and tools for identifying tumor-derived CNV in cf-DNA of a wide range of patients would have significant clinical values.
In this study, we present a novel approach named Somatic CNV Detecting Tool (SCDT), which has ability to use shallow WGS data and target sequencing data to detect genome-wide microdeletions or microduplications (MDs) without a priori knowledge of an event’s location. To maximize the ability to remove “noise” introduced by library construction, PCR process and sequencing, and intrinsic difference between genome regions, we used control samples to correct genome position specific bias, and two GC correction steps to regress GC content of DNA fragments and that of genome bins, respectively. Following that we nearly achieve the theoretical minimum of random fluctuation in copy ratios. A segmentation algorithm based on piecewise least squares fitting and a rigorous statistical method is applied to finally determine the CNVs. We show that SCDT recovered all “spiking in” MDs with ≥3Mbp length and ≥5% chimeric fraction (≥10% mix ratio) using about 25 million sequencing reads (0.3× genome coverage). Finally, we applied our algorithm in three cf-DNA datasets of abnormal maternal plasma samples, normal samples and a large cohort of patients with various types of cancer, respectively, and demonstrated the feasibility of SCDT for precisely detecting clinically relevant CNVs in cf-DNA.
2 Methods
2.1 Data preparation and Overview of methods
We use bam format files as input of SCDT, including at least one sample as control. The control samples should be prepared using the same protocol with the test samples, including methods for library constructing and sequencing. Duplicate reads should be removed in the bam files using software such as Picard-tools and Samtools. The aligned genome positions of reads were extracted from the bam files, with a filtering step to discard reads with low mapping quality or high number of mismatches. Additionally, for using target sequencing data as input, reads located adjacent to (<500 bp) the target region were discarded from the bam files.
Firstly, the whole genome should be divided into non-overlapping bins (defined as level-1 bins) with fixed length assigned by users. Read depth count (RDC) of each level-1 bin is obtained by counting reads with start positions in it. However, each DNA fragment was not counted by 1, but instead by 1 divided by a correction factor corresponding to its CG content (section 2.2). Secondly, the RDCs of each test and control sample were centralized to 1 by dividing their medians, and then we used the mean of centralized RDCs at each bin across the control samples to generate a reference data, which is used to normalize the centralized RDCs of test samples to obtain the copy ratios (section 2.3). Thirdly, we merged a fixed number (defined by users) of level-1bins into the level-2 bin. The copy ratio of each level-2 bin was calculated as the mean of copy ratios of level-1 bins inside it. Subsequently, we performed a second step of GC correction based on regression for the copy ratios and GC content of level-2 bins using general liner model (GLM) (section 2.4). Finally, we performed CNV segmentation by a piecewise least squares fitting on the copy ratios of level-2 bins (section 2.5), and tested the significance of each CNV segment under the assumption of independent and identical distribution of copy ratios (section 2.6) (Supplementary Fig S1).
2.2 First step of GC correction based on single DNA fragment
Firstly, we divided the GC-content range (0-1) into 1000 intervals (0.001 per interval), followed by calculating GC-content distribution on the 1000 GC intervals for 170 base-pair (bp) sliding windows (sliding with 1bp each time) in the reference genome (except the sex chromosomes). Secondly, for single-end sequencing data, we extended the sequencing reads to 170bp based on its alignment position in the reference genome. Then we calculated GC-content distribution of extended reads on the 1000 GC intervals for each sample. GC-content intervals higher than 70% and lower than 20% were discarded for their intense fluctuation of read counts. For each DNA fragment, a correction factor cf was assigned by:
Where gci is GC interval of the DNA fragment i, is distribution density of gci in the reference genome, is distribution density of gci in sequenced DNA fragments of this sample (Fig 1A). So we can get the normalized read depth count (NRDC) of each level-1 bin by
in which i represents the ID of reads whose start position located in the jth bin.
2.3 Control RDC construction and copy ratio calculation
To reduce systematic bias caused by factors other than GC-content, we used the copy ratio of test sample NRDC to the control reference NRDC (CNRDC) for further analysis. CNRDC is calculated by the following procedures: Firstly, to eliminate the influence of variation in sequencing data volumes, all the samples including test samples and control samples should be performed with centralization of NRDC by dividing the median NRDC of all genome bins. Secondly, to reduce the random fluctuation in the CNRDC and thus reduce the fluctuation in the final copy ratio, we construct CNRDC by averaging the NRDCs at each bin across the control samples. After obtaining the CNRDC, we can get the corrected copy ratio (CCR) for the test samples by
in which ccrj is the corrected copy ratio of the jth bin, nrdcjis the NRDC of the jth bin in the test sample, and cnrdcj is the CNRDC of the jth bin. Additionally, to avoid the frequent germline CNVs, outlying bins with CNRDC < mCNRDC *0.7 or CNRDC > mCNRDC *1.4 were removed, in which mCNRDC is the median value of CNRDC of all autosomal bins.
For detecting CNVs at various level of length, we performed CNV segmentation on detection bins (defined as level-2 bins), whose length (should be an integer multiple of level-1 bin length) are assigned by users. The copy ratio of each level-2 bin was obtained by averaging CCR of all level-1 bins within it.
2.4 Second step of GC-correction
Even with the first step of GC correction described in 2.2, we observed that the CCRs were still generally correlated with GC-content (Supplementary Fig S2). The remaining bias, though slightly increasing the variance of CCR, may produce false positives in several GC abnormal regions in the genome, e.g. chr1p and chr19. Thus a second step of GC-correction was performed to further remove the remaining GC bias, by a generalized linear regression for CCR and GC-content of level-2 bins, which can be illustrated as:
in which a, b and c are coefficients of regression, gc j is GC-content of jth level-2 bin, is the GC-corrected copy ratio of jth level-2 bin, and e j is the residual of ccrj. Then, was used instead of ccrj as input for CNV segmentation algorithm.
2.5 CNV Segmentation
To locate the breakpoints of CNVs and determine the CNV status of segments, we employed a piecewise least squares estimation model based on stepwise regression for each chromosome, which could also be described as piecewise linear fitting of ladder type and identified the breakpoints of ladders one by one. We minimized sum of squares of residuals to get the least squares estimation, as described below.
If we have obtained the sorted breakpoint set E ={(0,1),(j2, j2 +1),(j3, j3 +1),,(jk, jk +1),(n,n +1)} (1 < j2 < j3 <… < jk < n, each breakpoint is indicated by two consecutive genomic bins and n is the total bin numbers of this chromosome) after fitting k ladders to a chromosome, we can obtain the estimation of ccrj for each bin:
in which , or the mean value of ccrj from (jk –1 +1) th bin to jk th bin. So the sum of squares of residuals is
Next, we introduce the detailed steps of the piecewise linear fitting model by taking the chromosome i for example. For the jth bin of chromosome i, let be the discriminant set for chromosome i, in which ni is the total bin number.
Step 0: Get the initial value of coefficient of variation (CV). We firstly calculated CV of each 20 consecutive level-2 bins in all autosomes by
where SCCR and are the standard variation and the average of CCRs, respectively.. Then the initial CV (cv0) could be estimated by averaging the smallest 30% CVs of all 20 consecutive level-2 bins in the whole genome, and subsequently be used in the loop termination conditions for the piecewise linear fitting. When k=1, the breakpoint set is , which has only two breakpoints and one CNV segment. The estimation of could be calculated by the mean of all in chromosome i.
Step 1: Judge whether to terminate the loop of piecewise linear fitting for chromosome i. For discriminant set Di, we can calculate the CV value of ladder k (cvk) by
in which is the mean value of Di set, is the standard variation of Di set. When k=1, if cvk – γ *cv0 ≤ 0 (γ is a coefficient assigned by users, and γ = 1 by default) we terminate the loop for piecewise linear fitting. And when k>1, if
we terminate the loop, where r is a coefficient assigned by users corresponding to the lowest chimeric fraction for CNV detecting (3% by default), B is the length of level-2 bin (1M bp by default) and L is the smallest size for CNV detecting (3×b, by default). Otherwise, k=k+1, and then go to step 2.
Step 2: Add one optimal point to breakpoint set by traverse the potential new breakpoint set . After adding one more breakpoint to , we can get the new breakpoint set and the new sum of squares of residuals . To get the optimal breakpoints for fitting the chromosome, we minimize the over all :
Step 3: Test whether the new breakpoint set is significant for fitting chromosome i. According to the stepwise regression model, if k>2, we should test whether including each one of the previous breakpoints in is more significant for fitting chromosome i than including the new breakpoint in step 2. If including a previous breakpoint is not more significant than including the new breakpoint, this previous breakpoint should be removed from the breakpoint set while k=k-1, until all the remained breakpoints are more significant than the new breakpoint. Then, go to step 1. The significance of including a breakpoint is assessed by calculating the decrease in ESS.
2.6 Significance test
For chromosome i of the test sample, we assume that and follow independent identical normal distribution . The unbiased estimation of can be calculated by
So the CCR in a normal bin should follow distribution of . Therefore, we can inferred that average CCR of the kth segment follows distribution of under the H0 hypothesis of , and the significance of the kth segment to reject H0 could be tested.
Because the additional deviation in the copy ratios induced by real CNVs could not be determined before going through the pipeline, the first run of step 2.5 and 2.6 was only used to define normal regions in the whole genome, and thus used to set parameters cv0 for the second run of these two steps to obtain the final segmentation results and the significance of each CNV. Before the second run of step 2.5 and 2.6, the copy ratios across the genome should be centralized again using the average copy ratio of normal bins defined by the first run. The initial coefficient of variation cv0 used in the step 0 of 2.5 in the second run is also reset by cv of normal bins at the first run.
3. Results
3.1 Theoretical limitation of CNV detectability in cf-DNA
The sequenced DNA fragments only took a very small part of the cf-DNA in the circulation, so the process of blood drawing, cf-DNA extracting and library construction could be regarded as random sampling of cf-DNA from an infinite population. Thus the read count examined in a certain genome bin should follow Poisson distribution. To calculate theoretical limitation of CNV detectability in cf-DNA, we firstly assessed the theoretical random fluctuation of copy ratios. The RDC in the ith bin could approximate to a random variable following Poisson distribution P(λi) with a mathematical expectation of λi. However, λi may be different for different i because some inherent characteristics in different genomic regions could affect the read count, such as the mappability (Ha, et al., 2012). We have:
in which N is the total effective reads of this sample, b is the total number of level-2 bins in the whole genome, and Pi is a position specific coefficient to adjust the reads count in i th bin.
Considering X and Y are independent, the random fluctuation of copy ratios could be evaluated by
in which λ1 is mathematical expectation for test sample X and λ2 is mathematical expectation for the control reference Y.
It is easy to deduce a lower bound of CV
For brevity, we assume that Pi = 1 for all i in each sample, and use the variance of all bins in a sample as the variance of each bin. Then we can get:
In order to evaluate the effectiveness of our GC-correction method, we applied this method to 67 normal maternal plasma samples, and compared the actual coefficient of variation (ACV) before and after GC-correction with theoretical lower limit of coefficient of variation (TCV) for the copy ratios (CR). Before GC-correction, we identified various degrees of correlation between CR and GC-content in the 67 samples, implying that different samples were affected by different levels of GC bias, even with identical number of PCR cycles in the library process. We also identified a broad range of correlation between CRs of different samples (Fig 1B), suggesting the non-independence of CR in different samples. After GC-correction, ACVs were greatly reduced, as well as the linear correlation between GC-content and CRs, and the correlation of CRs between different samples. Moreover, we approximately achieved the theoretical lower limit of TCV after GC-correction (Fig 1C). These results implied that systematic bias in CRs are mostly contributed by GC bias, which could be almost completely removed using our GC-correction method. After GC-correction, CR of different samples obeyed the assumption of independence in step 2.6. However, factors other than GC-bias that introduce systematic bias in the RDC should also affect RDC of the control samples, and thus should be normalized in the CR.
We next assessed theoretical limit of detecting power given a confidence coefficient (p-value). A significant CNV segmentation could be modeled as several continuous bins with deviant CRs. Approximately, we assumed that CR of single bin followed an i.i.d normal distribution, thus the detecting power could be assessed by
where r / 2 is the chimeric fraction of a heterozygous CNV, α is the confidence level, z1–α / 2 is the 1–α / 2 quantile of standard normal distribution, L is the length of CNV, and B is the level-2 bin size. Supplementary Figure S3 shows the smallest chimeric fraction of detectable CNV with various lengths and amount of effective reads. Given N=25M, b=3000, B=1M, L=3M, and p=10e-5, we could infer that r /2 =2.80%.
3.2 Simulation Study and Comparison between segmentation methods
To evaluate the performance of SCDT, we blended cf-DNA from a healthy female with DNA from tissues of aborted fetuses whose CNVs had been determined by G-banding karyotyping, to simulate cf-DNA from maternal plasma with abnormal fetus. Using this method and tissues of 11 aborted fetuses, we obtained 108 simulated samples with various mixture ratios (MR), including 3%, 5%, 8%, 10%, 15% and 20%. All the simulated samples were sequenced with 35bp single-end reads on BGIseq 500 platform. After filtering out reads with low mapping quality (<Q30) or high mismatch numbers, the average amount of effective reads for each sample is averagely 25M (8M-69M), which is applicable in current NITP test. Considering that the CNVs of aborted fetuses are different from each other, in this experiment we used the median NRDC of all the samples as the control reference for normalization.
We set the level-1 bin size to 100kb and the level-2 bin size to 1M, and then used SCDT to analyze these samples. To reduce false positives, we required a CNV segment with p-value smaller than 10e-5), and with copy ratio >1.01 or <0.99. Here, we defined a predicted CNV as a true positive if it overlapped with at least 50% of a spike-in CNV. Using SCDT we detected all the spike-in CNVs with chimeric fraction ≥5% (mixture ratios ≥10%) and length ≥3M and had no false positive with length ≥3M after filtering out putative germline CNVs (Fig 2). We then compared our results with the theoretically detection limitation and found that SCDT detected most of theoretically detectable CNVs, though missed some near the line of theoretically limitation (Supplementary Table S1). However, the spike-in fractions evaluated from CRs of CNVs were lower than expected (Supplementary Fig S4), probably because of failure to remove some large fragment DNA during preparation of fetal DNA.
We compared the performance of SCDT on the simulated samples with the state-of-the-art CNV detection methods, including BIC-seq (Xi, et al., 2011), DNAcopy (Venkatraman and Olshen, 2007), Control-FREEC (Boeva, et al., 2012) and CNV-seq (Xie and Tammi, 2009). Considering that CBS and BIC-seq didn’t have GC correction workflow, these methods were evaluated following preprocessing by our GC-correction method. For each detector, we adjusted the parameters and cutoffs until the results achieved the fewest false positives with sensitivity of 60% (Supplementary Method). We observed that our GC-normalization step greatly improved performance of CBS and BIC-seq. For the 56 samples with theoretically detectable CNVs of size ≥ 3M, the segmentation method of SCDT had the highest sensitivity (89.29%), followed by CBS (82.14%) and BIC-seq (76.80%). When considering all the samples with the CNV size more than 3M, the segmentation method of SCDT also had higher sensitivity (59.77%) than CBS (52.87%) and BIC-seq (49.43%), while the false discovery rates for SCDT, CBS and BIC-seq were around 1.89%, 8% and 0%. We then compared the average sum of length of false positives in each sample at different level of sensitivity, and demonstrated that the segmentation method of SCDT outperformed the other methods (Fig 3).
3.3 Real Data Analysis3.3.1 Abnormal maternal plasma samples
To further evaluate the performance of SCDT, we applied it to cf-DNA sequencing data of real clinical samples, including maternal plasma and plasma of cancer patients. 34 maternal plasma samples carring abnormal fetal CNVs previously determined by amniotic fluid puncture and G-banding karyotyping were sequenced with average effective reads of 18M (11M - 31M). We chose another 9 normal maternal plasma samples to construct the control reference. The parameter setting was the same with the simulated data described above. We detected all confirmed CNVs in the 34 cases with only one false positive with length ≥3M, indicating high sensitivity and specificity of SCDT (Fig 4; Supplementary Table S2). We applied SCDT on another 7 cases of maternal plasma reported with abnormal genotyping by BGI NIPT workflow, but reported negative by amniocentesis and G-banding karyotyping. Interestingly, we observed chimeric CNVs in all these 7 cases, with great significance (Supplementary Fig S5). None of these women had been identified with a cancer, and inconsistence in these cases might result from chimeric placenta, abnormal hematologic clones of maternal, or false negative reports of amniocentesis.
3.3.2 Target sequencing of cf-DNA for normal individuals
We then investigated CNVs in 118 cf-DNA target capture sequencing data from normal individuals. All these samples were target enriched using a panel covering 1.7 megabases before sequenced with paired-end 100bp reads on Hiseq2500 platform. To detect CNVs using off-target reads as analogue of low-depth WGS data, we filtered out reads on or near target regions to build whole genome sequencing depth for these plasma samples. The number of off-target reads is 58M averagely (27-88M). Using p<10e-5 for cut-off and filtering out a few frequent false positive regions (totally 27M) that are affected by polymorphic CNVs or harbor long centromere and telomere sequence, the false positive callings was evaluated to be 0.03% of the whole genome (Supplementary Fig S6). However, the false discovery rate should be overestimated, for some of false positives were induced by germline events.
3.3.3 Target sequencing of cf-DNA for cancer patients
Finally, the same method for analyzing target-off reads was applied on additional 240 cf-DNA samples of patients with various types of cancers (Supplementary Table S3), which were captured by the same target panel used for normal individuals, with off-target reads of 45M averagely (16-170M). 10 female samples in the normal dataset were used to build the control reference. Blood cell DNA of these patients was also target sequenced as control to determine somatic point mutations. The most frequent copy number changes in 240 samples included gains of chr1q, 3q, 8q and loss of chr1p, 4, 8p, 17p, consisting with the CNV profiles in cancers (Fig 5, Supplementary Table S4). We evaluated the concentration of ct-DNA in cf-DNA using both point mutations and CNVs, and identified a high correlation between them (r=0.72, Supplementary Method). Samples detected with CNVs have significantly more point mutations and higher mutational variant allele frequency (VAF, p<1e-10) (Fig 5). Of the 107 samples that have >10% of the genome detected with CNVs, 97 (90.7%) were with average mutational VAF >2%, while in other 86 cancer samples that have ≤1% of the genome detected with CNVs,, only 6 (7.0%) were with average mutational VAF >2% and another 54 (62.8%) sample were absence of point mutations. In addition, we used VAF of point mutations to assess the copy number of CNVs, and detected several targetable agents including high amplification (CN≥7) of EGFR (n=9), MET (n=3), ERBB2 (n=5), KRAS (n=5) FGFR1 (n=1), CCND1 (n=7), CDK4 (N=4) et al (Fig 6; Supplementary Fig S7). However, high amplification (CN≥7) of some other genes known to have a role in drug resistance could also be detected in several samples, such as MYC (n=8) and MCL1(n=4), etc.
Discussion
Deviation in the copy ratio of each genome bin is composed of at least two components, one is natural fluctuation caused by the random sampling process in blood drawing, DNA extracting, library constructing and sequencing, and the other is systematic fluctuation caused by factors such as GC non-uniform distribution or different success rate of sequencing or reads mapping in certain genome regions. According to the theoretical statistic rules, the natural fluctuation would result in poisson distribution of the final read count per genomic region, and thus can be exactly evaluated. We showed that the limit detection ability for a CNV with certain chimeric fraction and length only depends on the read count per bin, according to the theoretical statistic rules. In this study we introduced a GC correction approach to remove almost all the deviation in copy ratios contributed by non-random factors in cf-DNA sequencing data. However, as FP rate particularly concern clinical application in large scale population screening, and even a low FP rate could produce considerable FP cases, avoiding FPs of biological sources, including germline CNVs, placental mosaicism and maternal abnormality, are noteworthy in addition to reducing FPs of technical source. Several studies have identified CNVs of hemopoietic origin in blood cell-DNA in about 1-3% of normal individuals (Jacobs, et al., 2012; Laurie, et al., 2012), while they are reasonable to also present in cf-DNA, which is mainly derived from blood cell DNA (Sun, et al., 2015).
Cf-DNA sequencing brought immense opportunities for molecular diagnosis in clinical settings, especially in the field of cancer. However, while point mutations and methylation of cf-DNA have been widely used as biomarkers for cancer screening, early detection, treatment guiding, and disease monitoring, the cf-DNA copy-number signatures, another important kind of genomic aberrations and targets of a handful of drugs, is still seldom mentioned and evaluated in studies, partially because of the low ct-DNA fraction and technique difficulty in overcoming the low signal to noise ratio for CNV detection. Another concern is that quantifying low chimerical CNVs using WGS data requires big data size and increases the cost. However, we demonstrated that using target-off reads, the by-product of target sequencing, was a feasible way to profile somatic CNVs in cf-DNA. In addition to providing information in guiding therapeutic decisions, interrogating cf-DNA CNVs may also help to evaluate the therapy efficiency and monitor tumor recurrence. However, according to our analysis, the detection limit of chimeric ratio of CNV is inversely proportional to the square of read count, which indicates that 10 folds improvement of sensitivity requires 100 folds of sequencing data, and thus hampers the interrogating of ultra-low chimeric CNVs. Therefore point mutations may be better biomarkers for samples with low ct-DNA fraction. Moreover, as ct-DNA fraction in early stage cancers is extremely low (0.1% or less), CNVs in ct-DNA are hard to be profiled and thus not economically applicable to early cancer detection. Overall, integrating CNV analysis into liquid biopsy offers a global view of genomic aberrations, and provides more opportunity for clinical management of cancer.
To accurately find more common chromosomal abnormalities with smaller size at earlier gestational stage and cancer stage is an important goal of non-invasive clinical testing. As continuous reduction in sequencing cost, larger data size will become available soon with affordable cost. This allows for more precise diagnostics as our method is expected to perform better with increased sequencing depth. A higher coverage will allow for more stable calls and using smaller bin sizes while keeping the read depth per bin high enough to detect changes confidently.
Supplementary Tables
Acknowledgements and Funding
The research reported in this work was supported by BGI Academy of Life Sciences and BGI Institute of big data without any official funding.
Footnotes
↵† The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.