Abstract
We currently lack an understanding of somatic mutation frequencies and patterns in benign tissues, as studies are often limited to the identification of mutations in clonal expansions (1). Using a novel method capable of accurately detecting mutations at single base pair resolution with allele frequencies as rare as 10−4, we find a surprisingly high somatic mutation burden of 50-900 mutations/MB in peripheral blood cells from apparently healthy individuals. Nearly all analyzed sites carry at least one somatic mutation (including known oncogenic mutations) within approximately 20,000 cells. Unexpectedly, mutation patterns and corresponding allele frequencies are highly similar between individuals, age-independent, and lack signatures of selection. We also identified two individuals with patterns of somatic mutation that resemble mismatch repair deficiency, exhibiting mutations that exist at uniformly elevated mutation frequencies. These results demonstrate that somatic mutations, including oncogenic changes, are abundant in healthy human tissue and suggest an unappreciated degree of non-randomness within the processes underlying somatic mutation.
Introduction
The processes involved in somatic mutagenesis are typically regarded as considerably stochastic, and have been incorporated into theories of oncogenesis, aging, and evolution accordingly (2, 3). Nevertheless, it is well known that mutation rates can be influenced by such factors as chromosomal location, nucleotide identity, and sequence context (4–9). As good example of mutation bias, that represents a substantial number of human point mutations, cytosine deamination within CpG contexts strongly favors C>T point mutations (10–12). The effect of this bias is furthermore not limited to just the CpG site itself, as mutation rate increases within 10 nucleotides of a CpG dinucleotide (13). Moreover, neighboring base pairs can influence the somatic mutability of a nucleotide (14, 15). While many other notable examples of biased mutability have been identified, understanding somatic mutation rates and biases for each nucleotide position within the human genome has been significantly restricted by technological limitations (9, 16–18).
Somatic mutations are constantly occurring, yet without clonal expansion, each unique mutation will typically exist at a very low allele frequency (19–22). This scarcity of somatic mutations makes it challenging to understand how deterministic mutation rates and burdens are, and has provided motivation to improve the sensitivity of existing methods. Technologies such as high-throughput digital droplet PCR (23–25), COLD-PCR (26, 27), and BEAMing (28) have shown promise for rare mutation detection, but are often limited to variant allele frequencies (VAFs) greater than 1 percent or are restricted to assaying only a few mutations at a time. In comparison, sequencing-based approaches can theoretically detect many mutations below a 1 percent allele frequency, but distinguishing true signal from relatively high false positive background has been a significant challenge. These signal to noise difficulties have been somewhat overcome by increasing the depth of sequencing, using clever methods of DNA barcoding, (19, 29) or performing paired strand collapsing (30). Despite these advances, sufficiently high false positive rates and low allele capture efficiencies have largely prevented sequencing-based approaches from yielding a comprehensive understanding of mutation rates and biases within the human genome (19, 22, 31).
A better understanding of somatic mutation rates, could have profound influences on our understanding of somatic evolution and its role in pathogenic processes like oncogenesis (32, 33). The aforementioned technological limitations have made it difficult to study somatic mutation burden and rates in healthy tissue, largely confining measurements of somatic mutation levels to retrospective reconstruction of in vitro (14, 34–36) or in vivo (29, 36–41) clonally expanded cells. By analyzing clonally expanded cells, these methods are typically confined to analysis of founder cells, and miss further downstream somatic changes. These limitations have left significant gaps in our knowledge of somatic mutation rates within healthy, properly functioning tissue.
Results
To overcome current sequencing limitations, we created FERMI (Fast Extremely Rare Mutation Identification), in which we adapted the amplicon sequencing method of Illumina’s TrueSeq Custom Amplicon platform to efficiently capture regions of genomic DNA (gDNA) purified from peripheral blood cells. While targeted sequencing is typically performed on broad regions of DNA, we used DNA probes to target and capture a precise set of 32 genomic regions, each approximately 150bp in length, that span either AML-associated oncogenic mutations or Tier III (non-conserved, non-protein coding and non-repetitive sequence) regions of the human genome.
With a significantly improved probe capture efficiency that yields about 1.2 million unique captures from 1μg of gDNA (see Methods), this approach enabled ultra-deep sequencing of peripheral blood cells. To overcome the false positive signals that often limit the utility of ultra-deep sequencing, we included in our DNA capture probes a 16bp index, containing sequence unique to each probed individual and a 12bp unique molecular identifier (UMI) of randomized DNA unique to each capture (Fig. 1a). Sequencing reads of these capture probes were then sorted by sample index and UMI to produce bins of single cell sequencing which were collapsed to produce largely error-free consensus reads. Captures were only considered if supported by at least 5 sequencing reads, and variants were only included if identified in both paired-end sequences and detected in at least 55% percent of supporting reads for each capture (Fig. 1a and Methods; see also Supplementary Fig. 1).
All probed regions were successfully captured and amplified with some variability in efficiency depending on probe identity (Fig. 1b). To understand assay sensitivity, log-series ratios of one human’s gDNA diluted into another’s gDNA were analyzed by FERMI. We observed robust quantification of spiked-in single nucleotide polymorphisms (SNPs) with frequencies as rare as 10−4 (Fig. 1c). Accurate quantification of SNP frequency can also be made when using strand information to follow dilutions of multiple SNPs located on the same allele (Fig. 1d). For more description of the methods used to maximize the accuracy of FERMI, see Elimination of false positive signal in Methods and Supplementary Fig. 1.
To assay somatic mutation burden in peripheral blood and understand how it changes with age, we used FERMI to capture and sequence gDNA from the peripheral blood of 22 apparently healthy donors ranging in age from 0 (cord blood) to 89 years old (Supplementary Table 1). Common and rare germline SNPs could be readily identified by their allele frequencies (Supplementary Fig. S2). In addition, FERMI detected many rare somatic mutations present below 0.3% allele frequencies in these samples. Interestingly, nearly all analyzed sites had at least one somatic mutation across ∼ 20,000 cells in peripheral blood (reflecting ∼ 40,000 captured alleles). These observed mutation rates predict a burden of 50 to 900 mutations per megabase (See Estimation of mutation burden in Methods), a rate that is much higher than estimations derived from hematopoietic tumor analysis, which typically range from 0.02 - 1mut/Mb (34). As leukemias are often of stem or progenitor cell origin (42), our elevated estimations suggest that mutation rates might be considerably elevated during production of terminally differentiated cells.
To understand variation within the human population, the variant allele frequency (VAF) for each rare variant was compared between each of the 22 blood donors. Unexpectedly, we found that these rare somatic variants existed at remarkably similar allele frequencies between individuals, across the full sampled age range. These rare VAFs are similar enough between most individuals that inter-individual comparisons for each unique substitution fall along a y=x line (Fig. 2a). We also created an average of the rare VAFs across the 22 donors, and used this for comparison to each individual, which also adhered to a y=x line (R2 Range = 0.426-0.631, Mean = 0.558) (Fig. 2b-c). Indicative of minimal age-related change in the mechanisms governing leukocyte somatic mutation spectra, the degree of mutation pattern similarity between individuals compared to the population average does not correlate with age (Fig. 2c). These similarities also reproduce in an independent experiment with a separate cohort of blood donors (Supplementary Fig. 3). This lack of age-dependence suggests that most of these mutations were unlikely to have occurred in long-term stem or progenitor cell populations, and instead arose at later stages of hematopoiesis. Furthermore, most variants likely represent multiple independent events rather than clonal expansions, as they are found at similar frequencies on both alleles (Fig. 2d). Consistent with this interpretation, analyses of serial samples from the same donors (Supplementary Fig. 2c) are also highly concordant, suggesting that such mutations probably arise transiently and recurrently in blood cell populations. It thus appears that instead of being semi-random, the aggregate effect of all DNA damage and maintenance processes generates somatic mutations at predictable rates throughout the genome independently of age.
While we observe variants at conserved frequencies across many individuals, previous studies have described age-related clonal expansions of cells containing AML-associated oncogenic changes(37–39, 41). Though we observe each queried oncogenic change in every biopsied individual, we do not observe significant age-related increases in the allele frequencies of oncogenic mutations (Fig. 2e and Supplementary Fig. 4). Thus, there was no clear evidence for positive or negative selection accompanying these oncogenic mutations. This inability to observe clonal expansions with age is most likely due to the fact that the average age of the adult individuals within our cohort is only 49 years, with only 5 donors older than 70 years. In addition, oncogenic mutations occurring at later stages of hematopoiesis may be unlikely to result in clonal expansions.
Previous observations suggest disparate mutation rates for each of the four DNA bases(14, 37, 40, 43, 44). Consistent with these observations, we observe nucleotide specific substitution probabilities, with C>T and T>C substitutions being the most common and T>G substitutions being the least common (Fig. 3a). When base change probabilities are analyzed within the context of their two flanking nucleotides (trinucleotide context), significant differences in substitution probability emerge, illustrating a significant impact of nucleotide context on overall mutability (Fig. 3b). While these immediately surrounding bases appear to significantly impact substitution probability, bases further away appear to have relatively little impact (Supplementary Fig. 5a). Also consistent with expectations, CpG positions were more likely to mutate than others in a manner not explained by oversampling of CpG sites (Fig. 3c and Supplementary Fig. 6).
Independent of functional or oncogenic potential, substitutions occur at rates that are uniquely determined by nucleotide position, such that each locus mutates in a highly reproducible manner (Fig 4a-b). Strikingly, a subset of sites shows a highly significant bias to mutate to only one of the possible nucleotides (Supplementary Fig. 7) – across all assayed individuals, these sites mutate to only one of the three possible alternative nucleotides. Even for matching trinucleotide contexts, substitution frequencies can vary, and often fall within just one of two distinct upper and lower VAF clusters (Fig. 5a and Supplementary Table 2). Thus, the substantial variation in mutation probabilities for different positions cannot simply be explained by base bias or trinucleotide context, but likely involves context conferred by other factors like histone and DNA modifications or chromosomal organization. Nonetheless, analyzing the neighboring base contexts for each base change separated into lower-VAF and upper-VAF groups revealed an influence of the flanking bases on mutation frequency for some changes but not others. For example, the immediate flanking bases exerted a much greater influence on the VAFs for C>A changes than C>T changes, and thus explains some but not all variability in mutation frequency (Fig. 5b and Supplementary Fig. 5b).
Possibly indicative of differences in mutational (and possibly selection) processes within cancers, the integrated exome sequencing pan cancer somatic mutation data from The Cancer Genome Atlas (TCGA) exhibits different substitution patterns from those that we find in healthy donor blood (Supplementary Fig. 8). Using the trinucleotide contexts of the substitutions, 7 out of 30 previously identified mutations signatures were identified, and these signatures did not differ significantly across sampled genomic segments (Supplementary Fig. 8).
To explore the ability of FERMI to distinguish perturbations of somatic mutation patterns, gDNA from MMR deficient HCT116 cells (MMRMT) that express truncated and non-functional MLH1 protein was compared to MMR proficient HCT116 cell line gDNA. Providing further validation of our method, across multiple experiments, we observed a substantial increase in VAFs within the MMRMT gDNA when compared to MMR competent HCT116 control (Fig. 6a and Supplementary Fig. 10a). Interestingly, while the mutation spectra of most peripheral blood samples resemble those in other individuals, the spectra from two individuals (samples 2 and 19) possessed a subset of variants that deviated from the population averages, having allele frequencies about twofold higher than average (Fig. 6b-c, and Supplementary Fig. 9). While the magnitude of deviation from mean VAFs was different between the two samples, the identities of the deviating variants were very similar, such that a comparison of VAFs between these two individuals correlate more closely to a y=x line than to the overall population average (Fig. 6d). This consistent deviation in VAFs for these two individuals suggests that the mechanisms governing mutation prevalence can be systematically perturbed in a manner that uniformly alters certain substitution probabilities.
Surprisingly, the substitution VAFs observed within individuals 2 and 19 resembled those altered in the MMRMT HCT116 cells, though the magnitude of these changes were greater in the latter (Fig. 6e). Furthermore, the deviating variants found within individuals 2, 19 and MMRMT samples are not enriched for oncogenic variants (Fig. 6f; shown for individual 2), indicating that deviations are not likely the result of oncogenic selection.
As expected from past studies(45), the HCT116 MMRMT gDNA showed an increased prevalence of T>C and T>A substitutions when compared to parental gDNA (Supplementary Fig. 10). Peripheral blood gDNA from individuals 2 and 19 also exhibited similarly increased rates of T>C and T>A substitutions (Fig. 6g-h and Supplementary Fig. 9). Thus, these two individuals appear to exhibit a mild MMR deficiency. In support of the results, individuals 2 and 19 show the same increased rates of substitutions across two experiments, with strong reproducibility in mutation patterns (Supplementary Fig. 9j). The systematic and reproducible variance from the typical mutational pattern for these two individuals and the MMRMT HCT116 cells also serves as validation of the specificity of FERMI to accurately detect variants and their frequencies. More importantly, the identification of two individuals with altered somatic mutation patterns out of only 35 individuals may indicate that systematic somatic mutation deviations from typical mutational profiles may be relatively common in the human population.
Discussion
In this study, we created a unique method of measuring levels of ultra rare somatic mutations and mutational burden within human blood. Use of this method gave rise to several key findings. First, we found an unexpectedly high somatic mutation burden within putatively healthy individuals, where cells contained 100-1000 times more mutations than previous measurements in stem cells and even most cancers. As tumors may often originate from stem cells, retrospective analysis of high frequency variants may largely reflect stem cell mutation burden. Together with measurements derived from healthy stem cells, our higher observed mutation burdens in mature blood cells could reflect unique use of mutation reducing processes within stem cells such as lower cell division rates, reduced exposure to oxidative stress, higher efflux pump activity, and perhaps better DNA repair. The second important finding was that all probed oncogenic changes were observed in each evaluated individual without evidence of either positive or negative selection, suggesting that oncogenic mutations occurring in short-lived, lineage-committed cells pose minimal risk. Moreover, these results indicate that oncogenic mutations are not uniformly under positive selection in normal tissues.
The third important observation was the surprising degree of similarity between the somatic mutation patterns of different individuals. If somatic mutations occurred in a largely stochastic manner, it may be logical to expect striking differences between individual somatic mutation patterns. Yet within our cohort only two individuals exhibited noticeable differences from the others. Furthermore, while nucleotide context is known to influence the mutability of genomic loci, we find that each nucleotide locus carries with it a uniquely determined mutability rate for each possible substitution. These mutability rates are conserved across nearly all measured individuals, and appear responsible for the observed similarities in somatic mutation. This would suggest that while somatic mutagenesis is often seen as a largely random process, in reality, it appears to be governed by a number of complex and highly deterministic factors.
Human mutation rates have long been an area of study, but technological limitations have largely necessitated that they be indirectly measured through clonal expansions of isolated healthy cells, in tumor cells, or from germline mutation rates across generations (35–37). While DNA sequencing based methods allow for the observation of ultra rare mutations, if enough DNA is sequenced to reach the allele frequencies present in somatic cells, false positive rates tend to climb high enough to obscure true rare mutations. We solved these problems with a barcoding system that allowed each captured genomic allele to be distinguished from that of other captures, providing near single cell sequencing resolution to bulk sequencing experiments. Furthermore, we sufficiently improved DNA capture efficiencies to allow capture of millions of unique alleles from each blood biopsy. This high allelic capture rate enabled reliable detection of mutations at rare enough allele frequencies that spontaneous somatic mutations could be observed. This sequencing strategy revealed somatic mutation loads per cell that are orders of magnitude higher than those measured in hematopoietic stem or progenitor cells (following clonal expansion) (34). Given our estimates of 50 to 900 mutations/MB in the average mature leukocyte, this burden would suggest a mutation rate between 10−6 to 10−5 mutations per nucleotide per division, respectively, if one assumes that a mature leukocyte is the product of approximately 100 cell divisions. Not only is this mutation rate substantially higher than those measured in progenitor cells, even in mature skin cells, mutation rates are only about 6 mutations/MB (40). The disparity between mutation rates for cell divisions in short-term progenitors (leading to terminal cells) and cell divisions in stem and germ cells may reflect the importance of investing more heavily in genome damage avoidance and repair within stem and germ cell populations. Furthermore, the accuracy of FERMI may facilitate a better understanding of the extent of somatic evolution within tumors. As previously elaborated, typical studies are confined to retrospective study of early driver mutations within clonogenic expansions, but are technologically limited from understanding later mutation acquisition within tumors (subsequent to the most recent bottlenecks). FERMI could be used to better understand how cancers evolve, particularly if leveraged during periodic sampling of malignancies.
Our studies demonstrated that within about 20,000 blood cells (2-5 µl of blood) all queried oncogenic mutations were present in each biopsied individual. While previous studies have demonstrated clonal expansions of some oncogenically initiated cells in a fraction of elderly individuals (37–39), we observe no such effect. While this is likely due to insufficient old-age samples, we were surprised to find that oncogenic mutations are ubiquitous in even very young individuals and at conserved frequency regardless of age. This is consistent with the frequent detection of oncogenic mutations in individuals over 50 in a previous study (29). This finding is bolstered by, and may help explain the previously reported commonality of oncogenically-initiated clonal expansions in sun exposed skin (46). As we are largely sampling terminally differentiated cells, we conclude that oncogenic mutations are being reliably generated during the production of mature cells throughout human life at consistent rates.
Given the common presence of oncogenic mutations in normal tissues, numerous hurdles clearly exist that prevent further cancer evolution, including intrinsic tumor suppressor pathways such as senescence and the hierarchical organization of tissues (47, 48). In cancers like acute myeloid, chronic lymphoid and chronic myeloid leukemias, which have been shown to initiate in hematopoietic stem cells (49–52), the small numbers and low division rates of these target stem cells should serve as a barrier to oncogenesis. Our results also highlight the importance of tissue maintenance mechanisms, which can maintain functionality despite mutation accumulation, in limiting and delaying both cancer and aging (47, 53). Finally, the prevalence of oncogenic mutations in benign tissues may introduce important challenges to early detection and monitoring of cancer progression.
Additionally, our results indicate that cells carrying oncogenic and other novel epitope-generating mutations are not readily eliminated by the immune system, as might be expected. This is perhaps due to insufficient accompaniment by damage signals like cytoplasmic DNA, or interferon and interleukin signaling (54). Furthermore, it is possible that this frequent generation of non-synonymous mutations during human life acts as a tolerizing mechanism that may limit the effectiveness of the immune system in attacking and eliminating tumors or oncogenic expansions.
From previous studies, we expected to observe some bias in mutational frequencies based on sequence context, but that the overall somatic mutation profile would be highly random and unique to an individual at a particular moment in time. Instead, what we found was an incredible degree of similarity between the somatic mutation profiles of each biopsied individual. We show that somatic mutation burden is so highly conserved that each observed substitution exists at very similar frequencies within most biopsied individuals. Furthermore, the manner in which a nucleotide mutates appears to be highly dependent on its particular base location. We expect this dependency reflects the impact of surrounding nucleotides, chromosome context, and epigenetic profile. From these observations (extrapolated genome-wide), we hypothesize that nearly all somatic mutation is predictable and deterministic.
Finally, we observe two individuals whose somatic mutation burden deviates from the others. Surprisingly, both appeared to closely resemble the patterns created by mismatch repair deficiency. With only two samples displaying such a phenotype, it is challenging to understand its populational prevalence, but these results suggest that deviation from typical mutation frequencies may be relatively common. While we already know of some differences in human mutation patterns (55–57), if mutation incidence rates can be significantly increased or decreased without affecting cancer or aging rates would indicate that the human body’s tolerance for mutations may be greater than previously appreciated.
Supplementary Materials
Materials and Methods
Amplicon Design
Amplicon probes for targeted annealing regions were created using the Illumina Custom Amplicon DesignStudio (https://designstudio.illumina.com/). UMIs were then added to the designed probe regions and generated by IDT using machine mixing for the randomized DNA. Probes were PAGE purified by IDT. All probes are listed below along with binding locations and expected lengths of captured sequence.
Genomic DNA Isolation
Human blood samples were purchased from the Bonfils Blood Center Headquarters of Denver Colorado. Our use of these samples was determined to be “Not Human Subjects” by our Institutional Review Board. Biopsies were collected as unfractionated whole blood from apparently healthy donors, though samples were not tested for infection. Samples were approximately 10 mL in volume, and collected in BD Vacutainer spray-coated EDTA tubes. Following collection, samples were stored at 4°C until processing, which occurred within 5 hours of donation. To remove plasma from the blood, samples were put in 50 mL conical tubes (Corning #430828) and centrifuged for 10 minutes at 515 rcf. Following centrifugation, plasma was aspirated and 200 mL of 4°C hemolytic buffer (8.3g NH4Cl, 1.0g NaHCO3, 0.04 Na2 in 1L ddH2O) was added to the samples and incubated at 4°C for 10 minutes. Hemolyzed cells were centrifuged at 515 rcf for 10 minutes, supernatant was aspirated, and pellet was washed with 200 mL of 4°C PBS. Washed cells were centrifuged for at 515rcf for 10 minutes, from which gDNA was extracted using a DNeasy Blood & Tissue Kit (Qiagen REF 69504).
Amplicon Capture
For amplicon capture from gDNA, we modified the Illumina protocol called “Preparing Libraries for Sequencing on the MiSeq” (Illumina Part #15039740 Revision D). DNA was quantified with a NanoDrop 2000c (ThermoFisher Catalog #ND-2000C). 500ng of input DNA in 15μl was used for each reaction instead of the recommended quantities. In place of 5μl of Illumina ‘CAT’ amplicons, 5μl of 4500ng/μl of our amplicons were used. During the hybridization reaction, after gDNA and amplicon reaction mixture was prepared, sealed, and centrifuged as instructed, gDNA was melted for 10 minutes at 95°C in a heat block (SciGene Hybex Microsample Incubator Catalog #1057-30-O). Heat block temperature was then set to 60°C, allowed to passively cool from 95°C and incubated for 24hr. Following incubation, the heat block was set to 40°C and allowed to passively cool for 1hr. The extension-ligation reaction was prepared using 90 μl of ELM4 master mix per sample and incubated at 37°C for 24hr. PCR amplification was performed at recommended temperatures and times for 29 cycles. Successful amplification was confirmed immediately following PCR amplification using a Bioanalyzer (Agilent Genomics 2200 Tapestation Catalog #G2964-90002, High Sensitivity D1000 ScreenTape Catalog #5067-5584, High Sensitivity D1000 Reagents Catalog #5067-5585). PCR cleanup was then performed as described in Illumina’s protocol using 45 μl of AMPure XP beads. Libraries were then normalized for sequencing using the Illumina KapaBiosystems qPCR kit (KapaBiosystems Reference # 07960336001).
Sequencing
Prepared libraries were pooled at a concentration of 5 nM and mixed with PhiX sequencing control at 5%. Libraries were sequenced on the Illumina HiSeq 4000 at a density of 12 samples per lane.
Bioinformatics
The analysis pipeline used to process sequencing results can be found under FERMI here: http://software.laliggett.com/. For a detailed understanding of each function provided by the analysis pipeline, refer directly to the software. The overall goal of the software built for this project is to analyze amplicon captured DNA that is tagged with equal length UMIs on the 5’ and 3’ ends of captures, and has been paired-end sequenced using dual indexes. Input fastq files are either automatically or manually combined with their paired-end sequencing partners into a single fastq file. Paired reads are combined by eliminating any base that does not match between Read1 and Read2, and concatenating this consensus read with the 5’ and 3’ UMIs. A barcode is then created for each consensus read from the 5’ and 3’ UMIs and the first five bases at the 5’ end of the consensus. All consensus sequences are then binned together by their unique barcodes. The threshold for barcode mismatch can be specified when running the software, and for all data shown in this manuscript one mismatched base was allowed for a sequence to still count as the same barcode. Bins are then collapsed into a single consensus read by first removing the 5’ and 3’ UMIs. Following UMI removal, consensus sequences are derived by incorporating the most commonly observed nucleotide at each position, so long as the same nucleotide is observed in at least a specified percent of supporting reads (55% of reads was used for results in this manuscript) and there are least some minimum number of reads supporting a capture (5 supporting reads was used for results in this manuscript). Any nucleotide that does not meet the minimum threshold for read support is not added to the consensus read, and alignment is attempted with an unknown base at that position. From this set of consensus reads, experimental quality measurements are made, such as total captures, total sequencing reads, average capture coverage, and estimated error rates.
Derived consensus reads are then aligned to the specified reference genome using Burrows-Wheeler (58), and indexed using SAMtools (59). For this manuscript consensus reads were aligned to the human reference genome hg19 (60, 61) (though the software should be compatible with other reference genomes). Sequencing alignments are then used to call variants using the Bayesian haplotype-based variant detector, FreeBayes (62). Identified variants are then decomposed and block decomposed using the variant toolset vt (63). Variants are then filtered to eliminate any that have been identified outside of probed genomic regions. If necessary variants can also be eliminated if below certain coverage or observation thresholds such that variants must be independently observed multiple times in different captures to be included. For this manuscript, we included all variants that passed previous filters and did not eliminate those that were observed only within a single capture, unless otherwise indicated.
Elimination of false positive signal
A number of steps have been included within sample preparation and bioinformatics analysis specifically to distinguish between true positive signal and false positive signal. Using the dilution series shown in Figs. 1C-D, we can show sufficient sensitivity to identify signal diluted to levels as rare as 10−4. While these dilutions show significantly improved sensitivity over many current sequencing methods, they do not address our background error rate. Unfortunately, because both endogenous and exogenous DNA synthesis is error prone, it is challenging to find negative controls that can be used to estimate background error rates with a method of mutation detection as sensitive as FERMI. Nevertheless, we have a number of steps that should eliminate most sources of false signal. The two largest sources of erroneous mutation when sequencing DNA will typically be from PCR amplification mutations (caused both by polymerase errors and exogenous insults like oxidative damage), and sequencing errors.
The steps are the following:
Elimination of first round PCR amplification errors
Elimination of subsequent PCR amplification errors
Elimination of sequencing errors
Elimination of first round PCR amplification errors
The first round of PCR amplification performed during library preparation causes mutations that are challenging to distinguish from those that occurred endogenously. Since there is little difference between those mutations that occur during the first round of PCR amplification and those that occurred endogenously, we rely on probability to eliminate these errors. Since we are performing sequencing of individually captured alleles, we can ask whether requiring that a mutation be observed in multiple captured alleles before it is called as a true positive signal alters the frequency of variants identified. We expect about 400 first round PCR amplification errors, and the probability that the identical mutation will occur in multiple cells becomes exponentially unlikely (Fig. S1). By requiring a mutation be observed in just three captures before it is called as real signal, only about 1-2 first round PCR amplification errors should make it into the final data. In contrast, when we process our data requiring from 1 to 5 independent observations of a mutation, the overall mutation spectrum does not change, apart from a loss of the most rarely observed variants. This observation led us to include all variants that were observed even once.
Elimination of subsequent PCR amplification errors
Elimination of PCR amplification errors after the first round of PCR is done using UMI collapsing (Fig. 1a). Each time a strand is amplified, the UMI will keep track of its identity. Any mutations that occur after the first round of PCR will be found on average in 25% of the reads (or fewer for subsequent rounds). This allows us to collapse each unique capture and eliminate any rarely observed variants (<55%) associated with a given UMI. Utilizing the UMI in this way allows us to essentially eliminate any PCR amplification errors that occurred after the first round of PCR. The method should also eliminate most errors resulting from DNA oxidation in vitro.
Elimination of sequencing errors
Sequencing errors are eliminated in two ways. This first method is by using paired-end sequencing to read each strand of a DNA fragment (Fig. 1a). The sequence of these reads (Read1 and Read2) should match if no sequencing errors have been made. For an error to escape elimination it would need to occur at the same position (changing to the same new base) within both Read1 and Read2. Therefore, when the base call differs at a position on Reads 1 and 2, these changes are eliminated from the final sequence. This collapsing should eliminate most sequencing errors, although sequencing errors of the same identity occurring at the same position will escape. These errors should be removed when collapsing into single capture bins (Fig. 1a). As with the logic when eliminating subsequent PCR amplification errors, most sequences associated with each UMI pair should be identical. Therefore, sequencing errors passing through Read1 and Read2 will be very unlikely to match other sequenced strands from the same capture event, and are eliminated during consensus sequence derivation.
Mutation signature analysis
Twenty somatic mutation signatures were previously identified (43) by analyzing trinucleotide mutation context of cancer genomes using non-negative matrix factorization (NMF) and principal component analysis (PCA). Here, we used deconstructSig (64) to identify the relative presence of those mutation signatures within the somatic mutations detected blood using somaticSignatures (65). Codon triplet biases were partially analyzed using the MutationalPatterns R package (66).
Estimation of mutation burden
It is difficult to understand the somatic lineage development that gave rise to the number of cells that are assayed from each blood biopsy. Therefore, estimating a somatic mutation rate (per cell division) is challenging. Nevertheless, we can derive estimates of somatic mutation burden.
An upper bound for the somatic mutation burden observed by FERMI analysis can be estimated by using the number of captures and total observed variants, and assume that all of these are de-novo mutations. In our data from Cohort 1, we observe on average 1,232,458 unique captures per analyzed blood sample. These captures are relatively uniformly spread across each of our 32 different probes, which span a total of 4838bp. From this, the total probed DNA, DT, can be estimated as: The total number of observed variants within each blood sample is on average 168,940, from which the aggregate mutation burden, M, can be estimated as: A lower estimate can be made by assuming that mutations are not all unique occurrences but might be the result of clonal expansions creating multiple copies of each unique mutation. This mutation burden, M, can be estimated by the approximately 40,000 captures per each of the 32 probes that captured roughly 6000 variants across a conservative 100bp sized capture for each probe (probe region is realistically smaller than 150bp because of collapsing conditions). Given that all variants for which allelic information could be discerned were present on both alleles, we can realistically conclude each of the ∼ 3000 base positions queried was mutated at least twice (hence the estimate of 6000 variants).
Acknowledgments
We would like to thank Ruth Hershberg of Technion University and Jay Hesselberth and Robert Sclafani of the University of Colorado School of Medicine for useful suggestions and for review of the manuscript. These studies were supported by grants from the National Cancer Institute (R01CA180175 to J.D.), NIH/NCATS Colorado CTSI Grant Number UL1TR001082CU (seed grant to J.D.), F31CA196231 (to L.A.L.), the Linda Crnic Institute for Down Syndrome (to J.D. and L.A.L.), and P30-CA072720 (to A.S. and S.D.). The research utilized services of the Cancer Center Genomics Shared Resource, which is supported in part by NIH grant P30-CA46934. L.A.L. and J.D. developed the concept of this project, planned the experiments, analyzed results, and wrote the manuscript. L.A.L. processed and prepared samples from blood biopsy to sequencing, and wrote the bioinformatics software used for analysis. A.S. and S.D. analyzed results, and contributed to writing of the manuscript.