Abstract
Accurately identifying large repeat expansions including those that cause amyotrophic lateral sclerosis (ALS) and Fragile X syndrome is challenging for short-read (100–150bp) whole genome sequencing (WGS) data. A solution to this problem is an important step towards integrating WGS into precision medicine. We have developed a research tool called ExpansionHunter that, using PCR-free WGS data, can identify repeat expansions at the locus of interest, even if the expansion is larger than the read length. We applied our algorithm to WGS data from 3,001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Southern blot and fragment length analysis were applied on a subset of samples to confirm the presence or absence of the repeat expansion. Compared to the RP-PCR results, our WGS-based method identified pathogenic repeat expansions (>30 GGCCCC repeats) with 98.1%sensitivity and 99.7% specificity. Further inspection identified that 11 of the 12 conflicting calls were resolved as errors in the original RP-PCR results. Compared against this updated result, ExpansionHunter correctly classified 99.5% (212/213) of the expanded samples and all (2,788/2,788) of the wild type samples. The targeted repeat expansion caller we describe here marks a significant step towards a single whole genome medical test that includes detection of other pathogenic repeat expansions in WGS.