Abstract
The two strongest factors predicting a human cancer’s clinical behaviour are the primary tumour’s anatomic organ of origin and its histopathology. However, roughly 3% of the time a cancer presents with metastatic disease and no primary can be determined even after a thorough radiological survey. A related dilemma arises when a radiologically defined mass is sampled by cytology yielding cancerous cells, but the cytologist cannot distinguish between a primary tumour and a metastasis from elsewhere.
Here we use whole genome sequencing (WGS) data from the ICGC/TCGA PanCancer Analysis of Whole Genomes (PCAWG) project to develop a machine learning classifier able to accurately distinguish among 23 major cancer types using information derived from somatic mutations alone. This demonstrates the feasibility of automated cancer type discrimination based on next-generation sequencing of clinical samples. In addition, this work opens the possibility of determining the origin of tumours detected by the emerging technology of deep sequencing of circulating cell-free DNA in blood plasma.