Summary
Mutational signatures are patterns in the occurrence of somatic single nucleotide variants (SNVs) that can reflect underlying mutational processes. The SomaticSignatures package provides flexible, interoperable, and easy-to-use tools that identify such signatures in cancer sequencing studies. It facilitates large-scale, cross-dataset estimation of mutational signatures, implements existing methods for pattern decomposition, supports extension through user-defined methods and integrates with Bioconductor workflows.
The R package SomaticSignatures is available as part of the Bioconductor project (R Core Team, 2014; Gentleman et al., 2004). Its documentation provides additional details on the methodology and demonstrates applications to biological datasets.
1 Introduction
Mutational signatures link observed somatic single nucleotide variants to mutation generating processes (Alexandrov et al., 2013a). The identification of these signatures offers insights into the evolution, heterogeneity and developmental mechanisms of cancer (Alexandrov et al., 2013b; Nik-Zainal et al., 2012).
Existing implementations (Fischer et al., 2013; Nik-Zainal et al., 2012) are standalone packages with specialized functionality. Their reliance on non-standard data input and output formats limits integration into common workflows.
The SomaticSignatures package aims to encourage wider adoption of somatic signatures in tumor genome analysis by providing an accessible R implementation that supports multiple statistical approaches, scales to large datasets, and closely interacts with the data structures and tools of Bioconductor.
2 Approach
To detect the extent of sequence specific effects contributing to the set of observed somatic variants, the SNVs are analyzed with regard to their immediate sequence contexts, the flanking 3′ and 5′ bases (Alexandrov et al., 2013a). This can capture characteristics of mutational mechanisms as well as technical biases (Nakamura et al., 2011). As an example, the mutation of A to G in the sequence TAC defines the mutational motif T[A>G]C. Considering the frequency of the 96 possible motifs across all samples defines the mutational spectrum. It is represented by the matrix Mij, with i enumerating the motifs and j the samples.
The observed mutational spectrum can be interpreted by decomposing M into two matrices of smaller size, where the number of signatures R is typically small compared to the number of samples, and the elements of the residual matrix ε are minimized, such that W H is a useful approximation of the data. The columns of W describe the composition of a signature: Wik is the relative frequency of somatic motif i in the k-th signature. In addition, the rows of H indicate the contribution of each signature to a particular sample j.
3 Methods
Several approaches exist for the decomposition (Eq. 1) that differ in their constraints and computational complexity. In principal component analysis (PCA), for a given k, W and H are chosen such that the norm is minimal and the columns of W are orthonormal. Non-negative matrix factorization (NMF) (Brunet et al., 2004) is motivated by the fact that the mutational spectrum fulfills Mij ≥ 0, and imposes the same requirement on the elements of W and H. Different NMF and PCA algorithms allow additional constraints on the results, such as sparsity. With unsupervised clustering, the elements of H are either 0 or 1, and each row contains exactly one entry of 1. In other words, the columns of W are the cluster representatives and H is the cluster membership matrix.
4 Results
SomaticSignatures is a flexible and efficient tool for inferring characteristics of mutational mechanisms. It integrates with the Bioconductor framework and its tools for importing, processing, and annotating genomic variants. An analysis starts with a set of SNV calls, typically imported from a VCF file and represented as a VRanges object (Obenchain et al., 2014). Since the original calls do not contain information about the sequence context, we construct the mutational motifs first, based on the reference genome.
ctx = mutationContext(VRanges, ReferenceGenome)
Subsequently, we construct the mutational spectrum M. By default, its columns are defined by the samples in the data. Alternatively, users can specify a grouping covariate, for example drug response or tumor type.
m = motifMatrix(ctx, group)
Mutational signatures and their contribution to each sample’s mutational spectrum are estimated with a chosen decomposition method for a defined number of signatures. We provide implementations for NMF and PCA, and users can specify their own functions that implement alternative decomposition methods.
sigs = identifySignatures(m, nSig, method)
The user interface and library of plotting functions facilitate subsequent analysis and presentation of results (Fig. 1). Accounting for technical biases is often essential, particularly when analyzing across multiple datasets. For this purpose, we provide methods to normalize for the background distribution of sequence motifs, and demonstrate how to identify batch effects.
In the documentation of the software, we illustrate a use case by analyzing 653,304 somatic SNV calls from 2,437 TCGA whole-exome sequenced samples (Gehring, 2014). The analysis, including NMF, PCA and hierarchical clustering, completes within minutes on a standard desktop computer. The different approaches yield a consistent and reproducible grouping of the cancer types according to the estimated signatures (Fig. 1).
We applied this approach to the characterization of kidney cancer and showed that classification of subtypes according to mutational signatures is consistent with classification based on RNA expression profiles and mutation rates (Durinck et al., 2014).
Acknowledgment
We thank Leonard Goldstein and Oleg Mayba for their insights and suggestions.
Funding
This work was supported by European Molecular Biology Laboratory, the NSF award “BIGDATA: Mid-Scale: DA: ESCE: Collaborative Research: Scalable Statistical Computing for Emerging Omics Data Streams” and Genentech Inc.
Footnotes
julian.gehring{at}embl.de, whuber{at}embl.de