mixOmics: an R package for ‘omics feature selection and multiple data integration

Florian Rohart; Benoît Gautier; Amrit Singh; Kim-Anh Lê Cao

doi:10.1101/108597

Abstract

The advent of high throughput technologies has led to a wealth of publicly available biological data coming from different sources, the so-called ‘omics data (transcriptomics for the study of transcripts, proteomics for proteins, metabolomics for metabolites, etc). Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) that explains or predicts biological conditions, but mainly for the analysis of a single data set. In addition, commonly used methods are univariate and consider each biological feature independently. In contrast, linear multivariate methods adopt a system biology approach by statistically integrating several data sets at once and offer an unprecedented opportunity to probe relationships between heterogeneous data sets measured at multiple functional levels.

mixOmics is an R package which provides a wide range of linear multivariate methods for data exploration, integration, dimension reduction and visualisation of biological data sets. The methods we have developed extend Projection to Latent Structure (PLS) models for discriminant analysis and data integration and include ℓ₁ penalisations to identify molecular signatures. Here we introduce the mixOmics methods specifically developed to integrate large data sets, either at the N-level, where the same individuals are profiled using different ‘omics platforms (same N), or at the P-level, where independent studies including different individuals are generated under similar biological conditions using the same ‘omics platform (same P). In both cases, the main challenge to face is data heterogeneity, due to inherent platform-specific artefacts (N-integration), or systematic differences arising from experiments assayed at different geographical sites or different times (P-integration). We present and illustrate those novel multivariate methods on existing ‘omics data available from the package.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.