Abstract
Marine microeukaryotes express large and complex transcriptomes that often respond dynamically to environmental and physiological conditions. In parallel to developments in human disease research, the opportunity exists to employ transcriptomic features as “biomarkers” to understand and predict cellular and environmental states. Here, the prediction and classification of basic physiological and environmental states including light, growth phase and inorganic carbon status was explored for the model diatom T. pseudonana using publicly available data including 56 microarray and 316 mRNA-seq samples. Simple “machine learning” methods combined with integrative bootstrapped clustering were able to detect, recapitulate and expand biologically and environmentally relevant signals evident across hundreds of samples collected and processed independently by multiple laboratories. Agnostic, integrative and empirical “data-driven” approaches are likely applicable to modern questions in new environmental and experimental contexts.