Abstract
Identification of co-expressed genes within a given experimental or biological context can provide evidence for genetic or physical interactions between genes. Thus, detection of co-expression has become a routine step in large-scale analyses of gene expression data. In this work, we show that application of the most commonly used methods to identify co-expressed gene clusters produce results that do not match the biological expectations of co-expressed gene clusters. Specifically, clusters generated using these methods are not discrete and can contain up to 50% unreliably assigned genes. Consequently, downstream analyses on these clusters, such as functional term enrichment analysis, suffer from high error rates. We present clust, an automated method that solves this problem by extracting clusters from gene expression datasets that match the biological expectations of co-expressed genes. Using 100 gene expression datasets from five model organisms we demonstrate that the statistical properties of clusters generated by clust are better than those produced by other methods. We further show that this improvement results in a concomitant improvement in detection of enriched functional terms.