Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning

Brian Cleary; Ilana Lauren Brito; Katherine Huang; Dirk Gevers; Terrance Shea; Sarah Young; Eric J Alm

doi:10.1038/nbt.3329

Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning

Nat Biotechnol. 2015 Oct;33(10):1053-60. doi: 10.1038/nbt.3329. Epub 2015 Sep 14.

Authors

Brian Cleary^{1

2}, Ilana Lauren Brito^{2

3

4}, Katherine Huang², Dirk Gevers², Terrance Shea², Sarah Young², Eric J Alm^{2

3

4}

Affiliations

¹ Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
² Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.
³ Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
⁴ Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

Abstract

Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Bacteria / classification
Bacteria / genetics*
Chromosome Mapping / methods
Databases, Genetic
Datasets as Topic
Epigenesis, Genetic / genetics*
Genome, Bacterial / genetics*
Metagenomics / methods*
Microbiota / genetics*
Sequence Analysis, DNA / methods*
Species Specificity

Abstract

Publication types

MeSH terms

Grants and funding