PT - JOURNAL ARTICLE AU - Samuel M. Nicholls AU - Wayne Aubrey AU - Kurt de Grave AU - Leander Schietgat AU - Christopher J. Creevey AU - Amanda Clare TI - Probabilistic recovery of cryptic haplotypes from metagenomic data AID - 10.1101/117838 DP - 2017 Jan 01 TA - bioRxiv PG - 117838 4099 - http://biorxiv.org/content/early/2017/03/17/117838.short 4100 - http://biorxiv.org/content/early/2017/03/17/117838.full AB - The cryptic diversity of microbial communities represent an untapped biotechnological resource for biomining, biorefining and synthetic biology. Revealing this information requires the recovery of the exact sequence of DNA bases (or “haplotype”) that constitutes the genes and genomes of every individual present. This is a computationally difficult problem complicated by the requirement for environmental sequencing approaches (metagenomics) due to the resistance of the constituent organisms to culturing in vitro.Haplotypes are identified by their unique combination of DNA variants. However, standard approaches for working with metagenomic data require simplifications that violate assumptions in the process of identifying such variation. Furthermore, current haplotyping methods lack objective mechanisms for choosing between alternative haplotype reconstructions from microbial communities.To address this, we have developed a novel probabilistic approach for reconstructing haplotypes from complex microbial communities and propose the “metahaplome” as a definition for the set of haplotypes for any particular genomic region of interest within a metagenomic dataset. Implemented in the twin software tools Hansel and Gretel, the algorithm performs incremental probabilistic haplotype recovery using Naive Bayes — an efficient and effective technique.Our approach is capable of reconstructing the haplotypes with the highest likelihoods from metagenomic datasets without a priori knowledge or making assumptions of the distribution or number of variants. Additionally, the algorithm is robust to sequencing and alignment error without altering or discarding observed variation and uses all available evidence from aligned reads. We validate our approach using synthetic metahaplomes constructed from sets of real genes, and demonstrate its capability using metagenomic data from a complex HIV-1 strain mix. The results show that the likelihood framework can allow recovery from microbial communities of cryptic functional isoforms of genes with 100% accuracy.