PT  - JOURNAL ARTICLE
AU  - Chris Wymant
AU  - François Blanquart
AU  - Astrid Gall
AU  - Margreet Bakker
AU  - Daniela Bezemer
AU  - Nicholas J. Croucher
AU  - Tanya Golubchik
AU  - Matthew Hall
AU  - Mariska Hillebregt
AU  - Swee Hoe Ong
AU  - Jan Albert
AU  - Norbert Bannert
AU  - Jacques Fellay
AU  - Katrien Fransen
AU  - Annabelle Gourlay
AU  - M. Kate Grabowski
AU  - Barbara Gunsenheimer-Bartmeyer
AU  - Huldrych F. Günthard
AU  - Pia Kivelä
AU  - Roger Kouyos
AU  - Oliver Laeyendecker
AU  - Kirsi Liitsola
AU  - Laurence Meyer
AU  - Kholoud Porter
AU  - Matti Ristola
AU  - Ard van Sighem
AU  - Guido Vanham
AU  - Ben Berkhout
AU  - Marion Cornelissen
AU  - Paul Kellam
AU  - Peter Reiss
AU  - Christophe Fraser
AU  - The BEEHIVE Collaboration
TI  - Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data
AID  - 10.1101/092916
DP  - 2016 Jan 01
TA  - bioRxiv
PG  - 092916
4099  - http://biorxiv.org/content/early/2016/12/13/092916.short
4100  - http://biorxiv.org/content/early/2016/12/13/092916.full
AB  - Next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We use shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping to shiver’s constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs.