RT Journal Article SR Electronic T1 Alignment by the numbers: sequence assembly using reduced dimensionality numerical representations JF bioRxiv FD Cold Spring Harbor Laboratory SP 011940 DO 10.1101/011940 A1 Avraam Tapinos A1 Bede Constantinides A1 David L Robertson YR 2014 UL http://biorxiv.org/content/early/2014/11/28/011940.abstract AB DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and interpret sequence data. Established methods for computational sequence analysis generally consider the nucleotide-level resolution of sequences, and while these approaches are sufficiently accurate, increasingly ambitious and data-intensive analyses are rendering them impractical for demanding applications such as genome and metagenome assembly. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data such as signal processing and time series analysis. By representing nucleic acid composition numerically it is possible to apply dimensionality reduction methods from these fields to sequences of nucleotides, enabling their approximate representation. To explore the applicability of signal decomposition methods in sequence assembly, we implemented a short read aligner and evaluated its performance against simulated high diversity viral sequences alongside four existing aligners. Using our prototype implementation, approximate sequence representations reduced overall alignment time by up to 14-fold compared to that of uncompressed sequences, and without any reduction in alignment accuracy. Despite using heavily approximated sequence representations, our implementation yielded alignments of similar overall accuracy to existing aligners, outperforming all other tools tested at high levels of sequence variation. Our approach was also applied to the de novo assembly of a simulated diverse viral population. We have demonstrated that full sequence resolution is not a prerequisite of accurate sequence alignment and that analytical performance may be retained or even enhanced through appropriate dimensionality reduction of sequences.