MSAC: Compression of multiple sequence alignment files

Sebastian Deorowicz; Joanna Walczyszyn; Agnieszka Debudaj-Grabysz

doi:10.1101/240341

Abstract

Motivation Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.

Results We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.

Availability MSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/msac.

Contact sebastian.deorowicz{at}polsl.pl

Supplementary material Supplementary data are available at the publisher Web site.