Abstract
Motivation Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space-efficiently.
Results We propose the GBZ file format based on data structures used in the Giraffe short read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems.
Availability C++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively.
Contact jouni.siren{at}iki.fi
Supplementary information Supplementary data are available online.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Compression / decompression results with the 1000GP graph. Text clarifications. A figure giving and overview of the file format. Scaling with the number of compression / decompression jobs (in the supplement).