Reducing reference bias using multiple population reference genomes

Nae-Chyun Chen; Brad Solomon; Taher Mun; Sheila Iyer; Ben Langmead

doi:10.1101/2020.03.03.975219

Abstract

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance, but with 14% of the memory footprint and 5.5 times the speed.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

We develop an additional "RandFlow-LD-26" approach that has higher accuracy using reasonable additional computational resources. We stratify real whole-genome data analysis by GIAB confidence levels. We evaluate bias reduction performance of variant-aware approaches in repetitive regions. We show that the proposed randomized approaches do not suffer from major variability due to randomness. We provide additional measures for our mapping correctness analysis. We include the unlocalized/random contigs in our major-allele and reference-flow references and update pre-built indexes.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.