Ancestry Inference Using Reference Labeled Clusters of Haplotypes

Keith Noto; Yong Wang; Shiya Song; Joshua G. Schraiber; Alisa Sedghifar; Jake K. Byrnes; David A. Turissini; Eurie L. Hong; Catherine A. Ball

doi:10.1101/2020.09.23.310698

Abstract

We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual’s ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and used to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations to 1,001 sections of a genotype using 10 CPU). We test ARCHes on public data from the 1,000 Genomes Project and HGDP as well as simulated examples of known admixture. Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at regional levels regardless of the amount of population admixture.

Competing Interest Statement

The authors declare competing financial interests: authors affiliated with AncestryDNA may have equity in Ancestry. The work described in this manuscript is covered by one or more patents including US patent entitled Local Genetic Ethnicity Determination System US10558930B2.