RT Journal Article
SR Electronic
T1 Bootstrat: Population Informed Bootstrapping for Rare Variant Tests
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 068999
DO 10.1101/068999
A1 Hailiang Huang
A1 Gina M. Peloso
A1 Daniel Howrigan
A1 Barbara Rakitsch
A1 Carl Johann Simon-Gabriel
A1 Jacqueline I. Goldstein
A1 Mark J. Daly
A1 Karsten Borgwardt
A1 Benjamin M. Neale
YR 2016
UL http://biorxiv.org/content/early/2016/08/11/068999.abstract
AB Recent advances in genotyping and sequencing technologies have made detecting rare variants in large cohorts possible. Various analytic methods for associating disease to rare variants have been proposed, including burden tests, C-alpha and SKAT. Most of these methods, however, assume that samples come from a homogeneous population, which is not realistic for analyses of large samples. Not correcting for population stratification causes inflated p-values and false-positive associations. Here we propose a population-informed bootstrap resampling method that controls for population stratification (Bootstrat) in rare variant tests. In essence, the Bootstrat procedure uses genetic distance to create a phenotype probability for each sample. We show that this empirical approach can effectively correct for population stratification while maintaining statistical power comparable to established methods of controlling for population stratification. The Bootstrat scheme can be easily applied to existing rare variant testing methods with reasonable computational complexity.Author Summary Recent technology advances have enabled large-scale analysis of rare variants, but properly testing rare variants remains a significant challenge as most rare variant testing methods assume a sample of homogenous ethnicity, an assumption often not true for large cohorts. Failure to account for this heterogeneity increases the type I error rate. Here we propose a bootstrap scheme applicable to most existing rare variant testing methods to control for population heterogeneity. This scheme uses a randomization layer to establish a null distribution of the test statistics while preserving the sample genetic relationships. The null distribution is then used to calculate an empirical p-value that accounts for population heterogeneity. We demonstrate how this scheme successfully controls the type I error rate without loss of statistical power.