RT Journal Article SR Electronic T1 Estimation of Pairwise Genetic Distances Under Independent Sampling of Segregating Sites vs. Haplotype Sampling JF bioRxiv FD Cold Spring Harbor Laboratory SP 108928 DO 10.1101/108928 A1 Max Shpak A1 Yang Ni A1 Jie Lu A1 Peter Mueller YR 2017 UL http://biorxiv.org/content/early/2017/02/15/108928.abstract AB Genetic distance is a standard measure of variation in populations. When sequencing genomes individually, genetic distances are computed over all pairs of multilocus haplotypes in a sample. However, when next-generation sequencing methods obtain reads from heterogeneous assemblages of genomes (e.g. for microbial samples in a biofilm or cells from a tumor), individual reads are often drawn from different genomes. This means that pairwise genetic distances are calculated across independently sampled sites rather than across haplotype pairs. In this paper, we show that while the expected pairwise distance under whole haplotype sampling (WHS) is the same as with independent locus sampling (ILS), the sample variances of pairwise distance differ and depend on the direction and magnitude of linkage disequilibrium (LD) among polymorphic sites. We derive a weighted LD value that, when positive, predicts higher sample variance in estimated genetic distance for WHS. Weighted LD is positive when on average, the most common alleles at two loci are in positive LD. Using individual-based simulations of an infinite sites model under Fisher-Wright genetic drift, variances of estimated genetic distance are found to be almost always higher under WHS than under ILS, suggesting a reduction in estimation error when sites are sampled independently. We apply these results to haplotype frequencies from a lung cancer tumor to compute weighted LD and the variances in estimated genetic distance under ILS vs. WHS, and find that the the relative magnitudes of variances under WHS vs. ILS are sensitive to sampled allele frequencies.