Approximately independent linkage disequilibrium blocks in human populations

Tomaz Berisa; Joseph K. Pickrell

doi:10.1101/020255

Abstract

We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.

Availability (code) http://bitbucket.org/nygcresearch/ldetect

Availability (data): http://bitbucket.org/nygcresearch/ldetect-data

1 Introduction

The genome-wide association study (GWAS) is a commonly-used study design for the identification of genetic variants that influence complex traits. In this type of study, millions of genetic variants are genotyped on thousands to millions of individuals, and each variant is tested to see if an individual’s genotype is predictive of their phenotypes. Because of linkage disequilibrium (LD) in the genome [Pritchard and Przeworski, 2001], a single genetic variant with a causal effect on the pheno-type leads to multiple statistical (but non-causal) associations at nearby variants. One initial analysis goal in a GWAS is to count the number of independent association signals in the genome while accounting for LD.

The most commonly-used approach to counting independent SNPs that influence a trait is to count “peaks” of association signals–this can be done manually when the number of peaks is small (e.g. Wellcome Trust Case Control Consortium [2007]), or in a semi-automated way when the number of peaks is larger (e.g. Jostins et al. [2012]). There are also fully automated methods that use LD patterns estimated from large reference panels of individuals [Yang et al., 2012]. In some contexts (for example, when performing identical analysis on multiple GWAS with the goal of comparing phenotypes), however, it is useful to define approximately independent LD blocks a priori rather than letting them vary across analyses performed on different phenotypes [Loh et al., 2015; Pickrell, 2014].

To define approximately-independent LD blocks, Loh et al. [2015] used non-overlapping segments of 1 megabase, and Pickrell [2014] used non-overlapping segments of 5,000 single nucleotide polymorphisms (SNPs). The breakpoints of these segments undoubtedly sometimes fall in regions of strong LD, thus potentially splitting a single association signal over two blocks (and leading to over-counting of the number of associated variants). A better approximation could be obtained by considering the empirical patterns of LD in a reference panel. In the remainder of this paper, we present an efficient signal processing-based heuristic for choosing approximate segment boundaries.

2 Approach and Results

In order to estimate LD between pairs of loci, we use the r² metric. If a genetic variant is in LD with another genetic variant that has a causal influence on disease, then r² is proportional to the association statistic at the non-causal SNP [Pritchard and Przeworski, 2001].

Our approach is a heuristic for choosing segment boundaries, given a mean segment size (which is the required input). Let there be n genetic variants on a chromosome. The method can be broken down into the following basic steps (see the Supplementary Material for details):

Calculate the n × n covariance matrix C for all pairs of loci using the shrinkage estimator of C from [Wen and Stephens, 2010]
Convert the covariance matrix to n × n matrix of squared Pearson product-moment correlation coefficients P
Convert the matrix P = (e_i,j) to a (2n − 1)-dimensional vector V = (v_k) as follows:
The effect of this step is representing each antidiagonal of P by the sum of its elements (Figs. 1a. and b.).
- Download figure
- Open in new tab
Figure 1:
(a) and (b) Schematic of the conversion of matrix P to vector V. (c) Example data (blue) with Hann filter applied (red). (d) Example of Crohn’s disease GWAS hits with partially filtered vector V and comparison of breakpoints
Apply low-pass filters of increasing widths to (i.e., “smooth”) V until the requested number of minima is achieved
Perform a local search in the proximity of each minimum from Step 4 in order to fine-tune the segment boundaries

In reality, matrix P turns out to be sparse, approximately banded, and approximately block-diagonal, with sporadically overlapping blocks [Slatkin, 2008; Wen and Stephens, 2010].

In order to provide intuition for Step 3, Fig. 1a. shows a simplified example of a correlation matrix P, where two loci i and j are either correlated (represented by 1 in element e_ij of the matrix) or uncorrelated (represented by zero, not shown). Representing each antidiagonal of P by the sum of its elements results in the vector shown in Fig. 1b. and identifying segments representing blocks of LD reduces to identifying local (or more stringently, global) minima in this vector. In reality,the elements e_ij of P are continuous values from the interval [0, 1] and result in an extremely noisy vector V (example in blue in Fig. 1c.) Therefore, in order to identify large-scale trends of LD and reduce high frequency components in the signal, we apply a signal processing technique dubbed low-pass filtering (utilizing a Hann window [Blackman and Tukey, 1958]) in Step 4. The result of applying a low-pass filter (with width = 100) is shown in red in Fig. 1c.

Applying wider and wider filters to vector V in Step 4 allows us to focus on the large scale structure of LD blocks, but also causes the approach to miss small scale variation around identified minima. In order to counteract this effect, Step 5 conducts a local search in the proximity of each local minimum identified in Step 4 to find the closest locus l with .

We provide an illustrative example in Fig. 1d., showing genome-wide association study (GWAS) results for Crohn’s disease [Jostins et al., 2012] in a region of chromosome 21 between 44.0 Mb and 46.5 Mb. The figure also shows a scaled-to-fit illustration of vector V for this region. This example depicts a situation in which using the uniform breakpoint (in red) would result in two significant SNPs, while the LD-aware breakpoints avoid stretches of loci in LD.

To test whether this approach is useful more generally, we ran fgwas [Pickrell, 2014] on GWAS of Crohn’s disease [Jostins et al., 2012] and height [Wood et al., 2014], using both uniformly-distributed breakpoints and LD-aware breakpoints. Using the LD-aware breakpoints successfully eliminated double-counting of SNPs in moderate-to-high LD and on opposite sides of uniform breakpoints (Supplementary Materials, Section 6).

A complete list of breakpoints obtained using this method (with mean segment size = 10⁴ SNPs) on the 1000 Genomes Phase 1 dataset African, Asian, and European populations are available at [Berisa and Pickrell, 2015] in BED format.

References

↵
Berisa, T. and Pickrell, J. K. (2015). LDetect data repository: http://bitbucket.org/nygcresearch/ldetect-data.
↵
Blackman, R. B. and Tukey, J. W. (1958). The measurement of power spectra from the point of view of communications engineering - part i. Bell System Technical Journal, 37(1), 185–282.
OpenUrl Web of Science
↵
Jostins, L., Ripke, S., Weersma, R. K., Duerr, R. H., McGovern, D. P., Hui, K. Y., Lee, J. C., Schumm, L. P., Sharma, Y., Anderson, C. A., et al. (2012). Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature, 491(7422), 119–124.
OpenUrl CrossRef PubMed Web of Science
↵
Loh, P.-R., Bhatia, G., Gusev, A., Finucane, H. K., Bulik-Sullivan, B. K., Pollack, S. J., de Candia, T. R., Lee, S. H., Wray, N. R., Kendler, K. S., et al. (2015). Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis. bioRxiv, page 016527.
↵
Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4), 559–573.
OpenUrl CrossRef PubMed
↵
Pritchard, J. K. and Przeworski, M. (2001). Linkage disequilibrium in humans: models and data. The American Journal of Human Genetics, 69(1), 1–14.
OpenUrl CrossRef PubMed Web of Science
↵
Slatkin, M. (2008). Linkage disequilibrium?understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.
OpenUrl CrossRef PubMed Web of Science
↵
Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145), 661–78.
OpenUrl CrossRef PubMed Web of Science
↵
Wen, X. and Stephens, M. (2010). Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics, 4(3), 1158.
OpenUrl
↵
Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T. H., Gustafsson, S., Chu, A. Y., Estrada, K., Luan, J., Kutalik, Z., et al. (2014). Defining the role of common variation in the genomic and biological architecture of adult human height. Nature genetics, 46(11), 1173–1186.
OpenUrl CrossRef PubMed
↵
Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Weedon, M. N., Loos, R. J., et al. (2012). Conditional and joint multiple-snp analysis of gwas summary statistics identifies additional variants influencing complex traits. Nature genetics, 44(4), 369–375.
OpenUrl CrossRef PubMed