Abstract
We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.
Availability (code) http://bitbucket.org/nygcresearch/ldetect
Availability (data): http://bitbucket.org/nygcresearch/ldetect-data
1 Introduction
The genome-wide association study (GWAS) is a commonly-used study design for the identification of genetic variants that influence complex traits. In this type of study, millions of genetic variants are genotyped on thousands to millions of individuals, and each variant is tested to see if an individual’s genotype is predictive of their phenotypes. Because of linkage disequilibrium (LD) in the genome [Pritchard and Przeworski, 2001], a single genetic variant with a causal effect on the pheno-type leads to multiple statistical (but non-causal) associations at nearby variants. One initial analysis goal in a GWAS is to count the number of independent association signals in the genome while accounting for LD.
The most commonly-used approach to counting independent SNPs that influence a trait is to count “peaks” of association signals–this can be done manually when the number of peaks is small (e.g. Wellcome Trust Case Control Consortium [2007]), or in a semi-automated way when the number of peaks is larger (e.g. Jostins et al. [2012]). There are also fully automated methods that use LD patterns estimated from large reference panels of individuals [Yang et al., 2012]. In some contexts (for example, when performing identical analysis on multiple GWAS with the goal of comparing phenotypes), however, it is useful to define approximately independent LD blocks a priori rather than letting them vary across analyses performed on different phenotypes [Loh et al., 2015; Pickrell, 2014].
To define approximately-independent LD blocks, Loh et al. [2015] used non-overlapping segments of 1 megabase, and Pickrell [2014] used non-overlapping segments of 5,000 single nucleotide polymorphisms (SNPs). The breakpoints of these segments undoubtedly sometimes fall in regions of strong LD, thus potentially splitting a single association signal over two blocks (and leading to over-counting of the number of associated variants). A better approximation could be obtained by considering the empirical patterns of LD in a reference panel. In the remainder of this paper, we present an efficient signal processing-based heuristic for choosing approximate segment boundaries.
2 Approach and Results
In order to estimate LD between pairs of loci, we use the r2 metric. If a genetic variant is in LD with another genetic variant that has a causal influence on disease, then r2 is proportional to the association statistic at the non-causal SNP [Pritchard and Przeworski, 2001].
Our approach is a heuristic for choosing segment boundaries, given a mean segment size (which is the required input). Let there be n genetic variants on a chromosome. The method can be broken down into the following basic steps (see the Supplementary Material for details):
Calculate the n × n covariance matrix C for all pairs of loci using the shrinkage estimator of C from [Wen and Stephens, 2010]
Convert the covariance matrix to n × n matrix of squared Pearson product-moment correlation coefficients P
Convert the matrix P = (ei,j) to a (2n − 1)-dimensional vector V = (vk) as follows:
The effect of this step is representing each antidiagonal of P by the sum of its elements (Figs. 1a. and b.).
Apply low-pass filters of increasing widths to (i.e., “smooth”) V until the requested number of minima is achieved
Perform a local search in the proximity of each minimum from Step 4 in order to fine-tune the segment boundaries
In reality, matrix P turns out to be sparse, approximately banded, and approximately block-diagonal, with sporadically overlapping blocks [Slatkin, 2008; Wen and Stephens, 2010].
In order to provide intuition for Step 3, Fig. 1a. shows a simplified example of a correlation matrix P, where two loci i and j are either correlated (represented by 1 in element eij of the matrix) or uncorrelated (represented by zero, not shown). Representing each antidiagonal of P by the sum of its elements results in the vector shown in Fig. 1b. and identifying segments representing blocks of LD reduces to identifying local (or more stringently, global) minima in this vector. In reality,the elements eij of P are continuous values from the interval [0, 1] and result in an extremely noisy vector V (example in blue in Fig. 1c.) Therefore, in order to identify large-scale trends of LD and reduce high frequency components in the signal, we apply a signal processing technique dubbed low-pass filtering (utilizing a Hann window [Blackman and Tukey, 1958]) in Step 4. The result of applying a low-pass filter (with width = 100) is shown in red in Fig. 1c.
Applying wider and wider filters to vector V in Step 4 allows us to focus on the large scale structure of LD blocks, but also causes the approach to miss small scale variation around identified minima. In order to counteract this effect, Step 5 conducts a local search in the proximity of each local minimum identified in Step 4 to find the closest locus l with .
We provide an illustrative example in Fig. 1d., showing genome-wide association study (GWAS) results for Crohn’s disease [Jostins et al., 2012] in a region of chromosome 21 between 44.0 Mb and 46.5 Mb. The figure also shows a scaled-to-fit illustration of vector V for this region. This example depicts a situation in which using the uniform breakpoint (in red) would result in two significant SNPs, while the LD-aware breakpoints avoid stretches of loci in LD.
To test whether this approach is useful more generally, we ran fgwas [Pickrell, 2014] on GWAS of Crohn’s disease [Jostins et al., 2012] and height [Wood et al., 2014], using both uniformly-distributed breakpoints and LD-aware breakpoints. Using the LD-aware breakpoints successfully eliminated double-counting of SNPs in moderate-to-high LD and on opposite sides of uniform breakpoints (Supplementary Materials, Section 6).
A complete list of breakpoints obtained using this method (with mean segment size = 104 SNPs) on the 1000 Genomes Phase 1 dataset African, Asian, and European populations are available at [Berisa and Pickrell, 2015] in BED format.