Abstract
We introduce Genetic Instrumental Variables (GIV) regression – a method to estimate causal effects in non-experimental data with many possible applications in the social sciences and epidemiology. In non-experimental data, genetic correlation between the outcome and the exposure of interest is a source of bias. Instrumental variable (IV) regression is a potential solution, but valid instruments are scarce. Existing literature proposes to use genes related to the exposure as instruments (i.e. Mendelian Randomization – MR), but this approach is problematic due to possible pleiotropic effects of genes that can violate the assumptions of IV regression. In contrast, GIV regression provides accurate estimates for the causal effect of the exposure and gene-environment interactions involving the exposure under less restrictive assumptions than for MR. As a valuable byproduct, GIV regression also provides accurate estimates of the chip heritability of the outcome variable. GIV regression uses polygenic scores (PGS) for the exposure and the outcome of interest, both of which can be constructed from genome-wide association study (GWAS) results. By splitting the GWAS sample for the outcome into non-overlapping subsamples, we obtain multiple indicators of the outcome PGS that can be used as instruments for each other. In two empirical applications, we demonstrate that our approach produces reasonable estimates of the chip heritability of educational attainment (EA) and, unlike the results using MR, GIV regression estimates find that the positive relationship between body height and EA is primarily due to genetic confounds that have pleiotropic effects on both traits.
Footnotes
This research was facilitated by the Social Science Genetic Association Consortium (SSGAC) and by the research group on genetic and social causes of life chances at the Zentrum für interdisplinäre Forschung (ZiF) Bielefeld. Data analyses make use of the UK Biobank resource under application number 11425. We acknowledge data access from the Genetic Investigation of ANthropometric Traits Consortium (GIANT). We used data from the Health and Retirement Study (HRS), which is supported by the National Institute on Aging (NIA U01AG009740, RC2 AG036495, RC4 AG039029). HRS genotype data can be accessed via the database of Genotypes and Phenotypes (dbGaP, accession number phs000428.v1.p1). Researchers who wish to link genetic data with other HRS measures that are not in dbGaP, such as educational attainment, must apply for access from HRS. We are very grateful to Richard Karlsson Linnér for help with the GWAS analyses in the UK Biobank and to Aysu Okbay for providing us with subsets of the GWAS meta-analysis on educational attainment. We thank Patrick Turley, Daniel J. Benjamin, Jonathan P. Beauchamp, Niels Rietveld, Eric Slob, Hans van Kippersluis, Benjamin Domingue, and Lisbeth Trille Loft for productive discussions and comments on earlier versions of the manuscript. Furthermore, we thank Elliot Tucker-Drob for pointing us to the necessary correction of the heritability estimate in our model. The study was supported by funding from an ERC Consolidator Grant (647648 EdGe, Philipp D. Koellinger).
↵1 This conclusion assumes that the two PGS are estimated from the same population. In principle, the PGS for a trait could vary across sub-populations. Using a PGS from one subpopulation as an instrument for a PGS from another subpopulation could cause a violation of the exclusion restriction. This potential problem is solved if the two scores are estimated from randomly chosen subsamples of a single GWAS sample after randomly excluding related individuals so that the final sample consists only of unrelated individuals. This can be done using the UK Biobank.
↵2 This issue is similar to the attenuation of predictive accuracy of a PGS that results from an imperfect genetic correlation between the GWAS summary statistics in the hold-out sample and the GWAS summary statistics in the discovery sample [16].
↵3 We could, for example, imagine replacing each of the variables in equation 8 with the residual of this variable from an OLS regression of that variable on the variables in X.
↵4 one component is the difference between the true PGS for y, without controls for X and T, and the predicted PGS for y (without controls for X and T) from Sy2). The other component is the difference in the coefficients of G on y in the presence and the absence of Xand T.
↵5 RAND HRS Data, Version O. Produced by the RAND Center for the Study of Aging, with funding from the National Institute on Aging and the Social Security Administration. Santa Monica, CA (August 2016). See http://www.rand.org/labor/aging/dataprod/hrs-data.html for additional information.
↵7 http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT consortium data files\#GWAS Anthropometric 201