Genome-wide association study of depression phenotypes in UK Biobank (n = 322,580) identifies the enrichment of variants in excitatory synaptic pathways

David M. Howard; Mark J. Adams; Masoud Shirali; Toni-Kim Clarke; Riccardo E. Marioni; Gail Davies; Jonathan R. I. Coleman; Clara Alloza; Xueyi Shen; Miruna C. Barbu; Eleanor M. Wigmore; Jude Gibson; Saskia P. Hagenaars; Cathryn M. Lewis; Daniel J. Smith; Patrick F. Sullivan; Chris S. Haley; Gerome Breen; Ian J. Deary; Andrew M. McIntosh

doi:10.1101/168732

Abstract

Depression is a polygenic trait that causes extensive periods of disability and increases the risk of suicide, a leading cause of death in young people. Previous genetic studies have identified a number of common risk variants which have increased in number in line with increasing sample sizes. We conducted a genome-wide association study (GWAS) in the largest single population-based cohort to date, UK Biobank. This allowed us to estimate the effects of ≈ 8 million genetic variants in 320,000 people for three depression phenotypes: broad depression, probable major depressive disorder (MDD), and International Classification of Diseases (ICD, version 9 or 10)-coded MDD. Each phenotype was found to be significantly genetically correlated with the results from a previous independent study of clinically defined MDD. We identified 14 independent loci that were significantly associated (P < 5 × 10⁻⁸) with broad depression, two independent variants for probable MDD, and one independent variant for ICD-coded MDD. Gene-based analysis of our GWAS results with MAGMA revealed 46 regions significantly associated (P < 2.77 × 10⁻⁶) with broad depression, two significant regions for probable MDD and one significant region for ICD-coded MDD. Gene region-based analysis of our GWAS results with MAGMA revealed 59 regions significantly associated (P < 6.02 × 10⁻⁶) with broad depression, of which 27 were also detected by gene-based analysis. Variants for broad depression were enriched in pathways for excitatory neurotransmission, mechanosensory behavior, postsynapse, neuron spine and dendrite. This study provides a number of novel genetic risk variants that can be leveraged to elucidate the mechanisms of MDD and low mood.

Introduction

Depression is ranked as the largest contributor to global disability affecting 322 million people¹. The heritability (h²) of major depressive disorder (MDD) is estimated at 37% from twin studies² and common single nucleotide polymorphisms (SNPs) contribute approximately 9% to variation in liability³, providing strong evidence of a genetic contribution to its causation. Previous genetic association studies have used a number of depression phenotypes, including self-declared depression⁴, clinician diagnosed MDD⁵ and depression ascertained via hospital records⁶, with some evidence of overlapping genetic architecture between a subset of these definitions. Different definitions of depression are rarely included in large sample studies, although UK Biobank is an exception. The favouring of greater sample size over clinical precision has yielded a steady increase over time in the number of variants for ever more diverse MDD phenotypes^3-5,7. In the current paper, we extend this approach to the study of three depression-related phenotypes within the large UK Biobank cohort and identify new disease biology based upon our findings.

The UK Biobank cohort provides data on over 500,000 individuals and represents an opportunity to conduct the largest genome-wide association study (GWAS) of depression to date within a single cohort. This cohort has been extensively phenotyped allowing us to derive three depression traits: self-reported past help-seeking for problems with ‘nerves, anxiety, tension or depression’ (hereby termed ‘broad depression’); self-reported depressive symptoms with associated impairment (termed ‘probable MDD’); and MDD identified from International Classification of Diseases (ICD)-9 or ICD-10 hospital admission records (termed ICD-coded MDD). We also conducted a gene-based analyses with the MAGMA software package⁸ to identify genes, regions and pathways associated with each phenotype and used GTEx⁹ to identify if the significant variants identified were expression quantitative trait loci (eQTL).

Materials and Methods

The UK Biobank cohort is a population-based cohort consisting of 501,726 individuals, recruited at 23 centres across the United Kingdom. Genotypic data was available for 488,380 individuals and was imputed with IMPUTE4 and used the HRC reference panel¹⁰ to identify ≈ 19M variants for 487,409 individuals¹¹. We excluded 131,790 related individuals based on a shared relatedness of up to the third degree using kinship coefficients (> 0.044) calculated using the KING toolset¹², and excluded a further 79,990 individuals that were either not recorded as “white British”, outliers based on heterozygosity, or had a variant call rate < 98%. We subsequently added back in one member of each group of related individuals by creating a genomic relationship matrix and selected individuals with a genetic relatedness less than 0.025 with any other participant (n = 55,745). We removed variants with a call rate < 98%, a minor allele frequency < 0.01, those that deviated from Hardy-Weinberg equilibrium (P < 10⁻⁶), or had an imputation accuracy score < 0.1 leaving a total of 7,826,341 variants for 331,374 individuals.

Extensive phenotypic data were collected for UK Biobank participants using health records, biological sampling, physical measures, and touchscreen tests and questionnaires. We used three definitions of depression in the UK Biobank sample, which are explained in greater depth in the Supplementary Information and are summarised below.

Broad depression phenotype

The broadest phenotype (broad depression) was defined using self-reported help-seeking behaviour for mental health difficulties. Case and control status was determined by the touchscreen response to either of two questions ‘Have you ever seen a general practitioner (GP) for nerves, anxiety, tension or depression?’ or ‘Have you ever seen a psychiatrist for nerves, anxiety, tension or depression?. Caseness for broad depression was determined by answering ‘Yes’ to either question at either the initial assessment visit or at any repeat assessment visit or if there was a primary or secondary diagnosis of a depressive mood disorder from linked hospital admission records. The remaining respondents were classed as controls if they provided ‘No’ responses to both questions during all assessments that they participated in.

Probable MDD phenotype

The second depression phenotype (probable MDD) was derived from touchscreen responses to questions about the presence and duration of low mood and anhedonia, following the definitions from Smith, et al. ¹³, whereby the participant had indicated that they were ‘Depressed/down for a whole week; plus at least two weeks duration; plus ever seen a GP or psychiatrist for ‘nerves, anxiety, or depression’ OR ever anhedonia for a whole week; plus at least two weeks duration; plus ever seen a GP or psychiatrist for ‘nerves, anxiety, or depression’. Cases for the probable MDD definition were supplemented by diagnoses of depressive mood disorder from linked hospital admission records.

ICD-coded phenotype

The ICD-coded MDD phenotype was derived from linked hospital admission records. Participants were classified as cases if they had either an ICD-10 primary or secondary diagnosis for a mood disorder. ICD-coded MDD controls were participants who had linked hospital records, but who did not have any diagnosis of a mood disorder and were not probable MDD cases.

For the three UK Biobank depression phenotypes we excluded: participants who were identified with bipolar disorder, schizophrenia, or personality disorder using self-declared data, touchscreen responses (per Smith, et al. ¹³), or ICD codes from hospital admission records; and participants who reported having a prescription for an antipsychotic medication during a verbal interview. Further exclusions were applied to control individuals if they had a diagnosis of a depressive mood disorder from hospital admission records, had reported having a prescription for antidepressants, or self-reported depression (see Supplementary Information for full phenotype criteria and UK Biobank field codes). This provided a total of 113,769 cases and 208,811 controls (n_total = 322,580, prevalence = 35.27%) for the broad depression phenotype, 30,603 cases and 143,916 controls (n_total = 174,519, prevalence = 17.54%) for the probable MDD phenotype, and 8,276 cases and 209,308 controls (n_total = 217,584, prevalence = 3.80%) for the ICD-coded MDD phenotype.

To validate the three phenotypes we derived for the UK Biobank cohort, genetic correlations were calculated using Linkage Disequilibrium Score regression (LDSR)¹⁴ using summary statistics from the Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium. ⁵ study that used a clinically derived phenotype for MDD. We also calculated the genetic correlation with a neuroticism phenotype¹⁵.

Association analysis

We performed a linear association test to assess the effect of each variant using BGENIE v1.1¹¹: where y was the vector of binary observations for each phenotype (controls coded as 0 and cases coded as 1). β was the matrix of fixed effects, including sex, age, genotyping array, and 8 principal components and X was the corresponding incidence matrices. (y – ŷ) was a vector of phenotypes residualized on the fixed effect predictors, G was a vector of expected genotype counts of the effect allele (dosages), b was the effect of the genotype on residualized phenotypes, and ε₁ and ε₂ were vectors of normally distributed errors.

Genome-wide statistical significance was determined by the conventional threshold of a P-value of association < 5 × 10⁻⁸. To determine significant variants that were independent the clump command in Plink 1.90b4¹⁶ was applied using --clump-p1 1e-4 --clump-p2 1e-4 --clump-r2 0.1 --clump-kb 3000, mirroring the approach of Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium., et al. ³. Therefore variants which were within 3Mb of each other and shared a linkage disequilibrium greater than 0.1 were clumped together and only the most significant variant reported. Due to the complexity of major histocompatibility complex (MHC) region an approach similar to that of The Schizophrenia Psychiatric Genome-Wide Association Study ¹⁷ was taken and only the most significant variant across that region is reported.

LDSR¹⁴ was used to provide a SNP-based estimate of the heritability of the phenotypes using the whole-genome summary statistics obtained by the association analyses. LDSR was also used to examine the data for evidence of inflation of the test statistics, based on the intercept, due to population stratification.

Gene- and region-based analyses

Two downstream analyses of the results were conducted using MAGMA⁸ (Multi-marker Analysis of GenoMic Annotation) by applying a principal component regression model to the results of our association analyses. In the first downstream analysis, a gene-based analysis was performed for each phenotype using the results from our GWAS. Genetic variants were assigned to genes based on their position according to the NCBI 37.3 build, resulting in a total of 18,033 genes being analysed. The European panel of the 1,000 Genomes data (phase 1, release 3)¹⁸ was used as a reference panel to account for linkage disequilibrium. A genome-wide significance threshold for gene-based associations was calculated using the Bonferroni method (α = 0.05 / 18,033; P < 2.77 × 10⁻⁶).

In the second downstream analysis, a region-based analysis was performed for each phenotype. To determine the regions, haplotype blocks identified by recombination hotspots were used as described by Shirali, et al. ¹⁹ and implemented in an analysis of MDD by Zeng, et al. ²⁰ for detecting causal regions. Block boundaries were defined by hotspots of at least 30 cM per Mb based on a European subset of the 1,000 genome project recombination rates. This resulted in a total of 8,308 regions being analysed using the European panel of the 1,000 Genomes data (phase 1, release 3)¹⁸ as a reference panel to account for linkage disequilibrium. A genome-wide significance threshold for region-based associations was calculated using the Bonferroni correction method (α = 0.05 / 8,308; P < 6.02 × 10⁻⁶).

Pathway analysis

The pathway analysis was performed on our gene-based analysis results. The analysis was a gene-set enrichment analysis that was conducted utilising gene-annotation files from the Gene Ontology (GO) Consortium (http://geneontology.org/)²¹ taken from the Molecular Signatures Database (MSigDB) v5.2²². The GO consortium includes gene-sets for three ontologies; molecular function, cellular components and biological function. This annotation file consisted of 5,917 gene-sets which were corrected for multiple testing correction using the MAGMA default setting correcting for 10,000 permutations. Visualisation of pathways was obtained using the online tool, GeneMANIA²³.

eQTL identification

The online GTEx portal (https://www.gtexportal.org/home/) was used to determine whether any of the genome-wide significant variants for each phenotype were eQTL⁹.

Results

We conducted a genome-wide association study testing the effect of 7,826,341 variants on three depression phenotypes using up to 322,580 UK Biobank participants. The study demographics for each UK Biobank phenotype and within the case and control groups are provided in Table 1.

View this table:

Table 1.

Number of individuals, number of each sex, mean age in years, age range in years for each of the assessed UK Biobank phenotypes and within the respective case and control groups

The estimated SNP-based heritabilities, genetic correlations between each UK Biobank phenotype and genetic correlations with a clinically defined MDD phenotype and obtained from the study conducted by the Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium. ⁵ and a neuroticism phenotype¹⁵ for each UK Biobank phenotype are provided in Table 2.

View this table:

Table 2.

The SNP-based heritability (h²), the genetic correlations (r_g) between each UK Biobank phenotype and the r_g with a depression and a neuroticism phenotype obtained from separate studies† for each of the assessed UK Biobank phenotypes.

There were 1,643 variants that were genome-wide significant (P < 5 × 10⁻⁸) for an association with broad depression, of which 14 were independent (Table 3). The association analysis of probable MDD identified 20 variants with P < 5 × 10⁻⁸ and of these two were independent (Table 4). There was one independent genome-wide significant variant for ICD-coded MDD (Table 5). Manhattan plots of all the variants analysed are provided in Figures 1, 2, and 3 for broad depression, probable MDD, and ICD-coded MDD, respectively. Q-Q plots of the observed P-values on those expected are provided in Supplementary Figures 1, 2, and 3 for broad depression, probable MDD, and ICD-coded MDD, respectively. There were 4,390, 189 and 108 variants with P < 1 × 10⁻⁶ for an association with broad depression (see Supplementary Table 1), probable MDD (see Supplementary Table 2), and ICD-coded MDD (see Supplementary Table 3), respectively. None of the phenotypes examined provided evidence of inflation of the test statistics due to population stratification (see Supplementary Table 4).

View this table:

Table 3.

Independent variants with a genome-wide significant (P < 5 × 10⁻⁸) association with broad depression

View this table:

Table 4.

Independent variants with a genome-wide significant (P < 5 × 10⁻⁸) association with probable MDD

View this table:

Table 5.

Independent variants with a genome-wide significant (P < 5 × 10⁻⁸) association with ICD-coded MDD

Figure 1.

Manhattan plot of the observed –log₁₀ P-values of each variant for an association with broad depression in the UK Biobank cohort. Variants are positioned according to the GRCh37 assembly.

Figure 2.

Manhattan plot of the observed –log₁₀ P-values of each variant for an association with probable MDD in the UK Biobank cohort. Variants are positioned according to the GRCh37 assembly.

Figure 3.

Manhattan plot of the observed –log₁₀ P-values of each variant for an association with ICD-coded MDD in the UK Biobank cohort. Variants are positioned according to the GRCh37 assembly.