Abstract
Genome-wide association studies have identified hundreds of risk loci for autoimmune disease, yet only a minority (~25%) share a single genetic effect with changes to gene expression (eQTLs) in primary immune cell types. RNA-Seq based quantification at whole-gene resolution, where abundance is estimated by culminating expression of all transcripts or exons of the same gene, is likely to account for this observed lack of colocalisation as subtle isoform switches and expression variation in independent exons are concealed. We perform integrative cis-eQTL analysis using association data from twenty autoimmune diseases (846 SNPs; 584 independent loci), with RNA-Seq expression from the GEUVADIS cohort profiled at gene-, isoform-, exon-, junction-, and intron-level resolution. After testing for a shared causal variant, we found exon-, and junction-level analyses produced the greatest frequency of candidate-causal cis-eQTLs; many of which were concealed at whole-gene resolution. In fact, only 9% of autoimmune loci shared a disease-relevant eQTL effect at gene-level. Expression profiling at all resolutions however was necessary to capture the full array of eQTL associations, and by doing so, we found 45% of loci were candidate-causal cis-eQTLs. Our findings are provided as a web resource for the functional annotation of autoimmune disease association studies (www.insidegen.com). As an example, we dissect the genetic associations of Ankylosing Spondylitis as only a handful of loci have documented causative relationships with gene expression. We classified fourteen of the thirty-one associated SNPs as candidate-causal cis-eQTLs. Many of the newly implicated genes had direct relevance to inflammation through regulation of TNF signalling (for example NFATC2IP, PDE4A, and RUSC1), and were supported by integration of functional genomic data from epigenetic and chromatin interaction studies. We have provided a deeper mechanistic understanding of the genetic regulation of gene expression in autoimmune disease by profiling the transcriptome at multiple resolutions.
Author Summary It is now well acknowledged that non-coding genetic variants contribute to susceptibility of autoimmune disease through alteration of gene expression levels (eQTLs). Identifying the variants that are causal to both disease risk and changes to expression levels has not been easy and we believe this is in part due to how expression is quantified using RNA-Sequencing (RNA-Seq). Whole-gene expression, where abundance is estimated by culminating expression of all transcripts or exons of the same gene, is conventionally used in eQTL analysis. This low resolution may conceal subtle isoform switches and expression variation in independent exons. Using isoform-, exon-, and junction-level quantification can not only point to the candidate genes involved, but also the specific transcripts implicated. We make use of existing RNA-Seq expression data profiled at gene-, isoform-, exon-, junction-, and intron-level, and perform eQTL analysis using association data from twenty autoimmune diseases. We find exon-, and junction-level thoroughly outperform gene-level analysis, and by leveraging all five quantification types, we find 45% of autoimmune loci share a single genetic effect with gene expression. We highlight that existing and new eQTL cohorts using RNA-Seq should profile expression at multiple resolutions to maximise the ability to detect causal eQTLs and candidate-genes.
Introduction
The autoimmune diseases are a family of heritable, often debilitating, complex disorders whereby immune system dysfunction leads to loss of tolerance to self-antigens and chronic inflammation [1]. Genome-wide association studies (GWAS) have now detected hundreds of susceptibility loci contributing to risk of autoimmunity [2] yet their biological interpretation still remains challenging [3]. Mapping single nucleotide polymorphisms (SNPs) that influence gene expression (eQTLs) can provide crucial insight into the potential candidate genes and etiological pathways connected to discrete disease phenotypes [4]. For example, such analyses have implicated dysregulation of autophagy in Crohn’s Disease [5], the pathogenic role of CD4+ effector memory T-cells in Rheumatoid Arthritis [6], and an overrepresentation of transcription factors in Systemic Lupus Erythematosus [7].
Expression profiling in appropriate cell types and physiological conditions is necessary to capture the pathologically relevant regulatory changes driving disease risk [8]. Lack of such expression data is thought to explain the observed disparity of shared genetic architecture between disease association and gene expression at certain autoimmune loci [9]. A much overlooked cause of this disconnect however, is not only the use of microarrays to profile gene expression, but also the resolution to which expression is quantified using RNA-Sequencing (RNA-Seq) [10]. Expression estimates of whole-genes, individual isoforms and exons, splice-junctions, and introns are obtainable with RNA-Seq [11–18]. The SNPs that affect these discrete units of expression vary strikingly in their proximity to the target gene, localisation to specific epigenetic marks, and effect on translated isoforms [18]. For example, in over 57% of genes with both an eQTL influencing overall gene expression and a transcript ratio QTL (trQTL) affecting the ratio of each transcript to the gene total, the causal variants for each effect are independent and reside in distinct regulatory elements of the genome [18].
RNA-Seq based eQTL investigations that solely rely on whole-gene expression estimates are likely to mask the allelic effects on independent exons and alternatively-spliced isoforms [16–19]. This is in part due to subtle isoform switches and expression variation in exons that cannot be captured at gene-level [20]. Recent evidence also suggests that exon-level based strategies are more sensitive than conventional gene-level approaches, and allow for detection of moderate but systematic changes in gene expression that are not necessarily derived from alternative-splicing events [15,21]. Furthermore, gene-level summary counts can be biased in the direction of extreme exon outliers [21]. Use of isoform-, exon-, and junction-level quantification in eQTL analysis also support the potential to not only point to the candidate genes involved, but also the specific transcripts or functional domains affected [10,18]. This of course facilitates the design of targeted functional studies and better illuminates the causative relationship between regulatory genetic variation and disease. Lastly, though intron-level quantification is not often used in conventional eQTL analysis, it can still provide valuable insight into the role of unannotated exons in reference gene annotations, retained introns, and even intronic enhancers [22,23].
Low-resolution expression profiling with RNA-Seq will impede the subsequent identification of causal eQTLs when applying genetic and epigenetic fine-mapping approaches [24]. In this investigation, we aim to increase our knowledge of the regulatory mechanisms and candidate genes of human autoimmune disease through integration of GWAS and RNA-Seq expression data profiled at gene-, isoform-, exon-, junction-, and intron-level in lymphoblastoid cell lines (LCLs). Our findings are provided as a web resource to interrogate the functional effects of autoimmune associated SNPs (www.insidegen.com), and will serve as the basis for targeted follow-up investigations.
Results
Detection of cis-eQTLs and candidate-genes of autoimmune disease using RNA-Seq
Using association data from twenty human autoimmune diseases, we performed integrative cis-eQTL analysis in lymphoblastoid cell lines (LCLs) with RNA-Seq expression data profiled at five resolutions: gene-, isoform-, exon-, junction-, and intron-level. We tested for a shared causal variant between disease and expression at each association. The 846 autoimmune-associated SNPs taken forward for analysis are documented in S1 Table and an overview of the analysis pipeline to detect candidate-causal cis-eQTLs and eGenes is depicted in Fig 1. Expression targets at each level of RNA-Seq quantification were interrogated in cis (+/-1Mb) to the 846 SNPs; comprising a total of 7,969 genes, 28,220 isoforms, 54,043 exons, 49,909 junctions, and 35,662 introns (Fig 2A).
We found that cis-eQTL association analysis using exon-, junction-, and intron-level quantification yielded the greatest frequency of significant (q < 0.05) cis-eQTLs and eGenes (Fig 2B). These findings persisted after testing whether each statistically significant cis-eQTL showed strong evidence for colocalisation with the genetic variant underlying the autoimmune disease association (q < 0.05 and RTC ≥ 0.95). For clarity, we define such eQTLs as candidate causal cis-eQTLs and we define their targets as eGenes (Fig 2C). Exon-level analysis detected the most candidate-causal cis-eQTLs (235) and eGenes (233) out of all quantification types, followed by junction- and intron-level quantification. Isoform- and gene-level analysis were thoroughly outperformed, with the latter detecting only 70 candidate-causal cis-eQTLs and 65 eGenes. In fact, we observed gene-level quantification presented the greatest dropout of significant cis-eQTLs that were candidate causal (Fig 2D). Only 23.8% of significant cis-eQTLs were candidate-causal at gene-level compared to 49.8% at exon-level; suggesting that in the autoimmune susceptibility loci tested more strongly associated cis-eQTLs are captured by the exon-level analysis and they are distinct from gene-level cis-eQTLs. Gene-level analysis under estimated candidate-causal eGenes. Our findings, highlighting the need to profile gene-expression at multiple resolutions, are summarised in Fig 2E.
Profiling at all resolutions is necessary to capture the full array associated cis-eQTLs
We pruned the 846 autoimmune associated SNPs using an r2 cut-off of 0.8 and 100kb limit to create a subset of 584 independent susceptibility loci. By combining all five resolutions of RNA-Seq, we found 267 loci (45.7%) presented a shared genetic effect between disease association and gene expression (Fig 3A). Strikingly, only 9.3% of associated loci shared an underlying causal variant at gene-level, in contrast to the 29.1% classified at exon-level. We mapped the candidate-causal cis-eQTLs detected by RNA-Seq back to the diseases to which they are associated (Fig 3B). On average, 47% of associated SNPs per disease were classified as candidate-causal cis-eQTLs using all five RNA-Seq quantification types. Interestingly, we observed the diseases that fell most below this average comprised autoimmune disorders related to the gut: celiac disease (29%), inflammatory bowel disease (36%), ulcerative colitis (39%), and Crohn’s disease (41%), as well as Type 1 Diabetes (37%). These observations are possibly a result of the cellular expression specificity of associated genes in colonic and pancreatic tissue. This conclusion is supported by the above-average frequency of candidate-causal cis-eQTLs detected in Systemic Lupus Erythematosus (50%) and Rheumatoid Arthritis (54%); diseases in which the pathogenic role of B-lymphocytes is well documented [33,34]. We further broke down our results per disease by RNA-Seq quantification type (Fig 3C) and in almost all cases, the greatest frequency of candidate-causal cis-eQTLs and eGenes were captured by exon- and junction-level analyses.
By separating candidate-causal cis-eQTL associations out by quantification type, we found over half were detected by either exon- or junction-level, and considerable overlap of cis-eQTL associations existed between both types (Fig 3D). The greatest correlation of effect sizes (r2: 0.88) of candidate-causal cis-eQTLs between exon- and junction-level (S1 Fig). Strong correlation also existed between the effect sizes of gene- and isoform-level candidate-causal cis-eQTLs as expected (r2: 0.83); yet gene-level analysis detected only 19% of all candidate-causal associations. Gene- and isoform-level analysis did however capture six and eighteen candidate-causal cis-eQTLs unique to their quantification type respectively. Thus, our data suggest that although exon- and junction-level, and to a lesser extent intron-level analysis, capture the majority of candidate-causal cis-eQTL associations, it is necessary to prolife gene-expression at all quantification types to avoid misinterpretation of the functional impact of disease associated SNPs.
Web resource for functional interpretation of association studies of autoimmune disease
We provide our data as a web resource (www.insidegen.com) for researchers to lookup candidate-causal cis-eQTLs and eGenes of autoimmune diseases detected across the five RNA-Seq quantification types. Data are sub-settable and exportable by SNP ID, gene, RNA-Seq resolution, genomic position, and association to specific autoimmune diseases. Full data are also made available in S2 Table.
Functional dissection of Ankylosing Spondylitis genetic associations using RNA-Seq
We decided to apply the results of our integrative cis-eQTL analysis to functionally dissect the genetic associations of ankylosing spondylitis (AS). By doing so, we highlight the necessity of profiling at all resolutions of RNA-Seq to shed light on novel regulatory variants, candidate genes, and molecular pathways involved in pathogenesis. AS is a heritable inflammatory arthritis with a largely unexplained genetic contribution outside of the HLA-B*27 allele (> 30 risk loci) [35,36]. Only a handful of loci show causative relationships with changes in gene expression [35,36]. Candidate-genes implicated by association studies however suggest discrete immunological processes such as antigen presentation, lymphocyte differentiation and activation, and regulation of the TNF/NF-κB signalling pathways are involved, and of note, strong genetic overlap exists with psoriasis, psoriatic arthritis, and inflammatory bowel disease; indicating the pathogenesis of these diseases are tightly connected [37].
Of the 31 AS associated SNPs taken forward for functional interrogation, 14 were classified as candidate-causal cis-eQTLs (Fig 4A; full results found at www.insidegen.com for all diseases). We replicated the association of risk allele rs4129267 [C] with expression reduction of IL6R by junction-level analysis (β = −0.36; P = 1.14 × 10−06) [35]. Interestingly, we found the expression of neighbouring gene, RUSC1, is also influenced by candidate-causal cis-eQTL rs4129267 where the risk allele was also reduced expression of RUSC1 at exon-level (β = −0.24; P = 1.59 × 10−03). RUSC1 is able to polyubiquitinate IKBKG, a key regulator of NF-κB [38].
The effect of independently associated variants within the 5q15 locus on the expression of aminopeptidase genes ERAP1 and ERAP2 was also replicated (S3 Fig) [36]. This includes the association of risk allele rs30187 [T] with increased expression of ERAP1 (β = −1.09; P = 1.60 × 10− 71), and the striking effect of protective allele rs2910686 [T] on the near-complete loss of ERAP2 (β = −1.37; P = 1.95 × 10−175). Again however, additional genes at this locus with no previous association to expression changes with regards to AS risk alleles were detected. LNPEP also belongs to the endoplasmic reticulum aminopeptidase family and has been shown to regulate the NF-κB pathway and antigen presentation via peptide trimming [39]. Interestingly, a missense variation in this gene is linked to psoriasis and is down-regulated in psoriatic lesions relative to healthy skin [40]. We found at junction-level, AS risk allele rs2910686 [C] also contributes to expression reduction of LNPEP (β = −0.41; P = 3.09 × 10−08). Similarly, the risk allele rs30187 [T] correlated strongly with decreased expression of CAST (β = −0.46; P = 2.47 × 10−10); encoding calpastatin, a calcium-dependent cysteine protease inhibitor. Cysteine protease activity positively correlates with the severity of arthritic lesions and degree of inflammation [41]. Our data support the notion of multiple functional effects at this locus and suggests novel pathological mechanisms including decreased expression of the inhibitor CAST leading to increased cysteine protease activity.
Other AS susceptibility loci contributing to expression modulation of multiple genes include rs9901869 for TBKBP1 and ITGB3, and rs75301646 for NFATC2IP and TUFM. Candidate genes at the rs9901869 locus are yet to be functionally characterised [35]. Our data suggest the risk allele rs9901869 [A] increases the expression of both TBKBP1, which plays an active role in the NF-κB and IFN-α signalling pathways ( β = 0.57; P = 2.47 × 10−17) [42], and ITGB3 - involved in the intestinal immune pathway for IgA production ( β = 0.35; P = 1.53 × 10−6) [43]. Similarly, we found the risk allele rs9901869 [A] increases expression of both novel candidate-causal eGenes NFATC2IP ( β = 0.25; P = 6.18 × 10−4) and TUFM ( β = 0.60; P = 2.82 × 10−17). TUFM has been reported as the causative gene at this locus for early onset inflammatory bowel disease [44], whereas NFATC2IP (Nuclear Factor of Activated T-cells 2 Interacting Protein) has clear immunological roles in the induction of IL-4 production and regulation of the TNF receptor family of proteins [45]. Our analysis has shed new light on the molecular genetics of AS and can be used in similar manner for the functional dissection of the remaining 19 autoimmune diseases (www.insidegen.com).
Functional genomic support for candidate-causal cis-eQTLs
The resolution of RNA-Seq can be leveraged to map candidate-genes and isolate specific exons and junctions perturbed by disease-associated variants. Functional genomic data can then be used to support potential causal associations to deduce molecular mechanisms and epigenetically prioritize causal variants.
The remaining AS associated variant rs1128905 is a candidate-causal cis-eQTL for both CARD9 and SNAPC4 (Fig 4A). The candidate-gene at this AS locus is thought be to CARD9 [36]. Our results also draw attention to SNAPC4 (Small Nuclear RNA-activating Complex Polypeptide 4). Using exon-level RNA-Seq, the risk allele rs1128905 [C] decreased the expression of exons 18 ( β = −0.37; P = 3.65 × 10−07) and 19 ( β = −0.27; P = 4.80 × 10−04) of the canoncical transcript of SNAPC4 (Fig 4B). Accordingly, the exon 18-19 boundry was also significantly decreased, captured by junction-level quantifiation ( β = −0.26; P = 2.74 × 10−04). As rs1128905 lies over 39kb away from the transcription start site of SNAPC4, we used existing promoter capture Hi-C data in lymphoblastoid cell lines to assess whether rs1128905 and associated SNPs may act distally upon SNAPC4 to influence its expression [32]. We found the bait region encompassing rs1128905 interacts with five targets with great confidence CHiCAGO score > 12 (Fig 4C) [46]. Four of these are located within the SNAPC4 gene itself. Adding further evidence from histone marks from lymphoblastoid cell lines from the RoadMap Epigenomics Project [31], we found two SNPs in near-perfect LD with rs1128905 (r2 > 0.95; rs10870201 and rs10870202) were localised to the peaks of H3K4me3, H3K27ac, and H3K9ac marks, and the region encompassed is predicted to be an active enhancer (S4 Fig). Our data therefore suggest that associated SNPs rs10870201 and rs10870202 may perturb the enhancer-promoter interaction with SNAPC4 affecting expression. In fact, rs10870201 was the best cis-eQTL in the 1Mb region for exons 18 and 19 of SNAPC4. Interestingly, although no autoimmune phenotype has been documented with SNAPC4, an uncorrelated SNP rs10781500 (r2 with rs1128905 < 0.5), associated with Crohn’s Disease, inflammatory bowel disease, and ulcerative colitis, has also been classified as a candidate-causal cis-eQTL for SNAPC4 but not CARD9 in ex vivo human B-lymphocytes (the risk allele is also correlated with reduced expression of SNAPC4) [47]. This effect holds true in our analysis - rs10781500 is an eQTL for SNAPC4 but not CARD9.
Our data point to candidate genes and molecular mechanisms but further functional characerization is of course necessary to determine the true causative gene(s) at this locus.
Detection of autoimmune associated trans-eQTLs using RNA-Seq
We extended our RNA-Seq based eQTL investigation to include expression targets > 5Mb away from each of the 846 lead autoimmune GWAS variants (S3 Table). Though we were relatively underpowered for a trans-eQTL analysis, we were able to detect 26 trans-eQTLs at isoform-level, eight at exon-level, six at gene-level, three at junction-level (Fig 5A). Many of the trans-eQTLs detcted were only associated with one eGene, and no trans-eQTLs were detected at intron-level. With exon-level quantification however, we were able to identify an interesting effect of trans-eQTL rs7726414 - associated with Systemic Lupus Erythematosus (SLE) [7]. We found rs7726414, was a trans-eQTL for eight eGenes (Fig 5B). These comprise SIPA1L2, PDPK1, IVNS1ABP, HES2, JAZF1, ULK4, RP11-51F16.8, and PPM1M. We found the risk allele rs7726414 [T] was associated with increased expression of each of these eight genes (Fig 5C). We highlight, PDPK1, a key regulator of IRF4 and inducer of apoptosis [48], and JAZF1 which is genetically associated with many autoimmune diseases including SLE itself [7]. The serine/threonine-protein kinase ULK4 is also of interest as its family member, ULK3, is also an SLE suseptability gene [7]. Though we did not classify rs7726414 as a candidate-causal cis-eQTL in our dataset, it has been documented as candidate-causal in SLE using a larger eQTL cohort profiled in lymphoblastoid cell lines for eGenes TCF7 (Transcription Factor 7, T-Cell Specific, HMG-Box) and the ubiquitin ligase complex SKP1 [10].
Discussion
Elucidation of the functional consequence of non-coding genetic variation in human disease is a major objective of medical genomics [49]. Integrative studies that map disease-associated eQTLs in relevant cell types and physiological conditions are proving essential in progression towards this goal through identification of causal SNPs, candidate-genes, and illumination of molecular mechanisms [50]. In autoimmune disease, where there is considerable overlap of immunopathology, integrative eQTL investigations have been able to connect discrete aetiological pathways, cell types, and epigenetic modifications, to particular clinical manifestations [2,50–52]. Emerging evidence however suggests that only a minority (~25%) of autoimmune associated SNPs share casual variants with cis-eQTLs in primary immune cell-types [9].
Genetic variation can influence expression at every stage of the gene regulatory cascade - from chromatin dynamics, to RNA folding, stability, and splicing, and protein translation [53]. As RNA-Seq becomes the convention for genome-wide transcriptomics, be it for differential expression or eQTL analysis, it is essential to maximise the ability to resolve and quantify discrete transcriptomic features. It is now well documented that SNPs affecting these units of expression vary strikingly in their genomic location and localisation to specific epigenetic marks [18]. The reasoning for our investigation therefore was to delineate the limits of microarray and RNA-Seq based eQTL cohorts in the functional annotation of autoimmune disease association signals. To map autoimmune disease associated cis-eQTLs, we interrogated RNA-Seq expression data profiled at gene-, isoform, exon-, junction-, and intron-level, and tested for a shared genetic effect at each significant association. We found exon- and junction-level quantification led to the greatest frequency of candidate-causal cis-eQTL and eGenes, and thoroughly outperformed gene-level analysis (Fig 2C). We argue however that it is necessary to profile expression at all possible resolutions to diminish the likelihood of overlooking potentially causal cis-eQTLs (Fig 3D). In fact, by combining our results across all resolutions, we found 45% of autoimmune loci were candidate-causal cis-eQTLs for at least one eGene. Our findings can be used as a resource to lookup causal eQTLs and candidate genes of autoimmune disease (www.insidegen.com).
Gene-level expression estimates can generally be obtained in two ways – union-exon based approaches [14,17] and transcript-based approaches [11,12]. In the former, all overlapping exons of the same gene are merged into union exons, and intersecting exon and junction reads (including split-reads) are counted to these pseudo-gene boundaries. Using this counting-based approach, it is also possible to quantify meta-exons and junctions easily and with high confidence by preparing the reference annotation appropriately [13,15,54]. Introns can be quantified in a similar manner by inverting the reference annotation between exons and introns [18]. Conversely, transcript-based approaches make use of statistical models and expectation maximization algorithms to distribute reads among gene isoforms - resulting in isoform expression estimates [11,12]. These estimates can then be summed to obtain the entire expression estimate of the gene. Greater biological insight is gained from isoform-level analysis; however, disambiguation of specific transcripts is not trivial due to substantial sequence commonality of exons and junctions. In fact, we found only 15% of autoimmune loci shared a causal variant at transcript-level (Fig 3A). The different approaches used to estimate expression can also lead to significant differences in the reported counts. Union-based approaches, whilst computationally less expensive, can underestimate expression levels relative to transcript-based, and this difference becomes more pronounced when the number of isoforms of a gene increases, and when expression is primarily derived from shorter isoforms [20]. The GEUVADIS study implemented a transcript-based approach to obtain whole-gene expression estimates. A gold standard of eQTL mapping using RNA-Seq is essential therefore for comparative analysis across datasets.
Our findings support recent evidence that suggests exon-level based strategies are more sensitive and specific than conventional gene-level approaches [21]. Subtle isoform variation and expression of less abundant isoforms are likely to be masked by gene-level analysis. Exon-level allows for detection of moderate but systematic changes in gene expression that are not captured at gene-level, and also, gene-level summary counts can be shifted in the direction of extreme exon outliers [21]. It is therefore important to note that a positive exon-level eQTL association does not necessarily mean a differential exon-usage or splicing mechanism is involved; rather a systematic expression effect across the whole gene may exist that is only captured by the increased sensitivity. Additionally, by combining exon-level with other RNA-Seq quantification types, inferences can be made on the particular isoforms and functional domains affected by the eQTL which can later aid biological interpretation and targeted follow-up investigations [10].
We found intron-level quantification also generated more candidate-causal cis-eQTLs than gene-level. As the library was synthesised from poly-A selection, these associations are unlikely due to differences in pre-mRNA abundance. Rather, they are likely derived from either true retained introns in the mature RNA or from coding exons that are not documented in the reference annotation used. We observed multiple instances where a candidate-causal cis-eQTL at intron-level was detected, yet a previous investigation had detected an exonic effect using a different reference annotation. For example, an intronic-effect was detected for SLE candidate eGenes IKZF2 and WDFY4 in this analysis (which used the GENCODE v12 basic reference annotation). Using the comprehensive reference annotation of GENCODE v12, we found these effects were in fact driven by transcribed exons located within the intronic region of the basic annotation – and were validated in vitro by qPCR [10]. The choice of reference annotation therefore has a profound effect on expression estimates [55]; and so again, a gold standard is necessary prevent misinterpretation and increase consistency of eQTL associations.
Lastly, we show how our findings can be leveraged to comprehensively dissect GWAS results of autoimmune diseases. We found 14 of the 31 SNPs associated with Ankylosing Spondylitis (AS) were candidate-causal cis-eQTLs for at least on eGene (Fig 4). The majority of these eQTLs influenced the expression of multiple eGenes which had direct relevance to biological pathways associated with autoimmunity. In fact, the majority of the candidate genes detected (for example RUSC1, TBKBP1, NFATC2IP, TNFRSF1A, and PDE4A) support the involvement of TNF-α and NF-κB in the pathology of AS [35]. We finally show at the CARD9-SNAPC4 locus, how existing functional genomic data from chromatin interaction and epigenetic modification experiments can strengthen evidence of the eQTL associations detected by RNA-Seq and allow for functional prioritization of causal variants (Fig 4C). We also highlight the benefit of exon-level analysis to also detect disease associated trans-eQTLs (Fig 5).
Taken together, we have provided a deeper mechanistic understanding of the genetic regulation of gene expression in autoimmune disease by profiling the transcriptome at multiple resolutions using RNA-Seq. Similar analyses in new and existing datasets using relevant cell types and context-specific conditions will undoubtedly increase our understanding of how associated variants alter cell physiology and ultimately contribute to disease risk.
Materials and Methods
Autoimmune disease associated SNPs
SNPs were taken from the ImmunoBase resource (www.immunobase.org). It comprises summary case-control association statistics from twenty diseases: twelve originally targeted by the ImmunoChip consortium (Ankylosing Spondylitis, Autoimmune Thyroid Disease, Celiac Disease, Crohn’s Disease, Juvenile Idiopathic Arthritis, Multiple Sclerosis, Primary Biliary Cirrhosis, Psoriasis, Rheumatoid Arthritis, Systemic Lupus Erythematosus, Type 1 Diabetes, Ulcerative Colitis), and eight others (Alopecia Areata, Inflammatory Bowel Disease, IgE and Allergic Sensitization, Narcolepsy, Primary Sclerosing Cholangitis, Sjogren Syndrome, Systemic Scleroderma, Vitiligo). For eQTL analysis, we took the lead SNPs for each disease - defined as a genome-wide significant SNP with the lowest reported P-value (S1 Table). X-chromosome associations and SNPs with minor allele frequency < 5% were omitted from analysis, leaving 846 SNPs. A total of 262 SNPs were pruned using the ‘‐‐indep-pairwise’ function of PLINK 1.9 with a window size of 100kb and an r2 threshold of 0.8, leaving 584 independent loci.
RNA-Seq gene expression data
Normalised RNA-Seq expression data of 373 lymphoblastoid cell lines from four European sub-populations (CEU, GBR, FIN, TSI) of the 1000Genomes Project (Geuvadis) [18] were obtained from EBI ArrayExpress (E-GEUV-1; full methods can be found in http://geuvadiswiki.crg.es/). In summary, transcripts, splice-junctions, and introns were quantified using Flux Capacitor against the GENCODE v12 basic reference annotation [16]. Reads belonging to single transcripts were predicted by deconvolution per observations of paired-reads mapping across all exonic segments of a locus. Gene-level expression was calculated as the sum of all transcripts per gene. Annotated splice junctions were quantified using split read information, counting the number of reads supporting a given junction. Intronic regions that are not retained in any mature annotated transcript, and reported mapped reads in different bins across the intron to distinguish reads stemming from retained introns from those produced by not yet annotated exons. Meta-exons were quantified by merging all overlapping exonic portions of a gene into non-redundant units and counting reads within these bins [15]. Reads were excluded when the read pairs map to two different genes. Quantifications were corrected for sequencing depth and gene length (RPKM). Only expression elements quantified in > 50 % of individuals were kept and Probabilistic Estimation of Expression Residuals (PEER) was used to remove technical variation [25].
Cis and trans-eQTL analysis
An overview of the integration pipeline is depicted in Fig 1. Genotypes were obtained from EBI ArrayExpress (E-GEUV-1). The 41 individuals genotyped on the Omni 2.5M SNP array were previously imputed to the Phase 1 v3 release as described [18]. PCA of genotype data was performed using the Bioconductor package SNPRelate (S2 Fig) [26]. Only bi-allelic SNPs with MAF > 0.05, imputation call-rates ≥ 0.8, and HWE P < 1 × 10−04 were used. All eQTL association testing was performed with a liner-regression model in R. Normalized expression residuals (PEER factor normalized RPKM) for each quantification type were transformed to standard normal and the first three principle components used as covariates in the eQTL model as well as the binary imputation status. Cis and trans-eQTL mapping was performed for genes within +/-1Mb of the lead SNP and for genes > 5Mb from the lead SNP respectively. Adjustment for multiple testing of eQTL results per quantification type (corrected total genes, isoforms, exons, junctions, and introns) was undertaken using an FDR of 0.05 for cis and 0.01 for trans analysis (MHC associations were excluded in trans).
Analysis of shared causal variant
The Regulatory Trait Concordance (RTC) method was used to assess the likelihood of a shared causal variant between the GWAS SNP and the cis-eQTL signal [27]. SNPs were firstly classified according to their position in relation to recombination hotspots (based on genome-wide estimates of hotspot intervals) [28]. For each significant cis-eQTL association, the residuals from the linear-regression of the best cis-asQTL (lowest association P-value within the hotspot interval) was extracted against the expression quantification for the expression unit in hand. Regression was then performed using all SNPs within the defined hotspot interval against these residuals. The RTC score was then calculated as (NSNPs - RankGWAS SNP / NSNPs. Where NSNPs is the total number of SNPs in the recombination hotspot interval, and RankGWAS SNP is the rank of the GWAS SNP association P-value against all other SNPs in the interval from the liner-association against the residuals of the best cis-eQTL. Disease associated SNPs with statistically significant association to gene expression (q < 0.05) and an RTC score > 0.95 were classified as ‘candidate-causal eQTLs’. Genes whose expression is modulated by the eQTL were defined as ‘candidate-causal eGenes’.
Data visualisation and resources
R version 3.3.1 was used to create heatmaps, box-plots (ggplot2), and circularized chromosome diagrams (circlize). Genes were plotted in UCSC Genome Browser [29] and IGV [30]. Roadmap epigenetic data were downloaded from the web resource [31], and chromatin interaction data were taken from the CHiCP web resource [32].
Acknowledgements
We thank Dr David L Morris for helpful discussions throughout this work. We also thank Philip Tombleson for his assistance with data uploading. The GEUVADIS 1000 Genomes RNA-Seq data was downloaded from the EBI ArrayExpress Portal (accession E-GEUV-1).