Abstract
Long intergenic non-coding RNAs (lincRNA) are members of a class of non-protein-coding RNA transcript that has recently been shown to contribute to gene regulatory processes and disease etiology. It has been hypothesized that lincRNAs influence disease risk through the regulation of mRNA transcription [88], possibly by interacting with regulatory proteins such as chromatin-modifying complexes [37, 50]. The hypothesis of the regulation of mRNA by lincRNAs is based on a small number of specific lincRNAs analyses; the cellular roles of lincRNAs regulation have not been catalogued genome-wide. Relative to mRNAs, lincRNAs tend to be expressed at lower levels and in more tissue-specific patterns, making genome-wide studies of their regulatory capabilities difficult [15]. Here we develop a method for Mendelian randomization leveraging expression quantitative trait loci (eQTLs) that regulate the expression levels of lincRNAs (linc-eQTLs) to perform such a study across four primary tissues. We find that linc-eQTLs are largely similar to protein-coding eQTLs (pc-eQTLs) in cis-regulatory element enrichment, which supports the hypothesis that lincRNAs are regulated by the same transcriptional machinery as protein-coding RNAs [15, 80] and validates our linc-eQTLs. We catalog 74 lincRNAs with linc-eQTLs that are in linkage disequilibrium with TASs and are in protein-coding gene deserts; the putative lincRNA-regulated traits are highly enriched for adipose-related traits relative to mRNA-regulated traits.
1 Introduction
Long intergenic non-coding RNAs (lincRNAs) are members of a class of non-protein coding RNA tran-script greater than 200 nucleotides in length that do not overlap protein-coding gene annotations. Lin-cRNAs have been shown to be necessary for various cellular and organismal processes, including mammalian X-chromosome inactivation [106], telomere maintenance [7], and cellular differentiation and spec-ification [34, 51, 91, 95]. The dysregulation of lincRNAs has been noted in several important human phenotypes, including muscle performance, Alzheimer’s disease, and autoimmune diseases [2, 102]. A role for lincRNAs in disease is consistent with sequence conservation analyses, showing that they are under purifying selection [80]. However, lincRNA genes are generally less evolutionarily conserved than protein-coding exons or messenger RNA (mRNA) untranslated regions [36]. LincRNAs are thought to be regulated by the same transcriptional and RNA processing machinery as protein-coding mRNAs [15, 80]. Overall, however, lincRNAs are often expressed at lower levels and in a more tissue-specific patterns relative to mRNA [15, 38].
Of the thousands of lincRNAs recently annotated across human tissues using high-throughput sequenc-ing technologies [15, 23, 73], only a small fraction have been functionally characterized [85]. There is a variety of functional mechanisms by which they are thought to act [9]. Nevertheless, a common theme has emerged that many lincRNAs might function to regulate the transcription of protein-coding genes [72, 88] through either cis (e.g., XIST [18]), trans (e.g., Fendrr [34]), or a combination of cis and trans mecha-nisms (e.g., lincRNA-p21 [25, 46]). The best studied example of a regulatory lincRNA, Xist, is a ∼ 17 Kb transcript expressed in Eutherian mammalian females from only one of two X-chromosomes [13]. Following capping, splicing, and polyadenylation, Xist bypasses nuclear export pathways to localize to the X-chromosome inactivation center, where it accumulates and spreads in cis [18]. Xist coats the X-chromosome and recruits the Polycomb Repressive Complex 2 (PRC2) [106], which promotes the formation of the repressive chromatin mark H3K27me3. Interactions with chromatin-modifying complexes have also been noted for many other lincRNAs [37, 50, 89, 97, 105], although there are other mechanisms by which lincRNAs can act to affect gene expression. In particular, lincRNAs have been experimentally shown to bind to transcription factors and alter their activity [39], stabilize mRNA transcripts [53], and possibly affect the transcription of neighboring mRNA genes by influencing local transcriptional environ-ments [78]. However, regardless of detailed functional characterization of individual lincRNAs, there is no consensus on the extent to which lincRNAs regulate transcription or the modes by which they might act.
Here we aim to systematically characterize the global transcriptional effects of many lincRNAs si-multaneously and the modes by which they regulate transcription. There are two primary statistical challenges to this work. First, correlations between genes do not discriminate between causal interactions, where the lincRNA directly regulates gene transcription, and unobserved regulatory factors, where the transcription of both the lincRNA and another gene is co-regulated. Second, studies on lincRNAs have reduced statistical power because lincRNAs are on average expressed at much lower levels than mRNAs, and lincRNA annotations are of lower quality than mRNA annotations. We are able to overcome both of these challenges by using associations between genotypes and RNA expression levels to resolve causal models and to control for statistical power differences between mRNAs and lincRNAs. Genetic variants that are associated with RNA expression, or expression quantitative trait loci (eQTLs), have been exten-sively mapped in cell lines and tissue types using genome-wide approaches [28, 32, 64, 70, 74, 79]. Recent eQTL studies that include lincRNAs have shown that genetic variation also contributes to the regulation of lincRNA expression levels [67, 81].
More broadly, non-coding variants are thought to affect organismal phenotypes largely through the modulation of gene expression levels [69]. Support for this comes from the finding that trait-associated SNPs (TASs) are enriched in collections of protein-coding expression quantitative trait loci (eQTLs) across human cell lines and tissues [71]. Genome-wide association studies (GWAS) have shown that the majority of TASs are non-coding. Some of these TASs are located in protein-coding gene deserts [62, 66] and in lincRNAs [15], such as MIAT, which is associated with myocardial infarction [47]. Other TASs include eQTLs for lincRNAs like PTCSC3, which is associated with papillary thyroid carcinoma [49]. However, as with transcription regulation, the role of lincRNAs in regulating complex traits is unclear on a genome-wide level.
To better understand the contribution of lincRNAs expressed in primary tissues on complex traits and mRNA expression levels, we performed a systematic eQTL study to identify genetic variants regulating lincRNA and mRNA expression levels across multiple tissues. We used these genetic associations in statistical enrichment models to study shared transcriptional mechanisms across lincRNAs and mRNAs via their cis-eQTLs. We used similar enrichment analyses to compare the roles of lincRNAs and mRNAs on the regulation of complex trait types. We used a genome-wide approach of Mendelian randomization to quantify the effects of lincRNAs on distal mRNA expression levels across the genome, and investigated several possible mechanisms of transcription regulation in the regulatory lincRNAs.
2 Results
We performed a genome-wide eQTL association study using RNA-sequencing data and array-based, imputed genotype data to identify genetic variants that influence lincRNA gene expression levels (linc-eQTLs) and mRNA (protein-coding) gene expression levels (pc-eQTLs). We used data from four primary tissues—adipose (n = 107 samples, 103 individuals), artery (n = 118 samples, 117 individuals), lung (n = 133 samples, 121 individuals), and skin from the lower leg (n = 109 samples, 105 individuals)— from 167 donors to the Genotype-Tissue Expression (GTEx) project pilot v4 data [6].
We adopted the definition of cis- and trans-regulation current in the eQTL literature as association between genotype at a locus and expression of a gene where these genomic elements are separated by less than or greater than some distance threshold (here, < 100 Kb and > 1 Mb), respectively. This definition is agnostic of functional mechanism, and is only a proxy for the true definition of cis-regulation, which is allele-specific regulation; nevertheless it is a useful proxy for the study of genome-wide patterns of transcriptional regulation by lincRNAs.
2.1 Cis-eQTL discovery
We tested for association between the genotype of single nucleotide polymorphisms (SNPs) and gene expression levels in each tissue separately; we also tested for association jointly across all tissues using a multi-tissue association test [30]. More specifically, we tested for association of a gene with cis-SNPs within the interval from 100 Kb upstream of the transcription start site (TSS) to 100 Kb downstream of the transcription end site (TES; Supplemental Table 1). We computed Bayes factor (BF) thresholds corresponding to FDR ≤ 5% for both methods separately using permutations and took the union of significant SNP-gene associations to produce the full set of eQTLs. We found 126,258 SNPs associated with 1,567 of the 4,368 lincRNA genes, or 36% of the lincRNA genes tested (FDR ≤ 5%; Supplemental Fig. 1, Supplemental Table 2). In contrast, we identified 905, 704 SNPs associated with 9, 794 of the 18, 191 protein-coding genes, or 54% of the mRNA genes tested (FDR≤ 5%; Supplemental Fig. 1, Supplemental Table 2). We selected the single most significant eQTL for each gene—where the singleand multi-tissue analyses differed, we included both of the most significant eQTLs—to define a set of best eQTLs that included 2,034 cis-linc-eQTLs and 13, 522 cis-pc-eQTLs. We found similar enrichment of the best cis-eQTLs proximal to the TSS and TES for both lincRNAs and mRNAs (Fig. 1A and B) [100]. Linc-eQTLs and pc-eQTLs were similarly overrepresented around annotated splice-sites, suggesting that lincRNA expression levels might be subject to the same degree of post-transcriptional regulation as mRNAs (Supplemental Fig. 2).
We quantified the relative tissue specificity of linc-eQTLs, and we found that the number of lincRNAs with at least one linc-eQTL was consistent across the four tissues (Fig. 2A, B, and Supplemental Fig. 3A). Moreover, 35% of linc-eQTLs were unique to a single tissue (Fig. 2A). The number of mRNAs with at least one cis-pc-eQTL was also consistent across tissues (Fig. 2B and Supplemental Fig. 3B), and 31% of pc-eQTLs were unique to a single tissue. Controlling for gene length, GC content, number of exons, average expression levels, SNP minor allele frequency (MAF), and tissue-specificity of expression [15], linc-eQTLs were significantly more tissue-specific than pc-eQTLs (Eqn.1; Student’s t-test; p ≤ 5.1 × 10−12).
We found that tissue-specific linc- and pc-eQTLs had lower median association significance than non-tissue-specific eQTLs [24] (Mann-Whitney U-test, linc- and pc-eQTLs, p ≤ 1×10−100). Small sample size in the pilot GTEx data, intrinsic tissue specificity of lincRNA expression, and poor lincRNA annotations may affect these results.
The majority of cis-linc-eQTLs and cis-pc-eQTLs were associated with only one gene; however, we found substantial overlap between SNPs identified as both cis-linc-eQTLs and cis-pc-eQTLs. Specifically, 27, 524 SNPs were significantly associated with expression levels of both a lincRNA and an mRNA. Shared cis-eQTLs were 2.3 times more likely to have the same direction of effect on lincRNA and mRNA expression levels, rather than opposite direction of effect (Fisher’s exact test, p ≤ 1×10−100; Supplemental Fig. 4). We did not find any overrepresented patterns of the shared eQTL with respect to the transcription orientation of the lincRNA and mRNA in a genomic locus across matched and unmatched effects. Using conditional analyses, we found that the large majority of eQTLs that were significant cis-linc-eQTLs and cis-pc-eQTLs were regulating both gene targets directly (Eqn. 2, Supplemental Fig. 5A). This scenario also allows for a common but unmodeled effect (e.g., a shared transcription factor that regulates both the mRNA and the lincRNA). For a small minority of shared SNPs, we identified a direct and an indirect effect. These results agree with the lack of significant patterns of orientation of the mRNA and lincRNA gene bodies with respect to the shared cis-eQTL. These results suggest that lincRNA and mRNA transcription share similar local non-genic mechanisms of regulation. They also suggest that there do not exist ubiquitous patterns of local mRNA-lincRNA regulatory interactions, although it is possible that this lack of signal is due to insufficiently powerful statistical models to resolve these local interactions.
We considered the possibility that a SNP-gene association may be an artifact of the RNA-seq platform or read mapping pipeline by examining these eQTLs in nine publicly available eQTL data sets from seven different (unmatched) cell types [12]. In particular, we looked for each SNP-gene pair in the association mapping results from each study. We removed all genes not assayed on the gene expression array from the data and all SNPs that were not tested for association in that study. We considered the association to replicate in a study when the gene-SNP pair had a log10 BF ≥ 1. Overall, 44 cis-linc-eQTLs were tested; of these, 20 linc-eQTLs replicated (45%). Similarly, 7, 989 cis-pc-eQTLs were tested; of these, 4, 361 pc-eQTLs replicated (55%). Although the number of potential associations tested differs substantially, linc-eQTLs and pc-eQTLs replicated across platforms and cohorts at similar frequencies (Fisher’s exact test, p ≤ 0.6), indicating that well-annotated lincRNAs and mRNAs are similarly reproducible across RNA-seq and microarray platforms.
2.2 Similar cis-regulatory element enrichment for linc-eQTLs and pc-eQTLs
Many of the mechanisms by which cis-eQTLs regulate transcription of mRNAs are well understood [32]; to validate our cis-linc-eQTLs, we compared enriched co-localization of cis-linc-eQTLs with specific classes of cis-regulatory elements (CREs) to enriched co-localization of cis-pc-eQTLs with these CREs [12]. We performed enrichment analyses of our cis-linc-eQTLs and cis-pc-eQTLs with genome-wide maps of several indicators of CREs—DNase I hypersensitive sites (DHSs), nine histone modifications, and binding of RNA Polymerase II, EZH2, and CTCF—identified in nearest match cell lines from the ENCODE project [20]. Then we modeled CRE-eQTL co-localization using logistic regression controlling for distance to TSS, average gene expression level, and MAF [12] (Eqn. 3) to statistically quantify enrichment.
The best cis-linc-eQTLs and the best cis-pc-eQTLs were significantly enriched in CREs with features indicative of active gene transcription (putatively active CREs), including DHSs (Fig. 3C), H3K27ac (Fig. 3B), H3K9ac, H3K4me3, and RNA Polymerase II binding sites (Supplemental Fig. 6). Among all CREs, cis-pc-eQTLs and cis-linc-eQTLs were most differentially enriched for overlap with H3K36me3 (Fisher’s exact test, best: p ≤ 2.2 × 10−16; Fig. 3C). A recent study found that protein-coding genes have elevated H3K36me3 signals downstream of the TSS across cell types, while lncRNA genes do not [90]. The differential enrichment of eQTLs in H3K36me3 sites may indicate differential usage of this histone modification in transcription regulation of protein-coding genes as compared with lincRNA genes. For the remainder of the CREs, however, the signal for enrichment in active CREs was similar for cis-linc-eQTLs and cis-pc-eQTLs, indicating similar allele-specific transcription regulatory mechanisms and globally validating our cis-linc-eQTLs.
2.3 Enrichment of cis-linc-eQTLs with trait associated SNPs
Previous studies have shown that pc-eQTLs are enriched in trait-associated SNPs (TASs), or SNPs identified via genome-wide association studies (GWAS) for organismal traits, suggesting that genetic regulation of mRNA transcription contributes to complex organismal traits [12, 71]. To study the impact of the genetic regulation of lincRNA transcription on organismal traits, we tested for enrichment of TASs from the NHGRI GWAS Catalog [43] among cis-linc-eQTLs and cis-pc-eQTLs. The probability of linkage to a TAS was modeled with logistic regression controlling for distance to TSS, average expression level, and SNP MAF, requiring the TAS and eQTL to have r2 > 0.8 (Eqn. 4). Both cis-linc-eQTLs and cis-pc-eQTLs were enriched for linkage to TASs (Fig. 3D).
We compiled a set of 74 lincRNAs that had cis-linc-eQTLs in LD with TASs (r2 ≥ 0.8) that were not in LD (r2 < 0.2) with any cis-pc-eQTLs or local-pc-eQTLs (local indicates a distance of SNP to gene < 1 Mb) and were also at least 10 Kb from the nearest protein-coding gene, suggesting that the trait may be regulated through the lincRNA (explanatory lincRNAs; Supplemental Table 3). Eighty-eight TASs associated with a total of 96 traits were explained by these 74 explanatory lincRNAs (Supplemental Table 4). Among the several trait types enriched among TASs linked to linc-eQTLs but not to pc-eQTLs, obesity-related traits had the greatest enrichment, making up 8% of the 96 traits (Fisher’s exact test, p ≤ 0.029).
To explore the possible regulatory mechanisms of our explanatory lincRNAs, we tested these lincRNAs for possible mechanistic signatures (see Methods), including secondary structure conservation and protein-coding potential. None of these signatures were found in notable proportions of the explanatory lincRNAs, consistent with the notion that the functional mechanisms of lincRNAs are highly variable [98]. For all mechanistic analyses, we found that explanatory lincRNAs were not able to be distinguished from non-explanatory lincRNAs with cis-linc-eQTLs after controlling for transcript length. Two of our explanatory lincRNAs are likely to code for protein (LINC00452 and PCAT1; Supplemental Table 3), but this does not appear to be common despite recent work using ribosomal profiling [10] and functional characterization [3]. The variety of possible mechanisms and the lack of evolutionary conservation of lincRNAs between humans and model organisms, makes experimental characterization of a single explanatory lincRNA difficult.
Three of the TASs that we tested here were previously found to be associated with traits related to adipose tissue (adiposity [56] and visceral adipose tissue/subcutaneous adipose tissue ratio [31]). These three TASs were located in a single genomic locus in strong LD with 132 cis-linc-eQTLs for the explanatory lincRNA RP11-392O17, with no cis- or local-pc-eQTLs in LD with the TAS (Fig. 4). Eight additional TASs are in LD (r2 ≥ 0.8) with cis-linc-eQTLs for this same lincRNA in this gene desert, seven of which were adipose-related traits (adiponectin levels [22], fasting insulin-related traits [61], visceral adipose tissue/subcutaneous adipose tissue ratio [31], hip-adjusted BMI [86], and waist-hip ratio [41]) and the non-adipose related trait was a sexually dimorphic trait (osteoarthritis [5]); although these eight additional TASs were also in weak LD (0.2 < r2 < 0.8) with a singleton cis-pc-eQTL in skin for gene SLC30A10, the explanatory value of this association is weak relative to the explanatory lincRNA. The cis-linc-eQTL association was most significant in adipose tissue. One of these linked cis-linc-eQTLs, rs2605100, has been found to be associated with adiposity [56] and resides over 200 Kb away from the nearest protein-coding gene, LYPLAL1. Furthermore, LYPLAL1 does not have a local-pc-eQTL in LD with any of the adipose-related TASs. Our data show that, of the four tissues in this study, lincRNA RP11-392O17 is most highly expressed in adipose tissue (Fig. 4); the only tissue in which this lincRNA has higher levels of expression is pancreas, which produces a number of sugar-metabolizing hormones, including insulin and glucagon. In aggregate, these results suggest that the cis-linc-eQTLs associated with this lincRNA may be involved in regulating risk of this collection of adipose-related traits.
We mention five other explanatory lincRNAs that are good candidates to mediate disease risk, based on their long distance from the nearest protein coding gene, the number of linked TASs, and the effect size of the cis-linc-eQTLs. First, lincRNAs RP11-815M8.1 and RP11-400N13.2 both were explanatory lincRNAs for TASs for progressive supranuclear palsy [44] and colorectal cancer [45] (Supplemental Table 3, 4). Second, lincRNA RP11-400N13.1 had cis-linc-eQTLs that were strongly linked to one TAS for keloid (scar formation) [68]. Third, lincRNA RP11-345M22.2 was strongly linked to four TASs for urate levels [52], thyroid volume [96], thyroid hormone levels [82], and thyroid function [87] (Supplemental Table 3, 4). Fourth, PCAT1 was strongly linked to two obesity-related trait SNPs [19], but based on our analyses appears to be protein coding (Supplemental Table 3). Experimental characterization of these disease-related lincRNAs is difficult as described earlier due to low sequence conservation and other factors; moreover, none of these five lincRNAs appear to be small peptide hormones. However, our work indicates that including cis-linc-eQTLs in functional analyses of TASs is essential, as certain organismal traits may be regulated through lincRNAs.
Our results on the enrichment for linkage to TASs among cis-eQTLs defined linkage between an eQTL and a TAS if the two SNPs were located within 100 Kb and had a pairwise r2 ≥ 0.8, yet our results were robust to different lower bounds on r2 between the TAS and the cis-eQTL. Specifically the all linc-eQTL and all pc-eQTL sets were enriched for linkage to TASs at r2 ≥ {0.20-0.95}. The best pc-eQTL set was enriched for linkage to TASs at r2 ≥ {0.20-0.90}. The best linc-eQTL set was not enriched for linkage to TASs at any tested r2 threshold.
2.4 Mechanistic hypotheses for candidate lincRNAs
We found that linc-TASs were significantly enriched for obesity-related traits compared to pc-TASs. This enrichment generally increased with increasing stringency in defining explanatory linc-eQTLs by distance to the nearest protein-coding gene—and conversely defining explanatory pc-eQTLs by distance to the nearest lincRNA gene (Fisher’s exact test, r2 > 0.2: p ≤ 0.007; r2 > 0.8: p ≤ 0.035; r2 > 0.2 and > 10 Kb constraint: p ≤ 2.1 × 10−4; r2 > 0.8 and > 10 Kb constraint: p ≤ 0.028; r2 > 0.2 and > 100 Kb constraint: p ≤ 2.1 × 10−5; r2 > 0.8 and > 100 Kb constraint: p ≤ 0.01). We used thresholds of r2 > 0.8 to define eQTL-TAS linkage, r2 < 0.2 to characterize no eQTL-TAS linkage, and the protein coding gene > 10 Kb away from the TAS. We found that six lincRNAs had linc-eQTLs linked to TASs for obesity-related traits that were not linked to pc-eQTLs or nearby protein coding genes, including PCAT1 and PWRN1, known to be functionally important in prostate cancer [83] and Prader-Willi syndrome [14], respectively.
Currently, few lincRNAs have experimentally verified functions or disease associations. Of these, only PCAT1 from the Long Non-Coding RNA Database (lncRNAdb v.2.0) [85] was found in our candidate set. In terms of coding potential, the candidate set could not be distinguished from the set of its reverse complement sequences. We tested for coding potential, evolutionary conservation, conserved secondary structure, secondary structure folding energy, number of miRNA binding sites, nuclear versus cytoplasm localization, and signal peptide potential; for none of these tests could the candidate lincRNAs be distinguished from their complement set among all lincRNAs tested for cis-linc-eQTLs after controlling for transcript length. This finding is consistent with the notion that lincRNAs are a heterogeneous group of molecules whose functional mechanisms vary substantially [98].
Several of our candidate lincRNAs are likely protein-coding genes according to CPAT, LINC00452 and PCAT1 in particular (Supplemental Table 4). Additional work is needed to describe the potential action of other promising explanatory lincRNAs, such as RP11-815M8.1 and RP11-345M22.2, which do not have a clear mechanism based on these analyses.
2.5 Experimental validation of linc-eQTLs
To experimentally confirm the regulatory effects of cis-linc-eQTLs on transcription, we selected seven loci that included cis-linc-eQTLs and cis-pc-eQTLs to validate using allele-specific luciferase reporter assays. We chose cis-eQTLS that were located in DHS shared across cell types that represent the tissues used, and compared relative enhancer activity between the major and minor haplotype. Of the seven loci tested, four were cis-pc-eQTLs, and four were cis-linc-eQTLs (one was both a cis-pc-eQTL and a cis-linc-eQTL). Three of the four cis-linc-eQTLs had a significant difference in luciferase expression between the major and minor haplotypes and in all cases the direction of the effect agreed with the association analysis (Student’s t-test; p ≤ 0.05, Student’s t-test; Fig. 5). Of the four cis-pc-eQTLs, three had a significant effect on luciferase expression. Two of those agreed in direction with the association analysis, and one acted in the opposite direction. The latter cis-pc-eQTL tested a tissue-specific association, potentially explaining the single inconsistent result. Overall, these results serve both as positive validation for the association analysis, and also support the hypothesis that genetic variation in DHSs are largely responsible for allele-specific transcription regulation.
2.6 Mendelian randomization to test for RNA-mRNA regulation
The predominant hypothesis about the cellular role of lincRNAs is that they regulate the transcription of protein-coding genes. Previous studies have shown that i) lincRNAs interact with chromatin-modifying complexes [37, 50]; ii) lincRNA knockdown affects expression levels of protein-coding genes [37]; iii) lincRNAs, including ROR, MALAT1 MEG3, MEG8, TERC, and others, have been experimentally shown to function as trans-regulators [16, 17, 37, 57, 65, 107]. To do this, we tested for the trans-regulatory effects of lincRNAs in two ways: a n¨aive approach and an approach using a Mendelian randomization test.
2.6.1 Näive approach for testing for trans-effects of lincRNAs
In our first näive approach, we tested the best cis linc- and pc-eQTLs for association with the expression levels of all genes with non-zero expression in at least 10% of samples in all four tissues in a cross-tissue manner. We designed this test for a general transcriptional role for lincRNAs using linc-cis-eQTLs for two reasons: first, using the genotypes instead of the lincRNA expression levels allowed us to control for potential loss of statistical power due to lower expression levels of lincRNAs versus mRNAs; second, we were able to calibrate our results against the same results from cis-pc-eQTLs, which had an similar distribution of effect sizes to the cis-linc-eQTLs and a substantial number of known transcription regulators. We did not find a significant difference between the distributions of the magnitude of effect sizes for the eQTLBMA best cis-linc-eQTLs and best cis-pc-eQTLs (Mann-Whitney U-test, p > 0.05; Supplemental Fig. 7A). However, the SNPTEST best cis-linc-eQTLs had slightly but significantly greater absolute effect sizes than cis-pc-eQTLs (Mann-Whitney U-test, p = 1.04 ×10−4; Supplemental Fig. 7B). If this slight difference in effect sizes affected results, it should advantage cis-linc-eQTLs over cis-pc-eQTLs. This approach let us statistically compare the proportion of lincRNAs having a trans effect on a protein coding gene to the proportion of protein coding genes having a trans effect on a protein coding gene; however, this approach does not control for shared latent confounders.
Using the näive approach, we found 9 cis-linc-eQTLs that were associated with 10 (0.05% of the total) protein-coding genes across four tissue types, resulting in 11 trans-pc-eQTLs (FDR ≤ 20%; Supplemental Table 2). For comparison, we found 7,594 cis-pc-eQTLs were associated with 7,046 (38.7%) protein-coding genes in trans across four tissue types, resulting in 22, 527 trans-pc-eQTLs (FDR ≤ 20%; Fig. 7A and Supplemental Table 2). All trans-pc-eQTLs were identified via multi-tissue mapping, while only 39 trans-pc-eQTLs were replicated using single-tissue mapping, pointing to the increased power of multi-tissue analyses in identifying trans-pc-eQTLs. Thus, using this näive approach, we observed a 2, 000 times greater chance that an mRNA will be associated in trans with a cis-pc-eQTL than with a cis-linc-eQTL. While 85% of cis-pc-eQTLs were estimated to be shared across all four tissues using eQTLBMA, only 11% of trans-pc-eQTLs (from cis-pc-eQTLs) were estimated to be shared across all four tissues. Our results provide support for the hypothesis that trans-eQTLs are more tissue-specific than cis-eQTLs [29, 35, 84]; however, we have limited power in these pilot data to resolve this question.
2.6.2 Mendelian randomization approach for testing for trans-effects of lincRNAs
To study whether lincRNAs affect mRNA expression levels genome-wide in comparison with mRNA in the presence of shared confounding effects, we developed and applied a Mendelian randomization (MR) test to quantify support a direct relationship between each RNA that has a cis-eQTL and all local and distal genes.
Mendelian randomization (MR) is a form of causal inference referred to as instrumental variable analysis, where an instrumental variable (here, a cis-eQTL) is used to artificially create randomization from observational data, leading to an analysis that mimics a controlled randomized trial (CRT) [94]. CRT analyses are the gold standard approach to testing for causal effects, so MR is a powerful approach to causal inference in observational genomic data. MR requires that a genotype (instrument) is associated with the independent variable (the cis-regulated gene; Fig. 6). MR explicitly controls for unobserved co-regulators of the two genes via the instrumental variable. In particular, our approach to MR separates variation in the independent variable x (the cis-gene) due to the cis-eQTL and variation due to all other effects, and only uses the genetically-regulated variation to test for association with the dependent variable (the trans-gene).
The effect sizes of 1,077 best cis-linc-eQTLs and 5,899 best cis-pc-eQTLs were compared to the effect sizes of those same eQTLs in trans by means of Mendelian randomization. We found that 36 (3.3% of tested) lincRNA genes associated with the expression of 753 protein-coding genes, while 278 (4.7% of tested) protein-coding genes associated with the expression of 4, 298 protein-coding genes (20% FDR; FET, p = 4.57 × 10−2; Fig. 7B). While the proportion of lincRNA genes with trans associations was significantly lesser than the proportion of protein-coding genes with trans associations, our results support the hypothesis that lincRNAs participate in a dosage-dependent distal regulation of protein coding gene transcription on a genome-wide scale.
We suspect that it is the additional control in MR for shared confounding effects that leads to different conclusions between the näive approach and the MR approach. This effect will be compounded in the shared batch and technical effects in the measurements of the cis- and trans-genes expression levels, as the assays were shared. We also note that the tests were implemented differently: the MR approach combined gene expression levels across all tissues, whereas the näive approach explicitly used cross-tissue meta-analyses.
To ensure that trans-pc-eQTLs were not long-range cis-pc-eQTLs, we verified that trans-pc-eQTLs from cis-pc-eQTLs were located on the same chromosome as their associated genes at a frequency no greater than expected by chance, when modeling the probability of choosing a SNP or gene as a uniform distribution across the length of each chromosome (Z-test, p > 0.5).
Previous work has speculated that lincRNAs, and lncRNAs more generally, compose a novel layer of transcription regulation [37, 63, 88]. Researchers have experimentally shown that a handful of lincRNAs have regulatory functions, for example, in Mus musculus embryonic stem (ES) cell differentiation [37] and in human lung endoderm morphogenesis [42]. Certain lincRNAs with established regulatory functions did not have cis-linc-eQTLs in our data set and thus could not be tested for trans-pc-eQTL association. These excluded functional lincRNAs include TERC, Xist, lincRNA-ROR, MEG3, MEG8, ncRNA-a1, ncRNA-a5, and ncRNA-a7. Another two lincRNAs, NEAT1 and MALAT1/NEAT2, which have been hypothesized to have trans regulatory functions in humans based on experimental characterization of their cellular function, did have cis-linc-eQTLs in our study (166 and 50, respectively). However, NEAT1 and MALAT1/NEAT2 did not significantly associate with the expression of any protein-coding genes in trans. Many of the other well-studied lncRNAs with known regulatory roles are either classified as antisense RNA, miscRNA, or were not annotated in GENCODE and so were not tested here. These lncRNAs include HOTAIR, HOTTIP, Evf2, Kcnq1ot1, Paupar, CCND1-associated ncRNAs, AIR, DHFR minor promoter, NRON, ANRIL, ncRNA-a2, ncRNA-a3, ncRNA-a4, and others.
2.7 Local correlation of mRNAs and lincRNAs
A handful of lincRNAs has been shown to regulate mRNA expression both locally and distally by diverse mechanisms [39]. Mendelian randomization assumes that the cis-eQTL does not have a direct effect on the downstream gene, which we could not guarantee in this scenario, but we suspect that direct distal effects will be uncommon.
To test for local genomic interaction between lincRNAs and mRNAs, we used correlation coefficients to test for co-regulation of gene pairs. The median absolute correlation coefficient (MACC) was significantly higher for gene pairs that were nearer to one another compared to those farther away (Mann-Whitney U-test, 20 Kb versus 100 Kb: p ≤ 1 × 10−100; 100 Kb versus 1 Mb: p ≤ 1 × 10−100; 1 Mb versus 5 Mb: p ≤ 1 × 10−100). However, across all gene pair genomic proximity thresholds, the MACC between pairs of lincRNAs and protein-coding genes was not significantly greater than the MACC for pairs of protein-coding genes nor for pairs of lincRNA genes (Mann-Whitney U-test, p > 0.05). In other words, proximal pairs of lincRNAs and protein-coding genes showed no more correlation in expression than other proximal gene pairs, as has been shown in previous work [15, 99].
3 Discussion
In this study, we applied a Mendelian randomization approach to explore the biological role of lincRNAs through their associated genetic variants, using matched-power analyses on mRNAs as an experimental control. We provided evidence for similar allele-specific regulation of gene expression levels among lin-cRNA and mRNA genes using CRE enrichment analyses. We find that organismal traits—here, largely adipose-related traits—may be genetically mediated through lincRNA expression levels; we compiled 74 explanatory lincRNAs and highlight six of the most promising lincRNAs that may be involved in the regulation of specific organismal traits. Using Mendelian randomization, we found that there is evidence for lincRNAs having distal effects on the expression levels of protein-coding genes in a dosage-dependent way across the genome in similar proportions to distal effects of mRNA.
Mendelian randomization shows great promise in performing genome-wide studies of a functional role for specific types of epigenetic elements. We contrasted the results using MR with results using standard tests of trans-eQTLs, where the discovery set of SNPs included only cis-eQTLs. The improvement in results of MR was clear: we found diminishingly few trans-eQTL associations from cis-pc-eQTLs (39) and cis-linc-eQTLs (4) in the univariate analysis tissue-by-tissue. Using MR and pooling data across tissues, however, we found 5, 213 trans gene-gene associations (from 278 putative regulators; FDR ≤ 20). There are a number of limitations to our study, including a small sample size for eQTL discovery, noisy lincRNA annotations in the human genome as compared with annotations of mRNA, and lack of experimental approaches to validating trans-eQTLs that act via lincRNAs. None of our results, however, are conditioned on these limitations.
We designed our study to have similar statistical power to detect both the effects of lincRNA regulation of mRNA and mRNA regulation of mRNA. This controlled experiment was possible since we used a matched distribution of effect sizes of the cis-eQTLs and the use of a MR approach to quantify the magnitude of the effects of the cis-RNA on all trans-mRNA to control for shared confounding effects. We proposed a specific MR test to evaluate distal regulatory effects in the context of this controlled experiment. We feel that these methods can be generalized to increase statistical power in trans-eQTL studies [104], study genetic mechanisms of specific complex traits [33], or infer the causal relationships among epigenetic markers [8]. Here we apply this approach to assess, at a genome-wide scale, the regulatory potential of non-coding RNA - specifically to determine whether or not lincRNAs have distal regulatory effects.
Methodologically, there are many avenues to pursue in the area of methods design for Mendelian ran-domization. In particular, we are scaling our approach to high-dimensional SNPs and epigenetic markers. Our current method, implemented in an open source Python package on GitHub (https://github.com/PrincetonUniversity/MReQTL), can be applied to data for which summary statistics are available across unpaired samples; in other words, we did not leverage the paired samples here, and we could have equiv-alently ascertained lincRNA cis-eQTLs in one dataset and mRNA trans-eQTLs in a second dataset. A paired test that simultaneously controls for unobserved confounding and exploits the paired sample design to improve statistical power would be valuable in light of the many paired epigenetic studies on the horizon.
4 Methods
4.1 RNA-seq processing, mapping, quantification, and normalization
RNA-seq reads from the Genotype-Tissue Expression (GTEx) Project consortium Pilot data [58] were trimmed using Trimmomatic (v.0.30) [11]. We mapped reads to human reference genome assembly GRCh37.p13 using STAR aligner (v.2.3.0) [26]. Of 15.93 B total read pairs mapped, 14.52 B read pairs (91%) mapped uniquely, and 1.2 B read pairs (7.5%) were discarded for mapping to more than one locus. Mapped reads were converted into read counts (one read pair per read count) using the software featureCounts [55] on settings that required both ends of a read pair to be mapped to the same gene. For all further analyses, we used expression levels for lincRNAs and mRNAs in autosomal chromosomes with non-zero expression levels in at least 10% of samples in at least one tissue, which filtered 32% of the original 6, 416 lincRNA genes and 10.6% of the original 20, 345 protein-coding genes. After filtering, we retained 4,377 lincRNA genes and 18, 194 protein-coding genes. Read abundances for lincRNA and protein-coding genes were normalized by gene length, GC-content, and library size using the Bioconductor R package cqn [40], which yielded RPKM estimates. We controlled for known and unknown covariates of expression by removing all principal components that explained greater than 0.5% of the variation in expression values across individuals. Principal components analysis was performed using NumPy and SciPy Python packages. Expression values were normalized to the quantiles of the standard normal distribution across individuals within gene and tissue before and after the removal of PCs.
4.2 RNA-seq read trimming and preprocessing
Adapter sequences and overrepresented contaminant sequences, identified by FastQC (v.0.10.1) [4], were trimmed using Trimmomatic (v.0.30) [11] with 2 seed mismatches and a simple clip threshold of 20. Leading and trailing nucleotides (low quality or Ns) were trimmed from all reads until a canonical base was encountered with quality greater than 3. For adaptive quality trimming, reads were scanned with a 4-base sliding window, trimming when the average quality per base dropped below 20. Any remaining sequences shorter than 30 nucleotides were discarded. After trimming, the original raw read counts of 5.87 B read pairs in adipose, 6.34 B read pairs in artery, 7.89 B read pairs in lung, 6.45 B read pairs in skin were reduced to 3.65 B read pairs in adipose, 4.02 B read pairs in artery, 4.63 B read pairs in lung, 3.63 B read pairs in skin.
4.3 RNA-seq read mapping
After preparing the genome with STAR aligner genomeGenerate mode using a splice junction database (sjdbGTFfile) set to GENCODE v.19 annotation, the splice junction database overhang (sjdbOverhang) set to 75 bp, and defaults for all remaining settings. STAR aligner alignReads mode was run using default settings except outFilterMultimapNmax was set to 1 so that only uniquely mapping reads were retained.
4.4 Filtering and imputing genotype data
In the same subjects from which RNA-seq samples were taken, 3.5 M quality-filtered SNPs were assayed on the Illumina Omni5-Quad array. These SNPs were then used to impute to approximately 10 M SNPs using as reference the 1000 Genomes phase 1 release [1] as described in earlier work [58]. Imputed SNPs with quality score < 0.4 and diverging from Hardy-Weinberg equilibrium (p ≤ 1 × 10−6) were removed before association mapping.
4.5 Association mapping for cis-effects
Bayesian regression was performed using two separate techniques, eQTLBMA (v. 1.3) [30] and SNPTEST (v. 2.5.b.4). eQTLBMA was used to perform regression jointly on data from all four tissue types, which enabled the estimation of the proportion of expression quantitative trait loci (eQTLs) shared across tissues and allowed for greater power than would be possible by performing regression for each tissue separately.
The parameters we used corresponded to the BFBMA as previously described [30]. After sample test runs, we filtered SNPs with minor allele frequency (MAF) < 0.02 as low MAF SNPs were found to be uninformative when using eQTLBMA. Using SNPTEST, we assumed an additive effects model with a prior effect size modeled as a normal distribution with zero mean and variance modeled as an inverse gamma scaled by a factor of 0.02 and with mean 3 and variance 2. We tested associations between the expression of each lincRNA or mRNA and all SNPs in the interval between 100 Kb upstream of the transcription start site (TSS) to 100 Kb downstream of the transcription end site (TES). For comparison, local association mapping was also performed for a candidate region extending from 1 Mb upstream and downstream of each TSS and TES respectively. For each tested gene-SNP pair, eQTLBMA returned Laplace approximations of log10 Bayes factors (log10BFs) [30] and SNPTEST also returned approximate log10BFs for the observed expression-genotype relationships and for expression-genotype relationships where labels of individuals were randomly permuted. Because of the computational burden, only one permutation was performed for both SNPTEST and eQTLBMA. (For numbers of genes, SNPs and gene-SNPs tested for association, see Supplemental Table 1). The false discovery rate (FDR) was computed as the number of associations identified in the permuted data at each log10BF cutoff value divided by the number of associations identified in the observed data at that same cutoff, as in previous work [60]. Significant cis-eQTLs—both lincRNA (linc-eQTLs) and protein-coding (pc-eQTLs)—were defined as those gene-SNP associations that passed the significance threshold at a 5% FDR.
4.6 Distribution of location of the best cis-eQTLs
For all lincRNA and protein-coding genes with at least one cis-eQTL, the upstream and downstream cis association candidate regions flanking the gene body were divided into ten equal parts into which the best eQTLs were binned (e.g. if eQTL was 100, 000 - 90, 000 bp upstream, then bin 1, if eQTL was 89, 999 - 80, 000 bp upstream, then bin 2, etc.) Gene bodies were split into different numbers of bins according to the mean length of genes in the class under consideration such that the genic bins would be roughly 10 Kb in size, and thus comparable to the flanking bins. This worked out to 9 bins per gene for SNPTEST pc-eQTLs, 7 bins per gene for eQTLBMA pc-eQTLs, 4 bins per gene for SNPTEST linc-eQTLs, and 3 bins per gene for eQTLBMA linc-eQTLs.
4.7 Comparison of cis-linc-eQTLs and cis-pc-eQTLs in tissue specificity
To determine whether linc-eQTLs were more tissue-specific than pc-eQTLs, we modeled the probability that an eQTL was tissue-specific (i.e., significant in only one of the four tissues) as a logistic regression model with the following covariates: eQTL type (linc-eQTL or pc-eQTL), absolute effect size, log of distance to TSS/TES, log of average expression level, MAF, log of the transcribed gene length, log of the genomic gene length, tissue specificity of expression score, GC-content, and log of the number of exons.
4.8 Conditional analysis of shared cis-linc-/pc-eQTLs
We allowed for four different scenarios of the regulatory relationship between cis-eQTL, lincRNA, and mRNA for those cis-eQTLs that were both linc-eQTLs and pc-eQTLs:
In each case, we modeled the probability using a Bayesian linear model: where the prior on effect size is the D2 prior from previous work [92]. We summarized this conditional analysis with a modified Bayes factor (BF) that is computed as follows for SNP j: where 1 in the denominator captures the null (nonej) hypothesis in the denominator of all of these separate tests. When iBFj > 1, there is evidence that the eQTL effect is differential; otherwise it is stable or not present.
4.9 Enrichment for overlap with cis-regulatory elements
Significant cis-linc-eQTLs and cis-pc-eQTLs were tested for enrichment of overlap with various cis-regulatory elements (CREs) in matching ENCODE [20] cell lines. The background distribution consisted of all SNPs tested for association that had MAFs at least as great as the smallest MAF for any significant linc-eQTL or pc-eQTL and nearest to a GENCODE v.19 protein-coding or lincRNA gene with non-zero average expression. There were 2.0 M of these best linc-eQTL background SNPs, 3.0 M for all linc-eQTL background SNPs, 5.8 M for best pc-eQTL background SNPs, and 7.6 M for all pc-eQTL background SNPs. The probability of a SNP overlapping a CRE was modeled in a logistic regression model, including as covariates the logarithm of the distance of SNP to associated gene, average expression levels of the associated gene, MAF, and an indicator of whether the SNP was an cis-eQTL (Eqn. 4). As background SNPs could not be assigned an associated gene (by definition, such SNPs have no associations), they were assigned to the nearest gene, breaking ties by choosing the nearest gene with the highest average expression.
The calculation of bootstrap confidence intervals for odds ratio of probability of overlap with CRE and odds ratio of probability of linkage to trait-associated SNP throughout this work refers to the bias-corrected accelerated bootstrap method [27] and was implemented using scikit.bootstrap [76].
4.10 Enrichment for linkage to trait-associated SNPs (TASs)
Cis-linc-eQTLs and cis-pc-eQTLs were tested for enrichment of linkage to TASs. Genome-wide assocication study (GWAS) SNPs were downloaded from the NHGRI GWAS catalog (accessed: 04/22/2014) [103]. The background set of SNPs was compiled as described above in enrichment for overlap with cis-regulatory elements. Linkage disequilibrium (LD), or the non-random association of alleles due to shared ancestry, can be measured by the correlation coefficient (r2) between loci. A SNP was considered linked to a TAS if the SNP and the TAS were located within 100 Kb and had a pairwise r2 ≥ 0.8. The probability of a SNP being linked to a TAS was modeled in a logistic regression (shown below) as a function of the logarithm of distance of SNP to associated gene, average expression of the associated gene, MAF, and an indicator of whether the SNP was a cis-eQTL. As background SNPs could not be assigned an associated gene (by definition, such SNPs have no associations), they were assigned to the nearest gene, breaking ties by choosing the nearest gene with the highest average expression (Eqn. 5).
4.11 Evaluating mechanistic hypotheses for explanatory lincRNAs
We compiled a list of TASs that may regulate the associated trait through the regulation of lincRNA expression levels. For this set of lincRNA-TASs, we selected TASs that were at least 10 Kb away from the nearest GENCODE v.19 protein-coding gene and were in LD with a cis-linc-eQTLs (r2 > 0.8) but were not in LD with a cis-pc-eQTLs or a local-pc-eQTL (r2 < 0.2) (Supplemental Table 4). The list of explanatory lincRNAs were those lincRNAs associated with lincRNA-TASs as described above (Supplemental Table 3). To study the enrichment of certain trait types in the set of lincRNA-TASs, we compared the distribution of traits for lincRNA-TASs to the distribution of traits for TASs linked to pc-eQTLs and not to linc-eQTLs. To study a wide range of possible mechanisms of action for the lincRNAs on an organismal phenotype, we tested the candidate set of lincRNAs for genetic signatures of potential mechanisms of action.
To study a wide range of possible mechanisms of action for the lincRNAs on an organismal phenotype, we tested the candidate set of lincRNAs for genetic signatures of potential mechanisms of action. In particular, we tested:
coding potential as estimated by CPAT (v1.2.1) [101],
sequence conservation as estimated by phastCons (46-way) and downloaded from the UCSC Genome Browser (accessed: 12/18/2014) [93],
conserved secondary structure as estimated by Evofold and downloaded from the UCSC Genome Browser (accessed: 12/19/2014) [75],
secondary structure folding energy as estimated by RNAfold (v.2.1.3) [59],
number of miRNA binding sites as estimated by TargetScan [54] and compiled in lnCeDB (accessed: 01/14/2015) [21],
miRNA binding site conservation (46-way and primates-only) as estimated by TargetScan [54] and compiled in miRcode database (accessed: 01/14/2015) [48],
nuclear versus cytoplasm localization ratio as measured for selected GENCODE v.7 transcripts [23],
signal peptide potential as measured by signalP (v.4.1) [77].
For the coding potential analysis, the sequences of all lincRNAs tested for linc-eQTLs were compared to their reverse complement RNA sequences, which are assumed to have coding potential equivalent to genomic background but have identical length distribution as the set of lincRNAs, representing our null hypothesis. For other analyses, metrics for candidate lincRNAs were compared to their complement set among all lincRNAs tested. For those metrics that were continuous or integer-valued—coding potential, secondary structure folding energy, number of miRNA binding sites, conservation of miRNA binding sites, signal peptide potential, nuclear versus cytoplasm localization ratio—the metrics were compared between sets using a Mann-Whitney U-test. For conserved secondary structure, which was a binary metric, the metrics were compared between sets using Fisher’s exact test. Those metrics that could be reasonably binarized—coding potential, conservation of miRNA binding sites with varied thresholds, signal peptide potential, nuclear versus cytoplasm localization ratio ≥ 1 versus < 1—the metrics were binarized within set and were compared across sets using Fisher’s exact test. Any test that yielded significant results using the methods above was retested using multivariate linear regression or logistic regression to control for transcript length, which influences many of the above metrics.
4.12 Allele-specific luciferase assay
We selected for validation cis-eQTLs that were located in DHS shared across cell types that represented the tissues used: the lung epithelial cell line A549, primary preadipocytes, the skin fibroblast cell line AG04449, and aortic adventitial fibroblasts [20]. We used PCR to amplify the major and minor haplotype of each DHS from the genomes of 1,000 Genomes Project donors [1]. (Details of the target regions, hap-lotypes, and primers used are in Supplemental Table 5.) The DHS were ligated upstream of the Simian virus 40 promoter driving expression of a firefly-derived luciferase reporter gene (Promega pGL4.13) using Gibson assembly. The constructs were then transformed into Escherichia coli (E. coli) and haplotypes were confirmed with Sanger sequencing. The positive clones were expanded, and plasmids purified using standard approaches. We then transiently transfected each luciferase reporter into HepG2 cells, and cal-culated the relative luciferase expression between the two haplotypes after normalizing to a renilla-derived luciferase expressed from a co-transfected control plasmid. The statistical significance of differences in renilla-normalized enhancer activity between haplotypes was evaluated using a Student’s t-test.
4.13 Association mapping for trans-effects
We performed association mapping (both single-tissue and multi-tissue) for all lincRNA and protein-coding gene expression levels (tested in cis) in all tissue types using only the 2, 034 best cis-linc-eQTLs and 13, 522 best cis-pc-eQTLs, which had the effect of greatly reducing the multiple hypothesis testing burden to detect trans-eQTLs. To exclude any cis-associations, we did not test for trans effects for any SNPs within 1 Mb of a gene. The same parameters were used for SNPTEST and eQTLBMA trans-association as described above for cis-association mapping. As in cis-association eQTL mapping, one permutation was performed for trans-eQTL analyses. A significance threshold was determined empirically corresponding to a 20% FDR.
4.14 Mendelian randomization
We used an instrumental variable (IV) analysis to identify cis-regulated genes (both mRNA and lincRNA) that affect regulation of protein coding genes in trans. Let z be the instrumental variable (here, the genotype of a SNP across n indivduals) that directly affects the independent variable x (here, the cis-RNA), which, in turn, may or may not affect the dependent variable y (here, the trans-mRNA). Both x and y may be affected by unmeasured variables u (Fig. 6). Mendelian randomization (MR) is an example of IV analysis, where we quantify the degree of the causal relationship from x to y by removing all sources of variation in x other than variation due to instrumental variable z and then testing for a relationship between this denoised x and y. The MR framework requires z to have direct effects on x and indirect effects only on y, and also implicitly controls for all possible unmeasured confounders u that jointly affect x and y.
First, define the normal equation for linear regression, which estimates the effect size of a linear regression predictor x on a response variable y: . The linear regression model, , may be conditioned on the instrumental variable z, and then expectations taken: where by assumption and are the estimated effect sizes for the trans-eQTL and the cis-eQTL respectively. Thus, to quantify the direct effects of RNA x on mRNA y using βMR, we compute the ratio of the trans-eQTL to the cis-eQTL.
We now test the null hypothesis, that βMR = 0, compared to the alternative hypothesis, , using Wald’s test. The test statistic, has a χ2 distribution with three degrees of freedom (two linear regression intercepts and two slopes). We compute the variance terms, var(βMR) and residual variance σ2, as follows: where ν is the degrees of freedom and n is the number of samples.
We assessed global FDR using a permutation. In particular, the null hypothesis is that there is no relationship between x and y. Correspondingly, we used the estimated effect size from permuted dependent variable data while maintaining the estimated effect sizes from the unpermuted samples to ensure there is still an association between z and x. Then, using the ordered list of p-values computed using Wald’s test, we approximated the global FDR for specific p-value thresholds using the number of p-values in the permutations that were smaller than the threshold (FPs) divided by the number of p-values in the true data that were smaller than the threshold (TPs+FPs).
4.15 Local correlation of mRNA and lincRNA
We computed pairwise Pearson’s correlation coefficients between gene expression estimates across all samples from the four tissue types for gene pairs within multiple genomic proximity thresholds (20 Kb, 100 Kb, 1 Mb, 5 Mb) using the NumPy function corrcoef.
Author Contributions
B.E.E., T.E.R., and I.C.M. conceived the experiments. B.E.E., and I.C.M. designed the experiments. I.C.M. performed the computational experiments. I.C.M. and B.E.E. analyzed the data. C.G., C.M.V., and T.E.R. designed and performed the laboratory experiments. I.C.M., A.A.P., C.D.B., T.E.R., and B.E.E. wrote the paper.
Disclosure Declaration
The authors have no conflicts of interests to disclose.
Acknowledgements
BEE was funded by NIH R00 HG006265 and NIH R01 MH101822. CDB was funded by NIH R01 MH101822. AAP was supported by a postdoctoral fellowship from the Jane Coffin Childs Foundation. CG, CMV, and TER were funded by NIH R01 DK099820. The data reported in this paper are tabulated in the Supporting Online Material and archived at dbGaP, study accession number phs000424.v4.p1, Common Fund Genotype-Tissue Expression (GTEx) Project.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵