Abstract
Identification of trans-eQTLs has been limited by a heavy multiple testing burden, read-mapping biases, and hidden confounders. To address these issues, we developed GBAT, a powerful gene-based method that allows robust detection of trans gene regulation. Using simulated and real data, we show that GBAT drastically increases detection of trans-gene regulation over standard trans-eQTL analyses.
Main
Identification of long range trans-gene regulation often leads to the discovery of important disease-causing genes and pathways that are not captured in cis analyses1-4. This is because trans regulation harbors more cell type specific effects1, 5, 6 and explains more than twice the variability in gene expression than cis effects5-7. However, robust discovery of trans-eQTLs is challenging and prone to false positives for four reasons. First, genome-wide scans for trans-eQTLs suffer from heavy burden of multiple testing6, 8. Second, trans effects are typically much smaller than cis effects8. Third, sequence read mapping errors, multi-mapped reads, and reads from repeat regions, lead to many false trans signals9. Fourth, naïve use of dimensionality reduction techniques to estimate confounding effects (such as PEER10 or SVA11) in trans-eQTL studies can both reduce power12, 13 and introduce false positives in trans-eQTL studies11, 14.
We address these issues through a new gene-based method, GBAT, for detecting trans- regulatory effects. GBAT consists of three main steps (Figure 1). First, to reduce the number of false positives due to mapping issues, GBAT filters out reads that are multi-mapped15. In addition, GBAT further removes problematic mapped reads that are not marked as “multi-mapped” by RNA-seq alignment algorithms by discarding reads that are mapped to repeat regions (genomic regions with mappability scores lower than 1, see Supplementary Notes).
Second, GBAT uses cvBLUP, a novel gene-based method to produce predictions of gene expression from SNPs cis to each gene. The cvBLUP method does not rely on external eQTL studies, but builds leave-one-sample-out cross-validated cis-genetic predictions (CVGPi for each gene i), to avoid overfitting issues of standard best linear unbiased predictor (BLUP)(see Methods for details). The cvBLUP method dramatically reduces computing time, compared to other leave-one-sample-out cross validation approaches implemented for prediction methods such as BSLMM16 and Elastic-net17. This gain is attained by building our N (N=sample size) leave-one-out CVGP predictions after fitting the model only once instead of N times (Methods).
Finally, we test the association of each CVGPi with quantile normalized expression levels Ej of every trans gene j (at least 1Mb away from gene i). Cis-eQTL studies typically include covariates such as PEER factors or surrogate variables from SVA that are intended to model confounders17, 18. To prevent false positives13 and power loss in trans-eQTL studies11, 12, the use of supervised versions of PEER and SVA is recommended13. Therefore, for each CVGPi, we run supervised SVA conditional on CVGPi, such that the resulting surrogate variables SVi does not include the genetic effects of gene i. We then use SVi as covariates. While including conditional SVs as covariate is computationally efficient in our gene-based approach, it is impossible in SNP based trans association testing.
In addition, GBAT regresses Ej on SVi and uses the quantile normalized residuals Ej’ for gene-based testing using the regression: Ej’∼ CVGPi. We found that the p-values from this test are well calibrated (Supplementary Notes, Figure S2). However, skipping the normalization caused false positive inflation, as did using the models: Ej ∼ CVGPi + SVi or Ej ∼ CVGPi + SV (where SV is the naïve SVA not conditioning on CVGPi) (Figure S2).
Using real genotypes from a whole blood RNA-seq dataset: the Depression Genes and Networks cohort9 (DGN, sample size N=913), we performed simulations to assess the power of our gene-based approach (GBAT) in comparison to a SNP-based approach (Methods). We simulated a causal SNP→cis-expression→trans-expression model with realistic effect-sizes under different genetic architectures (proportion of causal SNPs = 0.1%, 1% and 10%) and sample sizes (N=200,400,600 and 913) (Supplementary Notes). To better reflect the imperfect genotyping of the individuals, we assumed that only 10% of the SNPs are genotyped, such that not all causal SNPs are observed. We measured the power of both approaches to identify significant trans associations (accounting for 1 million×15000 SNP-gene pairs for SNP-based approach or 5,000 ×15,000 (highly heritable gene)-gene pairs for the GBAT approach, using Bonferroni correction). Across all simulated genetic architectures of gene expression and combinations of cis- and trans- effects, we observed that the power of the GBAT approach is substantially higher than the SNP-based method (Figure 2a, Figure S3).
We next applied GBAT to the DGN dataset to detect trans- gene regulation signal for real expression phenotypes. We built cross-validated cis-genetic (CVGP) expression levels using cvBLUP with variants within 100kb of the transcription start site. Prediction accuracy was assessed using squared correlation (prediction R2) between observed and predicted expression levels. On average across all genes, the prediction R2 is 85% of the cis SNP heritability estimated by restricted maximum likelihood (REML) (Figure 2b). We note our prediction accuracy is comparable to prediction methods modeling the sparse genetic architecture of cis-gene regulations (Figure 3 of ref18 and Figure 4 of ref19). After gene-based association test, we computed q-values20 from the p-values of all inter-chromosomal gene pairs, and applied the threshold to all inter-and intra-chromosomal gene pairs. At 10% FDR, the final trans gene regulation signal consists of 411 regulator-trans target gene pairs and 157 unique regulators (Figure 2c, Table S1). Among the 411 trans gene pairs, 290 (70.6%) are inter-chromosomal (corresponding to 253 unique inter-chromosomal trans-eGenes), and 121(29.4%) are intra-chromosomal (corresponding to 94 unique intra-chromosomal trans-eGenes). In contrast, SNP-based eQTL mapping with Matrix eQTL23 identified only 90 trans-eGenes at 10% FDR in DGN (Supplementary Notes, Table S3). Gene Ontology enrichment analysis by the Database for Annotation, Visualization and Integrated Discovery21 (DAVID v6.8, see URLs) showed that the top two enriched categories of the 157 trans regulators are DNA binding (Benjamini-Hochberg (BH) FDR = 1.7×10−4) and transcription factor activity (BH FDR = 5.3×10−4, Table S2). Both regulators and trans target genes show heritability enrichment in autoimmune diseases, including lupus, ulcerative colitis, Crohn’s disease and inflammatory bowel disease (Figure S4), by using stratified LDSC22.
Among 157 unique trans regulators, 20 were found to regulate more than 3 genes (Table S4), supporting the existence of master regulators. For example, we identified NFKBIA (NF-Kappa-B Inhibitor Alpha) as a master regulator that regulates expression of four other genes, including two that encode subunits of NF-kappa-B complex: NFKB2 and RELB (Figure 2d). Consistent with the inhibiting effect of NFKBIA on NF-kappa-B subunits, our estimated effect sizes effect size on NFKB2 and RELB are both negative (NFKB2 beta=-0.19, P=1.0×10−08; RELB beta=-0.17, P= 2.4×10−07). We also identified a master regulator encoded by SRCAP on chromosome 16 that regulates 88 other genes (81 are inter-chromosomal signals, Figure 2e). SRCAP, short for Snf2 Related CREBBP Activator Protein, encodes the core catalytic component of a chromatin-remodeling complex. SRCAP is known to activate the expression of CREBBP, consistent with our positive estimated effect in the DGN dataset (effect size=0.25; P= 1.3×10−14). The SRCAP complex was shown to regulate key lymphoid fate in haematopoietic system by remodeling chromatin and enhancing promoter accessibility of target genes23. In the DGN dataset, the expression level of SRCAP is highly correlated with natural killer cell proportion (Pearson correlation = 0.21, P=7.9×10−11) and T cell proportions (Pearson correlation=-0.15, P=4.0 x10−06). Remarkably, we found that 19 SRCAP target genes (out of 88 genes, 2.9 fold enrichment, Fisher’s test) overlap with genes associated with blood cell type proportions (Table S5) from the GWAS catalog (see URLs). Our discovery highlights the gene co-expression network driven by SRCAP (Figure 2e) is relevant to haematopoiesis and immune cell type proportions, and suggests that SRCAP is a master regulator that controls lineage commitment during haematopoiesis.
GBAT was carefully designed to improve power and reduce false positives in detecting trans signals; failing to properly perform the recommended steps led to an increase in false positives or loss of power. For example, incomplete removal of problematic sequence reads resulted in 220% more trans signals, most of which are likely to be false positives; correcting for covariates using regression model Ej ∼ CVGPi + SVi resulted in 39% more trans signals, also likely due to false positives inflation (Supplementary notes).
GBAT can be used to detect other trans gene regulatory events, such as splicing, methylation, protein regulation or gene network response to stimulus. As larger studies become available for increasingly diverse populations, tissues, and functional genomic measurements, we foresee more trans regulation discoveries will reveal new disease genes and mechanisms.
Methods
Cross-validated cis-genetic prediction with cvBLUP
The cross-validated prediction by cvBLUP is a cross validated version of a standard linear mixed model (LMM) prediction, or best linear unbiased predictor (BLUP). We consider an LMM as below: where y is the phenotype, in particular the expression of gene, measured on N individuals. X is a matrix of covariates, including an intercept. Z is a standardized N×M matrix of M SNPs within the cis region of the gene. b is the vector of effect sizes for the SNPs in Z, modeled as normally distributed by . The total cis-genetic contribution to the phenotype is then the product Zb, with distribution , where K is the genetic relationship matrix defined as . Finally, ϵ is a vector of non-genetic effects, modeled as . Phenotype y therefore has the distribution: y∼N(Xβ, V), with . We use standard REML to get estimates of the LMM variance components, and . The estimate of the narrow sense heritability is then the ratio of estimated genetic variance to total variance: .
The BLUPs for the random effects are:, and the genetic predictor, or fitted-value of y is calculated as , where are the phenotypic residuals after removing the contributions of the covariates X.
The standard BLUPs, yBLUP, over-fit the training data, meaning that they are highly correlated with the noise term ϵ. Cross-validation is often used to mitigate overfitting. In our analysis, we use a leave-one-out cross-validation scheme to generate a set of out-of-sample LMM predictions: each subject is left out of the dataset in turn; the remaining subjects are used to estimates , and then the genetic contribution to the left-out subject’s phenotype is defined using .
The resulting collection of cross-validated LMM estimates, or cvBLUPs, is still a strong estimator of the true cis-genetic contribution to the phenotype, but does not have spurious correlations with ϵ. Fortunately, leave-one-out cross-validation is mathematically simple for BLUPs. Given , such that the prediction is a linear operator applied to , the out-of-sample prediction and prediction errors can be simply calculated from a single model fit. The cvBLUPs can therefore be calculated in linear time as:
For any gene i, the cross validated cis-genetic prediction value (CVGPi) is calculated with all cis genetic variants (±100kb to transcription starting sites) of gene i by using Equation (2).
Gene-based association testing of trans signals (GBAT)
For any gene i, we test the gene-based trans association between CVGPi and the expression level of gene j (Ej). Gene j is in trans to gene i if it is at least 1Mb away from gene i. To improve association power and reduce spurious trans association signal, we used supervised SVA conditioning on CVGPi (SVi) as covariates11, 14, 24. We first regressed out SVi from quantile normalized expression Ej in a linear model Ej ∼ SVi. Then we used the quantile normalized residuals E j ’ to test for trans association: E j ’∼ CVGPi.
To compute the FDR levels from all trans association tests, we used only the summary association p-values from the inter-chromosomal trans association tests. We used 10% FDR for significant trans signals. We further removed gene pairs that are cross mappable due to sequence similarities surrounding the gene pairs25.
URLs
GBAT: the pipeline will be available before publication
PLINK 2.0: https://www.cog-genomics.org/plink/2.0/
Michigan Imputation Server: https://imputationserver.sph.umich.edu/index.html
ENCODE 36 k-mer of the reference human genome: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig
DAVID 6.8: https://david.ncifcrf.gov/tools.jsp
Matrix eQTL: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
GWAS Catalog: https://www.ebi.ac.uk/gwas/
Acknowledgements
We thank members of Zaitlen lab, Price lab and Y Li for helpful discussions. This work was supported by R01 MH115676.