Abstract
For many complex traits, gene regulation is likely to play a crucial mechanistic role. How the genetic architectures of complex traits vary between populations and subsequent effects on genetic prediction are not well understood, in part due to the historical paucity of GWAS in populations of non-European ancestry. We used data from the MESA (Multi-Ethnic Study of Atherosclerosis) cohort to characterize the genetic architecture of gene expression within and between diverse populations. Genotype and monocyte gene expression were available in individuals with African American (AFA, n=233), Hispanic (HIS, n=352), and European (CAU, n=578) ancestry. We performed expression quantitative trait loci (eQTL) mapping in each population and show genetic correlation of gene expression depends on share ancestry proportions. Using elastic net modeling with cross validation to optimize genotypic predictors of gene expression in each population, we show the genetic architecture of gene expression is sparse across populations. We found the best predicted gene, HLA-DRB5, was the same across populations with R2 > 0.81 in each population. However, there were 1094 (11.3%) well predicted genes in AFA and 372 (3.8%) well predicted genes in HIS that were poorly predicted in CAU. Using genotype weights trained in MESA to predict gene expression in 1000 Genomes populations showed that a training set with ancestry similar to the test set is better at predicting gene expression in test populations, demonstrating an urgent need for diverse population sampling in genomics. Our predictive models in diverse cohorts are made publicly available for use in transcriptome mapping methods at http://predictdb.hakyimlab.org/.
Author summary Most genome-wide association studies (GWAS) have been conducted in populations of European ancestry leading to a disparity in understanding the genetics of complex traits between populations. For many complex traits, gene regulation is likely to play a critical mechanistic role given the consistent enrichment of regulatory variants among trait-associated variants. However, it is still unknown how the effects of these key variants differ across populations. We used data from MESA to study the underlying genetic architecture of gene expression by optimizing gene expression prediction within and across diverse populations. The populations with genotype and gene expression data available are from individuals with African American (AFA, n=233), Hispanic (HIS, n=352), and European (CAU, n=578) ancestry. After calculating the prediction performance, we found that there are many genes that were well predicted in AFA and HIS that were poorly predicted in CAU. We further showed that a training set with ancestry similar to the test set resulted in better gene expression predictions, demonstrating the need to incorporate diverse populations in genomic studies. Our gene expression prediction models are publicly available to facilitate future transcriptome mapping studies in diverse populations.