ABSTRACT
Naturally occurring functional genetic variation is often employed to identify genetic loci that regulate specific traits. Existing approaches to link functional genetic variation to quantitative phenotypic outcomes typically evaluate one or several traits at a time. Advances in high throughput phenotyping now enable datasets which include information on dozens or hundreds of traits scored across multiple environments. Here, we develop an approach to use data from many phenotypic traits simultaneously to identify causal genetic loci. Using data for 260 traits scored across a maize diversity panel, we demonstrate that a distinct set of genes are identified relative to conventional genome wide association. The genes identified using this many-trait approach are more likely to be independently validated than the genes identified by conventional analysis of the same dataset. Genes identified by the new many-trait approach share a number of molecular, population genetic, and evolutionary features with a gold standard set of genes characterized through forward genetics. These features, as well as substantially stronger functional enrichment and purification, separate them from both genes identified by conventional genome wide association and from the overall population of annotated gene models. These results are consistent with a large subset of annotated gene models in maize playing little or no role in determining organismal phenotypes.
Main
Genetics seeks to link individual genes to their roles in determining the characteristics of an organism. Early QTL studies utilized individual phenotypic1 or chromosomal markers2. Now Genome Wide Association Study (GWAS) can scan 100,000’s or millions of markers for association with a target trait3-5. Statistical methods for Phenome Wide Association Study (PheWAS) have also developed6-8. Unifying GWAS and PheWAS produces multiple testing problems which make retaining statistical power challenging9, 10. Multivariate GWAS methodologies have been shown to increase power11-17. However, these approaches face challenges scaling to hundreds or thousands of traits. Medical record data mining18, 19, high throughput phenotyping20, and scoring of molecular traits such as transcript and metabolite abundance21-24 are making high dimensional trait datasets increasingly common. Here we employ a published dataset of 260 distinct traits of maize (Zea mays)25, 26 to evaluate a multi-trait multi-SNP framework to identify genotype-phenotype associations. Genes identified by our model show greater independent validation27 and increased similarity to a gold standard set of genes characterized by knockout phenotypes in maize28.
Maize HapMap3 SNPs25, 26 were imputed and filtered based on minor allele frequency, linkage disequilibrium, and distance to annotated gene models to produce a set of 557,968 unique SNPs associated with 32,084 maize gene models (See Methods). Filtering of a set of 57 unique traits scored across up to 16 environments resulted in a set of 260 trait datasets with a median missing data rate of 18%29. Unobserved trait datapoints were imputed using PHENIX30 (Supplementary Table S1).
Two widely used GWAS methodologies – GLM and FarmCPU31, 32 (Table 1) – were employed to identify gene-trait associations (Table 1). A given gene may be identified as statistically significantly associated with a phenotype in a single analysis, but fail to survive multiple testing correction when many trait datasets are analyzed sequentially (Figure 1a). We developed an approach based upon stepwise regression model – Genome-Phenome Wide Association Study (GPWAS) – fitting where the set of SNPs fallen in a gene treated as the response variable, and both population structure and individual trait datasets are employed to explain the patterns of genetic variance across the population (See Methods; Figure S1). In principle, this approach should address the challenges of partially correlated traits and genotype matrices (Figure 1c). Given the complexities introduced by the iterative model selection step, we chose to correct for multiple testing using a permutation-based method (See Methods) which has been shown to be robust in controlling false positives in both GWAS and PheWAS studies33, 34. With an estimated false discovery rate (FDR) < 1.00e-3, 1,776 genes were classified as significantly associated with phenome-wide variation (Figure S2). Comparison gene sets were identified by GLM31 and FarmCPU32 using a same dataset (See Methods; Table 1).
The accuracy of each of these three approaches was validated using data from a second much larger dataset, the maize nested association mapping (NAM) population27, 35. The comparison employed the subset of 29,430 gene models with clear 1:1 correspondence between RefGenV4 and RefGenV2 maize gene models. GPWAS identified genes showed substantially higher overlapped with the validation dataset than GLM GWAS (p=2.15e-5; Chi-squared test) and FarmCPU GWAS (p=9.17e-3; Chi-squared test).
GPWAS produces a list of the specific traits which have been included in the model for a particular gene (Figure 1b). However, the associations of individual phenotypes identified within the GPWAS model for a given gene are not rigorously controlled for false discovery. Anther ear1 (an1) is a classical maize gene shown to encode a Ent-Copalyl diphosphate synthase involved in gibberellic acid biosynthesis. Knockout alleles of an1 have been show to produce reduced or abolished tassel branching, reduced plant height, delayed growth, and delayed flowering36. Anther ear1 is also associated with quantitative variation in tassel spike length37. The an1 gene was not significantly associated with any individual traits in conventional GWAS in this dataset. GPWAS found the association between this gene and a group of 11 traits was statistically significant (FDR < 0.001). Many traits incorporated into this model were consistent with the characterized function of this gene (Figure S3).
Maize genes with known mutant phenotypes – classical mutants28 – are more likely to show significant mRNA expression in at least one tissue, and tend to be expressed across a broader set of tissues than other gene models (Table 1; Supplementary Table S2). Genes identified by GPWAS were more likely to be expressed > 1 FPKM in at least one of the 92 tissues/timepoints than those identified by GLM and FarmCPU, although the latter difference was not statistically significant (p=0.007, p=0.085; chi-squared test)39, 40 (Table 1). Genes identified by GLM GWAS, FarmCPU GWAS and GPWAS all showed much higher breadth of expression than other gene models (Supplementary Table S2). Genes identified by GPWAS were longer, contained both more total SNPs and a higher density of SNPs per KB of gene space than genes identified by conventional GWAS (Supplementary Table S3). But both of populations showed the same bias towards greater polymorphism rates relative to all annotated genes (Figure S4; Supplementary Table S3). Permutation testing revealed a modest bias towards the identification of genes with greater numbers of SNPs by GPWAS, however, this bias was insufficient to explain the patterns observed in real data (Supplementary Table S4). Classical maize mutants exhibited the opposite pattern, showing less polymorphism than the overall gene set (Supplementary Table S3). Both genes identified through forward genetics screens and through analysis of natural variation must play a role in specifying the characteristics of an organism. However, unlike genes identified through forward genetics, genes identified through analysis of natural variation must also be represented by functionally variable alleles in the studied population. It may be this second criteria which explains the additional bias towards the identification of genes with high polymorphism rates in GPWAS and GWAS.
Maize classical mutants are substantially less likely to exhibit presence absence variation (PAV) (Table 2). Gene models identified by both GWAS and GPWAS were much less likely to exhibit PAV than other gene models (Table 2; Supplementary Table S5). GPWAS was less likely to identify gene models with PAV than conventional GWAS (p=1.55e-3; Chi-squared test). Genes exhibiting PAV in maize are less likely to be conserved at syntenic locations in other species41. The frequency of syntenic conservation was the inverse of the pattern observed for PAV (Table 2; Supplementary Table S5), and the increased syntenic conservation of GPWAS identified genes relative to conventional GWAS was statistically significant (p < 2.2e-16; p=1.46e-6; Chi-squared test; GLM and FarmCPU respectively). It was also possible to calculate Ka/Ks ratios for maize genes with syntenic orthologs in sorghum. Classical mutants show much lower Ka/Ks ratios – a sign of stronger purifying selection – than the overall population of conserved genes (Figure 2; Table 2). The Ka/Ks ratios of gene models identified by GLM GWAS were not significantly different from the overall population of conserved gene models (Figure 2; Table 2). Gene models identified by GPWAS showed significantly lower Ka/Ks ratios than all conserved gene models (p=1.24e-9), gene models identified by GLM GWAS (p=1.09e-9) and FarmCPU GWAS (p=4.24e-5; Mann–Whitney U test) (Figure 2; Table 2).
A set of 137 GO terms showed statistically significant (Bonferroni corrected p-value < 0.05) enrichment (119 terms) or purification (18 terms) among genes uniquely identified by GPWAS relative to GLM GWAS. In contrast only 15 GO terms (11 enriched and 4 purified) were identified in the corresponding set of genes uniquely identified by GLM GWAS (Figure 3a, Supplementary Table S6). Genes annotated as involved in development, response to stimuli, cell wall and cell membrane metabolism, hormone signalling, disease response, and transport were all over represented among genes associated with phenome-wide variation, while those associated with nucleotide metabolism, DNA replication, translation, and telomere organization were disproportionately unlikely to show such associations. Relative to the total number of gene models a given GO term is assigned to (e.g. information content), p values of enriched GO terms tends to be more significant (Figure 3a). Similar results were obtained for FarmCPU even after controlling for total number of genes identified (see Supplementary Information; Figure 3b; Supplementary Table S6). Number of GO terms per gene and proportion of genes with no assigned GO term did not differ dramatically between gene populations, however the median GO term assigned to a gene uniquely identified by GPWAS had higher information content than the median GO term assigned to a gene uniquely identified by GLM GWAS (See Methods; Supplementary Table S7). These results are consistent with unified GPWAS identifying a less random subset of annotated genes than sequential GWAS for each trait.
The statistical method we employ for GPWAS requires a complete absence of missing data. It is only because advances in kinship based phenotypic imputation approach is now available30. It also requires binning individual genetic markers into groups associated with individual genes. This binning is likely to be imperfect, as regulatory regions of genes can be separated from coding sequence by tens of Kb in maize45, 46. Noncoding regulatory sequences, many distant from annotated genes, have been shown to explain approximately 40% of the total phenotypic variation in maize47. Finally, the present GPWAS algorithm and implementation is quite computationally expensive. We estimate the GPWAS analyses presented here required a total of approximately 6.9 years of CPU time.
Today, there are only a few datasets which contain as many different traits scored for the same population across multiple environments as the Panzea dataset. However, the rapid emergence of high throughput plant phenotyping technologies make it likely that high dimensional trait datasets – where the number of measured phenotypes exceeds even the number of individuals in the population – will become much more common in the future20. Increases in the total number of phenotypes should increase the power and accuracy of GPWAS. However, if many highly correlated traits are included, the result can be issues with multicollinearity that makes the statistical estimation and inference procedures employed unstable. The current statistical procedure also encounters challenges once the number of input traits exceed the number of individuals in the population. In these cases, it would be best to avoid the common practice of employing BLUP scores48 as this approach strips out information on trait plasticity across environments, and trait plasticity is often controls by distinct sets of genes from genes controlling variation in multi-environment trait means49. Automatic variable selection and/or dimensional reduction could be incorporated into future GPWAS implementations. Here we have developed a new approach to identify genes with statistical links to a variation in a large set of diverse plant traits scored for a maize diversity panel across multiple environments, and showed that it exhibits greater consistency with genes identified as controlling organismal phenotypes in an independent population than do genes identified using two conventional GWAS approaches. We also showed that gene models identified by GPWAS exhibit greater structural, molecular, and evolutionary similarity to gold standard maize genes identified through forward genetics than genes identified by conventional trait-by-trait GWAS. Over the past three decades, without substantial discussion or debate, many in the scientific community have moved from a definition of genes that was based on organismal function, to one which is based on molecular features50-52. It may be possible to combine all of these data types to quantitatively determine the subset gene models likely involved in specifying the characteristics of organisms. These patterns could also guide prioritization in future reverse genetics efforts.
Methods
Genotype and Phenotype Sources, Filtering, and Imputation
Raw genotype calls in RefGenV4 coordinates from resequencing data of the maize 282 association panel26 were retrieved from PanZea. Missing genotypes were imputed using Beagle (version: 2018-06-10)53. Only biallelic SNPs with less than 80% missing points were input for imputation. After imputation, SNPs with MAF (Minor Allele Frequency) less than 0.05 or which were scored at heterozygous in more than 10% of samples were discarded. A phenotype file (traitMatrix maize282NAM v15-130212.txt) containing total of 285 traits, corresponding to 57 unique types of phenotypes scored in 1 to 16 environments was downloaded from PanZea. A set of 277 accessions with identical names in the HapMap3 data release and the PanZea trait data were employed for all downstream analyses.
Maize gene regions were extracted from AGPv4.39 downloaded from Ensembl. SNPs were clustered based on R2 > 0.8 and only one randomly selected SNP per cluster was retained. If the number of SNPs after collapsing highly correlated clusters exceeded 138 (50% of the number of inbreds scored), a random subsample of 138 SNPs was employed for downstream analyses. Identical final SNP sets were employed for GPWAS and GWAS analyses.
Of the 285 initial trait datasets, 25 were removed because the data file contained a recorded trait value for only a single individual among the 277 maize inbreds genotyped, leaving a total of 260 trait datasets. Missing phenotypes were imputed based on a kinship matrix calculated from 1.24 million SNPs calculated in GEMMA15 and using a Bayesian multiple-phenotype mixed model30. Accuracy of phenotypic imputation was assessed independently for each trait with sufficient number of real observations to evaluation using ten iterations of masking 1% of available records for each trait and comparing imputed and masked values for each trait (Supplementary Table S1).
GPWAS Analysis
We propose a model selection approach to adaptively choose the most significant phenotypes associated with each gene. Given a gene, we consider all the SNPs as the multi-responses. For analysis of the given gene on each chromosome, a separate principal component analysis (PCA) was conducted using markers solely from the other 9 chromosomes to reduce the endogenous correlations between genes and principal components54. A subset of 1.24 million SNPs distributed across both intragenic and intergenic regions on all 10 chromosomes was used to perform PCA for both GPWAS and GWAS. The first three PCs were calculated using R prcomp function and included in GPWAS analysis. Let αin and αout be the criterion thresholds for the p-values of the phenotypes. If a phenotype with p-value smaller than αin, we consider it as potentially significant and should be added into the regression model. Whereas, if the p-value of an existing phenotype in the model is larger than αout, we consider it as insignificant and exclude it from the model. As a default, we choose αin = αout = 0.01 for each gene.
The stepwise selection procedure is as follows:
Start with the multi-response model with all the SNPs as responses and the first three PC scores as covariate. Search for the the most significant phenotype across all the phenotype measurements. Include this phenotype into the model if its p-value is below αin. Otherwise, declare no phenotype is significant for this gene.
For the 𝓁th step, add each one of the remaining phenotypes into the existing model with the covariates that have already been selected, and calculate its p-value. This p-value measures the effect of this phenotype on the responses given all the selected phenotypes from the previous steps.
Find the remaining phenotype with the minimum p-value. Include this phenotype into the model if its p-value is below αin. Otherwise, declare no new add in.
The newly added covariate may be correlated with the existing covariates in the model. This may change their corresponding p-values. Fit all the selected phenotypes jointly in the model and drop the phenotype with the largest p-value that is greater than the cutoff value αout.
Repeat steps (2), (3) and (4) until no phenotypes can be added or removed from the model. This is considered as the final model for the targeted gene.
The final model can be represented as:
Here, the subscript k and i represent the kth observation and the ith gene, respectively. There are vi selected phenotypes for the ith gene, where vi ≤260. The selected phenotypes {Phek,(j)} are a subset of the collection of all the phenotypes {Phek,1, Phek,2, …, Phek,260}, where τi(j) is the corresponding coefficients for the selected phenotype Phek,(j) of the ith gene. The first three PC scores PC1, PC2 and PC3 were always included in the model with their effects βi1, βi2 and βi3. Note that gk,i, βi1, βi2, βi3 and τi( j) could be vectors corresponding to the multiple SNPs within the ith gene. Total phenotypes was iteratively selected for 35 times for each scanned gene. All the unselected phenotypes were considered as insignificant for a particular gene. The p-value of each gene was determined by the partial F test through comparing the final model containing both the first three PCs and the selected phenotypes with the initial model containing only the PCs. Of 32,084 gene models, genomic data of every 200 genes was extracted and submitted to cluster (Intel Xeon E5-2670 2.60GHz 2 CPU/16 cores per node) in Holland Computing Center at University of Nebraska-Lincoln for processing GPWAS with input phenotypes.
FDR cut offs of partial F-test were based on the results from 20 permutation analysis where the values for each trait were independently shuffled among the 277 genotyped individuals and the entire GPWAS pipeline rerun for all genes. The code implementing the above analyses in R and associated documentation has been published as the “GPWAS” which is avaliable from the following link: https://github.com/shanwai1234/GPWAS. Selected significant GPWAS genes with incorporated phenotypes were listed in Supplementary Table S8.
GWAS Analysis
GLM GWAS analyses were conducted using the algorithm first defined by Price and coworkers31 and FarmCPU GWAS with the algrothm defined by Liu and colleagues32. Both algorithms were run using the R-based software rMVP (A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool For Genome-Wide Association Study) (https://github.com/XiaoleiLiuBio/rMVP). Both analysis methods were run using maxLoop = 10 and the variance component method method.bin = “Fast-LMM”55. The first three principal components were considered as additional covariates for population structure control. For comparison to GPWAS results, each gene was assigned the p-value of the single most significant SNP among all the SNPs assigned to that gene across 260 analyzed phenotypes in the GWAS model.
Nested Association Mapping Comparison
Published associations identified for 41 phenotypes scored across 5,000 maize recombinant inbred lines were retrieved from Panzea (http://cbsusrv04.tc.cornell.edu/users/panzea/download.aspx?filegroupid=14)27. Following the thresholding proposed in that paper a SNP and CNV (copy number variant) hits with a resample model inclusion probability ≥ 0.05 which were either within the longest annotated transcript for each gene AGPv2.16 or within 15kb upstream or downstream from the an-notated transcription start and stop sites were assigned to that gene. Gene models were converted from B73 RefGenV2 to B73 Re-fGenV4 using a conversion list published on MaizeGDB (https://www.maizegdb.org/search/gene/downloadgenexrefs.php?relative=v4).
Gene Expression Analysis
Raw reads from the a published maize expression atlas generated for the inbred B73 were downloaded from the NCBI Sequence Read Archive PRJNA17168439. Reads were trimmed using Trimmomatic-0.38 with default setting parameters56. Trimmed reads were aligned to the maize B73 RefGenv4 reference genome using GSNAP version 2018-03-2557. Alignment results were converted to sorted BAM file format using Samtools 1.658 and Fragments Per Kilobase of transcript per Million mapped reads (FPKM) where calculated for each gene in the AGPv4.39 maize gene models in each sample using Cufflinks v2.259. Only annotated genes located on 10 maize pseudomolecules were used for downstream analyses and the visualization of FPKM distribution.
Ka/Ks Calculations
For each gene listed in a public syntenic gene list,60, the coding sequence for the single longest transcript per locus was downloaded from Ensembl Plants and aligned to the single longest transcript of genes annotated as syntenic orthologs in Sorghum bicolor in v3.161 and Setaria italica v2.262 were retrieved from Phytozome v12.0 using a codon based alignment as described previously43. The calculation of the ratio of the number of nonsynonymous substitutions per non-synonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks) was automated using in-house constructed software pipeline posted to github (https://github.com/shanwai1234/Grass-KaKs). Genes with synonymous substitution rate less than 0.05 were excluded from the analyses as the extremely small number of synonymous substitutions tended to produce quite extreme Ka/Ks ratios and genes with multiple tandem duplicates were also excluded from Ka/Ks calculations. Calculated Ka/Ks ratios of maize genes were provided in Supplementary Table S9.
Presence/Absence Variation (PAV) Analysis
PAV data was downloaded from a published data file42. Following the thresholding proposed in that paper, a gene was considered to exhibit presence absence variance if at least one inbred line with coverage less than 0.2.
Gene Ontology Enrichment Analysis
All GO analyses used the maize-GAMER GO annotations for B73 RefGenV4 gene models44. Statistical tests for GO term enrichment and purificiation were performed using the goatools software package63 with support for the Fisher Exact test provided by the fisher exact function in SciPy. For determining median information content individual GO terms, each GO term was assigned a score based on the total number of gene models this GO term was assigned to in the maize-GAMER dataset. This analysis considered only gene models a GO term was specifically applied to in the dataset, but not gene models where the assignment of the GO term may have been implied by the assignment of a child GO term.
Power and FDR evaluation of GPWAS and GWAS using simulated data
SNP calls for the entire set of 1,210 individuals included in Maize HapMap3 were retrieved from Panzea26, filtered, imputed, and assigned to genes as described above resulting in 1,648,398 SNPs assigned to annotated gene body regions in B73 RefGenV4. 2,000 randomly selected genes associated with 30,547 SNP markers were employed for downstream simulations. In each simulation, 100 genes (5%) were selected as causal genes. For each causal gene in each simulation, a causal SNP was selected for simulating phenotypic effects. A total of 100 phenotypes were simulated in each permutation of the analysis, with 10 traits simulated with heritability of 0.7, 30 traits simulated with heritability of 0.5 and 60 traits simulated with heritability of 0.3. Effect sizes for each SNP for each phenotype in each permutation were drawn from a normal distribution centered on zero using the additive model in GCTA (version 1.91.6)64.
The resulting simulated trait data and genuine genotype calls were analyzed using GLM GWAS, FarmCPU GWAS, and GPWAS as described above with the exception of calculating population structure principal components using a sample (1% or 157,748 SNPs) of the total SNPs remaining after filtering, rather than only the subset of SNPs assigned to the 2,000 randomly selected genes included in this analysis. For each analysis, the set of 2,000 genes was ranked from most to least statistically significant based on the significance of the single most significantly associated SNP (for GLM and FarmCPU GWAS) or the significance of the overall model fit relative to a population structure only model (for GPWAS). Power evaluation for GPWAS was defined as the number of true positive genes to the total number of causal genes and FDR was defined as the number of false positive genes to the total number of positive genes. Power and FDR were calculated in a step size of five genes from 5 total positive genes to 500 (i.e. {5,10,…,450,500}).
Acknowledgements
This work is supported by the Quantitative Life Sciences Initiative at the University of Nebraska-Lincoln, which receives support from a University of Nebraska Program of Excellence and by the National Science Foundation Awards MCB-1838307 and OIA-1826781 to JCS. The authors thank Andy Dahl advice and instruction in the use of phenotype imputation, Zheng Xu and Wenlong Ren for consultation on the design of the association study, and the PanZea project (http://www.panzea.org) for gathering the phenotypic and genotypic data employed in this study. This work was completed utilizing the Holland Computing Center of the University of Nebraska, which receives support from the Nebraska Research Initiative.