ABSTRACT
Genome-wide association studies (GWAS) have proven to be a valuable approach for identifying genetic intervals associated with phenotypic variation in Medicago truncatula. These intervals can vary in size, depending on the historical local recombination near each significant interval. Typically, significant intervals span numerous gene models, limiting the ability to resolve high-confidence candidate genes underlying the trait of interest. Additional genomic data, including gene co-expression networks, can be combined with the genetic mapping information to successfully identify candidate genes. Co-expression network analysis provides information about the functional relationships of each gene through its similarity of expression patterns to other well-defined clusters of genes. In this study, we integrated data from GWAS and co-expression networks to pinpoint candidate genes that may be associated with nodule-related phenotypes in Medicago truncatula. We further investigated a subset of these genes and confirmed that several had existing evidence linking them nodulation, including MEDTR2G101090 (PEN3-like), a previously validated gene associated with nodule number.
INTRODUCTION
The ability to convert atmospheric nitrogen into usable forms makes legumes an integral part of the plant ecosystem. Unfortunately, the expected increase in human population size over the next several decades will require a higher amount of nitrogen than current legume cropping systems can fulfill (Smil, 1999). This increase in demand requires that researchers better understand and improve nitrogen fixation in current legume species. One species in particular, Medicago truncatula, is widely considered a model species for understanding nitrogen fixation due to its diploid nature, seed to seed generation time, small genome size, and the vast amount of genomic resources (Young and Udvardi, 2009). Although previous studies have identified genes associated with nodulation (Oldroyd et al., 2001; Curtin et al., 2017; VandenBosch, 2003; Elise et al., 2005; Combier et al., 2006; Wasson, 2006), the trait is highly polygenic, and a large number of genes involved in nodulation remain to be discovered. One way researchers have tried to overcome this obstacle is through the use of Genome-wide association studies (GWAS).
Genetic analysis performed on standing collections of diverse lines or accessions reveals the locations of historical recombination that differentiate each genotype. GWAS leverage this information to discover associations between genetic markers and a phenotype of interest that exhibits variation within the population. However, these strong associations typically implicate genomic regions that are too large to allow for the identification of the specific gene that underlies this variation (Breseghello and Coelho, 2013; Flint-Garcia et al., 2005; Visscher et al., 2012). In most cases, further investigation is required to identify genes surrounding each marker that may be associated with the phenotype. Furthermore, it is possible that numerous markers truly associated with the trait are not identified as significant in GWAS, due to stringent statistical cutoffs (Storey and Tibshirani, 2003; Johnson et al., 2010; Sham and Purcell, 2014). Conversely, lowering the statistical threshold introduces false positives that are problematic for further analysis (Korte and Farlow, 2013).
Advances in next-generation sequencing technologies have allowed researchers to generate numerous reference genomes for a variety of plant species. However, many of the genes within these species remain functionally uncharacterized, limiting the amount of biological information available to interpret a candidate gene’s effect on a specific phenotype. Using technologies such as RNA-seq and microarrays, it is possible to measure quantitative levels of expression throughout the genome across multiple samples. Based on a collection of genome-wide expression profiles collected from various tissues, species, and/or environments, one can construct a co-expression network by measuring similarity between all pairs of genes’ expression profiles, where strongly connected edges indicate that two genes exhibit highly similar patterns of expression (Usadel et al., 2009; Stuart, 2003) (Aoki et al., 2007). These networks provide a powerful resource for understanding gene function, particularly for uncharacterized genes, as the data-derived relationships allow one to establish a functional context for a gene, even when formal annotations do not exist.
A recent study described a new framework to integrate co-expression networks with GWAS as a means to identify candidate genes (Schaefer et al., 2018). In maize, they ran several GWAS to identify a SNPs associated with elemental accumulation in seeds. Although they were able to identify significant markers associated with regions of the genome, in most cases, they were left with hundreds of markers per trait that often implicated linked genomic regions that could not be resolved to individual candidate genes. They further built three co-expression networks, two from publicly available data and one from root tissue designed to represent the phenotype measured in the respective GWAS. Schaefer et al. 2018 integrated the significant markers for each trait with the co-expression networks using their Camoco framework to identify and better prioritize candidate genes associated with elemental accumulation.
Here, we apply this framework to Medicago truncatula using publicly available expression datasets, and markers from a previously published GWAS focused on nodulation traits. We demonstrate that Camoco framework, originally established in maize, indeed generalizes to other species and traits, and provides an effective means of pinpointing candidate causal genes associated with nodulation.
RESULTS AND DISCUSSION
Integration of nodule focused genome-wide association study with co-expression networks
To identify candidate genes associated with nodulation traits, we used a previously published GWAS (Stanton-Geddes et al., 2013) as well as two publicly available RNA-seq datasets. The GWAS consisted of 226 M. truncatula accessions that were previously grown in replicate and phenotyped for five different nodulation traits as well as flowering time, trichrome density and height. By manually inspecting their most significant 50-200 SNPs ranked by p-value, the authors discovered several genes near significant SNPs that were previously associated with nodulation traits (Stanton-Geddes et al., 2013). Similar to other GWAS studies, the authors focused on genes that either contained or were directly adjacent to significant markers even though, in some cases, other genes may also be plausible candidates given their linkage to the significant markers (Branca et al., 2011). We selected a subset of these traits and markers from the study to serve as input for the GWAS/co-expression Camoco pipeline (Table S1).
As a basis for our co-expression networks, we used two publicly available RNA-seq expression data sets. The data consisted of 138 samples consisting of three different genotypes, three different tissues, four different rhizobium treatments, and presence-absence of nitrogen (Table S2). We then built six different co-expression networks using Camoco (https://github.com/schae234/Camoco) (Schaefer et al., 2018). Four of the six networks were constructed from a single tissue type (Leaf, Root, Nodule, JQL_Nodule), and the other two networks (referred to as the “General” network and “JQL” network) were constructed from a combination of different tissue types (Table S3). The diversity of tissue types within each co-expression network allows for the detection of signals corresponding to different biological processes that may have remained undiscovered if all samples were combined into one large network (Schaefer et al., 2014, 2018). The total number of genes that passed the co-expression network construction phase was relatively consistent among the four networks, with each network consisting of roughly 22,000 genes (Table S3). Genes that were excluded from each network were either not expressed, or did not exhibit enough variation in expression between samples to robustly measure covariation. The smaller number of genes within the nodule-specific network was expected, as fewer genes are expressed in nodule tissue relative to others (Benedito et al., 2008).
To test whether the networks were capturing biologically meaningful relationships, we measured the enrichment in each network for known biological relationships. Using sets of genes coannotated to the same Gene Ontology (GO) term, the relative density (how highly an established set of functionally related genes are co-expressed with each other) was measured and compared to density values of randomly sampled gene sets of the same size. All six networks demonstrated functional enrichment of at least ten-fold (Figure S1), indicating that many more GO terms exhibited evidence of co-expression than expected by chance for all six networks.
Using the six co-expression networks and selected GWAS markers, we applied the Camoco pipeline to prioritize candidate causal genes. Briefly, Camoco, which was originally described in Schaefer et al. 2018, evaluates candidate genes linked to significant GWAS markers on the basis of their co-expression with genes linked to other significant GWAS marker based on the assumption that some causal genes should exhibit strong co-expression relationships with other genes associated with the trait. Camoco is depicted in Figure 1, and the details of this analysis are provided in the Methods section. Any genes reported by Camoco with an FDR < 0.35 were considered candidate genes and included in further analysis.
The results of the Camoco framework yielded 489 high-confidence candidate genes across all GWAS trait and network combinations. We also measured the number that persisted at more stringent FDR cutoffs, and indeed we were able to discover genes across a range of FDRs (FDR < 0.2: 172 genes; FDR < 0.1: 32 genes; FDR < 0.05: 3 genes). Analysis of the Nod_A trait (strain occupancy in the top 5 cm of roots) with the Mt_JQL_Nodule network combination, revealed a high amount of network connectivity between genes. To illustrate the basis for highly prioritized candidate genes, we highlight the observed co-expression relationships for MEDTR2G101090 (Figure 2), which was one of the top prioritized candidate genes for the Nod_A. MEDTR2G101090 is linked with a significant GWAS marker and is highly co-expressed with genes linked to significant loci on several other chromosomes (Figure 2), suggesting that the Camoco framework is discovering meaningful relationships.
Importance of trait and tissue specificity in co-expression networks
The number of high-confidence candidate genes discovered by Camoco varied significantly across different combinations of traits, networks, and parameters (Figure 3). Interestingly, the nodule based Mt_JQL_Nodule co-expression network combined with the Nod_A trait yielded the most high-confidence candidate genes across all network-trait combinations, which likely reflects a strong match between the tissue in which expression covariation was measured and the biology of the phenotype of interest (in this case, both focused on nodules). Surprisingly, the root-based network performed the worst even though we expected strong biological relevance for nodulation based traits. It was the poorest performer across all GWAS traits, only producing a few candidate genes for the Nod_B trait (strain occupancy below the top 5 cm of roots). This result could possibly be due to the timepoint at which RNA was extracted from the roots. For example, if RNA was extracted at an earlier timepoint when nodules were still early in development, there may have been more informative expression patterns, allowing for the discovery of candidate genes.
The leaf network was the only network that consistently identified candidates for the height trait. While this is biologically unsurprising, this network also discovered significant genes for a few nodulation traits, suggesting that there are processes detectable in leaf tissue with relevance to nodulation. The General network, which consisted of the largest number of samples and tissue types only generated a few candidates for the Nod_B phenotype.
These results suggest that the context from which the co-expression network was derived, and its relation to the GWAS phenotype, play an important role in determining whether the Camoco approach is able to prioritize high-confidence genes. Notably, our results suggest that combining many different types of tissue into one large network does not perform well as a smaller, more concise, tissue-focused network, even though it is based on a larger set of expression profiles. One reason for this is that combining expression data from very different contexts introduces more variation across each gene’s profile, but that variation likely reveals generic modules that represent large sets of genes that simply are expressed in the same subsets of tissues. In contrast, networks derived from specific tissues capture more subtle covariation that reflects co-regulated genes functioning in processes relevant to that tissue that may otherwise be lost in larger sets of expression profiles.
GWAS marker significance and proximity to genes are variable when integrating coexpression analysis
A common approach to interpreting GWAS studies is to manually inspect the most significant markers and look for candidates that are closest in proximity to the marker of interest. Unfortunately, the closest genes to GWAS markers may not always be the ones that are causally driving the association with the phenotype. When looking at the height trait in the leaf network, we see an increase in signal (i.e., number of Camoco-identified high-confidence candidate genes) as we increase the number of flanking genes surrounding each marker (Figure S2). When the window size is increased from 10 kb to 20kb, we see that the signal drastically increases, indicating that there are genes further out from the marker that are highly co-expressed with a subset of these genes. However, when an even larger 50kb window is used, no high-confidence genes are reported. The loss of signal at the largest interval (50kb) is expected as the number of potential candidate genes per locus increases sharply (the large majority of them being false positive as one considers candidates further from the locus peak). Ultimately, this large number of false candidate genes obscures the identification of co-expression relationships among true causal genes, and the approach no longer works. This analysis suggests that several of the GWAS loci implicated for these traits are likely driven by causal genes that are not directly adjacent to the GWAS peaks.
Similarly, the constraint of only focusing on the most significant markers (e.g. derived from extremely conservative significance cutoffs on the association test) leaves other candidates that are truly associated with the phenotype neglected. The Camoco framework can provide filtering of false positives at lower significance thresholds, by integrating information from the co-expression network. For instance, if we used the common GWAS p-value cutoff of 5 × 10-8 (Fadista et al., 2016; Barsh et al., 2012; Panagiotou and Ioannidis, 2012), this would result in two GWAS markers from the Nod_A phenotype, which does not provide enough context for an approach like Camoco to prioritize candidate genes. Instead, we applied a less conservative threshold (p-value < 3 x 10-5), which resulted in 292 SNPs, which was able to produce several high-confidence candidate causal genes, which would have otherwise been ignored (Table S1). In general, of course the number of markers produced at any confidence threshold will depend on the trait’s genetic architecture and the study design, but this analysis suggests that the Camoco approach can better produce candidate genes with less conservative thresholds on marker association.
Identification of nodulation-related genes using co-expression and GWAS
To identify a small set of the most promising high confidence candidate genes for more investigation, we further narrowed candidate genes lists for the Nod_A trait by focusing on genes that were consistently discovered across different parameter settings. Using the JQL_Nodule network, we narrowed the candidate gene lists by limiting candidate genes to those that appeared in at least three out of the nine (10kb, 20kb, 50kb genome window size by 1,2,5 flanking gene) combinations of parameter settings; this process resulted in 25 genes for further investigation (Table1). When viewing the strength of co-expression between these 25 genes within the nodule network, it was observed that the majority of the genes were connected and formed a single module (Figure 4).
Interestingly, among those 25 candidate genes from the Nod_A analysis, was PEN3-like (MEDTR2G101090; Table 1), a gene that was associated with the most significant GWAS marker for the Nod_A trait (Stanton-Geddes et al., 2013). Functional validation of PEN3-like using CRISPR and Tnt1-mutated plants previously confirmed that loss-of-function of this gene resulted in decreased nodule number (Curtin et al., 2017). Another strong candidate among these 25 within the module was the hub gene (gene with the highest number of connections), MEDTR7G109130, which is annotated as a P-loop nucleoside triphosphate hydrolase superfamily protein and is known to play a role in nodulation (Jayaraman et al., 2017).
Because multiple co-expression networks were able to support the discovery of strong candidate genes for Nod_B, we defined a short list of high-confidence candidates by requiring high confidence genes to be consistently prioritized as high-confidence candidates across all networks for the Nod_B trait and were discovered across 4 or more parameter settings (Table 2). One promising gene, MEDTR1G012530, appeared as a candidate for 9 out of the 20 parameter settings that resulted in at least one candidate gene discovery. This gene is annotated as a TPX2 (targeting protein for Xklp2) family protein and has been shown to be highly expressed during nodule formation (Jardinaud et al., 2016). Another promising candidate, MEDTR4G073400, which also appeared as candidate 9 times, is annotated as Synaptotagmins-1-related, which play a role in the formation of root nodules (Gavrin et al., 2017).
Overall, these results demonstrate that the integration of co-expression networks to interpret GWAS results was able to effectively prioritize genes causally associated with nodulation processes. The genes that are directly connected to PEN3-like would serve as valuable candidates for follow-up studies due to their similarity in expression profiles across tissues. Another approach to prioritizing candidates from among the set produced by the Camoco analysis is to rank based on their linked GWAS marker’s significance value. For instance, the candidate causal gene associated with the marker with the highest significance was the PEN3-like gene while our P-loop nucleoside triphosphate hydrolase superfamily protein hub gene was ranked 270 out of 523 significant markers input into the Camoco analysis.
Conclusions
Using an M. truncatula GWAS focused on nodulation traits as well as expression data from different tissues, rhizobium strains, nitrogen treatments and accessions, we were able to identify a subset of genes surrounding GWAS markers that are highly co-expressed with one another. From these lists, we discovered a previously validated nodulation gene PEN3-like as well as several other genes whose annotations are associated with nodulation. Uncharacterized genes within our high-confidence lists are worthy of more in-depth follow-up studies using Tnt1 or CRISPR knockouts.
Schaefer et al. 2018 developed the Camoco framework and integrated co-expression networks and GWAS in maize in order to capture variation associated with elemental uptake in seeds. Our current study used a higher-density GWAS that focused on a different phenotype, different plant species, and an expression data set that was not explicitly created for this study. One common theme between the studies is that the choice of the co-expression network matters; specifically, tissue-relevant networks derived from expression variation across diverse genotypes appear to perform the best in ranking candidate genes. This was true in maize, and we report here that this is also true in Medicago. We believe this result is likely to generalize to many other contexts, and it suggests as a community, more emphasis in the generation genotype-focused networks would be worthwhile if we hope to build resources for functional interpretation of phenotype-associated variants. It is also important to mention that we were able to generate a panel of high confidence candidate genes using two independent datasets that were not generated specifically for this study.
The majority of candidate genes discovered in this analysis would have mostly likely been neglected by traditional GWAS analyses unless they were under the most significant markers. By combining co-expression networks with GWAS, the functional relationship between genes related to the GWAS phenotype are more likely to be discovered. It is also important to note that based on our analysis, in many cases, the nearest gene to a marker was not the gene predicted to be causally associated with the phenotype.
In general, we demonstrate that the Camoco framework for integrating co-expression networks with GWAS generalizes beyond the species for which it was originally developed and applied (maize ionomic traits), as it shows utility for prioritizing genes related to nodulation in Medicago truncatula. Based on these results, we expect that the approach will generalize to a wide variety of other species and traits as well.
MATERIAL AND METHODS
Medicago experimental design and sample extraction
Three accessions from the Medicago HapMap project (HM56, HM101, HM340) were grown in greenhouse conditions. Rhizobium strains S. meliloti (KH46c) and S. medicae (WSM419), as well as nitrogen, were applied to the soil shortly after planting. Tissues were harvested and frozen in liquid nitrogen 31 days after planting. RNA was extracted using the Qiagen RNeasy Plant mini kit (Product ID: 74903). Individual nodules were pooled and extracted as a single sample for each plant.
Generation of expression data
RNA from 138 samples were sequenced by the University of Minnesota’s Genomic Center using Illumina HiSeq2500 100bp single-end reads. One sample required resequencing (L88), which resulted in 125bp reads. Samples were barcoded and multiplexed using Illumina TruSeq HT adapters. Fastq files were checked with Fastqc version 0.11.5 and adapters were trimmed using cutadapt version 1.8.1 with non-default parameters -m 40 and -q 30 (Andrews, 2010; Martin, 2011). Reads were then aligned to Mt_4.0 gene models, and reference (http://jcvi.org/medicago/) using STAR 2.5.3a (Dobin et al., 2013), then filtered based on unique mapping scores, sorted and indexed using samtools version 1.6 (Li et al., 2009). FPKM values were generated using Cufflinks version 2.2.1 using non-default parameters of -I 20000 and --min-intron-length 5. Raw sequencing files are publicly available on the NCBI SRA (PRJNA327225 and PRJNA449544).
Co-expression network construction and genome-wide association study i ntegration
Methods used were similar to those in the previously mentioned co-expression GWAS integration study (Schaefer 2017). Briefly, Camoco takes a set of SNP’s as input and uses their location within a genome as well the number of genes flanking a marker within a given window size to extract genes lists for testing (Figure 1). If there are multiple significant SNPs appearing within the same window, then all but the most significant SNP is discarded (Table S1). Once genes are selected for testing; each gene is then measured to see how well it is co-expressed with other genes also linked to the significant markers associated with the trait of interest. Once a network statistic (either density or locality, see Schaefer 2017) is generated, Camoco will resample (1000 times) a random set of genes equal in size to the test set to establish a null distribution for estimating significance of the observed statistic. To account for the varying amount of linkage disequilibrium across the genome, we used 10kb, 20kb and 50kb window sizes and 1, 2, and 5 flanking genes (Stanton-Geddes et al., 2013). Any gene that had an FDR < 0.35 was called “candidate” and included in further analysis.
FPKM expression tables were used as input into Camoco (https://github.com/schae234/Camoco) using the Mt_4.1 reference genome. Non-default parameters used to build each network included rawtype=‘RNASEQ’, max_gene_missing_data=0.5, max_accession_missing_data=0.5, min_single_sample_expr=1, min_expr=0.001, quantile=False, max_val=300, sep=‘,’. Network health statistics were generated using GO terms from (http://jcvi.org/medicago) and 1000 bootstraps. SNPs were integrated into Camoco using built-in functions, and per gene, density measurements were run with 1,000 bootstraps. Figures were created using ggplot2 (Wickham, 2006).
Acknowledgments
We would like to thank the University of Minnesota Office of Information Technology for accommodating our data storage needs and the Department of Computer Science at the University of Minnesota for server maintenance and support. This work was supported by funding from the National Science Foundation (IOS-1237993) with partial funding from (IOS-1126950). Funding sources played no role in the design of this study or the collection, analysis, and the interpretation of data and in writing the manuscript.