Abstract
Background Single Nucleotide Polymorphism (SNP) array and re-sequencing technologies have different properties (e.g. calling rate, minor allele frequency profile) and drawback (e.g. ascertainment bias), which lead us to study the complementarity and consequences of using them separately or combined in diversity analyses and Genome-Wide Association Studies (GWAS). We performed GWAS on three traits (grain yield, plant height and male flowering time) measured in 22 environments on a panel of 247 diverse dent maize inbred lines using three genotyping technologies (Genotyping-By-Sequencing, Illumina Infinium 50K and Affymetrix Axiom 600K arrays).
Results The effects of ascertainment bias of both arrays were negligible for deciphering global genetic trends of diversity in this panel and for estimating relatedness. We developed an original approach based on linkage disequilibrium (LD) extent in order to determine whether SNPs significantly associated with a trait and that are physically linked should be considered as a single QTL or several independent QTLs. Using this approach, we showed that the combination of the three technologies, which have different SNP distribution and density, allowed us to detect more Quantitative Trait Loci (QTLs, gain in power) and potentially refine the localization of the causal polymorphisms (gain in position).
Conclusions Conceptually different technologies are complementary for detecting QTLs by tagging different haplotypes in association studies. Considering LD, marker density and the combination of different technologies (arrays and re-sequencing), the genotypic data presently available were most likely enough to well represent polymorphisms in the centromeric regions, whereas using more markers would be beneficial for telomeric regions.
Background
Understanding the genetic bases of complex traits involved in the adaptation to biotic and abiotic stress in plants is a pressing concern, with world-wide drought due to climate change as a major source of human food and agriculture threats. Recent progress in next generation sequencing and genotyping array technologies contribute to a better understanding of the genetic basis of quantitative trait variation by performing Genome-Wide Association Studies (GWAS) on large diversity panels [1]. Single Nucleotide Polymorphism (SNP)-based techniques became the most commonly used genotyping methods for GWAS because SNPs are cheap, numerous, codominant and can be automatically analysed with SNP-arrays or produced by genotyping-by-sequencing (GBS), or sequencing [2-4]. The decreasing cost of genotyping technologies have led to an exponential increase in the number of markers used for the GWAS in association panels, thereby raising the question of computation time to perform the association tests. Computational issues were addressed by using either approximate methods by avoiding re-estimating variance component for each SNP [5] or exact methods using mathematical tools for sparing time in matrix inversion [6, 7]. It is noteworthy that using approximate computation in GWAS can produce inaccurate p-values when the SNP effect size is large or/and when the sample structure is strong [8].
Several causes may impact the power of Quantitative Trait Locus (QTL, locus involved in quantitative trait variation) detection in GWAS. Highly diverse panels have in general undergone multiple historical recombinations, leading to a low extent of linkage disequilibrium (LD). However, these panels can present different average and local patterns of LD [9-11]. A high marker density and a proper distribution of SNPs are therefore essential to capture causal polymorphisms. Furthermore, minor allele frequencies (MAF), population stratification and cryptic relatedness are three other important parameters affecting power and false positive detection [12, 14]. These last two factors are substantial in several cultivated species such as maize [15] and grapevine [16], and their impact on LD can be statistically evaluated [17]. Population structure and kinship can be estimated using molecular markers [18-21] and can be modelled to efficiently detect marker-trait associations due to linkage only [12, 22, 23]. These advances have largely increased the power and effectiveness of linear mixed models that can now efficiently account for population structure and relatedness in GWAS [12, 8].
In maize, an Illumina Infinium HD 50,000 SNP-array (50K array) was developed by Ganal et al. [3] and has been used extensively for diversity and association studies [24, 25]. For example, GWAS were conducted to unravel the genetic architecture of phenology, yield component traits and to identify several flowering time QTLs linked to adaptation of tropical maize to temperate climate [26, 27]. With the same array, Rincent et al. [11] showed that LD occurs over a longer distance in a dent than in a flint panel, with appreciable effects on the power of QTL detection. Comparison of LD extent between association panels suggests that genotyping with 50K markers causes a limited power of GWAS in many panels due to the low LD extent and the correlation between allelic values at some SNPs due to the kinship and population structure [14, 27]. Therefore, higher marker densities are desirable because the maize genome size is large (2.4 Gb), the level of diversity is high, and LD extent is low (more than one substitution per hundred nucleotides) [28]. As a consequence, an Affymetrix Axiom 600,000 SNP-array (600K array) was developed and used in association genetics [29, 30] and detection of selective sweep [4]. Another possibility is whole genome sequencing, but this is currently unpractical for large genomes such as maize because of the associated cost. Hence, a Genotyping-By-Sequencing (GBS) procedure has been developed [2] that targets low-copy genomic regions by using cheap restriction enzymes. Genotyping-by-sequencing has been successfully used in maize for genomic prediction [31]. Romay et al. [32] and Gouesnard et al. [33] highlighted the interest of the GBS for (i) deciphering and comparing the genetic diversity of the inbred lines in seedbanks and (ii) identifying QTLs by GWAS for kernel colour, sweet corn and flowering time. To our knowledge, the respective interests of DNA arrays and GBS for diversity analyses and GWAS have never been compared in plants.
The main drawback of the DNA arrays is that they do not allow to discover new SNPs. This possibly leads to some ascertainment bias in diversity analysis when the SNPs selected for building arrays come from (i) the sequencing of a set of individuals who did not well represent the diversity explored in the studied panel, (ii) a subset of SNPs that skews the allelic frequency profile towards the intermediate frequencies [27, 34]. Ascertainment bias can compromise the ability of the SNP arrays to reveal an exact view of the genetic diversity [34]. Genotyping-by-sequencing can overcome ascertainment bias since it is based on sequencing and therefore allows the discovery of alleles in the diversity panel analysed. It can be generalized to any species at a low cost providing that numerous individuals have been sequenced in order to build a representative library of short haplotypes to call SNPs [35]. Non-repetitive regions of genomes can be targeted with two-to three-fold higher efficiency, thereby considerably reducing the computationally challenging problems associated with alignment in species with high repeat content. However, GBS may have a low-coverage leading to a high missing data rate (65% in both studies; [32, 33]) and heterozygote under-calling, depending on genome size and structure, and on the number of samples combined in the flow-cell. Furthermore, GBS requires the establishment of demanding bioinformatic pipelines and imputation algorithms [36]. Pipelines have been developed to call SNP genotypes from raw GBS sequence data and to impute the missing data from a haplotype library [35, 36].
Here, we investigated the impact of using GBS and DNA arrays on the quality of the genotyping data, together with the biological properties of data generated by these technologies, and the potential complementarity of these approaches. In particular, we analyzed the impact of increasing the marker density and using different genotyping technologies (sequencing vs array) on (i) the estimates of relatedness and population structure, (ii) the detection of QTLs (power. To address these issues, we performed a GWAS based on genotypic datasets obtained using either GBS or DNA arrays with low (50k) or high (600k) densities on a diversity panel of maize hybrids obtained from a cross of dent lines with a common flint tester. Three traits were considered, namely grain yield, plant height and male flowering time (day to anthesis), measured in 22 different environments (sites × years × treatments) over Europe. We developed an original approach based on LD extent in order to determine whether SNPs significantly associated with a trait should be considered as a single QTL or several independent QTLs.
Results
Combining Tassel and Beagle imputations improved the genotyping quality for GBS
We estimated the genotyping and imputation concordance of the GBS based on common markers with the 50K or 600K arrays (Additional file 1: Figure S1 and Table S1). After SNP calling from reads using AllZeaGBSv2.7 database (direct reads, GBS1, Additional file 1: Figure S1), the call rate was 33.81% on the common SNPs with the 50K array, vs 37% for the whole GBS dataset. The genotyping concordance rate was 98.88% (Additional file 1: Table S1). After imputation using TASSEL by Cornell Institute (GBS2), the concordance rate was 96.04% on the common markers with the 50K array and 11.91% of missing data remained for the whole GBS dataset. In GBS3, all missing data were imputed by Beagle but yielded a lower concordance rate (92.14% and 91.58% for the 50K and the 600K arrays). In an attempt to increase the concordance rate of the genotyping while removing missing data, we tested two additional methods, namely GBS4 where the missing data and heterozygotes of Cornell imputed data (GBS2) were replaced by Beagle imputation, and GBS5 where Cornell homozygous genotypes (GBS2) were completed by imputations from GBS3 (Additional file 1: Figure S1 and Table S1). GBS5 displayed a slightly better concordance rate than GBS2 (96.25% vs 96.04%) and predicted heterozygotes with a higher quality than GBS4. GBS5 was therefore used for all genetic analyses and named GBS hereafter.
GBS displayed more rare alleles and lower call rate than microarrays
The SNP call rate was higher for the arrays (average values of 96% and >99% for the 50K and 600K arrays, respectively), than for the GBS (37% for the direct reads). The MAF distribution differed between the technologies (Figure 1): while the use of arrays resulted in a near-uniform distribution, GBS resulted in an excess of rare alleles with a L-shaped distribution (22% of SNPs with MAF < 0.05 for the GBS versus 6% and 9% for the 50K and 600K, respectively). This is not surprising since the 50K array was based on 27 sequenced lines for SNPs discovery [3], the 600K array was based on 30 lines for [4], whereas GBS was based on 31,978 lines, thereby leading to higher discovery of rare alleles. Consistent with MAF distribution, the average gene diversity (He) was lower for GBS (0.27) than for arrays (0.35 and 0.34 for the 50K and 600K arrays, respectively). The distribution of SNP heterozygosity was similar for the three technologies, with a mean of 0.80%, 0.89% and 0.22 % for the 50K and 600K arrays and GBS, respectively. The heterozygosity of inbred lines was highly correlated between technologies with large coefficients of Spearman correlation: r50K- 600K = 0.90, r50K-GBS = 0.76, r600K-GBS = 0.83. The distribution of the SNPs along the genome was denser in the telomeres for the GBS and in the peri-centromeric regions for the 600K array, whereas the 50K array exhibited a more uniform distribution (top graph in Figure 2 and in Additional file 2: Figure S2).
Population structure and relatedness were consistent between the three technologies
We used the ADMIXTURE software to analyse the genetic structure within the studied panel based on SNPs from the three technologies, by using two to ten groups. Based on a K-fold cross-validation, the clustering in four genetic groups (NQ = 4) was identified as the best one in the datasets resulting from the three technologies. Considering a threshold of 0.5 (ancestral fraction), the assignation to the four groups was identical except for a few admixed inbred lines (Additional file 3: Figure S3). Based on the 50K, the four groups were constituted by (i) 39 lines in the Non Stiff Stalk (Iodent) family traced by PH207, (ii) 46 lines in the Lancaster family traced by Mo17 and Oh43, and (iii) 55 lines in the stiff stalk families traced by B73 and (iv) 107 lines that did not fit into the three primary heterotic groups, such as W117 and F7057. This organization appeared consistent with the organization of breeding programs into heterotic groups, generally related to few key founder lines.
We compared two estimators of relatedness between inbred lines, IBS (Identity-By-State) and K_Freq (Identity-By-Descend), calculated per technology. For IBS, pairs of individuals were on average more related using the GBS data than those from arrays (Additional file 3: Table S2). Relatedness estimated with the two arrays were highly correlated: r = 0.95 and 0.98 for IBS and K_Freq, respectively (Figure 3 and Additional file 3: Figure S4b). The differences between the kinships estimated from the three technologies were reduced if the excess of rare alleles in the GBS was removed (Additional file 3: Figure S4c).
We further carried out diversity analyses by performing Principal Coordinate Analyses (PCoA) on IBD (K_Freq, weights by allelic frequency) estimated from the three technologies. The two first PCoA axes explained 12.9%, 15.6% and 16.3% of the variability for the GBS, 50K and 600K arrays, respectively. The same pattern was observed regardless of the technology with the first axis separating the Stiff Stalk from the Iodent lines and the second axis separating the Lancaster from the Stiff Stalk and Iodent lines (see illustration with the 50K kinship, Additionnal File 3: Figure S5). Key founders lines of the three heterotic groups (Iodent: PH207, Stiff Stalk: B73, Lancaster: Mo17) were found at extreme positions along the axes, which was consistent with the admixture groups previously described.
Long distance linkage disequilibrium was removed by taking into account population structure or relatedness
In order to evaluate the effect of kinship and the genetic structure on linkage disequilibrium (LD), we studied genome-wide LD between 29,257 PANZEA markers from the 50K array within and between chromosomes before and after taking into account the kinship (K_Freq estimated from the 50K array), structure (Number of groups = 4) or both (Additional file 4: Figure S6). Whereas inter-chromosomal LD was only partially removed when the genetic structure was taken into account, it was mostly removed when either the kinship or both kinship and structure were considered (Additional file 4: Figure S6b and c). Accordingly, long distance intra-chromosomal LD was almost totally removed for all chromosomes by accounting for the kinship, structure or both. Interestingly, some pairs of loci located on different chromosomes or very distant on a same chromosome remained in high LD despite correction for genetic structure and kinship (Additional file 4: Figure S6). This can be explained either by genome assembly errors, by chromosomal rearrangements such as translocations or by strong epistatic interactions. Linkage disequilibrium decreased with genetic or physical distance, Additional file 4: Figure S7). The majority of pairs of loci with high LD (r2K>0.4) in spite of long physical distance (>30Mbp), were close genetically (<3cM), notably on chromosome 3, 5, 7 and to a lesser extent 9 and 10. These loci were located in centromeric and peri-centromeric regions that displayed low recombination rate, suggesting that this pattern was due to variation of recombination rate along the chromosome. Only very few pairs of loci in high LD were genetically distant (>5cM) but physically close (<2Mbp). Linkage disequilibrium (r2K and r2KS) was negligible beyond 1 cM since 99% of LD values were less than 0.12 in this case. Note that some unplaced SNPs remained in LD after taking into account the kinship and structure with some SNPs with known positions on chromosome 1, 3 and 4 (Additionnal File 4: Figure S6). Therefore, LD measurement corrected by the kinship can help to map unplaced SNPs.
Linkage disequilibrium strongly differed between and within chromosomes
We combined the three technologies together to calculate the r2K for all pairs of SNPs, which were genetically distant by less than 1 cM. For any chromosome region, LD extent in terms of genetic and physical distance showed a limited variation over the 100 sets of 500,000 loci pairs (cf. Material). This suggests that the estimation of LD extent did not strongly depend on our set of loci. LD extent varied significantly between chromosomes for both high recombinogenic (>0.5 cM/Mbp) and low recombinogenic regions (<0.5 cM/Mbp, Table 1). Chromosome 1 had the highest LD extent in high recombination regions (0.062 ±0.007 cM) and chromosome 9 the highest LD extent in low recombinogenic regions (898.6±21.7 kbp) (Table 1). Linkage disequilibrium extent relative to genetic and physical distances was highly and positively correlated in high recombinogenic regions (r = 0.86), whereas it was not in low recombinogenic regions (r = -0.64).
The effective population size (Ne) estimated from the Hill and Weir’s model [37] using genetic distance varied from 7.9±0.04 (Chromosome 1) to 41.2±0.12 (Chromosome 7) in high recombinogenic regions. Noteworthy, the same approach lead to unrealistic values in low recombinogenic regions (from 961 on Chromosome 6 to more than 1 million for chromosome 2 and 10), thereby confirming that the use of genetic distance is not well suited to model local LD in low recombinogenic regions.
Finally, we studied the variation of local LD extent along each chromosome by adjusting the Hill and Weir’s model against genetic distance within a sliding windows of 1cM (Additional file 4: Figure S8). After removing intervals that did not reach our criteria (Absence of model convergence, effective population size > 247, low number of loci), the 3,205 remaining intervals (90%) showed a high variation for genetic LD extent along each chromosome, with LD extent varying from 0.019 (Ne = 246) to 0.997 cM (Ne = 0.06) (Additional file 4: Figure S8).
Large differences in genome coverage between technologies
We estimated the percentage of the genome that was covered by LD windows around SNPs, calculated by using either physical or genetic distances (Table 1). We observed a strong difference in coverage between the three technologies at both genome-wide and chromosome scale, as illustrated in Figure 2 on chromosome 3 (Table 1, and Additional file 2: Figure S2). For a LD extent of r2K = 0.1, 74%, 82% and 89% of the physical map, and 42%, 58% and 71% of the genetic map were covered by the 50K array, the GBS and the 600K array, respectively (Table 1). For the combined data (50K + 600K + GBS), the coverage strongly varied between chromosomes, ranging from 83% (chromosome 7) to 98% (chromosome 1) of the physical map, and from 51% (chromosome 7) to 97% (chromosome 1) of the genetic map (Table 1). For the physical map, increasing the LD extent threshold to r2K=0.4 reduced the genome coverage from 89% to 49% for 600K, 82% to 28% for GBS, 74% to 20% for 50K and 90% to 52% for the combined data. Increasing the MAF threshold reduced slightly the genome coverage, with smaller reduction for the physical map than genetic map. Surprisingly, increasing the SNP number by combining the markers from the arrays and GBS did not strongly increase the genome coverage as compared to the 600K, regardless of the threshold for LD extent (Figure 2 and Additional file 2: Figure S2).
We observed a strong variation of genome coverage along each chromosome with contrasted patterns in low and high recombinogenic regions (Figure 2 and Additional file 2: Figure S2). While low recombinogenic regions were totally covered with all the technologies (except for few intervals using the 50K array), the genome coverage in high recombinogenic regions varied depending on both technology and SNP distribution. 47% of the 2Mbp intervals in high recombination regions were better covered by the 600K array than the GBS against only 1%, which were better covered by GBS than 600K.
Number of QTLs detected using genome-wide association studies increases with markers density
We observed a strong variation in the number of SNP significantly associated with the three traits across the 22 environments (Table 2). The mean number of significant SNPs per environment and trait was 3.7, 44.7, 17.9 and 62.4 for the 50K, 600K, GBS and the three technologies combined, respectively (Table 3). Considering the p-value threshold used, 28, 303 and 204 false positives were expected among the 243, 2,953 and 1,182 associations detected for 50K, 600K and GBS, respectively. False discovery rate appeared therefore higher for GBS (17.2%) than for DNA arrays (11.5% and 10.2% for 50K and 600K, respectively). It could be explained by the higher genotyping error rate of GBS due to imputation and/or by its higher number of makers with a lower MAF. Both reduce the power of GBS compared to DNA arrays and therefore lead to a higher false discovery rate. Proportionally to the SNP number, 50K and 600K arrays resulted in 1.5-and 1.7-fold more associated SNPs per situation (environment × trait) than GBS (p-value<2x10-6, Table 3). This difference between arrays and GBS was higher for grain yield (GY) and plant height (plantHT) than for male flowering time (DTA, Table3).
We used two approaches based on LD for grouping significant SNPs: (i) considering that all SNPs with overlapping LD windows for r2K=0.1 belong to the same QTL (LD_win) and (ii) grouping significant SNPs that are adjacent on the physical map and are in LD (r2K > 0.5, LD_adj). The QTLs defined by using the two approaches were globally consistent since significant SNPs within QTLs were in high LD whereas SNPs from different adjacent QTLs were not (Additional file 6: Figure S9-LD-Adjacent and Additional file 7: Figure S9-LD-Windows). LD_adj detected more QTLs than LD_win for flowering time (242 vs 226), plant height (240 vs 160) and grain yield (433 vs 237). The number of QTLs detected with the LD_adj approach increased strongly when the LD threshold was set above 0.5. Differences in QTL groupings between the two methods were observed for specific LD and recombination patterns. This occurred for instance on chromosome 6 for grain yield (Additional file 6: Figure S9-LD-Adjacent and Additional file 7: Figure S9-LD-Windows). Within this region, the recombination rate was low and the LD pattern between associated SNPs was complex. While LD_adj splitted several SNPs in high LD into different QTLs (for instance QTL 232, 235, 237, 249), LD_win grouped together associated SNPs that are genetically close but displayed a low LD (Additional file 6: Figure S9-LD-Adjacent and Additional file 7: Figure S9-LD-Windows). Reciprocally, for flowering time, we observed different cases where LD_win separated distant SNPs in high LD into different QTLs whereas LD_adj grouped them (QTL 25 and 26, 51 to 53, 95 to 97, 208 and 209, 218 and 219). As these differences were specific to complex LD and recombination patterns, we used the LD_win approach for the rest of the analyses.
Although a large difference in number of associated SNPs was observed between 600K and GBS, little difference was observed between QTL number after grouping SNPs (Table 2, Table 3). The mean number of QTLs was indeed 1.0, 5.9 and 5.2 and 9.5 for the 50K, 600K, GBS, and the three technologies combined, respectively (Table 3). Note that the number of QTLs continued to increase with marker density when SNPs from GBS, 50K and 600K were combined (Figure 4). The number of SNPs associated with each QTL varied according to the technology (on average 3.7, 7.6, 3.4 and 6.6 significant SNPs for the 50K, 600K, GBS, and the combined technologies, respectively). The total number of QTLs detected over all environments by using the 600K array and GBS was close for flowering time (130 vs 133) and plant height (96 vs 90). It was 1.4-fold higher for the 600K than GBS for grain yield (166 vs 120).
The 600K and GBS were highly complementary for association mapping
The 600K and GBS technologies were highly complementary to detect QTLs for the three traits: 78%, 76% and 71% of the QTLs of flowering time, plant height and grain yield, respectively, were specifically detected by 600K or GBS (Figure 5). On the contrary, 50K displayed very few specific QTLs. While only 9 out 69 QTLs from the 50K array were not detected when the 600K array was used, 39 QTLs detected using the 50K array were not detected when using GBS. When we combined the GBS and 600K markers, 7% of their common QTLs had -log10(Pval) increased by 2 and 21% by 1 potentially indicating a gain in accuracy of the position of the causal polymorphism (Additional file 8: Table S3).
This complementarity between GBS and 600K is well exemplified with two strong association peaks for flowering time on chromosome 1 (QTL32) and 3 (QTL95) detected in several environments (Additional file 8: Table S3 and Figure 6a). In order to better understand the origin of the complementarity between GBS and 600K technologies for GWAS, we scrutinized the LD between SNPs and the haplotypes within these two QTLs (Figure 6b and c, and Additional file 9: Figure S10 for other examples). For example, QTL95 showed a gain in power. It was only identified by the 600K array although the region included numerous SNPs from GBS close to the associated peak. None of these SNPs was in high LD with the most associated marker of the QTL95 (Figure 6b). Another example is QTL32, which was detected by 1 to 10 GBS markers in 9 environments with –log(p-value) ranging from 5 to 7.6, whereas it was detected by only two 600K markers in one environment (Ner12W) with –log(p-value) slightly above the significance threshold (Additional file 8: Table S3 and Figure 6b).
Haplotype analyses showed that the SNPs from the GBS within QTL95 were not able to discriminate all haplotypes (Figure 6c). In QTL95, using the 600K markers allowed one to discriminate the three main haplotypes (H1, H2, H3), whereas using the GBS markers did not allow discrimination of H3 against H1 + H2. As H1 contributed to an earlier flowering time than H2 or H3, associations appeared more significant for the 600K than for GBS (Figure 6c). In QTL32, the use of GBS markers allowed identifying late individuals that mostly displayed H1, H2 and H3 haplotypes, against early individuals that mostly displayed H4 and H5 haplotypes (Figure 6c). The gain of power for GBS markers as compared to 600K markers for QTL32 originated from the ability to discriminate late individuals (black alleles) from early individuals (red alleles) within H4 haplotypes (Figure 6c).
Stability, pleiotropy and distribution of QTL detected across environments
After combining the three technologies, we identified 226, 160, 238 QTLs for the flowering time, plant height and grain yield, respectively (Table 3 and Additional file 8: Table S3). We highlighted 23 QTLs with the strongest effects on flowering time, plant height and grain yield (-log10(Pval) ≥ 8, Table 4). The strongest association corresponded to the QTL95 for flowering time (-log10(p-value) = 10.03) on chromosome 3 (158,943,646 – 159,005,990 bp), the QTL135 for GY (-log10(p-value) = 18.7) on chromosome 6 (12,258,527 – 29,438,316 bp) and QTL78 on chromosome 6 (12,258,527 – 20,758,095 bp) for plant height (-log10(p-value) = 17.31). The QTL95 for flowering time trait was the most stable QTLs across environments since it was detected in 19 environments (Additional file 8: Table S3). Moreover, this QTL showed a colocalization with QTL74 for grain yield in 5 environments and QTL30 for plant height in 1 environment suggesting a pleiotropic effect. More globally, 472 QTLs appeared trait-specific whereas 70 QTLs overlapped between at least two traits (6,3%, 5.2% and 3.0% for GY and plantHT, GY and DTA, and DTA and plantHT, respectively) suggesting that some QTLs are pleiotropic (Additional file 10: Figure S11). This is not surprising since average corresponding correlations within environments for these traits were moderate (0.47, 0.54 and 0.45, respectively). Only 0.7% overlapped between the three traits (Additional file 10: Figure S11). Twenty percent of QTLs were detected in at least two environments and 9% in at least three environments (Additional file 10: Figure S12 and Table S4). We observed no significant differences of stability between the three traits (p-value = 0.2). However, 6 out 7 most stable QTLs (Number of environments >5) were found for flowering time and a higher proportion of QTLs were specific for plant height than grain yield and flowering time (85% vs 77% for both flowering time and grain yield, p-value = 0.09, p-value = 0.2), respectively. We observed that QTLs that displayed a significant effect in more than one environment had larger effects and -log(p-value) values than those significant in a single environment. This difference in -log(p-value) values was stronger for grain yield and plant height than flowering time.
The distribution of QTLs was not homogeneous along the genome since 82%, 77% and 79% of flowering time, plant height and grain yield QTLs, respectively, were located in the high recombinogenic regions, whereas they represented 46% of the physical genome (Additional file 10: Table S5 and Figure S13). The QTLs were more stable (≥ 2 environments) in low than in high recombinogenic regions (12.8% vs 5.8%, p-value = 0.03).
Discussion
GBS required massive imputation but displayed similar global trends than DNA arrays for genetic diversity organization
In order to reduce genotyping cost, GBS is most often performed at low depth leading to a high proportion of missing data, thereby requiring imputation in order to perform GWAS. Imputation can produce genotyping errors that can cause false associations and introduce bias in diversity analysis [33]. We evaluated the quality of genotyping and imputation obtained by different approaches, taking the 50K or 600K arrays as references. The best imputation method that yielded a fully genotyped matrix with a low error rate for the prediction of both heterozygotes and homozygotes was the approach merging the homozygous genotypes from Tassel and the imputation of Beagle for the other data (GBS5 in Additional file 1: Table S1). The quality of imputation was high with 96% of allelic values consistent with those of the 50K and 600K arrays. This level of concordance is identical than in a study of USA national maize inbred seed bank by Romay et al. [32]. It is higher than in a diversity study of European flint maize collection (93%) by Gouesnard et al. [33], which was more distant from the reference AllZeaGBSv2.7 database than for the panel presented here.
The ascertainment bias of arrays due to the limited number of lines used for SNP discovery was reinforced by counter-selection of rare alleles during the design process of DNA arrays [3, 4]. For GBS, the polymorphism database to call polymorphisms included thousands of diverse lines [35]. In our study, we used AllZeaGBSv2.7 database. After a first step of GBS imputation (GBS2), missing data dropped to 11.9% i.e. only slightly more than in Romay et al. (10%) [34]. This confirms that the polymorphism database (AllZeaGBSv2.7) covered adequately the genetic diversity of our genetic material.
Although, we observed differences of allelic frequency spectrum between GBS and DNA arrays, these technologies revealed similar trends in the organization of population structure and relatedness (Figure 1, Additional file 3: Figure S3 and S4 and Table S2) suggesting no strong ascertainment bias for deciphering global genetic structure trends in the panel. However, although highly correlated, level of relatedness differed between GBS and DNA arrays, especially when the lines were less related as showed by the deviation (to the left) of the linear regression from the bisector (Figure 3).
The extent of linkage disequilibrium strongly varied along and between chromosomes
Linkage disequilibrium extent in high recombinogenic regions varied to a large extent among chromosomes, ranging from 0.012 to 0.062 cM. Similar variation of genetic LD extent between maize chromosomes has been previously observed by Rincent et al. [14], but their classification of chromosomes was different from ours. This difference could be explained by the fact that we analyzed specifically high and low recombination regions. According to Hill and Weir Model [37], the physical LD extent in a genomic region increased when the local recombination rate decreased. As a consequence, chromosome 1 and 9 had the lowest and highest physical LD extent and displayed the highest and one of the lowest recombination rate in pericentromeric regions, respectively (0.26 vs 0.11 cM / Mbp, Table 1 and Additional file 10: Table S5). Unexpectedly, the genetic LD extent also correlated negatively with the recombination rate. It suggested that chromosomes with a low recombination rate also display a low effective population size. Background selection for deleterious alleles could explain this pattern since it reduced the genetic diversity in low recombinogenic regions [38, 39]. Finally, we observed a strong variation of the LD extent along each chromosome (Additional file 4: Figure S8). As we used a consensus genetic map [40] that represents well the recombination within our population, it suggested, according to Hill and Weir’s model, that the number of ancestors contributing to genetic diversity varied strongly along the chromosomes. This likely reflects the selection of genomic regions for adaptation to environment or agronomic traits [38], that leads to a differential contribution of ancestors according to their allelic effects. Ancestors with strong favorable allele(s) in a genomic region may lead ultimately to large identical by descent genomic segments [41].
SNPs were clustered into QTL highlighting interesting genomic regions
In previous GWAS, the closest associated SNPs were grouped into QTLs according to either a fixed physical distance [1] or a fixed genetic distance [30, 42]. These approaches suffer of two drawbacks. First, the physical LD extent can vary strongly along chromosomes according to the variation of recombination rate (Additional file 2: Figure S2). Second, the genetic LD extent depends both on panel composition and the position along the genome (Table 1). These approaches may therefore strongly overestimate or underestimate the number of QTLs. To address both issues Cormier et al. [43] proposed to group associated SNPs by using a genetic window based on the genetic LD extent estimated by Hill and Weir model in the genomic regions around the associated peaks [37]. In our study, we improved this last approach (LD_win):
- First, we used r2K that corrected r2 for kinship rather than the classical r2 since r2K reflected the LD addressed in our GWAS mixed models to map QTL [17].
- Second, we took advantage of the availability of both physical and genetic maps of maize to project the genetic LD extent on the physical map. This physical window was useful to retrieve the annotation from B73 reference genome, decipher local haplotype diversity (Figure 6) and estimate physical genome coverage (Table 1, Figure 2).
- Third, we considered an average LD extent estimated separately in the high and low recombinogenic genomic regions. This average was estimated by using several large random sets of pairs of loci in these regions rather than the local LD extent in the genomic regions around each associated peaks.
We preferred this approach rather than using local LD extent in order to limit the effect of (i) the strong variation of marker density along the chromosome (Additional file 2: Figure S2), (ii) the local ascertainment bias due to the markers sampling (iii) the poor estimation of the local recombination rate using a genetic map, notably for low recombination regions [3, 41] (iv) errors in locus order due to assembly errors or chromosomal rearrangements.
We compared LD_win with LD_adj, another approach based on LD to group the SNPs associated to trait variation into QTL. The discrepancies between the two approaches can be explained by the local recombination rate and LD pattern. Since LD_adj approach was based on the grouping of contiguous SNPs according to their LD, this approach was highly sensitive to (i) error in marker order or position due to genome assembly errors or structural variations, which are important in maize [44] (ii) genotyping or imputation errors, which we estimated at ca. 1% and ca. 4%, respectively, for GBS (Additional file 1: Table S1), (iii) presence of allelic series with contrasted effects in different experiments which are currently observed in maize [40], (iv) LD threshold used. On the other hand, LD_win lead either to inflate the number of QTLs in high recombinogenic regions in which SNPs were too distant genetically to be grouped, or deflated their number by grouping associated SNPs in low recombinogenic regions. Since LD_win considered the average LD extent, this method could conduct either to separate or group abusively SNPs when local LD extent were different than the global LD extent.
Note that LD windows should not be considered as confidence intervals since the relationship between LD and recombination is complex due to demography, drift and selection in association panels, contrary to linkage based QTL mapping [17]. The magnitude of the effect of causal polymorphism in the estimation of these intervals which is well established for linkage mapping should be explored further [45]. Other approaches have been proposed to cluster SNPs according to LD [46, 47]. These approaches aim at segmenting the genome in different haplotype blocks separating by high recombination regions. These methods are difficult to use for estimating putative windows inside which the causal polymorphisms are because such approaches are not centered on the associated SNP.
Several QTLs identified by LD_win in our study correspond to regions previously identified: in particular six regions associated with female flowering time [27] and 30 regions associated with different traits in the Cornfed dent panel [11]. Conversely, we did not identify in our study any QTL associated to the florigen ZCN8, which showed significant effect in these two previous studies. This relates most likely to the fact that we narrowed the flowering time range in our study, in particular by eliminating early lines. This reduced the representation of the early allele in the Zcn8, leading to a MAF of 0.27 in our study vs. 0.35 in Rincent et al. [11], which can diminishes the power of the tests [14].
Complementarity of 600K and GBS for QTL detection resulted mostly from the tagging of different haplotypes rather than the coverage of different genomic regions
Number of significant SNPs and QTLs increased with the increase in marker number (Table 3, Figure 4). This could be explained partly by a better coverage of some genomic regions by SNPs, notably in high recombinogenic regions which showed a very short LD extent and were enriched in QTLs (Additional file 10: Figure S13). Numerous new QTLs identified by the 600K array and GBS as compared with those identified by the 50K array were detected in high recombinogenic regions that were considerably less covered by the 50K array than the 600K array or GBS (Additional file 2: Figure S2).
The high complementarity for QTL detection between GBS and 600K array was only explained to a limited extent by the difference of the SNP distribution and density along the genome, since these two technologies targeted similar regions as showed by coverage analysis (Figure 2 and Additional file 2: Figure S2). However, at a finer scale, SNPs from the 600K array and GBS could tag close but different genomic regions around genes. SNPs from the 600K array were mostly selected within coding regions of genes [4], whereas SNP from GBS targeted more largely low copy regions, which included coding but also regulatory regions of genes [32, 35]. To further analyse the complementarity of the technologies, we analysed local haplotypes. We showed that both technologies captured different haplotypes when similar genomic regions were targeted (Figure 6). Hence, we pinpointed that GBS and DNA arrays are highly complementary for QTL detection because they tagged different haplotypes rather than tagging different regions (Figure 6). Based on the L-shaped MAF distribution, which suggest no ascertainment bias, and the high number of sequenced lines used for the GBS, we expect a closer representation of the variation present in our panel by this technology compared to the 600K array, but this comes to the cost of an enrichment in rare alleles. Both factors tend to counterbalance each other in terms of GWAS power.
Our results suggest that we did not reach saturation with our c. 800,000 SNPs because (i) some haplotypes certainly remain not tagged (ii) the genome coverage was not complete, and (iii) the number of significant SNPs and QTLs continued to increase with marker density (Figure 4). Considering LD and marker density, the genotypic data presently available were most likely enough to well represent polymorphisms in the centromeric regions, whereas using more markers would be beneficial for telomeric regions. New approaches based on resequencing of representative lines and imputation are currently developed to achieve this goal.
Methods
Plant Material and Phenotypic Data
The panel of 247 genotypes (Additional file 11: Table S6) includes 164 lines from a wider panel of the CornFed project, composed of dent lines from Europe and America [11] and 83 additional lines derived from public breeding programs in Hungary, Italy and Spain and recent lines free of patent from the USA. Lines were selected within a restricted window of flowering time (10 days). Candidate lines with poor sample quality, i.e. high level of heterozygosity, or high relatedness with other lines were discarded. The lines selection was also guided by pedigree to avoid as far as possible over-representation of some parental materials. These inbred lines were crossed with a common flint tester (UH007) and the hybrids were evaluated for male flowering time (Day To Anthesis, DTA), plant height (plantHT), and grain yield (GY) at seven sites in Europe, during two years (2012 and 2013), and for two water treatments (watered and rainfed) [30]. The adjusted mean (Best Linear Unbiased Estimation, BLUEs, https://doi.org/10.15454/IASSTN) of the three traits were estimated per environment (site × year × treatment) using a mixed model with correction for blocks, repetitions and rows and columns in order to take into account spatial variation of micro-environment in each field trial [30]. Variance components and heritability of each traits in each environment were also estimated [30] (Additional file 12: Table S7). Adjusted means of hybrids were combined with genotyping data of the lines to perform GWAS.
Genotyping and Genotyping-By-Sequencing Data
The inbred lines were genotyped using three technologies: a maize Illumina Infinium HD 50K array [3], a maize Affymetrix Axiom 600K array [4], and Genotyping-By-Sequencing [2, 35]. In the arrays, DNA fragments are hybridized with probes attached to the array that flanked SNPs that have been previously identified between inbred lines (Additional file 5: Supplementary Text 1 for the description of the data from the two arrays). Genotyping-by-sequencing technology is based on multiplex resequencing of tagged DNA from different individuals for which some genomic regions were targeted using restriction enzyme (Keygene N.V. owns patents and patent applications protecting its Sequence Based Genotyping technologies) [2]. Cornell Institute (NY, USA) processed raw sequence data using a multi-step Discovery and a one-top Production pipeline (TASSEL-GBS) in order to obtain genotypes (Additional file 5: Supplementary Text 2). An imputation step of missing genotypes was carried out by Cornell Institute [36], which utilized an algorithm that searches for the closest neighbour in small SNP windows across the haplotype library [35], allowing for a 5% mismatch. If the requirements were not met, the SNP was left ungenotyped for individuals.
We applied different filters (heterozygosity rate, missing data rate, minor allele frequency) for a quality control of the genetic data before performing the diversity and association genetic analyses. For GBS data, the filters were applied after imputation using the method “Compilation of Cornell homozygous genotypes and Beagle genotypes” (GBS5 in Additional file 1: Figure S1; See section “Evaluating Genotyping and Imputation Quality”). We eliminated markers that had an average heterozygosity and missing data rate higher than 0.15 and 0.20, respectively, and a Minor Allele Frequency (MAF) lower than 0.01 for the diversity analyses and 0.05 for the GWAS. Individuals which had heterozygosity and/or missing data rate higher than 0.06 and 0.10, respectively, were eliminated.
Evaluating Genotyping and Imputation Quality
Estimating the genotyping and imputation quality were performed using 245 lines since two inbred lines have different seedlots between technologies. The 50K and the 600K arrays were taken as reference to compare the concordance of genotyping (genotype matches) with the imputation of GBS based on their position. While SNP position and orientation from GBS were called on the reference maize genome B73 AGP_v2 (release 5a) [48], flanking sequences of SNPs in the 50K array were primary aligned on the first maize genome reference assembly B73 AGP_v1 (release 4a.53) [49]. Both position and orientation scaffold carrying SNPs from the 50K array can be different in the AGP_v2, which could impair correct comparison of genotype between the 50K array and GBS. Hence, we aligned flanking sequences of SNPs from the 50K array on maize B73 AGP_v2 using the Basic Local Alignment Search Tool (BLAST) to retrieve both positions and genotype in the same and correct strand orientation (forward) to compare genotyping. The number of common markers between the 50K/600K, 50K/GBS, GBS/600K and 50K/600K/GBS was 36,395, 7,018, 25,572 and 5,947 SNPs, respectively. The comparison of the genotyping and imputation quality between the 50K/GBS, 50K/600K and 600K/GBS was done on 5,336 and 24,286 PANZEA markers [50] in common, and 26,154 markers in common, respectively. The genotyping concordance of the 600K with the 50K array was extremely high (99.50%) but slightly lower for heterozygotes (92.88%). In order to achieve these comparisons, we considered the direct reads from GBS (GBS1) and four approaches for imputation (GBS2 to GBS5, Additional file 1: Figure S1). GBS2 approach consisted in one imputation step from the direct read by Cornell University, using TASSEL software, but missing data was still present. GBS3 approach consisted in a genotype imputation of the whole missing data of the direct read by Beagle v3 [13]. In GBS4, genotype imputation by Beagle was performed on Cornell imputed data after replacing the heterozygous genotypes into missing data. GBS5, consisted in homozygous genotypes of GBS2 completed by values imputed in GBS3 (Additional file 1: Figure S1).
Diversity Analyses
After excluding the unplaced SNPs and applying the filtering criteria for the diversity analyses (MAF > 0.01), we obtained the final genotyping data of the 247 lines with 44,729 SNPs from the 50K array, 506,662 SNPs from the 600K array, and 395,024 SNPs from the GBS (Figure 1). All markers of the 600K array and GBS5 that passed the quality control were used to perform the diversity analyses (estimation of Q genetic groups and K kinships). For the 50K, we used only the PANZEA markers (29,257 SNPs) [50] in order to reduce the ascertainment bias noted by Ganal et al. [3] when estimating Nei’s index of diversity [51] and relationship coefficients. Genotypic data generated by the three technologies were organized as G matrices with N rows and L columns, N and L being the panel size and number of markers, respectively. Genotype of individual i at marker l (Gi,l) was coded as 0 (the homozygote for an arbitrarily chosen allele), 0.5 (heterozygote), or 1 (the other homozygote). Identity-By-Descend (IBD) was estimated according to Astle and Balding [19]: where pl is the frequency of the allele coded 1 of marker l in the panel of interest, i and j indicate the inbred lines for which the kinship was estimated. We also estimated the Identity-By-State (IBS) by estimating the proportion of shared alleles. For GWAS, we used K_Chr [14] that are computed using similar formula as K_Freq, but with the genotyping data of all the chromosomes except the chromosome of the SNP tested. This formula provides an unbiased estimate of the kinship coefficient and weights by allelic frequency assuming Hardy-Weinberg equilibrium. Hence, relatedness is higher if two individuals share rare alleles than common alleles.
Genetic structure was analysed using the sofware ADMIXTURE v1.22 [18] with a number of groups varying from 2 to 10 for the three technologies. We compared assignation by ADMIXTURE of inbred lines between the three technologies by estimating the proportion of inbred lines consistently assigned between technologies two by two (50K vs GBS5, 50K vs 600K, 600K vs GBS5) using a threshold of 0.5 for admixture.
Expected heterozygosity (He) [51] was estimated at each marker as 2pl(1 - pl) and was averaged on all the markers for a global characterization of the panel for the three technologies. Principal Coordinate Analyses (PCoA) were performed on the genetic distance matrices [52], estimated as 1N,N- K_Freq, where 1N,N is a matrix of ones of the same size as K_Freq.
Linkage Disequilibrium Analyses
We first analyzed the effect of the genetic structure and kinship on linkage disequilibrium (LD) extent within and between chromosomes by estimating genome-wide linkage disequilibrium using the 29,257 PANZEA SNPs from the 50K array. Four estimates of LD were used: the squared correlation (r2) between allelic dose at two markers [53], the squared correlation taking into account global kinship with K_Freq estimator (r2K), the squared correlation taking into account population structure (r2S), and the squared correlation taking into account both (r2KS) [17].
To explore the variation of LD decay and the stability of LD extent along the chromosomes, we estimated LD between a non-redundant set of 810,580 loci from the GBS, the 50K and 600K arrays. To save computation time, we calculated LD between loci within a sliding window of 1 cM. Genetic position was obtained by projecting the physical position of each locus using a smooth.spline function R calibrated on the genetic consensus map of the Cornfed Dent Nested Association Mapping (NAM) design [40]. We used the estimator r2 and r2K using 10 different kinships K_Chr. This last estimator was calculated because it corresponds exactly to LD used to map QTL in our GWAS model. It determines the power of GWAS to detect QTL considering that causal polymorphisms were in LD with some polymorphisms genotyped in our panel [17]. To study LD extent variation, we estimated LD extent by adjusting Hill and Weir’s model [37] using non-linear regression (nls function in R-package nlme) against both physical and genetic position within each chromosome. Since recombination rate (cM / Mbp) varied strongly along the genome (Figure 2 and Additional file 2: Figure S2), we defined high (>0.5 cM / Mbp) and low (<0.5 cM / Mbp) recombinogenic genomic regions within each chromosome. We adjusted Hill and Weir’s model [37] separately in low and high recombinogenic regions (Additional file 10: Table S5) by randomly sampling 100 sets of 500,000 pairs of loci distant from less than 1 cM. This random sampling avoided over-representation of pairs of loci from low recombinogenic regions due to the sliding-window approach (Additional file 12: Figure S14). 500,000 pairs of loci represented 0.36% (Chromosome 3 / High rec) to 1.20% of all pairs of loci (Chromosome 8 / High rec).
For all analyses, we estimated LD extent by calculating the genetic and physical distance for the fitted curve of Hill and Weir’s Model that reached r2K=0.1, r2K=0.2 and r2K=0.4.
Genome coverage estimation
In order to estimate the genomic regions in which the effect of an underlying causal polymorphisms could be captured by GWAS using LD with SNP from three technologies, we developed an approach to define LD windows around each SNP with MAF ≥ 5% based on LD extent (Additional file 12: Figure S14). To set the LD window around each SNP, we used LD extent with r2K=0.1 (negligible LD), r2K=0.2 (intermediate LD) and r2K=0.4 (high LD) estimated in low and high recombinogenic regions for each chromosome. We used the global LD decay estimated for these large chromosomal regions rather than local LD extent (i) to avoid bias due to SNP sampling within small genomic regions, (ii) to reduce computational time, and (iii) to limit the impact of possible local error in genome assembly. In low recombinogenic regions, we used the physical LD extent, hypothesizing that recombination rate is constant along physical distance in these regions. In high recombinogenic regions, we used the genetic LD extent since there is a strong variation of recombination rate by base pair along the physical position (Additional file 2: Figure S2). We then converted genetic LD windows into physical windows by projecting the genetic positions on the physical map using the smooth.spline function implemented in R calibrated on the NAM dent consensus map [40]. Reciprocally, we obtained the genetic positions of LD windows in low recombinogenic regions by projecting the physical boundaries of LD windows on the genetic map.
To estimate coverage of the three technologies to detect QTLs based on their SNP distribution and density, we calculated cumulative genetic and physical length that are covered by LD windows around the markers, considering different LD extents for each chromosome (r2K=0.1, r2K=0.2, r2K=0.4). In order to explore variation of genome coverage along the chromosome, we estimated the proportion of genome covered using a sliding-windows approach based on physical distance (2Mbp).
Statistical Models for Association Mapping
We used four models to determine the statistical models that control best the confounding factors (i.e. population structure and relatedness) in GWAS (Additional file 5: Supplementary Texts 3 and 4). We tested different software implementing either approximate (EMMAX) [8] or exact computation of standard test statistics (ASReml and FaST-LMM) [6, 54] for computational time and GWAS results differences (Additional file 5: Supplementary Text 5). Single-trait, single-environment GWAS was performed for each marker for each environment and all traits using FaST-LMM. We selected the mixed model using K_Chr, estimated from PANZEA markers of the 50K array to perform GWAS on 66 situations (environment × trait) (Additional file 5: Supplementary Text 4, Additional file 12: Figure S15 and Additional file 12: Figure S16). We developed a GWAS pipeline in R v3.2.1 [55] calling FaST-LMM software and implementing [14] approaches to conduct single trait and single environment association tests.
Multiple testing is a major challenge in GWAS using large numbers of markers. The experiment-wise error rate (αe) increases with the number of tests (number of markers) carried out, even when the point-wise error rate (αp) is maintained low. Popular methods [56, 57] are overly conservative and can result in overlooking true positive associations. In addition, these corrections assume that the hypothesis tests are independent. To take into account the dependence of the tests in GWAS, αp has to be adjusted in order to keep αe at a nominal level. Moskvina and Schmidt [58] and Gao et al. [59, 60] corrections can correctly infer the number of independent tests and use the Bonferroni formula to rapidly adjust for multiple testing. Using Gao approaches, we estimated the number of independent tests for GWAS at 15,780 for the 50K, 92,752 for the 600K, 109,117 for the GBS5 and 191,026 for the combined genetic data, leading to different -log10(p-value) thresholds: 5.49, 6.27, 6.34 and 6.58, respectively. Because of these differences, we used two thresholds of -log10(p-value) = 5 (less stringent) and 8 (hightly conservative and slightly above Bonferroni) for comparing GWAS to avoid the differences of identification of significant SNPs between the technologies due to the choice of the threshold.
Methods for grouping associated SNPs into QTLs
We used two approaches based on LD for grouping significant SNPs. The first approach (LD_win) used LD windows, previously described, to group significant SNPs into QTLs considering that all SNPs with overlapping LD windows of r2K=0.1 belong to the same QTL. We hypothesized that significant SNPs with overlapping LD windows at r2K=0.1 captured the same causal polymorphism and were therefore a single and unique QTL. The second approach (LD_adj) grouped into single QTL significant SNPs that are adjacent on the physical map providing that their LD were above a LD threshold (r2K > 0.5). We used LD heatmaps for comparing the SNP grouping produced by the two approaches on the three different traits across all environments (Additional file 6: Figure S9-LD-Adjacent and Additional file 7: Figure S9-LD-Windows). All scripts are implemented in R software [55].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
The following links toward the data will be available upon publication of this paper.
All the genotyping data used in this study can be found at https://doi.org/10.15454/AEC4BN.
The GWAS results can be found at https://doi.org/10.15454/6TL2N4.
The phenotypic dataset can be found at https://doi.org/10.15454/IASSTN.
Competing interests
The authors declare that they have no competing interests.
Funding
This project (Project ID: 244374) was funded under the European FP7-KBBE (CP – IP – Large-scale integrating project, DROPS) and the Agence Nationale de la Recherche project ANR-10-BTBR-01 (ANR-PIA AMAIZING).
Authors’ contributions
S.S.N., S.D.N. and A.C., designed the studied and wrote the article. S.S.N. performed genotyping data quality control, imputation and genetic analyses. S.D.N. developed and performed LD analyses. A.C. designed the association panel with the help of S.D.N. and C.W. C.B. participated in assembling the dent inbred lines panel, organizing the germplasms and field work for seeds production. E.J.M., C.W. and F.T. collected and analysed the phenotypic data. V.C. and D.M. performed DNA extraction and prepared the samples. All authors critically reviewed and approved the final manuscript.
Authors’ information (optional)
Not applicable.
Additional file legends
Additional file 1 (.docx)
Figure S1: Different approaches used to compare the quality of genotyping and imputation of the GBS. We considered the direct reads from GBS (GBS1) and four approaches for imputation (GBS2 to GBS5). GBS2 approach consisted in one imputation step from the direct read by Cornell University, using TASSEL software, but missing data was still present. GBS3 approach consisted in a genotype imputation of the whole missing data of the direct read by Beagle v3. In GBS4, genotype imputation by Beagle was performed on Cornell imputed data after replacing the heterozygous genotypes into missing data. GBS5, consisted in homozygous genotypes of GBS2 completed by values imputed in GBS3.
Table S1: Percentage of GBS concordance based on the 50K and 600K arrays (Reference). Call rate of SNPs from GBS are in brackets. * After Beagle inference of missing data, the call rate is 100%. Here the call rate is <100% because the comparison was made against the 50K and the 600K arrays that include few missing data.
Additional file 2 (.pdf)
Figure S2: Variation of the markers density, the recombination rate and the genome coverage in non-overlapping 2 Mbp windows along each chromosome. The percentage SNP coverage (bottom) used the cumulated length of physical LD windows around each SNP. Markers have MAF ≥ 5%. Green, blue, red and black lines represent variation of GBS, 600K, 50K and combined technologies, respectively.
Additional file 3 (.docx)
Figure S3: Contribution of four ancestral populations to 247 inbred lines after ADMIXTURE analysis. Markers from the 50K array (top), 600K array (middle) and GBS (bottom) were used. One vertical bar corresponds to one individual. Lines were ordered according to contributions observed for the 50K array. From left to right, we have Stiff Stalk lines type B73 and B14a (red), Iodent lines type PH207 (green), Lancaster lines type Mo17 and Oh43 (turquoise), a group of lines assembling W117, F7057 type lines (blue).
Table S2: Means and ranges of the two relatedness estimators (IBS and IBD i.e. K_Freq) from the 50K (29,257 PANZEA SNPs only) and 600K arrays, and GBS.
Figure S4: Correlation (r) between the IBS and IBD (K_Freq) for each technology (A) and correlation of IBD between the three technologies (B). (C) Correlation of IBD between the three technologies after removing the excess of rare alleles in the GBS to have the same distribution of MAF as in the 50K and the 600K arrays. The red line is the bisector.
Figure S5: Principal coordinate analyses (PCoA) of the DROPS panel. The PCoA were based on the covariance matrix K_Freq estimated from the 50K Illumina array. The genetic groups identified by ADMIXTURE (NQ = 4) are colored (differently than in Fig. S6). Three key founders are indicated (Iodent: PH207 in red, Stiff Stalk: B73 in blueviolet, Lancaster: Mo17 in turquoise).
Additional file 4 (.docx)
Figure S6: Heatmap of genome-wide linkage disequilibrium (LD) between all markers within and between chromosomes using PANZEA SNPs from the 50K array. All SNPs were ordered according to their position on the genome. Dots represented LD between two loci and were colored according to their strength. Classical LD measurement r2 between loci were represented within triangle below the diagonal. Linkage disequilibrium corrected for structure (r2S, A), relatedness (r2K, B) or both (r2KS, C) were represented within triangle above the diagonal.
Figure S7: Linkage disequilibrium (r2, top) and LD corrected for relatedness (r2k, bottom) as a function of physical distance (left) and genetic distance (right): example of chromosome 1.
Figure S8: Variation of genetic LD extent (Dm, cM), effective population size (N), along the physical map. A sliding window of 1 cM moving by 0.5 cM at each step was used. Local genetic LD extent (cM) and local effective size (N) were estimated by adjusting the Hill and Weir model’s using r2K between all loci that are located in sliding windows of 1 cM. Each values were plotted on the physical map of each chromosome by projecting the genetic position of the windows on the physical map.
Additional file 5 (.docx)
Supplementary Text 1: Differences between microarrays.
Supplementary Text 2: GBS pipelines.
Supplementary Text 3: Statistical models for GWAS.
Supplementary Text 4: Effects of confounding factors on GWAS.
Supplementary Text 5: Performance of different software.
Additional file 6 (.pdf)
Figure S9-LD_Windows: QTL limits obtained by the LD_win approach projected on heatmaps representing the level of LD between associated SNPs for each trait (DTA: male flowering time, plantHT: plant height and GY: grain yield) and each chromosome. Upper and lower triangles on the heatmaps represented the r2 and r2K values between associated SNPs, respectively. Linkage disequilibrium between loci was colored according to values from weak LD (yellow) to high LD (red). The significant markers were ordered according to their physical positions on the chromosome and were represented by ticks on the four sides of the heatmaps. Limits of QTLs were displayed by gray dotted lines. QTL numbers were indicated in gray on the top and the right of each heatmap.
Additional file 7 (.pdf)
Figure S9-LD_Adjacent: QTL limits obtained by the LD_Adj approach projected on heatmaps representing the level of LD between associated SNPs for each trait (DTA: male flowering time, plantHT: plant height and GY: grain yield) for each chromosome. Upper and lower triangles on the heatmaps represented the r2 and r2K values between associated SNPs, respectively. Linkage disequilibrium between loci was colored according to values from weak LD (yellow) to high LD (red). The significant markers were ordered according to their physical positions on the chromosome and were represented by ticks on the four sides of the heatmaps. Limits of QTLs were displayed by gray dotted lines. QTL numbers were indicated in gray on the top and the right of each heatmap.
Additional file 8 (.pdf)
Table S3: Summary of all the QTLs identified for the male flowering time (DTA), plant height (plantHT) and grain yield (GY). “LowerLimit” and “UpperLimit” columns are the lower and upper physical limits for each QTL. The “Rec” column indicates if the QTL is located in a high or low region of recombination. “NbSNP50”, “LogPvaMax50”, “NbSNP600”, “LogPvaMax600”, “NbSNPGBS”, “LogPvaMaxGBS” are the number of significant SNPs and the most significant –log10(Pval) within the QTL for each technology across all environments. The physical position (“PosMax”), the proportion of the variance explained (“R2_LDMax”) and the effect (“EffectMax”) of the most significant SNP within the QTL is shown. “NbDiffEnv” gives the number of different situations that detected the QTL.
Additional file 9 (.docx)
Figure S10: Examples of comparison of QTLs detection on Chromosome 1, 6 and 8 for the different traits. Local distribution of the -log10(p-value) and linkage disequilibrium (bottom) corrected by the kinship (r2k) of all SNPs with the strongest associated marker within the chosen QTL for the three technologies. Ticks on different x-axes show the marker density of the three technologies (red for the 50K, blue for the 600K and green for the GBS). The vertical red line spots the position of the SNP with the maximum -log10(p-value) within the QTL.
Additional file 10 (.docx)
Figure S11: Pleiotropy of QTLs between the traits. Number of QTLs specific and shared by the three traits across all environments. Note that several QTLs from one trait were sometimes included in a single QTL of another trait.
Figure S12: Percentage of stable QTLs across environments for the three traits (DTA: male flowering time, plantHT: Plant Height, GY: Grain Yield).
Table S4: Stability of QTL across environments. DTA: male flowering time, plantHT: plant height, GY: grain yield traits.
Table S5: Recombination rate and proportion of low and high recombination regions. Average recombination rate (“RecRate”) and proportion of the physical (“Phys”) and genetic (“Genetic) map in low (“LowRec”, <0.5 cM / Mbp) and high (“HighRec”, >0.5 cM / Mbp) recombination regions for each chromosomes. “Chr” indicates the chromosome. Physical and genetic size columns indicated the size of each chromosome in bp and cM, respectively.
Figure S13: Percentage of QTLs located in high (darkgrey) and low (lightgrey) recombinogenic regions. (a) male flowering time, (b) plant height and (c) grain yield.
Additional file 11 (.pdf)
Table S6: Description of inbred lines. Variety and accession along with the breeders, seeds providers and genetic groups obtained using ADMIXTURE for K=4 (Stiff Stalk, Iodent, Lancaster, Other).
Additional file 12 (.docx)
Table S7: Narrow sense heritability (h2) and variance components (Vg, genetic variance; Ve, residual variance). The heritability and variance components were estimated for all traits (grain yield, male flowering time and plant height) using the R package Heritability [1].
Figure S14: Linkage disequilibrium based approach to delineate a physical window around each SNP, examplified with chromosome 3. Linkage disequilibrium (LD) windows were defined per chromosome for each SNP based on physical LD extent in low recombinogenic regions (left part) and based on genetic LD extent in high recombinogenic regions (right part). These LD windows were used (i) to group significant SNPs into QTLs when they overlapped, (ii) to estimate genome coverage to detect QTLs by GWAS considering region not covered by LD windows, (iii) identify putative underlying genes involved in trait variations.
Figure S15: QQ-plots representing observed -log10(p-value) against expected -log10(p-value) under null hypothesis (No association, black line). We tested association between 44,729 SNPs from the 50K array and the male flowering time trait in one environment (Gai12R) using different GWAS models, kinship estimators and programs. (A) Comparisons between statistical models: M1 is the model without correction (green dots), M2 takes into account the group structure (blue dots), M3 takes into account kinship (IBD: K_freq) between individuals (purple dots) and M4 takes into account both group structure and kinship (red dots). (B) Comparison between mixed models using different estimates (IBS and IBD, K_freq) of kinship. (C) Comparison of using or not Rincent et al. 2014 approach (using K_Chr vs K_freq). (D) Comparison between different informatics tools (EMMAX, ASReml, FasST-LMM) that perform GWAS.
Figure S16: Correlations between the GWAS results from the GBS genetic data using a kinship estimated from the PANZEA 50K array (x-axis) and a kinship estimated from the GBS (y-axis). The horizontal and vertical lines are the threshold -log10(p-value) = 5. The correlations were done using the flowering time (DTA) and plant height (plantHT) traits and the two sites, two years and two treatments (Gai12R, Gai12W, Gai13R, Gai13W, Ner12R, Ner12W, Ner13R, Ner13W).
Acknowledgements
We are grateful to key partners from the field: Pierre Dubreuil, Cécile Richard, Jérémy Lopez (Biogemma), Tamás Spitkó (MTA ATK), Therese Welz (KWS), Franco Tanzi, Ferenc Racz, Vincent Schlegel (Syngenta) and Maria Angela Canè (UNIBO). We also acknowledge Björn Usadel and Axel Nagel (MPI) for data management. We thank Willem Kruijer, Fred Van Eeuwijk (WUR), Tristan Mary-Huard and Laurence Moreau (INRA) for helpful discussions and statistical advice. We are grateful to Chris-Carolin Schön (TUM) for providing an early access to the Affymetrix Axiom 600K array and Edward Buckler (USDA) for providing genotyping using GBS. We are also grateful to partners of the CornFed project, Univ. Hohenheim (Germany), CSIC (Spain), CRAG (Spain), MTA ATK (Hungary), NCRPIS (USA), CRB Maize (France) and CRA-MAC (Italy) who contributed to the genetic material.
Footnotes
E-mail addresses: snegro{at}uliege.be; emilie.millet{at}wur.nl; delphine.madur{at}inra.fr; cyril.bauland{at}inra.fr; valerie.combes{at}inra.fr; claude.welcker{at}inra.fr; francois.tardieu{at}inra.fr; alain.charcosset{at}inra.fr; stephane.nicolas{at}inra.fr
Background: Single Nucleotide Polymorphism (SNP) array and re-sequencing technologies have different properties (e.g. calling rate, minor allele frequency profile) and drawback (e.g. ascertainment bias), which lead us to study the complementarity and consequences of using them separately or combined in diversity analyses and Genome-Wide Association Studies (GWAS). We performed GWAS on three traits (grain yield, plant height and male flowering time) measured in 22 environments on a panel of 247 diverse dent maize inbred lines using three genotyping technologies (Genotyping-By-Sequencing, Illumina Infinium 50K and Affymetrix Axiom 600K arrays). Results: The effects of ascertainment bias of both arrays were negligible for deciphering global genetic trends of diversity in this panel and for estimating relatedness. We developed an original approach based on linkage disequilibrium (LD) extent in order to determine whether SNPs significantly associated with a trait and that are physically linked should be considered as a single QTL or several independent QTLs. Using this approach, we showed that the combination of the three technologies, which have different SNP distribution and density, allowed us to detect more Quantitative Trait Loci (QTLs, gain in power) and potentially refine the localization of the causal polymorphisms (gain in position). Conclusions: Conceptually different technologies are complementary for detecting QTLs by tagging different haplotypes in association studies. Considering LD, marker density and the combination of different technologies (arrays and re-sequencing), the genotypic data presently available were most likely enough to well represent polymorphisms in the centromeric regions, whereas using more markers would be beneficial for telomeric regions.
List of abreviations
- DTA
- = Day to Anthesis
- GY
- = Grain Yield adjusted at 15% moisture
- plantHT
- = Plant Height
- GBS
- = Genotyping By Sequencing
- LD
- = Linkage disequilibrium
- GWAS
- = Genome-Wide Association Studies
- MAF
- = Minimum Allelic Frequency
- SNP
- = Single Nucleotide Polymorphism
- HRR
- = High Recombinogenic Regions
- LRR
- = Low Recombinogenic Regions
- QTL
- = Quantitative Trait Locus