ABSTRACT
Estimation of allele dosage in autopolyploids is challenging and current methods often result in the misclassification of genotypes. Here we propose and compare the use of next generation sequencing read depth as continuous parameterization for autotetraploid genomic prediction of breeding values, using blueberry (Vaccinium corybosum spp.) as a model. Additionally, we investigated the influence of different sources of information to build relationship matrices in phenotype prediction; no relationship, pedigree, and genomic information, considering either diploid or tetraploid parameterizations. A real breeding population composed of 1,847 individuals was phenotyped for eight yield and fruit quality traits over two years. Analyses were based on extensive pedigree (since 1908) and high-density marker data (86K markers). Our results show that marker-based matrices can yield significantly better prediction than pedigree for most of the traits, based on model fitting and expected genetic gain. Continuous genotypic based models performed as well as the current best models and presented a significantly better goodness-of-fit for all traits analyzed. This approach also reduces the computational time required for marker calling and avoids problems associated with misclassification of genotypic classes when assigning dosage in polyploid species. Accuracies are encouraging for application of genomic selection (GS) for blueberry breeding. Conservatively, GS could reduce the time for cultivar release by three years. GS could increase the genetic gain per cycle by 86% on average when compared to phenotypic selection, and 32% when compared with pedigree-based selection.
INTRODUCTION
Polyploidy events are not an exception in plants, as about 70% of Angiosperms and 95% of Pteridophytes underwent at least one polyploidization event (Soltis and Soltis 1999). Polyploids are normally grouped into two categories, autopolyploids and allopolyploids, but intermediate forms are also possible, such as segmental allopolyploids (Spoelhof et al. 2017). Thresholds for polyploid classification have been controversial, but following the general taxonomic definition, autopolyploids arise from within-species whole genome duplication, and allopolyploids arise from whole genome duplication prior to or after an inter-specific hybridization event (Soltis et al. 2007).
Because speciation via ploidy increase can generate new phenotypic variability, this phenomenon is considered a powerful evolutionary source (Hieter and Griffiths 1999; Soltis et al. 2016). Despite the important role of polyploidization in plant evolution, its effects on inheritance of many agronomic traits and population genetics are still poorly understood when compared with diploid species (Dufresne et al. 2014). This especially holds true for autopolyploids. The complex nature of autopolyploid genetics is due to the presence of genotypes with higher allele dosage than diploids, larger number of genotypic classes, possibility of multivalent pairing, and poor knowledge of chromosome behavior during meiosis (Slater et al. 2013; Dufresne et al. 2014; Mollinari et al. 2015).
The advent of high-throughput genotyping methods, associated with the development of genetic and statistical analysis tools, has generated significant genetic gains for diploid species (Desta and Ortiz 2014). However, the application of genomic information to polyploid crops remains a challenge (Comai et al. 2005; Grandke et al. 2016). Although methods for the analysis and interpretation of genetic data in polyploids have recently been described (see review in Bourke et al. 2018), much development is needed, especially for new breeding approaches, such as genomic selection.
Genomic selection (GS) is a method to increase the efficiency and accelerate the selection process in breeding programs. GS is used to capture the simultaneous effects of molecular markers distributed across the genome, based in the premise that linkage disequilibrium between causal polymorphisms and markers allow the prediction of phenotypes based on the genotypic values (Meuwissen et al. 2001; Zhang et al. 2011; de los Campos et al. 2013). The first GS studies addressing polyploids considered diploid genetic models to circumvent the complexity involved in accurately defining allelic dosage (i.e., the number of copies of each allele at a given polymorphic locus). Promising results have been reported for polyploids (e.g. Gouy et al. 2013; Annicchiarico et al. 2015; Ashraf et al. 2016), however simplified assumptions were mostly used for genetic and statistical inferences (Garcia et al. 2013). Only a few studies have added different factors accounting for polyploid effects (e.g., Slater et al. 2016; Sverrisdottir el al. 2017). Thus, more appropriate methods for GS in polyploids could be evaluated, possibly improving trait prediction.
Polyploidy can affect phenotypes through allelic dosage (additive effect of multiple copies of the same alleles), or by creating more complex interactions between loci or alleles, such as dominance or epistasis (Osborn et al. 2003). Thus, the inclusion of allelic dosage information may improve GS results (e.g., better fit, increase of accuracy) by creating a more realistic representation of the effects of each genotypic class. Although the evidence of dosage effects in the expression of important economic traits exists (Guo et al. 1996; Birchler et al. 2001; Adams et al. 2003; Osborn et al. 2003), few studies linking dosage effects to phenotype prediction have been reported in autopolyploid species (e.g.; Slater et al., 2016; Sverrisdottir el al. 2017; Nyine et al. 2018; Endelman et al. 2018). It is interesting to note that genotype classification is one of the major challenges for polyploids. Studies about genotyping calling evaluation for autopolyploids with next generation sequencing (NGS) data showed that none of the existing methods performs properly (Grandke et al. 2016), unless high sequencing coverage (60-80x) is used (Uitdewilligen et al. 2013).
Here we compare a novel approach to GS in the context of autopolyploid, using Vaccinium corymbosum (southern highbush blueberry, SHB) as a model. The cultivated SHB is an autotetraploid, presenting 2n = 4X = 48 chromosomes (Lyrene et al. 2002). Inbreeding depression is strong in SHB and population improvements have been achieved by long-term recurrent phenotypic selection alongside with long testing phase and slow genetic gain per generation (Lyrene 2008). Our goal was to investigate and compare the influence of different relationship matrices that consider different ploidy information on phenotype prediction, using novel genotyping approaches based on next-generation sequencing.
MATERIAL AND METHODS
Population and phenotyping
The population used in this study encompasses one cycle of the University of Florida blueberry breeding program’s recurrent selection, comprising 1,847 SHB individuals. This population was originated from 124 biparental controlled crosses, from 146 parents that presented superior phenotypic performance (cultivars and advanced stage of breeding). Phenotypic data of eight yield and fruit quality-related traits were collected during two production seasons (2014 and 2015), when the plants were 2.5 and 3.5 years of age. Yield (rated using a 1-5 scale), weight (g), firmness (g mm−1 of compression force), scar diameter (mm), fruit diameter (mm), flower bud density (reported as buds per 20 cm of shoot), soluble solids content (°Brix), and pH were evaluated. The last three traits were phenotyped only in one year – soluble solids content and pH were phenotyped in 2014 and flower buds in 2015.
Five berries (fully mature and presenting picking quality) were randomly sampled to compose the measurement of fruit traits for each individual. Fruit weight was measured using an analytical scale (CP2202S, Sartorious Corp., Bohemia, NY). The FirmTech II firmness tester (BioWorks Inc., Wamego, KS) was used to measure fruit diameter and firmness. The scar diameter was obtained by image analysis of the fruits using FIJI software (Schindelin et al. 2012). The number of flower buds was counted in the main cane upright shoot, in the top 20 cm. A digital pocket refractometer (Atago, U.S.A., Inc., Bellevue, WA) was used to obtain soluble solids measures from 300μl of berry juice. The pH was measured using a glass pH electrode (Mettler-Toldeo, Inc., Schwerzenbach, Switzerland). More details are provided by Amadeu et al. (2016), Cellon et al. (2018), and Ferrão et al. (2018).
Genotyping
Genomic DNA was extracted and genotyped using sequence capture by Rapid Genomics (Gainesville, FL, USA). Polymorphisms were genotyped in genomic regions captured by 31,063 120-mer biotinylated probes, designed based on the 2013 blueberry draft genome sequence (Bian et al. 2014; Gupta et al. 2015). Sequencing was performed in the Illumina HiSeq2000 platform using 100 cycle paired-end sequencing. After trimming (quality score of 20), demultiplexing, and removing barcodes, reads were aligned to the draft genome using Mosaik v.2.2.3 (Lee et al. 2014). Genotypes were called using FreeBayes v.1.0.1 (Garrison and Marth 2012) considering the diploid and tetraploid options. Single-nucleotide polymorphisms (SNPs) were filtered considering i) minimum sequencing depth of 40 (average depth for the population); ii) minimum SNP quality score (QUAL) of 10; iii) only biallelic markers; iv) maximum population missing data of 0.5; and v) minor population allele frequency of 0.05. After filtering a total of 85,973 SNP were used in the GS analysis. Further information regarding population composition and genotyping approach were described in Ferrão et al. (2018). The genotypes for the diploid calling were coded as 0 (AA), 1 (AB), or 2 (BB). For the tetraploid parameterization they were coded as 0 (AAAA), 1 (AAAB), 2 (AABB), 3 (ABBB), and 4 (BBBB). A third parameterization (assumption-free method) was used, which considered allele ratio #A/(#A + #a), where #A is the allele count (sequencing depth) of the alternative allele and #a is the allele count of the reference allele. No dosage calling was performed in this model (File S1); these data varied continuously between 0 and 1.
Population genetics analysis
In order to compare the information captured by each genomic-based relationship matrix, we performed linkage disequilibrium (LD), and principal components (PC) analyses. Pearson correlation tests (r2) were performed for pairwise LD estimation among SNPs within scaffolds, considering draft reference genomes (Bian et al. 2014; Gupta et al, 2015). One SNP was randomly sampled per probe interval, and a total of 22,914 SNP were used in the analysis. LD was obtained for all marker-based scenarios: i) diploid (G2); ii) tetraploid (G4) and iii) ratio (i.e., continuous genotypes; Gr). The LD decay over physical distance was determined as the mean distance at the LD threshold of r2 = 0.2. To compare the LD among scenarios, the mean distances (Kb) and their interval confidences at r2 = 0.2 were compared. The diversity captured from each relationship matrix was compared by PC using the R package adegenet v. 1.3-1 (Jombart and Ahmed 2011).
In order to compare the information present in the marker matrices, we also evaluated the observed heterozygosity in the population. For this, we obtained the ratio between the number of heterozygote genotypes and the total number of individuals. To estimate the heterozygosity for the continuous genotypes, empirical limits were established based on the mean and standard deviations presented for homozygotes classes of the tetraploid parameterization.
Models
One-step single-trait Bayesian linear mixed models were used to predict breeding values for each individual in the population, as follows: Where is a vector of the phenotypic values of the trait being analyzed, μ is the population’s overall mean, b is the fixed effect of year, c is the random effect of ith column position in the field , r is the random effect of the ith row position in the field ~ N , a is the random effect of genotype , where Ga was replaced by the different additive relationship matrices as described in the next section. The bxa is the random effect of the year by genotype interaction , and e is the random residual effect . Row and column effects were considered nested within year only for the traits evaluated in two years. For traits measured a single year, the same equation (1) was used without the year and the year by genotype interactions. The variance components for each random variable were: additive , column , row , year-by-genotype interaction , and residual . X, Z1, Z2, Z3, and Z4 were incidence matrices for year, column, row, genotype, and year by genotype interaction, respectively. The narrow-sense heritabilities were estimated considering the ratio between the additive variance component and the total phenotypic variance (sum of all variance components).
Relationship matrices
To quantify the effect of the genetic information used to build the relationship matrices on the predictive ability (PA), we performed analyses considering different approaches to modeling the genotypic values in autotetraploid species (Table 1, File S1). The factors tested were: i) the source of information used to build the relationship matrix (pedigree, genomic, or no relationship information); and ii) ploidy information (diploid, tetraploid, and assumption-free method).
The methods chosen to obtain the relationship matrices are shown in the Table 1. The pedigree-based relationship matrices (A) were built considering a diploid model (Henderson 1976) and autotetraploid model without double-reduction (Kerr et al. 2012). The marker-based relationship matrices (G) were based on the incidence matrices of markers effects (X) according to VanRaden (2008) and adapted by Ashraf et al. (2016). Different assumptions can be made regarding the marker allele dosage in autotetraploids (Table 2). We built the X matrices under three assumptions regarding the additive marker allele dosage effect: i) a pseudo-diploid model, where all the heterozygous genotypes are assumed as one class, corresponding to a unique effect (data coded as 0, 1, and 2); ii) an additive autotetraploid model, where each genotype had a specific value, and cumulative additive effect is assumed (data coded as 0, 1, 2, 3, and 4); and iii) an assumption-free method based on the ratio of reads count for the alternative and reference alleles (continuous parameterization, assuming values between 0 and 1), where also a cumulative additive effect is assumed. For the construction of the relationship matrices based on marker data, the missing genotypes were substituted by the mean. The R package AGHmatrix v. 0.0.3003 (Amadeu et al. 2016) was used to obtain all relationship matrices.
Model implementation
The six models described above (Table 1) were fitted using the R package (R Core Team 2018) BGLR v. 1.0.5. (de los Campos and Pérez-Rodríguez 2016). The predictions were based on 30,000 iterations of the Gibbs sampler, in which 5,000 were taken as burn-in, and a thinning of five. The number of iterations, burn-in, and thinning interval parameters were evaluated to define the final values used in the analysis (Figure S1). A single step regression approach was applied to perform all phenotypic BLUP (I matrix), pedigree-BLUP (P-BLUP), and genomic-BLUP (G-BLUP). Default hyper-parameters were previously described (Perez and de los Campos 2014).
Validation and model comparison
For each trait, models were compared based on their PA, stability (mean square errors), goodness-of-fit, and expected genetic gain. A 10-fold cross validation scheme was applied to compute model PA. Because each validation group might have a different mean (Resende et al. 2012b), the phenotypic PA were obtained as the Pearson correlation coefficient between the empirical best linear unbiased estimation values (eBLUEs) obtained by considering all the variables in the equations 1 as fixed (i.e., Least Square means estimations; LSMeans) and the cross-validated breeding values (BV) predicted by the models for each validation fold. The goodness-of-fit for the different models was evaluated with measures of the posterior mean of the log likelihood obtained from the full data set. The model with the lowest value for this parameter defined the best fit for the data. For the expected genetic gain estimation we used the following formula: ΔG=(PA · σa · i)/L, where PA is the phenotypic predictive ability, σa is the square root of additive genetic variance in the population, i is the selection intensity, and L is the breeding cycle length. The selection intensity (i) was considered constant for all methods.
Phenotypic and genotypic data used for diploid and tetraploid parameterizations are available from Dyrad Digital Repository (accession number doi: 10.5061/dryad.kd4jq6h). Data for continuous parameterization is available for review upon request. Data will be available at Dyrad Digital Repository. The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.
RESULTS
Population genetics analysis
Linkage disequilibrium decayed below r2 = 0.2 at distances of 88.3 Kb, 92.6 Kb, and 98.2 Kb for the diploid, tetraploid and continuous models, respectively (Figure 1A-C). No significant difference was observed considering the confidence interval for the mean distance (Kb) at r2 = 0.2 among different ploidies and continuous genotyping scenarios (Figure S2).
Similarly, no major differences were found between parameterizations within methodology (i.e., pedigree-based or marker-based methods) in the PC analysis (Figure S3). The first two PC components of the marker-based (G) matrices were consistent across all matrices, explaining approximately 20% of the variation. For example, G2 matrix captured 20.60% of the variation, while G4 captured 21.71%, and Gr captured 23.36% (Figure S3 A-C). The PC results were consistent between pedigree methodologies as well. Approximately 38% of the variation was explained (i.e., 37.74% of the variability was explained for the A2 matrix and 37.86% was explained for the A4 matrix, Figure S3 D-E). The results obtained in the PC analysis did not justify a stratified sampling of cross-validation populations, since no evidence of sub-population structure was detected for any of the relationship matrices.
Considering the heterozygosity observed in each scenario, genotypes assumed as homozygotes in the diploid parameterization were classified as one of the possible heterozygote classes in the tetraploid and in the assumption-free parameterizations (Figure 1D-F). As a result of this process, the tetraploid parameterization presented 37.50% more heterozygotes than the diploid parameterization. Considering the empirical thresholds established to compare the proportion of “heterozygotes” in the continuous genotypes with the ploidy parameterizations, values equal to or below 0.058 and equal to or above 0.908 were considered as “homozygotes” classes (dashed lines, Figure 1F). With this, 61.59% of the genotypes were considered “heterozygotes”, thus the continuous method would have presented 89.92% and 41.23% more heterozygotes than the diploid and the tetraploid parameterization, respectively. Nevertheless, some misclassification of data into classes in the diploid and tetraploid parameterization might have occurred (Figure 2A-B).
Variance estimates
The posterior means of the genetic parameters are summarized in Table 3. All the traits presented additive genetic variance significantly higher than zero. A wide range of variance was observed within a given parameter for the different methodologies, and most of the values were significantly different from each other (considering Tukey test results; Table 3, Table S1). Marker-based methodologies generated significantly smaller estimations for variance components when compared with pedigree-based estimations. Within marker-based methodologies, the assumption-free parameterization generated significantly smaller estimations. The effects of the difference in the estimation of variance components are reflected in the estimated heritabilities – smaller values were estimated for marker-based methodologies. The lowest heritability was obtained for soluble solids, flower buds, and pH. Considering all methods, narrow-sense heritability values varied between 0.152 and 0.574, for flower buds and fruit weight, respectively.
Effect of the genetic information to build the relationship matrices
The incorporation of relationship information in the analysis generated better PA results than the phenotypic-BLUP model without it. Overall, we observed that higher values for the phenotypic PA were obtained when marker-based relationship matrices were used, when compared with phenotypic and pedigree BLUP (I and A matrices, respectively). However, the marker-based and pedigree-based results were not always significantly different from each other (Figure 3, Table S1). The use of molecular data yielded phenotypic PA values ranging from 0.27 (pH) to 0.49 (fruit scar) in 2014, and from 0.15 (flower buds) to 0.51 (fruit firmness) in 2015. Lower PA values were obtained for traits with lower heritability and better results were observed for the second year of evaluation. The biggest increase in the PA values can be seen for fruit firmness – when we compared marker and pedigree results, we observed an average increase of 13.37% in 2014. Also, an increase in the PA values of 11% was observed for fruit diameter and yield in 2015 when markers were used instead of pedigree data.
The use of pedigree-based relationship matrices generated higher phenotypic PA values for all the traits, when compared with the assumption of unrelated individuals (i.e., identity matrix). Unlike the identity matrix, the use of pedigree-based matrix assumes that there is relationship (expected values) among individuals. The phenotypic PA obtained for the pedigree methods in 2014 yielded values from 0.20 (flower bud) to 0.49 (fruit firmness). As with marker-based methods, smaller values were observed for traits with lower heritability (i.e., pH, brix, and flower bud). For 2015, the PA results for the phenotypic-BLUP were 0.36, 0.38, and 0.42, for fruit weight, fruit scar, and fruit firmness, respectively. The PA values obtained for the same traits with pedigree-BLUP were 0.40, 0.45, and 0.49, respectively. No significant differences between the models’ stability were observed (Table S1).
Use of dosage information and continuous genotypes
Our results indicate that the importance of dosage in GS will vary depending on the trait being analyzed. For example, in 2014 the PA for fruit firmness, fruit scar, and fruit diameter showed modestly better phenotypic PA when the tetraploid and continuous parameterizations were applied, as opposed to the diploid parameterization (Figure 3, Table S1). Although no significant difference was observed between marker-based models, the use of relationship matrices derived from continuous genotype data (ploidy-free parameterization) performed equally well as the best models (Figure 3, Table S1). However, the goodness-of-fit statistics show that the use of a relationship matrix obtained from the continuous genotype data significantly improved model fit for all traits (Table 3). This was followed by the tetraploid parameterization using marker-based data.
Expected genetic gain in a perennial fruit tree, blueberry
The results obtained for the expected genetic gain (EGG) are summarized in Table 3. GS offers the possibility to accelerate genetic improvement by decreasing the breeding cycle and selecting superior individuals earlier in the breeding program. Considering a breeding cycle (L) of 12 years (Cellon et al. 2018) we propose that routine genomic selection could be implemented in the second stage of the blueberry breeding program, which would allow the omission of a whole stage (stage III), and a three-year reduction for cultivar release (Figure 4).
Higher EGG was obtained for all traits when marker-based matrices (i.e., genomic selection) were applied (Table 3), which was mainly related to the reduction in cycle time. The implementation of GS in the second stage population would lead to an increase in the EGG varying from 27% (pH) to 119% (scar) when compared with the application of phenotypic BLUP. Considering the comparison of marker-based and pedigree-based models, an increase of 15% (pH) to 41% (fruit weight, fruit scar, and flower buds) in the EGG was observed (Table 3). In addition, the use of continuous data generated EGG values that were not significantly different of the best models for all traits (Table 3).
DISCUSSION
In this study, six linear mixed models were applied to predict breeding values for eight yield and fruit-quality traits measured in a real blueberry breeding population as model. Analyses were based on phenotypic, pedigree, and high-density marker data from 1,847 individuals. We compared the expected genetic gain, the stability, and the PA of models considering different sources to build the relationship matrices (only phenotype=BLUP, phenotypes + pedigree=P-BLUP, phenotypes + genomic=G-BLUP). Our results also explored models accounting for ploidy information and proposed the use of genotypic data that is independent of assumptions regarding ploidy levels (continuous) to perform GS, avoiding the need for a priori parameterization for a given ploidy level.
Continuous data
Our research showed empirical evidences that the use of continuous genotypic data from NGS can be effectively applied in GS models for autotetraploid species. This method was tested and compared with marker calling methodologies at the individual level in genome wide association studies (Grandke et al. 2016). It was also tested in family pool data for GS (Ashraf et al. 2014; Guo et al. 2018), as well as used at the individual level in tetraploid potato for GS by Sverrisdottir el al. (2017). However, to our knowledge the comparison of continuous genotypes with ploidy parameterizations for genomic selection has not yet been reported. Here we empirically compare diploid, tetraploid, and continuous data at the individual level for the application of genomic selection in an autotetraploid species.
In polyploids, the assignment of genotypic classes based on NGS data is a major challenge, with high risk of misclassification (Grandke et al. 2016, Bourke et al. 2018). The problem is further exacerbated as the ploidy increases – for a given level of ploidy, n, the expected number of genotypic classes is 2n+1. As a consequence, the signal distribution derived from each genotypic class increasingly approximates a continuous distribution where no clear separation is observed (Grandke et al. 2016). Despite extensive research to address these challenges (Serang et al. 2012), advances have been mostly limited to SNP arrays in tetraploid data (Carley et al. 2017). Studies that evaluated genotype calling with NGS data obtained from polyploids show that no method works properly, and that misclassification of genotypes can significantly interfere in the results of genetic studies (Grandke et al. 2016). This misclassification can be observed in our results when a diploid, or tetraploid parameterization is used in the genomic data (Figure 2A-B) with standard parameters of filtering. The use of the continuous genotyping approach provides a relevant alternative to overcome this issue that is independent of assumptions regarding ploidy level. Models that used continuous genotypic data performed as well as the best models and resulted in modestly better predictive abilities for some of the traits (i.e., fruit firmness, fruit scar, and fruit diameter; Table 3), but better data fit, which could indicate better prediction of future populations. The use of continuous genotypes also simplifies the analysis complexity and time by eliminating the genotype calling and parameterization for a give ploidy, because instead, the ratio of reads assigned to each allele are used. Finally, our results showed that the addition of noise associated with the continuous distribution in the genotypes significantly improved model fitting for all analyzed traits (Table 3), instead of increasing the complexity of the models. The benefits of continuous genotyping could easily be extended to more complex polyploids (higher ploidies), where the genotype attribution is even more difficult, however higher sequencing depth would probably be required. Meanwhile, for more complex models, such as those that consider dominance effects, dosage calling is still necessary.
Relationship matrices
Our results also showed that including information based on the genetic merit of the individuals yielded better results when compared with the phenotypic-BLUP analysis (based on the identity matrix; Table 3), corroborating previous studies in the literature (e.g., Muir 2005; Resende et al. 2012a; Muñoz et al. 2014a). In addition, the use of marker-based methodologies generated better predictions than pedigree for most of the traits. Marker-based methods allow the capture of Mendelian segregation. This is especially important in our population, since it was composed of 117 full-sib families. In this context, pedigree-based methods have no power to distinguish variance within families. Another advantage is that marker-based methods allows the computation of genetic similarity among unidentified individuals in the pedigree, and corrections of errors in the pedigree, which can affect parameter estimation causing reduction in the genetic gain (Muñoz et al. 2014b).
In our results, some non-significant differences between pedigree and marker-based methods were identified, which could be an effect of the extensive pedigree data used, as well as bias in pedigree-based estimations. Pedigree-based methods can overestimate the reliability of selection and consequently, the accuracy (Bulmer 1971; Gorjanc et al. 2015). Furthermore, it also presents low efficiency to capture and estimate genetic relationships among individuals (Resende et al. 2017).
It is interesting to notice that we used extensive pedigree information that dates back to 1907 for our predictions, which may not be common in other autopolyploid breeding. This extensive information can have significant implications on the estimation of relationship coefficients (Amadeu et al. 2016) and consequently, in breeding value predictions. For breeding programs with smaller pedigree depth information, the comparison between accuracies of prediction from marker and pedigree-based methodologies could be even bigger than what was found in our study.
Allele dosage
The results obtained for both models that assumed more than three genotypic classes (G4 and Gr) demonstrate the importance of considering dosage in the prediction of breeding values. However, this will depend on the trait analyzed, as previously reported by Nyine et al. (2018) and Endelman et al. (2018). For example, modest improvement was verified in the PA for fruit firmness, fruit scar, and fruit diameter when this factor was considered in the models.
In addition, model fitting was significantly better for methods that accounted for dosage information (Figure 3, Table 3, Table S1). The inclusion of nonadditive effects into the models could also improve model accuracy. Endelman et al. (2018) demonstrated that the inclusion of digenic effects, as well as accounting for ploidy information, presented a higher accuracy over diploid models when using a SNP array.
Genomic selection for perennial autopolyploids
We also demonstrate the value of applying GS in a perennial fruit tree, blueberry. One cycle of blueberry breeding takes from 12 to 15 years until the release of a new cultivar (Lyrene 2008; Cellon et al. 2018). By applying selection based on high-density markers at early stages of the program, the time to cultivar release could decrease by three years (Figure 4), significantly improving the expected genetic gain per unit of time. More specifically, the use of GS would lead to an average increase of 86% in the EGG when compared with phenotypic BLUP, and an average increase of 32% over the application of pedigree-based models (Table 3). Implementing GS in this form could eliminate one stage in the breeding and selection process toward cultivar development, which will reduce costs associated with field trials and phenotyping. The implementation of GS would require extra financial outlay when genotyping and accurately phenotyping the training population. However, the savings on phenotyping and field trials of future generations (selection populations) could results in a break-even financial exercise, and as a result could be a cost-effective application of GS. However, this financial analysis needs to be performed for each crop in a case-by-case basis.
Funding
USDA - Agriculture and Food Research Initiative Grant no. 2014-67013-22418 to Patricio R. Munoz, James W. Olmstead and Jeffrey B. Endelman from the USDA National Institute of Food and Agriculture. Ivone de Bem Oliveira was funded by the CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) [PSDE scholarship: 88881.131685/2016-01].
Acknowledgments
The authors thank the University of Florida blueberry breeding program technical support, especially Dr. Paul M. Lyrene, David Norden, and Werner Collante. Special thanks to James Olmstead and Catherine Cellon, who coordinate the phenotyping and genotyping of the population as part of Catherine Cellon MS degree.