Benchmarking algorithms for genomic prediction of complex traits

Christina B. Azodi; Andrew McCarren; Mark Roantree; Gustavo de los Campos; Shin-Han Shiu

doi:10.1101/614479

Abstract

The usefulness of Genomic Prediction (GP) in crop and livestock breeding programs has led to efforts to develop new and improved GP approaches including non-linear algorithm, such as artificial neural networks (ANN) (i.e. deep learning) and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of GP datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and five non-linear algorithms, including ANNs. First, we found that hyperparameter selection was critical for all non-linear algorithms and that feature selection prior to model training was necessary for ANNs when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple GP algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits than that of linear algorithms. Although ANNs did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance near the level of other algorithms. These results, together with the fact that even small improvements in GP performance could accumulate into large genetic gains over the course of a breeding program, highlights the importance of algorithm selection for the prediction of trait values.

Abbreviations

GP: genomic prediction
ANN: artificial neural network
rrBLUP: ridge regression best linear unbiased prediction
BA: Bayes A
BB: Bayes B
LASSO: least absolute angle and selection operator
BL: Bayesian LASSO
SVR: support vector regression
lin: linear
rbf: radial basis function
poly: polynomial
RF: random forest
GTB: gradient tree boosting
p: number of markers
n: number of lines
MSE: mean squared error
ANOVA: analysis of variance
ReLU: rectified linear unit
EN: elastic net
MWU: Mann Whitney U
DBH: diameter at breast height

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.