An Evaluation of Machine-learning for Predicting Phenotype: studies in yeast and wheat

Nastasiya F. Grinberg; Ross D. King

doi:10.1101/105528

Abstract

In phenotype prediction the physical character of an organism is predicted from knowledge of its genotype and environment. Such studies are of the highest societal importance as they are now of central importance to medicine, crop-breeding, etc. We investigated two phenotype prediction problems: one simple and clean (yeast), the other complex and real-world (wheat). We compared standard machine learning methods (forward stepwise regression, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM)) with two state-of-the-art classical statistical genetics methods (including genomic BLUP). Additionally, using the yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, population structure, genotype drift, and the use of different data representations. We found that for almost all phenotypes considered standard machine learning methods outperformed the two methods from classical statistical genetics. On the yeast problem the most successful method was GBM, followed by lasso regression, followed by the two statistical genetics methods and SVM; with greater mechanistic complexity GMB was best, whilst in simpler cases lasso was best. When applied to the wheat study the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure, which suggests one way to improve standard machine learning methods when population structure is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.