Abstract
Crossvalidation is a method for estimating predictive performance and adjudicating between multiple models. On each of k folds of the process, k-1 of k independent subsets of the data (training set) are used to fit the parameters of each model and the left-out subset (test set) is used to estimate predictive performance. The method is statistically efficient, because training data are reused for testing and performance estimates combined across folds. The method requires no assumptions, provides nearly unbiased (slightly conservative) estimates of predictive performance, and is generally applicable because it amounts to a direct empirical test of each model.
GLOSSARY
- Generalization performance
- the quality of the predictions about new data afforded by a model fitted with a given data set.
- Overfitting
- the inevitable effect of measurement error on the estimates of parameters obtained by fitting a model to a given data set.
- Independence (statistical independence)
- the absence of any relationship, linear or nonlinear, deterministic or stochastic, between two variables. Independence implies that learning either variable does not change our belief (expressed as a probability distribution) about the other variable.
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.