Abstract
Several recent papers have studied the double descent phenomenon: a classic U-shaped empirical risk curve when the number of parameters is smaller or equal to the number of data points, followed by a decrease in empirical risk (referred to as “second descent”) as the number of features is increased past the interpolation threshold (the minimum number of parameters needed to have 0 training error). In a similar vein as several recent papers on double descent, we concentrate here on the special case of over-parameterized linear regression, one of the simplest model classes that exhibit double descent, with the aim of better understanding the nature of the solution in the second descent and how it relates to solutions in the first descent. In this paper, we show that the final second-descent model (obtained using all features) is equivalent to the model estimated using principal component (PC) regression when all PCs of training data are included. It follows that many properties of double descent can be understood through the relatively simple and well-characterized lens of PC regression. In particular, we will identify a set of conditions that will guarantee final second-descent performance to be better than the best first-descent performance: it is the scenario in which PC regression using all features does not suffer from over-fitting and can be guaranteed to outperform any other first-descent model (any linear regression model using no more features than training data points). We will also discuss how this work relates to transfer learning, semi-supervised learning, few-shot learning, as well as theoretical concepts in neuroscience.
Competing Interest Statement
The authors have declared no competing interest.