|
Abstract : |
Abstract: We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis parameters), and the estimation rate (the deviation between the training and generalization errors as a function of the number of hypothesis parameters). The approximation rate captures the complexity of the target function with respect to the hypothesis model, and the estimation rate captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of cross validation. The bound clearly shows the tradeoffs involved with making fl--- the fraction of data saved for testing--- too large or too small. By optimizing the bound with respect to fl, we then argue (through a combination of formal analysis, plotting, and controlled experimentation) that the following qualitative properties of cross validation behavior should be quite robust to significant changes in the underlying model selection problem: ffl When the target function complexity is small compared to the sample size, the performance of cross validation is relatively insensitive to the choice of fl. ffl The importance of choosing fl optimally increases, and the optimal value for fl decreases, as the target function becomes more complex relative to the sample size. ffl There is nevertheless a single fixed value for fl that works nearly optimally for a wide range of target function complexity., |