Goodness of fit tests for high-dimensional linear models [by Rajen Shah]

One of the simplest models for high-dimensional data in the regression setting is the ubiquitous high-dimensional linear model,

\displaystyle Y = X\beta + \sigma \varepsilon.

Here {\beta \in \mathbb{R}^p} is sparse and {\varepsilon \sim \mathcal{N}_n(0, I)}. Whilst methods are readily available for estimating and performing inference concerning the unknown vector of regression coefficients, the problem of checking whether the high-dimensional linear model actually holds has received little attention.

In the low-dimensional setting, checks for the goodness of fit of a linear model are typically based on various plots involving the residuals. Writing {P} for the orthogonal projection on to {X}, we have that the scaled residuals {(I-P)Y / \|(I-P)Y\|_2 = (I-P)\varepsilon / \|(I-P)\varepsilon\|_2} do not depend on any unknown parameters, a fact which allows for easy interpretation of these plots. This property can however also be exploited algorithmically.

If {\mathbb{E}(Y)} is not a linear combination of the columns of {X}, then the scaled residuals will contain some signal. The residual sum of squares (RSS) from a nonlinear regression (e.g. random forest) of the scaled residuals on to {X} should be smaller, on average, than if we were fitting to pure noise. Taking this RSS as our test statistic, we can easily simulate from its null distribution and thereby obtain a (finite sample exact) {p}-value. This is the basic idea of Residual Prediction (RP) tests introduced in our paper, where scaled residuals from a linear regression are then predicted using a further regression procedure (an RP method), and some proxy for its prediction error (e.g. RSS) is computed to give the final test statistic. By converting goodness of fit to a prediction problem, we can leverage the predictive power of the variety of machine learning methods available to detect the presence of nonlinearities.

Tests for a variety of different departures from the linear model can be constructed in this framework, including tests that assess the significance of groups of predictors. For these, we take {X} to be a subset of all available predictors {X_{\text{all}}} and the scaled residuals are regressed on to {X_{\text{all}}} rather than just {X}. For example, when {X_{\text{all}}} is moderate or high-dimensional, one can use the Lasso as the RP method. This is particularly powerful against alternatives where the signal includes a sparse linear combination of variables not present in {X}, but in fact tends to outperform the usual {F}-test in a wider variety of settings including fully dense alternatives. Interestingly, using OLS as the RP method is exactly equivalent to performing the {F}-test. With this “two-stage regression” interpretation of the {F}-test we can view the RP framework as a generalisation of the {F}-test to allow for more general regression procedures in the second stage.

To extend the idea to the high-dimensional setting we use the square-root Lasso as the initial regression procedure. Unfortunately scaled square-root Lasso residuals do depend on the unknown parameter so simple Monte Carlo cannot be used to obtain a {p}-value. It turns out however that this dependence is largely through the signs of the true coefficient vector rather than their magnitudes or on the noise level (this can be formalised). This motivates a particular bootstrap scheme for calibration of the tests which yields asymptotic type I error control regardless of the form of the RP method under minimum signal strength a restricted eigenvalue conditions under the null. Whilst the scheme that achieves this is somewhat cumbersome, we show empirically that it is essentially equivalent to a simple parametric bootstrap approach which also retains type I error control.

We give examples of RP tests for the significance of groups and individual predictors, where the procedure is competitive with debiased Lasso approaches, and also develop tests for nonlinearity and heteroscedasicity. The R package RPtests implements these but also allows the user to design their own RP test to target their particular alternative of interest.


Starting the Series B blog

rssbFollowing discussions with several editors of Series B and support of the Research Section of the Royal Statistical Society, we are starting a blog on the Journal of the Royal Statistical Society, Series B, meaning opening a new media for discussions and comments on the papers published (or not published) in Series B, as well as statistical methodology, theory, and applications, the editorial choices of the journal, and wider statistical issues. All contributions are welcomed, if subject to editorial filtering for obvious reasons. By email to the Blog AEditor or through the comments. And note that LaTeX mathematical formulas like

\int_0^\infty \exp\{-s^2/2\}\text{d}s = \sqrt{\pi/2}

can be easily inserted by adding the word latex just after the $ sign. (The favoured format for submission is in html! Using for instance latex2wp.)