High-dimensional changepoint estimation via sparse projection [comments by Yudong Chen]

I congratulate the authors for such a brilliant paper. The data-driven projection approach also discussed in previous papers (e.g. Aston and Kirch (2014)) turns out to be very useful in tackling offline high-dimensional changepoint estimation. I would like to suggest that the result for single changepoint estimation can be extended to sub-gaussian (see Vershynin (2012)) noise structure. Only the tail bound for the CUSUM transformation of noise requires to be relaxed as the Ornstein-Uhlenbeck process can not be used. For completeness, I define the notion of sub-subgaussianity first, then I list the corresponding modified results for sub-gaussian noise:
Definition. A real-valued random variable {X} is said to have a sub-gaussian distribution with parameter {\sigma^2>0}, denoted {X\in \textbf{sG}(\sigma^2)} if for all {t \in \mathbb{R}} it holds that

\displaystyle \mathbb{E}\left[\mathrm{e}^{tX}\right]\le \mathrm{e}^{\frac{t^2 \sigma^2}{2}} .

Lemma 4.(online supplement) Let {W=(W_1,...,W_n)} have independent components with {W_i \in \textbf{sG}(\sigma^2)} for all {i}. Then for {u>0}, we have

\displaystyle \mathbb{P}\left( \lVert E \rVert_\infty \ge u\sigma \right) \le 2(n-1)\mathrm{e}^{-\frac{u^2}{2}}.

This result comes directly from the tail bound of a sub-gaussian distribution and a union bound.
Proposition 1.(main paper) Suppose that {\hat{M}} satisfies (15) for either {\mathcal{S}=\mathcal{S}_1} or {\mathcal{S}=\mathcal{S}_2}. Let {\hat{v}\in \mathrm{argmax}_{\tilde{v}\in\mathbb{S}^{p-1}} \lVert M^T\tilde{v} \rVert_2} be the leading singular vector of {\hat{M}}. If we choose {\lambda \ge 2\sigma \sqrt{\log{(pn)}}}, then

\displaystyle \sup_{P \in \mathcal{P}(n,p,k,1,\vartheta, \tau, \sigma^2)} \mathbb{P}_P\left(\sin \angle(\hat{v},v) > \frac{32\lambda\sqrt{k}}{\tau\vartheta\sqrt{n}}\right) \le \frac{2}{pn}.

This result comes from the following modification to eqn (20) in the paper:

\displaystyle \mathbb{P}\left( \lVert E \rVert_\infty \ge 2\sigma \sqrt{\log{(pn)}} \right) \le 2p(n-1)(pn)^{-2} \le \frac{2}{pn}.

Theorem 1.(main paper) Suppose {\sigma^2>0} is known. Let {\hat{z}} be the output of Algorithm 3 with input {X \sim P \in \mathcal{P}(n,p,k,1,\vartheta, \tau, \sigma^2)} and {\lambda := 2\sigma\sqrt{\log{(pn)}}}. There exist universal constants {C,C'>0} such that if {n \ge 10} is even, {z} is even and

\displaystyle \frac{C\sigma}{\vartheta \tau} \sqrt{\frac{k\log{(pn)}}{n}}\le 1,


\displaystyle \mathbb{P}_P\left( \frac{1}{n}|\hat{z}-z|\le \frac{C'\sigma^2 \log n}{n\vartheta^2} \right) \le 1-\frac{4}{pn}-\frac{4}{n}-\frac{32\log{(n/2)}}{n}.

The proof is very similar to the one in the paper, while we can only choose {\lambda_1} to be {2\sigma \sqrt{\log n}} and use Lemma 4, Lemma 5 and Proposition 1 to bound the probability of {\Omega_1^c \cup \Omega_2^c} again.
I would also like to remark that the method presented in the paper, though it is offline, is reasonably efficient even in online setting. However there is still very much to be explored in online changepoint detection. I notice that Soh and Chandrasekaran (2017) presented a method using convex optimisation and filtering. The method can be applied in online setting, however the assumption that the signal is sparse is weaker.

  1.  Aston, J. A. D. and Kirch, C. (2014) Efficiency of change point tests in high dimensional settings. arXiv preprint, arxiv:1409.1771v2.
  2. Soh, Y. S. and Chandrasekaran, V. (2017) High-dimensional change-point estimation: combining filtering with convex optimization. Appl. Comput. Harmon. Anal., 43, 122-147.
  3. Vershynin, R. (2012) Introduction to the non-asymptotic analysis of random matrices. Compressed sensing, 210-268, Cambridge Univ. Press, Cambridge.

Goodness of fit tests for high-dimensional linear models [by Rajen Shah]

One of the simplest models for high-dimensional data in the regression setting is the ubiquitous high-dimensional linear model,

\displaystyle Y = X\beta + \sigma \varepsilon.

Here {\beta \in \mathbb{R}^p} is sparse and {\varepsilon \sim \mathcal{N}_n(0, I)}. Whilst methods are readily available for estimating and performing inference concerning the unknown vector of regression coefficients, the problem of checking whether the high-dimensional linear model actually holds has received little attention.

In the low-dimensional setting, checks for the goodness of fit of a linear model are typically based on various plots involving the residuals. Writing {P} for the orthogonal projection on to {X}, we have that the scaled residuals {(I-P)Y / \|(I-P)Y\|_2 = (I-P)\varepsilon / \|(I-P)\varepsilon\|_2} do not depend on any unknown parameters, a fact which allows for easy interpretation of these plots. This property can however also be exploited algorithmically.

If {\mathbb{E}(Y)} is not a linear combination of the columns of {X}, then the scaled residuals will contain some signal. The residual sum of squares (RSS) from a nonlinear regression (e.g. random forest) of the scaled residuals on to {X} should be smaller, on average, than if we were fitting to pure noise. Taking this RSS as our test statistic, we can easily simulate from its null distribution and thereby obtain a (finite sample exact) {p}-value. This is the basic idea of Residual Prediction (RP) tests introduced in our paper, where scaled residuals from a linear regression are then predicted using a further regression procedure (an RP method), and some proxy for its prediction error (e.g. RSS) is computed to give the final test statistic. By converting goodness of fit to a prediction problem, we can leverage the predictive power of the variety of machine learning methods available to detect the presence of nonlinearities.

Tests for a variety of different departures from the linear model can be constructed in this framework, including tests that assess the significance of groups of predictors. For these, we take {X} to be a subset of all available predictors {X_{\text{all}}} and the scaled residuals are regressed on to {X_{\text{all}}} rather than just {X}. For example, when {X_{\text{all}}} is moderate or high-dimensional, one can use the Lasso as the RP method. This is particularly powerful against alternatives where the signal includes a sparse linear combination of variables not present in {X}, but in fact tends to outperform the usual {F}-test in a wider variety of settings including fully dense alternatives. Interestingly, using OLS as the RP method is exactly equivalent to performing the {F}-test. With this “two-stage regression” interpretation of the {F}-test we can view the RP framework as a generalisation of the {F}-test to allow for more general regression procedures in the second stage.

To extend the idea to the high-dimensional setting we use the square-root Lasso as the initial regression procedure. Unfortunately scaled square-root Lasso residuals do depend on the unknown parameter so simple Monte Carlo cannot be used to obtain a {p}-value. It turns out however that this dependence is largely through the signs of the true coefficient vector rather than their magnitudes or on the noise level (this can be formalised). This motivates a particular bootstrap scheme for calibration of the tests which yields asymptotic type I error control regardless of the form of the RP method under minimum signal strength a restricted eigenvalue conditions under the null. Whilst the scheme that achieves this is somewhat cumbersome, we show empirically that it is essentially equivalent to a simple parametric bootstrap approach which also retains type I error control.

We give examples of RP tests for the significance of groups and individual predictors, where the procedure is competitive with debiased Lasso approaches, and also develop tests for nonlinearity and heteroscedasicity. The R package RPtests implements these but also allows the user to design their own RP test to target their particular alternative of interest.


Starting the Series B blog

rssbFollowing discussions with several editors of Series B and support of the Research Section of the Royal Statistical Society, we are starting a blog on the Journal of the Royal Statistical Society, Series B, meaning opening a new media for discussions and comments on the papers published (or not published) in Series B, as well as statistical methodology, theory, and applications, the editorial choices of the journal, and wider statistical issues. All contributions are welcomed, if subject to editorial filtering for obvious reasons. By email to the Blog AEditor or through the comments. And note that LaTeX mathematical formulas like

\int_0^\infty \exp\{-s^2/2\}\text{d}s = \sqrt{\pi/2}

can be easily inserted by adding the word latex just after the $ sign. (The favoured format for submission is in html! Using for instance latex2wp.)