# Recent posts

## High-dimensional changepoint estimation via sparse projection [comments by Yudong Chen]

I congratulate the authors for such a brilliant paper. The data-driven projection approach also discussed in previous papers (e.g. Aston and Kirch (2014)) turns out to be very useful in tackling offline high-dimensional changepoint estimation. I would like to suggest that the result for single changepoint estimation can be extended to sub-gaussian (see Vershynin (2012)) noise structure. Only the tail bound for the CUSUM transformation of noise requires to be relaxed as the Ornstein-Uhlenbeck process can not be used. For completeness, I define the notion of sub-subgaussianity first, then I list the corresponding modified results for sub-gaussian noise:
Definition. A real-valued random variable ${X}$ is said to have a sub-gaussian distribution with parameter ${\sigma^2>0}$, denoted ${X\in \textbf{sG}(\sigma^2)}$ if for all ${t \in \mathbb{R}}$ it holds that

$\displaystyle \mathbb{E}\left[\mathrm{e}^{tX}\right]\le \mathrm{e}^{\frac{t^2 \sigma^2}{2}} .$

Lemma 4.(online supplement) Let ${W=(W_1,...,W_n)}$ have independent components with ${W_i \in \textbf{sG}(\sigma^2)}$ for all ${i}$. Then for ${u>0}$, we have

$\displaystyle \mathbb{P}\left( \lVert E \rVert_\infty \ge u\sigma \right) \le 2(n-1)\mathrm{e}^{-\frac{u^2}{2}}.$

This result comes directly from the tail bound of a sub-gaussian distribution and a union bound.
Proposition 1.(main paper) Suppose that ${\hat{M}}$ satisfies (15) for either ${\mathcal{S}=\mathcal{S}_1}$ or ${\mathcal{S}=\mathcal{S}_2}$. Let ${\hat{v}\in \mathrm{argmax}_{\tilde{v}\in\mathbb{S}^{p-1}} \lVert M^T\tilde{v} \rVert_2}$ be the leading singular vector of ${\hat{M}}$. If we choose ${\lambda \ge 2\sigma \sqrt{\log{(pn)}}}$, then

$\displaystyle \sup_{P \in \mathcal{P}(n,p,k,1,\vartheta, \tau, \sigma^2)} \mathbb{P}_P\left(\sin \angle(\hat{v},v) > \frac{32\lambda\sqrt{k}}{\tau\vartheta\sqrt{n}}\right) \le \frac{2}{pn}.$

This result comes from the following modification to eqn (20) in the paper:

$\displaystyle \mathbb{P}\left( \lVert E \rVert_\infty \ge 2\sigma \sqrt{\log{(pn)}} \right) \le 2p(n-1)(pn)^{-2} \le \frac{2}{pn}.$

Theorem 1.(main paper) Suppose ${\sigma^2>0}$ is known. Let ${\hat{z}}$ be the output of Algorithm 3 with input ${X \sim P \in \mathcal{P}(n,p,k,1,\vartheta, \tau, \sigma^2)}$ and ${\lambda := 2\sigma\sqrt{\log{(pn)}}}$. There exist universal constants ${C,C'>0}$ such that if ${n \ge 10}$ is even, ${z}$ is even and

$\displaystyle \frac{C\sigma}{\vartheta \tau} \sqrt{\frac{k\log{(pn)}}{n}}\le 1,$

then

$\displaystyle \mathbb{P}_P\left( \frac{1}{n}|\hat{z}-z|\le \frac{C'\sigma^2 \log n}{n\vartheta^2} \right) \le 1-\frac{4}{pn}-\frac{4}{n}-\frac{32\log{(n/2)}}{n}.$

The proof is very similar to the one in the paper, while we can only choose ${\lambda_1}$ to be ${2\sigma \sqrt{\log n}}$ and use Lemma 4, Lemma 5 and Proposition 1 to bound the probability of ${\Omega_1^c \cup \Omega_2^c}$ again.
I would also like to remark that the method presented in the paper, though it is offline, is reasonably efficient even in online setting. However there is still very much to be explored in online changepoint detection. I notice that Soh and Chandrasekaran (2017) presented a method using convex optimisation and filtering. The method can be applied in online setting, however the assumption that the signal is sparse is weaker.

1.  Aston, J. A. D. and Kirch, C. (2014) Efficiency of change point tests in high dimensional settings. arXiv preprint, arxiv:1409.1771v2.
2. Soh, Y. S. and Chandrasekaran, V. (2017) High-dimensional change-point estimation: combining filtering with convex optimization. Appl. Comput. Harmon. Anal., 43, 122-147.
3. Vershynin, R. (2012) Introduction to the non-asymptotic analysis of random matrices. Compressed sensing, 210-268, Cambridge Univ. Press, Cambridge.

## High-dimensional changepoint estimation via sparse projection [by Richard Samworth]

Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called inspect for estimation of the changepoints: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimisation problem derived from the CUSUM transformation of the time series. We then apply an existing univariate changepoint estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated changepoints and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data generating mechanisms. We call our algorithm inspect, short for informative sparse projection for estimation of changepoints; it is implemented in the R package InspectChangepoint.

Left: visualisation of the data matrix. Right: its CUSUM transformation.

Left: overlay of the projected CUSUM statistics for the three changepoints detected. Right: visualisation of thresholding; the three detected changepoints are above the threshold (dotted red line) whereas the remaining numbers are the test statistics obtained if we run the wild binary segmentation to completion without applying a termination criterion.

A brief illustration of the inspect algorithm in action is given in the Figure above. Here, we simulated a 2000 x 1000 data matrix having independent normal columns with identity covariance and with three changepoints in the mean structure at locations 500, 1000 and 1500. Changes occur in 40 coordinates, where consecutive changepoints overlap in half of their coordinates, and the squared ${\ell_2}$ norms of the vectors of mean changes were 0.4, 0.9 and 1.6 respectively.

## Estimation of the False Discovery Proportion with Unknown Dependence

The correlation effect of dependent test statistics in large-scale multiple testing has attracted considerable attention in recent years. Applying standard Benjamini & Hochberg (1995, B-H) or Storey (2002)’s procedures for independent test statistics can lead to inaccurate false discovery control. Statisticians have now reached the conclusion that it is important and necessary to incorporate the dependence information in the multiple testing procedure. A challenging question is how to incorporate the correlation effect in the testing procedure. In our paper, we suppose that the observed data $\{\mathbf{X}_i\}_{i=1}^n$ are p-dimensional independent random vectors with ${\mathbf{X}_i\sim N_p(\mu,\Sigma)}$, where ${\Sigma}$ is unknown. The mean vector ${\mu=(\mu_1,\cdots,\mu_p)^T}$ is a high dimensional sparse vector, but we do not know which ones are the non-vanishing signals. Consider the standardized test statistics

${Z_j=n^{-1/2}\sum_{i=1}^nX_{ij}/\widehat{\sigma}_j}$

where ${\widehat{\sigma}_j}$ is sample standard deviation of the j-th coordinate, we aim to provide a good approximation of the False Discovery Proportion (FDP) for the detection of signals. If ${\Sigma}$ is known, Fan and his colleagues provided an accurate approximation of the FDP, which is a nonlinear function of eigenvalues and eigenvectors of ${\Sigma}$. However, the problem of unknown dependence has at least two fundamental differences from the setting with known dependence. (a) Impact through estimating marginal variances. When the population marginal variances of the observable random variables are unknown, they have to be estimated first for standardization. In such a case, the popular choice of the test statistics will have ${t}$ distribution with dependence rather than the multivariate normal distribution considered in Fan, Han & Gu (2012); (b) Impact through estimating eigenvalues/eigenvectors. Even if the population marginal variances of the observable random variables are known, estimation of eigenvalues/eigenvector can still significantly affect the FDP approximation. In various situations, FDP approximation can have inferior performance even if a researcher chooses the “best” estimator for the unknown matrix. In our paper, we consider a generic estimator ${\widehat{\Sigma}}$. The major regularity conditions to get a good FDP approximation will be on the first k eigenvalues and eigenvectors of ${\widehat{\Sigma}}$. These k eigenvectors ${\{\widehat{\gamma}_i\}}$ needs to be consistently estimated, but the k eigenvalues ${\{\widehat{\lambda}_i\}}$ are not necessarily consistent estimates. This result can be further connected with the covariance matrix estimation, where the dependence structures of ${\Sigma}$ can include banded or sparse covariance matrices and (conditional) sparse precision matrices. Within this framework, we also consider a special example to illustrate our method where data are sampled from an approximate factor model, which encompasses most practical situations. We will recommend a POET-PFA method in our paper for FDP approximation in practice. The proposed method POET-PFA can be easily implemented by the R package “pfa” (version 1.1).

## Goodness of fit tests for high-dimensional linear models [by Rajen Shah]

One of the simplest models for high-dimensional data in the regression setting is the ubiquitous high-dimensional linear model,

$\displaystyle Y = X\beta + \sigma \varepsilon.$

Here ${\beta \in \mathbb{R}^p}$ is sparse and ${\varepsilon \sim \mathcal{N}_n(0, I)}$. Whilst methods are readily available for estimating and performing inference concerning the unknown vector of regression coefficients, the problem of checking whether the high-dimensional linear model actually holds has received little attention.

In the low-dimensional setting, checks for the goodness of fit of a linear model are typically based on various plots involving the residuals. Writing ${P}$ for the orthogonal projection on to ${X}$, we have that the scaled residuals ${(I-P)Y / \|(I-P)Y\|_2 = (I-P)\varepsilon / \|(I-P)\varepsilon\|_2}$ do not depend on any unknown parameters, a fact which allows for easy interpretation of these plots. This property can however also be exploited algorithmically.

If ${\mathbb{E}(Y)}$ is not a linear combination of the columns of ${X}$, then the scaled residuals will contain some signal. The residual sum of squares (RSS) from a nonlinear regression (e.g. random forest) of the scaled residuals on to ${X}$ should be smaller, on average, than if we were fitting to pure noise. Taking this RSS as our test statistic, we can easily simulate from its null distribution and thereby obtain a (finite sample exact) ${p}$-value. This is the basic idea of Residual Prediction (RP) tests introduced in our paper, where scaled residuals from a linear regression are then predicted using a further regression procedure (an RP method), and some proxy for its prediction error (e.g. RSS) is computed to give the final test statistic. By converting goodness of fit to a prediction problem, we can leverage the predictive power of the variety of machine learning methods available to detect the presence of nonlinearities.

Tests for a variety of different departures from the linear model can be constructed in this framework, including tests that assess the significance of groups of predictors. For these, we take ${X}$ to be a subset of all available predictors ${X_{\text{all}}}$ and the scaled residuals are regressed on to ${X_{\text{all}}}$ rather than just ${X}$. For example, when ${X_{\text{all}}}$ is moderate or high-dimensional, one can use the Lasso as the RP method. This is particularly powerful against alternatives where the signal includes a sparse linear combination of variables not present in ${X}$, but in fact tends to outperform the usual ${F}$-test in a wider variety of settings including fully dense alternatives. Interestingly, using OLS as the RP method is exactly equivalent to performing the ${F}$-test. With this “two-stage regression” interpretation of the ${F}$-test we can view the RP framework as a generalisation of the ${F}$-test to allow for more general regression procedures in the second stage.

To extend the idea to the high-dimensional setting we use the square-root Lasso as the initial regression procedure. Unfortunately scaled square-root Lasso residuals do depend on the unknown parameter so simple Monte Carlo cannot be used to obtain a ${p}$-value. It turns out however that this dependence is largely through the signs of the true coefficient vector rather than their magnitudes or on the noise level (this can be formalised). This motivates a particular bootstrap scheme for calibration of the tests which yields asymptotic type I error control regardless of the form of the RP method under minimum signal strength a restricted eigenvalue conditions under the null. Whilst the scheme that achieves this is somewhat cumbersome, we show empirically that it is essentially equivalent to a simple parametric bootstrap approach which also retains type I error control.

We give examples of RP tests for the significance of groups and individual predictors, where the procedure is competitive with debiased Lasso approaches, and also develop tests for nonlinearity and heteroscedasicity. The R package RPtests implements these but also allows the user to design their own RP test to target their particular alternative of interest.

## Hamiltonian Monte Carlo sampling in Bayesian empirical likelihood computation

While working on the Series B blog yesterday I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

It is of particular interest for me [disclaimer: I was not involved in the review of this paper!] as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)
~

## On the exact region determined by Kendall’s τ and Spearman’s ρ [by Wolfgang Trutschnig]

Suppose that ${X,Y}$ are random variables with continuous distribution functions ${F}$ and ${G}$ respectively. Then Spearman’s ${\rho}$ is defined as the Pearson correlation coefficient of the ${\mathcal{U}(0,1)}$-distributed random variables ${U:=F \circ X}$ and ${V:=G \circ Y}$, and Kendall’s ${\tau}$ is given by the probability of concordance minus the probability of discordance, i.e.

$\displaystyle \rho(X,Y)=12\big(\mathbb{E}(UV)-\tfrac{1}{4}\big)$

$\displaystyle \tau(X,Y)= \mathbb{P}\big((X_1-X_2)(Y_1 -Y_2)>0) - \mathbb{P}\big((X_1-X_2)(Y_1 -Y_2)<0\big)$

where ${(X_1,Y_1)}$ and ${(X_2,Y_2)}$ are independent and have the same distribution as ${(X,Y)}$. Clearly, ${\tau}$ and ${\rho}$ are the two most famous nonparametric measures of concordance. Despite the fact that ${\tau}$ and ${\rho}$ are both concordance measures, they quantify different aspects of the underlying dependence structure and may vary significantly. Since the early 1950s two universal inequalities between ${\tau}$ and ${\rho}$ are known – the first one goes back to Daniels (1950), the second one to Durbin and Stuart (1951):

$\displaystyle \vert 3 \tau-2 \rho \vert \leq 1 \ \ \ \ \ (1)$

$\displaystyle \frac{(1+\tau)^2}{2} -1 \leq \rho \leq 1- \frac{(1-\tau)^2}{2} \ \ \ \ \ (2)$

Although both inequalities have been known for decades now, a full characterization of the exact ${\tau}$${\rho}$-region ${\Omega}$, defined by

$\Omega=\big\{(\tau(X,Y),\rho(X,Y)):\,\, X,Y \text{ continuous random variables}\big\},$

was first established in our article. Working with copulas and so-called shuffles it was possible to describe all random variables ${(X,Y)}$ with ${(\tau(X,Y),\rho(X,Y)) \in \partial \Omega}$, where ${\partial \Omega}$ denotes the topological boundary of ${\Omega}$. Figure 0 depicts some distributions of uniformly distributed random variables ${X,Y}$ for which we have ${(\tau(X,Y),\rho(X,Y)) \in \partial \Omega}$ – for a small animation we refer to animation ${\partial \Omega}$. Although our main objective was a full characterization of ${\Omega}$ the proofs produced an equally interesting by-product: For every point ${(x,y) \in \Omega}$ there exist random variables ${X,Y}$ and a measurable bijection ${f: \mathbb{R} \rightarrow \mathbb{R}}$ with ${Y=f(X)}$ such that ${(\tau(X,Y),\rho(X,Y))=(x,y)}$ holds. In other words: (Mutually) completely dependent random variables cover all possible ${\tau}$${\rho}$-constellations.

As pointed out in Section 6 in our paper, characterizing the exact ${\tau}$${\rho}$-region for standard families of distributions (or, equivalently, copulas) may in some cases be even more difficult than determining ${\Omega}$ was. The main reason for this fact is that not in all families we may find dense subsets consisting of elements for which ${\tau}$ and ${\rho}$ allow for handy formulas (as it is the case for shuffles). The author conjectures, however, that the classical Hutchinson-Lai inequalities are not sharp for the class of all extreme-value copulas and that it might be possible to derive sharper inequalities in the near future.

## A Coverage Theory for Least Squares [by Simone Garatti]

Least Squares, in its different forms, is probably the most used approach to construct models from a sample of observations. In the article “A Coverage Theory for Least Squares”, we refer to least-squares more generally as a methodology to make decisions. To give a concrete example, consider the problem of deciding the location of a station that serves a population. This can e.g. be a laundry, a library or a supermarket. After obtaining the location of the homes of a sample of the population, according to a least-squares approach the service station can be placed at the barycentric position of the home locations, a decision that minimizes the sum of squared home-service distances. In the article, our main interest lies in assessing the quality of least squares decisions: once the location of the service station has been determined based on a sample, one can ask how good the decision made is for the rest of the population. For example, one can evaluate the levels of satisfaction – as measured by the home-service distance – for the individuals in the sample, and ask how representative these satisfaction levels are of the level of satisfaction of the whole population. Say e.g. that the individual who is the 10th furthest away from the station in a sample of 100 has to walk 0.4 miles to go to the station, what is the proportion of the population that has to walk more than 0.4 miles to go to the station? Answering this question and related ones has an important impact on the usage of the least squares method, and even on the acceptance of the solution obtained from it (if too many people have to walk for too long, the decision to place the station in the least squares location might be rejected in favor of constructing two stations instead of one at an extra cost). This problem has so far received little attention from the statistical community and this article tries to fill this gap by presenting a new theory that is applicable to least squares decisions across diverse fields. In the service location problem, the empirical proportion of members in the sample that pay a cost above a given value is not a valid statistic for quantifying the proportion of the whole population whose cost is above that given value. This is not surprising since the least squares solution introduces a bias towards making small the cost of the members in the sample. On the other hand, it is shown in the article that, by introducing suitable margins, valid and tight statistics can be obtained which hold true distribution-free, that is, these statistics can be applied without availing of extra knowledge on how the population distributes.