High-dimensional changepoint estimation via sparse projection [by Richard Samworth]

Changepoints are a very common feature of Big Data that arrive in the form of a data stream. In this paper, we study high-dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the coordinates. The challenge is to borrow strength across the coordinates in order to detect smaller changes than could be observed in any individual component series. We propose a two-stage procedure called inspect for estimation of the changepoints: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimisation problem derived from the CUSUM transformation of the time series. We then apply an existing univariate changepoint estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated changepoints and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data generating mechanisms. We call our algorithm inspect, short for informative sparse projection for estimation of changepoints; it is implemented in the R package InspectChangepoint.

Left: visualisation of the data matrix. Right: its CUSUM transformation.

 

Left: overlay of the projected CUSUM statistics for the three changepoints detected. Right: visualisation of thresholding; the three detected changepoints are above the threshold (dotted red line) whereas the remaining numbers are the test statistics obtained if we run the wild binary segmentation to completion without applying a termination criterion. 

A brief illustration of the inspect algorithm in action is given in the Figure above. Here, we simulated a 2000 x 1000 data matrix having independent normal columns with identity covariance and with three changepoints in the mean structure at locations 500, 1000 and 1500. Changes occur in 40 coordinates, where consecutive changepoints overlap in half of their coordinates, and the squared {\ell_2} norms of the vectors of mean changes were 0.4, 0.9 and 1.6 respectively.

Estimation of the False Discovery Proportion with Unknown Dependence

The correlation effect of dependent test statistics in large-scale multiple testing has attracted considerable attention in recent years. Applying standard Benjamini & Hochberg (1995, B-H) or Storey (2002)’s procedures for independent test statistics can lead to inaccurate false discovery control. Statisticians have now reached the conclusion that it is important and necessary to incorporate the dependence information in the multiple testing procedure. A challenging question is how to incorporate the correlation effect in the testing procedure. In our paper, we suppose that the observed data \{\mathbf{X}_i\}_{i=1}^n are p-dimensional independent random vectors with {\mathbf{X}_i\sim N_p(\mu,\Sigma)}, where {\Sigma} is unknown. The mean vector {\mu=(\mu_1,\cdots,\mu_p)^T} is a high dimensional sparse vector, but we do not know which ones are the non-vanishing signals. Consider the standardized test statistics

{Z_j=n^{-1/2}\sum_{i=1}^nX_{ij}/\widehat{\sigma}_j}

where {\widehat{\sigma}_j} is sample standard deviation of the j-th coordinate, we aim to provide a good approximation of the False Discovery Proportion (FDP) for the detection of signals. If {\Sigma} is known, Fan and his colleagues provided an accurate approximation of the FDP, which is a nonlinear function of eigenvalues and eigenvectors of {\Sigma}. However, the problem of unknown dependence has at least two fundamental differences from the setting with known dependence. (a) Impact through estimating marginal variances. When the population marginal variances of the observable random variables are unknown, they have to be estimated first for standardization. In such a case, the popular choice of the test statistics will have {t} distribution with dependence rather than the multivariate normal distribution considered in Fan, Han & Gu (2012); (b) Impact through estimating eigenvalues/eigenvectors. Even if the population marginal variances of the observable random variables are known, estimation of eigenvalues/eigenvector can still significantly affect the FDP approximation. In various situations, FDP approximation can have inferior performance even if a researcher chooses the “best” estimator for the unknown matrix. In our paper, we consider a generic estimator {\widehat{\Sigma}}. The major regularity conditions to get a good FDP approximation will be on the first k eigenvalues and eigenvectors of {\widehat{\Sigma}}. These k eigenvectors {\{\widehat{\gamma}_i\}} needs to be consistently estimated, but the k eigenvalues {\{\widehat{\lambda}_i\}} are not necessarily consistent estimates. This result can be further connected with the covariance matrix estimation, where the dependence structures of {\Sigma} can include banded or sparse covariance matrices and (conditional) sparse precision matrices. Within this framework, we also consider a special example to illustrate our method where data are sampled from an approximate factor model, which encompasses most practical situations. We will recommend a POET-PFA method in our paper for FDP approximation in practice. The proposed method POET-PFA can be easily implemented by the R package “pfa” (version 1.1).

On the exact region determined by Kendall’s τ and Spearman’s ρ [by Wolfgang Trutschnig]

Suppose that {X,Y} are random variables with continuous distribution functions {F} and {G} respectively. Then Spearman’s {\rho} is defined as the Pearson correlation coefficient of the {\mathcal{U}(0,1)}-distributed random variables {U:=F \circ X} and {V:=G \circ Y}, and Kendall’s {\tau} is given by the probability of concordance minus the probability of discordance, i.e.

\displaystyle \rho(X,Y)=12\big(\mathbb{E}(UV)-\tfrac{1}{4}\big)

\displaystyle \tau(X,Y)= \mathbb{P}\big((X_1-X_2)(Y_1 -Y_2)>0) - \mathbb{P}\big((X_1-X_2)(Y_1 -Y_2)<0\big)

where {(X_1,Y_1)} and {(X_2,Y_2)} are independent and have the same distribution as {(X,Y)}. Clearly, {\tau} and {\rho} are the two most famous nonparametric measures of concordance. Despite the fact that {\tau} and {\rho} are both concordance measures, they quantify different aspects of the underlying dependence structure and may vary significantly. Since the early 1950s two universal inequalities between {\tau} and {\rho} are known – the first one goes back to Daniels (1950), the second one to Durbin and Stuart (1951):

\displaystyle \vert 3 \tau-2 \rho \vert \leq 1 \ \ \ \ \ (1)

\displaystyle \frac{(1+\tau)^2}{2} -1 \leq \rho \leq 1- \frac{(1-\tau)^2}{2} \ \ \ \ \ (2)

{The classical τ-ρ-region determined by the inequalities (1) and (2) and some copulas (distributing mass uniformly on the blue segments) for which the inequality by Durbin and Stuart is sharp. The red line depicts the true boundary of Ω.

Although both inequalities have been known for decades now, a full characterization of the exact {\tau}{\rho}-region {\Omega}, defined by

\Omega=\big\{(\tau(X,Y),\rho(X,Y)):\,\, X,Y \text{ continuous random variables}\big\},

was first established in our article. Working with copulas and so-called shuffles it was possible to describe all random variables {(X,Y)} with {(\tau(X,Y),\rho(X,Y)) \in \partial \Omega}, where {\partial \Omega} denotes the topological boundary of {\Omega}. Figure 0 depicts some distributions of uniformly distributed random variables {X,Y} for which we have {(\tau(X,Y),\rho(X,Y)) \in \partial \Omega} – for a small animation we refer to animation {\partial \Omega}. Although our main objective was a full characterization of {\Omega} the proofs produced an equally interesting by-product: For every point {(x,y) \in \Omega} there exist random variables {X,Y} and a measurable bijection {f: \mathbb{R} \rightarrow \mathbb{R}} with {Y=f(X)} such that {(\tau(X,Y),\rho(X,Y))=(x,y)} holds. In other words: (Mutually) completely dependent random variables cover all possible {\tau}{\rho}-constellations.

As pointed out in Section 6 in our paper, characterizing the exact {\tau}{\rho}-region for standard families of distributions (or, equivalently, copulas) may in some cases be even more difficult than determining {\Omega} was. The main reason for this fact is that not in all families we may find dense subsets consisting of elements for which {\tau} and {\rho} allow for handy formulas (as it is the case for shuffles). The author conjectures, however, that the classical Hutchinson-Lai inequalities are not sharp for the class of all extreme-value copulas and that it might be possible to derive sharper inequalities in the near future.