# Estimation of the False Discovery Proportion with Unknown Dependence

The correlation effect of dependent test statistics in large-scale multiple testing has attracted considerable attention in recent years. Applying standard Benjamini & Hochberg (1995, B-H) or Storey (2002)’s procedures for independent test statistics can lead to inaccurate false discovery control. Statisticians have now reached the conclusion that it is important and necessary to incorporate the dependence information in the multiple testing procedure. A challenging question is how to incorporate the correlation effect in the testing procedure. In our paper, we suppose that the observed data $\{\mathbf{X}_i\}_{i=1}^n$ are p-dimensional independent random vectors with ${\mathbf{X}_i\sim N_p(\mu,\Sigma)}$, where ${\Sigma}$ is unknown. The mean vector ${\mu=(\mu_1,\cdots,\mu_p)^T}$ is a high dimensional sparse vector, but we do not know which ones are the non-vanishing signals. Consider the standardized test statistics

${Z_j=n^{-1/2}\sum_{i=1}^nX_{ij}/\widehat{\sigma}_j}$

where ${\widehat{\sigma}_j}$ is sample standard deviation of the j-th coordinate, we aim to provide a good approximation of the False Discovery Proportion (FDP) for the detection of signals. If ${\Sigma}$ is known, Fan and his colleagues provided an accurate approximation of the FDP, which is a nonlinear function of eigenvalues and eigenvectors of ${\Sigma}$. However, the problem of unknown dependence has at least two fundamental differences from the setting with known dependence. (a) Impact through estimating marginal variances. When the population marginal variances of the observable random variables are unknown, they have to be estimated first for standardization. In such a case, the popular choice of the test statistics will have ${t}$ distribution with dependence rather than the multivariate normal distribution considered in Fan, Han & Gu (2012); (b) Impact through estimating eigenvalues/eigenvectors. Even if the population marginal variances of the observable random variables are known, estimation of eigenvalues/eigenvector can still significantly affect the FDP approximation. In various situations, FDP approximation can have inferior performance even if a researcher chooses the “best” estimator for the unknown matrix. In our paper, we consider a generic estimator ${\widehat{\Sigma}}$. The major regularity conditions to get a good FDP approximation will be on the first k eigenvalues and eigenvectors of ${\widehat{\Sigma}}$. These k eigenvectors ${\{\widehat{\gamma}_i\}}$ needs to be consistently estimated, but the k eigenvalues ${\{\widehat{\lambda}_i\}}$ are not necessarily consistent estimates. This result can be further connected with the covariance matrix estimation, where the dependence structures of ${\Sigma}$ can include banded or sparse covariance matrices and (conditional) sparse precision matrices. Within this framework, we also consider a special example to illustrate our method where data are sampled from an approximate factor model, which encompasses most practical situations. We will recommend a POET-PFA method in our paper for FDP approximation in practice. The proposed method POET-PFA can be easily implemented by the R package “pfa” (version 1.1).