Mixtures, envelopes and hierarchical duality [by Nick Polson]

Many signal extraction and estimation problems in machine learning and statistics are solved by either regularisation methods or by probabilistic modelling. The chosen approach is often determined by who is doing the estimation; researchers typically apply one methodology or another, not both. In our paper, Mixtures, Envelopes and Hierarchical Duality (JRSSB, 2016, p.701-727), we study the connection between regularisation and probabilistic modelling. We show that there are many cases in which we can construct a probabilistic approach that is equivalent to a regularisation approach. We call this equivalence “hierarchical duality” and explore how this duality can inform regularisation methods. For example, we show that hierarchical duality of a fused lasso can inform the choice of penalty function. Thus thinking about the relationship between the two methods can inform the estimation process to improve signal extraction and estimation results.

Our analysis begins by recognising that, in many cases, both approaches can be described using a common set of building blocks. Regularisation requires the researcher to specify a measure of fit (or “empirical risk function”), which we denote by l(x), to measure the deviation of one’s expectation from observed reality. It also requires a regularisation penalty function, which we denote by \phi(x). These same functions can be included in a probabilistic modelling framework where the researcher specifies a prior and likelihood. In this setup, l(x) and \phi(x) correspond to the negative logarithms of the likelihood and prior distribution, respectively. Our paper provides the conditions under which the two approaches are dual to one another that cover a broad variety of practical applications. We also show how our hierarchical duality approach can provide useful insights to a commonly used approach to signal processing.

To illustrate the commonality that gives rise to hierarchical duality, consider a typical approach to regularisation and probabilistic modelling. Regularisation leads to an optimisation problem of the form

\underset{x \in \Re^d}{\text{minimise}}\ l(x) + \phi(x) \; . \qquad (1)

The probabilistic approach leads to a Bayesian hierarchical model

p(y \mid x) \propto \exp\{-l(x)\} \; , \quad p(x) \propto \exp\{ -\phi(x) \} \, .

The solution to the minimisation problem estimated by regularisation (1) corresponds to the posterior mode,  \hat{x} = {\rm arg \; max}_x \; p( x|y) , where p(x|y) denotes the posterior distribution.

The archetypal example of regularisation is a regression with a least squares log-likelihood subject to a penalty such as an L²-norm (ridge) Gaussian probability model or L¹-norm (lasso) double exponential probability model. Our paper shows that the hierarchical dual of the former leads to a shrinkage estimator, while that of the latter is a sparse solution commonly used for model selection.

By exploring the dualities between regularisation and probabilistic modelling, we can address the two key questions that arise in any problem: (a) How do we specify effective penalties? and (b) How do we construct efficient algorithms?

Probabilistic modelling has addressed question (a). In the paper, we study the connection between mixture distributions where p(x)=\int_\Lambda p(x,\lambda)d\lambda and envelope functions, where p(x)=\sup_\lambda p(x,\lambda) which profiles out the latent variable, \lambda . We show that there are many cases in which these two operations are dual to each other and we study hierarchical duality for both exponential and Gaussian mixtures. Assessing the effectiveness of a penalty is best done probabilistically; for example, out-of-sample predictive, Bayes risk and classification errors are statistical measures whose properties can be carefully studied. For example, it is well-known that the regularisation penalty effects a favorable bias-variance tradeoff and concepts like uncertainty quantification or standard errors have to be done probabilistically.

Constructing efficient algorithms that are scalable has been the focus of the machine learning optimisation literature and provides a solution to question (b). Representing a regularisation penalty as an envelope; e.g. \phi(x) = \sup_z \{ z x - \psi(z) \} (a Fenchel dual) or

\phi(x) = \sup_z \{ - \dfrac{1}{2} (x-z)^2 + \psi(z) \}

(a Geman-Reynolds-Yang dual), provides a way of constructing a proximal operator to solve equation (1). Proximal algorithms generalise expectation-maximisation (EM) and majorise-minimise (MM) algorithms, commonly used in statistics, and leads to the alternating direction method of multipliers (ADMM) algorithm and its variants.

Our framework also allows for a sensitivity analysis by calculating the full regularisation path and acceleration techniques at little extra cost.  We provide a range of proximal operators, denoted by \text{prox}_{\gamma} f(x), in our applications. In particular, we show how to solve robust fused lasso, non-linear quantile regression via trend filtering and logit regression with double Pareto regularisation. Proximal methods will surely gain more interest from the statistical community given their flexibility and generality.

Research in this area has a bright future as there are many other avenues to explore. Much like Gibbs sampling, which grew out of the image processing literature, so has convex optimisation for optimising envelope penalties. One current area of interest is relating the last twenty years of research on convergence of stochastic simulation methods to the optimisation methods that solve regularisation problems.