Hamiltonian Monte Carlo sampling in Bayesian empirical likelihood computation

While working on the Series B blog yesterday I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

It is of particular interest for me [disclaimer: I was not involved in the review of this paper!] as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)
~

A Coverage Theory for Least Squares [by Simone Garatti]

Least Squares, in its different forms, is probably the most used approach to construct models from a sample of observations. In the article “A Coverage Theory for Least Squares”, we refer to least-squares more generally as a methodology to make decisions. To give a concrete example, consider the problem of deciding the location of a station that serves a population. This can e.g. be a laundry, a library or a supermarket. After obtaining the location of the homes of a sample of the population, according to a least-squares approach the service station can be placed at the barycentric position of the home locations, a decision that minimizes the sum of squared home-service distances. In the article, our main interest lies in assessing the quality of least squares decisions: once the location of the service station has been determined based on a sample, one can ask how good the decision made is for the rest of the population. For example, one can evaluate the levels of satisfaction – as measured by the home-service distance – for the individuals in the sample, and ask how representative these satisfaction levels are of the level of satisfaction of the whole population. Say e.g. that the individual who is the 10th furthest away from the station in a sample of 100 has to walk 0.4 miles to go to the station, what is the proportion of the population that has to walk more than 0.4 miles to go to the station? Answering this question and related ones has an important impact on the usage of the least squares method, and even on the acceptance of the solution obtained from it (if too many people have to walk for too long, the decision to place the station in the least squares location might be rejected in favor of constructing two stations instead of one at an extra cost). This problem has so far received little attention from the statistical community and this article tries to fill this gap by presenting a new theory that is applicable to least squares decisions across diverse fields. In the service location problem, the empirical proportion of members in the sample that pay a cost above a given value is not a valid statistic for quantifying the proportion of the whole population whose cost is above that given value. This is not surprising since the least squares solution introduces a bias towards making small the cost of the members in the sample. On the other hand, it is shown in the article that, by introducing suitable margins, valid and tight statistics can be obtained which hold true distribution-free, that is, these statistics can be applied without availing of extra knowledge on how the population distributes.

Parameter stability and semiparametric inference in time-varying ARCH models [by Lionel Truquet]

Forecasting financial data with autoregressive and conditionally heteroscedastic processes has received a considerable attention in the literature. However, most of the models used in practice are stationary and have short memory properties which is not compatible with the slow decay of the empirical autocorrelations observed for these series. Then non stationary models with time-varying parameters have been introduced for dealing with financial returns. These models avoid the criticism made on stationary long memory processes which are always fitted over a large history of data. See for instance Mikosch and Stărică (2004) for a discussion about some nonstationarities and a theoretical basis of a possible explanation of the long memory properties usually found for financial time series. However, different types of nonstationary time series models have been introduced. For instance, Granger and Stărică (2005) consider a simple model of independent random variables with a time-varying unconditional variance, Engle and Rangel (2008) a multiplicative model with a time-varying unconditional variance and a GARCH component whereas Dahlhaus and Subba Rao (2006) or Fryzlewicz et al. (2008) consider an ARCH with time-varying parameters.

Deciding which model is the most compatible with a given data set is a problem of major importance because inference of time-varying parameters is a difficult task and too complex models can be hardly interpretable or unstable.

In this paper, we develop a variety of tools for testing parameters stability of ARCH processes and in particular for deciding between the three types of models discussed above. First, we show that non time-varying parameters in ARCH processes can be estimated at the usual root {T} converge rate and we also study the asymptotic semiparametric efficiency of our procedure. Then we develop two statistical tests. The first one is used for testing if a subset of parameters is non time-varying and the second one for testing if the estimates of lag coefficients are significant. We also provide an information criterion for selecting the number of lags for time-varying ARCH processes.

Our tools are illustrated with three data sets. For the three series, we found that a time-varying intercept is clearly rejected over the period under study. The conclusion for the lags coefficients depends on the series. For instance, the daily exchange rates between the US Dollar and the Euro seems compatible with the model of Stărică and Granger (2005) with no second order dynamic. On the other hand, the absence of second order dynamic is rejected for the FTSE index, though nonstationarity leads to much smaller lag estimates with respect to the stationary case. In all the cases, a time-varying unconditional variance seems to have an important contribution to volatility and has to be taken into account.

A Bayesian decision theoretic model of sequential experimentation with delayed response [by Stephen Chick]

When is it cost-effective to stop a sequential clinical trial?

Interest in sequential clinical trial design is growing, as health care funders and pharmaceutical companies strive to bring beneficial therapies to patients sooner and stop researching inferior ones earlier. But when is it `cost-effective’ to stop a sequential clinical trial? Might cost-effectiveness criteria help investigators decide whether or not to conduct a sequential trial, or one of a fixed sample size, at design stage?

In A Bayesian Decision Theoretic Model of Sequential Experimentation with Delayed Response (JRSSB, 2017, doi:10.1111/rssb.12225), we model the optimal stopping of a sequential clinical trial which accounts for:

  1. the cost of carrying out research, of treatment and of switching health technologies;
  2. benefits accruing to trial participants and the wider study population;
  3. delay in observing the primary end point.

We seek the policy which maximises the expected benefits of health technology adoption, minus the cost of carrying out the trial itself.

We do this for a two-armed clinical trial in which realisations on the primary outcome start to arrive while the investigator has the option to commit further patients to the trial. The incremental net monetary benefit of the new health technology versus standard X is assumed to have a normal distribution with unknown mean W and known variance σ² (we also model unknown sampling variance). We place a Gaussian prior on W and update its expected value sequentially as realisations arrive with delay.

The key stage of the trial is when realisations are arriving from subjects `in the pipeline’ at the same time as the investigator is deciding whether or not to randomise further subjects to the two arms. Our proposed trial design is therefore fully sequential, in contrast to designs which comprise a small number of stages. As each realisation arrives, Bayesian updating is used to form a predictive distribution which values the expected reward from stopping immediately, following up the remaining `pipeline subjects’ and choosing the best technology. This expected reward is compared with that from making another pairwise allocation at defined sampling cost, and acting optimally thereafter. Dynamic programming yields the optimal policy in (sample size x posterior mean) space, based on a diffusion process approximation, defining upper and lower stopping boundaries for the trial’s optimal stopping problem. The optimal trial design depends on the extent of the delay, and the resulting value function may be used to define three optimal decisions at trial design stage: `do not experiment’/`fixed sample size experiment’/`sequential experiment’.

Monte Carlo simulations show that, as expected, the policy outperforms alternative, non-sequential, trial designs in terms of the expected benefit of health technology adoption, net of trial costs. But they also show that the expected sample size of the optimal Bayes sequential policy can be greater than, equal to, or less than, that of comparator designs. Why? Because the optimal policy achieves the sample size which appropriately balances the expected benefit to patients with the cost of learning during the trial. The simulations also show that the policy performs well when judged according to the probability of correctly selecting the best technology. And they show how the policy and its performance change with changes in key parameters such as the rate of recruitment, the size of the population to benefit and the cost of conducting the trial.

With increasing policy interest in `value-based’ health care, we hope that our work will prompt both applied and more theoretically-inclined researchers to dig deeper. Please check out our paper, supplementary~material and code and contribute to our B’Log!

Continuous ARMA random fields [by Yasumasa Matsuda]

[This blog entry refers to Continuous auto-regressive moving average random fields on ℝn by Peter J. Brockwell and Yasumasa Matsuda]

We define a new family of random fields on {{\mathbb{R}}^n} by extending the kernel representation of a class of non-causal continuous time auto-regressive and moving average (CARMA) processes on {{\mathbb R}}. (Causality has no natural analogue for data indexed by {{\mathbb{R}}^n} with {n>1}.) Suppose that {a_*(z)=\prod_{i=1}^p(z-\lambda_i)} and {b_*(z)=\prod_{i=1}^q(z-\mu_i)} are polynomials with real coefficients and {q<p}. Suppose also that each of the zeros, {\lambda_1,\ldots,\lambda_p}, of {a} has multiplicity one and strictly negative real part and that the zeros, {\mu_1,\ldots,\mu_q}, of {b} have negative real parts. The strictly stationary CARMA{(p,q)} process with autoregressive polynomial {a_*}, moving average polynomial {b_*}, and driven by the Lévy process {L}, is the strictly stationary solution of the (suitably interpreted) formal stochastic differential equation,

\displaystyle a_*(D)Y(t)=b_*(D)DL(t),

where {D} denotes differentiation with respect to {t}. If {E[\max(0,\log |L(1)|)]<\infty}, the solution (see Brockwell and Lindner (2009)) is

\displaystyle Y(t)=\int_{\mathbb{R}} g(t-u)dL(u)\ \ \ \ \ (1)

where

\displaystyle g(t)=\sum_{i=1}^p \frac{b_*(\lambda_i)}{a_*^{'}(\lambda_i)}e^{\lambda_i t}{\bf 1}_{(0,\infty)}(t).

If we introduce new polynomials {a(z)=\prod_{i=1}^p(z^2-\lambda_i^2)} and {b(z)=\prod_{i=1}^q(z^2-\mu_i^2)}, then the strictly stationary non-causal CARMA{(2p,2q)} process with autoregressive polynomial {a}, moving average polynomial {b}, and driven by the Lévy process {L}, is given by (1) with

\displaystyle g(t)=\sum_{i=1}^p \frac{b(\lambda_i)}{a^{'}(\lambda_i)}e^{\lambda_i t}{\bf 1}_{(0,\infty)}(t) - \sum_{i=1}^p \frac{b(-\lambda_i)}{a^{'}(-\lambda_i)}e^{-\lambda_i t}{\bf 1}_{(-\infty,0)}(t) =\sum_{\lambda:a(\lambda)=0}\frac{b(\lambda)}{a^{'}(\lambda)}e^{\lambda |t|}.

This {(p+q)}-parameter non-causal CARMA kernel extends naturally to a kernel on {{\mathbb{R}}^n} by replacing {|t|, t\in{\mathbb{R}}}, by {||{t}||, {t}\in\mathbb{R}^n}, the Euclidean norm of {{t}}. This leads to the definition of an isotropic CARMA random field {S} on {{\mathbb R}^n} as

\displaystyle S_n(t)=\int_{{\mathbb R}^n}g_n(t-u)dL(u), ~t\in{\mathbb{R}}^n,

where

\displaystyle g_n(t)=\sum_{\lambda:a(\lambda)=0}\frac{b(\lambda)}{a^{'}(\lambda)}e^{\lambda ||{t}||}, t\in{\mathbb{R}}^n,\ \ \ \ \ (2)

and {L} is a Lévy sheet on {{\mathbb{R}}^n} . An anisotropic CARMA random field is obtained by replacing the kernel (2) in the definition of {S_n} by

\displaystyle h_n(t):=g_n(\Lambda R t), t\in{\mathbb{R}}^n,\ \ \ \ \ (3)

where {R} is an {n\times n} orthogonal matrix with determinant 1 and Λ is a strictly positive definite {n\times n} diagonal matrix. Under the assumption that the Lévy sheet has finite second moments, we determine the mean, spectral density and covariance function of {S_n}. As in the case {n=1}, the parameterization by {\lambda_1,\ldots,\lambda_p} and {\mu_1,\ldots,\mu_q} permits a very wide range of possible correlation functions for {S_n} and the choice of Lévy sheet allows for a very wide range of marginal distributions. The use of a compound Poisson sheet in particular makes simulation of a field with specified polynomials {a} and {b} a trivial matter and, from a second-order point of view, incurs no loss of generality. In this case {S_n} can be expressed as

\displaystyle S_n(t)=\sum_{i=1}^\infty g_n(t-X_i)Y_i,\ \ \ \ \ (4)

where {X_i} denotes the location of the {i^{\rm th}} unit point mass of a Poisson random measure on {{\mathbb{R}}^n} and {\{Y_i\}} is a sequence of independently and identically distributed random variables independent of {\{X_i\}}. For estimation based on observations at irregularly spaced locations in {\mathbb R^n} we use Bayesian MCMC with a compound Poisson sheet to provide the prior distribution for the knot locations {X_i}. Kriging is performed in the course of the MCMC procedure.