Reducing Reparameterization Gradient Variance
23 May 2017
Nick Foti, Alex D’Amour, Ryan Adams, and I just arxiv’d a paper that focuses on reducing the variance of the reparameterization gradient estimator (RGE) often used in the context of black-box variational inference. I’ll give a high level summary here, but I encourage you to check out the paper and the code.
Reparameterization gradients are often used when the objective of an optimization problem is written as an expectation. A common example is black-box variational inference for approximating an intractable posterior distribution . The objective we typically maximize is the evidence lower bound (ELBO)
Using gradient-based optimization requires computing the gradient with respect to
which itself can be expressed as an expectation. This expectation is in general analytically intractable, but we can still use gradient-based optimization if we can compute an unbiased estimate of the gradient, , such that .
The reparameterization gradient estimator (RGE) is one such unbiased estimator. The RGE uses the “reparameterization trick” to turn a Monte Carlo ELBO estimator into a Monte Carlo ELBO gradient estimator. Recall that the reparameterization gradient estimator depends on a differentiable map, , that transforms some seed randomness, , such that is distributed according to . The RGE can be written
We can easily form an unbiased -sample Monte Carlo estimator of by drawing samples and computing
The utility of this estimator is tied to its variance — generally, the higher the variance, the slower the optimization converges.
We approach this variance reduction task by examining the generating process of the RGE, . To see what I mean, consider a single-sample estimator as a function of (see this post for details)
The gradient estimator above is a transformation of a simple random variate. It maps — a random variable with a known distribution — to — a random variable with a complex distribution. We know this complex distribution is unbiased, but we can’t exactly compute the expectation (which is why we’re estimating it in the first place).
The key idea in this paper is that we can approximate the transformation to form a different gradient estimator. This new estimator, , is biased but has a tractable distribution. Importantly, we use the same to form both estimators, which correlates the two random variables. We can use the biased approximation as a control variate to reduce the variance of the original RGE. This simple relationship is depicted below — the lower left is our standard reparameterization gradient; the lower right is the biased approximation.
We apply this idea to a Gaussian variational distribution . We propose a linear approximation of the transformation that results in a simple estimator, with a known distribution (Gaussian or a shifted and scaled ).
To summarize, the unbiased Monte Carlo estimator has a complex distribution with an uncomputable mean. The biased estimator has a simpler distribution with a known mean. We construct by approximating the process that generates . Using the same for both estimators correlates them, making a powerful control variate for variance reduction.
The paper describes a way to construct a good that we can efficiently compute. We can do this by taking advantage of model curvature, efficient Hessian-vector products, and the stability of Gaussian random variables.
We compare our gradient estimator, HVP+Local, with the pure Monte Carlo RGE. We apply these gradient estimators to two VI problems with non-conjugate posterior distributions — a hierarchical Poisson GLM and a Bayesian Neural Network — using a diagonal Gaussian variational approximation with mean parameters and scale parameters and full parameter vector .
For the Poisson GLM we observe a dramatic reduction in variance for both mean and variance parameters — typically a reduction factor between 20 and 2,000x (more precise variance reduction measurements are in the paper). This reduction in variance translates to faster convergence. For the hierarchical model, we see quicker and more stable convergence:
And for the (modestly sized) Bayesian Neural Network, we see faster convergence. Note the 10-sample HVP+Local estimator outperforms the 50-sample MC estimator in terms of wall clock.
While the HVP+Local estimator requires 2–3x the computation of the pure Monte Carlo RGE, we see that this cost can be worth it. More results and details can be found in the paper.
Thanks for reading, and stay cool out there.