## Reducing Reparameterization Gradient Variance

### 23 May 2017

Nick Foti,
Alex D’Amour,
Ryan Adams, and I just
arxiv’d a paper that focuses on
reducing the variance of the *reparameterization gradient estimator* (RGE)
often used in the context of black-box variational inference.
I’ll give a high level summary here, but I encourage you to check out
the paper
and the code.

#### Reparameterization Gradients

Reparameterization gradients are often used when the objective of an
optimization problem is written as an expectation.
A common example is black-box variational inference for approximating an
intractable posterior distribution .
The objective we typically maximize is the
*evidence lower bound* (ELBO)

Using gradient-based optimization requires computing the gradient with respect to

which itself can be expressed as an expectation. This expectation is in general analytically intractable, but we can still use gradient-based optimization if we can compute an unbiased estimate of the gradient, , such that .

The *reparameterization gradient estimator* (RGE) is one such unbiased
estimator.
The RGE uses the “reparameterization trick” to turn a Monte Carlo
ELBO estimator into a Monte Carlo ELBO *gradient estimator*.
Recall that the reparameterization gradient estimator depends on
a differentiable map, , that transforms some
seed randomness, , such that
is distributed according to .
The RGE can be written

We can easily form an unbiased -sample Monte Carlo estimator of by drawing samples and computing

The utility of this estimator is tied to its variance — generally, the higher the variance, the slower the optimization converges.

#### Method

We approach this variance reduction task by examining the
*generating process* of the RGE, .
To see what I mean, consider a single-sample estimator as a function
of
(see this post
for details)

The gradient estimator above is a transformation of a simple random variate. It maps — a random variable with a known distribution — to — a random variable with a complex distribution. We know this complex distribution is unbiased, but we can’t exactly compute the expectation (which is why we’re estimating it in the first place).

The key idea in this paper is that we can
*approximate the transformation*
to form a different gradient estimator.
This new estimator, , is biased but has a
tractable distribution.
Importantly, we use the same to form both estimators, which
correlates the two random variables.
We can use the biased approximation as a control variate to reduce
the variance of the original RGE.
This simple relationship is depicted below — the lower left is our
standard reparameterization gradient; the lower right is the biased
approximation.

We apply this idea to a Gaussian variational distribution . We propose a linear approximation of the transformation that results in a simple estimator, with a known distribution (Gaussian or a shifted and scaled ).

To summarize, the **unbiased** Monte Carlo estimator
has a complex distribution with an uncomputable mean.
The **biased** estimator has a simpler distribution with
a known mean.
We construct by approximating the process that generates
.
Using *the same* for both estimators correlates them, making
a powerful control variate for variance reduction.

The paper describes a way to construct a good that we can efficiently compute. We can do this by taking advantage of model curvature, efficient Hessian-vector products, and the stability of Gaussian random variables.

#### Results

We compare our gradient estimator, HVP+Local, with the pure Monte Carlo RGE. We apply these gradient estimators to two VI problems with non-conjugate posterior distributions — a hierarchical Poisson GLM and a Bayesian Neural Network — using a diagonal Gaussian variational approximation with mean parameters and scale parameters and full parameter vector .

For the Poisson GLM we observe a dramatic reduction in variance for both mean and variance parameters — typically a reduction factor between 20 and 2,000x (more precise variance reduction measurements are in the paper). This reduction in variance translates to faster convergence. For the hierarchical model, we see quicker and more stable convergence:

And for the (modestly sized) Bayesian Neural Network, we see faster convergence. Note the 10-sample HVP+Local estimator outperforms the 50-sample MC estimator in terms of wall clock.

While the HVP+Local estimator requires 2–3x the computation of the pure Monte Carlo RGE, we see that this cost can be worth it. More results and details can be found in the paper.

Thanks for reading, and stay cool out there.