Expectation Maximization and Gradient Updates
15 Oct 2015
There are many ways to obtain maximum likelihood estimates for statistical models, and it’s not always clear which algorithm to choose (or why). There appears to be a natural divide between fixedpoint iterative methods, such as Expectation Maximization (EM), and directly optimizing the marginal likelihood with gradientbased methods. One great reference that bridges this gap is the paper Optimization with EM and ExpectationConjugateGradient by Salakhutdinov, Roweis and Ghahramani from ICML in 2003. The main takeaway of the paper is that an EM iteration can be viewed as a preconditioned gradient step on the marginal likelihood.
The model setup includes observed data, , latent variables, , and model parameters . To fit the model, our goal is to maximize the marginal likelihood of the observed data:
####ExpectationMaximization The standard EM algorithm is often presented in a terse, unintuitive way:
 EStep: compute
 MStep: compute
We can unpack the two steps by examining the marginal likelihood itself.
where we have introduced auxiliary parameters . The EM algorithm the proceeds
 Estep: maximize with respect to holding fixed [^{1}]
 Mstep: maximize with respect to holding fixed
So EM is optimizing a (sometimes tight) lower bound on the log marginal likelihood by alternately maximizing and . We can think of one EM update as a mapping , that is , and examine the properties of this mapping in terms of the pieces of the functional . First, we note that if EM is at a stable point, , such that , then around by a Taylor series expansion, the mapping can be approximated
and convergence near the optimum relies on the structure of the Jacobian matrix of the EM mapping.
####EM as a gradient update
Now the question is, how can we relate the EM step, , to the gradient of the marginal likelihood, ? The authors of the paper claim that an EM step can be written
and provide conditions where exists (C1 and C2 in the paper). In order to gain some insight about the form of , we can differentiate above with respect to
where
and the authors argue that when the optimizer is in a flat region of (small and is not too big), the rightmost term will dominate. So roughly
Examining the above equation, convergence is generally controlled by the eigenvalues of . If the eigenvalues of are small, then the EM update looks like a Newton update
but if the eigenvalues of are large, then the stepsize ends up being very small, leading to slow convergence.
We can gain some insight into the spectral structure of the EM iteration by looking at the result from the original EM paper [^{2}] paper that discusses the gradient of the EM map. If EM converges to some point , then
which describes the “ratio of missing information” near the optimal value  the derivative of the EM map is equal to the ratio of posterior information and the complete data information (w.r.t. the latent variable posterior expectation). The authors combine the above into the relationship to claim that near the point of convergence
which makes it clear that when the ratio of “missing information” to “complete information” is low, EM updates look more like Newton updates that make use of the curvature of the log marginal likelihood. However, when “missing information” is a large fraction of the “complete information”, EM updates take small steps that lead to slow convergence.
A simple simulation for a 2component mixture of Gaussians illustrates this point. I ran EM on two different mixtures of Gaussians a well separated and a notsowell separated model:
the median convergence for the notsowell separated model (less “information”) is on average slower (though EM is very quick to converge in both examples). The paper provides empirical evidence that linear dynamical systems and HMMs tend to really show slow EM convergence (compared to gradient updates).
####Proposed Method: Hybrid EMCG updates This intuition gives us a simple rule: if we observe that posterior uncertainty is high in a particular region of our parameter space, we shouldn’t do an EM update and instead do a gradientbased step. If we observe posterior uncertainty is low, then we can incorporate computationally inexpensive hessian information by doing an EM update.
The author’s propose using the (normalized) posterior entropy at each step as a heuristic for deciding when to switch between conjugate gradient updates and EM updates (this seems to just be for discrete models). It works well  it seems to almost always improve upon EM (which often slows to a crawl near the point of convergence), and will sometimes improve upon CG only updates (in general it doesn’t seem to do worse).
######Questions? The authors established an approximate relationship between standard gradientbased updates and EM iterations, and what we learned boils down to is: the more (relative) information about that is available, the more curvature information EM allows us to use for an update. But when doesn’t contain a lot of information, then taking expectations with respect to it won’t get us much traction, and updates will be slow.
But heuristically switching between two different types of optimization methods seems unsatisfying. Can an optimization routine enjoy the curvature information that EM (will sometimes) provide while more continuously relying on on pure gradientbased updates? Perhaps one can specify a “coarsened” latent variable model that admits the same marginal likelihood, but for which our current data is highly informative, and iterate in that space.
Also, due to close relationship between EM iterations and Gibbs updates, what can we learn about MCMC routines from this paper? When does it make sense to perform a Gibbs update (or more realistically, a “something”withinGibbs ) and when does it make sense to just perform a marginal MCMC update for ?

To see how this arises, you can refactor to be so clearly maximizing holding fixed is equivalent to finding the distribution that minimizes the KL between and , which is the posterior over when . Given a fixed , the step is the typical maximization of the expected complete data log likelihood. ↩

Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the royal statistical society. Series B (methodological) (1977): 138. ↩