Expectation Maximisation Algorithm

Patterns of incomplete data in practice often do not have the forms that allow explicit Maximum Likelihood(ML) estimates to be calculated. Suppose we have a model for the complete data Y, with density f(Yθ), indexed by the set of unknown parameters θ. Writing Y=(Y0,Y1) in terms of the observed Y0 and missing Y1 components, and assuming that the missingness mechanism is Missing At Random(MAR), we want to maximise the likelihood

L(θY0)=f(Y0,Y1θ)dY1

with respect to θ. When the likelihood is differentiable and unimodal, ML estimates can be found by solving the likelihood equation

Dl(θY0)lnL(θY0)θ=0,

while, if a closed-form solution cannot be found, iterative methods can be applied. One of these methods is the popular Expectation Maximisation(EM) algorithm (Dempster, Laird, and Rubin (1977)).

The EM algorithm is a general iterative method for ML estimation in incomplete data problems. The basic idea behind it is based on a sequence of steps:

  • Replace missing values by estimated values

  • Estimate the parameters

  • Re-estimate the missing values assuming the new parameter estimates are correct

  • Re-estimate parameters

The procedure is then iterated until apparent convergence. Each iteration of EM consists of an expectation step (E step) and a maximisation step (M step) which ensure that, under general conditions, each iteration increases the loglikelihood l(θY0). In addition, if the loglikelihood is bounded, the sequence {l(θtY0),t=(0,1,)} converges to a stationary value of l(θY0).

The E step and the M step

The M step simply consists of performing ML estimation of θ as if there were no missing data, that is, after they had been filled in. The E step finds the conditional expectation of the missing values given the observed data and current estimated parameters. In practice, EM does not necessarily substitute the missing values themselves but its key idea is that they are generally not Y0 but the functions of Y0 appearing in the complete data loglikelihood l(θY). Specifically, let θt be the current estimate of θ, then the E step finds the expected complete data loglikelihood if θ were θt:

Q(θθt)=l(θY)f(Y0Y1,θ=θt)dY0.

The M step determines θt+1 by maximising this expected complete data loglikelihood:

Q(θt+1θt)Q(θθt),

for all θ.

Univariate Normal Data Example

Suppose yi form a an iid sample from a Normal distribution with population mean μ and variance σ2, for i=1,,ncc observed units and i=ncc+1,,n missing units. Under the assumption that the missingness mechanism is ignorable, the expectation of each missing yi given Yobs and θ=(μ,σ2) is μ. Since the loglikelihood based on all yi is linear in the sufficient statistics i=1nyi and i=1nyi2, the E step of the algorithm calculates

E(i=1nyiθt,Y0)=i=1nccyi+(nncc)μt

and

E(i=1nyi2θt,Y0)=i=1nccyi2+(nncc)(μt2+σt2)

for current estimates θt=(μt,σt) of the parameters. Note that simply substituting μt for the missing values yncc+1,,yn is not correct since the term (nncc)(σt2) is omitted. Without missing data, the ML estimate of μ and σ2 are i=1nyin and i=1nyi2n(i=1nyin)2, respectively. The M step uses the same expressions based on the current expectations of the sufficient statistics calculated in the E step. Thus, the M step calculates

μt+1=E(i=1nyiθt,Y0)n

and

σt+12=E(i=1nyi2θt,Y0)nμt+12.

Setting μt=μt+1=μ^ and σt=σt+1=σ^ in these equations shows that a fixed point of these iterations is μ^=i=1nccyincc and σ^2=i=1nccyi2nccμ^2, which are the ML estimates of the parameters from Y0 assuming MAR and distinctness of the parameters.

Extensions of EM

There are a variety of applications where the M step does not have a simple computational form. In such cases, one way to avoid an iterative M step is to increase the Q function, rather than maximising it at each iteration, which corresponds to a Generalised Expectation Maximisation(GEM) algorithm. GEM inceases the likelihood at each iteration but appropriate convergence is not guaranteed without further specification of the process of increasing the Q function. One specific case of GEM is the Expectation Conditional Maximisation(ECM) algorithm (Meng and Rubin (1993)), which replaces the M step with a sequence of S conditional maximisation (CM) steps, each of which maximises the Q function over θ but with some vector function of θ, say gs(θ), fixed at its previous values for s=1,,S. Very briefly, assume that we have a parameter θ that can be partitioned into subvectors θ=(θ1,,θS), then we can take the s-th of the CM steps to be maximisation with respect to θs with all other parameters held fixed. Alternatively, it may be useful to take the s-th of the CM steps to be simultaneous maximisation over all of the subvectors expect θs, which is fixed. Because the ECM increases Q, it belongs to the class of GEM algorithms and therefore monotonically increases the likelihood of θ. When the set of functions g is “space filling” in the sense that it allows unconstrained maximisation over θ in its parameter space, ECM converges to a stationary point under the same conditions ensuring convergence of EM.

The Expectation Conditional Maximisation Either(ECME) algorithm (Liu and Rubin (1994)) is another version of GEM, which replaces some of the CM steps of ECM, maximising the constrained expected complete data loglikelihood function, with steps that maximise the correspondingly constrained actual likelihood function. The algorithm has stable monotone convergence and basic simplicity implementation relative to competing faster converging methods, and can have faster convergence rate than EM or ECM, measured using either the number of iterations or actual computer time. The The Alternative Expectation Conditional Maximisation(AECM) algorithm (Meng and Van Dyk (1997)) builds on the ECME idea by maximising functions other than Q or L in particular CM steps, corresponding to varying definitions of what constitutes missing data. An iteration of AECM consists of cycles, each consisting of an E step with a particular definition of complete and missing data, followed by CM steps, which can result in enhanced computational efficiency.

References

Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.
Liu, Chuanhai, and Donald B Rubin. 1994. “The ECME Algorithm: A Simple Extension of EM and ECM with Faster Monotone Convergence.” Biometrika 81 (4): 633–48.
Meng, Xiao-Li, and Donald B Rubin. 1993. “Maximum Likelihood Estimation via the ECM Algorithm: A General Framework.” Biometrika 80 (2): 267–78.
Meng, Xiao-Li, and David Van Dyk. 1997. “The EM Algorithm—an Old Folk-Song Sung to a Fast New Tune.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (3): 511–67.

Edit this page