Joint Multiple Imputation
Multiple Imputation(MI) refers to the procedure of replacing each missing value by a set of
Specify an imputation model to generate
imputed values, typically taken as random draws from the predictive distribution of the missing values given the observed values, and create completed data sets using these imputations and the observed data.Analyse each completed data sets using standard complete data methods based on an analysis model, and derive
completed data inferencesPool together the
completed data inferences into a single inference using standard MI formulas, which ensure that missing data uncertainty is taken into account
Mi was first proposed by Rubin (Rubin (1978)) and has become more popular over time (Rubin (1996), Schafer and Graham (2002), Little and Rubin (2019)), as well as the focus of research for methodological and practical applications in a variety of fields (Herzog and Rubin (1983), Rubin and Schenker (1987), Schafer (1999), Carpenter and Kenward (2012), Molenberghs et al. (2014), Van Buuren (2018)). MI shares both advantages of Single Imputaiton (SI) methods and solves both disadvantages. Indeed, like SI, MI methods allow the analyst to use familiar complete data methods when analysing the completed data sets. The only disadvantage of MI compared with SI methods is that it takes more time to generate the imputations and analyse the completed data sets. However, Rubin (2004) showed that in order to obtain sufficiently precise estimates, a relatively small number of imputations (typically
In the first step of MI, imputations should ideally be created as repeated draws from the posterior predictive distribution of the missing values
Rubin’s rules
Let
Because the imputations under MI are conditional draws, under a good imputaton model, they provide valid estimates for a wide range of estimands. In addition, the averaging over
The total variability associated with
where
where the degrees of freedom
with
The validity of MI rests on how the imputations are created and how that procedure relates to the model used to subsequently analyze the data. Creating MIs often requires special algorithms (Schafer (1997)). In general, they should be drawn from a distribution for the missing data that reflects uncertainty about the parameters of the data model. Recall that with SI methods, it is desirable to impute from the conditional distribution
Joint Multiple Imputation
Joint MI starts from the assumption that the data can be described by a multivariate distribution which in many cases, mostly for practical reasons, corresponds to assuming a multivariate Normal distribution. The general idea is that, for a general missing data pattern
Consider the multivariate Normal distribution
Define some plausible starting values for all parameters
At each iteration
, draw imputations for each missing value from the predictive distribution of the missing data given the observed data and the current value of the parameters at , that is
- Re-estimate the parameters
using the observed and imputed data at based on the multivariate Normal model, that is
And reiterate the steps 2 and 3 until convergence, where the stopping rule typically consists in imposing that the change in the parameters between iterations
The multivariate Normal model is also often applied to categorical data, with different types of specifications that have been proposed in the literature (Schafer (1997),Horton, Lipsitz, and Parzen (2003),Allison (2005),Bernaards, Belin, and Schafer (2007),Yucel, He, and Zaslavsky (2008),Demirtas (2009)). For examples, missing data in contingency tables can be imputed using log-linear models (Schafer (1997)); mixed continuous-categorical data can be imputed under the general location model which combines a log-linear and multivariate Normal model (Olkin, Tate, and others (1961)); two-way imputation can be applied to missing test item responses by imputing missing categorical data by conditioning on the row and column sum scores of the multivariate data (Van Ginkel et al. (2007)).