Implicit Single Imputation
All case deletion methods, such as Complete Case Analysis(CCA) or Available Case Analysis(ACA) make no use of units with partially observed data, when estimating the marginal distribution of the variables under study or the covariation between variables. Clearly, this is inefficient and a tempting alternative would be to impute or “fill in” the unobserved data with some plausible values. When a single value is used to replace each missing data, we talk about Single Imputation(SI) methods and, according to the precedure used to generate these imputations, different SI methods can be used. In general, the idea of imputing the missing values is really appealing as it allows to recover the full sample on which standard complete data methods can be applied to derive the estimates of interest.
However, it is important to be aware of the potential problems of imputing missing data without a clear understanding about the process underlying the values we want to impute, which is the key factor to determine whether the selected approach would be plausible in the context considered. Indeed, imputation should be conceptualised as draws from a predictive distribution of the missing values and require methods for creating a predictive distribution for the imputation based on the observed data. According to Little and Rubin (2019), these predictive distributions can be created using
Explicit modelling, when the distribution is based on formal statistical models which make the underlying assumptions explicit.
Implicit modelling, when the distribution is based on an algorithm which implicitly relies on some underlying model assumptions.
In this part, we focus on some of the most popular Implicit Single Imputation methods. These include: Hot Deck Imputation(SI-HD), where missing values are imputed using observed values from similar responding units in the sample; Substitution(SI-S), where nonresponding units are replaced with alternative units not yet selected into the sample; Cold Deck Imputation(SI-CD), where missing values are replaced with a constant value from an external source; Composite Methods, which combine procedures from the previous approaches. We will specifically focus on SI-HD methods, which are the most popular among these.
Hot Deck Imputation
SI-HD procedures refer to the deck of match Hollerith cards for the donors available for a nonrespondent. Suppose that a sample of
where
where the inner expectations and variances are taken over the distribution of
Predictive Mean Matching
A general approach to hot-deck imputation is to define a metric
where
Last Value Carried Forward
Longitudinal data are often subject to attrition when units leave the study prematurely. Let
where
Conclusions
According to Little and Rubin (2019), imputation should generally be
Conditional on observed variables, to reduce bias, improve precision and preserve association between variables.
Multivariate, to preserve association between missing variables.
Draws from the predictive distributions rather than means, to provide valid estimates of a wide range of estimands.
Nevertheless, a main problem of SI methods is that inferences based on the imputed data do not account for imputation uncertainty and standard errors are therefore systematically underestimated, p-values of tests are too significant and confidence intervals are too narrow.
References
Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.
Pocock, Stuart J. 2013. Clinical Trials: A Practical Approach. John Wiley & Sons.