Explicit Single Imputation

All case deletion methods, such as Complete Case Analysis(CCA) or Available Case Analysis(ACA) make no use of units with partially observed data, when estimating the marginal distribution of the variables under study or the covariation between variables. Clearly, this is inefficient and a tempting alternative would be to impute or “fill in” the unobserved data with some plausible values. When a single value is used to replace each missing data, we talk about Single Imputation(SI) methods and, according to the precedure used to generate these imputations, different SI methods can be used. In general, the idea of imputing the missing values is really appealing as it allows to recover the full sample on which standard complete data methods can be applied to derive the estimates of interest.

However, it is important to be aware of the potential problems of imputing missing data without a clear understanding about the process underlying the values we want to impute, which is the key factor to determine whether the selected approach would be plausible in the context considered. Indeed, imputation should be conceptualised as draws from a predictive distribution of the missing values and require methods for creating a predictive distribution for the imputation based on the observed data. According to Little and Rubin (2019), these predictive distributions can be created using

  1. Explicit modelling, when the distribution is based on formal statistical models which make the underlying assumptions explicit.

  2. Implicit modelling, when the distribution is based on an algorithm which implicitly relies on some underlying model assumptions.

In this part, we focus on some of the most popular Explicit Single Imputation methods. These include: Mean Imputation(SI-M), where means from the observed data are used as imputed values; Regression Imputation(SI-R), where missing values are replaced with values predicited from a regression of the missing variable on some other observed variables; and Stochastic Regression Imputation(SI-SR), where unobserved data are substituted with the predicted values from a regression imputation plus a randomly selected residual drawn to reflect uncertainty in the predicted values.

Mean Imputation

The simplest type of SI-M consists in replacing the missing values in a variable with the mean of the observed units from the same variable, a method known as Unconditional Mean Imputation (Little and Rubin (2019),Schafer and Graham (2002)). Let yij be the value of variable j for unit i, such that the unconditional mean based on the observed values of yj is given by y¯j. The sample mean of the observed and imputed values is then y¯jm=y¯jac, i.e. the estimate from ACA, while the sample variance is given by

sjm=sjac(nac1)(n1),

where sjac is the sample variance estimated from the nac available units. Under a Missing Completely At Random(MCAR) assumption, sjac is a consistent estimator of the tru variance so that the sample variance from the imputed data sjm systematically underestimates the true variance by a factor of (nac1)(n1), which clearly comes from the fact that missing data are imputed using values at the centre of the distribution. The imputation distorts theempirical distribution of the observed values as well as any quantities that are not linear in the data (e.g. variances, percentiles, measures of shape). The sampel covariance of yj and yk from the imputed data is

sjkm=sjkac(njkas1)(n1),

where njkac is the number of units with both variables observed and sjkac is the corresponding covariance estimate from ACA. Under MCAR sjkac is a consistent estimator of the true covariance, so that sjkm underestimates the magnitude of the covariance by a factor of (njkac1)(n1). Obvious adjustments for the variance ((n1)(njac1)) and the covariance ((n1)(njkac1)) yield ACA estimates, which could lead to covariance matrices that are not positive definite.

Regression Imputation

An improvement over SI-M is to impute each missing data using the conditional means given the observed values, a method known SI-R or Conditional Mean Imputation. To be precise, it would also be possible to impute conditional means without using a regression approach, for example by grouping individuals into adjustment classes (analogous to weighting methods) based on the observed data and then impute the missing values using the observed means in each adjustment class (Little and Rubin (2019)). However, for the sake of simplicity, here we will assume that SI-R and conditional mean imputation are the same.

To generate imputations under SI-R, consider a set of J1 fully observed response variables y1,,yJ1 and a partially observed response variable yJ which has the first ncc units observed and the remaiing nncc units missing. SI-R computes the regression of yJ on y1,,yJ1 based on the ncc complete units and then fills in the missing values as predictions from the regression. For example, for unit i, the missing value yiJ is imputed using

y^iJ=β^J0+j=1J1β^Jjyij,

where β^J0 is the intercept and β^Jj is the j coefficient of of the regression of yJ on y1,,yJ1 based on the ncc units.

An extension of regression imputation to a general pattern of missing data is known as Buck’s method (Buck (1960)). This approach first estimates the population mean μ and covariance matrix Σ from the sample mean and covariance matrix of the complete units and then uses these estimates to calculate the OLS regressions of the missing variables on the observed variables for each missing data pattern. Predictions of the missing data for each observation are obtained by replacing the values of the present variables in the regressions. The average of the observed and imputed values from this method are consistent estimates of the means and MCAR and mild assumptions about the moments of the distribution (Buck (1960)). They are also consistent when the missingness mechanism depends on observed variables, i.e. under a Missing At Random(MAR) assumption, although addtional assumptions are required in this case (e.g. using linear regressions it assumes that the “true” regression of the missing varables on the observed variables is linear).

The filled in data from Buck’s method typically yield reasonable estimates of means, while the sample variances and covariances are biased, although the bias is less than the one associated with unconditional mean imputation. Specifically, the sample variance σj2,SIR from the imputed data underestimates the true variance σj2 by a factor of 1n1i=1nσji2, where σji2 is the residual variance from regressing yj on the variables observed in unit i if yij is missing and zero if yij is observed. The sample covariance of yj and yk has a bias of 1n1i=1nσjki, where σjki is the residual covariance from the multivariate regression of (yij,yik) on the variables observed in unit i if both variables are missing and zero otherwise. A consistent estimator of Σ can be constructed under MCAR by replacing consistent estimates of σji2 and σjki in the expressions for bias and then adding the resulting quantities to the sample covariance matrix of the filled-in data.

Stochastic Regression Imputation

Any type of mean or regression imputation will lead to bias when the interest is in the tails of the distributions because “best prediction” imputation systematically underestimates variability and standard errors calculated from the imputed data are typically too small. These considerations suggest an alternative imputation strategy, where imputed values are drawn from a predictive distribution of a plausible set of values rather than from the centre of the distribution. This is the idea behind SI-SR, which imputes a conditional draw

y^iJ=β^J0+j=1J1β^Jjyij+ziJ,

where ziJ is a random normal deviate with mean 0 and variance σ^J2, the residual variance from the regression of yJ on y1,,yJ1 based on the complete units. The addition of the random deviate makes the imputation a random draw from the predictive distribution of the missing values, rather than the mean, which is likely to ameliorate the distortion of the predictive distributions (Little and Rubin (2019)).

Example

Consider a bivariate normal monotone missing data with y1 fully observed and y2 missing for a fraction λ=(nncc)n and a MCAR mechanism. The following table shows the large sample bias of standard OLS estimates obtained from the filled-in data about the mean, the variance of y2, the regression coefficient of y2 on y1, and the regression coefficient of y1 on y2, using four different single imputation methods: uncondtional mean (UM), unconditional draw (UD), conditional mean (CM), and conditional draw (CD).

Table 1: Bivariate normal monotone MCAR data; large sample bias of four imputation methods.
mu_2sigma_2beta_21beta_12
UM0-lambda * sigma_2-lambda * beta_210
UD00-lambda * beta_21-lambda * beta_21
CM0-lambda * (1-rho^2) * sigma_20((lambda * (1-rho^2)) / (1-lambda * (1-rho^2)) ) * beta_12
CD0000

Under MCAR, all four methods yield consistent estimates of μ2 but both UM and CM underestimate the variance σ2, UD leads to attenuation of the regression coefficients, while CD yields consistent estimates of all four parameters. However, CD has some important drawbacks. First, adding random draws to the conditional mean imputations is inefficient as the large sample variance of the CD estimates of μ2 can be shown (Little and Rubin (2019)) to be

[1λρ2+(1ρ2)λ(1λ)]σ2ncc,

which is larger than the large sample sampling variance of the CM estimate of μ2, namely [1λρ2]σ2ncc. Second, the standard errors of the CD estimates from the imputed data are too small because they do not incorporate imputation uncertainty.

When the analysis involves units with some covariates missing and other observed, it is common practice to condition on the observed covariates when generating the imputations for the missing covariates. It is also possible to condition on the outcome y to impute missing covariates, even if the final objective is to regress y on the full set of covariates and conditioning on y will lead to bias when conditional means are imputed. However, if predictive draws are imputed, this approach will yield consistent estimates of the regression coefficients. Imputing missing covariates using the means by conditioning only the observed covariates (and not also on y) also yields consistent estimates of the regression coefficients under certain conditions, although these are typically less efficient then those from CCA, but yields inconsistent estimates of other parameters such as variances and correlations (Little (1992)).

Conclusions

According to Little and Rubin (2019), imputation should generally be

  1. Conditional on observed variables, to reduce bias, improve precision and preserve association between variables.

  2. Multivariate, to preserve association between missing variables.

  3. Draws from the predictive distributions rather than means, to provide valid estimates of a wide range of estimands.

Nevertheless, a main problem of SI methods is that inferences based on the imputed data do not account for imputation uncertainty and standard errors are therefore systematically underestimated, p-values of tests are too significant and confidence intervals are too narrow.

References

Buck, Samuel F. 1960. “A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer.” Journal of the Royal Statistical Society: Series B (Methodological) 22 (2): 302–6.
Little, Roderick JA. 1992. “Regression with Missing x’s: A Review.” Journal of the American Statistical Association 87 (420): 1227–37.
Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.
Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7 (2): 147.

Edit this page