Multiply Imputing an Outcome Variable

Posted: 10.12.2011

I've become frustrated lately with the idea that one should not impute the dependent variable. While perhaps difficult to see at first, just a little more thought should reveal that the outcome is far too important to listwise delete. In fact, I'm not sure how one can suggest that MI works for explanatory variables, but not for outcomes. There might be a good argument for it, but I haven't heard one. Below I lay out an intuitive case for multiply imputing outcomes and illustrate with a simulated data set.

Suppose we are interested in estimating the effect of eduction on income, that survey respondents' education is completely observed, and richer respondents are less likely to report their income. We have missingness in our outcome variable. Is MI beneficial in this case? Absolutely! Think about it simplistically. If education has a positive effect on income as we might guess, then we should observe the income of most of our less educated respondents. Among more educated respondents, however, we'll tend to observe income only for those who make less than expected. Overall, this should cause us to underestimate the effect of education.

The intuitive solution, of course, is to make a guess about what the missing incomes are. Education doesn't offer us any new information, but we might think of some covariates that are causally posterior to income, such as one's occupation or the kind of car one drives. If we can come up with variables (in addition to those we'll use in the final model) to predict our outcome, we can reduce this bias.

To illustrate, I did a quick simulation. Keep in mind that in the simulation below, the missingness is non-ignorable, meaning that the probability that y is missing depends on y. In fact, in this simulation, that is all it depends on, which is one of the worst situations for MI. But because I have several variables that are caused by y, and thus wouldn't be included in the final analysis model, I can leverage these to reduce the bias caused by missingness.


# simulate a data set
n = 100000
x1 = rnorm(n) # x's are explanatory variables
x2 = rnorm(n)
y = 0 + x1 + x2 + rnorm(n) # y is the outcome variable
z1 = y + rnorm(n, 0) # z's are variables caused by y
z2 = y + rnorm(n, 0)
z3 = y + rnorm(n, 0)


# generate missing observations
p.mis = pnorm(y)
mis = rbinom(n, 1, p.mis)
y[mis == 1] = NA
data = data.frame(y, x1, x2, z1, z2, z3)


# mi and estimate models
library(Zelig)
a = amelia(data)
mi = zelig(y ~ x1 + x2, data = a$imputations, model = "ls")
ld = zelig(y ~ x1 + x2, data = data, model = "ls")
summary(mi); summary(ld)

In this simulation, the true coefficients of x1 and x2 are 1. The estimates using listwise deletion are about 0.76 with a tiny standard error of about 0.004. This is way off. The MI estimates of x1 and x2 are much better, with estimates of about 0.925. The bias is substantially reduced although still present because our simulation generates non-ignorable missingness.

I think it is at least somewhat intuitive that we should multiply impute our outcomes, and this simulation illustrates that it can be helpful and offers some intuition as to why. I'd be interested in hearing from someone who opposes MI for outcomes. Perhaps there are non-trivial conditions in which it can make you worse off.



  • Eric

    I'm a little confused by your introduction. The need for including all variables in a multiple imputation, including the dependent variable, is pretty emphatically and carefully described by all the authoritative sources on MI that I have read. Who is saying otherwise?

    • Carlisle Rainey

      One example occurs here, and I've run across other examples as well. It is my general experience that applied researchers are suspicious of imputing outcomes. I think this is because you are using a model that is presumably similar to the final analysis model to generate data. I believe the fear is that you are simulating data that artificially conforms to your expectations.

  • Scott

    I'm with you! I am a relative newcomer (the past year or so) to MI, but used an implementation of it (van Buuren's mice package for R) in a high-stakes testing research project. We definitely had to impute outcome variables, and we definitely faced some resistance to it. I think it's a special case of fear of the unknown. To those uninitiated to the large MI literature, "making up data" sounds awfully fishy, and somehow this seems worse for dependent measures. But I agree that there's really no logic behind this, since a variable is a variable is a variable. The only reason one little piece of the covariance matrix is seen as "special" is when it ends up on the left side of a regression equation. But the intuition is exactly wrong: leaving the DVs out of the imputation model makes it more likely to lead to less good estimates, and therefore increased bias in the imputed data.

    • Zoltan

      "I think it's a special case of fear of the unknown."

      Without commenting much on the MI (using it, happily), I just wanted to highlight that the phrase "fear of the unknown" in discussing missing data is priceless. I know Scott meant something else while using it, but this phrase is just priceless in this context!