Multiply Imputing an Outcome Variable

Posted on October 12th, 2011

I've become frustrated lately with the idea that one should not impute the dependent variable. While perhaps difficult to see at first, just a little more thought should reveal that the outcome is far too important to listwise delete. In fact, I'm not sure how one can suggest that MI works for explanatory variables, but not for outcomes. There might be a good argument for it, but I haven't heard one. Below I lay out an intuitive case for multiply imputing outcomes and illustrate with a simulated data set.

Suppose we are interested in estimating the effect of eduction on income, that survey respondents' education is completely observed, and richer respondents are less likely to report their income. We have missingness in our outcome variable. Is MI beneficial in this case? Absolutely! Think about it simplistically. If education has a positive effect on income as we might guess, then we should observe the income of most of our less educated respondents. Among more educated respondents, however, we'll tend to observe income only for those who make less than expected. Overall, this should cause us to underestimate the effect of education.

The intuitive solution, of course, is to make a guess about what the missing incomes are. Education doesn't offer us any new information, but we might think of some covariates that are causally posterior to income, such as one's occupation or the kind of car one drives. If we can come up with variables (in addition to those we'll use in the final model) to predict our outcome, we can reduce this bias.

To illustrate, I did a quick simulation. Keep in mind that in the simulation below, the missingness is non-ignorable, meaning that the probability that y is missing depends on y. In fact, in this simulation, that is all it depends on, which is one of the worst situations for MI. But because I have several variables that are caused by y, and thus wouldn't be included in the final analysis model, I can leverage these to reduce the bias caused by missingness.

# simulate a data setn = 100000x1 = rnorm(n) # x's are explanatory variablesx2 = rnorm(n)y = 0 + x1 + x2 + rnorm(n) # y is the outcome variablez1 = y + rnorm(n, 0) # z's are variables caused by yz2 = y + rnorm(n, 0)z3 = y + rnorm(n, 0)

# generate missing observationsp.mis = pnorm(y)mis = rbinom(n, 1, p.mis)y[mis == 1] = NAdata = data.frame(y, x1, x2, z1, z2, z3)
# mi and estimate modelslibrary(Zelig)a = amelia(data)mi = zelig(y ~ x1 + x2, data = a$imputations, model = "ls")ld = zelig(y ~ x1 + x2, data = data, model = "ls")summary(mi); summary(ld)

In this simulation, the true coefficients of x1 and x2 are 1. The estimates using listwise deletion are about 0.76 with a tiny standard error of about 0.004. This is way off. The MI estimates of x1 and x2 are much better, with estimates of about 0.925. The bias is substantially reduced although still present because our simulation generates non-ignorable missingness.

I think it is at least somewhat intuitive that we should multiply impute our outcomes, and this simulation illustrates that it can be helpful and offers some intuition as to why. I'd be interested in hearing from someone who opposes MI for outcomes. Perhaps there are non-trivial conditions in which it can make you worse off.