What should a linear models class cover?
Posted on August 3rd, 2012
I was recently asked what a linear models class for political science graduate students should cover. My somewhat uneducated understanding is that many political science departments have a three course methods sequence that resembles the following:
- Research design and basic of inference
- Linear models and OLS
- Generalized linear models and MLE
It seems like the goals of the linear models class are to (1) provide a thorough introduction to linear models and (2) serve as a foundation to introduce generalized linear models. Given this, what should be covered in a linear models class?
- introduce the basic linear model using both scalar and matrix notation
- discuss hypothesis testing using linear models
- consider violations of the assumptions of the model.
This seems great. I think that this framework makes a lot of sense.
However, there are several points I would add. If I taught a linear models course, here are some things I would like to make sure my students would learn, that might not be emphasized in the standard curriculum.
A general approach to modeling
I would talk every day about creating models, evaluating models, and learning from models. Estimation is not as important.
- start with a substantive understanding of the phenomenon of interest
- move from this understanding to a model (of which a “null model” is a subset)
- constant effects?
- distribution of errors?
- estimate the model
- least squares
- non-parametric estimation
- evaluate the estimated model
- check the modeling assumptions
- does the simulated data look like the observed data when plotted in several different ways?
- move from the estimated model to an understanding of the substance
- simulation of quantities of interest and uncertainty
- clear explanation of statistical arguments (such as hypothesis tests)
- lots of graphs to clearly communicating the effects of interest
- posterior probabilities
- hypothesis tests
- confidence intervals
Lots and lots of simulation.
All applied researchers should have simulation skills in their toolkit.
- Simulate from data they expect if their theory is right
- Simulate data that would occur if their theory were wrong
- Use bootstrapping to compute standard errors. This is motivated by a random sampling perspective.
- Use randomization/permutation to compute standard errors. This is motivated by a random assignment perspective.
- Use simulation from estimated models to compute quantities of interest (both fundamental and estimation uncertainty). This is especially useful when the outcome or predictors are transformed.
- Simulate data in which model assumptions are wrong and see what answers the linear model gives
- non-constant effects
- measurement error (random and biased)
- missing data
- omitted variables
- heterogeneous effects
- non-normal/heavy-tailed errors
- correlated errors
Alternative estimation routines/philosophies
I don't care for calling political methodology classes "OLS" and "MLE." These are estimation strategies, and while important, they are the far less important than building models and drawing conclusions from models. I would call the three course sequence "Inference," "linear models," and "generalized linear models." This puts more emphasize on the substance of the model rather than the particular routines used to estimate it. I would also make it clear that least squares is one of several ways to estimate linear models. It is by far the most popular, but it isn't so useful in other situations.
- least squares
- maximum likelihood
Lots and lots of hypothesis testing
The most important step empirical researchers make is drawing conclusions from statistical models. I would make students think through statistical arguments every day. Whether you like it or not, hypothesis testing is an important part of almost every political science paper and students need to understand it backward and forward.
- Hypothesis testing
- Choose an appropriate summary statistic
- Describe the space in which this statistic lies
- Describe where the statistic would lie if the research hypothesis were true
- Describe where the statistic would lie if the research hypothesis were false
- Compute a p-value
- Know the size of tests
- Know the power of tests
- Multiple comparisons
- Understand that 95% confidence intervals don't contain 95% of the data
- Understand that statistical insignificance should never be taken as evidence for no effect.
- Understand that the difference between a statistically significant and a statistically insignificant coefficients are not always statistically significant.
- Make students think really hard and formally about substantive significance.
I would have the students read Tufte and make lots of good graphs.
- Plot their data in several ways
- Plot their models
- Plot model predictions on top of their data
- Run lots of models on tiny subsets of data and plot the results.
I would want students to understand the properties of models using simulation. Every week, I would want students to walk through the process of modeling using real data sets. Even for the methodologically savvy, I think great ideas come from thinking hard about substantive problems, working out a good statistical model for a data set, and clearly communicating the results of that model.