Another Benefit of Publicly Version-Controlled Research

I've been thinking quite a bit lately about why and how political scientists should publicly version control their research projects. By research projects, I mean data, manuscript, and code. And by publicly version control, I mean use Git to version-control and post a public GitHub repository, from the beginning of the project, so that other researchers are free to follow and borrow as needed. Below I quickly summarize some of the benefits of version control and discuss another benefit that Git and GitHub have had on my own research.

My thinking about public version control for research projects began Zach Jones' discussion of the idea in a recent Political Methodologist. Teaching the linear models class here at UB last semester solidified its importance in my mind. We used Dropbox for version control and sharing, but Git and GitHub are better.

Several recent articles and posts outline why researchers (as opposed to programmers) might use Git and GitHub. Here's a brief summary:

  1. A paper by Karthik Ram.
  2. A short essay by Zach Jones.
  3. A short essay by Christopher Gandrud.

If I've missed something, please let me know.

There are a lot of good reasons to do this:

  • History. We all use version control. Most of us do it poorly. Using Git/GitHub, I'm learning to do it better. Git/GitHub offers a formal way to easily maintain a complete history of a project. In general, it's good to avoid filenames like new_final_carlisle_v3c_updated.docx. A recent comic makes this point clear. We need a method of updating files while keeping track of the old versions so that we can go back if needed. But the approach of giving different filenames to different version is inefficient at best. My approach of keeping "old files" in a designated folder is no better. Git/GitHub solves these issues. Second, Git allows you to tag a project at certain stages, such as "Initial submission to AJPS." After getting an invitation to revise and resubmit and making the required changes, I can compare the current version of the project to the (now several months old) version I initially submitted. This makes writing response memos much easier.
  • Transparency. Zach Jones most clearly makes the point that Git/GitHub increases transparency in the context of political science research. Git/GitHub essentially allows others to actively monitor the progress of a project or study its past development. Related to the motivation to using GitHub in an open manner is the idea of an "open notebook." Carl Boettiger is one of the most well-know proponents of open notebooks. This kind of openness provides a wonderful opportunity to receive comments and suggestions from a wide audience. This allows other to catch errors that might otherwise go unnoticed. It also gives readers a formal avenue to make suggestions, not to mention keeping a complete history of the suggestions and any subsequent discussion. GitHub allows the public to open issues, which is a wonderful way to receive and organize feedback on a paper.
  • Accessibility. Christopher Gandrud makes the point clearly in a clearly in a recent edition of The Political Methodologist, though he discusses accessibility purely in the context of building data sets. But similar arguments could be made for code. I recently had a graduate student express interest in some MRP estimates of state-level opinion on the Affordable Care Act. I told her that I had spent some time collecting surveys and writing code to produce the estimates. I noted that, ideally, she would not duplicate my work, but, if possible, build on it. I was able to point her to the GitHub repository for the project, which hopefully she'll find useful as a starting point for her own work. As part of my experience supervising replication projects as part of graduate methods classes and my own experience with replication data, the clean, final versions of the data that researchers typically post publicly do not allow future researchers to build on easily build on previous work. If authors posted the raw data and all the (possibly long and messy) code to do the cleaning and recoding, it would be much easier for future researcher to build on past contribution. Indeed, research shows that making the data and code freely available lowers the barriers to reuse and increases citations.

But these are the commonly cited reasons for using Git and GitHub. But in my practice, I've found another reason, perhaps more important than the above.

One thing that I first noticed in my students, but now I see that I'm just as guilty of, is "the race to a regression." That is, I devote the absolute minimum required effort (or less) to everything leading up to the regression. My attitude is usually that I'll go back later and clean up everything, double checking along the way, if the line of investigation "proves useful" (i.e., provides stars). I rarely go back later. I find that the script  let_me_just_try_this_really_quickly.R quickly becomes a part of  analysis.R . This is bad practice and careless.

Instead of a race to the regression, Git encourages me to develop projects a little more carefully, thinking about projects in tiny steps, each to be made public, and each done right and summarized in a nice commit message. The care in my research has noticeably improved. I think about how to do something better, do it better, and explain it in a commit message that I can refer to later.

In my view, project development in Git/GitHub works best when users make small, discrete changes to a project. This takes some thought and discipline, but it is the best way to go. I'm guilty of coming back from a conference and making dozens of small changes to an existing projects, incorporating all the suggestions in a single update. I just did it after the State Politics and Policy Conference. It is a poor way to go about developing a project. It is a poor way to keep track of things. It is a poor strategy, but I'm learning.

Creating Marginal Effect Plots for Linear Regression Models in R

Suppose we estimate the model \(E(y) = \beta_0 + \beta_xx + \beta_zz + \beta_{xz}xz\) and want to calculate the the marginal effect of \(x\) on \(E(y)\) as \(z\) varies \(\left(\dfrac{\partial E(y)}{\partial x}\right)\) and its 90% confidence interval. Brambor, Clark, and Golder describe exactly how to do this and provide Stata code. Below is a bit of R code to do something similar. (Click here to continue reading.)

compactr is now on CRAN

I've been working on a package called compactr  that helps create nice-looking plots in R and it is now up on CRAN.

You can get it by typing

directly into the command line in R. See how it works by typing  example(eplot)  or reading the details here. Below I describe the basic structure and functions. (Click here to continue reading.)

More on What Can Be Learned from Statistical Significance

A while back, I reacted to a post by Justin Esarey, in which he argues that not much can be learned from statistical significance. The basic question he's asking is as follows: For a fixed sample size, what does the posterior probability of the null hypothesis look like if we update based on the result of the hypothesis test rather than the entire data set? His answer is that statistical significance doesn't allow us to update much if our prior is appropriately skeptical.

He makes two points:

  1. A small magnitude but statistically significant result contains virtually no important information
  2. Even a large magnitude, statistically significant result is not especially convincing on its own.

In this post, I explore how we can set up the problem in reasonable way and have hypothesis test be informative. In particular, I'd like to argue that statistical significance can contain a lot of information about the hypothesis. (Click here to continue reading.)

The Problem with Testing for Heteroskedasticity in Probit Models

A friend recently asked whether I trusted the inferences from heteroskedastic probit models. I said no, because the heteroskedastic probit does not allow a researcher to distinguish between non-constant variance and a mis-specified mean function.

In particular, my friend had a hypothesis that the variance of the latent outcome (commonly called "y-star") should increase with an explanatory variable of interest. He was using the heteroskedastic probit model, which looks something like \(Pr(y_i = 1) = \Phi(X_i\beta, e^{Z_i\gamma})\), where \(\Phi()\) is the cumulative normal with mean \(X_i\beta\) and \(e^{Z_i\gamma}\).

He wanted to argue that his explanatory variable increased both the mean function (\(X\beta\)) and the variance function (\(e^{Z\gamma}\)). To do this, he included his variable in both the \(X\) and \(Z\) matrices and tested the statistical significance of the associated coefficients. He found that they were both significant. It would seem that his variance increases the mean and the variance of the latent outcome. He wanted to know if this was good evidence for his theory.

I replied that I did not think so, because a binary outcome variable doesn't contain any direct information about a non-constant variance. Indeed, the variance of a Bernoulli random variable is tied directly to the probability of success. This implies that any inference about changes in \(e^{Z\gamma}\) must come from observed changes in the probability of a success (i.e. changes in the mean function). Because we've assumed a specific (i.e. linear) functional form for the mean function, deviations from this will be attributed to the variance function. Because of this structure, the results are driven totally by our assumption of the linearity of the mean function. Indeed, it would not be hard to find a plausible non-linear mean function (e.g. quadratic specification) that makes the \(\gamma\) parameter no longer significant.

Example One

I thought a good way to illustrate this claim would be to show that for a large but plausible sample size of one million, the heteroskedastic probit will suggest a non-constant variance when the relationship is simply a logit.

To see an illustration of this, start by simulating data from a simple logit model. Then estimate a regular probit and a heteroskedastic probit.

n <- 10^6
x <- runif(n)
y <- rbinom(n, 1, plogis(-4 + 3*x))

r1 <- glm(y~x,family=binomial(link="probit"))

h1 <- hetglm(y ~ x)

Now if the coefficient for x is significant in the model of the scale, then we should conclude there is heteroskedasticity, right? No, because we already know that the latent variance is constant. However, we've barely mis-specified the link function (we're using a probit, the true model is logit). This slight mis-specification causes the results to point toward non-constant variance.

Coefficients (binomial model with probit link):
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.090026   0.007269  -287.5   <2e-16 ***
x            1.589261   0.007418   214.2   <2e-16 ***

Latent scale model coefficients (with log link):
  Estimate Std. Error z value Pr(>|z|)
x -0.20780    0.01475  -14.09   <2e-16 ***

Here is a plot of the predicted probabilities from the true, probit, and heteroskedastic probit models. Notice that in the range of the data, the heteroskedastic probit does a great job of representing the relationship. However, that's not because the variance is non-constant as the heteroskedastic probit would suggest. It's because the link function is slightly mis-specified.


Example Two

I think the logit example makes the point powerfully, but let's look at a second example just for kicks. This time, let's say that we believe there's heteroskedasticity that can be accounted for by x, so we estimate a heteroskedastic probit and include x in the mean and variance function. However, again we're wrong. Actually the true model has a constant variance, but a non-linear mean function (\(\beta_0 + \beta_1x^2\)).

If we simulate the data and estimate the model, we see again that our mis-specified mean function leads us to conclude that the variance is non-constant. In this case though, the mis-specification is severe enough that you'll find significant results with much smaller sample. If we made conclusions about the non-constant variance from the statistical significance of coefficients in the model of the variance, then we would be led astray.

Coefficients (binomial model with probit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -3.6122     0.3726  -9.694   <2e-16 ***
x             3.1493     0.2809  11.210   <2e-16 ***

Latent scale model coefficients (with log link):
  Estimate Std. Error z value Pr(>|z|)
x  -0.8194     0.2055  -3.988 6.66e-05 ***

Again, the plot shows that the heteroskedastic probit does a good job at adjusting for the mis-specified mean function (working much like a non-parametric model).


So What Should You Do?

I think that researchers who have a theory that allows them to speculate about the mean and variance of a latent variable should go ahead and estimate a statistical model that maps cleanly onto their theory (like the heteroskedastic probit). However, these researchers should realize that this model does not allow them to distinguish between non-constant variance and a mis-specified mean function.

For more posts, see the Archives.