Software Signals

Posted: 01.07.2013

This blog post by Sean Taylor generated quite a stir. He discussed the signals one sends by using certain software packages and seems to think that R users are more competent. The reactions ranged from amusement to bashing.

 

 

While I don't think this type of post is particularly useful, it is fun (especially the John Myles White line), so I'm writing up my thoughts on the issue.

For better or worse, I think the software one uses certainly sends a signal.

I've heard others apply the same arguments to typesetting programs. LaTeX and Beamer, for example, are said to send a "technically competent" signal compared to Word and PowerPoint. For better or worse, I think I am vulnerable to these signals, although I don't use Beamer.

R and Stata are the software packages that I run into most often in political science, and I certainly have stereotypes of their users, but it is a matter of style rather than competence. (These are just my stereotypes.)

  • R users are more likely to be interested in graphics and simulation (i.e. Gelman and Hill). Also, R users are more likely to care about statistical programming. This is how I first became an R user. The first paper that I wrote as a graduate student required a lot of simulation and a few custom graphs, and I needed to do a little programming to get these right. I think programming is a powerful tool and that graphics and simulation are really important in communicating results. Because I think R users are more likely to emphasize these things, I update upward slightly on R users. That said, there are plenty of terrible methodologists that use R.
  • Stata users are much more interested in estimating fancier econometric models (i.e. Cameron and Trivedi). These users put less emphasis on model checking methods such as cross-validation and value complicated models (e.g. bivariate probit with partial observability) more than R users. Since I like model checking and think complicated models are over-used (or at least over trusted), I tend to update downward slightly on Stata users. That said, there are plenty of great methodologists that rely on Stata.

I don't run into users of other software much in political science, but I do in the statistics department. (Again, these are just my stereotypes.)

  • Matlab users work on more theoretical problems. By that I mean building and evaluating new estimators and methods, not proving theorems.
  • SAS users care about analyzing data. They work on real-world problems, probably for a drug company.

I use both R and Stata.

I rely mostly on R in my research. I occasionally use Stata for two purposes.

  1. Recoding data. Whenever I work with huge chucks of (especially survey) data, Stata offers a really useful set of commands for cleaning up the data.
  2. Maximizing a difficult likelihood. Sometimes I'll have a custom model and regular optimization algorithms (e.g BFGS) fail. In this situation, I use a little magic that is found in Stata's ", difficult" option. I don't quite understand why it works so well, but it is relentless. It is the single best feature of Stata.

I don't update much on users' competence.

While I do update on the methodological style of software users, I don't think I update much (if at all) on their competence. Here are some statements from Taylor's post that I disagree with.

  • "When you don’t have to code your own estimators, you probably won’t understand what you’re doing." I think that many people don't code their on estimators (and couldn't easily start), but understand what they are doing. I also think that plenty of people who do code their own estimators have no clue.
  • "When operating software doesn't require a lot of training, users of that software are likely to be poorly trained." I'm sure that researchers who don't want to learn statistics are much less likely to want to learn software beyond point-and-click, but I think that most people who are using any software and writing about it to the public are not "poorly trained."
  • "Researchers who care about statistics enough should have gravitated toward R at some point." I spent three years in the statistics department at Florida State. People over there care about statistics and most use something other than R. I've also met plenty of political scientists who care about statistics and use Stata exclusively. I do think that those who care about certain styles of analysis (e.g. graphs, simulations, and programming) are likely to be drawn to R, but I don't think it's universal.


  • Walter Belluzzo

    IMHO, you missed an important point: Stata users are more likely not being troubled with not knowing how some numbers come about, trusting that someone else's warrant that they are correct. I think sentences such as "I don't quite understand why it works so well, but it is relentless" are more likely to be used to justify using Stata than for using R.

  • Dinre

    I personally found the original article to be hilarious and full of snark. I work in a research department where one half loves SPSS and the other loves SAS. I use R, because I consider myself a mathematician, and if you're going to end up coding anyway, R is actually an easier learning curve than most other packages.

    I have found the comment about "poorly trained" users to be true in some ways but not in relation to the statistical software used.
    I somewhat frequently find myself explaining to colleagues why their numbers don't mean anything. For instance, if the population you are studying is very small (e.g. 25 total) and you are sampling a majority of the population (e.g. 23/25), then you really shouldn't be using any traditional statistical analysis on the group. Statistics is a way of extrapolating a small amount of data into generalized assumptions about the group, but if your sample contains most of the group, you don't really need to extrapolate. In that case, you just need to describe the group, which is the holy grail of data that statistics is designed to approximate.
    I also find myself explaining why a comparative t-test isn't valid if you already know that two groups of data are fundamentally different. If the groups are fundamentally different (e.g. seconds to type an email vs. seconds waiting at a traffic light), then finding the two groups similar or dissimilar doesn't tell you anything at all.

    I think the real failure in research settings isn't the statistics software. I think it's the way publishing and reporting works. People doing drug trials mainly use SAS, because that's what other people in the same field use. If you want someone else to trust your work, you have to use SAS. This means that many people who don't fully understand how SAS works are just using whatever analysis has been published previously. They aren't evaluating whether or not the analysis is the correct choice or requires different parameters. The same goes for SPSS, Minitab, and other software packages. I had one conversation with a colleague who told me, "You can't use that analysis, because no one else uses that analysis. I don't think anyone reading an academic journal would accept that." My colleague was correct, of course, but it still makes me feel a bit sad.