Carlisle's Blog

Firth’s Logit: Some References

Carlisle Rainey — Wed, 30 Aug 2023 04:00:00 GMT

In Rainey and McCaskey (2021), Kelly McCaskey and I offer a accessible and practical (re)introduction to Firth’s penalized maximum likelihood estimator that (1) corrects the small sample bias and (2) reduces the excessive variance of the usual maximum likelihood estimator.

Below, I bookmark other references that might be helpful.

I’m sure there are embarrassing omissions. If you see an omission, please let me know (self-promotion is encouraged, especially not-yet-published papers).

This Stack Exchange answer gives a brief, but careful explanation of Firth’s logit. If you’re looking for a quick explanation, start here.

The Two Main Papers

Firth (1993) originally introduced the idea. Kelly and I draw mostly on this paper—it’s a wonderful paper.
Kosmidis and Firth (2021) follow-up with additional theoretical results that are relevant for the estimator as used in practice since 1993. This happened to come out while our paper was working its way through the publication process. Most importantly, they discuss the shrinkage property of the estimator, which is what Kelly and I highlight as under-appreciated (and really important!).

From my perpective, these are the two main papers to refer to if you’re concerned about small sample bias in logistic regression models.

Extensions

Beyond these two main papers, there have been a few extensions. Zietkiewicz and Kosmidis (2023) talk about Firth’s logit in very large data sets. Cook, Hays, and Franzese (2018) make a good argument for using Firth’s estimator in panel data sets with binary outcomes and fixed effects. Sterzinger and Kosmidis (2023) apply these ideas to mixed models (or random effects models). Šinkovec et al. (2021) compare Firth’s approach to ridge regression, and suggest that Firth’s is superior in small or sparse data sets. Puhr et al. (2017) study Firth’s logit in the context of rare events and propose FLIC and FLAC as alternatives.

Applications

Röver et al. (2022) offer an application of Firth’s logit to clinical trials.
Turner and Firth (2012) offer an application to Bradley-Terry models with the {BradleyTerry2} R package.

Separation and Finiteness

I learned about Firth’s estimator from Zorn (2005), who follows Heinze and Schemper (2002) in suggesting it as a solution to separation. According to David Firth in this blog post, this is the application that stimulated interest in the approach after it went relatively unnoticed for a few years. (Great post, I highly recommend reading it!) This application piqued my interest in Firth’s estimator. Briefly, I think Firth’s default penalty might not be substantively reasonable in a given application (Rainey 2016) (see also Beiser-McGrath (2020)) and the usual likelihood ratio and score tests work well without the penalty (Rainey 2023).

For more on Firth’s logit, see Ioannis Kosmidis’ research page and Georg Heinz Google Scholar page.

References

Beiser-McGrath, Liam F. 2020. “Separation and Rare Events.” Political Science Research and Methods 10 (2): 428–37. https://doi.org/10.1017/psrm.2020.46.

Cook, Scott J., Jude C. Hays, and Robert J. Franzese. 2018. “Fixed Effects in Rare Events Data: A Penalized Maximum Likelihood Solution.” Political Science Research and Methods 8 (1): 92–105. https://doi.org/10.1017/psrm.2018.40.

Firth, David. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38. https://doi.org/10.1093/biomet/80.1.27.

Heinze, Georg, and Michael Schemper. 2002. “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine 21 (16): 2409–19. https://doi.org/10.1002/sim.1047.

Kosmidis, Ioannis, and David Firth. 2021. “Jeffreys-Prior Penalty, Finiteness and Shrinkage in Binomial-Response Generalized Linear Models.” Biometrika 108 (1): 71–82. https://doi.org/10.1093/biomet/asaa052.

Puhr, Rainer, Georg Heinze, Mariana Nold, Lara Lusa, and Angelika Geroldinger. 2017. “Firth’s Logistic Regression with Rare Events: Accurate Effect Estimates and Predictions?” Statistics in Medicine. https://doi.org/10.1002/sim.7273.

Rainey, Carlisle. 2016. “Dealing with Separation in Logistic Regression Models.” Political Analysis 24 (3): 339–55. https://doi.org/10.1093/pan/mpw014.

———. 2023. “Hypothesis Tests Under Separation.” http://dx.doi.org/10.31235/osf.io/bmvnu.

Rainey, Carlisle, and Kelly McCaskey. 2021. “Estimating Logit Models with Small Samples.” Political Science Research and Methods 9 (3): 549–64. https://doi.org/10.1017/psrm.2021.9.

Röver, Christian, Moreno Ursino, Tim Friede, and Sarah Zohar. 2022. “A Straightforward Meta-Analysis Approach for Oncology Phase I Dose-Finding Studies.” Statistics in Medicine 41 (20): 3915–40. https://doi.org/10.1002/sim.9484.

Šinkovec, Hana, Georg Heinze, Rok Blagus, and Angelika Geroldinger. 2021. “To Tune or Not to Tune, a Case Study of Ridge Logistic Regression in Small or Sparse Datasets.” BMC Medical Research Methodology 21 (1). https://doi.org/10.1186/s12874-021-01374-y.

Sterzinger, Philipp, and Ioannis Kosmidis. 2023. “Maximum Softly-Penalized Likelihood for Mixed Effects Logistic Regression.” Statistics and Computing 33 (2). https://doi.org/10.1007/s11222-023-10217-3.

Turner, Heather, and David Firth. 2012. “Bradley-Terry Models inR: TheBradleyTerry2Package.” Journal of Statistical Software 48 (9). https://doi.org/10.18637/jss.v048.i09.

Zietkiewicz, Patrick, and Ioannis Kosmidis. 2023. “Bounded-Memory Adjusted Scores Estimation in Generalized Linear Models with Large Data Sets.” https://doi.org/10.48550/ARXIV.2307.07342.

Zorn, Christopher. 2005. “A Solution to Separation in Binary Response Models.” Political Analysis 13 (2): 157–70. https://doi.org/10.1093/pan/mpi009.

Equivalence Tests Using {marginaleffects}

Carlisle Rainey — Fri, 18 Aug 2023 04:00:00 GMT

Background on arguing for a negligible effect

I remember sitting in a talk as a first-year graduate student, and the speaker said something like: “I expect no effect here, and, just as I expected, the difference is not statistically significant.” I was a little bit taken aback—of course, that’s not a compelling argument for a null effect. A lack of statistical significance is an absence of evidence for an effect; it is not evidence of an absence of an effect.

But I saw this approach taken again and again in published work. (And still do!)

My first publication was an AJPS article (Rainey 2014) (Ungated PDF) explaining why this doesn’t work well and how to do it better.

Here’s what I wrote in that paper:

Hypothesis testing is a powerful empirical argument not because it shows that the data are consistent with the research hypothesis, but because it shows that the data are inconsistent with other hypotheses (i.e., the null hypothesis). However, researchers sometimes reverse this logic when arguing for a negligible effect, showing only that the data are consistent with “no effect” and failing to show that the data are inconsistent with meaningful effects. When researchers argue that a variable has “no effect” because its confidence interval contains zero, they take no steps to rule out large, meaningful effects, making the empirical claim considerably less persuasive . (Altman and Bland 1995; Gill 1999; Nickerson 2000)

But here’s a critical point, it’s impossible to reject every hypothesis except exactly no effect. Instead, the researcher must define a range of substantively “negligible” effects. The researcher can reject the null hypothesis that the effect falls outside this range of negligible effects. However, this requires a substantive judgement about those effects that are negligible and those that are not.

Here’s what I wrote:

Researchers who wish to argue for a negligible effect must precisely define the set of effects that are deemed “negligible” as well as the set of effects that are “meaningful.” This requires defining the smallest substantively meaningful effect, which I denote as . The definition must be debated by substantive scholars for any given context because the appropriate varies widely across applications.

Clark and Golder (2006)

Clark and Golder (2006) offer a nice example of this sort of hypothesis. I’ll refer you there and to Rainey (2014) for a complete discussion of their idea, but I’ll motivate it briefly here.

Explaining why a country might have only a few (i.e., two) parties, Clark and Golder write:

First, it could be the case that the demand for parties is low because there are few social cleavages. In this situation, there would be few parties whether the electoral institutions were permissive or not. Second, it could be the case that the electoral system is not permissive. In this situation, there would be a small number of parties even if the demand for political parties were high. Only a polity characterized by both a high degree of social heterogeneity and a highly permissive electoral system is expected to produce a large number of parties. (p. 683)

Thus, they expect that electoral institutions won’t matter in socially homogeneous systems. And they expect that social heterogeneity won’t matter in electoral systems that are not permissive.

Reproducing Clark and Golder (2006)

Before computing their specific quantities of interest, let’s reproduce their regression model. Here’s their table that we’re trying to reproduce.

And here’s a reproduction of their estimates using the cg2006 data from the {crdata} package on GitHub.¹

¹ Run ?crdata::cg2006 for detailed documentation of this data set.

# load packages
library(sandwich)
library(modelsummary)

# install my data packages from github
devtools::install_github("carlislerainey/crdata")  # only updates if newer version available

# load clark and golder's data set
cg <- crdata::cg2006

# reproduce their estimates
f <- enep ~ eneg*log(average_magnitude) + eneg*upper_tier + en_pres*proximity
fit <- lm(f, data = cg)

# regression table
modelsummary(fit, 
             vcov = ~ country, # cluster-robust SE; multiple observations per country
             fmt = 2, 
             shape = term ~ model + statistic)

	(1)
	Est.	S.E.
(Intercept)	2.92	0.35
eneg	0.11	0.14
log(average_magnitude)	0.08	0.23
upper_tier	−0.06	0.03
en_pres	0.26	0.15
proximity	−3.10	0.46
eneg × log(average_magnitude)	0.26	0.17
eneg × upper_tier	0.06	0.02
en_pres × proximity	0.68	0.23

Success!

They use averge_magnitude to measure the permissiveness of the electoral system and eneg to measure social heterogeneity.

Using `comparisons()` to compute the effects

Now let’s compute the two quantities of interest. Clark and Golder argue for two negligible effects, which I make really concrete below.

Hypothesis 1 Increasing the effective number of ethnic groups from the 10th percentile (1.06) to the 90th percentile (2.48) will not lead to a substantively meaningful change in the effective number of political parties when the district magnitude is one.
Hypothesis 2 Increasing the district magnitude from one to seven will not lead to a substantively meaningful change in the effective number of political parties when the effective number of ethnic groups is one.

And comparing the U.S. and the U.K., I argue that the smallest substantively interesting effect is 0.62. In Rainey (2014), I made the plot below. I want to reproduce it with {marginaleffects}.

These differences (and the 90% CIs) are really easy to compute using {marginaleffects}!²

² I’m only doing Clark and Golder’s original results, not any of the robustness checks.

# load packages
library(marginaleffects)

# the smallest substantively interesting effect
m <- 0.62

# a data frame setting the values of the "other" variables
X_c <- data.frame(
  eneg = 1.06,  # low value
  average_magnitude = 1,  # low value
  upper_tier = 0,
  en_pres = 0, 
  proximity = 0
)

# compute the comparison for eneg and average magnitude
c <- comparisons(fit,
            vcov = ~ country,
            newdata = X_c, 
            variables = list("eneg" = c(1.06, 2.48),         # low to high value
                             "average_magnitude" = c(1, 7)), # low to high value
            conf_level = 0.90)

Now we can just plot the 90% CIs with ggplot() and check whether the entire interval falls inside the bounds.

# load packages
library(tidyverse)

# bind the comparisons together and plot
ggplot(c, aes(x = estimate,
                 xmin = conf.low,
                 xmax = conf.high, 
                 y = term)) + 
  geom_vline(xintercept = c(-m, m), linetype = "dashed") + 
  geom_errorbarh() + 
  geom_point()

In this case, we conclude that social heterogeneity (eneg) has a negligible effect because the 90% CI only contains substantively negligible values. However, the 90% CI for district magnitude (average_magnitude) contains substantively negligible and meaningful values, so we cannot reject the null hypothesis of a meaningful effect.

Computing the TOST p-values using `hypotheses()`

It’s then almost trivial to use the hypotheses() function to compute the TOST p-values.³ Note, though, that you must resupply the vcov and conf_level arguments. I might have expected them to carry forward from the comparisons() above.

³ Note this warning from ?hypotheses(): Warning #2: For hypothesis tests on objects produced by the marginaleffects package, it is safer to use the hypothesis argument of the original function. Using hypotheses() may not work in certain environments, in lists, or when working programmatically with apply style functions.

Warning

You must resupply the vcov and conf_level arguments.

# hypothesis tests
# note: you must re-supply the vcov and conf_level arguments
hypotheses(c, equivalence = c(-m, m), conf_level = 0.90, vcov = ~ country)


              Term    Contrast Estimate Std. Error     z Pr(>|z|)    S  5.0 %
 average_magnitude 7 - 1          0.696      0.175 3.973   <0.001 13.8  0.408
 eneg              2.48 - 1.06    0.158      0.204 0.779    0.436  1.2 -0.176
 95.0 % p (NonSup) p (NonInf) p (Equiv) eneg average_magnitude upper_tier
  0.984     0.6671     <0.001    0.6671 1.06                 1          0
  0.493     0.0117     <0.001    0.0117 1.06                 1          0
 en_pres proximity
       0         0
       0         0

Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted, eneg, average_magnitude, upper_tier, en_pres, proximity, enep, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv 
Type:  response

This doesn’t print super-nicely into this document, so let’s extract the important parts.

# hypothesis tests, extracting the important pieces
hypotheses(c, equivalence = c(-m, m), conf_level = 0.90, vcov = ~ country) %>%
  select(term, contrast, estimate, conf.low, conf.high, p.value.equiv)


              Term    Contrast Estimate CI low CI high p (Equiv)
 average_magnitude 7 - 1          0.696  0.408   0.984    0.6671
 eneg              2.48 - 1.06    0.158 -0.176   0.493    0.0117

Columns: term, contrast, estimate, conf.low, conf.high, p.value.equiv

Checking that the 90% CIs fall within the bounds created by the smallest substantively-meaningful effect is equivalent to checking whether the TOST p-value (i.e., the p(Equiv) column) is less than 0.05, so our conclusions are (and must be) identical.

Other references

For more on effective arguments for no effect, see the following:

Lakens, Scheel, and Isager (2018) offer an accessible introduction to equivalences tests for psychologists.
Kane (2024) offers an excellent summary of design considerations when arguing for no effect.
McCaskey and Rainey (2015) (Ungated PDF) argue that researchers should make “claims if and only if those claims hold for the entire confidence interval.” This extends the logic of equivalence testing to a broader collection of possible hypotheses.

Final thoughts

{marginaleffects} is a great package (Arel-Bundock 2024). I think it’s the first package in which the syntax matches the way I think about computing quantities of interest. That said, this is just my first try at it. But I’m very impressed so far.
The {marginaleffects} book has a whole chapter on equivalence tests. My only caution is that there is a mismatch between 95% confidence intervals and equivalence tests. By default, {marginaleffects} reports a 95% CI, even when producing a p-value for an equivalence test. However, the 90% confidence interval correspondents to a size-5% equivalence test. So if you’re using {marginaleffects} to do equivalence tests, I recommend setting conf_level = 0.90.⁴
For a more recent example, Jake Jares and Neil Malhotra have a new paper that discusses negligible effects and hypothesis tests in a way that I find clear and compelling. It’s an excellent model to follow. See pp. 27-32. They “show that improved compensation outcomes had negligible impacts on Republican farmers’ midterm turnout and campaign contributions, even though such variation in benefits significantly affected farmers’ propensity to view the intervention as helpful.”

⁴ I would make a similar point about one-sided tests as well, but that’s less correct, because it should be a one-sided 95% CI.

References

Altman, D. G, and J M. Bland. 1995. “Statistics Notes: Absence of Evidence Is Not Evidence of Absence.” BMJ 311 (7003): 485–85. https://doi.org/10.1136/bmj.311.7003.485.

Arel-Bundock, Vincent. 2024. Marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. https://marginaleffects.com/.

Clark, William Roberts, and Matt Golder. 2006. “Rehabilitating Duverger’s Theory.” Comparative Political Studies 39 (6): 679–708. https://doi.org/10.1177/0010414005278420.

Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52 (3): 647–74. https://doi.org/10.1177/106591299905200309.

Kane, John V. 2024. “More Than Meets the ITT: A Guide for Anticipating and Investigating Nonsignificant Results in Survey Experiments.” Journal of Experimental Political Science, February, 1–16. https://doi.org/10.1017/xps.2024.1.

Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.

McCaskey, Kelly, and Carlisle Rainey. 2015. “Substantive Importance and the Veil of Statistical Significance.” Statistics, Politics and Policy 6 (1-2). https://doi.org/10.1515/spp-2015-0001.

Nickerson, Raymond S. 2000. “Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5 (2): 241–301. https://doi.org/10.1037/1082-989x.5.2.241.

Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58 (4): 1083–91. https://doi.org/10.1111/ajps.12102.

Daily Writing

Carlisle Rainey — Tue, 15 Aug 2023 04:00:00 GMT

I try to write everyday.¹

¹ See lots of caveats below!

By “writing,” I mean “pushing the paper closest to publication just a little bit closer.” I want to think about the next step on the journey to the published paper and do it. According to this loose definition of writing, it might involve data collection, data analysis, creating slides, or even writing and polishing text. It might involve organization, planning, or learning new skills. It excludes any tasks that aren’t necessary to complete the project.

By “everyday,” I mean at least every weekday, probably at the same time every day and probably first thing in the morning. For better or worse, academics are evaluated by their research productivity.

Urgency and Importance

President Eisenhower famously characterized his duties: “I have two kinds of problems, the urgent and the important. The urgent are not important, and the important are never urgent.”

Following Eisenhower’s Box, we might assign degrees of urgency and importance to tasks in academic tasks. In graduate school, I had teaching responsibilities, RA duties, readings for seminars, homework for methods classes, preliminary exams, and administrative tasks. All of these tasks are important. They must be completed. They must be completed well. Yet I was evaluated largely on my papers. As a faculty member, little has changed.

Writing is important, but writing never quite becomes urgent. It’s easy to put off writing to prepare a lecture (or write a blog post).

The Evidence

Robert Boice studied academic productivity carefully. A couple of his studies provide some evidence for my strategy to write every day.

First, he assessed how early-career academics spend their time. The figure below shows the results. Notice that these faculty spend more time in committee meetings (2 hrs.) than writing (1.5 hours).

Second, Boice conducted an experiment to assess the effect of writing strategies.

Boice randomly divided 27 academics into three groups:²

² This is a small sample, but it supports my claim so it’s okay.

The control group agreed to defer all but the most urgent writing for ten weeks.
The spontaneous group agreed write when they felt like it.
The contingency group agreed to donate to an anti-charity if they failed to write every day.

The figure below shows that regular writing routine increase production of both pages and ideas. Notice that the spontaneous writers barely produced more ideas and pages than the group trying to avoid writing.

I find these results compelling, but note that Helen Sword urges some caution.

How I Do It

Everyone is different, and my own approach has evolved over time. Here are the key ingredients (for me):

Write for two hours at a regular time. Consistency is key.³
Avoid writing outside this window. Set your window so that your window is “enough.”
Take breaks. I take long breaks from writing. But these are intentional and planned.⁴
Family permitting, I think it’s helpful to spend a little while pushing the projects forward on the weekends, just to keep the momentum up.⁵

³ Two hours works really well for me. My productivity degrades quickly after two hours, so it’s best to move on to less taxing tasks. But it takes me a while to get warmed up, so I need to keep moving while I’ve got momentum.

⁴ An unfortunate outcome is not writing and being stressed about not writing.

⁵ Just 15 minutes is great. This slot is perfect for proof-reading.

I admit that I deviate from the strategy above (and not always intentionally). But I’ve been at this long enough to know that a regular routine works really well for me.

What if you’re not ready to write yet?

It’s my view that PhD students should write every day, from the first day of their first semester (remember that I have a broad definition of “write”). Most students need some time before they’re ready to jump into the technical details a solo project, but there are always things to do.

If you can’t identify a specific task to work on, here are some resources to help you brainstorm.

Plan and organize. Start by reading How to Write a Lot. Perhaps read Getting Things Done. Perhaps read Air & Light & Time & Space or Writing for Social Scientists.
Read “Publication, Publication” and the updates.
Before you can jump into a project, you need to know the literature. Spend some writing time exploring literatures that you might want to contribute to. What interests you most? The Annual Review of Political Science is a valuable resource.
Once you have a specific topic of interest, you need to learn that literature. You can spend dozens of “writing” sessions reading and taking notes. I strongly encourage you to read and take notes systematically, as Raul Pacheco-Vega suggests using a spreadsheet, Elaine Campbell suggests a similar method, and Katherine Firth suggests a using Cornell notes.
Tanya Golash-Boza lists ten ways to write everyday if you’ve got a paper in-progress.

Benchmarking Firth’s Logit: {brglm2} versus {logistf}

Carlisle Rainey — Fri, 11 Aug 2023 04:00:00 GMT

Firth’s Logit

I like Firth’s logistic regression model (Firth 1993). I talk about that in Rainey and McCaskey (2021) and this Twitter thread. Kosmidis and Firth (2021) offer an excellent, recent follow-up as well.

I’ll refer you to the papers for a careful discussion of the benefits, but Firth’s penalty reduces the bias and variance of the logit coefficients.

Goals for Benchmarking

In this post, I want to compare the brglm2 and logistf packages. Which fits logistic regression models with Firth’s penalty the fastest?

These packages both fit the models almost instantly, so there is no practical difference when fitting just one model. But large Monte Carlo simulations (or perhaps bootstraps), small differences might add up to a substantial time difference.

Here, I benchmark the two packages for fitting logistic regression models with Firth’s penalty in a small sample–the results might not generalize to a larger sample. The data set comes from Weisiger (2014) (see ?crdata::weisiger2014). It has only 35 observations.

You can find the benchmarking code as a GitHub Gist.

Benchmarking

I benchmark four methods here.

A vanilla glm() logit model.
A Firth’s logit via brglm2 by supplying method = brglm2::brglmFit to glm().
A Firth’s logit via logistf via logistf() using the default settings.
A Firth’s logit via logistf via logistf() with the argument pl = FALSE. This argument is important because it skips hypothesis testing using profile likelihoods, which are computationally costly.

# install crdata package to egt weisiger2014 data set
remotes::install_github("carlislerainey/crdata")

# load packages
library(tidyverse)
library(brglm2)
library(logistf)
library(microbenchmark)


# load data
weis <- crdata::weisiger2014

# rescale weisiger2014 explanatory variables using arm::rescale()
rs_weis <- weis %>%
  mutate(across(polity_conq:coord, arm::rescale)) 

# create functions to fit models
f <- resist ~ polity_conq + lndist + terrain + soldperterr + gdppc2 + coord
f1 <- function() {
  glm(f, data = rs_weis, family = "binomial")
}
f2 <- function() {
  glm(f, data = rs_weis, family = "binomial", method = brglmFit)
}
f3 <- function() {
  logistf(f, data = rs_weis)
}
f4 <- function() {
  logistf(f, data = rs_weis, pl = FALSE)
}

# do benchmarking
bm <- microbenchmark("regular glm()" = f1(), 
               "brglm2" = f2(), 
               "logistf (default)" = f3(),
               "logistf (w/ pl = FALSE)" = f4(),
               times = 100)

Warning in microbenchmark(`regular glm()` = f1(), brglm2 = f2(), `logistf
(default)` = f3(), : less accurate nanosecond times to avoid potential integer
overflows

print(bm)

Unit: microseconds
                    expr      min        lq      mean    median        uq
           regular glm()  500.651  547.0425  591.5304  560.0805  584.4550
                  brglm2 2469.266 2546.9405 2735.6438 2591.5690 2672.3800
       logistf (default) 3303.329 3432.6635 3942.9122 3481.6585 3558.1235
 logistf (w/ pl = FALSE)  513.197  553.3155  594.6205  570.8225  608.9115
       max neval
  2713.708   100
  9735.983   100
 17672.148   100
  1698.343   100

In short, logistf is slower than brglm2, but only because it computes the profile likelihood p-values by default. Once we skip those calculations using pl = FALSE, logistf is much faster. On average, it’s faster than glm(), because glm() has the occasional really slow computation.

Here’s a plot showing the computation times of the four fits. Remember that all of these are computed practically instantly, so it only makes a difference when the fits are done thousands of times, like in a Monte Carlo simulation.

# plot times
bm %>%
  group_by(expr) %>%
  summarize(avg_time = mean(time)*10e-5) %>%  # convert to milliseconds
  ggplot(aes(x = fct_rev(expr), y = avg_time)) + 
  geom_col() + 
  labs(x = "Method", 
       y = "Avg. Time (in milliseconds)") + 
  coord_flip()

Follow-Up Notes

The models return slightly different estimates. Maybe they are using slightly different convergence tolerances. I didn’t investigate this beyond noticing it.

cbind(coef(f2()), coef(f4()))

                  [,1]       [,2]
(Intercept) -0.4771934 -0.4771935
polity_conq -2.2771109 -2.2771117
lndist       3.4020241  3.4020215
terrain      1.1018696  1.1018713
soldperterr -0.5952096 -0.5952110
gdppc2      -1.1542010 -1.1541998
coord        3.0514481  3.0514473

Ioannis Kosmidis made me aware of two things.

logistf has a C++ backend (thus explaining the speed).
brglm2 is written entirely in R. He noted that a new version is coming out soon that might be substantially faster. (brglm2 is also more general; it supports a variety of models and corrections).

Computer

Here’s the info on my machine.

system("sysctl -n machdep.cpu.brand_string", intern = TRUE)

[1] "Apple M2 Max"

References

Firth, David. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38. https://doi.org/10.1093/biomet/80.1.27.

Rainey, Carlisle, and Kelly McCaskey. 2021. “Estimating Logit Models with Small Samples.” Political Science Research and Methods 9 (3): 549–64. https://doi.org/10.1017/psrm.2021.9.

Weisiger, Alex. 2014. “Victory Without Peace: Conquest, Insurgency, and War Termination.” Conflict Management and Peace Science 31 (4): 357–82. https://doi.org/10.1177/0738894213508691.

Power, Part III: The Rule of 3.64 for Statistical Power

Carlisle Rainey — Mon, 12 Jun 2023 04:00:00 GMT

Background

I’ve wrapped up the argument that you should pursue statistical power in your experiments. In sum, you should do it for you (not a future Reviewer 2) and you shouldn’t see confidence intervals nestle consistently against zero.

In this post, I’d like to develop the intuition for power calculations, two helpful guidelines, and one implication.

Main takeaway: You need the ratio of the true effect and the standard error to be more than 3.64.

Starting Point: The Sampling Distribution

When you run one experiment, you realize one of many possible patterns of randomization. This particular realization produces a single estimate of the treatment effect from a distribution of possible estimates. The distribution of possibilities is called a “sampling distribution.”

Thus, we can think of the estimate as a random variable and its distribution as the sampling distribution. The sampling distribution is key to everything I do in this post, so let’s spend some time with it.

Let’s imagine that we did the exact same study 50 times. Let’s say that we computed a difference-in-means in dollars ($) donated.¹ I refer to this difference-in-means as the estimated treatment effect. It is the estimate of the average treatment effect (in $). This estimate will vary across the many possible patterns of randomization because each pattern puts different respondents in the treatment and control group.

¹ I just want an easy-to-use unit here, and dollars meets that criteria. Other units work fine, too.

We can visualize this with ggnaminate. We can imagine each iteration of the study as producing a particular estimate. We continue to repeat the study and collect the estimates. Eventually, we can produce a histogram from this collection of estimates. This histogram represents the sampling distribution and is fundamental to the calculations that follow. The figure belows shows how we might collect the points into a histogram.

Code

Teaching Confidence Intervals and Hypothesis Testing with gganimate

Carlisle Rainey — Thu, 01 Jun 2023 04:00:00 GMT

Background

When I give students formula for confidence intervals, I find that students don’t have a sharp concept of how those confidence intervals work—even if I explain the components of the formula well.

Even though they understand—seemingly very well—that the point estimate is noisy, they struggle to conceptualize that a confidence interval can often include values on the incorrect side of zero. Stated differently, they have a hard time understanding how a hypothesis test can fail to reject the null (when the null is incorrect). Their intuition suggests that a “test” should give you the correct answer.

Because their instincts are wrong, I want to undermine their trust in hypothesis tests. I want them to feel the riskiness of poorly-powered experiments that consistently nestle confidence intervals right up against zero. I want them ready and eager to work hard to avoid that risk—to make sure they have adequate statistical power.

To help undermine their confidence in confidence intervals, I like three exercises that mimic a test with 80% power. In each case, we are assuming that we have formulated a correct hypothesis and designed an excellent experiment with 80% power.

First, I have students roll a six-sided die. If the die produces a , then their study fails—they wasted their opportunity. See this post for details on this perspective. This simulates the riskiness of an experiment with 80% power quite well. They get lots of failed experiments in a short period of time. They become well-aware of the possibility of a failed study and grow increasingly interested in reducing this risk.
Second, I use a computer to produce a plot of many (about 50 seems right) confidence intervals from the same repeated study. I explain that the interval we will get in the study we actually conduct is like a random draw from this collection. (See below for an example of this figure.)
Third, I make the plot dynamic. I find that dynamics make the plot more memorable and convey the (appropriate) idea that hypothesis tests and confidence intervals are chaotic, noisy quantities.

These exercises make it clear that failed studies are real possibilities. Hopefully they clearly see and “feel”:

The hypothesis test is no oracle. It will not consistently reject the null (even when the null is wrong) unless you supply overwhelming evidence. In experimental design, that’s not a task, that’s the task.

Below, I walk through the plots I use in parts 2 and 3.

An Experiment that We Can Repeat

First, let’s choose to conduct an experiment with 80% power. To get this, we’ll suppose that the true effect is 1 and the standard error is 0.4. To obtain 80% power, you can use the guideline that the standard error should be about 40% of the true effect (or the true effect divided by 2.48). I’ll show where these guidelines come from in a future post. With the true effect and standard error in hand, we can compute the long-run properties of the experiment.

Code

Power, Part II: What Do Confidence Intervals from Well-Powered Studies Look Like?

Carlisle Rainey — Thu, 25 May 2023 04:00:00 GMT

Background

In this post, I address confidence intervals that are nestled right up against zero.¹ These intervals indicate that an estimate is “barely” significant. I want to be clear: “barely significant” is still significant, so you should still reject the null hypothesis.²

¹ This is the second post in a series. In my previous post, I mentioned two new papers that have me thinking about power: Arel-Bundock et al.’s “Quantitative Political Science Research Is Greatly Underpowered” and Kane’s “More Than Meets the ITT: A Guide for Investigating Null Results”. Go check out that post and those papers if you haven’t.

² I’m focusing on confidence intervals here because inference from confidence intervals is a bit more intuitive (see Rainey 2014 and Rainey 2015. In the cases I discuss, whether one checks whether the p-value is less than 0.05 or checks that confidence interval contains zero are equivalent.

But I want to address a feeling that can come along with a confidence interval nestled right up against zero. A feeling of victory. It seems like a perfectly designed study. You rejected the null and collected just enough data to do it.

But instead, it should feel like a near-miss. Like an accident narrowly avoided. A confidence interval nestled right up against zero indicates that one of two things has happened: either you were (1) unlucky or (2) under-powered.

Because “unlucky” is always a possibility, we can’t learn much from a particular confidence interval, but we can learn a lot from a literature. A literature with well-powered studies produces confidence intervals that often fall far from zero. A well-powered literature does not produce confidence intervals that consistently nestle up against zero. Under-powered studies, though, do tend to produce confidence intervals that nestle right up against zero.

A Simulation

I’m going to explore the behavior of confidence intervals with a little simulation. In this simulation, I’m going to assert a standard error rather than create the standard error endogenously through sample size, etc. I use a true effect size of 0.5 and standard errors of 0.5, 0.3, 0.2, and 0.15 to create studies with 25%, 50%, 80%, and 95% power, respectively.³ I think of 80% as “minimally-powered” and 95% as “well-powered.”

³ I’m ignoring how to choose the true effect, estimate the standard error, and compute power. For now, I’m placing all this behind the curtain. See Daniël Lakens’ book [Improving Your Statistical Inferences] for discussion (h/t Bermond Scoggins).

I’m using a one-sided test (hypothesizing a positive effect), so I’ll use 90% confidence intervals with arms that are 1.64 standard errors wide. Let’s simulate some estimates from each of our four studies and compute their confidence intervals. I simulate 5,000 confidence intervals to explore below.

Code

Power, Part I: Power Is for You, Not for Reviewer Two

Carlisle Rainey — Sun, 21 May 2023 04:00:00 GMT

Background

There’s been some really good work lately on statistical power. I’ll point you to two really great papers.

Arel-Bundock, Vincent, Ryan C. Briggs, Hristos Doucouliagos, Marco Mendoza Aviña, and T.D. Stanley. 2022. “Quantitative Political Science Research Is Greatly Underpowered.” OSF Preprints. July 5. doi: 10.31219/osf.io/7vy2f.
Kane, John V. 2023. “More Than Meets the ITT: A Guide for Investigating Null Results .” APSA Preprints. doi: 10.33774/apsa-2023-h4p0q-v2.

I’ve been long interested in statistical power (see Rainey 2014¹ and Rainey 2015²), and these new papers have me thinking even more about the importance of power.

¹ Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58(4): 1083-1091.

² McCaskey, Kelly and Carlisle Rainey. 2015. “Substantive Importance and the Veil of Statistical Significance.” Statistics, Politics, and Policy 6(1-2): 77-96.

In this post, I argue that statistical power isn’t something ancillary. Power is primary. I also argue that power isn’t something you–the researcher–build to satisfy an especially cranky Reviewer 2, it’s something you do for yourself, to make sure that your study succeeds.

The Hypothesis Testing Framework

In the hypothesis testing framework, you consider two hypotheses: the null hypothesis and the alternative hypothesis.

The hypothesis test is all about arguing against the null hypothesis (leaving the alternative as the only remaining possibility). You will (try to) show that your data would be “unusual” if the null hypothesis were correct.³

³ When hypothesizing about the average treatment effect (ATE), this can take a variety of forms. The form doesn’t really matter.

If the data would NOT be unusual under the null hypothesis, then you do not reject the null hypothesis.

Intepreting a Failure to Reject

A failure to reject means that the data “would not be unusual under the null hypothesis.” This does not imply that you should conclude the data are only consistent with the null. Indeed, there is a sharp asymmetry in hypothesis testing. I describe this in my 2014 AJPS:

Political scientists commonly interpret a lack of statistical significance (i.e., a failure to reject the null) as evidence for a negligible effect (Gill 1999), but this approach acts as a broken compass… If the sample size is too small, the researcher often concludes that the effect is negligible even though the data are also consistent with large, meaningful effects. This occurs because the small sample leads to a large confidence interval, which is likely to contain both “no effect” and large effects.

Gill (1999)⁴ describes this more forcefully:

⁴ Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52(3): 647-674.

We teach graduate students to be very careful when describing the occurrence of not rejecting the null hypothesis. This is because failing to reject the null hypothesis does not rule out an infinite number of other competing research hypotheses. Null hypothesis significance testing is asymmetric: if the test statistic is sufficiently atypical given the null hypothesis then the null hypothesis is rejected, but if the test statistic is insufficiently atypical given the null hypothesis then the null hypothesis is not accepted. This is a double standard: H1 is held innocent until proven guilty and Ho is held guilty until proven innocent (Rozeboom 1960)…

There are two problems that develop as a result of asymmetry. The first is a misinterpretation of the asymmetry to assert that finding a non-statistically significant difference or effect is evidence that it is equal to zero or nearly zero. Regarding the impact of this acceptance error Schmidt (1996: 126) asserts that this: “belief held by many researchers is the most devastating of all to the research enterprise.” This acceptance of the null hypothesis is damaging because it inhibits the exploration of competing research hypotheses. The second problem pertains to the correct interpretation of failing to reject the null hypotheses. Failing to reject the null hypothesis essentially provides almost no information about the state of the world. It simply means that given the evidence at hand one cannot make an assertion about some relationship: all you can conclude is that you can’t conclude that the null was false (Cohen 1962).

There are many incorrect, but somewhat innocent interpretations of p-values. Interpreting a lack of statistical significance as evidence for the null is incorrect and wildly misleading in many cases.

Important Point

A non-statistically significant difference is not evidence that an effect is equal to zero or nearly zero. Interpreting a non-statistically significant effect otherwise is “devastating.”

The Implication of a Non-Conclusion

If you cannot draw a conclusion then, what exactly has happened? Obtaining will not be an “error” because you won’t make a strong claim that the research hypothesis is wrong. Instead, you will simply admit that you failed to uncover evidence against the null. Failing to uncover evidence isn’t an error.

Indeed, Jones and Tukey (2000)⁵ write:

⁵ Jones, Lyle V., and John W. Tukey. 2000. “A Sensible Formulation of the Significance Test.” Psychological Methods 5(4): 411-414.

A conclusion is in error only when it is “a reversal,” when it asserts one direction while the (unknown) truth is the other direction. Asserting that the direction is not yet established may constitute a wasted opportunity, but it is not an error.

Important Point

Failing to uncover evidence isn’t an “error,” it is a “wasted effort.”

This is worth emphasizing in a different way. Tests are not magical tools that tell you which hypothesis is correct. Instead, tests summarize the evidence against the null. There are two critical pieces to “evidence against the null”: (1) the amount of evidence and (2) whether the evidence is against the null. If you buy your own argument that the null is false (surely you do!), then (2) is taken care of. Only the amount of evidence remains, and you–the researcher–choose the amount of evidence to supply.

The Implication for Power Calculations

This perspective helps motivate power calculations. By their design, tests control the error rate in certain situations (when then null is correct). You do not need to worry about Type I errors. First, the test controls the error rate under the null. Second, you are pretty sure the null is wrong (see your theory section).

Important Point

The hypothesis test takes care of the the Type I error rate. If you choose a properly-sized test, you don’t need to worry about those errors any more.

If you aren’t worried about Type I errors, what are you worried about? They only thing left to worry about is wasting your time and money. Statistical power is the chance not of wasting your time and money.

Power isn’t a secondary quantity that you compute for thoroughness or in anticipation of a comment from Reviewer 2. Power is something that you build for yourself.

Statisticians talk a lot about Type I errors because that’s their contribution. It’s your job to bring the power.

And importantly, power is under your control. Kane provides a rich summary of ways to increase the power of your experiment. At a minimum, you have brute force control through sample size.

Power isn’t an ancillary concern, it’s the entire game from the very beginning of the planning stage. It should be at the forefront of the researchers mind from the very beginning. You should want the power as high as possible.⁶

⁶ I hear that 80% is the standard, but I’m pretty uncomfortable spending dozens of hours and thousands of dollars running for a 1 in 5 chance of wasting my time. I want that chance as close to zero as I can get it. I want power close to 100%. 99% power and 80% power might both seem “high” or “acceptable,” but these are not the same. 80% power means 1 in 5 studies fail. 99% power means that 1 in 100 studies fail.

You have to supply a test overwhelming evidence to consistently reject the null. Careful power calculations help you make sure you succeed in this war against the null.

Power isn’t about Type S and M errors (Gelman and Carlin 2014)⁷. Power is about you protecting yourself from a failed study. And that seems like a protection worth pursuing carefully.⁸

⁷ Gelman, Andrew, and John Carlin. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9(6): 641-651.

⁸ Of course it’s also about Type S and M errors, but those are discipline-level concerns. I’m talking about your incentives as a researcher.

Summary

Here are the takeaways:

Statistical power is the chance of using your time and money productively (i.e., not wasting it).
Statistical power is under your control (see Kane).
Your power might be (much) lower than you think–you should check (see Arel-Bundock et al.).
Power should be a primary concern throughout the design. The researcher should care deeply about power, perhaps more than anything else.

Important Point

The hypothesis test is no oracle. It will not consistently reject the null (even when the null is wrong) unless you supply overwhelming evidence. In experimental design, that’s not a task, that’s the task.

Carlisle's Blog

Firth’s Logit: Some References

The Two Main Papers

Extensions

Applications

Separation and Finiteness

References

Equivalence Tests Using {marginaleffects}

Background on arguing for a negligible effect

Clark and Golder (2006)

Reproducing Clark and Golder (2006)

Using comparisons() to compute the effects

Computing the TOST p-values using hypotheses()

Other references

Final thoughts

References

Daily Writing

Urgency and Importance

The Evidence

How I Do It

What if you’re not ready to write yet?

Benchmarking Firth’s Logit: {brglm2} versus {logistf}

Firth’s Logit

Goals for Benchmarking

Benchmarking

Follow-Up Notes

Computer

References

Power, Part III: The Rule of 3.64 for Statistical Power

Background

Starting Point: The Sampling Distribution

Teaching Confidence Intervals and Hypothesis Testing with gganimate

Background

An Experiment that We Can Repeat

Power, Part II: What Do Confidence Intervals from Well-Powered Studies Look Like?

Background

A Simulation

Power, Part I: Power Is for You, Not for Reviewer Two

Background

The Hypothesis Testing Framework

Intepreting a Failure to Reject

The Implication of a Non-Conclusion

The Implication for Power Calculations

Summary

Using `comparisons()` to compute the effects

Computing the TOST p-values using `hypotheses()`