# download the {crdata} package from github
remotes::install_github("carlislerainey/crdata")
We can think of statistical power as determined by the ratio , where is the treatment effect and SE is the standard error of the estimate. To reason about statistical power, one needs to make assumptions or predictions about the treatment effect and the standard error.
And as data-oriented researchers, we often want to use data to inform these predictions and assumptions. We might want to use pilot data.^{1}
^{1} Here’s how Leon, Davis, and Kraemer (2011) describe the purpose of a pilot study. “The fundamental purpose of conducting a pilot study is to examine the feasibility of an approach that is intended to ultimately be used in a larger scale study. This applies to all types of research studies. Here we use the randomized controlled clinical trial (RCT) for illustration. Prior to initiating a full scale RCT an investigator may choose to conduct a pilot study in order to evaluate the feasibility of recruitment, randomization, retention, assessment procedures, new methods, and/or implementation of the novel intervention. A pilot study, however, is not used for hypothesis testing. Instead it serves as an earlier-phase developmental function that will enhance the probability of success in the larger subsequent RCTs that are anticipated.”
Usually:
With a predicted standard error in hand, we can predict the minimum detectable effect, the statistical power, or the required sample size in the planned study.
In this post, I give an example of how this can work.
Here’s how I suggest we use pilot data to predict the standard error in the planned study:
Conservatively, the standard error will be about , where is the number of respondents per condition in the pilot data, is the estimated standard error using the pilot data, and is the number of respondents per condition in the planned study.
The factor nudges the standard error from the pilot study in a conservative direction, since it might be an under-estimate of the actual standard error.^{2} For the details, see this early paper, but this conservative standard error estimate is approximately the upper bound of a 95% confidence interval for the standard error using the pilot data.
^{2} More generally, we can use a bootstrap to conservatively estimate the standard error, without relying on this analytical approximation.
As an example, let’s use half of the experiment conducted by Robbins et al. (2024).
Robbins et al. use a 2x2 factorial vignette design, randomly assigning each respondent to read one of four vignettes. The vignette describes a hypothetical covert operation ordered by the president that ends in either success or failure. Then, the vignette describes a whistleblower coming forward and describes the president’s opposition in Congress as either amplifying or ignoring the whistleblower.
President's Opposition in Congress | Outcome of Operation | |
---|---|---|
Success | Failure | |
Amplifies Whistleblower | Vignette 1: Success & Amplify | Vignette 2: Failure & Amplify |
Ignores Whistleblower | Vignette 3: Success & Ignore | Vignette 4: Failure & Ignore |
After the vignette, the respondent is asked whether they approve of the opposition in Congress’ actions on a seven-point Likert scale from strongly approve to strongly disapprove.
For a simple example, let’s focus on the effect of amplifying the whistleblower when the operation succeeds. That is, let’s compare responses after Vignette 1 and Vignette 3. How much does amplifying a whistleblower increase approval when the opperation succeeds? We expect a small effect here, so we should pay careful attention to power.
We hoped to detect an effect as small as 0.35 points on the seven-point scale and had tentatively planned on 250 respondents per condition. To test the survey instrument and data provider, we conducted a small pilot with about 75 respondents per condition. Let’s use those pilot data to check whether 250 respondents seem sufficient.
In the {crdata} package on GitHub, you can find the the pilot data we collected leading up to the main study.
# download the {crdata} package from github
remotes::install_github("carlislerainey/crdata")
Now let’s load the pilot data. To focus on observations where the operation succeeds, we’re going to keep only the observations where the vignette describes a successful observation.
# load pilot data and keep only success condition
robbins2_pilot <- crdata::robbins_pilot |>
subset(failure == "Success") |>
glimpse()
Rows: 147
Columns: 5
$ cong_overall <dbl> 3, 1, -2, 0, -2, -1, 0, -1, 0, 0, 0, 2, -1, 3, -3, 0, 0, …
$ failure <fct> Success, Success, Success, Success, Success, Success, Suc…
$ amplify <fct> Ignore, Ignore, Ignore, Amplify, Ignore, Ignore, Ignore, …
$ pid7 <fct> Strong Democrat, Not very strong Republican, Strong Democ…
$ pid_strength <dbl> 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 0, 3, 3, 3, 3, 1, 1, …
cong_overall
is the respondent’s approval of Congress’ actions on a seven-point scale and amplify
indicates whether Congress amplified the whistleblower (i.e., criticized the president).
Now let’s analyze the pilot data as we plan to analyze the main data set that we plan to collect later. We’re interested in the average response in the Amplify
and Ignore
conditions, so let’s use a t-test.
# t test
fit_pilot <- t.test(cong_overall ~ amplify, data = robbins2_pilot)
It can be really tempting to look at the estimated treatment effect. In this pilot study, it’s actually statistically significant. I intentionally don’t show the estimated treatment effect (or quantities requiring it, like p-values). If we looked at these, we might make one of the following mistakes:
Both of these claims are misleading. The estimated treatment effect is very noisy, so ignore the estimated treatment effect.
To predict the standard error in the main study, we need two pieces of information from this pilot:
We can get the number of observations per condition using table()
.
# create a table showing the observations per condition
table(robbins2_pilot$amplify)
Ignore Amplify
70 77
# sample size per condition
n_pilot <- mean(table(robbins2_pilot$amplify))
n_pilot
[1] 73.5
And then we need the estimated standard error, which is computed by t.test()
.
# get estimated standard error from pilot
se_hat_pilot <- fit_pilot$stderr
se_hat_pilot
[1] 0.2761011
Now we can predict the standard error in the planned study.
For the main study, we planned on about 250 respondents per condition.
n_planned <- 250
The we can conservatively predict the standard error in the full study as .
pred_se_cons <- sqrt(n_pilot/n_planned)*((sqrt(1/n_pilot) + 1)*se_hat_pilot)
pred_se_cons
[1] 0.1671691
But is this standard error small enough?
We can convert the standard error to the minimum detectable effect with 80% power using .^{3}
^{3} See Bloom (1995) for an excellent discussion of this rule. I also write about it here.
# compute conservative minimum detectable effect
2.5*pred_se_cons
[1] 0.4179227
We hoped to detect an effect as small as 0.35 points on the seven-point scale, so we’re going to need more than 250 respondents per condition!
We can also compute the power to detect an effect of 0.35 points on the seven-point scale.
# compute power as a percent
1 - pnorm(1.64 - 0.35/pred_se_cons)
[1] 0.6749736
Note that these are conservative estimates of the minimum detectable effect and statistical power.
Here’s what things look like if we remove the conservative nudge and predict the standard error as .
# without the conservative nudge
pred_se <- sqrt(n_pilot/n_planned)*se_hat_pilot
2.5*pred_se # best guess of minimum detectable effect
[1] 0.3742673
1 - pnorm(1.64 - 0.35/pred_se) # best guess of power
[1] 0.7573806
As you can see, the minimum detectable effect and power are a little too low. We need more respondents!
Our plan of 250 respondents per condition seems too low. If we want, we can predict the sample size we need to get to 80% power using the following rule:
For 80% power to detect the treatment effect , we will (conservatively) need about respondents per condition, where is the number of respondents per condition in the pilot data and is the estimated standard error using the pilot data.
n_pilot*((2.5/0.35)*((sqrt(1/n_pilot) + 1)*se_hat_pilot))^2
[1] 356.4477
Thus to get 80% power, the pilot data suggest that we (conservatively) need about 360 respondents per condition. We used 367 in the full study. Here are the conservative predictions for 367 respondents per condition.
n_planned <- 367
pred_se_cons <- sqrt(n_pilot/n_planned)*((sqrt(1/n_pilot) + 1)*se_hat_pilot)
pred_se_cons
[1] 0.1379726
2.5*pred_se_cons # conservative minimum detectable effect
[1] 0.3449315
1 - pnorm(1.64 - 0.35/pred_se_cons) # conservative power
[1] 0.8150699
We ran the full study.^{4}
^{4} See Robbins et al. (2024) for the full results.
robbins2_main <- crdata::robbins_main |>
subset(failure == "Success") |>
glimpse()
Rows: 735
Columns: 5
$ cong_overall <dbl> 2, -2, -1, -3, 0, -1, -2, -1, 1, 1, 0, -3, -2, 2, 2, -3, …
$ failure <fct> Success, Success, Success, Success, Success, Success, Suc…
$ amplify <fct> Ignore, Amplify, Amplify, Amplify, Ignore, Ignore, Amplif…
$ pid7 <fct> Not very strong Republican, Not very strong Republican, S…
$ pid_strength <dbl> 2, 2, 3, 3, 3, 2, 0, 2, 2, 1, 2, 0, 3, 1, 2, 3, 2, 3, 2, …
fit_main <- t.test(cong_overall ~ amplify, data = robbins2_main)
fit_main$stderr
[1] 0.1322618
1 - pnorm(1.64 - 0.35/fit_main$stderr)
[1] 0.8428563
As you can see, the pilot data gave us a good, slightly conservative prediction. We conservatively predicted a standard error of 0.138 in the planned study and we estimated a standard error of 0.132 after running the study. We conservatively predicted our power would be about 82% to detected an effect of 0.35 on the seven-point scale, but after running the study, it seems like we had about 84% power.
We can also use the bootstrap as an alternative. There are a few ways one might approach it.
Here’s one:
Repeat the process above many times. For each standard error estimate, compute the implied statistical power. This gives a distribution of power estimates. Find a value near the bottom of this distribution. The factor we used above—The factor —nudges the standard error to about the 2.5th percentile, so we can use that here, too.
# number of bootstrap iterations
n_bs <- 10000
bs_se <- numeric(n_bs) # a container
for (i in 1:n_bs){
# resample 367 observations from each condition
bs_data <- robbins2_main %>%
group_by(amplify) %>%
sample_n(size = 367, replace = TRUE)
# run planned analysis
bs_fit <- t.test(cong_overall ~ amplify, data = bs_data)
# grab se
bs_se[i] <- bs_fit$stderr
}
# compute 2.5th percentile of power to obtain conservative estimate
pwr <- 1 - pnorm(1.64 - 0.35/bs_se)
quantile(pwr, probs = 0.025)
2.5%
0.8206062
Using the analytical approximation, we got 0.815 as a conservative estimate of power. The bootstrap gave us 0.820 as a conservative estimate. The actual power in the full study turned out to be about 0.843. (Remember, all of these power calculations are power to detected an effect of 0.35 points on the seven-point scale.)
I have an early draft of a paper on these (and other) ideas. Please test them out in your own work and let me know if you have questions, comments, and suggestions. I’m interested in making the paper as clear and useful as I can.
We can think of statistical power as determined by the ratio , where is the treatment effect and SE is the standard error of the estimate.^{1} To reason about statistical power, one needs to make assumptions or predictions about the treatment effect and the standard error.
^{1} Bloom (1995) has a really beautiful paper on this idea. It’s one of my favorites.
In this post, I discuss ways that pilot data should and should not be used as part of a power analysis. I make two points:
With a predicted standard error in hand, we can predict for the minimum detectable effect, the statistical power, or the required sample size in the planned study.
To compute statistical power, researchers need to make an assumption about the size of the treatment effect. It’s easy to feel lost without any guidance on what effects are reasonable to look for, so we might feel tempted to use a small pilot study to estimate the treatment effect and then use that estimate in our power analysis. This is a bad idea because the estimate of the treatment effect from a pilot study is too uncertain for a power analysis.^{2} Leon, Davis, and Kraemer (2011) and Albers and Lakens (2018) discuss this problem in more detail.^{3}
^{2} The estimate of the treatment effect from a well-powered study might be too noisy as well.
^{3} Perugini, Gallucci, and Costantini (2014) offer a potential solution if it’s important to estimate the treatment effect from pilot data, though their approach is data-hungry and very conservative. Pilot data can estimate/predict the standard deviation in the full study quite precisely even with small pilots, such as 10 respondents per condition.
Do not use a small pilot study to estimate the treatment effect and then use that estimate as the treatment effect in a power analysis.
While pilot data might not be useful for estimating the treatment effect, pilot data are useful for estimating the standard error of the planned study. Given that power is a function of the ratio of the treatment effect and the standard error, it’s important to have a good prediction of the standard error. Further, the noisiness of this estimated standard error is predictable, so it’s easy to nudge the estimate slightly to obtain a conservative prediction.
In political science, it’s common to run pilot studies with, say, 100-200 respondents before a full-sized study of, say, 1,000 respondents. It can be very helpful to use these pilot data to confirm any preliminary power calculations.
Here are two helpful rules:
We can use pilot data to predict the standard error of the estimated treatment effect in a planned study. Conservatively, the standard error will be about , where is the number of respondents per condition in the pilot data, is the estimated standard error using the pilot data, and is the number of respondents per condition in the planned study.
We can use pilot data to conservatively predict the sample size we will need in a planned study. For 80% power to detect the treatment effect , we will (conservatively) need about respondents per condition, where is the number of respondents per condition in the pilot data and is the estimated standard error using the pilot data.
Note that the factor nudges the predicted standard error in a conservative direction. See this working paper for more details.
We can use the predicted standard error to find the minimum detectable effect (for 80% power) or the power (for a given treatment effect). Or we can use the pilot data to estimate the required sample size (for 80% power to detect a given treatment effect.)
Let’s imagine a setting where a study with 1,000 respondents has 80% power to detect an average treatment effect of 1 unit. I’m imagining that we’re using linear regression with robust standard errors to test the hypothesis that the average treatment effect is positive (aka Welch’s t-test). There’s just one treatment group and one control group with 500 respondents each, for 1,000 respondents total.
# set the treatment effect, SE, and sample size
tau <- 1 # treatment effect
se <- tau/(qnorm(0.95) + qnorm(0.80)) # standard error for 80% power
n_planned <- 500 # sample size per condition in planned full study
# calculate required standard deviation to yield 80% power given the above
sigma <- se*sqrt(2*n_planned)/2
Let’s confirm that this setting does indeed give us 80% power.
res_list <- NULL # a container to collect results
for (i in 1:10000) {
# simulate study
y0 <- rnorm(2*n_planned, sd = sigma)
y1 <- y0 + tau
d <- sample(rep(0:1, length.out = 2*n_planned))
y <- ifelse(d == 1, y1, y0)
data <- data.frame(y, d)
# fit model and get standard error and p-value
fit <- lm(y ~ d, data = data)
tau_hat <- as.numeric(coef(fit)["d"])
se_hat <- as.numeric(sqrt(diag(sandwich::vcovHC(fit, type = "HC2")))["d"])
p_value <- pnorm(tau_hat/se_hat, lower.tail = FALSE)
# collect results
res_list[[i]] <- data.frame(tau_hat, se_hat, p_value)
}
# compute power (and monte carlo error)
res_list |>
bind_rows() |>
summarize(power = mean(p_value < 0.05),
mc_error = sqrt(power*(1 - power))/sqrt(n()),
lwr = power - 2*mc_error,
upr = power + 2*mc_error)
power mc_error lwr upr
1 0.8023 0.003982646 0.7943347 0.8102653
Nailed it!
Alex Coppock kindly shared how one might run the same simulation using {DeclareDesign}. Code here.
Now let’s simulate 1,000 pilot studies with 10, 30, 60, 90, and 150 respondents per condition. I’m going to grab the standard error from each but throw the estimates of the treatment effects right into the trash.
# sample size per condition in pilot study
n_pilot_values <- c(10, 30, 60, 90, 150)
res_list <- NULL # a container to collect results
iter <- 1 # counter to index the collection
for (i in 1:10000) {
for (j in 1:length(n_pilot_values)) {
# set respondents per condition in the pilot study
n_pilot <- n_pilot_values[j]
# simulate pilot study
y0 <- rnorm(2*n_pilot, sd = sigma)
y1 <- y0 + tau
d <- sample(rep(0:1, length.out = 2*n_pilot))
y <- ifelse(d == 1, y1, y0)
pilot_data <- data.frame(y, d)
# fit model and get standard error
fit_pilot <- lm(y ~ d, data = pilot_data)
tau_hat <- as.numeric(coef(fit_pilot)["d"])
pilot_se_hat <- as.numeric(sqrt(diag(sandwich::vcovHC(fit_pilot, type = "HC2")))["d"])
# collect standard errors
res_list[[iter]] <- data.frame(pilot_se_hat, n_pilot)
iter <- iter + 1 # update counter
}
}
# combine collected results in a data frame
res <- bind_rows(res_list) |>
glimpse()
Rows: 50,000
Columns: 2
$ pilot_se_hat <dbl> 2.7282069, 1.7541542, 1.1426683, 0.9763233, 0.7760267, 3.…
$ n_pilot <dbl> 10, 30, 60, 90, 150, 10, 30, 60, 90, 150, 10, 30, 60, 90,…
Now let’s take a look a these standard errors from the simulated pilot studies. Notice that the standard errors are all larger than the standard error in the full study. And the smaller the pilot, the larger the standard error. This makes sense.
ggplot(res, aes(x = pilot_se_hat)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = se)
However, we can translate the standard error from the pilot studies into predictions for the standard errors in the full studies by multiplying the pilot standard error times .^{4}
^{4} In this setting, we’re planning on 500 respondents per condition.
ggplot(res, aes(x = sqrt(n_pilot/n_planned)*pilot_se_hat)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = se)
That’s spot on! However, notice that we sometimes substantially underestimate the standard error. When we underestimate the standard error, we will overestimate the power (which is bad for us).
As a solution, we can gently nudge the standard error from the pilot up by a factor of , which will make “almost all” of the standard errors over-estimates or “conservative” (details here).
ggplot(res, aes(x = sqrt(n_pilot/n_planned)*(sqrt(1/n_pilot) + 1)*pilot_se_hat)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = se)
This works super well.
But how should we use this predicted standard error to evaluate or choose a sample size?
We can use these conservative standard errors to compute any of the following (conservatively, as well):
First, we can compute the minimum detectable effect with 80% power. This is about 2.5 times the standard error.
# compute minimum detectable effect
mde <- res %>%
mutate(pred_se_cons = sqrt(n_pilot/n_planned)*(sqrt(1/n_pilot) + 1)*pilot_se_hat,
mde_cons = 2.5*pred_se_cons) %>%
glimpse()
Rows: 50,000
Columns: 4
$ pilot_se_hat <dbl> 2.7282069, 1.7541542, 1.1426683, 0.9763233, 0.7760267, 3.…
$ n_pilot <dbl> 10, 30, 60, 90, 150, 10, 30, 60, 90, 150, 10, 30, 60, 90,…
$ pred_se_cons <dbl> 0.5078358, 0.5081264, 0.4469336, 0.4578814, 0.4597523, 0.…
$ mde_cons <dbl> 1.269590, 1.270316, 1.117334, 1.144704, 1.149381, 1.40411…
# plot minimum detectable effect
ggplot(mde, aes(x = mde_cons)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = tau)
Second, we can compute the statistical power for given treatment effect. Power equals , where is pnorm()
, is the standard error of the estimated treatment effect, and is the treatment effect.
# compute the power
pwr <- res %>%
mutate(pred_se_cons = sqrt(n_pilot/n_planned)*(sqrt(1/n_pilot) + 1)*pilot_se_hat,
power_cons = 1 - pnorm(1.64 - tau/pred_se_cons)) %>%
glimpse()
Rows: 50,000
Columns: 4
$ pilot_se_hat <dbl> 2.7282069, 1.7541542, 1.1426683, 0.9763233, 0.7760267, 3.…
$ n_pilot <dbl> 10, 30, 60, 90, 150, 10, 30, 60, 90, 150, 10, 30, 60, 90,…
$ pred_se_cons <dbl> 0.5078358, 0.5081264, 0.4469336, 0.4578814, 0.4597523, 0.…
$ power_cons <dbl> 0.6289751, 0.6285495, 0.7249028, 0.7067695, 0.7037043, 0.…
# plot the power
ggplot(pwr, aes(x = power_cons)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = .8)
Finally, we can compute the required sample size to obtain 80% power to detect a certain treatment effect.
As I described above, for 80% power to detect the treatment effect , we will (conservatively) need about respondents per condition.
# compute the required sample size
ss <- res %>%
mutate(ss_cons = n_pilot*((2.5/tau)*(sqrt(1/n_pilot) + 1)*pilot_se_hat)^2) %>%
glimpse()
Rows: 50,000
Columns: 3
$ pilot_se_hat <dbl> 2.7282069, 1.7541542, 1.1426683, 0.9763233, 0.7760267, 3.…
$ n_pilot <dbl> 10, 30, 60, 90, 150, 10, 30, 60, 90, 150, 10, 30, 60, 90,…
$ ss_cons <dbl> 805.9289, 806.8515, 624.2176, 655.1731, 660.5380, 985.772…
# plot the required sample size
ggplot(ss, aes(x = ss_cons)) +
geom_histogram() +
facet_wrap(vars(n_pilot)) +
geom_vline(xintercept = n_planned)
Sample size is an especially helpful metric, because it is the constraint and cost that researchers face most directly. Because these required sample sizes are conservative, they tend to be too large—but by how much? For pilots with 60 respondents per condition, the sample sizes tend to be about 30% too large. This means that researchers could have obtained 80% power with 1,000 respondents but instead used 1,300 respondents.
In my view, this 30% waste is not particularly concerning. It’s relatively small and the statistical power will still be less than 90% even if the sample size is increased by 30%.
But most importantly, almost all of the sample sizes exceed what we need for 80% power.
# compute features of the sample sizes
ss %>%
group_by(n_pilot) %>%
mutate(waste = ss_cons/n_planned - 1) %>%
summarize(avg_waste = scales::percent(mean(waste), accuracy = 1),
pct_too_small = scales::percent(mean(ss_cons < 500), accuracy = 1)) %>%
rename(`Respondents per condition in pilot study` = n_pilot,
`Average waste (needed 1,000 respondents and used 1,300 means waste is 30%)` = avg_waste,
`Percent of sample sizes that produce less than 80% power` = pct_too_small) %>%
tinytable::tt()
Respondents per condition in pilot study | Average waste (needed 1,000 respondents and used 1,300 means waste is 30%) | Percent of sample sizes that produce less than 80% power |
---|---|---|
10 | 75% | 7% |
30 | 41% | 4% |
60 | 29% | 3% |
90 | 24% | 3% |
150 | 18% | 3% |
We think of statistical power as determined by the ratio , where is the treatment effect and SE is the standard error of the estimate. To reason about statistical power, one needs to make assumptions or predictions about the treatment effect and the standard error.
I make two points in this post:
With a predicted standard error in hand, we can obtain a prediction for the minimum detectable effect, the statistical power, or the required sample size.
You can find more details in this paper.
Below, I bookmark other references that might be helpful.
This Stack Exchange answer gives a brief, but careful explanation of Firth’s logit. If you’re looking for a quick explanation, start here.
From my perpective, these are the two main papers to refer to if you’re concerned about small sample bias in logistic regression models.
Beyond these two main papers, there have been a few extensions. Zietkiewicz and Kosmidis (2023) talk about Firth’s logit in very large data sets. Cook, Hays, and Franzese (2018) make a good argument for using Firth’s estimator in panel data sets with binary outcomes and fixed effects. Sterzinger and Kosmidis (2023) apply these ideas to mixed models (or random effects models). Šinkovec et al. (2021) compare Firth’s approach to ridge regression, and suggest that Firth’s is superior in small or sparse data sets. Puhr et al. (2017) study Firth’s logit in the context of rare events and propose FLIC and FLAC as alternatives.
I learned about Firth’s estimator from Zorn (2005), who follows Heinze and Schemper (2002) in suggesting it as a solution to separation. According to David Firth in this blog post, this is the application that stimulated interest in the approach after it went relatively unnoticed for a few years. (Great post, I highly recommend reading it!) This application piqued my interest in Firth’s estimator. Briefly, I think Firth’s default penalty might not be substantively reasonable in a given application (Rainey 2016) (see also Beiser-McGrath (2020)) and the usual likelihood ratio and score tests work well without the penalty (Rainey 2023).
I remember sitting in a talk as a first-year graduate student, and the speaker said something like: “I expect no effect here, and, just as I expected, the difference is not statistically significant.” I was a little bit taken aback—of course, that’s not a compelling argument for a null effect. A lack of statistical significance is an absence of evidence for an effect; it is not evidence of an absence of an effect.
But I saw this approach taken again and again in published work. (And still do!)
My first publication was an AJPS article (Rainey 2014) (Ungated PDF) explaining why this doesn’t work well and how to do it better.
Here’s what I wrote in that paper:
Hypothesis testing is a powerful empirical argument not because it shows that the data are consistent with the research hypothesis, but because it shows that the data are inconsistent with other hypotheses (i.e., the null hypothesis). However, researchers sometimes reverse this logic when arguing for a negligible effect, showing only that the data are consistent with “no effect” and failing to show that the data are inconsistent with meaningful effects. When researchers argue that a variable has “no effect” because its confidence interval contains zero, they take no steps to rule out large, meaningful effects, making the empirical claim considerably less persuasive . (Altman and Bland 1995; Gill 1999; Nickerson 2000)
But here’s a critical point, it’s impossible to reject every hypothesis except exactly no effect. Instead, the researcher must define a range of substantively “negligible” effects. The researcher can reject the null hypothesis that the effect falls outside this range of negligible effects. However, this requires a substantive judgement about those effects that are negligible and those that are not.
Here’s what I wrote:
Researchers who wish to argue for a negligible effect must precisely define the set of effects that are deemed “negligible” as well as the set of effects that are “meaningful.” This requires defining the smallest substantively meaningful effect, which I denote as . The definition must be debated by substantive scholars for any given context because the appropriate varies widely across applications.
Clark and Golder (2006) offer a nice example of this sort of hypothesis. I’ll refer you there and to Rainey (2014) for a complete discussion of their idea, but I’ll motivate it briefly here.
Explaining why a country might have only a few (i.e., two) parties, Clark and Golder write:
First, it could be the case that the demand for parties is low because there are few social cleavages. In this situation, there would be few parties whether the electoral institutions were permissive or not. Second, it could be the case that the electoral system is not permissive. In this situation, there would be a small number of parties even if the demand for political parties were high. Only a polity characterized by both a high degree of social heterogeneity and a highly permissive electoral system is expected to produce a large number of parties. (p. 683)
Thus, they expect that electoral institutions won’t matter in socially homogeneous systems. And they expect that social heterogeneity won’t matter in electoral systems that are not permissive.
Before computing their specific quantities of interest, let’s reproduce their regression model. Here’s their table that we’re trying to reproduce.
And here’s a reproduction of their estimates using the cg2006
data from the {crdata} package on GitHub.^{1}
^{1} Run ?crdata::cg2006
for detailed documentation of this data set.
# load packages
library(sandwich)
library(modelsummary)
# install my data packages from github
devtools::install_github("carlislerainey/crdata") # only updates if newer version available
# load clark and golder's data set
cg <- crdata::cg2006
# reproduce their estimates
f <- enep ~ eneg*log(average_magnitude) + eneg*upper_tier + en_pres*proximity
fit <- lm(f, data = cg)
# regression table
modelsummary(fit,
vcov = ~ country, # cluster-robust SE; multiple observations per country
fmt = 2,
shape = term ~ model + statistic)
Est. | S.E. | |
---|---|---|
(Intercept) | 2.92 | 0.35 |
eneg | 0.11 | 0.14 |
log(average_magnitude) | 0.08 | 0.23 |
upper_tier | −0.06 | 0.03 |
en_pres | 0.26 | 0.15 |
proximity | −3.10 | 0.46 |
eneg × log(average_magnitude) | 0.26 | 0.17 |
eneg × upper_tier | 0.06 | 0.02 |
en_pres × proximity | 0.68 | 0.23 |
Success!
They use averge_magnitude
to measure the permissiveness of the electoral system and eneg
to measure social heterogeneity.
comparisons()
to compute the effectsNow let’s compute the two quantities of interest. Clark and Golder argue for two negligible effects, which I make really concrete below.
And comparing the U.S. and the U.K., I argue that the smallest substantively interesting effect is 0.62. In Rainey (2014), I made the plot below. I want to reproduce it with {marginaleffects}.
These differences (and the 90% CIs) are really easy to compute using {marginaleffects}!^{2}
^{2} I’m only doing Clark and Golder’s original results, not any of the robustness checks.
# load packages
library(marginaleffects)
# the smallest substantively interesting effect
m <- 0.62
# a data frame setting the values of the "other" variables
X_c <- data.frame(
eneg = 1.06, # low value
average_magnitude = 1, # low value
upper_tier = 0,
en_pres = 0,
proximity = 0
)
# compute the comparison for eneg and average magnitude
c <- comparisons(fit,
vcov = ~ country,
newdata = X_c,
variables = list("eneg" = c(1.06, 2.48), # low to high value
"average_magnitude" = c(1, 7)), # low to high value
conf_level = 0.90)
Now we can just plot the 90% CIs with ggplot()
and check whether the entire interval falls inside the bounds.
# load packages
library(tidyverse)
# bind the comparisons together and plot
ggplot(c, aes(x = estimate,
xmin = conf.low,
xmax = conf.high,
y = term)) +
geom_vline(xintercept = c(-m, m), linetype = "dashed") +
geom_errorbarh() +
geom_point()
In this case, we conclude that social heterogeneity (eneg
) has a negligible effect because the 90% CI only contains substantively negligible values. However, the 90% CI for district magnitude (average_magnitude
) contains substantively negligible and meaningful values, so we cannot reject the null hypothesis of a meaningful effect.
hypotheses()
It’s then almost trivial to use the hypotheses()
function to compute the TOST p-values.^{3} Note, though, that you must resupply the vcov
and conf_level
arguments. I might have expected them to carry forward from the comparisons()
above.
^{3} Note this warning from ?hypotheses()
: Warning #2: For hypothesis tests on objects produced by the marginaleffects package, it is safer to use the hypothesis argument of the original function. Using hypotheses() may not work in certain environments, in lists, or when working programmatically with apply style functions.
You must resupply the vcov
and conf_level
arguments.
# hypothesis tests
# note: you must re-supply the vcov and conf_level arguments
hypotheses(c, equivalence = c(-m, m), conf_level = 0.90, vcov = ~ country)
Term Contrast Estimate Std. Error z Pr(>|z|) S 5.0 %
average_magnitude 7 - 1 0.696 0.175 3.973 <0.001 13.8 0.408
eneg 2.48 - 1.06 0.158 0.204 0.779 0.436 1.2 -0.176
95.0 % p (NonSup) p (NonInf) p (Equiv) eneg average_magnitude upper_tier
0.984 0.6671 <0.001 0.6671 1.06 1 0
0.493 0.0117 <0.001 0.0117 1.06 1 0
en_pres proximity
0 0
0 0
Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted, eneg, average_magnitude, upper_tier, en_pres, proximity, enep, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
Type: response
This doesn’t print super-nicely into this document, so let’s extract the important parts.
# hypothesis tests, extracting the important pieces
hypotheses(c, equivalence = c(-m, m), conf_level = 0.90, vcov = ~ country) %>%
select(term, contrast, estimate, conf.low, conf.high, p.value.equiv)
Term Contrast Estimate CI low CI high p (Equiv)
average_magnitude 7 - 1 0.696 0.408 0.984 0.6671
eneg 2.48 - 1.06 0.158 -0.176 0.493 0.0117
Columns: term, contrast, estimate, conf.low, conf.high, p.value.equiv
Checking that the 90% CIs fall within the bounds created by the smallest substantively-meaningful effect is equivalent to checking whether the TOST p-value (i.e., the p(Equiv)
column) is less than 0.05, so our conclusions are (and must be) identical.
For more on effective arguments for no effect, see the following:
conf_level = 0.90
.^{4}^{4} I would make a similar point about one-sided tests as well, but that’s less correct, because it should be a one-sided 95% CI.
I try to write everyday.^{1}
^{1} See lots of caveats below!
By “writing,” I mean “pushing the paper closest to publication just a little bit closer.” I want to think about the next step on the journey to the published paper and do it. According to this loose definition of writing, it might involve data collection, data analysis, creating slides, or even writing and polishing text. It might involve organization, planning, or learning new skills. It excludes any tasks that aren’t necessary to complete the project.
By “everyday,” I mean at least every weekday, probably at the same time every day and probably first thing in the morning. For better or worse, academics are evaluated by their research productivity.
President Eisenhower famously characterized his duties: “I have two kinds of problems, the urgent and the important. The urgent are not important, and the important are never urgent.”
Following Eisenhower’s Box, we might assign degrees of urgency and importance to tasks in academic tasks. In graduate school, I had teaching responsibilities, RA duties, readings for seminars, homework for methods classes, preliminary exams, and administrative tasks. All of these tasks are important. They must be completed. They must be completed well. Yet I was evaluated largely on my papers. As a faculty member, little has changed.
Writing is important, but writing never quite becomes urgent. It’s easy to put off writing to prepare a lecture (or write a blog post).
Robert Boice studied academic productivity carefully. A couple of his studies provide some evidence for my strategy to write every day.
First, he assessed how early-career academics spend their time. The figure below shows the results. Notice that these faculty spend more time in committee meetings (2 hrs.) than writing (1.5 hours).
Second, Boice conducted an experiment to assess the effect of writing strategies.
Boice randomly divided 27 academics into three groups:^{2}
^{2} This is a small sample, but it supports my claim so it’s okay.
The figure below shows that regular writing routine increase production of both pages and ideas. Notice that the spontaneous writers barely produced more ideas and pages than the group trying to avoid writing.
I find these results compelling, but note that Helen Sword urges some caution.
Everyone is different, and my own approach has evolved over time. Here are the key ingredients (for me):
^{3} Two hours works really well for me. My productivity degrades quickly after two hours, so it’s best to move on to less taxing tasks. But it takes me a while to get warmed up, so I need to keep moving while I’ve got momentum.
^{4} An unfortunate outcome is not writing and being stressed about not writing.
^{5} Just 15 minutes is great. This slot is perfect for proof-reading.
I admit that I deviate from the strategy above (and not always intentionally). But I’ve been at this long enough to know that a regular routine works really well for me.
It’s my view that PhD students should write every day, from the first day of their first semester (remember that I have a broad definition of “write”). Most students need some time before they’re ready to jump into the technical details a solo project, but there are always things to do.
If you can’t identify a specific task to work on, here are some resources to help you brainstorm.
I like Firth’s logistic regression model (Firth 1993). I talk about that in Rainey and McCaskey (2021) and this Twitter thread. Kosmidis and Firth (2021) offer an excellent, recent follow-up as well.
I’ll refer you to the papers for a careful discussion of the benefits, but Firth’s penalty reduces the bias and variance of the logit coefficients.
In this post, I want to compare the brglm2 and logistf packages. Which fits logistic regression models with Firth’s penalty the fastest?
These packages both fit the models almost instantly, so there is no practical difference when fitting just one model. But large Monte Carlo simulations (or perhaps bootstraps), small differences might add up to a substantial time difference.
Here, I benchmark the two packages for fitting logistic regression models with Firth’s penalty in a small sample–the results might not generalize to a larger sample. The data set comes from Weisiger (2014) (see ?crdata::weisiger2014
). It has only 35 observations.
You can find the benchmarking code as a GitHub Gist.
I benchmark four methods here.
glm()
logit model.method = brglm2::brglmFit
to glm()
.logistf()
using the default settings.logistf()
with the argument pl = FALSE
. This argument is important because it skips hypothesis testing using profile likelihoods, which are computationally costly.# install crdata package to egt weisiger2014 data set
remotes::install_github("carlislerainey/crdata")
# load packages
library(tidyverse)
library(brglm2)
library(logistf)
library(microbenchmark)
# load data
weis <- crdata::weisiger2014
# rescale weisiger2014 explanatory variables using arm::rescale()
rs_weis <- weis %>%
mutate(across(polity_conq:coord, arm::rescale))
# create functions to fit models
f <- resist ~ polity_conq + lndist + terrain + soldperterr + gdppc2 + coord
f1 <- function() {
glm(f, data = rs_weis, family = "binomial")
}
f2 <- function() {
glm(f, data = rs_weis, family = "binomial", method = brglmFit)
}
f3 <- function() {
logistf(f, data = rs_weis)
}
f4 <- function() {
logistf(f, data = rs_weis, pl = FALSE)
}
# do benchmarking
bm <- microbenchmark("regular glm()" = f1(),
"brglm2" = f2(),
"logistf (default)" = f3(),
"logistf (w/ pl = FALSE)" = f4(),
times = 100)
Warning in microbenchmark(`regular glm()` = f1(), brglm2 = f2(), `logistf
(default)` = f3(), : less accurate nanosecond times to avoid potential integer
overflows
print(bm)
Unit: microseconds
expr min lq mean median uq
regular glm() 500.651 547.0425 591.5304 560.0805 584.4550
brglm2 2469.266 2546.9405 2735.6438 2591.5690 2672.3800
logistf (default) 3303.329 3432.6635 3942.9122 3481.6585 3558.1235
logistf (w/ pl = FALSE) 513.197 553.3155 594.6205 570.8225 608.9115
max neval
2713.708 100
9735.983 100
17672.148 100
1698.343 100
In short, logistf is slower than brglm2, but only because it computes the profile likelihood p-values by default. Once we skip those calculations using pl = FALSE
, logistf is much faster. On average, it’s faster than glm()
, because glm()
has the occasional really slow computation.
Here’s a plot showing the computation times of the four fits. Remember that all of these are computed practically instantly, so it only makes a difference when the fits are done thousands of times, like in a Monte Carlo simulation.
# plot times
bm %>%
group_by(expr) %>%
summarize(avg_time = mean(time)*10e-5) %>% # convert to milliseconds
ggplot(aes(x = fct_rev(expr), y = avg_time)) +
geom_col() +
labs(x = "Method",
y = "Avg. Time (in milliseconds)") +
coord_flip()
The models return slightly different estimates. Maybe they are using slightly different convergence tolerances. I didn’t investigate this beyond noticing it.
cbind(coef(f2()), coef(f4()))
[,1] [,2]
(Intercept) -0.4771934 -0.4771935
polity_conq -2.2771109 -2.2771117
lndist 3.4020241 3.4020215
terrain 1.1018696 1.1018713
soldperterr -0.5952096 -0.5952110
gdppc2 -1.1542010 -1.1541998
coord 3.0514481 3.0514473
Ioannis Kosmidis made me aware of two things.
Here’s the info on my machine.
system("sysctl -n machdep.cpu.brand_string", intern = TRUE)
[1] "Apple M2 Max"
This post turned out to be somewhat popular, so I’ve written up a more formal, careful description of the idea in a full length paper. The initial, early version is on GitHub.
I’ve wrapped up the argument that you should pursue statistical power in your experiments. In sum, you should do it for you (not a future Reviewer 2) and you shouldn’t see confidence intervals nestle consistently against zero.
In this post, I’d like to develop the intuition for power calculations, two helpful guidelines, and one implication.
Main takeaway: You need the ratio of the true effect and the standard error to be more than 3.64.
When you run one experiment, you realize one of many possible patterns of randomization. This particular realization produces a single estimate of the treatment effect from a distribution of possible estimates. The distribution of possibilities is called a “sampling distribution.”
Thus, we can think of the estimate as a random variable and its distribution as the sampling distribution. The sampling distribution is key to everything I do in this post, so let’s spend some time with it.
Let’s imagine that we did the exact same study 50 times. Let’s say that we computed a difference-in-means in dollars ($) donated.^{1} I refer to this difference-in-means as the estimated treatment effect. It is the estimate of the average treatment effect (in $). This estimate will vary across the many possible patterns of randomization because each pattern puts different respondents in the treatment and control group.
^{1} I just want an easy-to-use unit here, and dollars meets that criteria. Other units work fine, too.
We can visualize this with ggnaminate. We can imagine each iteration of the study as producing a particular estimate. We continue to repeat the study and collect the estimates. Eventually, we can produce a histogram from this collection of estimates. This histogram represents the sampling distribution and is fundamental to the calculations that follow. The figure belows shows how we might collect the points into a histogram.
library(tidyverse)
library(gganimate)
library(magick)
# gif pars
duration <- 24 # must be even
fps <- 25
nframes <- duration*fps
scale <- 2.5
width <- 8
height <- 6
res <- 125
# study parameters
true_effect <- 1
se <- 0.4
# number of times to repeat the study
n_studies <- 50 # nframes
# create a data frame of confidence intervals
ests <- tibble(study_id = 1:n_studies,
est = c(rnorm(n_studies, true_effect, se))) %>%
mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"))
# add two things to the data frame of confidence intervals
# 1. an initial row with study_id = 1 and est = NA so that
# the plot starts empty (gganimate would start with the
# first observation in place otherwise).
# 2. a group variable that defines the row. This is the same
# as the study_id, except the dummy row from (1) and the
# actual first row have different groups.
animate_data <- bind_rows(
tibble(study_id = 1, est = NA), # study_id = 1, est = NA
ests # actual cis
) %>%
mutate(group = 1:n())
split_animate_data <- animate_data %>% # group (row index)
split(.$group) %>%
accumulate(~ bind_rows(.x, .y)) %>%
bind_rows(.id = "frame") %>%
mutate(frame = as.integer(frame))
se_lines <- tribble(
~se_, ~label, ~chance, ~ch_loc_, # trailing _ means not rescaeld to study se
0, "True Effect", NA, NA,
1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5,
2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5,
3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5,
-1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5,
-2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5,
-3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5,
) %>%
mutate(ch_loc = ch_loc_*se + true_effect,
se = se_*se + true_effect)
# start with a ggplot
gg1 <- ggplot(animate_data, aes(x = est,
y = study_id,
group = group)) +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed") +
geom_label(data = se_lines, aes(x = se, y = n_studies + 2, label = label, group = NULL, color = -dnorm(se_))) +
geom_text(data = se_lines, aes(x = ch_loc, y = 4, label = chance, group = NULL)) +
geom_point(aes(color = -dnorm((est- true_effect)/se)),
size = 3) +
geom_rug(sides = "b",
aes(x = est,
color = -dnorm((est- true_effect)/se)),
alpha = 0.5,
length = unit(0.025, "npc")) +
theme_bw() +
theme(panel.grid.minor.y = element_blank()) +
labs(x = "Estimate of Effect",
y = "Study Number") +
theme(legend.position = "none")
# add dyamics to the plot
gg1_anim <- gg1 +
transition_states(states = group) +
# how points enter
enter_drift(y_mod = 10) +
enter_grow() +
enter_fade() +
# how points exit/remain
exit_fade(alpha = 0.5) +
exit_shrink(size = 1) +
shadow_mark(alpha = 0.5)
gg1_gif<- animate(gg1_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg1.gif")
gg1_mgif <- image_read("gg1.gif")
## plot 2: histogram
# start with a ggplot
gg2 <- ggplot(split_animate_data, aes(x = est, group = frame)) +
geom_histogram(binwidth = se, boundary = true_effect, fill = "grey") +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed") +
geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL, color = -dnorm(se_)), vjust = 1.5) +
geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1) +
#geom_density(linewidth = 2) +
geom_rug(sides = "b",
aes(x = est,
color = -dnorm((est- true_effect)/se)),
alpha = 0.5,
length = unit(0.025, "npc")) +
theme_bw() +
labs(x = "Estimate of Effect",
y = "Count") +
theme(legend.position = "none")
# add dyamics to the plot
gg2_anim <- gg2 +
transition_states(states = frame)
gg2_gif<- animate(gg2_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg2.gif")
gg2_mgif <- image_read("gg2.gif")
new_gif <- image_append(c(gg1_mgif[1], gg2_mgif[1]), stack = FALSE)
for(i in 2:nframes){
combined_gif <- image_append(c(gg1_mgif[i], gg2_mgif[i]), stack = FALSE)
new_gif <- c(new_gif, combined_gif)
}
new_gif
Before the study, we can predict two features of this sampling distribution.
For our purposes, then, we can describe the sampling distribution as normally distributed with an assumed mean and SD. The mean is the assumed true effect and the SD is the well-predicted SE.
This is the key claim: In order to build power into your experiment, you must build certain properties into the sampling distribution.
The design of your experiment will not affect the normality of the sampling distribution, but it will change the true effect and the SE. Changing the true effect and the SE will change the power.
I’m leaving aside how to predict the standard error of the experiment or choose the true effect. This post is about the target standard error and true effect.
Predicting the standard error is a mechanical process mixed with a little guesswork.^{2} Choose a true effect is a mostly substantive decision.^{3}
^{2} I’ll suggest two methods I like. First, use . I like to confirm this estimate with a small pilot. I sample observations from this pilot data set, run the full analysis, and confirm that my prediction is close, and make any needed adjustments.
^{3} Cyrus Samii likes to use the MME, I like a conservatively choosen guess of what the effect actually is.
But instead of talking about a target power, I like to talk about a target standard error (given a true effect) or a target true effect (given a standard error)—power is just not a very intuitive quantity.
To understand how the true effect and the SE relate to power, we need to introduce the confidence interval.
I like to use 90% confidence intervals to test hypotheses (see Rainey 2014 and Rainey 2015). In short, 90% confidence intervals correspond to one-tailed tests with size 0.05 and equivalence tests with size 0.05.] The formula for a 90% CI is . That is, we put “arms” around our estimate—one to the left and another to the right. Each arm is 1.64 standard errors long.
I’ll assume we have a one-sided research hypothesis that suggests a positive effect. If the lower bound () is less than zero, then we fail to reject the null hypothesis. If this lower bound is greater than zero, then we reject the null hypothesis.^{4}
^{4} This is equivalent to a z-test in a standard hypothesis testing framework using a p-value of less than 0.05 as the threshold for rejecting the null hypothesis.
This focus on testing rather than estimation changes the nature of the sampling distribution. Rather than an estimate along a continuous range, we get a binary outcome: either (1) reject the null hypothesis or (2) fail to reject the null hypothesis.
But the sampling distribution of estimates and the associated outcomes of tests are closely related. In particular, the logic of the test implies that *if the estimate falls less than 1.64 SEs above zero, we cannot reject the null.
We can reconstruct the figure above using this logic. Rather than plot the points continuously along the x-axis, we can color the points (and now error bars) according to whether the lower bound falls above zero or not. And we can use a bar plot showing the number of rejections and non-rejections.
library(tidyverse)
library(gganimate)
library(magick)
# gif pars
duration <- 24 # must be even
fps <- 25
nframes <- duration*fps
scale <- 2.5
width <- 8
height <- 6
res <- 125
# study parameters
true_effect <- 1
se <- 0.4
# number of times to repeat the study
n_studies <- 50 # nframes
# create a data frame of confidence intervals
ests <- tibble(study_id = 1:n_studies,
est = c(rnorm(n_studies, true_effect, se))) %>%
mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"),
lwr = est - 1.64*se,
upr = est + 1.64*se)
# add two things to the data frame of confidence intervals
# 1. an initial row with study_id = 1 and est = NA so that
# the plot starts empty (gganimate would start with the
# first observation in place otherwise).
# 2. a group variable that defines the row. This is the same
# as the study_id, except the dummy row from (1) and the
# actual first row have different groups.
animate_data <- bind_rows(
tibble(study_id = 1, est = NA), # study_id = 1, est = NA
ests # actual cis
) %>%
mutate(group = 1:n())
split_animate_data <- animate_data %>% # group (row index)
split(.$group) %>%
accumulate(~ bind_rows(.x, .y)) %>%
bind_rows(.id = "frame") %>%
mutate(frame = as.integer(frame))
se_lines <- tribble(
~se_, ~label, ~chance, ~ch_loc_, # trailing _ means not rescaeld to study se
0, "True Effect", NA, NA,
1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5,
2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5,
3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5,
-1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5,
-2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5,
-3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5,
) %>%
mutate(ch_loc = ch_loc_*se + true_effect,
se = se_*se + true_effect)
# start with a ggplot
gg1 <- ggplot(animate_data, aes(x = est,
y = study_id,
group = group)) +
geom_vline(xintercept = 1.64*se) +
annotate("label", x = 1.64*se, y = 5, label = "1.64 SEs") +
geom_errorbarh(height = 0, aes(xmin = lwr, xmax = upr, color = reject_null)) +
geom_point(aes(color = reject_null),
size = 3) +
geom_rug(sides = "b",
aes(x = est,
color = reject_null),
alpha = 0.5,
length = unit(0.025, "npc")) +
theme_bw() +
theme(panel.grid.minor.y = element_blank()) +
labs(x = "Estimate of Effect",
y = "Study Number") +
theme(legend.position = "none") +
scale_color_manual(values = c("Yes" = "#1b9e77", "No" = "#d95f02"))
# add dyamics to the plot
gg1_anim <- gg1 +
transition_states(states = group) +
# how points enter
enter_drift(y_mod = 10) +
enter_grow() +
enter_fade() +
# how points exit/remain
exit_fade(alpha = 0.5) +
exit_shrink(size = 1) +
shadow_mark(alpha = 0.5)
gg1_gif<- animate(gg1_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg1.gif")
gg1_mgif <- image_read("gg1.gif")
## plot 2: histogram
# start with a ggplot
gg2 <- ggplot(split_animate_data, aes(x = reject_null, fill = reject_null), na.rm = TRUE) +
geom_bar(na.rm = TRUE) +
theme_bw() +
labs(x = "Reject Null",
y = "Count") +
theme(legend.position = "none") +
scale_x_discrete(na.translate = FALSE) +
scale_fill_manual(values = c("Yes" = "#1b9e77", "No" = "#d95f02"))
# add dyamics to the plot
gg2_anim <- gg2 +
transition_states(states = frame)
gg2_gif<- animate(gg2_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg2.gif")
gg2_mgif <- image_read("gg2.gif")
new_gif <- image_append(c(gg1_mgif[1], gg2_mgif[1]), stack = FALSE)
for(i in 2:nframes){
combined_gif <- image_append(c(gg1_mgif[i], gg2_mgif[i]), stack = FALSE)
new_gif <- c(new_gif, combined_gif)
}
new_gif
The key to building statistical power into your experiment is to get “almost all” of the sampling distribution above 1.64 standard errors above zero. The portion of the sampling distribution that falls below 1.64 standard errors above zero does not allow the researcher to reject the null.
library(tidyverse)
# study parameters
true_effect <- 1
se <- 0.4
se_lines <- tribble(
~se_, ~label, ~chance, ~ch_loc_, # trailing _ means not rescaeld to study se
0, "True Effect", NA, NA,
1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5,
2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5,
3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5,
-1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5,
-2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5,
-3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5,
) %>%
mutate(ch_loc = ch_loc_*se + true_effect,
se = se_*se + true_effect)
x <- rnorm(500000, mean = true_effect, sd = se)
df <- data.frame(x)
bg_alpha <- 0.3
ggplot() +
geom_histogram(data = df,
aes(x = x, y = after_stat(density)), binwidth = se, boundary = true_effect, fill = "grey", alpha = bg_alpha) +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed", alpha = bg_alpha) +
geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL), vjust = 1.5, color = alpha('black', bg_alpha)) +
geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1, color = alpha('black', bg_alpha)) +
geom_vline(xintercept = 0) +
geom_function(fun = dnorm, args = list(mean = true_effect, sd = se), size = 1) +
theme_bw() +
labs(x = "Estimate of Effect",
y = "Density") +
geom_area(data = tibble(x = seq(1.64*se, 3*se + true_effect, by = 0.1)), aes(x = x),
stat = "function", fun = dnorm, args = list(mean = true_effect, sd = se),
fill = "#d95f02", alpha = 0.1, xlim = c(1.64*se, 3**se + true_effect)) +
annotate("label", x = 1.3, y = .1, label = "fraction rejected", color = "#d95f02", size = 6) +
annotate("segment", x = 1.64*se, xend = 1.64*se, y = 0, yend = dnorm(1.64*se, mean = true_effect, sd = se), color = "#1b9e77", size = 1) +
annotate("label", x = 1.64*se, y = dnorm(1.64*se, mean = true_effect, sd = se)/2, label = "1.64 SEs above zero", color = "#1b9e77") +
annotate("segment", x = 0, xend = 1.64*se,
y = dnorm(1.64*se, mean = true_effect, sd = se),
yend = dnorm(1.64*se, mean = true_effect, sd = se),
color = "#7570b3", size = 1) +
annotate("label", x = 0.5*1.64*se, y = dnorm(1.64*se, mean = true_effect, sd = se), label = "width of 90% CI", color = "#7570b3") +
theme(legend.position = "none") +
xlim(-1, 3)
You’ll remember that “almost all” of the normal distribution falls within two standard errors of its mean. So as a starting point, let’s use this rule: get the sampling distribution two standard errors above 1.64 standard errors above zero. We can just add 1.64 and 2 together to get a sampling distribution 3.64 standard errors above zero. That is, we need . If the true effect is larger than 3.64 standard errors, then the confidence interval will “rarely” overlap zero (about 2% of the time).
library(tidyverse)
# study parameters
true_effect <- 1
se <- 1/3.84
x <- rnorm(500000, mean = true_effect, sd = se)
df <- data.frame(x)
ggplot() +
geom_histogram(data = df,
aes(x = x, y = after_stat(density)), binwidth = se, boundary = true_effect, fill = "grey", alpha = bg_alpha) +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed", alpha = bg_alpha) +
geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL), vjust = 1.5, color = alpha('black', bg_alpha)) +
geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1, color = alpha('black', bg_alpha)) +
geom_vline(xintercept = 0) +
geom_function(fun = dnorm, args = list(mean = true_effect, sd = se), size = 1) +
theme_bw() +
labs(x = "Estimate of Effect",
y = "Density") +
geom_area(data = tibble(x = seq(1.64*se, 3*se + true_effect, by = 0.1)), aes(x = x),
stat = "function", fun = dnorm, args = list(mean = true_effect, sd = se),
fill = "#d95f02", alpha = 0.1, xlim = c(1.64*se, 3**se + true_effect)) +
#annotate("label", x = 1.3, y = .1, label = "fraction rejected", color = "#d95f02", size = 6) +
annotate("segment", x = true_effect, xend = true_effect, y = 0, yend = dnorm(true_effect, mean = true_effect, sd = se), color = "#d95f02", size = 1) +
annotate("label", x = true_effect, y = dnorm(true_effect, mean = true_effect, sd = se)/2, label = "true effect", color = "#d95f02") +
annotate("segment", x = 1.64*se, xend = 1.64*se, y = 0, yend = .75, color = "#1b9e77", size = 1) +
annotate("label", x = 1.64*se, y = .75, label = "1.64 SEs above zero", color = "#1b9e77") +
annotate("segment", x = 0, xend = 1.64*se,
y = .125,
yend = .125,
color = "#7570b3", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = 0.5*1.64*se, y = .125, label = "1.64 SEs", color = "#7570b3", size = 4) +
annotate("segment", x = true_effect, xend = 1.64*se,
y = .125,
yend = .125,
color = "black", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = 1.64*se + (true_effect - 1.64*se)/2, y = .125
, label = "ideally 2 SEs", color = "black", size = 4) +
annotate("segment", x = true_effect, xend = 0,
y = 1, yend = 1,
color = "black", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = true_effect/2, y = 1, label = "ideally 3.64 SEs", color = "black", size = 4) +
theme(legend.position = "none") +
xlim(-0.1, 2.0)
There are a few ways to write the ratio:
These are all the same goals. I prefer the first, but they are all equivalent targets for power.
I like this ratio because it points to strategies to increase power. You can do two things:
This ratio gets you thinking not what your power is, but how to increase it. Again, power isn’t a task for the experimenter, it’s the task.^{5} Kane (2023)^{6} suggests several ways researchers can increase the true effect or decrease the standard error.
^{5} I’m leaving aside the question of how to predict the standard error or choose a true effect for a paricular design.
^{6} Kane, John V. 2023. “More Than Meets the ITT: A Guide for Investigating Null Results.” APSA Preprints. doi: 10.33774/apsa-2023-h4p0q-v2.
I’ll highlight a few examples here. To increase the treatment effect, you can:
To decrease the standard error, you can:
If you get the ratio to 3.64, then you’ll have 98%. This is much higher than the cutoff of 80% I hear suggested most often. If you want 80% power, rather than 98% power, you can change the 3.64 to 2.48.^{7} But notice that 3.64 and 2.48 are not very different. This has an important implication.
^{7} Find this by adding 1.64 to -qnorm(0.2)
.
A researcher can lower the risk of a failed study from about 1 in 5 (80% power) to about 1 in 50 (98% power) by increasing the ratio from 2.48 to 3.64. This requires increasing the treatment effect by about 50% or shrinking the standard error by about 33%. Stated differently, if you are careless and let your treatment effect fall by 33% or let your standard error increase by 50%, then your risk of a failed study increases 10-fold! Small differences can matter a lot.
The implication is this: when you are on the cusp of a well-powered study (about 80% power), then small increases in the ratio have a disproportionate impact on your risk of a failed study.
Following the logic of the picture above, we need to compute the fraction of the sampling distributin that lies above 1.64 SEs. The pnorm()
function returns normal probabilities, but below specfic thresholds. By supplying the argument lower.tail = FALSE
, we can get the probabilities of falling above a specific threshold.
true_effect <- 1.00
se <- 0.4
# compute power
pnorm(1.64*se, # want fraction above 1.64 SE
mean = true_effect, # mean of sampling distribution
sd = se, # sd of sampling distribution
lower.tail = FALSE) # fraction above, not below
[1] 0.8051055
Statistical power is an abstract quantity. It’s easy to understand what it means, but harder to think about how to manipulate it. I explain why the target is equivalent to a target power of 80%. But you want to “almost always” reject the null, so you should shoot for . Hopefully these guidelines help you build a bit of actionable intuition about statistical power.
When I give students formula for confidence intervals, I find that students don’t have a sharp concept of how those confidence intervals work—even if I explain the components of the formula well.
Even though they understand—seemingly very well—that the point estimate is noisy, they struggle to conceptualize that a confidence interval can often include values on the incorrect side of zero. Stated differently, they have a hard time understanding how a hypothesis test can fail to reject the null (when the null is incorrect). Their intuition suggests that a “test” should give you the correct answer.
Because their instincts are wrong, I want to undermine their trust in hypothesis tests. I want them to feel the riskiness of poorly-powered experiments that consistently nestle confidence intervals right up against zero. I want them ready and eager to work hard to avoid that risk—to make sure they have adequate statistical power.
To help undermine their confidence in confidence intervals, I like three exercises that mimic a test with 80% power. In each case, we are assuming that we have formulated a correct hypothesis and designed an excellent experiment with 80% power.
These exercises make it clear that failed studies are real possibilities. Hopefully they clearly see and “feel”:
The hypothesis test is no oracle. It will not consistently reject the null (even when the null is wrong) unless you supply overwhelming evidence. In experimental design, that’s not a task, that’s the task.
Below, I walk through the plots I use in parts 2 and 3.
First, let’s choose to conduct an experiment with 80% power. To get this, we’ll suppose that the true effect is 1 and the standard error is 0.4. To obtain 80% power, you can use the guideline that the standard error should be about 40% of the true effect (or the true effect divided by 2.48). I’ll show where these guidelines come from in a future post. With the true effect and standard error in hand, we can compute the long-run properties of the experiment.
library(tidyverse)
# study parameters
true_effect <- 1
se <- 0.40 # 1/2.48
# identify effects of interest
eoi <- tribble(
~Effect, ~Description,
true_effect, "True Effect (known in this exercise)"
)
# compute quantities of interest regarding power
eoi %>%
mutate(Power = 1 - pnorm(1.64*se, Effect, se),
Power = scales::percent(Power, accuracy = 1),
`Type S` = retrodesign::type_s(Effect, se)$type_s,
`Type S` = scales::number(`Type S`, accuracy = 0.01),
`Type M` = retrodesign::type_m(Effect, se)$type_m,
`Type M` = scales::number(`Type M`, accuracy = 0.01),
Effect = scales::number(Effect, accuracy = 0.01)) %>%
pivot_longer(cols = Effect:`Type M`) %>%
kableExtra::kable(format = "markdown", col.names = NULL)
Effect | 1.00 |
Description | True Effect (known in this exercise) |
Power | 81% |
Type S | 0.00 |
Type M | 1.20 |
But these long-run properties remain a bit abstract and seem “distant” from the practical implications of our particular experiment. This is where the three exercises above come in handy.
The second exercise is a static plot. The plot shows 50 intervals; we can see that several include zero. These are wasted opporunities. We set out to reject a null hypothesis that the effect was less than or equal to zero, and we failed to do that.
# number of studies to simulate
n_studies <- 50
# a data frame of studies
set.seed(123)
ests <- tibble(study_id = 1:n_studies,
est = c(rnorm(n_studies, true_effect, se))) %>%
mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"))
# plot the confidence intervals for each study
gg <- ggplot(ests, aes(x = est,
y = study_id,
xmin = est - 1.64*se,
xmax = est + 1.64*se,
color = reject_null)) +
geom_vline(xintercept = true_effect,
linetype = "dotted") +
geom_vline(xintercept = 0) +
geom_point() +
geom_errorbarh(height = 0) +
geom_rug(sides = "b",
aes(x = est - 1.64*se, color = NULL),
alpha = 0.5,
length = unit(0.025, "npc")) +
scale_color_manual(values = c("No" = "#d95f02", "Yes" = "#1b9e77")) + # from https://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3
theme_bw() +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Estimate and 90% Confidence Interval",
y = "Study ID",
color = "Reject Null?",
caption = "Rug shows distribution of lower bound of confidence interval.")
# print plot
print(gg)
The static plot above is nice, but doesn’t convey an appropriate sense of randomness or “you don’t know what you’ll get this time.” To convey this feeling, I like to add dynamics. In particular, I like a confidence interval that’s moving and you’re not sure if it will cover zero or not. This conveys the sense that “something bad might happen” with each draw. Even though only about 10 of the 50 confidence intervals will cross zero, you feel the danger on all 50 simulations.
In designing this plot, two features are in tension:
In the past, I’ve prioritized making the plot look exactly like I want. But there are concrete downsides to using hacky solutions—using functions in ways not intended. The code is brittle and difficult. For examples that clearly convey the ideas, see Presidential Plinko and this “raindrop” plot. These are great ways to convey the randomness, but producing these plots requires some “tedious” coding.
With this code, I tried to illustrate the concept well while avoid hacky solutions to minor problems. This makes the code easier to understand, change, and update.
First, let’s start by creating the data frame to plot. We need to make two small changes to the data frame above. These are both hacks, but worth it.
# load packages
library(gganimate)
# add two things to the data frame of confidence intervals
# 1. an initial row with study_id = 1 and est = NA so that
# the plot starts empty (gganimate would start with the
# first observation in place otherwise).
# 2. a group variable that defines the row. This is the same
# as the study_id, except the dummy row from (1) and the
# actual first row have different groups.
animate_data <- bind_rows(
tibble(study_id = 1, est = NA), # study_id = 1, est = NA
ests # combine dummy row with ests data frame from above
) %>%
mutate(group = 1:n()) # group (row index)
Now let’s plot the confidence intervals much like above, except with expanded scales to give some more room for movement.
# same ggplot, except three annotated changes
gg_exp <- ggplot(animate_data, aes(x = est,
y = study_id,
xmin = est - 1.64*se,
xmax = est + 1.64*se,
color = reject_null,
group = group)) +
geom_vline(xintercept = true_effect,
linetype = "dotted") +
geom_vline(xintercept = 0) +
geom_point() +
geom_errorbarh(height = 0) +
geom_rug(sides = "b",
aes(x = est - 1.64*se, color = NULL),
alpha = 0.5,
length = unit(0.025, "npc")) +
scale_color_manual(values = c("No" = "#d95f02", "Yes" = "#1b9e77")) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Estimate and 90% Confidence Interval",
y = "Study ID",
color = "Reject Null?",
caption = "Rug shows distribution of lower bound of confidence interval.
Solid lines shows zero.
Dotted line shows the true effect.") +
# new scales here
scale_x_continuous(expand = expansion(add = c(0.6, 1))) +
scale_y_continuous(expand = expansion(add = 2))
# print plot
print(gg_exp)
group
variable.
group
explicitly.
Now let’s add the animation. I like the confidence intervals to be shooting toward zero. This gives the feeling that “it might cross!” and makes the fear of a failed study real for each simulation.
# add dyamics to the plot
anim <- gg_exp +
transition_states(states = group) +
# how points enter
enter_drift(x_mod = 2) +
enter_grow() +
enter_recolor(color = "black") +
ease_aes(color = "exponential-in") +
# how points exit/remain
exit_fade(alpha = 0.3) +
shadow_mark(alpha = 0.3)
# make magic happen!
animate(anim, duration = n_studies, fps = 10,
height = 6, width = 8, units = "in", res = 150)
transition_states()
function from gganimate creates an animation transitioning between different states of the data. In this case, the argument states
is set to group
. This makes each group
(each row in the data frame animate_data
) appear in the plot, one at a time.
enter_drift()
function describes how new data points enter the frame. x_mod = 2
makes new data enter by drifting along the x-axis 2 points from the right of their ending position.
enter_grow()
function describes how new data points enter the frame. In this case, it makes them will grow from a size of 0 to their ending size.
enter_recolor()
function again describes how new data points should enter the frame. Here, it makes the points and CIs change the color from black to their ending color as they appear. I want the final color to be a bit of a surprise, so I start them as black.
ease_aes()
function determines how the aesthetics of the points change over the transitions. color = "exponential-in"
means the color change will be at an exponential rate at the start of the transition. This makes the color change really fast at the end of the transition to maintain the surprise of the result.
exit_fade()
function describes how data points leave the frame. alpha = 0.3
specifies that points will fade out to 30% transparency when they exit (when the next data point reaches its position).
shadow_mark()
function keeps past data points in the frame. alpha = 0.3
sets the transparency of the shadow marks to 30% to match the exit_fade()
transparency.
The dynamic plot above is a good tool to help students understand that confidence intervals can quite easily include values on the wrong side of zero. Stated differently, it’s easy for a hypothesis test to fail to reject the null (when the null is wrong). Their intuition suggests that the test should tell you the correct answer.
The exercises above (appropriately) undermine their trust in hypothesis tests. I want them to feel the riskiness of poorly-powered experiments that consistently nestle confidence intervals right up against zero. I want them ready to work hard to avoid that risk—to make sure they have statistical power. Hopefully, they see that carefully building power into your experiment isn’t a task, it’s the task of experimental design.
As a concluding example, here’s the same dynamic plot for a study with about 98% power. Notice how “safe” this study feels compared to the one above with 80% power. I think this plot does a good job a translating probabilities into an appropriate sense of “danger.”
# study parameters
se <- 0.27 # 1/3.64
# identify effects of interest
eoi <- tribble(
~Effect, ~Description,
true_effect, "True Effect (known in this exercise)"
)
# compute quantities of interest regarding power
eoi %>%
mutate(Power = 1 - pnorm(1.64*se, Effect, se),
Power = scales::percent(Power, accuracy = 1),
`Type S` = retrodesign::type_s(Effect, se)$type_s,
`Type S` = scales::number(`Type S`, accuracy = 0.01),
`Type M` = retrodesign::type_m(Effect, se)$type_m,
`Type M` = scales::number(`Type M`, accuracy = 0.01),
Effect = scales::number(Effect, accuracy = 0.01)) %>%
pivot_longer(cols = Effect:`Type M`) %>%
kableExtra::kable(format = "markdown", col.names = NULL)
# a data frame of studies
ests <- tibble(study_id = 1:n_studies,
est = c(rnorm(n_studies, true_effect, se))) %>%
mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"))
animate_data <- bind_rows(
tibble(study_id = 1, est = NA), # study_id = 1, est = NA
ests # combine dummy row with ests data frame from above
) %>%
mutate(group = 1:n()) # group (row index)
# same ggplot, except three annotated changes
gg_exp <- ggplot(animate_data, aes(x = est,
y = study_id,
xmin = est - 1.64*se,
xmax = est + 1.64*se,
color = reject_null,
group = group)) +
geom_vline(xintercept = true_effect,
linetype = "dotted") +
geom_vline(xintercept = 0) +
geom_point() +
geom_errorbarh(height = 0) +
geom_rug(sides = "b",
aes(x = est - 1.64*se, color = NULL),
alpha = 0.5,
length = unit(0.025, "npc")) +
scale_color_manual(values = c("No" = "#d95f02", "Yes" = "#1b9e77")) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Estimate and 90% Confidence Interval",
y = "Study ID",
color = "Reject Null?",
caption = "Rug shows distribution of lower bound of confidence interval.
Solid line shows zero.
Dotted line shows the true effect.") +
# new scales here
scale_x_continuous(expand = expansion(add = c(0.6, 1))) +
scale_y_continuous(expand = expansion(add = 2))
# add dyamics to the plot
anim <- gg_exp +
transition_states(states = group) +
# how points enter
enter_drift(x_mod = 2) +
enter_grow() +
enter_recolor(color = "black") +
ease_aes(color = "exponential-in") +
# how points exit/remain
exit_fade(alpha = 0.3) +
shadow_mark(alpha = 0.3)
# make magic happen!
animate(anim, duration = n_studies, fps = 10,
height = 6, width = 8, units = "in", res = 150)
Effect | 1.00 |
Description | True Effect (known in this exercise) |
Power | 98% |
Type S | 0.00 |
Type M | 1.02 |
In this post, I address confidence intervals that are nestled right up against zero.^{1} These intervals indicate that an estimate is “barely” significant. I want to be clear: “barely significant” is still significant, so you should still reject the null hypothesis.^{2}
^{1} This is the second post in a series. In my previous post, I mentioned two new papers that have me thinking about power: Arel-Bundock et al.’s “Quantitative Political Science Research Is Greatly Underpowered” and Kane’s “More Than Meets the ITT: A Guide for Investigating Null Results”. Go check out that post and those papers if you haven’t.
^{2} I’m focusing on confidence intervals here because inference from confidence intervals is a bit more intuitive (see Rainey 2014 and Rainey 2015. In the cases I discuss, whether one checks whether the p-value is less than 0.05 or checks that confidence interval contains zero are equivalent.
But I want to address a feeling that can come along with a confidence interval nestled right up against zero. A feeling of victory. It seems like a perfectly designed study. You rejected the null and collected just enough data to do it.
But instead, it should feel like a near-miss. Like an accident narrowly avoided. A confidence interval nestled right up against zero indicates that one of two things has happened: either you were (1) unlucky or (2) under-powered.
Because “unlucky” is always a possibility, we can’t learn much from a particular confidence interval, but we can learn a lot from a literature. A literature with well-powered studies produces confidence intervals that often fall far from zero. A well-powered literature does not produce confidence intervals that consistently nestle up against zero. Under-powered studies, though, do tend to produce confidence intervals that nestle right up against zero.
I’m going to explore the behavior of confidence intervals with a little simulation. In this simulation, I’m going to assert a standard error rather than create the standard error endogenously through sample size, etc. I use a true effect size of 0.5 and standard errors of 0.5, 0.3, 0.2, and 0.15 to create studies with 25%, 50%, 80%, and 95% power, respectively.^{3} I think of 80% as “minimally-powered” and 95% as “well-powered.”
^{3} I’m ignoring how to choose the true effect, estimate the standard error, and compute power. For now, I’m placing all this behind the curtain. See Daniël Lakens’ book [Improving Your Statistical Inferences] for discussion (h/t Bermond Scoggins).
I’m using a one-sided test (hypothesizing a positive effect), so I’ll use 90% confidence intervals with arms that are 1.64 standard errors wide. Let’s simulate some estimates from each of our four studies and compute their confidence intervals. I simulate 5,000 confidence intervals to explore below.
# load packages
library(tidyverse)
# create a parameter for the true effect
true_effect <- 0.5 # just assumed by me
# create a data frame of standard errors (with approximate power)
se_df <- tribble(
~se, ~pwr,
0.5, "about 25% power",
0.3, "about 50% power",
0.2, "about 80% power",
0.15, "about 95% power"
)
# create function to simulate estimates for each standard error
simulate_estimates <- function(se, pwr) {
tibble(
est = rnorm(n_cis, mean = true_effect, sd = se),
se = se,
pwr = pwr
)
}
# simulate the estimates, compute the confidence intervals, and wrangle
n_cis <- 5000 # the number of cis to create
ci_df <- se_df %>%
# simulate estimates
pmap_dfr(simulate_estimates) %>%
# compute confidence intervals
mutate(lwr = est - 1.64*se,
upr = est + 1.64*se) %>%
# summarize the location of the confidence interval
mutate(result = case_when(lwr < 0 ~ "Not significant",
lwr < se ~ "Nestled against zero",
lwr >= se~ "Not nestled against zero"))
Now let’s quickly confirm my power calculations by computing the proportion of confidence intervals to the right of zero. These are about right. In a later post, I’ll describe how I think about computing these quantities.
# confirm power calculations
ci_df %>%
group_by(se, pwr) %>%
summarize(sim_pwr = 1 - mean(result == "Not significant"),
sim_pwr = scales::percent(sim_pwr, accuracy = 1)) %>%
select(SE = se,
Power = pwr,
`Percent Significant` = sim_pwr) %>%
kableExtra::kable(format = "markdown")
SE | Power | Percent Significant |
---|---|---|
0.15 | about 95% power | 95% |
0.20 | about 80% power | 80% |
0.30 | about 50% power | 51% |
0.50 | about 25% power | 25% |
Now let’s see what these confidence intervals look like. 5,000 is too many to plot, so I sample 25. But I applied the statistical significance filter first. This mimics the publication process and makes the plots a little easier to compare. My argument doesn’t depend on this filter, though.
I plotted these 100 intervals below^{4}
^{4} 4 studies x 25 simulated intervals per study = 100 intervals.
There are three important vertical lines in these plots.
The intervals are green when the lower bound of the 90% confidence interval falls within one standard error of zero—that’s my definition of “nestled up against zero.” The intervals are orange when the lower bound falls further than one standard error above zero.
Notice how low-powered studies tend to nestle their confidence intervals right up against zero. Almost all of the confidence intervals from the study with 25% power are nestled right up against zero. Very few of the confidence intervals from the study with 95% power are nestled up against zero.
Again, you should apply this standard to a literature. You should not apply this standard to a particular study because even well-powered studies sometimes produce confidence intervals that nestle up against zero. But when you start to see confidence intervals consistently falling close to zero, you should start to assume that the literature uses under-powered studies and that the estimates in that literature are inflated due to Type M errors (Gelman and Stern 2014).
gg_df <- ci_df %>%
filter(lwr > 0) %>% # apply significance filter
# sample 25 intervals (from those that are significant)
group_by(se, pwr) %>%
sample_n(25) %>%
# create id (ordered by estimate value)
group_by(se, pwr) %>%
arrange(est) %>%
mutate(ci_id = 1:n())
ggplot(gg_df, aes(x = est, xmin = lwr, xmax = upr, y = ci_id,
color = result)) +
facet_wrap(vars(pwr), ncol = 1, scales = "free_x") +
geom_vline(data = se_df, aes(xintercept = se), linetype = "dotted") +
geom_vline(xintercept = 0) +
geom_vline(xintercept = true_effect, linetype = "dashed") +
geom_errorbarh(height = 0) +
geom_point() +
scale_color_brewer(type = "qual", palette = 2) +
theme_bw() +
labs(x = "Estimate and 90% CI",
y = NULL,
color = "Result")
We can also plot the density of the lower bounds of these 5,000 intervals. This approach shows the “nestling” most clearly. The plots below show that the lower bounds of confidence intervals tend to nestle close to zero when the power is low, and lie further from zero when the power is high.
gg_df <- ci_df %>%
filter(lwr > 0) # apply significance filter
ggplot(gg_df, aes(x = lwr)) +
facet_wrap(vars(pwr), scales = "free_x") +
geom_density(fill = "grey50") +
geom_vline(data = se_df, aes(xintercept = se), linetype = "dotted") +
theme_bw() +
labs(x = "Location of Lower Bound of 90% CI",
y = "Density",
color = "Power")
Lastly, I compute the percent of confidence intervals that are nestled right up against zero. For a well-powered study with 95% power, only about 1 in 5 confidence intervals nestle up against zero. For a poorly-powered study with 25% power, about 4 in 5 of confidence intervals nestle up against zero (among those that are above zero). The table below shows the remaining frequencies.
ci_df %>%
group_by(se, pwr, result) %>%
summarize(frac = n()/n_cis, .groups = "drop") %>%
pivot_wider(names_from = result, values_from = frac) %>%
mutate(`Nestled, given significant` = `Nestled against zero`/(1 - `Not significant`),
`Not nestled, given significant` = `Not nestled against zero`/(1 - `Not significant`)) %>%
select(SE = se,
Power = pwr,
`Not significant`,
`Nestled against zero`,
`Not nestled against zero`,
`Nestled, given significant`,
`Not nestled, given significant`) %>%
mutate(across(`Not significant`:`Not nestled, given significant`, ~ scales::percent(., accuracy = 1))) %>%
kableExtra::kable()
SE | Power | Not significant | Nestled against zero | Not nestled against zero | Nestled, given significant | Not nestled, given significant |
---|---|---|---|---|---|---|
0.15 | about 95% power | 5% | 20% | 75% | 21% | 79% |
0.20 | about 80% power | 20% | 37% | 44% | 46% | 54% |
0.30 | about 50% power | 49% | 34% | 17% | 67% | 33% |
0.50 | about 25% power | 75% | 21% | 5% | 81% | 19% |
In this post, I address confidence intervals that are nestled right up against zero. These intervals can suggest a perfectly powered study—not too much, not too little. But instead, a confidence interval nestled right up against zero indicates that one of two things has happened: either you were (1) unlucky or (2) under-powered.
Because “unlucky” is always a possibility, we can’t learn much from a particular confidence interval, but we can learn a lot from a literature. A literature with well-powered studies produces confidence intervals that often fall far from zero. A well-powered literature does not produce confidence intervals that consistently nestle up against zero. Under-powered studies, though, do tend to produce confidence intervals that nestle right up against zero.
There’s been some really good work lately on statistical power. I’ll point you to two really great papers.
I’ve been long interested in statistical power (see Rainey 2014^{1} and Rainey 2015^{2}), and these new papers have me thinking even more about the importance of power.
^{1} Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58(4): 1083-1091.
^{2} McCaskey, Kelly and Carlisle Rainey. 2015. “Substantive Importance and the Veil of Statistical Significance.” Statistics, Politics, and Policy 6(1-2): 77-96.
In this post, I argue that statistical power isn’t something ancillary. Power is primary. I also argue that power isn’t something you–the researcher–build to satisfy an especially cranky Reviewer 2, it’s something you do for yourself, to make sure that your study succeeds.
In the hypothesis testing framework, you consider two hypotheses: the null hypothesis and the alternative hypothesis.
The hypothesis test is all about arguing against the null hypothesis (leaving the alternative as the only remaining possibility). You will (try to) show that your data would be “unusual” if the null hypothesis were correct.^{3}
^{3} When hypothesizing about the average treatment effect (ATE), this can take a variety of forms. The form doesn’t really matter.
If the data would NOT be unusual under the null hypothesis, then you do not reject the null hypothesis.
A failure to reject means that the data “would not be unusual under the null hypothesis.” This does not imply that you should conclude the data are only consistent with the null. Indeed, there is a sharp asymmetry in hypothesis testing. I describe this in my 2014 AJPS:
Political scientists commonly interpret a lack of statistical significance (i.e., a failure to reject the null) as evidence for a negligible effect (Gill 1999), but this approach acts as a broken compass… If the sample size is too small, the researcher often concludes that the effect is negligible even though the data are also consistent with large, meaningful effects. This occurs because the small sample leads to a large confidence interval, which is likely to contain both “no effect” and large effects.
Gill (1999)^{4} describes this more forcefully:
^{4} Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52(3): 647-674.
We teach graduate students to be very careful when describing the occurrence of not rejecting the null hypothesis. This is because failing to reject the null hypothesis does not rule out an infinite number of other competing research hypotheses. Null hypothesis significance testing is asymmetric: if the test statistic is sufficiently atypical given the null hypothesis then the null hypothesis is rejected, but if the test statistic is insufficiently atypical given the null hypothesis then the null hypothesis is not accepted. This is a double standard: H1 is held innocent until proven guilty and Ho is held guilty until proven innocent (Rozeboom 1960)…
There are two problems that develop as a result of asymmetry. The first is a misinterpretation of the asymmetry to assert that finding a non-statistically significant difference or effect is evidence that it is equal to zero or nearly zero. Regarding the impact of this acceptance error Schmidt (1996: 126) asserts that this: “belief held by many researchers is the most devastating of all to the research enterprise.” This acceptance of the null hypothesis is damaging because it inhibits the exploration of competing research hypotheses. The second problem pertains to the correct interpretation of failing to reject the null hypotheses. Failing to reject the null hypothesis essentially provides almost no information about the state of the world. It simply means that given the evidence at hand one cannot make an assertion about some relationship: all you can conclude is that you can’t conclude that the null was false (Cohen 1962).
There are many incorrect, but somewhat innocent interpretations of p-values. Interpreting a lack of statistical significance as evidence for the null is incorrect and wildly misleading in many cases.
A non-statistically significant difference is not evidence that an effect is equal to zero or nearly zero. Interpreting a non-statistically significant effect otherwise is “devastating.”
If you cannot draw a conclusion then, what exactly has happened? Obtaining will not be an “error” because you won’t make a strong claim that the research hypothesis is wrong. Instead, you will simply admit that you failed to uncover evidence against the null. Failing to uncover evidence isn’t an error.
Indeed, Jones and Tukey (2000)^{5} write:
^{5} Jones, Lyle V., and John W. Tukey. 2000. “A Sensible Formulation of the Significance Test.” Psychological Methods 5(4): 411-414.
A conclusion is in error only when it is “a reversal,” when it asserts one direction while the (unknown) truth is the other direction. Asserting that the direction is not yet established may constitute a wasted opportunity, but it is not an error.
Failing to uncover evidence isn’t an “error,” it is a “wasted effort.”
This is worth emphasizing in a different way. Tests are not magical tools that tell you which hypothesis is correct. Instead, tests summarize the evidence against the null. There are two critical pieces to “evidence against the null”: (1) the amount of evidence and (2) whether the evidence is against the null. If you buy your own argument that the null is false (surely you do!), then (2) is taken care of. Only the amount of evidence remains, and you–the researcher–choose the amount of evidence to supply.
This perspective helps motivate power calculations. By their design, tests control the error rate in certain situations (when then null is correct). You do not need to worry about Type I errors. First, the test controls the error rate under the null. Second, you are pretty sure the null is wrong (see your theory section).
The hypothesis test takes care of the the Type I error rate. If you choose a properly-sized test, you don’t need to worry about those errors any more.
If you aren’t worried about Type I errors, what are you worried about? They only thing left to worry about is wasting your time and money. Statistical power is the chance not of wasting your time and money.
Power isn’t a secondary quantity that you compute for thoroughness or in anticipation of a comment from Reviewer 2. Power is something that you build for yourself.
Statisticians talk a lot about Type I errors because that’s their contribution. It’s your job to bring the power.
And importantly, power is under your control. Kane provides a rich summary of ways to increase the power of your experiment. At a minimum, you have brute force control through sample size.
Power isn’t an ancillary concern, it’s the entire game from the very beginning of the planning stage. It should be at the forefront of the researchers mind from the very beginning. You should want the power as high as possible.^{6}
^{6} I hear that 80% is the standard, but I’m pretty uncomfortable spending dozens of hours and thousands of dollars running for a 1 in 5 chance of wasting my time. I want that chance as close to zero as I can get it. I want power close to 100%. 99% power and 80% power might both seem “high” or “acceptable,” but these are not the same. 80% power means 1 in 5 studies fail. 99% power means that 1 in 100 studies fail.
You have to supply a test overwhelming evidence to consistently reject the null. Careful power calculations help you make sure you succeed in this war against the null.
Power isn’t about Type S and M errors (Gelman and Carlin 2014)^{7}. Power is about you protecting yourself from a failed study. And that seems like a protection worth pursuing carefully.^{8}
^{7} Gelman, Andrew, and John Carlin. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9(6): 641-651.
^{8} Of course it’s also about Type S and M errors, but those are discipline-level concerns. I’m talking about your incentives as a researcher.
Here are the takeaways:
The hypothesis test is no oracle. It will not consistently reject the null (even when the null is wrong) unless you supply overwhelming evidence. In experimental design, that’s not a task, that’s the task.