# Power, Part III: The Rule of 3.64 for Statistical Power

Simple, Actionable Guidelines Using Statistical Power’s Key Ratio

statistical power
hypothesis tests
power analysis
methodology
confidence intervals
computing
R
Statistical power is an abstract quantity. It’s easy to understand what it means, but harder to think about how to manipulate it. To make power more intuitive, I encourage students to think about the ratio of the true effect to the standard error (i.e., signal-to-noise). I explain why a target ratio of 2.48 is equivalent to a target power of 80% (and why you should really shoot for 3.64!). These guidelines help you build actionable intuition about statistical power.
Author

Carlisle Rainey

Published

June 12, 2023

A New Paper

This post turned out to be somewhat popular, so I’ve written up a more formal, careful description of the idea in a full length paper. The initial, early version is on GitHub.

## Background

I’ve wrapped up the argument that you should pursue statistical power in your experiments. In sum, you should do it for you (not a future Reviewer 2) and you shouldn’t see confidence intervals nestle consistently against zero.

In this post, I’d like to develop the intuition for power calculations, two helpful guidelines, and one implication.

Main takeaway: You need the ratio of the true effect and the standard error to be more than 3.64.

## Starting Point: The Sampling Distribution

When you run one experiment, you realize one of many possible patterns of randomization. This particular realization produces a single estimate of the treatment effect from a distribution of possible estimates. The distribution of possibilities is called a “sampling distribution.”

Thus, we can think of the estimate as a random variable and its distribution as the sampling distribution. The sampling distribution is key to everything I do in this post, so let’s spend some time with it.

Let’s imagine that we did the exact same study 50 times. Let’s say that we computed a difference-in-means in dollars ($) donated.1 I refer to this difference-in-means as the estimated treatment effect. It is the estimate of the average treatment effect (in$). This estimate will vary across the many possible patterns of randomization because each pattern puts different respondents in the treatment and control group.

• 1 I just want an easy-to-use unit here, and dollars meets that criteria. Other units work fine, too.

• We can visualize this with ggnaminate. We can imagine each iteration of the study as producing a particular estimate. We continue to repeat the study and collect the estimates. Eventually, we can produce a histogram from this collection of estimates. This histogram represents the sampling distribution and is fundamental to the calculations that follow. The figure belows shows how we might collect the points into a histogram.

Code
library(tidyverse)
library(gganimate)
library(magick)

# gif pars
duration <- 24 # must be even
fps <- 25
nframes <- duration*fps
scale <- 2.5
width <- 8
height <- 6
res <- 125

# study parameters
true_effect <- 1
se <- 0.4

# number of times to repeat the study
n_studies <- 50 # nframes

# create a data frame of confidence intervals
ests <- tibble(study_id = 1:n_studies,
est = c(rnorm(n_studies, true_effect, se))) %>%
mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"))

# add two things to the data frame of confidence intervals
# 1. an initial row with study_id = 1 and est = NA so that
#    first observation in place otherwise).
# 2. a group variable that defines the row. This is the same
#    as the study_id, except the dummy row from (1) and the
#    actual first row have different groups.
animate_data <- bind_rows(
tibble(study_id = 1, est = NA),  # study_id = 1, est = NA
ests                             # actual cis
) %>%
mutate(group = 1:n())

split_animate_data <- animate_data %>%        # group (row index)
split(.$group) %>% accumulate(~ bind_rows(.x, .y)) %>% bind_rows(.id = "frame") %>% mutate(frame = as.integer(frame)) se_lines <- tribble( ~se_, ~label, ~chance, ~ch_loc_, # trailing _ means not rescaeld to study se 0, "True Effect", NA, NA, 1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5, 2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5, 3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5, -1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5, -2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5, -3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5, ) %>% mutate(ch_loc = ch_loc_*se + true_effect, se = se_*se + true_effect) # start with a ggplot gg1 <- ggplot(animate_data, aes(x = est, y = study_id, group = group)) + geom_vline(data = se_lines, aes(xintercept = se, color = -dnorm(se_)), linetype = "dashed") + geom_label(data = se_lines, aes(x = se, y = n_studies + 2, label = label, group = NULL, color = -dnorm(se_))) + geom_text(data = se_lines, aes(x = ch_loc, y = 4, label = chance, group = NULL)) + geom_point(aes(color = -dnorm((est- true_effect)/se)), size = 3) + geom_rug(sides = "b", aes(x = est, color = -dnorm((est- true_effect)/se)), alpha = 0.5, length = unit(0.025, "npc")) + theme_bw() + theme(panel.grid.minor.y = element_blank()) + labs(x = "Estimate of Effect", y = "Study Number") + theme(legend.position = "none") # add dyamics to the plot gg1_anim <- gg1 + transition_states(states = group) + # how points enter enter_drift(y_mod = 10) + enter_grow() + enter_fade() + # how points exit/remain exit_fade(alpha = 0.5) + exit_shrink(size = 1) + shadow_mark(alpha = 0.5) gg1_gif<- animate(gg1_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res) anim_save("gg1.gif") gg1_mgif <- image_read("gg1.gif") ## plot 2: histogram # start with a ggplot gg2 <- ggplot(split_animate_data, aes(x = est, group = frame)) + geom_histogram(binwidth = se, boundary = true_effect, fill = "grey") + geom_vline(data = se_lines, aes(xintercept = se, color = -dnorm(se_)), linetype = "dashed") + geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL, color = -dnorm(se_)), vjust = 1.5) + geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1) + #geom_density(linewidth = 2) + geom_rug(sides = "b", aes(x = est, color = -dnorm((est- true_effect)/se)), alpha = 0.5, length = unit(0.025, "npc")) + theme_bw() + labs(x = "Estimate of Effect", y = "Count") + theme(legend.position = "none") # add dyamics to the plot gg2_anim <- gg2 + transition_states(states = frame) gg2_gif<- animate(gg2_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res) anim_save("gg2.gif") gg2_mgif <- image_read("gg2.gif") new_gif <- image_append(c(gg1_mgif[1], gg2_mgif[1]), stack = FALSE) for(i in 2:nframes){ combined_gif <- image_append(c(gg1_mgif[i], gg2_mgif[i]), stack = FALSE) new_gif <- c(new_gif, combined_gif) } new_gif Before the study, we can predict two features of this sampling distribution. 1. First, it will usually have a bell-shaped, normal distribution. 2. Second, we can predict the standard deviation of this distribution with good accuracy before conducting a single study and excellent accuracy after just one study. We call the standard deviation of the sampling distribution the standard error or SE. For our purposes, then, we can describe the sampling distribution as normally distributed with an assumed mean and SD. The mean is the assumed true effect and the SD is the well-predicted SE. This is the key claim: In order to build power into your experiment, you must build certain properties into the sampling distribution. The design of your experiment will not affect the normality of the sampling distribution, but it will change the true effect and the SE. Changing the true effect and the SE will change the power. ## Thinking about True Effect and Standard Error as Targets I’m leaving aside how to predict the standard error of the experiment or choose the true effect. This post is about the target standard error and true effect. Predicting the standard error is a mechanical process mixed with a little guesswork.2 Choose a true effect is a mostly substantive decision.3 • 2 I’ll suggest two methods I like. First, use $$\text{predicted SE} = \frac{\text{SD of same outcome in different data set}}{0.5 \sqrt{\text{sample size}}}$$. I like to confirm this estimate with a small pilot. I sample observations from this pilot data set, run the full analysis, and confirm that my prediction is close, and make any needed adjustments. • 3 Cyrus Samii likes to use the MME, I like a conservatively choosen guess of what the effect actually is. • But instead of talking about a target power, I like to talk about a target standard error (given a true effect) or a target true effect (given a standard error)—power is just not a very intuitive quantity. To understand how the true effect and the SE relate to power, we need to introduce the confidence interval. ## Testing with Confidence Intervals I like to use 90% confidence intervals to test hypotheses (see Rainey 2014 and Rainey 2015). In short, 90% confidence intervals correspond to one-tailed tests with size 0.05 and equivalence tests with size 0.05.] The formula for a 90% CI is $$\text{estimate} \pm 1.64\text{SE}$$. That is, we put “arms” around our estimate—one to the left and another to the right. Each arm is 1.64 standard errors long. I’ll assume we have a one-sided research hypothesis that suggests a positive effect. If the lower bound ($$\text{estimate} - 1.64\text{SE}$$) is less than zero, then we fail to reject the null hypothesis. If this lower bound is greater than zero, then we reject the null hypothesis.4 • 4 This is equivalent to a z-test in a standard hypothesis testing framework using a p-value of less than 0.05 as the threshold for rejecting the null hypothesis. • This focus on testing rather than estimation changes the nature of the sampling distribution. Rather than an estimate along a continuous range, we get a binary outcome: either (1) reject the null hypothesis or (2) fail to reject the null hypothesis. But the sampling distribution of estimates and the associated outcomes of tests are closely related. In particular, the logic of the test implies that *if the estimate falls less than 1.64 SEs above zero, we cannot reject the null. We can reconstruct the figure above using this logic. Rather than plot the points continuously along the x-axis, we can color the points (and now error bars) according to whether the lower bound falls above zero or not. And we can use a bar plot showing the number of rejections and non-rejections. Code library(tidyverse) library(gganimate) library(magick) # gif pars duration <- 24 # must be even fps <- 25 nframes <- duration*fps scale <- 2.5 width <- 8 height <- 6 res <- 125 # study parameters true_effect <- 1 se <- 0.4 # number of times to repeat the study n_studies <- 50 # nframes # create a data frame of confidence intervals ests <- tibble(study_id = 1:n_studies, est = c(rnorm(n_studies, true_effect, se))) %>% mutate(reject_null = ifelse(est - 1.64*se > 0, "Yes", "No"), lwr = est - 1.64*se, upr = est + 1.64*se) # add two things to the data frame of confidence intervals # 1. an initial row with study_id = 1 and est = NA so that # the plot starts empty (gganimate would start with the # first observation in place otherwise). # 2. a group variable that defines the row. This is the same # as the study_id, except the dummy row from (1) and the # actual first row have different groups. animate_data <- bind_rows( tibble(study_id = 1, est = NA), # study_id = 1, est = NA ests # actual cis ) %>% mutate(group = 1:n()) split_animate_data <- animate_data %>% # group (row index) split(.$group) %>%
accumulate(~ bind_rows(.x, .y)) %>%
bind_rows(.id = "frame") %>%
mutate(frame = as.integer(frame))

se_lines <- tribble(
~se_, ~label, ~chance, ~ch_loc_,  # trailing _ means not rescaeld to study se
0, "True Effect", NA, NA,
1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5,
2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5,
3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5,
-1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5,
-2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5,
-3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5,
) %>%
mutate(ch_loc = ch_loc_*se + true_effect,
se = se_*se + true_effect)

gg1 <- ggplot(animate_data, aes(x = est,
y = study_id,
group = group)) +
geom_vline(xintercept = 1.64*se) +
annotate("label", x = 1.64*se, y = 5, label = "1.64 SEs") +
geom_errorbarh(height = 0, aes(xmin = lwr, xmax = upr, color = reject_null)) +
geom_point(aes(color = reject_null),
size = 3) +
geom_rug(sides = "b",
aes(x = est,
color = reject_null),
alpha = 0.5,
length = unit(0.025, "npc")) +
theme_bw() +
theme(panel.grid.minor.y = element_blank()) +
labs(x = "Estimate of Effect",
y = "Study Number") +
theme(legend.position = "none") +
scale_color_manual(values = c("Yes" = "#1b9e77", "No" = "#d95f02"))

# add dyamics to the plot
gg1_anim <- gg1 +
transition_states(states = group) +
# how points enter
enter_drift(y_mod = 10) +
enter_grow() +
# how points exit/remain
exit_shrink(size = 1) +

gg1_gif<- animate(gg1_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg1.gif")

## plot 2: histogram
gg2 <- ggplot(split_animate_data, aes(x = reject_null, fill = reject_null),  na.rm = TRUE) +
geom_bar(na.rm = TRUE) +
theme_bw() +
labs(x = "Reject Null",
y = "Count") +
theme(legend.position = "none") +
scale_x_discrete(na.translate = FALSE) +
scale_fill_manual(values = c("Yes" = "#1b9e77", "No" = "#d95f02"))

# add dyamics to the plot
gg2_anim <- gg2 +
transition_states(states = frame)

gg2_gif<- animate(gg2_anim, nframes = nframes, duration = duration, width = width, height = height, units = "in", res = res)
anim_save("gg2.gif")

new_gif <- image_append(c(gg1_mgif[1], gg2_mgif[1]), stack = FALSE)
for(i in 2:nframes){
combined_gif <- image_append(c(gg1_mgif[i], gg2_mgif[i]), stack = FALSE)
new_gif <- c(new_gif, combined_gif)
}
new_gif

## The Key Ratio

The key to building statistical power into your experiment is to get “almost all” of the sampling distribution above 1.64 standard errors above zero. The portion of the sampling distribution that falls below 1.64 standard errors above zero does not allow the researcher to reject the null.

Code
library(tidyverse)

# study parameters
true_effect <- 1
se <- 0.4

se_lines <- tribble(
~se_, ~label, ~chance, ~ch_loc_,  # trailing _ means not rescaeld to study se
0, "True Effect", NA, NA,
1, "+1 SE", scales::percent(pnorm(1) - pnorm(0), accuracy = 1), 0.5,
2, "+2 SE", scales::percent(pnorm(2) - pnorm(1), accuracy = 1), 1.5,
3, "+3 SE", scales::percent(pnorm(3) - pnorm(2), accuracy = 1), 2.5,
-1, "-1 SE", scales::percent(pnorm(0) - pnorm(-1), accuracy = 1), -0.5,
-2, "-2 SE", scales::percent(pnorm(-1) - pnorm(-2), accuracy = 1), -1.5,
-3, "-3 SE", scales::percent(pnorm(-2) - pnorm(-3), accuracy = 1), -2.5,
) %>%
mutate(ch_loc = ch_loc_*se + true_effect,
se = se_*se + true_effect)

x <- rnorm(500000, mean = true_effect, sd = se)
df <- data.frame(x)

bg_alpha <- 0.3
ggplot() +
geom_histogram(data = df,
aes(x = x, y = after_stat(density)), binwidth = se, boundary = true_effect, fill = "grey", alpha = bg_alpha) +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed", alpha = bg_alpha) +
geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL), vjust = 1.5, color = alpha('black', bg_alpha)) +
geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1, color = alpha('black', bg_alpha)) +
geom_vline(xintercept = 0) +
geom_function(fun = dnorm, args = list(mean = true_effect, sd = se), size = 1) +
theme_bw() +
labs(x = "Estimate of Effect",
y = "Density") +
geom_area(data = tibble(x = seq(1.64*se, 3*se + true_effect, by = 0.1)), aes(x = x),
stat = "function", fun = dnorm, args = list(mean = true_effect, sd = se),
fill = "#d95f02", alpha = 0.1, xlim = c(1.64*se, 3**se + true_effect)) +
annotate("label", x = 1.3, y = .1, label = "fraction rejected", color = "#d95f02", size = 6) +
annotate("segment", x = 1.64*se, xend = 1.64*se, y = 0, yend = dnorm(1.64*se, mean = true_effect, sd = se), color = "#1b9e77", size = 1) +
annotate("label", x = 1.64*se, y = dnorm(1.64*se, mean = true_effect, sd = se)/2, label = "1.64 SEs above zero", color = "#1b9e77") +
annotate("segment", x = 0, xend = 1.64*se,
y = dnorm(1.64*se, mean = true_effect, sd = se),
yend = dnorm(1.64*se, mean = true_effect, sd = se),
color = "#7570b3", size = 1) +
annotate("label", x = 0.5*1.64*se, y = dnorm(1.64*se, mean = true_effect, sd = se), label = "width of 90% CI", color = "#7570b3") +
theme(legend.position = "none") +
xlim(-1, 3)

You’ll remember that “almost all” of the normal distribution falls within two standard errors of its mean. So as a starting point, let’s use this rule: get the sampling distribution two standard errors above 1.64 standard errors above zero. We can just add 1.64 and 2 together to get a sampling distribution 3.64 standard errors above zero. That is, we need $$\frac{\text{true effect}}{\text{standard error}} > 3.64$$. If the true effect is larger than 3.64 standard errors, then the confidence interval will “rarely” overlap zero (about 2% of the time).

Code
library(tidyverse)

# study parameters
true_effect <- 1
se <- 1/3.84

x <- rnorm(500000, mean = true_effect, sd = se)
df <- data.frame(x)

ggplot() +
geom_histogram(data = df,
aes(x = x, y = after_stat(density)), binwidth = se, boundary = true_effect, fill = "grey", alpha = bg_alpha) +
geom_vline(data = se_lines, aes(xintercept = se,
color = -dnorm(se_)), linetype = "dashed", alpha = bg_alpha) +
geom_label(data = se_lines, aes(x = se, y = Inf, label = label, group = NULL), vjust = 1.5, color = alpha('black', bg_alpha)) +
geom_label(data = se_lines, aes(x = ch_loc, y = 0, label = chance, group = NULL), vjust = -1, color = alpha('black', bg_alpha)) +
geom_vline(xintercept = 0) +
geom_function(fun = dnorm, args = list(mean = true_effect, sd = se), size = 1) +
theme_bw() +
labs(x = "Estimate of Effect",
y = "Density") +
geom_area(data = tibble(x = seq(1.64*se, 3*se + true_effect, by = 0.1)), aes(x = x),
stat = "function", fun = dnorm, args = list(mean = true_effect, sd = se),
fill = "#d95f02", alpha = 0.1, xlim = c(1.64*se, 3**se + true_effect)) +
#annotate("label", x = 1.3, y = .1, label = "fraction rejected", color = "#d95f02", size = 6) +
annotate("segment", x = true_effect, xend = true_effect, y = 0, yend = dnorm(true_effect, mean = true_effect, sd = se), color = "#d95f02", size = 1) +
annotate("label", x = true_effect, y = dnorm(true_effect, mean = true_effect, sd = se)/2, label = "true effect", color = "#d95f02") +
annotate("segment", x = 1.64*se, xend = 1.64*se, y = 0, yend = .75, color = "#1b9e77", size = 1) +
annotate("label", x = 1.64*se, y = .75, label = "1.64 SEs above zero", color = "#1b9e77") +
annotate("segment", x = 0, xend = 1.64*se,
y = .125,
yend = .125,
color = "#7570b3", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = 0.5*1.64*se, y = .125, label = "1.64 SEs", color = "#7570b3", size = 4) +
annotate("segment", x = true_effect, xend = 1.64*se,
y = .125,
yend = .125,
color = "black", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = 1.64*se + (true_effect - 1.64*se)/2, y = .125

, label = "ideally 2 SEs", color = "black", size = 4) +
annotate("segment", x = true_effect, xend = 0,
y = 1, yend = 1,
color = "black", size = 1,
lineend = "round", linejoin = "round", arrow = arrow(length = unit(0.1, "inches"), ends = "both")) +
annotate("label", x = true_effect/2, y = 1, label = "ideally 3.64 SEs", color = "black", size = 4) +
theme(legend.position = "none") +
xlim(-0.1, 2.0)

## The Resulting Guidelines

There are a few ways to write the ratio:

1. $$\frac{\text{true effect}}{\text{standard error}} > 3.64$$
2. $$\text{true effect} > 3.64 \times \text{standard error}$$
3. $$\text{standard error} < \frac{1}{3.64} \times \text{true effect} \approx 0.27 \times \text{true effect}$$

These are all the same goals. I prefer the first, but they are all equivalent targets for power.

I like this ratio because it points to strategies to increase power. You can do two things:

1. Increase the true effect.
2. Decrease the standard error.

This ratio gets you thinking not what your power is, but how to increase it. Again, power isn’t a task for the experimenter, it’s the task.5 Kane (2023)6 suggests several ways researchers can increase the true effect or decrease the standard error.

• 5 I’m leaving aside the question of how to predict the standard error or choose a true effect for a paricular design.

• 6 Kane, John V. 2023. “More Than Meets the ITT: A Guide for Investigating Null Results.” APSA Preprints. doi: 10.33774/apsa-2023-h4p0q-v2.

• I’ll highlight a few examples here. To increase the treatment effect, you can:

1. Increase attentiveness.
2. Ensure that respondents are not pre-treated.
3. Write strong treatments or “hit them between the eyes” (Kuklinski et al. 2000).

To decrease the standard error, you can:

1. Increase the sample size.
2. Use multiple measurements for the outcome.
3. Use a pre-post design (Clifford, Sheagley, and Piston 2022).

## Implication

If you get the ratio to 3.64, then you’ll have 98%. This is much higher than the cutoff of 80% I hear suggested most often. If you want 80% power, rather than 98% power, you can change the 3.64 to 2.48.7 But notice that 3.64 and 2.48 are not very different. This has an important implication.

• 7 Find this by adding 1.64 to -qnorm(0.2).

• A researcher can lower the risk of a failed study from about 1 in 5 (80% power) to about 1 in 50 (98% power) by increasing the ratio from 2.48 to 3.64. This requires increasing the treatment effect by about 50% or shrinking the standard error by about 33%. Stated differently, if you are careless and let your treatment effect fall by 33% or let your standard error increase by 50%, then your risk of a failed study increases 10-fold! Small differences can matter a lot.

The implication is this: when you are on the cusp of a well-powered study (about 80% power), then small increases in the ratio have a disproportionate impact on your risk of a failed study.

## Computing Power from the Ratio

Following the logic of the picture above, we need to compute the fraction of the sampling distributin that lies above 1.64 SEs. The pnorm() function returns normal probabilities, but below specfic thresholds. By supplying the argument lower.tail = FALSE, we can get the probabilities of falling above a specific threshold.

true_effect <- 1.00
se <- 0.4

# compute power
pnorm(1.64*se,   # want fraction above 1.64 SE
mean = true_effect,  # mean of sampling distribution
sd = se,  # sd of sampling distribution
lower.tail = FALSE)  # fraction above, not below
[1] 0.8051055

## Summary

Statistical power is an abstract quantity. It’s easy to understand what it means, but harder to think about how to manipulate it. I explain why the target $$\frac{\text{true effect}}{\text{standard error}} > 2.48$$ is equivalent to a target power of 80%. But you want to “almost always” reject the null, so you should shoot for $$\frac{\text{true effect}}{\text{standard error}} > 3.64$$. Hopefully these guidelines help you build a bit of actionable intuition about statistical power.