```
# download the {crdata} package from github
::install_github("carlislerainey/crdata") remotes
```

# Statistical Power from Pilot Data: An Example

We can think of statistical power as determined by the ratio \(\frac{\tau}{SE}\), where \(\tau\) is the treatment effect and SE is the standard error of the estimate. To reason about statistical power, one needs to make assumptions or predictions about the treatment effect and the standard error.

And as data-oriented researchers, we often want to use data to inform these predictions and assumptions. We might want to use *pilot* data.^{1}

^{1} Here’s how Leon, Davis, and Kraemer (2011) describe the purpose of a pilot study. “The fundamental purpose of conducting a pilot study is to examine the feasibility of an approach that is intended to ultimately be used in a larger scale study. This applies to all types of research studies. Here we use the randomized controlled clinical trial (RCT) for illustration. Prior to initiating a full scale RCT an investigator may choose to conduct a pilot study in order to evaluate the feasibility of recruitment, randomization, retention, assessment procedures, new methods, and/or implementation of the novel intervention. A pilot study, however, is not used for hypothesis testing. Instead it serves as an earlier-phase developmental function that will enhance the probability of success in the larger subsequent RCTs that are anticipated.”

Usually:

- Pilot data are
*not*useful to predict the treatment effect. - Pilot data are useful to predict the standard error.

With a predicted standard error in hand, we can predict the minimum detectable effect, the statistical power, or the required sample size in the planned study.

In this post, I give an example of how this can work.

## Predicting the SE from pilot data

Here’s how I suggest we use pilot data to predict the standard error in the planned study:

Conservatively, the standard error will be about \(\sqrt{\frac{n^{pilot}}{n^{planned}}} \cdot \left\lbrack \left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right) \cdot {\widehat{SE}}_{\widehat{\tau}}^{pilot} \right\rbrack\), where \(n^{pilot}\) is the number of respondents per condition in the pilot data, \(SE_{\widehat{\tau}}^{pilot}\) is the estimated standard error using the pilot data, and \(n^{planned}\) is the number of respondents per condition in the planned study.

The factor \(\left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right)\) nudges the standard error from the pilot study in a conservative direction, since it might be an under-estimate of the actual standard error.^{2} For the details, see this early paper, but this conservative standard error estimate is approximately the upper bound of a 95% confidence interval for the *standard error* using the pilot data.

^{2} More generally, we can use a bootstrap to conservatively estimate the standard error, without relying on this analytical approximation.

## Example

### The Robbins et al. study

As an example, let’s use half of the experiment conducted by Robbins et al. (2024).

Robbins et al. use a 2x2 factorial vignette design, randomly assigning each respondent to read one of four vignettes. The vignette describes a hypothetical covert operation ordered by the president that ends in either success or failure. Then, the vignette describes a whistleblower coming forward and describes the president’s opposition in Congress as either amplifying or ignoring the whistleblower.

President's Opposition in Congress | Outcome of Operation | |
---|---|---|

Success | Failure | |

Amplifies Whistleblower | Vignette 1: Success & Amplify | Vignette 2: Failure & Amplify |

Ignores Whistleblower | Vignette 3: Success & Ignore | Vignette 4: Failure & Ignore |

After the vignette, the respondent is asked whether they approve of the opposition in Congress’ actions on a seven-point Likert scale from strongly approve to strongly disapprove.

For a simple example, let’s **focus on the effect of amplifying the whistleblower when the operation succeeds**. That is, let’s compare responses after Vignette 1 and Vignette 3. How much does amplifying a whistleblower increase approval *when the opperation succeeds*? We expect a small effect here, so we should pay careful attention to power.

## The task

**We hoped to detect an effect as small as 0.35 points on the seven-point scale and had tentatively planned on 250 respondents per condition.** To test the survey instrument and data provider, we conducted a small pilot with about 75 respondents per condition. Let’s use those pilot data to check whether 250 respondents seem sufficient.

## The data

In the {crdata} package on GitHub, you can find the the pilot data we collected leading up to the main study.

Now let’s load the pilot data. To focus on observations where the operation succeeds, we’re going to keep only the observations where the vignette describes a successful observation.

```
# load pilot data and keep only success condition
<- crdata::robbins_pilot |>
robbins2_pilot subset(failure == "Success") |>
glimpse()
```

```
Rows: 147
Columns: 5
$ cong_overall <dbl> 3, 1, -2, 0, -2, -1, 0, -1, 0, 0, 0, 2, -1, 3, -3, 0, 0, …
$ failure <fct> Success, Success, Success, Success, Success, Success, Suc…
$ amplify <fct> Ignore, Ignore, Ignore, Amplify, Ignore, Ignore, Ignore, …
$ pid7 <fct> Strong Democrat, Not very strong Republican, Strong Democ…
$ pid_strength <dbl> 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 0, 3, 3, 3, 3, 1, 1, …
```

`cong_overall`

is the respondent’s approval of Congress’ actions on a seven-point scale and `amplify`

indicates whether Congress amplified the whistleblower (i.e., criticized the president).

### Analyzing the pilot data

Now let’s analyze the pilot data as we plan to analyze the main data set that we plan to collect later. We’re interested in the average response in the `Amplify`

and `Ignore`

conditions, so let’s use a t-test.

```
# t test
<- t.test(cong_overall ~ amplify, data = robbins2_pilot) fit_pilot
```

It can be really tempting to look at the estimated treatment effect. In this pilot study, it’s *actually statistically significant.* I intentionally don’t show the estimated treatment effect (or quantities requiring it, like *p*-values). If we looked at these, we might make one of the following mistakes:

- “The pilot got significant results, therefore even the pilot is sufficiently powered.”
- “The estimate from the pilot is significant, therefore we can use the estimated treatment effect in the power analysis.”

Both of these claims are misleading. The estimated treatment effect is very noisy, so ignore the estimated treatment effect.

### Predicting the SE in the main study

To predict the standard error in the main study, we need two pieces of information from this pilot:

- the sample size per condition and
- the estimated standard error.

We can get the number of observations per condition using `table()`

.

```
# create a table showing the observations per condition
table(robbins2_pilot$amplify)
```

```
Ignore Amplify
70 77
```

```
# sample size per condition
<- mean(table(robbins2_pilot$amplify))
n_pilot n_pilot
```

`[1] 73.5`

And then we need the estimated standard error, which is computed by `t.test()`

.

```
# get estimated standard error from pilot
<- fit_pilot$stderr
se_hat_pilot se_hat_pilot
```

`[1] 0.2761011`

Now we can predict the standard error in the planned study.

For the main study, we planned on about 250 respondents per condition.

`<- 250 n_planned `

The we can conservatively predict the standard error in the full study as \(\sqrt{\frac{n^{pilot}}{n^{planned}}} \cdot \left\lbrack \left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right) \cdot {\widehat{SE}}_{\widehat{\tau}}^{pilot} \right\rbrack\).

```
<- sqrt(n_pilot/n_planned)*((sqrt(1/n_pilot) + 1)*se_hat_pilot)
pred_se_cons pred_se_cons
```

`[1] 0.1671691`

But is this standard error small enough?

### Evaluating the predicted SE in the main study

We can convert the standard error to the minimum detectable effect with 80% power using \(2.5 \times SE\).^{3}

```
# compute conservative minimum detectable effect
2.5*pred_se_cons
```

`[1] 0.4179227`

We hoped to detect an effect as small as 0.35 points on the seven-point scale, so we’re going to need more than 250 respondents per condition!

We can also compute the power to detect an effect of 0.35 points on the seven-point scale.

```
# compute power as a percent
1 - pnorm(1.64 - 0.35/pred_se_cons)
```

`[1] 0.6749736`

Note that these are *conservative* estimates of the minimum detectable effect and statistical power.

Here’s what things look like if we remove the conservative nudge \(\left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right)\) and predict the standard error as \(\sqrt{\frac{n^{pilot}}{n^{planned}}} \cdot {\widehat{SE}}_{\widehat{\tau}}^{pilot}\).

```
# without the conservative nudge
<- sqrt(n_pilot/n_planned)*se_hat_pilot
pred_se 2.5*pred_se # best guess of minimum detectable effect
```

`[1] 0.3742673`

`1 - pnorm(1.64 - 0.35/pred_se) # best guess of power`

`[1] 0.7573806`

As you can see, the minimum detectable effect and power are a little too low. We need more respondents!

### Adjusting the Sample Size

Our plan of 250 respondents per condition seems too low. If we want, we can predict the sample size we need to get to 80% power using the following rule:

For 80% power to detect the treatment effect \(\widetilde{\tau}\), we will (conservatively) need about \(n^{pilot} \cdot \left\lbrack \frac{2.5}{\widetilde{\tau}} \cdot \left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right) \cdot {\widehat{SE}}_{\widehat{\tau}}^{pilot} \right\rbrack^{2}\) respondents per condition, where \(n^{pilot}\) is the number of respondents per condition in the pilot data and \(SE_{\widehat{\tau}}^{pilot}\) is the estimated standard error using the pilot data.

`*((2.5/0.35)*((sqrt(1/n_pilot) + 1)*se_hat_pilot))^2 n_pilot`

`[1] 356.4477`

Thus to get 80% power, the pilot data suggest that we (conservatively) need about 360 respondents per condition. We used 367 in the full study. Here are the conservative predictions for 367 respondents per condition.

```
<- 367
n_planned <- sqrt(n_pilot/n_planned)*((sqrt(1/n_pilot) + 1)*se_hat_pilot)
pred_se_cons pred_se_cons
```

`[1] 0.1379726`

`2.5*pred_se_cons # conservative minimum detectable effect`

`[1] 0.3449315`

`1 - pnorm(1.64 - 0.35/pred_se_cons) # conservative power`

`[1] 0.8150699`

## How did we do?

```
<- crdata::robbins_main |>
robbins2_main subset(failure == "Success") |>
glimpse()
```

```
Rows: 735
Columns: 5
$ cong_overall <dbl> 2, -2, -1, -3, 0, -1, -2, -1, 1, 1, 0, -3, -2, 2, 2, -3, …
$ failure <fct> Success, Success, Success, Success, Success, Success, Suc…
$ amplify <fct> Ignore, Amplify, Amplify, Amplify, Ignore, Ignore, Amplif…
$ pid7 <fct> Not very strong Republican, Not very strong Republican, S…
$ pid_strength <dbl> 2, 2, 3, 3, 3, 2, 0, 2, 2, 1, 2, 0, 3, 1, 2, 3, 2, 3, 2, …
```

```
<- t.test(cong_overall ~ amplify, data = robbins2_main)
fit_main $stderr fit_main
```

`[1] 0.1322618`

`1 - pnorm(1.64 - 0.35/fit_main$stderr)`

`[1] 0.8428563`

As you can see, the pilot data gave us a good, slightly conservative prediction. We conservatively predicted a standard error of 0.138 in the planned study and we estimated a standard error of 0.132 after running the study. We conservatively predicted our power would be about 82% to detected an effect of 0.35 on the seven-point scale, but after running the study, it seems like we had about 84% power.

## A bootstrap alternative

We can also use the bootstrap as an alternative. There are a few ways one might approach it.

Here’s one:

- Treat the pilot data as a population. Create a data set with the planned sample size by sampling with replacement from the pilot data.
- Perform the planned analysis on each resampled data set.
- Store the estimated standard error from each analysis.

Repeat the process above many times. For each standard error estimate, compute the implied statistical power. This gives a distribution of power estimates. Find a value near the bottom of this distribution. The factor we used above—The factor \(\left( \sqrt{\frac{1}{n^{pilot}}} + 1 \right)\)—nudges the standard error to about the 2.5th percentile, so we can use that here, too.

```
# number of bootstrap iterations
<- 10000
n_bs
<- numeric(n_bs) # a container
bs_se for (i in 1:n_bs){
# resample 367 observations from each condition
<- robbins2_main %>%
bs_data group_by(amplify) %>%
sample_n(size = 367, replace = TRUE)
# run planned analysis
<- t.test(cong_overall ~ amplify, data = bs_data)
bs_fit # grab se
<- bs_fit$stderr
bs_se[i]
}
# compute 2.5th percentile of power to obtain conservative estimate
<- 1 - pnorm(1.64 - 0.35/bs_se)
pwr quantile(pwr, probs = 0.025)
```

```
2.5%
0.8206062
```

Using the analytical approximation, we got 0.815 as a conservative estimate of power. The bootstrap gave us 0.820 as a conservative estimate. The actual power in the full study turned out to be about 0.843. (Remember, all of these power calculations are *power to detected an effect of 0.35 points on the seven-point scale.*)

## The paper

I have an early draft of a paper on these (and other) ideas. Please test them out in your own work and let me know if you have questions, comments, and suggestions. I’m interested in making the paper as clear and useful as I can.

## References

*Evaluation Review*19 (5): 547–56. https://doi.org/10.1177/0193841x9501900504.

*Journal of Psychiatric Research*45 (5): 626–29. https://doi.org/10.1016/j.jpsychires.2010.10.008.