PhD Candidate @ Princeton University

š Homepage

š¦ Twitter, š¦ Bluesky

- For an simple R tutorial on power analyses, click here.
- [Other tutorials under constructionā¦]

Imagine that you tested negative for a fake disease, cucumberitis (congrats!). Should you celebrate?

Maybe. It depends on whether the test is any good. In particular, how likely is the test to turn out negative when you actually have cucumberitis? If it often turns out negative inaccurately, that makes it a weak, bad test. A good, *powerful* test would almost never do thatāif it says that you are sick, it is because you are.

Powerful tests are therefore informative, and weak tests are misleadingāin fact, you would probably be **very upset** if you learned that your medical test was weak (imagine a false-negative cancer test!). **In psychology, our tests are often much weaker than we think they are, and we should be upset, too;** because we often take costly actions (e.g., abondoning study designs, rejecting hypotheses, etc.) based on weak tests.

To know if you have a weak test, you minimally need to know two things: how big is your effect, and how big is your sample?

āI canāt do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldnāt have to run the study in the first place!

This is a common objection, and underlies why many still donāt do power analysesābut it turns out, we know much more than we think we do. For instance, here is a distribution of effect sizes from a meta-analysis of meta-analyses in social psychology collecting effect sizes across ~25,000 studies over 100 years in diverse research areas (credits to Jake Westfall for this analysis):

It turns out that, knowing *nothing* about your research question, I can still make an informed guess about your effect size (itās probably ~0.3; in particular because reported effect sizes tend to be inflated).

Moreover, Joe Simmons has run some extremely helpful, large (N ~ 700) studies on the effect sizes of very simple questions. Consider a classic psych question: Do smokers think that smoking is less risky than non-smokers? What do you think the effect size is for this? It turns out to be ~.3. How about the likelihood that someone who likes eggs eats more egg salad? Effect size ~ .5.

The upshot of this is that even very obvious effects in psychology have āsmall to moderateā effect sizes. If your manipulation is more subtle, you should expect your effect to be weaker.

Often, weāll do a smaller pilot study, and use the effect size estimates from that to do power analyses. This is probably not a good idea, unless your pilots are very large, mostly because getting tight bounds on effect sizes requires *very large* samples, e.g., N > 3000 (see below, courtesy of Uri Simonsohn) .

This means that your pilot tells you less than you think about how big your effect probably is.

Hopefully you are vaguely convinced that you should try power analyses. Here is my favorite way to do them.

The intuition is simple: Given some effect and sample, we want to know how likely we are to detect a true effect. To do this, you can just create a bunch of fake, random data such that the effect is actually there, and you see if you can detect the effect. If you do this ~1000 times, you get a good estimate of your likelihood of detecting a true effect.

Toy example: Does eating cucumbers make you less thirsty? This should be a very obvious effect, so letās say our effect size = .5 (similar to egg salad from above). Whatās your likelihood of detecting this in a typical psych sample, at p < .05? Letās say 30 people were given cucumbers and told to eat them; 30 people were not given cucumbers (so total N = 60). They then rated their thirstiness from 0-10.

The no_cucumber group has an average thirstiness of 5. If the effect size is .5, we would expect the cucumber group to have a 4.5 thirstiness on average. Letās arbitrarily say that the standard deviation for thirstiness is at 1. Letās generate these samples and do a t-test:

```
cucumber = rnorm(n = 30, mean = 4.5, sd = 1)
no_cucumber = rnorm(n = 30, mean = 5, sd = 1)
t.test(cucumber, no_cucumber)
```

In this instance, there was enough signal in the randomly generated data that I could find it:

But obviously this wonāt hold for all randomly generated data. The question is, for what proportion of these datasets? So we can loop over and randomly generate 1000 cucumber datasets, and collect the p-values:

```
test_pvalues = c() #initiate list to store p values
for(dataset in 1:1000){
cucumber = rnorm(n = 30, mean = 4.5, sd = 1)
no_cucumber = rnorm(n = 30, mean = 5, sd = 1)
test = t.test(cucumber, no_cucumber)
test_pvalues = c(test_pvalues,
test$p.value) # append the p.value to our running list
}
```

Here is the distribution of p-values. Red shows the mean p-value, and blue shows p = .05. All p-values after the blue line, in this case, are false negatives (i.e., we are not detecting the effect).

Technically, this is more informative than a single number, but letās say you just want to know what your power is. We can just look at the proportion of our āexperimentsā in which we found a significant relationship:

```
mean(ifelse(test_pvalues < .05, 1,0)) #ifelse assigns 1 if the test was significant, 0 if not, we then average it out with mean. I got .45.
```

So our power in this case ~.47. We can increase the number of experiments (e.g., from 1k to 10k) to get more precise estimates; or we can change various parameters (e.g., what if our variance was 2 vs. 1?) to see how they change our power.

We can also ask, what is the sample size you would need to get good (e.g., 90%) power? There are many ways to do this. I find simulations very intuitive, so I just search through possible sample sizes that might make sense, and see what power they are at. In this case, we need a total sample size of 170 to detect our cucumber effect:

And here is the code for looping over multiple sample sizes:

```
power_estimates = c() #initialize to store power estimates
samplesizes = seq(50,300,10)
for(samplesize in samplesizes){ #loop over possible sample sizes
test_pvalues = c() #initiate list to store p values
for(dataset in 1:1000){
cucumber = rnorm(n = samplesize / 2, mean = 4.5, sd = 1)
no_cucumber = rnorm(n = samplesize / 2, mean = 5, sd = 1)
test = t.test(cucumber, no_cucumber)
test_pvalues = c(test_pvalues,
test$p.value) # append the p.value to our running list
}
powerest = mean(ifelse(test_pvalues < .05, 1,0))
power_estimates = c(power_estimates,
powerest) # append power estimate to list
}
```

One shortcoming of this method is that it uses loops in R, which are notoriously slow; for more complex analyses, you will either want to optimize the code above, or use an established power analysis library for the particular test you want to run.

I wrote this up because I have done tens of studies with thounds of participants, which, in retrospect, told me a lot less than I thought they did. In the example above, running a āstandardā N = 30 per-cell design gives us ~.5 power. So thereās an even chance it didnāt work out because the effect actually doesnāt exist vs. we couldnāt detect it because weāre underpowered. Thereās linear returns to power in this simple setting, so if we had recruited twice the sample size, we would actually have known that if we donāt find an effect, itās because there isnāt one there.

**I could have learned so much more about which effects are real, and which arenāt, by conducting fewer, better powered studies.** And we would all stop wasting money and time if everyone conducted well-powered studies and reported when they failed.

For more info on why this matters for psychology research, check out the article by Paul Meehl in my list of favorite articles, or check out Uri, Joe, and Jakeās posts above. Data-colada in particular has a lot of useful wrote-ups on this.