Exercise 1 : Estimating sample size

Exercise 2 : Estimating statistical power

Exercise 3 : Estimating sample size (within-subjects)

Exercise 4 : Estimating required effect size (within-subjects)

Exercise 5 : Extension work

Create a new project in RStudio, and call it **int-power**. Now create a new script and call it **power.R**. Put all the commands you use in this worksheet in `power.R`

and save regularly. This way it is easier to see what you have done, and refer back to it later for revision.

Collecting data takes time and effort, both for the experimenter and for the participants. So, we don’t want to collect more data than is necessary to answer our question. On the other hand, psychologists often don’t collect enough data to support the conclusions they make. In this worksheet, we’ll cover the basics of how to work efficiently, by collecting *just enough* data to answer your question.

In these *Research Methods in R* materials, we generally use Bayesian techniques (e.g. Bayes Factors), rather than more traditional techniques (e.g. p values). We do this because Bayes Factors are easy to interpret. This is in contrast to p values, which are traditionally used in psychology, but which are widely misunderstood. This worksheet, however, is an exception - it uses traditional techniques. This is because there is no Bayesian method to answer these particular questions that is sufficiently simple for this intermediate-level course.

In traditional analyses, researchers hope to find a *statistically significant* difference between groups, as measured by p values. If an analysis gives you a p value less than .05, psychologists will traditionally believe the difference is real.

How much data we need to collect to find a statistically-significant difference depends on two things:

**The size of the difference we want to be able to see.**If you want to be able to observe subtle effects, you’ll need to test a lot of people.**How sure we want to be of finding the difference.**The more sure we want to be, the more people we’ll need to test.

If we can decide these two things, we can work out how many people we need to test using a *power calculation*. Here, we’re going to use R to do this calculation, using the `pwr.t.test`

command. This is from the `pwr`

package, so you’ll need to run the command `library(pwr)`

before you use it. The command looks like this:

`pwr.t.test(type = "two.sample", power = ?, d = ?, alternative = ?)`

This command won’t work until you replace the question marks with something else, as we’ll cover below.

The first part of the `pwr.t.test`

command - `type = "two.sample"`

- says that we have two different groups of people in our study. For example, one group of people in a memory experiment might read the items silently, while another group might read them aloud. This would be an experiment with two samples (silent, aloud).

In order to work out how many people we need to test, we have to replace each of the question marks with something else. We’ll cover each of these below.

`power`

Statistical power is a measure of how sure you want to be of finding a difference. Specifically, statistical power is the probability you will find a statistically significant difference (\(p < .05\)), assuming that difference has the effect size you expected.

The convention is that if your statistical power is 0.8, you have collected enough data to be confident of your answers. This is also sometimes called 80% power (0.8, expressed as a percentage, is 80%). 80% power means you have an 80% chance of finding a statistically-significant difference, if that difference is as large as you expected.

80% power might seem a bit low, and in some ways it is - you’ll have a 1 in 5 chance of not finding the effect you expect, even if it’s there. But, as we’ll see later, many psychology experiments don’t even get close to 80% power. The goal in this worksheet is to estimate how many people you need to test to hit this minimal but conventional standard of 80% power.

In summary, set `power = .8`

.

`d`

)The amount of data you need to collect depends on the *effect size* you want to be able to observe. Effect size is measured by Cohen’s *d*. If you have an effect size of 0.5, this means that the difference between the group means is half the standard deviation of the groups. Standard deviation is a measure of variability.

The smaller the effect size you want to be able to observe, the more data you need to collect to be sure of observing it. But what effect size should you expect? We know that the median effect size in psychology is around `d = 0.5`

. So, if we’ve nothing else to go on, we assume an effect size of 0.5. However, we can often do better than this. Many psychologists now publish their effect sizes, so a better choice is to find a study close to what you plan to do, and base your estimated effect size on this. Even if the study does not report an effect size, you can often work it out from other information that is reported. How to do this is covered in the “more on power” worksheet. You don’t need to read that worksheet right now, but it’ll be useful later on when you need to estimate an effect size from previous work.

In summary, you should estimate the effect size you want to be able to observe from the effect size reported in previous, relevant, experiments. If there are no previous relevant experiments, use `d = 0.5`

because that’s the average effect size across all of psychology.

If you’ve based your effect size estimate on a previous study, you’re probably also assuming that the effect will be in the same direction as the previous study. For example, if the previous study found in the U.S. that average income was lower from women than for men, with an effect size of 0.4, and you’re using that effect size to work out how big your sample should be to look at the same question in the U.K., then you’re probably assuming you’ll find the same direction of effect as the previous study (i.e. men earning more than women).

If you expect a particular direction, and have good reasons for doing so, you set `alternative = "greater"`

. Or, you could set `alternative = "less"`

- it doesn’t matter which you choose for this `pwr.t.test`

command. If there’s no previous relevant work, you set `alternative = "two.sided"`

.

Use `pwr.t.test`

to work out how many people per group you need to test for 80% power to detect the median effect size in psychology, assuming that you do not know which direction the effect will be in.

If you get it right, your result should look like this:

```
Two-sample t test power calculation
n = 63.76561
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
```

So, to the nearest whole person, you need 64 people **per group** (so, 128 people in total) to stand a good chance of finding a typically-sized effect, if it exists. This is about three times as much data as psychologists have traditionally collected, according to Marsalek et al.’s (2011) review of sample sizes.

**Copy the R code you used for this exercise into PsycEL.**

Traditionally, psychologists test about 20 participant per group. Calculate the power of such an experiment, assuming the same typical effect size of 0.5. You can do this by removing `power`

from the command and adding `n = 20`

.

Your answer should look like this:

```
Two-sample t test power calculation
n = 20
d = 0.5
sig.level = 0.05
power = 0.337939
alternative = two.sided
NOTE: n is number in *each* group
```

As you can see, statistical power in traditional psychology experiments is very low … in this case, around 33%. This means that very often, we’ll end up without good evidence there is a difference between groups, even though the groups are in fact different.

**Copy the R code you used for this exercise into PsycEL.**

So far, this is all looking a pretty bad for psychology. Effect sizes are typically only medium (`d = .5`

) and to test for a difference between groups with that kind of effect size, we should test 64 people per group (see Exercise 1). That’s a lot of testing, and most psychology studies don’t traditionally hit that target – 20 per group is much more typical. At that sample size, the power is very low for medium-sized effects (see Exercise 2).

How can we make things better? One really good option is to use a *within-subjects* design. The production-effect experiment in the example above was a *between-subjects* design - some people read words silently while others had them read out loud. To turn this into a *within-subjects* design, we could give **each person** some words to read silently, and other words that were read aloud to them. As long as we designed the experiment well, taking into account things like order effects, we could still test for a production effect.

The reason for switching to a within-subjects design is that it’s much more efficient. In other words, for a given effect size, you can reach 80% power with many fewer participants, as we’ll see below.

Before looking at the power of within-subjects designs, we have to decide how to calculate effect size for this kind of experiment. There are at least five different ways of doing this, but we’ll use the one required by the `pwr.t.test`

command - Cohen’s \(d_{z}\). This is the mean difference between conditions, divided by the standard deviation of the *differences*. The `cohen.d`

command we used earlier does **not** calculate this sort of within-subjects effect size. We won’t need to calculate \(d_{z}\) in this worksheet but, if you need to in later work, see the “more on power” worksheet.

The `pwr.t.test`

command lets us work out how many people we need for a within-subjects design, too. Just set `type="paired"`

and use as before. Here’s the answer you should get:

```
Paired t test power calculation
n = 33.36713
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
```

In other words, you only need test 34 people *in total* in a within-subjects design, relative to the 128 people you’d need to test in a *between-subjects* design.

This comparison assumes the effect size would be the same in a within-subjects design, as in a between-subjects design. This is not necessarily the case, a point covered in more detail in the “more on power” worksheet. Nevertheless, the comparison illustrates that within-subject designs tend to get the same power with fewer participants. This is sometimes described as within-subject designs being more *efficient* than between-subject designs.

**Copy the R code you used for this exercise into PsycEL.**

Often, there is a maximum number of people you can test given the time and resources available. For example, if you’re taking PSYC520 at Plymouth University, you will only be able to test 25 people in your first data collection period due to time constraints. When you have this kind of constraint, it’s important to know what effect size you will be able to detect with 80% power. To do this, take your last command and replace `d = .5`

with `n = 25`

. If you do this correctly, you’ll see this output:

```
Paired t test power calculation
n = 25
d = 0.5840272
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
```

So, if you can only test 25 people in a within-subjects study, then you will only be able to reliably ‘see’ effect sizes of 0.58 or greater. If you are taking PSYC520 at Plymouth University, the good news is that the course designers chose topic areas where the published effect sizes are at least that large.

**Copy the R code you used for this exercise into PsycEL.**

Whether you use a within- or a between-subjects design, another really good way to improve the statistical power of your experiment is to increase the experiment’s effect size. Effect size is the mean difference divided by the variability, so you can increase effect size by either increasing the mean difference, or decreasing the variability, or both. There are some suggestions of how to do this in the “more on power” worksheet.

Psychologists are often surprised how much difference small increases in effect size can make. The relationship between effect size and sample size for 80% power is not a straight line. The required sample size drops off very rapidly as effect size increases, as shown in the graph below:

`── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──`

```
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.1 ✓ dplyr 1.0.0
✓ tidyr 1.1.0 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
```

```
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
```

\(N\) in this graph is the *total* number of people you’d need to test for 80% power. You should be able to see that small increases above \(d = .5\) lead to substantial reductions in necessary sample size, falling to 10 participants in total for \(d_{z} = 1\) in a within-subjects design. In the other direction, small reductions in \(d\) lead to very large increases in required sample size, with \(d = .3\) requiring over 350 people for a between-subjects design.

If you’re feeling confident, and have some time, try to recreate the above graph using R, and **copy the R code you used for this exercise into PsycEL.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.