Exercise 1 : Estimating sample size

Exercise 2 : Estimating statistical power

Exercise 3 : Estimating sample size (within-subjects)

Exercise 4 : Estimating required effect size (within-subjects)

Exercise 5 : Extension work

Open the project you created in the revision
worksheet. Now create a new script within that project and call it
**power.R**. Put all the commands you use in this worksheet
in `power.R`

and save regularly. This way it is easier to see
what you have done, and refer back to it later for revision.

Collecting data takes time and effort, both for the experimenter and
for the participants. So, we don’t want to collect more data than is
necessary to answer our question. On the other hand, psychologists often
don’t
collect enough data to support the conclusions they make. In this
worksheet, we’ll cover the basics of how to work efficiently, by
collecting *just enough* data to answer your question.

In these *Research Methods in R* materials, we generally use
Bayesian techniques (e.g. Bayes Factors), rather than more traditional
techniques (e.g. p values). We do this because Bayes Factors are easy to
interpret. This is in contrast to p values, which are traditionally used
in psychology, but which are widely misunderstood. This worksheet,
however, is an exception - it uses traditional techniques. This is
because there is no Bayesian method to answer these particular questions
that is sufficiently simple for this intermediate-level course. In the
advanced-level guide, *Going Further with R*, there is a
worksheet on Bayesian power
estimation

In traditional analyses, researchers hope to find a *statistically
significant* difference between groups, as measured by p values. If
an analysis gives you a p value less than .05, psychologists will
traditionally believe the difference is real.

How much data we need to collect to find a statistically-significant difference depends on two things:

**The size of the difference we want to be able to see.**If you want to be able to observe subtle effects, you’ll need to test a lot of people.**How sure we want to be of finding the difference.**The more sure we want to be, the more people we’ll need to test.

If we can decide these two things, we can work out how many people we
need to test using a *power calculation*. Here, we’re going to
use R to do this calculation, using the `pwr.t.test`

command.
This is from the `pwr`

package, so you’ll need to run the
command `library(pwr)`

before you use it. The command looks
like this:

`pwr.t.test(type = "two.sample", power = ?, d = ?, alternative = ?)`

This command won’t work until you replace the question marks with something else, as we’ll cover below.

The first part of the `pwr.t.test`

command -
`type = "two.sample"`

- says that we have two different
groups of people in our study. For example, one group of people in a
memory experiment might read the items silently, while another group
might read them aloud. This would be an experiment with two samples
(silent, aloud).

In order to work out how many people we need to test, we have to replace each of the question marks with something else. We’ll cover each of these below.

`power`

Statistical power is a measure of how sure you want to be of finding a difference. Specifically, statistical power is the probability you will find a statistically significant difference (\(p < .05\)), assuming that difference has the effect size you expected.

The convention is that if your statistical power is 0.8, you have collected enough data to be confident of your answers. This is also sometimes called 80% power (0.8, expressed as a percentage, is 80%). 80% power means you have an 80% chance of finding a statistically-significant difference, if that difference is as large as you expected.

80% power might seem a bit low, and in some ways it is - you’ll have a 1 in 5 chance of not finding the effect you expect, even if it’s there. But, as we’ll see later, many psychology experiments don’t even get close to 80% power. The goal in this worksheet is to estimate how many people you need to test to hit this minimal but conventional standard of 80% power.

In summary, set `power = .8`

.

`d`

)The amount of data you need to collect depends on the *effect
size* you want to be able to observe. Effect size is measured by
Cohen’s *d*. If you have an effect size of 0.5, this means that
the difference between the group means is half the standard deviation of
the groups. Standard deviation is a measure of variability.

The smaller the effect size you want to be able to observe, the more
data you need to collect to be sure of observing it. But what effect
size should you expect? We know that the median
effect size in psychology is around `d = 0.5`

. So, if
we’ve nothing else to go on, we assume an effect size of 0.5. However,
we can often do better than this. Many psychologists now publish their
effect sizes, so a better choice is to find a study close to what you
plan to do, and base your estimated effect size on this. Even if the
study does not report an effect size, you can often work it out from
other information that is reported. How to do this is covered in the “more on power” worksheet. You don’t
need to read that worksheet right now, but it’ll be useful later on when
you need to estimate an effect size from previous work.

In summary, you should estimate the effect size you want to be able
to observe from the effect size reported in previous, relevant,
experiments. If there are no previous relevant experiments, use
`d = 0.5`

because that’s the average effect size across all
of psychology.

If you’ve based your effect size estimate on a previous study, you’re probably also assuming that the effect will be in the same direction as the previous study. For example, if the previous study found in the U.S. that average income was lower from women than for men, with an effect size of 0.4, and you’re using that effect size to work out how big your sample should be to look at the same question in the U.K., then you’re probably assuming you’ll find the same direction of effect as the previous study (i.e. men earning more than women).

If you expect a particular direction, and have good reasons for doing
so, you set `alternative = "greater"`

. Or, you could set
`alternative = "less"`

- it doesn’t matter which you choose
for this `pwr.t.test`

command. If there’s no previous
relevant work, you set `alternative = "two.sided"`

.

Enter this comment into your script:

```
# Statistical power
# EXERCISE 1
```

and then add a `pwr.t.test`

command to work out how many
people per group you need to test for 80% power to detect the median
effect size in psychology, assuming that you do not know which direction
the effect will be in.

If you get it right, your result should look like this:

```
Two-sample t test power calculation
n = 63.76561
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
```

So, to the nearest whole person, you need 64 people **per
group** (so, 128 people in total) to stand a good chance of
finding a typically-sized effect, if it exists. This is about three
times as much data as psychologists have traditionally collected,
according to Marsalek
et al.’s (2011) review of sample sizes.

**Copy the R code (not the comment) you used for this exercise
into PsycEL.**

Enter the following comment into your script:

`# EXERCISE 2`

Traditionally, psychologists test about 20 participant per group.
Enter a command to calculate the power of such an experiment, assuming
the same typical effect size of 0.5. You can do this by removing
`power`

from the command and adding `n = 20`

.

Your answer should look like this:

```
Two-sample t test power calculation
n = 20
d = 0.5
sig.level = 0.05
power = 0.337939
alternative = two.sided
NOTE: n is number in *each* group
```

As you can see, statistical power in traditional psychology experiments is very low … in this case, around 33%. This means that very often, we’ll end up without good evidence there is a difference between groups, even though the groups are in fact different.

**Copy the R code (not the comment) you used for this exercise
into PsycEL.**

So far, this is all looking a pretty bad for psychology. Effect sizes
are typically only medium (`d = .5`

) and to test for a
difference between groups with that kind of effect size, we should test
64 people per group (see Exercise 1). That’s a lot of testing, and most
psychology studies don’t traditionally hit that target – 20 per group is
much more typical. At that sample size, the power is very low for
medium-sized effects (see Exercise 2).

How can we make things better? One really good option is to use a
*within-subjects* design. The production-effect experiment in the
example above was a *between-subjects* design - some people read
words silently while others had them read out loud. To turn this into a
*within-subjects* design, we could give **each
person** some words to read silently, and other words that were
read aloud to them. As long as we designed the experiment well, taking
into account things like order
effects, we could still test for a production effect.

The reason for switching to a within-subjects design is that it’s much more efficient. In other words, for a given effect size, you can reach 80% power with many fewer participants, as we’ll see below.

Before looking at the power of within-subjects designs, we have to
decide how to calculate effect size for this kind of experiment. There
are at least five
different ways of doing this, but we’ll use the one required by the
`pwr.t.test`

command - Cohen’s \(d_{z}\). This is the mean difference
between conditions, divided by the standard deviation of the
*differences*. The `cohen.d`

command we used earlier
does **not** calculate this sort of within-subjects effect
size. We won’t need to calculate \(d_{z}\) in this worksheet but, if you need
to in later work, see the “more on
power” worksheet.

The `pwr.t.test`

command lets us work out how many people
we need for a within-subjects design, too. Just set
`type="paired"`

and use as before. Add the following comment
to your script:

`# EXERCISE 3`

and then add the appropriate command. Here’s the answer you should get:

```
Paired t test power calculation
n = 33.36713
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
```

In other words, you only need test 34 people *in total* in a
within-subjects design, relative to the 128 people you’d need to test in
a *between-subjects* design.

This comparison assumes the effect size would be the same in a
within-subjects design, as in a between-subjects design. This is not
necessarily the case, a point covered in more detail in the “more on power” worksheet.
Nevertheless, the comparison illustrates that within-subject designs
tend to get the same power with fewer participants. This is sometimes
described as within-subject designs being more *efficient* than
between-subject designs.

**Copy the R code (not the comment) you used for this exercise
into PsycEL.**

Add the following to your script:

`# EXERCISE 4`

Often, there is a maximum number of people you can test given the
time and resources available. For example, if you’re taking PSYC520 at
Plymouth University, you will only be able to test 25 people in your
first data collection period due to time constraints. When you have this
kind of constraint, it’s important to know what effect size you will be
able to detect with 80% power. To do this, take your last command and
replace `d = .5`

with `n = 25`

. If you do this
correctly, you’ll see this output:

```
Paired t test power calculation
n = 25
d = 0.5840272
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
```

So, if you can only test 25 people in a within-subjects study, then you will only be able to reliably ‘see’ effect sizes of 0.58 or greater. If you are taking PSYC520 at Plymouth University, the good news is that the course designers chose topic areas where the published effect sizes are at least that large.

**Copy the R code (not the comment) you used for this exercise
into PsycEL.**

Whether you use a within- or a between-subjects design, another really good way to improve the statistical power of your experiment is to increase the experiment’s effect size. Effect size is the mean difference divided by the variability, so you can increase effect size by either increasing the mean difference, or decreasing the variability, or both. There are some suggestions of how to do this in the “more on power” worksheet.

Psychologists are often surprised how much difference small increases in effect size can make. The relationship between effect size and sample size for 80% power is not a straight line. The required sample size drops off very rapidly as effect size increases, as shown in the graph below:

```
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
```

\(N\) in this graph is the
*total* number of people you’d need to test for 80% power. You
should be able to see that small increases above \(d = .5\) lead to substantial reductions in
necessary sample size, falling to 10 participants in total for \(d_{z} = 1\) in a within-subjects design. In
the other direction, small reductions in \(d\) lead to very large increases in
required sample size, with \(d = .3\)
requiring over 350 people for a between-subjects design.

If you’re feeling confident, and have some time, try to recreate the
above graph using R, and **copy the R code you used for this
exercise into PsycEL.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.