This is an advanced worksheet, which assumes you have completed the Absolute Beginners’ Guide to R, the Research Methods in Practice (Quantitative section), and the Intermediate Guide to R.

You should be able to follow this worksheet if you understand the following materials:

One of the most important things in research is to make sure you have enough evidence/data to make reliable conclusions. Collecting enough data requires time and effort. In psychology, until recently, researchers didn’t collect enough data to make reliable conclusion with traditional frequentist statistics (methods that rely on p-value significance testing).

Throughout your undergraduate course, you relied on Bayesian techniques. One exemption was the worksheet that introduced statistical power. In that worksheet, you used traditional frequentist techniques to estimate how many participants you need to recruit for your experiment. The reason was that simply there is no clear-cut formula for estimating sample size for and with Bayesian techniques. Most techniques were beyond the requirements of an undergraduate course. The technique being introduced here is simple and requires no sophisticated statistical and mathematical skills.

In traditional null hypothesis significance testing, power analysis
estimates the likelihood of rejecting the null hypothesis given the
effect size of the alternative hypothesis. So if you have two groups,
you could use a single effect size, like that of Cohen’s
*d*, to find out how many participants you need to get a get
a p-value below 0.05 with high probability. This is useful, because it
explicitly states how likely to achieve the researchers goal in the
anticipated experiment (Kruschke,
2013). Ideally, you also want to do it before data collection and
not after. If you estimate power after the data are collected and
analysed, you will learn nothing that the Bayes Factor (BF) or a p-value
wouldn’t tell you.

Traditional techniques to estimate sample size are also limited. For example, they don’t allow you to plan for the null hypothesis being true. This is because p-values don’t let you conclude from null results. See the Evidence worksheet for more explanation.

A p-value and traditional techniques do not allow you to incorporate this possibility, even though in some cases it might be important to plan for that outcome as well.

For example, you might want to know how many participants you need to test so that you can know for certain that two treatments (a drug and a palcebo) are no different. Another example is when there are conflicting evidence from underpowered studies that a psychological effect is real. You might want to know how many participants you need to have to show that an effect is present in the data or not.

You can also have multiple estimates of sample size: one for the null hypothesis being true, and one for the alternative being true. In that case, it is not unreasonable to check the data at the lower estimate, to see if you need to recruit more participants.

Rather than a single value, effect sizes are best thought of as a range of possible values with different probabilities - effect sizes are uncertain (Kruschke, 2013), but traditional power calculations use a single point-value. Effect sizes can vary from experiment to experiment (sometimes even substiantially). It is uncertain if you will get the same effect size with a new sample. This experiment-to-experiment variance is not incorporated in traditional methods. Effect sizes are also often overestimated which causes a lot of problems for scientists and researchers.

Many of the limitations of traditional power calculations highlighted so far can be addressed by doing power calculations using BFs.

In Bayesian Power Analysis, we are looking at power from a different angle. Here, we want to estimate precision - the probability of the BF being conclusive or inconclusive. We determine this by checking whether the BF falls within a certain interval. Intervals are groups of numbers lying between two numbers. This probability tells us how likely that we will have conclusive results given different sample sizes.

In psychology, people like to think categorically, therefore define categories of BFs depending on different intervals. Traditionally, a BF of three and above is considered as evidence for a difference, and ⅓ and below is considered as evidence for the absence of a difference.

This might be a bit confusing, because now we will be looking at intervals and not thresholds, as we are interested in an entire distribution of values. For estimating the sample size, we can define two big groups of BFs: conclusive vs inconclusive. Inconclusive BF can be any BF that falls between the lower bound and the upper bound. All other BFs are conclusive. We calculate power for BF as

\[1 - Pr(lower bound < BF < upper bound)\]

Simply put, we take the percentage of BFs that were inconclusive and
subtract it from 100%. We have the probability of BF falling between ⅓
and 3 written as \(Pr(lower bound < BF <
upper bound)\). So if we have 40% of Bayes Factors that were
inconclusive, then 100% - 40% will give us 60% of Bayes Factors that
were conclusive. This means that we have a 60% chance of getting a
conclusive result. Usually, we don’t use percentages, but rather use
values from 0 to 1. So 60% will be the same as 0.6 and 100% will be the
same as 1. We will also refer to these values as probabilities and not
percentages. So 60% chance of getting a conclusive BF will become a
*probability of 0.6* to get a conclusive BF.

**When we think about BFs and Power, we want to know how many
data points we need to get a conclusive BF.**

An algorithm is a list of rules to follow in order to solve a problem.

The algorithm for estimating sample size for more sophisticated
Bayesian analysis was first described by Kruschke (2013) while he was
promoting an alternative to frequentist statistics. Schönbrodt
and Wagenmakers (2018) adapted his algorithm to **fixed-n
designs** with a **BF**. In a fixed-n design, you
test a set number of participants and calculate something like a BF at
the end. Schönbrodt and Wagenmakers (2018) decided to group this
technique under an umbrella term: **Bayes Factor Design
Analysis**.

Here I will outline the algorithm we’ll be using and then walk you through it. The steps in the algorithm:

**Generate a large number (e.g. 2000) of random data points**by taking some assumptions about what to expect.**Subset the data to include a fixed number of data points**(e.g. data from 40 participants). You ideally want to do it for different sample sizes for the same data set.**Run a test on the subset**and tally whether the test is precise enough. You simply run a Bayesian t-test on the subsets and check if the BF is between 1/3 and 3.**Repeat steps 2 and 3 many thousands of times**, for example 10000.**Pick**a well-powered sample size.

In order to estimate a sample size, we need to think about both our experiment, our planned analysis and what other researchers have already learned about the topic of interest.

There is a good chance that someone has done an experiment similar to what you are planning. So you can use their data to find the effect size and estimate something about your future experiment - usually the sample size. If you need it for your own project, look at this material.

Let us say that we are interested in multistable perception and want to do a between-subject replication of George (1935).

George (1935) were interested in how different substances (coffee, antidepressant) affects the perception of a subset of visual illusions. Here we will only focus on the condition where participants viewed an ambiguous figure called the Necker Cube. A necker cube is the frame of the cube with no visual cues on its orientation, which makes it ambiguous. It is ambiguous, because while viewing the figure you can interpret it to have either the lower-left or the upper-right square at the front. If you keep looking at the cube, you can see that the orientiation switches between these two percepts - having the lower-left or upper-right square as its front side.

George (1935) reported that people passively switch between the two percepts at a higher rate after they had coffee compared to when they hadn’t had any other drug. They did not include the data in their Appendix, but they did report some details about their analysis, including some summary statistics.

First, let us start by setting everything up. If you have trouble remembering how to create projects and R scripts, look at this brief intro to Rstudio.

- Create a new project called power-bayesian
- Add the code in this worksheet to a script called power.R
- Right click on the data I made and save the CSV file.
- Create a folder called
`data`

in your R project. - Upload the CSV file to the
`data`

folder.

```
# Bayesian Power Analysis
# Always start with loading all the packages you need for a given analysis
library(BayesFactor, quietly = TRUE)
library(tidyverse)
# set seed for reproducibility
set.seed(1)
```

In most scenarios, you will start from the next step, unless your supervisor has some data you can use. Nowadays people share their data on places like OSF or GitHub. Here, I just made up the data according to some really vague description by the original authors. Let us start by importing said data.

```
# Import dataset
dta <- read_csv("data/george1935pilot.csv")
```

```
Rows: 20 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): condition
dbl (2): ppt, fluctuations
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# Make sure that every column is in the right format
dta$condition <- factor(dta$condition)
dta$ppt <- factor(dta$ppt)
```

The first thing we note is that it is not a within-subject design as in the original paper. Generating within-subject data is more ardous, as it involves some correlation between the two conditions. Here, we just assume that we collected data from a between-subject pilot experiment.

Now, let us have a look at how the data look like. Here is what each column in the dataset contains:

`ppt`

- These are the participant identifiers`condition`

- This column contains the name of the condition.`caffeine`

is our experimental condition: participants drank coffee before starting the task.`control`

is our control condition: participants didn’t take any stimulant before starting the task.`fluctuations`

- This is our dependent variable. It contains the proportion of fluctuations participants reported during the experiment. For this exercise, imagine that we showed the Necker Cube to participants for 100 seconds and they can switch between percepts once in every second. This means that if participants keep switching between percepts as fast as possible, they can report a maximum of 100 switches. The proportion of fluctiation is then whatever they reported divided by a 100.

```
# Inspect our data
dta %>%
head(20)
```

```
# A tibble: 20 × 3
ppt condition fluctuations
<fct> <fct> <dbl>
1 1 caffeine 0.568
2 2 caffeine 0.694
3 3 caffeine 0.677
4 4 caffeine 0.544
5 5 caffeine 0.753
6 6 caffeine 0.829
7 7 caffeine 0.627
8 8 caffeine 0.462
9 9 caffeine 0.693
10 10 caffeine 0.811
11 11 control 0.646
12 12 control 0.429
13 13 control 0.722
14 14 control 0.825
15 15 control 0.579
16 16 control 0.451
17 17 control 0.598
18 18 control 0.432
19 19 control 0.724
20 20 control 0.255
```

Look at the means and standard deviations first.

```
# Look at some descriptive statistics
dta %>%
group_by(condition) %>%
summarise(mean = mean(fluctuations), sd = sd(fluctuations))
```

```
# A tibble: 2 × 3
condition mean sd
<fct> <dbl> <dbl>
1 caffeine 0.666 0.118
2 control 0.566 0.173
```

The means are indeed promising, but we have long learned not to base our conclusion solely on the mean, so we go on and do the analysis we plan to do on our own data. We will use a

`ttestBF(formula = fluctuations ~ condition, data = dta)`

`Warning: data coerced from tibble to data frame`

```
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.8560438 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
```

Our results are not conclusive, as BF = 0.8560438. This is an inconclusive result. Some would say that it is anecdotal evidence for the null, but that simply means that we don’t have enough data to conclude. When the results are inconclusive, the BF can end up going to either direction.

The next step is then to figure out how many participants we need to recruit so that our analysis can conclude.

So we need to come up with some generated data. If you don’t have access to any real data to begin with, you should start with this step. Data should be generated according to how the actual data will be collected in the planned experiment. For example, there will be a fixed number of participants, fixed number of trials, fixed number of conditions. You also want to take some other assumptions, like what is the size of the difference you might expect.

Based on these, we can quantitatively express what we expect to happen in our experiment. We do this by generating some data that will look like the one that we will collect. First, we need to decide our dependent variable. For some tests, you might want to put some extra effort into this. For example, if you need to have count or categorical data, you will need to use different commands and approaches.

For now, we will make it easy and straightforward. We will use
`rnorm`

to create our data. The command looks like this:

```
# Do not copy and paste this
rnorm(n = group_size, mean = mean_from_data, sd = standard_deviation)
```

The next step is to simply give `rnorm`

the means and
standard deviations of the two conditions we already looked at. Remember
in a few steps before, we looked at the mean and standard deviation of
each group in our data. Those will be the values we use in
`rnorm`

.

```
# Generate random data for the control condition
control <- rnorm(n = 1000, mean = 0.566, sd = 0.173)
# Generate random data for the caffeine condition
caffeine <- rnorm(n = 1000, mean = 0.666, sd = 0.118)
# Here we create a vector for the group column
group <- rep(c("control", "caffeine"), each = 1000)
# Create a participant column
ppt <- 1:2000
# Here we combine everything together
ideal_data <- tibble(ppt = ppt,
condition = group,
fluctuations = c(control, caffeine))
```

Now we check whether it looks okay. First, we should have three
columns. `ppt`

should include the participant ID,
`condition`

should include whether it is the control or
caffeine condition, and `fluctuations`

should be the number
of fluctuations. We should briefly inspect the data before moving
on.

```
# View first ten lines of data
ideal_data %>%
head(10)
```

```
# A tibble: 10 × 3
ppt condition fluctuations
<int> <chr> <dbl>
1 1 control 0.458
2 2 control 0.598
3 3 control 0.421
4 4 control 0.842
5 5 control 0.623
6 6 control 0.424
7 7 control 0.650
8 8 control 0.694
9 9 control 0.666
10 10 control 0.513
```

This looks exactly what we want.

```
# Display density plot of data
ideal_data %>% ggplot(aes(x = fluctuations, fill = condition)) +
geom_density(alpha = 0.5) +
theme_void()
```

The distribution of the data also looks exactly what we would expect.

Start with subsetting the data, so we can check power for one sample size at a time. Let’s say that we want to check power for 40 participants.

So we start by selecting 40 participants from each condition. We will do so by randomly generating participant identifiers. This is easy because participants are identified in our ideal data by a single number between 1 and a 2000, inclusive. The first thousand belong to the control condition, while the second thousand belong to the caffeine condition.

We can use `sample`

to generate random participant
numbers. `sample`

essentially picks random elements from a
predefined set. So we want `sample`

to pick 40 random whole
numbers between 1 and a 1000 for the control condition, and another 40
random whole numbers between 1001 and 2000 for the c caffeine
condition.

```
# Random participant numbers, for the control condition
control <- sample(1:1000, size = 40)
# Random participant numbers, for the caffeine condition
caffeine <- sample(1001:2000, size = 40)
```

Here, we created two objects that contain random numbers exactly like our participant numbers. We now filter the data by these numbers.

```
# Select random sample of 'ideal_data'
random_data <- ideal_data %>%
filter(ppt %in% c(control, caffeine))
```

`c(control, caffeine)`

essentially puts the two objects
together.

Now, we can carry out the Bayesian t-test.

```
# Calculate Bayes Factor
test <- ttestBF(formula = fluctuations ~ condition, data = random_data)
```

`Warning: data coerced from tibble to data frame`

The final step is to extract the Bayes Factor.

```
# Extract Bayes Factor
bf <- test %>%
data.frame() %>% # change the format to a data frame for usability
select(bf) %>% # select the bf column
as.numeric() # make sure that R converts BF to a number and not a character
```

Let us now look at the Bayes Factor.

```
# Show Bayes Factor
bf
```

`[1] 47.60528`

This is our whole procedure for extracting a Bayes Factor for a single sample, but we have only sampled from our ideal data once and only for the sample size of 40. Randomly picking participants will result in different BFs every time we do this.

In principle, we could go through our procedure line by line for each
sample size many thousands of times. A more compact way to do this is to
collage it together into a *function*. Then we can use the
*function* to go through the whole procedure for us for all the
thousand iteration of each sample size.

Every function (commands) that you use has been written by a person.
Functions as you have noticed, take something as an input, do something
to that input, and return the result of what they have done to the
input. There is some good material already on how to
write functions. To put it briefly, a **function is an
instruction sequence that takes something as an input and translates it
into an output.**

Here we will write our own function that extracts the BF for our
sample size. The function will need to know `n`

which is our
sample size, and `data`

which is the generated data. Our
function will:

- Take
`n`

number of randomly selected participant from each group in`data`

. We do this by randomly selecting participant identifiers for each group. - Then run the Bayesian t-test on the subset.
- Store the output and put it into a
`data.frame`

format, so we can extract the BF. - At the end, return the BF.

This is exactly what we just did, but described in a simpler way. You can simply copy-paste the code below into your R script.

```
## We take the sample size n and a data set as inputs
get_bf <- function(n, data) {
## 1. We randomly pick n number of participants for each group
control <- sample(1:1000, n) # get participant identifiers for control
caffeine <- sample(1001:2000, n) # get participant identifiers for caffeine
# Subset data
random_data <- data %>% filter(ppt %in% c(control, caffeine))
## 2. Do the Bayesian t-test
test <- ttestBF(formula = fluctuations ~ condition, data = random_data)
## 3. Get the Bayes Factor
## We have to change the format to data.frame to extract the Bayes Factor
bf <- test %>% data.frame() %>% select(bf) %>% as.numeric()
## 4. return it to the user
return(bf)
}
```

This is our function. Let’s try it out for a sample size of 40 with
our `ideal_data`

. So we will have `n = 40`

and
`data = ideal_data`

.

```
# Test function
get_bf(n = 40, data = ideal_data)
```

`Warning: data coerced from tibble to data frame`

`[1] 2.773374`

We get a warning message, but it is nothing to be concerned about. It
simply tells us that the `ttestBF`

changed the format of our
data, so it can do the test we want it to do. Apart from the warning
message, it works flawlessly.

Simulations are pretend games. When we say simulate, we mean to imitate some real-world process over time.

Now we are at the stage when we can actually run a simulation. These simulations will be based on a technique called random sampling or Monte Carlo methods. There are no special equations or complicated math involved here. Monte Carlo is simply a tool that is used to estimate possible outcomes of uncertain events. In our case, the possible outcome is the Bayes factor and the uncertain event is the data we collect. So we imitate the whole scientific process by generating some random data, checking the Bayes Factor for the data, and repeat it many times to get a robust and more accurate estimate.

We have to repeat it many times, because the process of sampling
*n* number of data points is semi-random (not unlike how we
should recruit participants). The more you repeat it and the more data
points you have, the more your distribution will resemble to a normal
distribution due to the central
limit theorem. Monte Carlo methods also address another point we
made before: uncertainty in effect sizes. Because we sample randomly
from a much larger distribution of values, the difference will vary
slightly every time we draw *n* number of data points. The mean
effect size remains the same, but it allows us to make more precise and
reliable estimate on how many participants we need to recruit.

There are two things to decide before we start simulating data collection:

- The sample sizes we want to check. We create all sample sizes as a
`tibble`

. If you want to know more about what tibble is, see More on Tibbles. This means that we will have a one-column data frame with all the sample sizes we want to check. - The number of draws we wish to make for each sample size. This will
simply be the number of times a certain sample size occurs in our
`tibble`

data frame.

First, we will create a `tibble`

with sample sizes from 10
to 200. We will also repeat each sample size 1000 times. After creating
the sample sizes, we need to apply our `get_bf`

function to
every one of the sample sizes in our tibble a 1000 times and somehow
make R remember every BF we get. One option is to do it by hand. We go
and take one sample size at a time in our `tibble`

and record
the Bayes Factor. We would need to repeat it 20000 times to complete our
power calculation.

The better option is to make R do the legwork for us by using
`map_dbl`

. `map_dbl`

is a member of a family of
`map`

commands. They take a *function* and apply it to
every element in the object you select. We can use `map_dbl`

over each sample size in our `tibble`

and run the
`get_bf`

function we just wrote. `get_bf`

needs to
have the data as well, which we will add within the `map_dbl`

command below as `data = ideal_data`

at the end.
`map_dbl`

is smart enough to know that this is something that
our `get_bf`

function needs.

Then we `mutate`

the data frame, which adds another column
that we call `bf`

.

We also need to make sure to store the results in an object.

We will also get a *warning message* from R that it converted
our tibble to a data.frame. This is not an error message. R still
successfully executed the code below.

**The command below will take approximately 5 minutes to
complete.**

```
# Run Monte Carlo simulation (approx. 5 min run time)
monte <- tibble(n = rep(seq(from = 10, to = 200, by = 10), each = 1000)) %>%
mutate(bf = map_dbl(n, get_bf, data = ideal_data))
```

Now we have everything that we need. So let us find out the lowest sample size that gives us a decent chance at getting a conclusive BF.

We will group the data by the sample sizes.

Check whether the BF fell between 1/3 and 3. We do this with the command

`between`

, which will return TRUE if the value of interest falls between the lower and upper bounds of the interval. So it will look like`between(value, lower, upper)`

. For example,`between(30, 18, 60)`

, would return TRUE, because 30 is somewhere between 18 and 60.We then

`sum`

it up, which will count the number of times it returns TRUE. Divide that number by a 100 to get the proportions.Then we move on to subtract it from 1, which will give us the proportion of times we got a BF that was conclusive - outside of the interval 1/3 and 1.

```
# Calculate results of Monte Carlo simulation
power <- monte %>%
group_by(n) %>%
summarise(power = 1 - (sum(between(bf, 1/3, 3)) / 1000))
# Show all the rows where power is larger than 80%
power %>%
filter(power > 0.80)
```

```
# A tibble: 16 × 2
n power
<dbl> <dbl>
1 50 0.811
2 60 0.867
3 70 0.929
4 80 0.956
5 90 0.967
6 100 0.979
7 110 0.991
8 120 0.995
9 130 0.998
10 140 0.999
11 150 1
12 160 1
13 170 0.998
14 180 1
15 190 1
16 200 1
```

The only thing that remains is to find out the actual sample size we
need. In frequentist methods, people usually expect your study to have
at least 80% power. If we accept that convention for this as well, we
can settle on a sample size of 50 **for each group**. You
can see in the output above that this is the first sample size where 80%
of BFs are conclusive. **This means that you need to recruit 100
participants overall.**

Remember that we picked *n* number of people from each group.
So each sample size we checked has to be multipled by 2 as it is a
between-subject experiment - compare two independent groups.

Before moving on to the exercises, it is worth looking at how power improves for each sample size. Let us visualise this via a plot. This plot will be similar to what you already made in the Understanding interactions and Factorial Differences worksheets.

First, we are going to create a new column that will tell us whether
the sample size has at least 80% power or not. We will also save the
object by overwriting `power`

.

```
# Add 'criterion' column
power <- power %>%
mutate(criterion = power > 0.80)
```

Now in order to make a graph, we will use two new functions that will show us which is the smallest sample size that meets our criteria.

`geom_vline`

will simply insert a vertical line onto the
graph. A simple vertical line will tell us exactly where to look on the
plot.

`scale_x_continuous`

will let us specify how many ticks we
want to draw on the x-axis. We can use the `breaks`

option
within this function We want to have more ticks than one, so that
whoever is viewing the plot can have a sense of how the smallest
adequate sample size fairs with other options.

We also need to make sure that axis labels are correct and
informative. On top of that, I also choose to better look for our plot
by using `"theme_classic`

.

```
# Display line plot with sample size on the x-axis and power on the y-axis
ggplot(power, aes(x = n, y = power)) +
# Make a line graph
geom_line(linetype = 2) +
# Add points with different colours for our criterion
# Increase the size of those points for better visibility
geom_point(aes(colour = criterion), size = 2) +
# Put a vertical line at 50 on the x axis
geom_vline(aes(xintercept = 50)) +
# Put a tick on the x-axis at 50, 100, and 200
scale_x_continuous(breaks = sort(c(50, 100, 200))) +
# Rename axis labels
xlab("sample size") +
ylab("power") +
# nice theme
theme_classic()
```

As you can see, the power goes up not linearly, but more as a curve. It resembles a logarithmic function, which has this steep curve that seems to plateau after a certain point. Power after a certain point remains nearly stationary. This means that collecting too much data is not harmful, but can be unnecessary.

In this exercise you will need to estimate the sample size for a different data set.

The data will be on the production effect in memory. More explanation on the data can be found in the Revision worksheet. You can right click on production data and save it as a CSV.

Additionally, you will need to do so for a harsher interval, where the BF falls outside of 1/10 and 10. You will need to find the sample size, that will give you at least 80% of power.

**Hint**: If the effect size is small, it often requires
large samples to reliably detect the effect. Sample size can go
sometimes as high as a 500 participants per group. If you increase the
number of sample sizes you want to check, it will also take longer to
run the command.

**This is a Psyc:EL
task.**

Using the estimate from Exercise 1, create a plot showing power for
each sample size and **upload it to Psych:EL as a PDF**.
You will need to edit the code for the plot we have created at the end
of the worksheet, so you already have a template to go by. Remember to
(a) add the line for the adequate sample size and (b) edit the axis
ticks, so it includes the lowest estimated sample size with at least 80%
power.

These readings will give you a general overview of this exact topic, but can also serve as a good induction to a bit more thought-out power analysis.

- Schönbrodt, F. D. & Wagenmakers, E.-J. (2018). Bayes Factor Design Analysis: Planning for compelling evidence. Psychonomic Bulletin & Review, 25, 128-142.
- Vandekerckhove, J., Rouder, J. N., & Kruschke, J. K. (2018). Bayesian methods for advancing psychological science.

Here is a list of papers already using Bayesian Power Analysis:

They can give you a few examples on how to write up what we have done for your potential report. ___

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.