# 1 Before you start

This is an advanced worksheet, which assumes you have completed the Absolute Beginners’ Guide to R, the Research Methods in Practice (Quantitative section), and the Intermediate Guide to R.

You should be able to follow this worksheet if you understand the following materials:

# 2 What is the problem we are trying to solve?

One of the most important things in research is to make sure you have enough evidence/data to make reliable conclusions. Collecting enough data requires time and effort. In psychology, until recently, researchers didn’t collect enough data to make reliable conclusion with traditional frequentist statistics (methods that rely on p-value significance testing).

Throughout your undergraduate course, you relied on Bayesian techniques. One exemption was the worksheet that introduced statistical power. In that worksheet, you used traditional frequentist techniques to estimate how many participants you need to recruit for your experiment. The reason was that simply there is no clear-cut formula for estimating sample size for and with Bayesian techniques. Most techniques were beyond the requirements of an undergraduate course. The technique being introduced here is simple and requires no sophisticated statistical and mathematical skills.

# 3 Brushing up on traditional power

In traditional null hypothesis significance testing, power analysis estimates the likelihood of rejecting the null hypothesis given the effect size of the alternative hypothesis. So if you have two groups, you could use a single effect size, like that of Cohen’s d, to find out how many participants you need to get a get a p-value below 0.05 with high probability. This is useful, because it explicitly states how likely to achieve the researchers goal in the anticipated experiment (Kruschke, 2013). Ideally, you also want to do it before data collection and not after. If you estimate power after the data are collected and analysed, you will learn nothing that the Bayes Factor (BF) or a p-value wouldn’t tell you.

Traditional techniques to estimate sample size are also limited. For example, they don’t allow you to plan for the null hypothesis being true. This is because p-values don’t let you conclude from null results. See the Evidence worksheet for more explanation.

A p-value and traditional techniques do not allow you to incorporate this possibility, even though in some cases it might be important to plan for that outcome as well.

For example, you might want to know how many participants you need to test so that you can know for certain that two treatments (a drug and a palcebo) are no different. Another example is when there are conflicting evidence from underpowered studies that a psychological effect is real. You might want to know how many participants you need to have to show that an effect is present in the data or not.

You can also have multiple estimates of sample size: one for the null hypothesis being true, and one for the alternative being true. In that case, it is not unreasonable to check the data at the lower estimate, to see if you need to recruit more participants.

Rather than a single value, effect sizes are best thought of as a range of possible values with different probabilities - effect sizes are uncertain (Kruschke, 2013), but traditional power calculations use a single point-value. Effect sizes can vary from experiment to experiment (sometimes even substiantially). It is uncertain if you will get the same effect size with a new sample. This experiment-to-experiment variance is not incorporated in traditional methods. Effect sizes are also often overestimated which causes a lot of problems for scientists and researchers.

Many of the limitations of traditional power calculations highlighted so far can be addressed by doing power calculations using BFs.

# 4 Power analysis with Bayes Factors

In Bayesian Power Analysis, we are looking at power from a different angle. Here, we want to estimate precision - the probability of the BF being conclusive or inconclusive. We determine this by checking whether the BF falls within a certain interval. Intervals are groups of numbers lying between two numbers. This probability tells us how likely that we will have conclusive results given different sample sizes.

In psychology, people like to think categorically, therefore define categories of BFs depending on different intervals. Traditionally, a BF of three and above is considered as evidence for a difference, and ⅓ and below is considered as evidence for the absence of a difference.

This might be a bit confusing, because now we will be looking at intervals and not thresholds, as we are interested in an entire distribution of values. For estimating the sample size, we can define two big groups of BFs: conclusive vs inconclusive. Inconclusive BF can be any BF that falls between the lower bound and the upper bound. All other BFs are conclusive. We calculate power for BF as

$1 - Pr(lower bound < BF < upper bound)$

Simply put, we take the percentage of BFs that were inconclusive and subtract it from 100%. We have the probability of BF falling between ⅓ and 3 written as $$Pr(lower bound < BF < upper bound)$$. So if we have 40% of Bayes Factors that were inconclusive, then 100% - 40% will give us 60% of Bayes Factors that were conclusive. This means that we have a 60% chance of getting a conclusive result. Usually, we don’t use percentages, but rather use values from 0 to 1. So 60% will be the same as 0.6 and 100% will be the same as 1. We will also refer to these values as probabilities and not percentages. So 60% chance of getting a conclusive BF will become a probability of 0.6 to get a conclusive BF.

When we think about BFs and Power, we want to know how many data points we need to get a conclusive BF.

## 4.1 The Algorithm

An algorithm is a list of rules to follow in order to solve a problem.

The algorithm for estimating sample size for more sophisticated Bayesian analysis was first described by Kruschke (2013) while he was promoting an alternative to frequentist statistics. Schönbrodt and Wagenmakers (2018) adapted his algorithm to fixed-n designs with a BF. In a fixed-n design, you test a set number of participants and calculate something like a BF at the end. Schönbrodt and Wagenmakers (2018) decided to group this technique under an umbrella term: Bayes Factor Design Analysis.

Here I will outline the algorithm we’ll be using and then walk you through it. The steps in the algorithm:

1. Generate a large number (e.g. 2000) of random data points by taking some assumptions about what to expect.
2. Subset the data to include a fixed number of data points (e.g. data from 40 participants). You ideally want to do it for different sample sizes for the same data set.
3. Run a test on the subset and tally whether the test is precise enough. You simply run a Bayesian t-test on the subsets and check if the BF is between 1/3 and 3.
4. Repeat steps 2 and 3 many thousands of times, for example 10000.
5. Pick a well-powered sample size.

## 4.2 Walk-through Bayesian Power Analysis

In order to estimate a sample size, we need to think about both our experiment, our planned analysis and what other researchers have already learned about the topic of interest.

There is a good chance that someone has done an experiment similar to what you are planning. So you can use their data to find the effect size and estimate something about your future experiment - usually the sample size. If you need it for your own project, look at this material.

Let us say that we are interested in multistable perception and want to do a between-subject replication of George (1935).

George (1935) were interested in how different substances (coffee, antidepressant) affects the perception of a subset of visual illusions. Here we will only focus on the condition where participants viewed an ambiguous figure called the Necker Cube. A necker cube is the frame of the cube with no visual cues on its orientation, which makes it ambiguous. It is ambiguous, because while viewing the figure you can interpret it to have either the lower-left or the upper-right square at the front. If you keep looking at the cube, you can see that the orientiation switches between these two percepts - having the lower-left or upper-right square as its front side.

George (1935) reported that people passively switch between the two percepts at a higher rate after they had coffee compared to when they hadn’t had any other drug. They did not include the data in their Appendix, but they did report some details about their analysis, including some summary statistics.

### 4.2.1 Starting with the data

First, let us start by setting everything up. If you have trouble remembering how to create projects and R scripts, look at this brief intro to Rstudio.

• Create a new project called power-bayesian
• Add the code in this worksheet to a script called power.R
• Right click on the data I made and save the CSV file.
• Create a folder called data in your R project.
• Upload the CSV file to the data folder.
# Bayesian Power Analysis
library(BayesFactor, quietly = TRUE)
library(tidyverse)

# set seed for reproducibility
set.seed(1)

In most scenarios, you will start from the next step, unless your supervisor has some data you can use. Nowadays people share their data on places like OSF or GitHub. Here, I just made up the data according to some really vague description by the original authors. Let us start by importing said data.

# Import dataset
dta <- read_csv("data/george1935pilot.csv")
Rows: 20 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): condition
dbl (2): ppt, fluctuations

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
# Make sure that every column is in the right format
dta$condition <- factor(dta$condition)
dta$ppt <- factor(dta$ppt)

The first thing we note is that it is not a within-subject design as in the original paper. Generating within-subject data is more ardous, as it involves some correlation between the two conditions. Here, we just assume that we collected data from a between-subject pilot experiment.

Now, let us have a look at how the data look like. Here is what each column in the dataset contains:

• ppt - These are the participant identifiers
• condition - This column contains the name of the condition. caffeine is our experimental condition: participants drank coffee before starting the task. control is our control condition: participants didn’t take any stimulant before starting the task.
• fluctuations - This is our dependent variable. It contains the proportion of fluctuations participants reported during the experiment. For this exercise, imagine that we showed the Necker Cube to participants for 100 seconds and they can switch between percepts once in every second. This means that if participants keep switching between percepts as fast as possible, they can report a maximum of 100 switches. The proportion of fluctiation is then whatever they reported divided by a 100.
# Inspect our data
dta %>%
head(20)
# A tibble: 20 × 3
ppt   condition fluctuations
<fct> <fct>            <dbl>
1 1     caffeine         0.568
2 2     caffeine         0.694
3 3     caffeine         0.677
4 4     caffeine         0.544
5 5     caffeine         0.753
6 6     caffeine         0.829
7 7     caffeine         0.627
8 8     caffeine         0.462
9 9     caffeine         0.693
10 10    caffeine         0.811
11 11    control          0.646
12 12    control          0.429
13 13    control          0.722
14 14    control          0.825
15 15    control          0.579
16 16    control          0.451
17 17    control          0.598
18 18    control          0.432
19 19    control          0.724
20 20    control          0.255

Look at the means and standard deviations first.

# Look at some descriptive statistics
dta %>%
group_by(condition) %>%
summarise(mean = mean(fluctuations), sd = sd(fluctuations))
# A tibble: 2 × 3
condition  mean    sd
<fct>     <dbl> <dbl>
1 caffeine  0.666 0.118
2 control   0.566 0.173

The means are indeed promising, but we have long learned not to base our conclusion solely on the mean, so we go on and do the analysis we plan to do on our own data. We will use a

ttestBF(formula = fluctuations ~ condition, data = dta)
Warning: data coerced from tibble to data frame
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.8560438 ±0%

Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS

Our results are not conclusive, as BF = 0.8560438. This is an inconclusive result. Some would say that it is anecdotal evidence for the null, but that simply means that we don’t have enough data to conclude. When the results are inconclusive, the BF can end up going to either direction.

The next step is then to figure out how many participants we need to recruit so that our analysis can conclude.

### 4.2.2 Generating data

So we need to come up with some generated data. If you don’t have access to any real data to begin with, you should start with this step. Data should be generated according to how the actual data will be collected in the planned experiment. For example, there will be a fixed number of participants, fixed number of trials, fixed number of conditions. You also want to take some other assumptions, like what is the size of the difference you might expect.

Based on these, we can quantitatively express what we expect to happen in our experiment. We do this by generating some data that will look like the one that we will collect. First, we need to decide our dependent variable. For some tests, you might want to put some extra effort into this. For example, if you need to have count or categorical data, you will need to use different commands and approaches.

For now, we will make it easy and straightforward. We will use rnorm to create our data. The command looks like this:

# Do not copy and paste this
rnorm(n = group_size, mean = mean_from_data, sd = standard_deviation)

The next step is to simply give rnorm the means and standard deviations of the two conditions we already looked at. Remember in a few steps before, we looked at the mean and standard deviation of each group in our data. Those will be the values we use in rnorm.

# Generate random data for the control condition
control <- rnorm(n = 1000, mean = 0.566, sd = 0.173)
# Generate random data for the caffeine condition
caffeine <- rnorm(n = 1000, mean = 0.666, sd = 0.118)
# Here we create a vector for the group column
group <- rep(c("control", "caffeine"), each = 1000)
# Create a participant column
ppt <- 1:2000
# Here we combine everything together
ideal_data <- tibble(ppt = ppt,
condition = group,
fluctuations = c(control, caffeine))

Now we check whether it looks okay. First, we should have three columns. ppt should include the participant ID, condition should include whether it is the control or caffeine condition, and fluctuations should be the number of fluctuations. We should briefly inspect the data before moving on.

# View first ten lines of data
ideal_data %>%
head(10)
# A tibble: 10 × 3
ppt condition fluctuations
<int> <chr>            <dbl>
1     1 control          0.458
2     2 control          0.598
3     3 control          0.421
4     4 control          0.842
5     5 control          0.623
6     6 control          0.424
7     7 control          0.650
8     8 control          0.694
9     9 control          0.666
10    10 control          0.513

This looks exactly what we want.

# Display density plot of data
ideal_data %>% ggplot(aes(x = fluctuations, fill = condition)) +
geom_density(alpha = 0.5) +
theme_void()

The distribution of the data also looks exactly what we would expect.

### 4.2.3 Getting the Bayes Factor

Start with subsetting the data, so we can check power for one sample size at a time. Let’s say that we want to check power for 40 participants.

So we start by selecting 40 participants from each condition. We will do so by randomly generating participant identifiers. This is easy because participants are identified in our ideal data by a single number between 1 and a 2000, inclusive. The first thousand belong to the control condition, while the second thousand belong to the caffeine condition.

We can use sample to generate random participant numbers. sample essentially picks random elements from a predefined set. So we want sample to pick 40 random whole numbers between 1 and a 1000 for the control condition, and another 40 random whole numbers between 1001 and 2000 for the c caffeine condition.

# Random participant numbers, for the control condition
control <- sample(1:1000, size = 40)
# Random participant numbers, for the caffeine condition
caffeine <- sample(1001:2000, size = 40)

Here, we created two objects that contain random numbers exactly like our participant numbers. We now filter the data by these numbers.

# Select random sample of 'ideal_data'
random_data <- ideal_data %>%
filter(ppt %in% c(control, caffeine))

c(control, caffeine) essentially puts the two objects together.

Now, we can carry out the Bayesian t-test.

# Calculate Bayes Factor
test <- ttestBF(formula = fluctuations ~ condition, data = random_data)
Warning: data coerced from tibble to data frame

The final step is to extract the Bayes Factor.

# Extract Bayes Factor
bf <- test %>%
data.frame() %>% # change the format to a data frame for usability
select(bf) %>% # select the bf column
as.numeric() # make sure that R converts BF to a number and not a character

Let us now look at the Bayes Factor.

# Show Bayes Factor
bf
[1] 47.60528

This is our whole procedure for extracting a Bayes Factor for a single sample, but we have only sampled from our ideal data once and only for the sample size of 40. Randomly picking participants will result in different BFs every time we do this.

In principle, we could go through our procedure line by line for each sample size many thousands of times. A more compact way to do this is to collage it together into a function. Then we can use the function to go through the whole procedure for us for all the thousand iteration of each sample size.

#### 4.2.3.1 Compressing what we have just done

Every function (commands) that you use has been written by a person. Functions as you have noticed, take something as an input, do something to that input, and return the result of what they have done to the input. There is some good material already on how to write functions. To put it briefly, a function is an instruction sequence that takes something as an input and translates it into an output.

Here we will write our own function that extracts the BF for our sample size. The function will need to know n which is our sample size, and data which is the generated data. Our function will:

1. Take n number of randomly selected participant from each group in data. We do this by randomly selecting participant identifiers for each group.
2. Then run the Bayesian t-test on the subset.
3. Store the output and put it into a data.frame format, so we can extract the BF.
4. At the end, return the BF.

This is exactly what we just did, but described in a simpler way. You can simply copy-paste the code below into your R script.

## We take the sample size n and a data set as inputs
get_bf <- function(n, data) {
## 1. We randomly pick n number of participants for each group
control <- sample(1:1000, n) # get participant identifiers for control
caffeine <- sample(1001:2000, n) # get participant identifiers for caffeine
# Subset data
random_data <- data %>% filter(ppt %in% c(control, caffeine))
## 2. Do the Bayesian t-test
test <- ttestBF(formula = fluctuations ~ condition, data = random_data)
## 3. Get the Bayes Factor
## We have to change the format to data.frame to extract the Bayes Factor
bf <- test %>% data.frame() %>% select(bf) %>% as.numeric()
## 4. return it to the user
return(bf)
}

This is our function. Let’s try it out for a sample size of 40 with our ideal_data. So we will have n = 40 and data = ideal_data.

# Test function
get_bf(n = 40, data = ideal_data)
Warning: data coerced from tibble to data frame
[1] 2.773374

We get a warning message, but it is nothing to be concerned about. It simply tells us that the ttestBF changed the format of our data, so it can do the test we want it to do. Apart from the warning message, it works flawlessly.

### 4.2.4 Determining Sample Size

Simulations are pretend games. When we say simulate, we mean to imitate some real-world process over time.

Now we are at the stage when we can actually run a simulation. These simulations will be based on a technique called random sampling or Monte Carlo methods. There are no special equations or complicated math involved here. Monte Carlo is simply a tool that is used to estimate possible outcomes of uncertain events. In our case, the possible outcome is the Bayes factor and the uncertain event is the data we collect. So we imitate the whole scientific process by generating some random data, checking the Bayes Factor for the data, and repeat it many times to get a robust and more accurate estimate.

We have to repeat it many times, because the process of sampling n number of data points is semi-random (not unlike how we should recruit participants). The more you repeat it and the more data points you have, the more your distribution will resemble to a normal distribution due to the central limit theorem. Monte Carlo methods also address another point we made before: uncertainty in effect sizes. Because we sample randomly from a much larger distribution of values, the difference will vary slightly every time we draw n number of data points. The mean effect size remains the same, but it allows us to make more precise and reliable estimate on how many participants we need to recruit.

There are two things to decide before we start simulating data collection:

1. The sample sizes we want to check. We create all sample sizes as a tibble. If you want to know more about what tibble is, see More on Tibbles. This means that we will have a one-column data frame with all the sample sizes we want to check.
2. The number of draws we wish to make for each sample size. This will simply be the number of times a certain sample size occurs in our tibble data frame.

First, we will create a tibble with sample sizes from 10 to 200. We will also repeat each sample size 1000 times. After creating the sample sizes, we need to apply our get_bf function to every one of the sample sizes in our tibble a 1000 times and somehow make R remember every BF we get. One option is to do it by hand. We go and take one sample size at a time in our tibble and record the Bayes Factor. We would need to repeat it 20000 times to complete our power calculation.

The better option is to make R do the legwork for us by using map_dbl. map_dbl is a member of a family of map commands. They take a function and apply it to every element in the object you select. We can use map_dbl over each sample size in our tibble and run the get_bf function we just wrote. get_bf needs to have the data as well, which we will add within the map_dbl command below as data = ideal_data at the end. map_dbl is smart enough to know that this is something that our get_bf function needs.

Then we mutate the data frame, which adds another column that we call bf.

We also need to make sure to store the results in an object.

We will also get a warning message from R that it converted our tibble to a data.frame. This is not an error message. R still successfully executed the code below.

The command below will take approximately 5 minutes to complete.

# Run Monte Carlo simulation (approx. 5 min run time)
monte <- tibble(n = rep(seq(from = 10, to = 200, by = 10), each = 1000)) %>%
mutate(bf = map_dbl(n, get_bf, data = ideal_data))

Now we have everything that we need. So let us find out the lowest sample size that gives us a decent chance at getting a conclusive BF.

1. We will group the data by the sample sizes.

2. Check whether the BF fell between 1/3 and 3. We do this with the command between, which will return TRUE if the value of interest falls between the lower and upper bounds of the interval. So it will look like between(value, lower, upper). For example, between(30, 18, 60), would return TRUE, because 30 is somewhere between 18 and 60.

3. We then sum it up, which will count the number of times it returns TRUE. Divide that number by a 100 to get the proportions.

4. Then we move on to subtract it from 1, which will give us the proportion of times we got a BF that was conclusive - outside of the interval 1/3 and 1.

# Calculate results of Monte Carlo simulation
power <- monte %>%
group_by(n) %>%
summarise(power = 1 - (sum(between(bf, 1/3, 3)) / 1000))

# Show all the rows where power is larger than 80%
power %>%
filter(power > 0.80)
# A tibble: 16 × 2
n power
<dbl> <dbl>
1    50 0.811
2    60 0.867
3    70 0.929
4    80 0.956
5    90 0.967
6   100 0.979
7   110 0.991
8   120 0.995
9   130 0.998
10   140 0.999
11   150 1
12   160 1
13   170 0.998
14   180 1
15   190 1
16   200 1    

The only thing that remains is to find out the actual sample size we need. In frequentist methods, people usually expect your study to have at least 80% power. If we accept that convention for this as well, we can settle on a sample size of 50 for each group. You can see in the output above that this is the first sample size where 80% of BFs are conclusive. This means that you need to recruit 100 participants overall.

Remember that we picked n number of people from each group. So each sample size we checked has to be multipled by 2 as it is a between-subject experiment - compare two independent groups.

Before moving on to the exercises, it is worth looking at how power improves for each sample size. Let us visualise this via a plot. This plot will be similar to what you already made in the Understanding interactions and Factorial Differences worksheets.

First, we are going to create a new column that will tell us whether the sample size has at least 80% power or not. We will also save the object by overwriting power.

# Add 'criterion' column
power <- power %>%
mutate(criterion = power > 0.80)

Now in order to make a graph, we will use two new functions that will show us which is the smallest sample size that meets our criteria.

geom_vline will simply insert a vertical line onto the graph. A simple vertical line will tell us exactly where to look on the plot.

scale_x_continuous will let us specify how many ticks we want to draw on the x-axis. We can use the breaks option within this function We want to have more ticks than one, so that whoever is viewing the plot can have a sense of how the smallest adequate sample size fairs with other options.

We also need to make sure that axis labels are correct and informative. On top of that, I also choose to better look for our plot by using "theme_classic.

# Display line plot with  sample size on the x-axis and power on the y-axis
ggplot(power, aes(x = n, y = power)) +
# Make a line graph
geom_line(linetype = 2) +
# Add points with different colours for our criterion
# Increase the size of those points for better visibility
geom_point(aes(colour = criterion), size = 2) +
# Put a vertical line at 50 on the x axis
geom_vline(aes(xintercept = 50)) +
# Put a tick on the x-axis at 50, 100, and 200
scale_x_continuous(breaks = sort(c(50, 100, 200))) +
# Rename axis labels
xlab("sample size") +
ylab("power") +
# nice theme
theme_classic()

As you can see, the power goes up not linearly, but more as a curve. It resembles a logarithmic function, which has this steep curve that seems to plateau after a certain point. Power after a certain point remains nearly stationary. This means that collecting too much data is not harmful, but can be unnecessary.

# 5 Exercise 1: Estimate the sample size for a new data set.

In this exercise you will need to estimate the sample size for a different data set.

The data will be on the production effect in memory. More explanation on the data can be found in the Revision worksheet. You can right click on production data and save it as a CSV.

Additionally, you will need to do so for a harsher interval, where the BF falls outside of 1/10 and 10. You will need to find the sample size, that will give you at least 80% of power.

Hint: If the effect size is small, it often requires large samples to reliably detect the effect. Sample size can go sometimes as high as a 500 participants per group. If you increase the number of sample sizes you want to check, it will also take longer to run the command.