In the Absolute Beginners’ Guide to R and in the revision worksheet we used a Bayesian between-subjects t-test to look at the strength of evidence for the presence (or absence) of a difference between groups. For example, in the revision worksheet, we looked at whether the way you experienced a word (silently read, or spoken out loud) affected how good your memory of it was. We found a Bayes Factor close to 1, indicating there was neither good evidence for, nor good evidence against, there being a difference (an indecisive result). Bayes Factors greater than 3 are generally considered to be evidence for a difference worth listening to, while Bayes Factors less than one-third (0.33) are generally considered to be evidence for the absence of a difference that is worth listening to. Others have provided some additional guidelines on how to interpret Bayes Factors; the following comes from Andraszewicz et al. (2014):

A rule-of-thumb interpretation of Bayes Factors
Bayes Factors Interpretation
> 100 Extreme evidence for the alternative hypothesis
30 - 100 Very strong evidence for the alternative hypothesis
10 - 30 Strong evidence for the alternative hypothesis
3 - 10 Moderate evidence for the alternative hypothesis
1 - 3 Anecdotal evidence for the alternative hypothesis
0.33 - 1 Anecdotal evidence for the null hypothesis
0.10 – 0.33 Moderate evidence for the null hypothesis
0.03 – 0.10 Strong evidence for the null hypothesis
0.01 – 0.03 Very strong evidence for the null hypothesis
< 0.01 Extreme evidence for the null hypothesis
Note that this is BF10, which means that the larger the Bayes Factor is, the more evidence we have for the alternative hypothesis.

In this worksheet, we’ll cover how to calculate a Bayes Factor for a simple within-subjects experiment. As we covered in the statistical power worksheet, a within-subjects experiment is one where all the conditions are done by all the participants. Within-subjects designs are often a really good idea, because they are more efficient. In other words, they tend to allow us to give confident answers to questions while testing fewer people than would be needed in a between-subjects experiment.

The technique we’re going to use is called ANOVA, short for ANalysis Of VAriance. It’s probably the most used analysis technique in psychology. Its name (ANOVA) has a long history that isn’t really worth getting into here, particularly as it comes from our history in psychology of using p-values as evidence terms. As we’ve covered previously, p-values have no interpretation that is simultaneously accurate, simple, and useful. In this worksheet, we focus on a form of ANOVA that calculates Bayes Factors, which are more easily interpretable. If you ever need to calculate a traditional ANOVA, take a look at the traditional ANOVA worksheet.

Getting started

In the preprocessing worksheet, you loaded data from a git repository into an R project. Go back to that project now on RStudio; the project was called rminr-data This is very important. Make sure you are in the rminr-data project, changing your Rstudio project using the drop-down list on the top right of Rstudio if you need to (change to the rminr-data project you used for the preprocessing worksheet). If you do not do this, the rest of this worksheet will not work.

Now that you’re in the rminr-data project, create a new script called anova1.R, put all the comments and commands you use in this worksheet into that file, and save regularly.

Clearing the environment

You will notice that your Environment still contains all the data frames from the previous worksheet. You can leave them there if you like, but some people prefer to start with a “clean slate”. You can remove data frames with the rm command (as in remove). For example, to remove the dat data frame you’d type rm(dat).

The rm command can also be used with the ls() command. The ls() command gives you a list of the names of the things in your environment. The command is called ls because it lists them. We can use the ls() and rm() commands together to remove everything from our Environment, as follows.

Add the following comments and commands to your script and run them:

# Within-subject differences
# Clear environment
rm(list = ls())

Notice that your Environment tab in Rstudio is now empty. This command only clears the Environment, it leaves all the files in your project as they were (you can still see them in the Files panel).

Loading data

Now add these comments and commands to your script, and run them, to load this new data set:

# Load tidyverse
# Load data into 'words'
words <- read_csv("wordnaming2.csv")

Now click on the words data frame in the Environment tab, and take a look. You’ll see that its a data frame with 10 columns and 37,800 rows.

Description of data set

We’ll be using this data set over the next few worksheets, so it’s important to take some time here to understand the experiment it comes from.

The data frame contains simulated data for a word-naming experiment. In the experiment, participants just had to name (i.e. speak out loud) each written word as it appeared on the screen. However, on every trial, there was also a picture. Sometimes the picture matched the word (e.g. the word ‘car’ and a picture of a car). This is called a congruent trial. Other times, the picture and word were different (e.g. the word ’ boat’ and a picture of a carrot). This is called an incongruent trial. Participants were instructed to ignore the picture, but were they able to do so? If they can’t help processing the picture, then they should be slower to name the word on incongruent trials than on congruent trials. There are a third type of trials, neutral trials. These are trials on which just the word appears, with no picture.

In this experiment, the word naming task is being used to study the effects of meditation practice. Some psychologists have suggested that learning how to meditate can improve people’s ability to ignore irrelevant information. If so, we should be able to improve people’s performance on this word naming task by training them how to meditate first.

So, in this experiment, 140 participants were randomly allocated to a meditation-training condition, in which they received a week’s training on how to meditate, before doing the word naming experiment. Another 140 were assigned to a no-training condition. These people just did the word naming experiment, with no meditation training. Finally, another 140 people were assigned to a relaxation-training condition, in which they received a week’s training, but on some relaxation techniques, instead of meditation.

Within each group of participants, half were male and half were female. Each participant completed 90 trials of the word naming task. They did 30 trials, took a short break, then did another 30 trials, took another short break, and did the final 30 trials. This gives us 37,800 word naming trials in total. On each trial, the participant either named the word correctly, or they didn’t. Either way, they took a certain amount of time (measured in milliseconds) to name the word.

Here’s a summary of what each of the columns in the data set contains:

Column Description Values
subj Unique anonymous participant number 1-420
sex Biological sex of the participant “male”, “female”
medit Whether the participant has had meditation, relaxation, or no training “meditate”, “relax”, “control”
block Block of trials 1-3
trial Trial number (per participant) 1-90
word The word presented on this trial e.g. “cup”
pic The name of the picture presented e.g. “cake”
congru Experimental condition: congruent, incongruent or neutral trial? ‘cong’, ‘incong’, ‘neutral’
acc Accuracy of naming response 1 = correct, 0 = incorrect
rt Reaction time in milliseconds

DISCLAIMER: This data set is loosely based on the design of experiments that have really been done, and is realistic in terms of its size and complexity. However, it’s important to remember that this is generated data, not data actually collected from participants. So, you should not take the analyses you do of this simulated data as evidence for the effectiveness of meditation in improving attentional control. As is often the case, real data in this area is less clear, and more controversial.


As covered in the preprocessing worksheet, the first thing we need to do with realistic data sets like this is to preprocess them, so the data is in a format that we can analyse. The two steps we need to do for our first analysis are filtering and summarising.


In our first analyses, we’re going to take a look at the participants who had no pre-training in meditation or relaxation (i.e. the control group), so we get a sense of what performance on this task ‘normally’ looks like. So, enter the following comment into your script:

# Filter control condition into 'wordctrl'

and then add and run a command that uses filter to extract just the control condition, and put it into a new data frame called wordctrl. If you need a reminder of how the filter command works, take a look at the preprocessing worksheet.

EXPECTED OUTPUT: There will be new data frame in your Environment window that has 12,600 rows and 10 columns and is called control.

Also, for now, we’re not going to look at the neutral trials, just the congruent and incongruent ones. So, add this comment to your script:

# Filter out neutral trials, pass remainder to 'wordctrlCI'

and add and run a command to filter out the neutral trials from wordctrl and put the remaining data into wordctrlCI. You may find that the command != is the most efficient way to do this. For example, if you were trying to exclude the relaxation condition, you could use filter(medit != "relax"). The command != means ‘not equal to’.

EXPECTED OUTPUT: There will be a new data frame in your Environment window that has 8,400 rows and 10 columns and is called wordctrlCI.


The data frame is still very long, because it contains data for every trial. What we want to know, for each participant, is whether they are quicker to name the words on the congruent trials (where the picture and word match) than on the incongruent trials (where the picture and word are different). So, we need to summarise each person’s data down to just two numbers - their mean reaction time on congruent trials, and their mean reaction time on incongruent trials. So, add the following comment to your script:

# Group by 'subj' and 'congru'; calculate mean RT; pass to 'wordctrlCIsum'

and then write and run a command to group the data by participant (subj) and experimental condition (congru), using the group_by command, get the mean reaction times using the summarise command, and put this summarised data into a new data frame called wordctrlCIsum. If you need a reminder of how to group and summarise data, take a look at the preprocessing worksheet.

Column names with summarize: There is one new thing to learn here, though, relative to the previous worksheet. The column containing the RT data in this new data frame will need to have the name rt, because we use this name later on. You can set the column name within the summarise command. For example, summarise(acc = mean(acc)) takes the mean of the column acc and puts the answer into a column called acc in your new data frame. If you just use summarise(mean(acc)) then the column name you end up with is mean(acc) and this would be a pain to use in later commands, so we rename the column as acc by typing summarise(acc = mean(acc)) instead.

EXPECTED OUTPUT: There will be a new data frame in your Environment window called wordctrlCIsum. It will have 280 rows - two reaction times for each of the 140 participants - one for the congruent trials, and one for the incongruent trials. You’ll notice that the incongruent reaction times tend to be larger, as expected.

Density plots

Add the following comment to your script:

# Density plot of RT, by 'congru'

Now, as you did in the Absolute Beginners’ Guide to R, write and run a command to create a density plot of this summarised data (wordctrlCIsum), showing the distribution of reaction times in the two conditions. Click on the link above if you need to revise how to do this.

If you get it right, it should look like this:

We can see a few different things in this plot. First, there’s quite a lot of variation between people in how fast they react, with mean reaction time for incongruent trials anywhere from about 200 milliseconds to 800 milliseconds. Second, reaction times seem to vary less on congruent trials (averages around 400-600 milliseconds). Third, the peak of the incongruent distribution is slightly to the right of the congruent distribution, suggesting that, overall, incongruent trials might take a bit longer to react to than congruent trials.

But…looking at the data this way misses the most important thing about the design of this experiment, which is that it is within subjects. This is not two different groups of people, one who see only congruent trials, and the other who only see incongruent trials. Every participant sees both types. So, there’s a better way to look at this data, particularly when (as here) people differ a lot in their average reaction time.

What we can do is calculate a difference score for each participant. In other words, how much longer each participant takes to respond to incongruent trials, than congruent ones.

More preprocessing

In order to do this, we have to learn some new types of data preprocessing - pivoting and mutating.


Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer. In this case, we need to make wordctrlCIsum wider, for reasons that will become clearer in the next step.

To make a data frame wider, we can use the pivot_wider command from the tidyverse package. To illustrate, enter the following comment and command into your R script and run it:

# Pivot wordctrlCIsum into a wider data frame; place in 'ctrl'
ctrl <- wordctrlCIsum %>% pivot_wider(names_from = congru, values_from = rt)

and look at ctrl in the Environment window, we can see that it now has fewer rows (140). It has the same number of columns, but the column names are different. Specifically, we now have one row for each participant and we have one column for their congruent reaction time, and one column for their incongruent reaction time.

Explanation of command: The pivot_wider command has two components: names_from and values_from. The first one, names_from, tells R which column you want to use to as the column names in your new, wider data frame. In this case, we want to take them from the congru column of wordctrlCIsum. The values_from part tells R which column contains the data you want to put in these new columns. In this case, it’s the rt column of wordctrlCIsum.


Mutation in this context means to take some things and change them to make something new. In this case, we want to mutate the congruent and incongruent reaction times into a difference score, i.e. incongruent RT minus congruent RT. We do this using the mutate command, as follows:

Enter the following comment and command into your R script and run it:

# Calculate RT difference; place in 'ctrldiff'
ctrldiff <- ctrl %>% mutate(diff = incong - cong)

Look at ctrldiff in the Environment window - you will see that it has a fourth column, diff and that each row contains the incongruent reaction minus the congruent reaction time. For most participants, the difference is positive, meaning that incongruent trials are slower. However, some participants are the other way around (e.g. participant 146).

Explanation of command: It’s the mutate command that’s doing all the work here. We give it a name for the new column we want to create, in this case diff. We then tell it how to calculate that new column. In this case, by subtracting the congruent reaction time from the incongruent reaction time, incong - cong.

Density plots of differences

Now we have the difference scores, we can take a look at their distribution. We do this using a density plot, as we have on several previous occasions.

Enter the following comment and commands into your R script and run them:

# Display density plot of RT differences
ctrldiff %>% ggplot(aes(diff)) +
    geom_density(aes(y = ..scaled..)) +
    geom_vline(xintercept = 0, colour = 'red')

The only slightly unusual thing about this command is the geom_vline part, which we use to draw a vertical line on the plot, that intercepts (hits) the x axis at zero (xintercept = 0) - see also the Risk Taking worksheet in the Putting R to Work guide.

In this case, the vertical line allows us to quickly see that most of the distribution is to the right of zero. In other words, for most people, incongruent trials take longer to respond to than congruent trials. On average, it takes around 100 ms longer to react to incongruent trials, although the range is anything from around -200 to +300 milliseconds.

It is these differences, rather than the absolute reaction times for incongruent or congruent trials, that are critical to this within-subjects experiment. In the next section, we use Bayesian analysis to work out how good the evidence is for the claim that incongruent trials are on average slower than congruent trials (i.e. that the difference scores are on average positive).

Bayesian ANOVA (within-subjects)

We can use the command anovaBF to look at the evidence for this within-subjects effect of congruency. This command takes data in the format of our wordctrlCIsum data frame, so we’re almost ready to go. But, before we can do this analysis, we have to tell R which columns of this data frame are factors.


The word factor is a jargon term for “column that contains information about the experimental design”. To understand the meaning of this term, take a look at the wordctrlCIsum data frame in your environment. In wordctrlCIsum, the column congru is a factor because it tells us whether the reaction time comes from an congruent trial (cong) or an incongruent trial (incong). The column subj is also a factor, because it tells us which participant the reaction time comes from. The rt column is not a factor, because it contains the data (reaction times) that we’re analyzing.

We tell R this in the following way.

Enter this comment and these commands into your R script and run them:

# Set 'congru' and 'subj' as factors
wordctrlCIsum$congru <- factor(wordctrlCIsum$congru)
wordctrlCIsum$subj <- factor(wordctrlCIsum$subj)

Explanation of command - The first command means “turn the congru column of the wordctrlCIsum data frame into a factor”. More specifically:

wordctrlCIsum$congru - This means the congru column of the wordctrlCIsum data frame

factor(wordctrlCIsum$congru) - The factor command tells R that the column wordctrlCIsum$congru is a factor.

<- This means put the thing on the right of <- into the thing on the left. In other words, once factor has labelled wordctrlCIsum$congru as a factor, record that change back into wordctrlCIsum$congru (rather than, for example, sending the data to a different column).

The second command works in the same way as the first.


Now we’ve given R this information about factors, we’re ready to calculate a Bayes Factor for the difference between congruent and incongruent trials. To do this, first we load the BayesFactor package as before.

Enter this comment and command into your R script and run it:

# Load BayesFactor package

and then we use the following command.

Enter this comment and command into your R script and run it:

# Calculate w/subj Bayesian ANOVA for effect of 'congru' on RT
anovaBF(formula = rt ~ congru + subj,
        data = data.frame(wordctrlCIsum),
        whichRandom = "subj")
Bayes factor analysis
[1] congru + subj : 43224861 ±1.12%

Against denominator:
  rt ~ subj 
Bayes factor type: BFlinearModel, JZS

Explanation of command

Much of this anovaBF command is the same as for a Bayesian t-test. For example, formula = rt ~ congru means we want to know the effect of the congruency condition (congru) on the dependent measure of reaction time (rt).

The first part that’s new is that the formula contains + subj. Unlike for ttestBF, we always need to include this for anovaBF, so it knows which data came from which participant. What we include is the name of the column that contains the subject IDs – in this case, subj.

The other part that’s new is whichRandom = "subj". The command whichRandom means “tell me which factors are random factors”. A random factor is one where, if you ran the experiment again, the levels of the factor would be different. So, if you ran this experiment again, you’d almost certainly want to test different people, so the levels of the subj factor would be different if you re-ran the experiment. This makes it a random factor. In contrast, if you wanted to re-run this experiment in the same way as before, you would still have congruent and incongruent trials, so the levels of the congru factor aren’t random. They’re generally described as fixed. The command anovaBF assumes factors are fixed unless you tell it they are random.

The remaining part, data = data.frame(wordctrlCIsum), just tells R where to find the data.

Explanation of output

The first part of the output is just a way of showing a ‘progress bar’ on a worksheet. R shows you a progress bar going from 0% to 100% to let you know it’s working on the answer. This analysis may take several seconds to run, so you may end up looking at this progress bar for a while.

The second part of the output gives the Bayes Factor. The key figure here is 44286971, which is the Bayes Factor. Note that your number might be slightly different. This is what the ±1.23% part of the above output indicates. The number 44286971 is the correct answer, to within about plus or minus 1%. It’s like saying “this person is 2 metres tall, give or take 5cm”. The command anovaBF can take a while to run, so we prioritise speed over accuracy to some extent. If it was important to get the Bayes Factor precisely correct, we could let R work on the problem for much longer (e.g. overnight). However, the extra time is usually not worth it, and we don’t cover how to increase accuracy of the output in this worksheet.

The Bayes Factor on this occasion is around 40 million, meaning there is really very strong evidence in support of an effect of congruency. Pretty much however unlikely we thought it was beforehand that an incongruent picture would slow down word naming, we should now be pretty confident that it does.

The rest of the output is not crucial to understand right now, but if you are curious take a look at the more on Bayes factors worksheet.

Bayesian ANOVA with more than two levels

In the above Bayes Factor calculation, we looked at just the congruent and incongruent trials of the experiment. However, this experiment had a third level on the condition factor – there were also neutral trials where no picture was shown. These data are in the wordctrl data frame that you made. In this section, we’re going to do an analysis using all three conditions.

Unlike ttestBF, anovaBF can cope with a factor than has more than two levels. This is true for both between-subject and within-subject factors, although we’ll focus on within-subject factors in this worksheet.


Add the following comment to your script:

# Group 'wordctrl' by 'subj' and 'congru'; calculate mean RT; place in 'wordctrlsum'

and then write and run a command to create a subject-level summary of mean reaction times, as you did before. However, this time, you’ll need all three conditions in that summary, so use your wordctrl data frame, rather than your wordctrlCI data frame. Put your summary into a data frame called wordctrlsum.

EXPECTED OUTPUT: There will be a new data frame wordctrlsum in your Environment, with 420 rows and 3 columns.


Add the following comment to your script:

# Display density plot of RT, by 'congru'

and then write and run a command that produces a density plot that contains all three conditions. You can do this the same way as before, just use wordctrlsum instead of wordctrlsumCI.

EXPECTED OUTPUT: This is what the graph should look like:

The main thing to notice about this density plot is that the reaction times for neutral trials are intermediate between the congruent and incongruent trials, both in terms of their mean, and in terms of their variance.

It’s still the case that it is the difference between conditions that is key here, because this is a within-subjects experiment. However, there are three differences we could work out here (incongruent - neutral, neutral - congruent, incongruent - congruent), so we’re going to leave further graphing until a bit later.


The command anovaBF allows us to calculate a Bayes Factor across all three conditions, all we need to do is give it the appropriate subject-level summary (wordctrlsum in this case). However, don’t forget that, before anovaBF can work, it needs to know which columns of wordctrlsum are factors (you will need to set subj and congru as factors). So add the following comments to your script:

# Set 'congru' and 'subj' as factors

# Calculate w/subj Bayesian ANOVA for effect of 'congru' on RT

and then insert and run some commands to perform a Bayesian ANOVA that includes all three levels of the congruency condition.

EXPECTED OUTPUT: The correct output is as follows (your Bayes Factor may be slightly different):

Bayes factor analysis
[1] congru + subj : 1.741829e+14 ±0.66%

Against denominator:
  rt ~ subj 
Bayes factor type: BFlinearModel, JZS

The Bayes Factor here is approximately 1.8e+14, which is read as \(1.8 \times 10^{14}\) (click here if you need a reminder about scientific notation). So, the Bayes Factor is overwhelmingly in support of the conclusion that there is a difference between these three conditions.

Pairwise comparisons

It’s important to be clear what the above Bayes Factor tell us. It gives us evidence that the three conditions (congruent, incongruent, neutral) are not all the same. This is worth knowing, but often we want evidence for more specific hypotheses. For example, does the presentation of a congruent picture speed responding? If so, there should be evidence that congruent trials are faster than neutral trials. The Bayes Factor we have just calculated does not tell us that – it just tells us the evidence level for the claim that the three conditions are different.

Calculating the evidence for a particular pair in an experiment is called doing a pairwise comparison. You’ve in fact already done this – you removed the neutral trials from the data set and calculated a Bayes Factor for the congruent vs. incongruent comparison. You can use the same techniques to look at any pair.


The goal in this exercise is to look at the reaction time difference between congruent and neutral trials in the control condition of this experiment.

Specifically, the task is to write a new R script that contains all the necessary commands, and only the necessary commands, to do all of the following, in this order:

  1. Start a new script and call it anova1ex.R. Add the following comment:
# Within-subject differences
  1. Start your script with the command rm(list=ls()) (preceded by a suitable comment). This command clears your Environment and so ensures your script is not relying on commands you have previously run.

  2. Load the packages you need (tidyverse and BayesFactor). Precede with an appropriate comment.

  3. Load the data file wordnaming2.csv (using read_csv). Precede with an appropriate comment.

  4. Preprocess the data to include only the control condition, and exclude the incongruent trials. Use the filter command to do this. Precede with an appropriate comment.

  5. Generate a subject-level summary of these filtered data (use the group_by and summarise commands). Precede with an appropriate comment.

  6. Show a density plot of the RT differences for congruent versus neutral trials. You’ll need to use pivot_wider and mutate to get the data in the right format, and then use ggplot to produce the graph. Precede with an appropriate comments.

  7. Calculate a Bayes Factor for this difference. Use anovaBF to do this, and don’t forget to use the factor command to tell R which columns of your data frame are factors. Precede with an appropriate comments.

Other requirements:

Expected output, submitting your work

Your output should look like the example below. Note that your Bayes Factor may be slightly different to the one reported here, for the reasons we covered earlier.

Once it does, copy and paste your R script for this exercise into PsycEL. REMEMBER: This should be all the code you need to complete this exercise, and only the code you need to complete this exercise (not any other code from this worksheet). Even with comments, it shouldn’t be longer than around 25 lines of code. Your code has to run, produce the correct answer, and not include irrelevant code, to pass this exercise.

Bayes factor analysis
[1] congru + subj : 7050364 ±2.15%

Against denominator:
  rt ~ subj 
Bayes factor type: BFlinearModel, JZS

Multiple comparisons and replication

Just before we finish, some notes on the issue of multiple comparisons, just because this is something you may hear about elsewhere.

To explain the concept of multiple comparisons, I’d like you to imagine you had a factor with five levels. The command anovaBF would cope just fine with this. But, with five levels, there are ten pairwise comparisons you could do (1 versus 2, 1 versus 3, etc.). In traditional analysis techniques (i.e. techniques resulting in p values) you will sometimes hear people talking about ‘correcting for multiple comparisons’. This is because the more tests you do, the more likely it is that one of them will be ‘significant’ (\(p < .05\)), even if there is in fact no difference between any of the levels of the factor. One common suggestion is to use a stricter significance level (e.g. \(p < .01\)) if you’re doing a lot of tests.

This is not an approach covered in these worksheets. This is for two reasons:

First, we’re using Bayesian methods, not traditional ANOVA. Although issues around multiple comparisons still occur with Bayesian techniques, the issues are not the same, the solutions are not same, and the whole thing is just a bit too complex for this intermediate-level worksheet.

Second, correcting for multiple comparisons is seldom the answer anyway. If you knew you needed to compare those two conditions before you looked at the data, this is known as a confirmatory analysis, and we don’t generally adjust for multiple comparisons for such analyses. Alternatively, you might be exploring your data after you collected it. For example, you might observe a difference you didn’t expect to see, and do a test on this. This is called exploratory analysis, and it’s fine — it’s good to fully explore your data. But, results from exploratory analysis should normally be replicated before we are particularly convinced by them, whatever the p value (or Bayes Factor). In other words, you should run the study again, predict you would see the difference again, and then test for it. This is because, if you look long and hard enough, you’ll always find some pattern in a set of data, even if there is not actually anything to be found.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.