Preprocessing is all the things you have to do to your data before
you can analyse it. In the Absolute Beginners’ Guide to R, the
preprocessing was mostly done for you, and you just used
read_csv to load in the preprocessed data. However, in most
realistic situations, data does not come preprocessed. In this
worksheet, we’ll look at preprocessing data from computerized
experiments. Preprocessing data from these kinds of experiments
typically comes in five parts - loading, tidying, filtering,
summarizing, and combining. We’ll cover these in turn below.
In this first part of the worksheet, the commands you need to do each step are given to you. It’s important to take time to read all the instructions, try out the commands in RStudio for yourself, and read the descriptions and explanations of how they work. This is because, in the next section, you’ll be preprocessing another data set. You won’t be given any of the commands. Instead, you’ll adapt what you’ve learned to that new data set.
In order to get the data for this exercise, we’re going to load it from a git repository. The best-known git repository is github, and that’s the one we’ll use here. Git repositories are a common way of sharing code and data via the internet, and RStudio can easily make use of them. Here’s how:
Create a new RStudio project as before, but click on “Version Control” rather than “New Directory”.
Click on “Git”.
Enter the location of the repository into the first line. It’s
Click “Create Project”
That’s it! You should now see that your Files window in RStudio has a number of files showing, including a folder called rawdata. If so, you have successfully downloaded data from a git repository on github and put it in an RStudio project.
Click on the name ‘rawdata’ in the Files window of RStudio. You’ll
see there are three CSV files, and one file called
README.md. We’ll use the CSV files in a minute.
Each CSV file (e.g.
subject-11.csv) contains the data
from one participant in a facial
prototypes experiment. The experiment was run using OpenSesame. OpenSesame, like R, is a
free and open-source program. Install OpenSesame on your machine and run
the experiment on yourself. To do this, you will need to download the
OpenSesame experiment, which is in the
of your R project. Click on
expt-scripts, tick the box next
facialproto_short.osexp, click ‘More…’ and click
‘Export’. Now open the experiment on your machine using OpenSesame.
rminr-data project, create a new script file
and call it preproc.R. Put all your commands for this
worksheet into this script, and save regularly. This will make it easier
to see what you have done, especially when you come back to it after a
In this experiment, people are shown some pictures of male faces. Each is a picture of a real face, but its internal features (eyes, nose, mouth, etc.) have been digitally stretched or compressed either along the x-axis or the y-axis. Each of these manipulations of real faces is shown exactly once. Participants rate each picture for masculinity (1-8 scale, higher numbers = more masculine), just as a way of encouraging them to look at each picture closely.
After participants have been shown 32 pictures, they move to the test
phase. They’re shown another 24 pictures, and have to rate their
confidence they’ve seen that exact picture before (0 =
definitely not seen, 9 = definitely seen). The pictures they’re shown
include the exact distortions they’ve seen before (
the undistorted version of the faces (
they have not been shown before, and some other distortions of the same
faces that they haven’t seen before (
The expected result is that people are confident they’ve seen the
prototype pictures before, even though they haven’t. One
interpretation of this result is that we average across the pictures we
see. The prototype is the average of the pictures we’ve seen of that
face, so it seems familiar even though we’ve not seen it before.
In order to work out whether we get the expected result, we need to
know the mean confidence rating each participant gave for each face type
seen, unseen, prototype).
We’ll start by loading the data for one of the participants,
subject-11.csv. Recall that we can do this using the
read_csv command. In this case, the CSV file we want to
load is inside the
rawdata folder of our project so we have
rawdata/subject-11.csv rather than just
subject-11.csv, so RStudio knows where to look for the
Add the following comments and commands to preproc.R, and run the script:
# Preprocessing worksheet # Load tidyverse library(tidyverse) # Load data dat <- read_csv("rawdata/subject-11.csv")
dat in the Environment tab and take a look at
the file you’ve loaded. This is a typical output file for OpenSesame,
but it seems quite overwhelming at first sight. It has 101 columns of
data, many with unclear names. There are 56 rows of data, so 5656 pieces
of data in total, and this is just for one participant in one short
Our first job is to tidy up this dataset so it’s clear and readable to the average human being.
The first thing you need to know to make sense of this dataset is that each row is one trial of the experiment. In a typical experiment, a trial begins with the presentation of a stimulus (in this case, a face) and ends with the participant making a response (in this case, a rating). Participants rate 32 pictures for masculinity, and then 24 pictures for confidence they’ve seen them before, leading to a total of 56 trials in this experiment, and hence 56 rows in this data frame.
This is what’s sometimes called tidy data, which means there is one row for each observation (each trial, in this case). It is also called long format data, because it has more rows and fewer columns than the main alternative, which is to put all data from a single participant on the same row. This is called wide format, and in this case would result in a dataset with 1 row and 5656 columns.
Most commands in R assume your data is in long format, so it’s good news our raw data is also in that format, even if it still needs a bunch of tidying up before we can analyse it. The first step is to go through the 101 columns and find those we actually need.
It’s important that we are able to anonymously identify the
participant who generated the data. So, we’re going to need their
participant number, which is in the
Columns to keep:
Although this data file contains the participant number, this may not always be the case for other data sets. If you ever need to add participant numbers yourself, take a look at the more on preprocessing worksheet.
One thing we’ll definitely need to know is how the participant
responded on each trial of the experiment. If you scroll through the
dat, which are organised alphabetically, you’ll
find a column called
response. You’ll see that each row
contains a number from 0 to 9. So, this is the rating the participant
made on that trial. We’ll need to know that to analyse this data, so
response is one of the columns we need to keep.
Columns to keep:
We also need to know what kind of response the participant was making
– a masculinity rating (first part of the experiment), or a confidence
rating (second part). When an experiment has two or more different
parts, we call those parts phases. And you’ll find there is a
phase, which has the entries
exposure (which is the first part of the experiment) and
test (second part). So
phase is another column
Columns to keep:
In the second phase of the experiment, there are three different
types of trial - unseen faces, seen faces, and prototypes. We’ll need to
know what type of stimulus was presented in order to analyse these data.
Scroll to the column called
type, and you’ll see that for
the second phase (rows 33 onwards), it contains this information. For
the first phase it contains
NA, meaning this is not
relevant information for the first phase. So
another column we’ll need to keep.
Columns to keep:
There are many trials in each phase of this experiment, so it makes
sense to include a column that says which trial within the phase each
row refers to. Take a look at the data — click on
the Environment panel if you have not already done so, and this will
open a spreadsheet-like view of the data in the top left window of
Rstudio. If you scroll through the columns, you’ll find one called
live_row. You’ll notice it counts up from zero to 31, and
then from 0 to 23. So, this column contains trial numbers, first for the
masculinity-rating part of the experiment, and then for the
confidence-rating part of the experiment. Counting from zero might seem
a bit weird, but it’s quite common in data collected by a computer.
Columns to keep:
You may have noticed that the list of ‘columns to keep’ is not in
alphabetical order. Instead, it follows the conventional ordering of
subject_nr), position in
phase, live_row), stimulus
type), response. Most experiments report their
data in this order. Having conventions like these make it easier for
others to read and understand our analyses.
Having worked out which columns we need, we now tidy things up by
selecting just the columns we need and putting them into a new data
frame. This is done using R’s
select command. Add the
following to your script and run it
# Select columns; place into 'dat_subset' dat_subset <- dat %>% select(subject_nr, phase, live_row, type, response)
Explanation of command - Our original data frame
dat is piped (sent to,
select command, which picks just the columns we name. These
selected columns are put (
<-) into a new dataframe
dat_subset in the Environment window.
You’ll see we have a much easier-to-read data frame, still with 56 rows,
but now only 5 columns.
When working with data, it’s useful to have meaningful column names,
because it makes it easier to remember what they contain. The column
live_row could be more clearly named as
as that’s the information it contains. Also,
longer than it needs to be,
subj is clear, and quicker to
type. We rename columns in R using the
Add the following to your script and run it:
# Rename columns; place into 'tidydat' tidydat <- dat_subset %>% set_names(c("subj", "phase", "trial", "type", "response"))
Explanation of command - The command
c() means concatenate i.e. put together. So
c("subj", "phase", "trial", "type", "response") is a way of
putting the five names together so they can be sent somewhere. We use
set_names function to change the column names of the
dataframe to be this list. The result is sent (
saved in a new variable,
We now have a clear, tidy dataframe (
tidydat), which we
can start to analyse. The expected result is that the participant will
show high confidence ratings for having seen the prototype, despite not
having seen it before. Does this participant show this pattern of
Our predictions are about the test phase, but this data frame also
contains data about the exposure phase (the masculinity ratings). So,
the first thing we have to do is filter the data so that it
only contains the test phase. As covered in Absolute Beginners’ Guide to R,
we use the
filter command to do this, telling it which
parts of the data we want to keep. Add the following to your script and
# Filter test phase data into 'testdat' testdat <- tidydat %>% filter(phase == "test")
Explanation of command - The
data is passed (
%>%) to the
which keeps only those rows (trials) where
phase == "test",
i.e. where the column
phase contains the word
test. This filtered data is then written
<-) to a new data frame called
testdat in the Environment window. You’ll see
that now we just have 24 rows, all from the test phase.
We can now use the
commands covered in Absolute
Beginners’ Guide to R to answer our question. Add the following to
your script and run it:
# Group data by 'type', display mean of 'response' testdat %>% group_by(type) %>% summarise(mean(response))
# A tibble: 3 × 2 type `mean(response)` <chr> <dbl> 1 prototype 4.75 2 seen 3.88 3 unseen 5
Explanation of command - Our test-phase data
testdat) is piped to
groups it into the three parts given by the
(seen, unseen, prototype). This grouped data is then sent to
summarise to work out a summary value for each group. We
summarise what summary we want, in this case we want a
mean. And we tell the
mean command that the
data we want a mean of is in the
As before, you can safely ignore the “ungrouping” message that you receive. Looking at the output, we can see that this participant was more confident they’d seen the prototypes (4.75) than that they’d seen the pictures they’d actually seen before (3.88). This fits our hypothesis. But, somewhat oddly, they were even more confident about other distortions they hadn’t seen (5). An alternative hypothesis (and the correct one in this case) is that this particular participant just randomly pressed the buttons, so any difference in the three scores is down to chance.
Experiments generally involve testing relatively large numbers of people, in order to achieve good statistical power. This means that another important part of preprocessing is to combine the data from different participants into a single data frame, so we can analyse the data from all the participants at the same time.
We’ll start by loading in data from a second participant into a different dataframe. Add the following to your script and run it:
# Load data from second participant dat2 <- read_csv("rawdata/subject-12.csv")
To combine these two participants into a single dataframe, we use the
bind_rows command. Add the following to your script and run
# Combine the two data sets; place into 'alldat' alldat <- bind_rows(dat, dat2)
Explanation of command - The
command takes the
dat2 dataframe and adds it to the end of
dat dataframe. This combined data set is then written
<-) a new dataframe called
alldat in Environment window; you’ll
see it now has 112 rows, the 56 rows from subject-11 and then the
56-rows from subject-12.
Generally, we have a lot more than two people in an experiment. You could use the same technique to combine the data of tens, hundreds, or thousands of participants. However, that would take a really long time and be quite tedious. Fortunately, R allows us to speed up this process so that, even if we have tens of thousands of participants, we can quickly combine their data without error.
The first thing we need to do to combine many participants is to get
a list of the names of their data files
subject-11.csv). We do this using the
list.files command. Add the following to your script and
# Display list of filenames tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))
# A tibble: 3 × 1 filename <chr> 1 rawdata/subject-11.csv 2 rawdata/subject-12.csv 3 rawdata/subject-13.csv
Explanation of command: We say
tibble(filename= to tell R to make a new dataframe, with a
column containing filenames. The first part of
"rawdata", tells R which folder the data files are in.
Generally, it’s a good idea to keep all your data files inside a single
folder, so it is easier for commands like this to find them. The second
"*.csv" tells R to only
give the names of files that end in
"*" means “any character or number”). This is useful,
because sometimes raw data folders contain other files, too. For
README.md, a file
that generally provides information relating to the data, rather than
being data itself. The third part,
full.names=TRUE, tells R
to give the name of the file including the name of the folder it is in,
rawdata/subject-11.csv rather than just
We know how to make a new dataframe containing a list of filenames to process. We can use this to load each file in turn and combine them into a single large dataset.
To do this, we make use another tidyverse function called
do function performs an operation for each group in
a dataframe. All of the results are joined into a large, combined
datafile. All we need to do is:
Make a dataframe containing the list of file names to be used
Remind R to read the files one at a time (explanation below)
Tell it which function to use to read the raw files
The entire command looks like this; which you should enter into your script and run:
# Combine data from all participants into 'alldat' alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE)) %>% group_by(filename) %>% do(read_csv(.$filename))
Explanation of command: At the start of line 1 we
alldat <-. This means the results will be saved
with the name
alldat. Then we reuse the code from above to
create a new dataframe with one column:
filename. We then
group_by to tell R to process each filename
individually. If we didn’t do this then R would ask
read_csv to open all the files at once, and
read_csv would be confused!
When we write
do(read_csv(.$filename)) we are telling
do to apply the
read_csv function to each
".$filename" part is shorthand: The period
"." means “the group in the data I am working with now”.
"$filename" part means use the values from the
filename column. So
means, read the csv file using the value in the
If you look at
alldat in the Environment window, you’ll
see that it has 168 rows. Using
do has combined all three
participants, each with 56 trials, into one big data frame. Although
with three participants we could have done this as quickly in other
ways, generally we have much more data than this - typically somewhere
between 30 and 300 participants in a single experiment. Using
do saves a lot of time in these situations.
Of course, this combined data file has the same problems as the individual files that make it up – there are over 100 columns, most of which we don’t need. We can fix this with the same commands as before.
Add this comment to your script:
# Select and rename columns; place into 'tidydat'
Next add some commands to select the relevant
columns, put them into
tidydat and then rename the
Run those commands. Now, click on
tidydat, you’ll see a
human-readable data frame with 168 rows, and 5 columns. This is the full
dataset for the experiment, so we can now start analyzing it.
Now we have this combined file, we summarise the test phase data for
each participant using the same commands as before. We just need to
subj in the
group_by command, so
group_by(subj, type), so we get a separate summary for each
Add the following comment to your script:
# Filter test phase data into 'test' # Group 'test' by 'type' and 'subj', calculate mean of 'response', place into 'test.sum'
Now, fill in the blanks, i.e. insert and run some
commands to filter the data to the test phase, and put that
data into a data frame called
test. Now use
test to group by subject and trial type, and summarize
using the mean response, and put the resulting summary into
test.sum. If you’ve got this right, you’ll end up with this
summary. You can view it either by typing
test.sum, or by
test.sum in the Environment
# A tibble: 9 × 3 # Groups: filename  filename type `mean(response)` <chr> <chr> <dbl> 1 rawdata/subject-11.csv prototype 4.75 2 rawdata/subject-11.csv seen 3.88 3 rawdata/subject-11.csv unseen 5 4 rawdata/subject-12.csv prototype 5.5 5 rawdata/subject-12.csv seen 4.75 6 rawdata/subject-12.csv unseen 5.12 7 rawdata/subject-13.csv prototype 4.88 8 rawdata/subject-13.csv seen 4.38 9 rawdata/subject-13.csv unseen 4.38
Looking at the above summary, we can see that all three participants were more confident they’d seen the prototype than the pictures they’d actually seen, which supports our hypothesis. But they’re also all at least as confident about other unseen distortions, too. So, again, an alternative hypothesis is that participants were just pressing keys randomly. This was in fact the case for these data. In the next part of the worksheet, you’ll look at some data from a different experiment where the participants took the task more seriously.
In this section of the worksheet, you’ll preprocess the data from a
different experiment. You’ll find the data in the folder called
lexdec, and the OpenSesame implementation of this
experiment in the
In this experiment, words are shown on the screen one at a time. Some are real words, others are made-up words (non-words). The participant’s job is to decide whether each is a word or non-word, as quickly as possible. The computer records for each decision whether they got it correct, and how long it took them to respond (their response time). The experiment begins with a practice phase, so participants can get used to the task. This is followed by the test phase, which is the part we analyse.
Your task is to write R code that gives the mean reaction time for each participant, for both words and non-words. You should analyse only reaction times for trials in which they made the correct response, and you should not include the practice phase in your analysis.
This can, and should, be done with less than 15 files of R code (plus comments, see below). When your code is working correctly, it will give you the following numbers:
# A tibble: 6 × 3 # Groups: subj  subj type `mean(rt)` <dbl> <chr> <dbl> 1 11 nonword 1158. 2 11 word 888. 3 12 nonword 1339 4 12 word 1818 5 13 nonword 1458. 6 13 word 1526
Use short, meaningful names for data frames and column names.
Use comments in your code to make it more human readable.
Comments are any line that begins with
# and are ignored by
R. They are there to make your code easier for humans to understand. For
# Create a dataframe containing all the filenames we want to read raw.data.files <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))
Hints: Your code should first get a list of files,
do to load all those files in and combine them. Next
you’ll have to find the relevant columns among the 100+ columns of the
data file. Once you’ve found them, use
select to select
just those columns and use
set_names to give the selected
columns better names. Use
filter to remove the practice
phase and keep only the correct responses. Note that accuracy in this
data file is a number, not a word, so the correct phrase is something
acc == 1, rather than
acc == "1".
summarise to report
the mean response time for words and non words for each participant.
Once you’re getting the right answers, paste your R code into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.