Preprocessing is all the things you have to do to your data before you can analyse it. In the Absolute Beginners’ Guide to R, the preprocessing was mostly done for you, and you just used
read_csv to load in the preprocessed data. However, in most realistic situations, data does not come preprocessed. In this worksheet, we’ll look at preprocessing data from computerised experiments. Preprocessing data from these kinds of experiments typically comes in five parts - loading, tidying, filtering, summarising, and combining. We’ll cover these in turn below.
In this first part of the worksheet, the commands you need to do each step are given to you. It’s important to take time to read all the instructions, try out the commands in RStudio for yourself, and read the descriptions and explanations of how they work. This is because, in the next section, you’ll be preprocessing another data set. You won’t be given any of the commands. Instead, you’ll adapt what you’ve learned to that new data set.
In order to get the data for this exercise, we’re going to load it from a git repository. The best-known git repository is github, and that’s the one we’ll use here. Git repositories are a common way of sharing code and data via the internet, and RStudio can easily make use of them. Here’s how:
Create a new RStudio project as before, but click on “Version Control” rather than “New Directory”.
Click on “Git”.
Enter the location of the repository into the first line. It’s
Click “Create Project”
That’s it! You should now see that your Files window in RStudio has a number of files showing, including a folder called rawdata. If so, you have successfully downloaded data from a git repository on github and put it in an RStudio project.
Click on the name ‘rawdata’ in the Files window of RStudio. You’ll see there are three CSV files, and one file called
README.md. We’ll use the CSV files in a minute.
Each CSV file (e.g.
subject-11.csv) contains the data from one participant in a facial prototypes experiment. The experiment was run using OpenSesame. OpenSesame, like R, is a free and open-source program. Install OpenSesame on your machine and run the experiment on yourself. To do this, you will need to download the OpenSesame experiment, which is in the
expt-scripts folder of your R project. Click on
expt-scripts, tick the box next to
facialproto_short.osexp, click ‘More…’ and click ‘Export’. Now open the experiment on your machine using OpenSesame.
Create a new script file and call in preproc.R. Put all your commands for this worksheet into this script, and save regularly. This will make it easier to see what you have done, espcially when you come back to it after a break.
In this experiment, people are shown some pictures of male faces. Each is a picture of a real face, but its internal features (eyes, nose, mouth, etc.) have been digitally stretched or compressed either along the x-axis or the y-axis. Each of these manipulations of real faces is shown exactly once. Participants rate each picture for masculinity (1-8 scale, higher numbers = more masculine), just as a way of encouraging them to look at each picture closely.
After participants have been shown 32 pictures, they move to the test phase. They’re shown another 24 pictures, and have to rate their confidence they’ve seen that exact picture before (0 = definitely not seen, 9 = definitely seen). The pictures they’re shown include the exact distortions they’ve seen before (
seen), the undistorted version of the faces (
prototype), which they have not been shown before, and some other distortions of the same faces that they haven’t seen before (
The expected result is that people are confident they’ve seen the
prototype pictures before, even though they haven’t. One interpretation of this result is that we average across the pictures we see. The prototype is the average of the pictures we’ve seen of that face, so it seems familiar even though we’ve not seen it before.
In order to work out whether we get the expected result, we need to know the mean confidence rating each participant gave for each face type (
seen, unseen, prototype).
We’ll start by loading the data for one of the participants,
subject-11.csv. Recall that we can do this using the
read_csv command. In this case, the CSV file we want to load is inside the
rawdata folder of our project so we have to say
rawdata/subject-11.csv rather than just
subject-11.csv, so RStudio knows where to look for the file.
Add the following commands to preproc.R, and run the script:
library(tidyverse) dat <- read_csv("rawdata/subject-11.csv")
dat in the Environment tab and take a look at the file you’ve loaded. This is a typical output file for OpenSesame, but it seems quite overwhelming at first sight. It has 101 columns of data, many with unclear names. There are 56 rows of data, so 5656 pieces of data in total, and this is just for one participant in one short experiment!
Our first job is to tidy up this dataset so it’s clear and readable to the average human being.
The first thing you need to know to make sense of this dataset is that each row is one trial of the experiment. In a typical experiment, a trial begins with the presentation of a stimulus (in this case, a face) and ends with the participant making a response (in this case, a rating). Participants rate 32 pictures for masculinity, and then 24 pictures for confidence they’ve seen them before, leading to a total of 56 trials in this experiment, and hence 56 rows in this data frame.
This is what’s sometimes called tidy data, which means there is one row for each observation (each trial, in this case). It is also called long format data, because it has more rows and fewer columns than the main alternative, which is to put all data from a single participant on the same row. This is called wide format, and in this case would result in a dataset with 1 row and 5656 columns.
Most commands in R assume your data is in long format, so it’s good news our raw data is also in that format, even if it still needs a bunch of tidying up before we can analyse it. The first step is to go through the 101 columns and find those we actually need.
It’s important that we are able to anonymously identify the participant who generated the data. So, we’re going to need their participant number, which is in the
Columns to keep:
Although this data file contains the participant number, this may not always be the case for other data sets. If you ever need to add participant numbers yourself, take a look at the more on preprocessing worksheet.
One thing we’ll definitely need to know is how the participant responded on each trial of the experiment. If you scroll through the columns of
dat, which are organised alphabetically, you’ll find a column called
response. You’ll see that each row contains a number from 0 to 9. So, this is the rating the participant made on that trial. We’ll need to know that to analyse this data, so
response is one of the columns we need to keep.
Columns to keep:
We also need to know what kind of response the participant was making – a masculinity rating (first part of the experiment), or a confidence rating (second part). When an experiment has two or more different parts, we call those parts phases. And you’ll find there is a column called
phase, which has the entries
exposure (which is the first part of the experiment) and
test (second part). So
phase is another column we’ll need.
Columns to keep:
In the second phase of the experiment, there are three different types of trial - unseen faces, seen faces, and prototypes. We’ll need to know what type of stimulus was presented in order to analyse these data. Scroll to the column called
type, and you’ll see that for the second phase (rows 33 onwards), it contains this information. For the first phase it contains
NA, meaning this is not relevant information for the first phase. So
type is another column we’ll need to keep.
Columns to keep:
There are many trials in each phase of this experiment, so it makes sense to include a column that says which trial within the phase each row refers to. Take a look at the data — click on
dat in the Environment panel if you have not already done so, and this will open a spreadsheet-like view of the data in the top left window of Rstudio. If you scroll through the columns, you’ll find one called
live_row. You’ll notice it counts up from zero to 31, and then from 0 to 23. So, this column contains trial numbers, first for the masculinity-rating part of the experiment, and then for the confidence-rating part of the experiment. Counting from zero might seem a bit weird, but it’s quite common in data collected by a computer.
Columns to keep:
You may have noticed that the list of ‘columns to keep’ is not in alphabetical order. Instead, it follows the conventional ordering of participant (
subject_nr), position in experiment (
phase, live_row), stimulus (
type), response. Most experiments report their data in this order. Having conventions like these make it easier for others to read and understand our analyses.
Having worked out which columns we need, we now tidy things up by selecting just the columns we need and putting them into a new data frame. This is done using R’s
dat_subset <- dat %>% select(subject_nr, phase, live_row, type, response)
Explanation of command - Our original data frame
dat is piped (sent to,
select command, which picks just the columns we name. These selected columns are put (
<-) into a new dataframe called
dat_subset in the Environment window. You’ll see we have a much easier-to-read data frame, still with 56 rows, but now only 5 columns.
When working with data, it’s useful to have meaningful column names, because it makes it easier to remember what they contain. The column
live_row could be more clearly named as
trial, as that’s the information it contains. Also,
subject_nr is longer than it needs to be,
subj is clear, and quicker to type. We rename columns in R using the
tidydat <- dat_subset %>% set_names(c("subj", "phase", "trial", "type", "response"))
Explanation of command - The command
c() means concatenate i.e. put together. So
c("subj", "phase", "trial", "type", "response") is a way of putting the five names together so they can be sent somewhere. We use the
set_names function to change the column names of the dataframe to be this list. The result is sent (
<-) and saved in a new variable,
We now have a clear, tidy dataframe (
tidydat), which we can start to analyse. The expected result is that the participant will show high confidence ratings for having seen the prototype, despite not having seen it before. Does this participant show this pattern of results?
Our predictions are about the test phase, but this data frame also contains data about the exposure phase (the masculinity ratings). So, the first thing we have to do is filter the data so that it only contains the test phase. As covered in Absolute Beginners’ Guide to R, we use the
filter command to do this, telling it which parts of the data we want to keep
testdat <- tidydat %>% filter(phase == "test")
Explanation of command - The
tidydat data is passed (
%>%) to the
filter command, which keeps only those rows (trials) where
phase == "test", i.e. where the column
phase contains the word
test. This filtered data is then written (
<-) to a new data frame called
testdat in the Environment window. You’ll see that now we just have 24 rows, all from the test phase.
We can now use the
summarise commands covered in Absolute Beginners’ Guide to R to answer our question.
testdat %>% group_by(type) %>% summarise(mean(response))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2 type `mean(response)` <chr> <dbl> 1 prototype 4.75 2 seen 3.88 3 unseen 5
Explanation of command - Our test-phase data (
testdat) is piped to
group_by(type), which groups it into the three parts given by the
type column (seen, unseen, prototype). This grouped data is then sent to
summarise to work out a summary value for each group. We tell
summarise what summary we want, in this case we want a
mean. And we tell the
mean command that the data we want a mean of is in the
As before, you can safely ignore the “ungrouping” message that you receive. Looking at the output, we can see that this participant was more confident they’d seen the prototypes (4.75) than that they’d seen the pictures they’d actually seen before (3.88). This fits our hypothesis. But, somewhat oddly, they were even more confident about other distortions they hadn’t seen (5). An alternative hypothesis (and the correct one in this case) is that this particular participant just randomly pressed the buttons, so any difference in the three scores is down to chance.
Experiments generally involve testing relatively large numbers of people, in order to achieve good statistical power. This means that another important part of preprocessing is to combine the data from different participants into a single data frame, so we can analyse the data from all the participants at the same time.
We’ll start by loading in data from a second participant into a different dataframe:
dat2 <- read_csv("rawdata/subject-12.csv")
To combine these two participants into a single dataframe, we use the
alldat <- bind_rows(dat, dat2)
Explanation of command - The
bind_rows command takes the
dat2 dataframe and adds it to the end of the
dat dataframe. This combined data set is then written to (
<-) a new dataframe called
alldat in Environment window; you’ll see it now has 112 rows, the 56 rows from subject-11 and then the 56-rows from subject-12.
Generally, we have a lot more than two people in an experiment. You could use the same technique to combine the data of tens, hundreds, or thousands of participants. However, that would take a really long time and be quite tedious. Fortunately, R allows us to speed up this process so that, even if we have tens of thousands of participants, we can quickly combine their data without error.
The first thing we need to do to combine many participants is to get a list of the names of their data files (e.g.
subject-11.csv). We do this using the
tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE))
# A tibble: 3 x 1 filename <chr> 1 rawdata/subject-11.csv 2 rawdata/subject-12.csv 3 rawdata/subject-13.csv
Explanation of command: We say
tibble(filename= to tell R to make a new dataframe, with a column containing filenames. The first part of
"rawdata", tells R which folder the data files are in. Generally, it’s a good idea to keep all your data files inside a single folder, so it is easier for commands like this to find them. The second part of
"*.csv" tells R to only give the names of files that end in
"*" means “any character or number”). This is useful, because sometimes raw data folders contain other files, too. For example,
README.md, a file that generally provides information relating to the data, rather than being data itself. The third part,
full.names=TRUE, tells R to give the name of the file including the name of the folder it is in, so
rawdata/subject-11.csv rather than just
We know how to make a new dataframe containing a list of filenames to process. We can use this to load each file in turn and combine them into a single large datset.
To do this, we make use another tidyverse function called
do function performs an operation for each group in a dataframe. All of the results are joined into a large, combined datafile. All we need to do is:
The entire command looks like this:
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>% group_by(filename) %>% do(read_csv(.$filename))
Explanation of command: At the start of line 1 we write
alldat <-. This means the results will be saved with the name
alldat. Then we reuse the code from above to create a new dataframe with one column:
filename. We then use
group_by to tell R to process each filename individually. If we didn’t do this then R would ask
read_csv to open all the files at once, and
read_csv would be confused!
When we write
do(read_csv(.$filename)) we are telling
do to apply the
read_csv function to each filename. The
".$filename" part is shorthand: The period
"." means “the group in the data I am working with now”. The
"$filename" part means use the values from the
filename column. So
read_csv(.$filename) means, read the csv file using the value in the
If you look at
alldat in the Environment window, you’ll see that it has 168 rows. Using
do has combined all three participants, each with 56 trials, into one big data frame. Although with three participants we could have done this as quickly in other ways, generally we have much more data than this - typically somewhere between 30 and 300 participants in a single experiment. Using
do saves a lot of time in these situations.
Of course, this combined data file has the same problems as the individual files that make it up – there are over 100 columns, most of which we don’t need. We can fix this with the same commands as before.
WRITE SOME COMMANDS to select the relevant columns, put them into
tidydat and then rename the columns.
tidydat, you’ll see a human-readable data frame with 168 rows, and 5 columns. This is the full dataset for the experiment, so we can now start analyzing it.
Now we have this combined file, we summarise the test phase data for each participant using the same commands as before. We just need to include
subj in the
group_by command, so
group_by(subj, type), so we get a separate summary for each person.
WRITE SOME COMMANDS to filter the data to the test phase, and put that data into a data fram called
test. Now use
test to group by subject and trial type, and summarize using the mean response, and put the resulting summary into
test.sum. If you’ve got this right, you’ll end up with this summary. You can view it either by typing
test.sum, or by clicking on
test.sum in the Environment window:
# A tibble: 9 x 3 # Groups: filename  filename type `mean(response)` <chr> <chr> <dbl> 1 rawdata/subject-11.csv prototype 4.75 2 rawdata/subject-11.csv seen 3.88 3 rawdata/subject-11.csv unseen 5 4 rawdata/subject-12.csv prototype 5.5 5 rawdata/subject-12.csv seen 4.75 6 rawdata/subject-12.csv unseen 5.12 7 rawdata/subject-13.csv prototype 4.88 8 rawdata/subject-13.csv seen 4.38 9 rawdata/subject-13.csv unseen 4.38
Looking at the above summary, we can see that all three participants were more confident they’d seen the prototype than the pictures they’d actually seen, which supports our hypothesis. But they’re also all at least as confident about other unseen distortions, too. So, again, an alternative hypothesis is that participants were just pressing keys randomly. This was in fact the case for these data. In the next part of the worksheet, you’ll look at some data from a different experiment where the participants took the task more seriously.
In this section of the worksheet, you’ll preprocess the data from a different experiment. You’ll find the data in the folder called
lexdec, and the OpenSesame implementation of this experiment in the
In this experiment, words are shown on the screen one at a time. Some are real words, others are made-up words (non-words). The participant’s job is to decide whether each is a word or non-word, as quickly as possible. The computer records for each decision whether they got it correct, and how long it took them to respond (their response time). The experiment begins with a practice phase, so participants can get used to the task. This is followed by the test phase, which is the part we analyse.
Your task is to write R code that gives the mean reaction time for each participant, for both words and non-words. You should analyse only reaction times for trials in which they made the correct response, and you should not include the practice phase in your analysis.
This can, and should, be done with less than 15 files of R code (plus comments, see below). In fact, excluding the
library(tidverse) command and comments, it can be done in 10 lines. When your code is working correctly, it will give you the following numbers:
# A tibble: 6 x 3 # Groups: subj  subj type `mean(rt)` <dbl> <chr> <dbl> 1 11 nonword 1158. 2 11 word 888. 3 12 nonword 1339 4 12 word 1818 5 13 nonword 1458. 6 13 word 1526
Use short, meaningful names for data frames and column names.
Use comments in your code to make it more human readable. Comments are any line that begins with
## and are ignored by R. They are there to make your code easier for humans to understand. For example:
# Create a dataframe containing all the filenames we want to read raw.data.files <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE))
Hints: Your code should first get a list of files, then
do to load all those files in and combine them. Next you’ll have to find the relevant columns among the 100+ columns of the data file. Once you’ve found them, use
select to select just those columns and use
set_names to give the selected columns better names. Use
filter to remove the practice phase and keep only the correct responses. Note that accuracy in this data file is a number, not a word, so the correct phrase is something like
acc == 1, rather than
acc == "1". Finally, use
summarise to report the mean response time for words and non words for each participant. Good luck!
Once you’re getting the right answers, paste your R code into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.