In this worksheet, we’ll cover some additional techniques for preprocessing experimental data.
To prepare for this worksheet:
rminr-data project we used previously.
Files tab. You should see a
going-further which contains the files
If you don’t see the folder or files, ask git to “
pull” the latest version of the repository. Select the
Git tab, which is located in the row of tabs which includes the
Environment tab. Click the
Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the
Git pull window.
Create a script named
preproc-experiments.R in the
rminr-data folder (the folder above
case-studies). Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.
We’ll start by loading some data from an undergraduate student dissertation on the Perruchet Effect.
# A tibble: 6 x 4 subj time key value <dbl> <dbl> <chr> <dbl> 1 200 50 GS 0 2 200 106 GS 0 3 200 153 GS 0 4 200 201 GS 0 5 200 253 GS 0 6 200 301 GS 0
Explanation of commands:
Line 1 clears the environment. Lines 2 and 3 should be familiar from previous worksheets; we nearly always need the
tidyverse package, and we use the
read_csv command to load the data. The
going-further/ part of the filename
going-further/perruchet-raw.csv says that the file
perruchet-raw.csv is to be found in the directory (folder) called
Line 4 renames the columns, as covered in the Preprocessing data from experiments worksheet. Line 5 prints the first few rows of
Explanation of output:
Our data is in long format i.e. one observation per row.
We’re now going to get the data into a state where we can meaningfully analyse it:
raw <- distinct(raw) is a de-duplication command. This experiment was run by three people over many different days. They tried to manually stitch all their data files together, by copying-and-pasting them into one big file in Excel. This type of manual manipulation of data is notoriously error prone, and in the process, they accidentally duplicated some data. In this case, it’s easy to detect and fix in R, because no two rows of this data can be identical - the subject number or the time is always going to be different. So, we can remove duplicates by telling R to give us just one copy of each of the different rows in the data set. The command
distinct does that for us.
raw <- raw %>% filter(subj > 161) should be familiar from previous worksheets; we are keeping only those participants whose subject number is greater than 161. This is because, in this study, we first ran a short pilot experiment, looked at the data, and made a few changes to the experiment. Here, we only wanted to analyze the main study, so we removed the pilot participants.
As part of this experiment, participants were told to rate their expectancy of hearing an unpleasant noise, on every trial. However, they only had 8 seconds to do this, and the experimenters noticed that sometimes people failed to make a rating. The experimental apparatus records this as a rating of zero.
Participants who miss a lot of ratings aren’t really following the instructions, making their data hard to interpret. In such cases, we normally exclude those participants from further analysis. First, we need to find out how widespread this problem is. One way to do this is to calculate the number of zero ratings each participant makes:
163 164 166 176 184 188 191 196 199 202 223 233 235 236 239 246 1 19 46 16 2 3 2 1 32 3 3 1 3 9 34 1
Line 2 filters those rows of the raw data that contain expectancy ratings (
key == "ER") and then to cases where that rating is zero (
value == 0). This is then stored in the data frame
inspect. One could look at this manually, but it’s more efficient and less error prone to get R to tell us how many times each participant has a zero rating. This is what line 3,
table(inspect$subj) does - it counts the number of times each subject number occurs in
The output shows the number of trials on which each participant failed to give an expectancy rating. If a participant never made a zero rating, their participant number does not appear on the table. We can see that sixteen participants missed at least one rating, and that six participants missed more than 3 (one participant never made a rating, missing all 46 trials of the experiment!).
In this case, we decided to exclude all participants who missed more than three ratings:
Line 2 creates a list of the participant numbers we wish to exclude. In line 3, we remove those participants from the
raw data frame. The
filter command will be familiar from previous worksheets. The command
subj %in% exclude means ‘any participant whose subject number is in our list
exclude’. The use of
!() in the filter statement means
filter(!(subj %in% exclude)) means keep the participants whose subject number is not in the
There are still a few trials where the participant makes no expectancy rating, and the computer records this as an expectancy of zero. There’s a good case to be made for recoding the rating on those trials as either
NA (meaning ‘missing data’) or
3, meaning “participant is unsure whether the noise will occur”. For brevity, we don’t make these changes here, but once you’ve gone through the whole worksheet, perhaps try to work out how one might do this.
The other measurement in this study was Galvanic Skin Response (GSR). Studies of GSR data normally log transform it before analysis. Log transforming data means to take the logarithm of the data, in this case the natural logarithm. Two reasonable questions about this preprocessing are: (a) what does that mean, and (b) why would you do it?
To answer the first question, a log transform compresses the range of the data. The following graph should help visualize this:
On the x-axis we have the numbers 1 to 50. On the y-axis, we have the natural logarithm of those numbers. As you can see, while the x-axis rises from 1 to 50, the y-axis rises from zero to around 4. You should also be able to notice that for each increase in x, the increase in y gets smaller. So, for example, where x rises from 1 to 2, log(x) rises from 0 to about .8; but where x rises from 2 to 3, log(x) rises by about 0.4.
As a result, a log transform doesn’t change small numbers very much, but reduces large numbers a lot more. This can be useful if, for example, a few participants have unusually large GSRs. These very large GSRs would increase the variance of the data substantially, and this can make it harder to detect real differences between conditions. A log transform can reduce that problem, and hence can make it easier to find real effects in such situations. So, this also answers the second part of the question - we apply a log transform to GSR data because it can make it easier to find real effects.
We log transform our GSR data using the
mutate command that we used earlier:
Line 1 loads the data from the same experiment, after it’s undergone some preprocessing. The
cval (for “corrected value”) column contains the GSR measurement that we need to log transform. You can follow the preprocessing steps that produced
sum.data in the Perruchet Effect worksheet.
log takes the natural logarithm. We calculate
log(cval+1) rather than
log(cval) because the
log(0) is an infinitely small number, which is something that cannot be analyzed. By adding 1 we don’t hit this problem (unless cval is -1 or smaller, which it seldom is).
This exercise uses data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Before and after each group listened to a clip of their preferred music, measurements were taken using the 20-item Positive and Negative Affect Schedule (PANAS). The data is in wide format, with one row per participant.
We start by loading the data and counting the rows.
# A tibble: 1 x 1 n <int> 1 297
Like the Perruchet Effect experiment, the 297 rows in this data includes some duplicates. De-duplicate then count the rows. The result should look like this:
# A tibble: 1 x 1 n <int> 1 294
This version of the PANAS used 10 questions to measure positive affect, and 10 to measure negative affect. Answers to questions were scored from 1 to 5, meaning that scores could range from 10-50. For the second negative affect measurement, the researchers calculated and entered the scores by hand. Use
filter() to find any participants with a
post_na score greater than 50. Use another
filter() command to exclude any participants who meet this criterion.
The ‘reciprocal’ of a number is result of dividing 1 by that number. So the reciprocal of 10 is 1 / 10, or 0.1. More generally, if x is our number, 1 / x is the reciprocal. Like the log transformation, taking a reciprocal reduces the disproprotional effect that a small number of extreme values have on the variance.
Notice that taking a reciprocal reverses the order of your data e.g. 1 / 5 = 0.2, which is larger than 0.1. This is easy to correct for by subtracting each value of x from the maximum value before the division. You don’t need to do that for this exercise. However, because 1 / 0 will produce an error, we’ll use the formula 1 / (x + 1) to calculate a reciprocal.
Use this formula to transform
post_na. Then use
summarise() to calculate mean pre- and post- negative affect by subculture. The results should look like this.
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 3 subculture `mean(pre_na)` `mean(post_na)` <chr> <dbl> <dbl> 1 Emo 0.0574 0.0672 2 Goth 0.0685 0.0768 3 Mainstream 0.0705 0.0786 4 Metal 0.0670 0.0784
Copy the R code you used for this exercise into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.