In this worksheet, we’ll cover some additional techniques for preprocessing experimental data.

Getting started

To prepare for this worksheet:

  1. Open the rminr-data project we used previously.

  2. Open the Files tab. You should see a going-further which contains the files perruchet-raw.csv and perruchet-sum.csv.

  3. If you don’t see the folder or files, ask git to “pull” the latest version of the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.

  4. Create a script named preproc-experiments.R in the rminr-data folder (the folder above case-studies). Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

Loading data

We’ll start by loading some data from an undergraduate student dissertation on the Perruchet Effect.

# A tibble: 6 x 4
   subj  time key   value
  <dbl> <dbl> <chr> <dbl>
1   200    50 GS        0
2   200   106 GS        0
3   200   153 GS        0
4   200   201 GS        0
5   200   253 GS        0
6   200   301 GS        0

Explanation of commands:

Line 1 clears the environment. Lines 2 and 3 should be familiar from previous worksheets; we nearly always need the tidyverse package, and we use the read_csv command to load the data. The going-further/ part of the filename going-further/perruchet-raw.csv says that the file perruchet-raw.csv is to be found in the directory (folder) called going-further.

Line 4 renames the columns, as covered in the Preprocessing data from experiments worksheet. Line 5 prints the first few rows of raw.

Explanation of output:

Our data is in long format i.e. one observation per row.

De-duplicating data

We’re now going to get the data into a state where we can meaningfully analyse it:

Explanation of commands

Line 2, raw <- distinct(raw) is a de-duplication command. This experiment was run by three people over many different days. They tried to manually stitch all their data files together, by copying-and-pasting them into one big file in Excel. This type of manual manipulation of data is notoriously error prone, and in the process, they accidentally duplicated some data. In this case, it’s easy to detect and fix in R, because no two rows of this data can be identical - the subject number or the time is always going to be different. So, we can remove duplicates by telling R to give us just one copy of each of the different rows in the data set. The command distinct does that for us.

Line 3, raw <- raw %>% filter(subj > 161) should be familiar from previous worksheets; we are keeping only those participants whose subject number is greater than 161. This is because, in this study, we first ran a short pilot experiment, looked at the data, and made a few changes to the experiment. Here, we only wanted to analyze the main study, so we removed the pilot participants.

Excluding participants

As part of this experiment, participants were told to rate their expectancy of hearing an unpleasant noise, on every trial. However, they only had 8 seconds to do this, and the experimenters noticed that sometimes people failed to make a rating. The experimental apparatus records this as a rating of zero.

Participants who miss a lot of ratings aren’t really following the instructions, making their data hard to interpret. In such cases, we normally exclude those participants from further analysis. First, we need to find out how widespread this problem is. One way to do this is to calculate the number of zero ratings each participant makes:

163 164 166 176 184 188 191 196 199 202 223 233 235 236 239 246 
  1  19  46  16   2   3   2   1  32   3   3   1   3   9  34   1 

Explanation of commands

Line 2 filters those rows of the raw data that contain expectancy ratings (key == "ER") and then to cases where that rating is zero (value == 0). This is then stored in the data frame inspect. One could look at this manually, but it’s more efficient and less error prone to get R to tell us how many times each participant has a zero rating. This is what line 3, table(inspect$subj) does - it counts the number of times each subject number occurs in inspect.

Explanation of output

The output shows the number of trials on which each participant failed to give an expectancy rating. If a participant never made a zero rating, their participant number does not appear on the table. We can see that sixteen participants missed at least one rating, and that six participants missed more than 3 (one participant never made a rating, missing all 46 trials of the experiment!).

Removing participants from the dataset

In this case, we decided to exclude all participants who missed more than three ratings:

Explanation of commands

Line 2 creates a list of the participant numbers we wish to exclude. In line 3, we remove those participants from the raw data frame. The filter command will be familiar from previous worksheets. The command subj %in% exclude means ‘any participant whose subject number is in our list exclude’. The use of !() in the filter statement means not. So, filter(!(subj %in% exclude)) means keep the participants whose subject number is not in the exclude list.


There are still a few trials where the participant makes no expectancy rating, and the computer records this as an expectancy of zero. There’s a good case to be made for recoding the rating on those trials as either NA (meaning ‘missing data’) or 3, meaning “participant is unsure whether the noise will occur”. For brevity, we don’t make these changes here, but once you’ve gone through the whole worksheet, perhaps try to work out how one might do this.

Log transform

The other measurement in this study was Galvanic Skin Response (GSR). Studies of GSR data normally log transform it before analysis. Log transforming data means to take the logarithm of the data, in this case the natural logarithm. Two reasonable questions about this preprocessing are: (a) what does that mean, and (b) why would you do it?

To answer the first question, a log transform compresses the range of the data. The following graph should help visualize this:

On the x-axis we have the numbers 1 to 50. On the y-axis, we have the natural logarithm of those numbers. As you can see, while the x-axis rises from 1 to 50, the y-axis rises from zero to around 4. You should also be able to notice that for each increase in x, the increase in y gets smaller. So, for example, where x rises from 1 to 2, log(x) rises from 0 to about .8; but where x rises from 2 to 3, log(x) rises by about 0.4.

As a result, a log transform doesn’t change small numbers very much, but reduces large numbers a lot more. This can be useful if, for example, a few participants have unusually large GSRs. These very large GSRs would increase the variance of the data substantially, and this can make it harder to detect real differences between conditions. A log transform can reduce that problem, and hence can make it easier to find real effects in such situations. So, this also answers the second part of the question - we apply a log transform to GSR data because it can make it easier to find real effects.

We log transform our GSR data using the mutate command that we used earlier:

Explanation of commands

Line 1 loads the data from the same experiment, after it’s undergone some preprocessing. The cval (for “corrected value”) column contains the GSR measurement that we need to log transform. You can follow the preprocessing steps that produced in the Perruchet Effect worksheet.

The command log takes the natural logarithm. We calculate log(cval+1) rather than log(cval) because the log(0) is an infinitely small number, which is something that cannot be analyzed. By adding 1 we don’t hit this problem (unless cval is -1 or smaller, which it seldom is).


This exercise uses data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Before and after each group listened to a clip of their preferred music, measurements were taken using the 20-item Positive and Negative Affect Schedule (PANAS). The data is in wide format, with one row per participant.

We start by loading the data and counting the rows.

# A tibble: 1 x 1
1   297

Step 1

Like the Perruchet Effect experiment, the 297 rows in this data includes some duplicates. De-duplicate then count the rows. The result should look like this:

# A tibble: 1 x 1
1   294

Step 2

This version of the PANAS used 10 questions to measure positive affect, and 10 to measure negative affect. Answers to questions were scored from 1 to 5, meaning that scores could range from 10-50. For the second negative affect measurement, the researchers calculated and entered the scores by hand. Use filter() to find any participants with a post_na score greater than 50. Use another filter() command to exclude any participants who meet this criterion.

Step 3

The ‘reciprocal’ of a number is result of dividing 1 by that number. So the reciprocal of 10 is 1 / 10, or 0.1. More generally, if x is our number, 1 / x is the reciprocal. Like the log transformation, taking a reciprocal reduces the disproprotional effect that a small number of extreme values have on the variance.

Notice that taking a reciprocal reverses the order of your data e.g. 1 / 5 = 0.2, which is larger than 0.1. This is easy to correct for by subtracting each value of x from the maximum value before the division. You don’t need to do that for this exercise. However, because 1 / 0 will produce an error, we’ll use the formula 1 / (x + 1) to calculate a reciprocal.

Use this formula to transform pre_na and post_na. Then use summarise() to calculate mean pre- and post- negative affect by subculture. The results should look like this.

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 3
  subculture `mean(pre_na)` `mean(post_na)`
  <chr>               <dbl>           <dbl>
1 Emo                0.0574          0.0672
2 Goth               0.0685          0.0768
3 Mainstream         0.0705          0.0786
4 Metal              0.0670          0.0784

Copy the R code you used for this exercise into PsycEL.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.