Adding participant numbers

Sometimes, a data file does not contain a participant number within it, it’s just provided as part of the filename. If you encounter this issue, here’s how to resolve it using the add_column command. The following assumes you have a project in Rstudio associated with the git repository used in the preprocessing worksheet.

library(tidyverse)
subj.11 <- read_csv('rawdata/subject-11.csv') %>%
  add_column(subj = 11, .before = "acc")  # .before = "acc" means: 'insert the new column before the existing column called "acc"'

In the case that you need to read multiple participants’ datafiles at once, we saw how to use do with read_csv in the preproc worksheet:

alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>% 
  group_by(filename) %>% 
  do(read_csv(.$filename))

Explanation of command: This is the same code we saw in the preproc worksheet. We use list.files to produce a list of all the files in the rawdata directory which end in .csv. This list is used to make a column in a new dataframe, which is piped to the group_by(filename) function. The grouped data is then piped to the do function. This works on each group (in this case, each filename) in turn and uses the filename column as input the read_csv command.
Because read_csv produces a dataframe as output, these are automatically combined into a single dataframe of all participants. The filename column remains and provides a record of where the data came from.


When you run this code, you should notice that alldat has a new column, filename. This contains the original file name of the raw data.

That’s OK, but it would be better if we could just have the participant number (e.g. 11) because it’s more compact and easy to use like that. So, we need to be able to cut out the participant number 11 from the filename. We can do this using the str_sub command. Here’s an example of how str_sub works:

str_sub("investment", 3, 6)
[1] "vest"

Explanation of command: str_sub is short for “string subset”, with a string being a collection of characters (e.g. a word) and a subset being part of that string. The first number, 3 is the start of the substring, and the second number 6 is the end of the substring. So, if we take from the third to the sixth character in “investment”, we get “vest”.

Looking at the filename rawdata/subject-11.csv, we can see that the participant number starts at the 17th position and ends at the 18th. This will be true for any two-digit participant number (a good reason to start subject numbers at 11 rather than at 1). So, putting this all together, we get:

alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>% 
  group_by(filename) %>% 
  do(read_csv(.$filename)) %>% 
  mutate(subj = str_sub(filename, 17, 18), .before="filename")

These four lines of code load and combine every data file, and extract the participant number for each row.

Matching patterns

If you didn’t always use 2-digit subject numbers in your experiment (e.g. you used 1..9, 10, 11, 12 and so on), or have more than 99 participants, there is another more advanced trick which can be useful.

The str_extract function uses a special language to define patterns in a string. These can be used to identify and extact regular or repeating patterns in your filenames. These patterns are called regular expressions. To give one example:

str_extract("participant-9999", "(\\d+)")
[1] "9999"

Explanation of the code: str_extract is being used to match patterns in the text "participant-9999". The pattern used is "\\d+". The \\d part means ’match any digit from 0 to 9. The + means, match as many of what went before as you can. So \\d+ means match as many digits as you can.

Exercise

Adapt the code from above to use str_extract rather than str_sub.

Optionally, if you think matching patterns in your text data might be a useful skill, see this guide for lots more detail: https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf


This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.