Sometimes, a data file does not contain a participant number within it, it’s just provided as part of the filename. If you encounter this issue, here’s how to resolve it using the
add_column command. The following assumes you have a project in Rstudio associated with the git repository used in the preprocessing worksheet.
library(tidyverse) subj.11 <- read_csv('rawdata/subject-11.csv') %>% add_column(subj = 11, .before = "acc") # .before = "acc" means: 'insert the new column before the existing column called "acc"'
In the case that you need to read multiple participants’ datafiles at once, we saw how to use
read_csv in the preproc worksheet:
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>% group_by(filename) %>% do(read_csv(.$filename))
Explanation of command: This is the same code we saw in the preproc worksheet. We use
list.files to produce a list of all the files in the
rawdata directory which end in
.csv. This list is used to make a column in a new dataframe, which is piped to the
group_by(filename) function. The grouped data is then piped to the
do function. This works on each group (in this case, each filename) in turn and uses the
filename column as input the
read_csv produces a dataframe as output, these are automatically combined into a single dataframe of all participants. The
filename column remains and provides a record of where the data came from.
When you run this code, you should notice that
alldat has a new column,
filename. This contains the original file name of the raw data.
That’s OK, but it would be better if we could just have the participant number (e.g.
11) because it’s more compact and easy to use like that. So, we need to be able to cut out the participant number
11 from the filename. We can do this using the
str_sub command. Here’s an example of how
str_sub("investment", 3, 6)
Explanation of command:
str_sub is short for “string subset”, with a string being a collection of characters (e.g. a word) and a subset being part of that string. The first number,
3 is the start of the substring, and the second number
6 is the end of the substring. So, if we take from the third to the sixth character in “investment”, we get “vest”.
Looking at the filename
rawdata/subject-11.csv, we can see that the participant number starts at the 17th position and ends at the 18th. This will be true for any two-digit participant number (a good reason to start subject numbers at 11 rather than at 1). So, putting this all together, we get:
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>% group_by(filename) %>% do(read_csv(.$filename)) %>% mutate(subj = str_sub(filename, 17, 18), .before="filename")
These four lines of code load and combine every data file, and extract the participant number for each row.
If you didn’t always use 2-digit subject numbers in your experiment (e.g. you used 1..9, 10, 11, 12 and so on), or have more than 99 participants, there is another more advanced trick which can be useful.
str_extract function uses a special language to define patterns in a string. These can be used to identify and extact regular or repeating patterns in your filenames. These patterns are called regular expressions. To give one example:
Explanation of the code:
str_extract is being used to match patterns in the text
"participant-9999". The pattern used is
\\d part means ’match any digit from 0 to 9. The
+ means, match as many of what went before as you can. So
\\d+ means match as many digits as you can.
Adapt the code from above to use
str_extract rather than
Optionally, if you think matching patterns in your text data might be a useful skill, see this guide for lots more detail: https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.