Introduction

Intelligence, personality, and many other psychological constructs are often measured using scales. This type of data is normally collected using questionnaires (also called surveys). Answers to the questions are given numerical values, most commonly using a Likert scale. Likert scales associate numbers with a set of answers which express some degree of agreement with each question e.g. 0=Not at all, 1=A little, 2=Somewhat, 3=A lot, 4=Extremely. A formula is applied to the scores for some or all of the questions to calculate an overall score for the scale. The formula often just consists of adding up the individual scores (more on this below). This worksheet assumes that your survey software has recorded Likert responses as numbers. Refer back to the Cleaning up questionnaire data worksheet if you need a reminder of how to convert text responses to numbers.

A psychometric scale is a scale which has undergone some degree of testing to ensure that it is a valid and reliable measure of the underlying construct. For example, a valid intelligence scale would truly measure intelligence, rather than some other construct (e.g. memory). A reliable scale gives consistent results, i.e. a person who completed the scale at different times would produce similar scores, as would two people who are similar in terms of the construct measured by the scale. Most published scales have been tested to ensure they are valid and reliable, so it’s advisable to use an existing scale if one exists, before creating your own.

Surveys can be created using JISC, Gorilla Survey, OpenSesame, The Experiment Factory, Qualtrics and many other software packages. Most software will allow you to save your data as a CSV file. The precise structure of the data varies between packages, so you are likely to have to start by preprocessing your data.

In this worksheet, we’ll cover some common techniques you are likely to use to preprocess psychometric scale data. These techniques should be useful regardless of the software you used to administer your survey data, although they will need slight modifications depending on the way your raw data is organised.

Getting started

To prepare for this worksheet:

1. Open the rminr-data project we used previously.

2. If you don’t see a folder named going-further, it means you created your project before the data required for this worksheet was added to the rminr-data git repository. You can get the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.

3. Open the Files tab. The going-further folder should contain the files dass21.csv and sses.sav.

4. Create a script named scales.R in the rminr-data folder (the folder above going-further). Add the comments and code to this script as you work through each section of the worksheet.

We start with some lines to clear the workspace and load tidyverse.

# Data preprocessing for scale
# Clear the environment
rm(list = ls())
library(tidyverse)

Our first step will be to load the data and remove columns from the raw survey data which aren’t needed for analysis. We’ll demonstrate this using some real data from the Depression Anxiety Stress Scales—21 (DASS-21, Henry & Crawford, 2005), a 21-item scale for measuring depression, anxiety and stress.

# Load data
# Select relevant columns of data
dass21_raw <- dass21_raw %>% select(partID, Age:DASS21)

Explanation of command:

1. We read the DASS-21 CSV file into the data frame dass21_raw.

2. We then select() just the columns in dass21_raw that we want to keep. The first column we select(), is the participant ID, which is stored in the partID column. Arguments to select() can also be consecutive ranges of columns in a data frame, consisting of the first and last column name (ordered from left to right), separated by a :. This avoids having to type out long lists of column names. Here we use Age:DASS21 to select all columns between Age and DASS21.

The table below shows the first few rows from dass21_raw. In this study, the data was recorded in “wide” format (one row for each participant). Notice that our data frame contains only the columns that we selected in the commands above. The DASS-21 scores are in columns DASS1-DASS21.

partID Age Gender Stage DASS1 DASS2 DASS3 DASS4 DASS5 DASS6 DASS7 DASS8 DASS9 DASS10 DASS11 DASS12 DASS13 DASS14 DASS15 DASS16 DASS17 DASS18 DASS19 DASS20 DASS21
34 18 2 2 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
35 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
36 37 2 2 2 1 3 1 3 1 0 1 2 1 1 1 2 0 2 3 1 2 1 0 2
37 19 2 1 1 1 1 1 2 2 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0
38 20 2 1 1 0 1 0 2 1 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0
39 20 2 2 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0

Handling missing data

If participants don’t complete (or partially complete) a survey, you may want to exclude their data from your analyses. Here are some rows from dass21_raw.

partID Age Gender Stage DASS1 DASS2 DASS3 DASS4 DASS5 DASS6 DASS7 DASS8 DASS9 DASS10 DASS11 DASS12 DASS13 DASS14 DASS15 DASS16 DASS17 DASS18 DASS19 DASS20 DASS21
106 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
107 19 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
108 18 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
109 20 2 3 1 0 1 1 1 1 0 0 1 0 2 2 1 3 0 1 0 1 1 0 0

We can see that participants 108, and 109 (rows 3 and 4) have numbers in all columns, indicating that their data is complete. However, participants 106 and 107 have cells containing the value NA, which means these cells in the CSV file were empty. For participant 106, all cells are NA (perhaps they dropped out of the study), and for participant 107, all of the DASS-21 cells are NA (perhaps they skipped this survey).

If you select dass21_raw in the Environment pane and look through the rest of the data, you’ll see that participants 35, 49, 61, and 77 also have no data for this survey. We exclude these participants from the data frame.

# Exclude participants with no data
exclude <- c(35,49,61,77,106,107)
dass21 <- dass21_raw %>% filter(!(partID %in% exclude))

Explanation of commands:

Line 1 creates a list of the participant numbers we wish to exclude. In line 2, we remove those participants from dass21_raw. The filter command will be familiar from previous worksheets. The command partID %in% exclude means ‘any participant whose subject number is in our list exclude’. The use of !() in the filter statement means not. So, filter(!(partID %in% exclude)) means keep the participants whose subject number is not in the exclude list.

If you look at the Value column in the Environment pane, you’ll see that dass21 now has six fewer rows than dass21_raw.

Calculating subscale scores

Our next step is to calculate the scores for the constructs measured by our scale. Many scales consist of groups of questions which measure multiple, distinct constructs. The DASS-21 is an example of a scale with subscale scores for depression, anxiety and stress. These are calculated by adding together responses for specific items, which we can do using the rowSums() function.

# Calculate depression subscale score
dass21 <- dass21 %>%
mutate(depression = rowSums(dass21[4 + c(3,5,10,13,16,17,21)]))
# Add relevant columns to 'dass21_total'
dass21_total <- dass21 %>% select(partID, Age, Gender, depression)

Explanation of command:

• We use mutate() to create a depression column which is the sum of items 3, 5, 10, 13, 16, 17 and 21. Item 1 of the DASS-21 data is in column 5 of dass21, so we add 4 to each item number to select the correct columns to add together. The command dass21[4 + c(3,5,10,13,16,17,21)] is an example of “vectorised addition”. It adds 4 to each of the columns defined in the vector to the right of the +. For each row, the values in the resulting columns are added together using rowSums(). We assign the result back to dass21, thereby creating a depression column for each row.

Exercise 1

Use similar commands to add scores for anxiety and stress to dass21. The anxiety subscale is the sum of questions 2,4,7,9,15,19 and 20. The stress subscale is the sum of questions 1,6,8,11,12,14 and 18. After running your commands, the first few rows of dass21_total should look like this:

partID Age Gender depression anxiety stress
34 18 2 0 1 2
36 37 2 15 7 8
37 19 2 6 4 7
38 20 2 5 1 5
39 20 2 1 0 2
40 18 1 2 1 5
41 20 1 1 0 0
42 20 2 3 5 4
43 19 0 0 0 0
44 19 2 19 17 14

Copy the R code you used for this exercise, along with appropriate comments, into PsycEL.

Tidying survey data

Some data benefits from a little more tidying than simply removing columns which aren’t required. We’ll demonstrate this more advanced preprocessing using a different dataset. This data came from from an experiment in which self-esteem was measured before and after participants completed one of two mental imagery conditions, or a control condition.

• Condition 1: participants visualised a negative mental image of themself
• Condition 2: participants visualised a negative mental image of someone else
• Condition 0 (control): participants did a card sorting task and did not think of any images.

The experiment used the State Self-Esteem Scale (SSES, Heatherton & Polivy, 1991), a 20-item scale used to measure short-lived (state) changes in self-esteem.

Enter this comment and command into your script, and run it:

# Load data into 'sses'
sses <- read_csv('going-further/sses.csv')
partID Age Gender Stage Pre_SSE_1 Pre_SSE_2 Pre_SSE_3 Pre_SSE_4 Pre_SSE_5 Pre_SSE_6 Pre_SSE_7 Pre_SSE_8 Pre_SSE_9 Pre_SSE_10 Pre_SSE_11 Pre_SSE_12 Pre_SSE_13 Pre_SSE_14 Pre_SSE_15 Pre_SSE_16 Pre_SSE_17 Pre_SSE_18 Pre_SSE_19 Pre_SSE_20 Condition Post_SSE_1 Post_SSE_2 Post_SSE_3 Post_SSE_4 Post_SSE_5 Post_SSE_6 Post_SSE_7 Post_SSE_8 Post_SSE_9 Post_SSE_10 Post_SSE_11 Post_SSE_12 Post_SSE_13 Post_SSE_14 Post_SSE_15 Post_SSE_16 Post_SSE_17 Post_SSE_18 Post_SSE_19 Post_SSE_20
47 20 male 2 3 4 2 2 3 3 0 1 3 0 4 3 2 3 0 2 3 2 1 3 control 3 2 3 2 1 2 0 1 3 0 4 3 2 2 0 1 2 1 1 1
51 22 female 2 2 3 3 2 1 3 0 0 3 2 2 3 3 3 0 0 2 0 2 2 control 3 3 3 3 1 3 1 1 4 1 2 2 2 3 0 0 2 0 2 2
57 20 female 2 3 0 0 0 0 0 4 0 2 0 2 3 0 4 0 0 0 0 0 0 control 3 0 0 0 0 2 4 0 2 0 2 3 0 4 0 0 0 0 0 0

The data will be easier to analyse if we rename the columns. It will also be useful to divide the data into two data frames, one for the pre-intervention SSES, the other for the post-intervention SSES. We’ll do this in stages.

Enter this comment and command into your script, and run it:

# Place pre-intervention SSES into 'sses_pre_raw'
sses_pre_raw  <- sses %>% select(1, 5:25)

Explanation of command:

• sses_pre_raw <- sses %>% select(1, 5:25) - We select() column 1, and columns 5:25 from sses, and store the resulting data frame in sses_pre_raw. Column 1 is the participant id, columns 5:24 are the SSES scores, and column 25 contains a number indicating which of the three experimental conditions the subject was assigned to.

Here are the first three participants of our pre-intervention data:

partID Pre_SSE_1 Pre_SSE_2 Pre_SSE_3 Pre_SSE_4 Pre_SSE_5 Pre_SSE_6 Pre_SSE_7 Pre_SSE_8 Pre_SSE_9 Pre_SSE_10 Pre_SSE_11 Pre_SSE_12 Pre_SSE_13 Pre_SSE_14 Pre_SSE_15 Pre_SSE_16 Pre_SSE_17 Pre_SSE_18 Pre_SSE_19 Pre_SSE_20 Condition
47 3 4 2 2 3 3 0 1 3 0 4 3 2 3 0 2 3 2 1 3 control
51 2 3 3 2 1 3 0 0 3 2 2 3 3 3 0 0 2 0 2 2 control
57 3 0 0 0 0 0 4 0 2 0 2 3 0 4 0 0 0 0 0 0 control

Next, we’ll rename the SSES columns based on their question number. This will make them easier to refer to in the rest of our code.

Enter this comment and command into your script, and run them:

# Rename columns
sses_pre_raw  <- sses_pre_raw %>%
set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q"))

Explanation of command:

1. set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q")) - We use the function set_names() to rename our columns. The ~ is a way of telling set_names() to apply a function to each column name. The remainder of the command is a “sub-pipeline” which tidies up the column name. The command str_to_lower(.) converts a string (the . means the current column name) to lower case. This lower case name is piped to str_replace_all("pre_sse_", "q")) which replaces any string with the prefix pre_sse_ with q. All our columns are now lowercase, and the SSES questions are named q1:q20.
partid q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 condition
47 3 4 2 2 3 3 0 1 3 0 4 3 2 3 0 2 3 2 1 3 control
51 2 3 3 2 1 3 0 0 3 2 2 3 3 3 0 0 2 0 2 2 control
57 3 0 0 0 0 0 4 0 2 0 2 3 0 4 0 0 0 0 0 0 control

Now we’ll convert some columns to factors.

Enter this comment and command into your script, and run them:

# Convert columns to factors; add factor column 'time', set to 'pre'; select relevant columns
sses_pre_raw <- sses_pre_raw %>%
mutate(subj = factor(partid), condition = factor(condition),
time = factor('pre')) %>%
select(subj, condition, time, q1:q20)
1. mutate(subj = factor(partid), condition = factor(condition), time = factor('pre')) - We use mutate to add and modify some columns. The argument subj = factor(partid) creates a new column named subj (which is a bit clearer than partid) by copying the partid column and making it a factor. The argument condition = factor(condition) makes the condition column a factor. The argument time = factor('pre') creates a new factor called time and sets all values to pre.
2. select(subj, condition, time, q1:q20) just puts our columns in a more logical order.

Our data is now much tidier:

subj condition time q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
47 control pre 3 4 2 2 3 3 0 1 3 0 4 3 2 3 0 2 3 2 1 3
51 control pre 2 3 3 2 1 3 0 0 3 2 2 3 3 3 0 0 2 0 2 2
57 control pre 3 0 0 0 0 0 4 0 2 0 2 3 0 4 0 0 0 0 0 0

Note that we could do all of these steps in a single pipeline (DO NOT enter these commands, you do not need to do the same thing twice, this is just an illustration of how the previous commands could be combined).

sses_pre_raw  <- select(sses, 1, 5:25) %>%
set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q")) %>%
mutate(subj = factor(partid), condition = factor(condition),
time = factor('pre')) %>%
select(subj, condition, time, q1:q20)

Exercise 2

Write a similar pipeline (including comments) to create a data frame named sses_post_raw containing the post-intervention SSES data. The condition and post-intervention SSES data are in columns 25:45. The SSES columns have the prefix post_sse_ rather than pre_sse_. Set the value in the time factor to post. After running your commands, the first few rows of sses_post_raw should look like this:

subj condition time q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
47 control post 3 2 3 2 1 2 0 1 3 0 4 3 2 2 0 1 2 1 1 1
51 control post 3 3 3 3 1 3 1 1 4 1 2 2 2 3 0 0 2 0 2 2
57 control post 3 0 0 0 0 2 4 0 2 0 2 3 0 4 0 0 0 0 0 0

Copy the R code and comments you used for this exercise into PsycEL.

References

Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem Journal of Personality and Social Psychology, 60(6), 895.

Henry, J. D., & Crawford, J. R. (2005). The short-form version of the Depression Anxiety Stress Scales (DASS-21): Construct validity and normative data in a large non-clinical sample British Journal of Clinical Psychology, 44(2), 227–239.