Data preprocessing for scales

Introduction
Getting started
Loading and selecting data
Handling missing data
Calculating subscale scores
Exercise 1
Tidying survey data
Exercise 2

Introduction

Intelligence, personality, and many other psychological constructs are often measured using scales. This type of data is normally collected using questionnaires (also called surveys). Answers to the questions are given numerical values, most commonly using a Likert scale. Likert scales associate numbers with a set of answers which express some degree of agreement with each question e.g. 0=Not at all, 1=A little, 2=Somewhat, 3=A lot, 4=Extremely. A formula is applied to the scores for some or all of the questions to calculate an overall score for the scale. The formula often just consists of adding up the individual scores (more on this below). This worksheet assumes that your survey software has recorded Likert responses as numbers. Refer back to the Cleaning up questionnaire data worksheet if you need a reminder of how to convert text responses to numbers.

A psychometric scale is a scale which has undergone some degree of testing to ensure that it is a valid and reliable measure of the underlying construct. For example, a valid intelligence scale would truly measure intelligence, rather than some other construct (e.g. memory). A reliable scale gives consistent results, i.e. a person who completed the scale at different times would produce similar scores, as would two people who are similar in terms of the construct measured by the scale. Most published scales have been tested to ensure they are valid and reliable, so it’s advisable to use an existing scale if one exists, before creating your own.

Surveys can be created using JISC, Gorilla Survey, OpenSesame, The Experiment Factory, Qualtrics and many other software packages. Most software will allow you to save your data as a CSV file. The precise structure of the data varies between packages, so you are likely to have to start by preprocessing your data.

In this worksheet, we’ll cover some common techniques you are likely to use to preprocess psychometric scale data. These techniques should be useful regardless of the software you used to administer your survey data, although they will need slight modifications depending on the way your raw data is organised.

Getting started

To prepare for this worksheet:

Open the rminr-data project we used previously.
If you don’t see a folder named going-further, it means you created your project before the data required for this worksheet was added to the rminr-data git repository. You can get the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.
Open the Files tab. The going-further folder should contain the files dass21.csv and sses.sav.
Create a script named scales.R in the rminr-data folder (the folder above going-further). Add the comments and code to this script as you work through each section of the worksheet.

We start with some lines to clear the workspace and load tidyverse.

Enter these comments and commands into your script, and run them:

# Data preprocessing for scale
# Clear the environment
rm(list = ls()) 
# Load tidyverse
library(tidyverse)

Loading and selecting data

Our first step will be to load the data and remove columns from the raw survey data which aren’t needed for analysis. We’ll demonstrate this using some real data from the Depression Anxiety Stress Scales—21 (DASS-21, Henry & Crawford, 2005), a 21-item scale for measuring depression, anxiety and stress.

Enter these comments and commands into your script, and run them:

# Load data
dass21_raw <- read_csv("going-further/dass21.csv")
# Select relevant columns of data
dass21_raw <- dass21_raw %>% select(partID, Age:DASS21)

Explanation of command:

We read the DASS-21 CSV file into the data frame dass21_raw.
We then select() just the columns in dass21_raw that we want to keep. The first column we select(), is the participant ID, which is stored in the partID column. Arguments to select() can also be consecutive ranges of columns in a data frame, consisting of the first and last column name (ordered from left to right), separated by a :. This avoids having to type out long lists of column names. Here we use Age:DASS21 to select all columns between Age and DASS21.

The table below shows the first few rows from dass21_raw. In this study, the data was recorded in “wide” format (one row for each participant). Notice that our data frame contains only the columns that we selected in the commands above. The DASS-21 scores are in columns DASS1-DASS21.

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
34	18	2	2	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
35	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
36	37	2	2	2	1	3	1	3	1	0	1	2	1	1	1	2	0	2	3	1	2	1	0	2
37	19	2	1	1	1	1	1	2	2	0	1	1	0	1	1	1	0	1	1	1	1	0	0	0
38	20	2	1	1	0	1	0	2	1	0	1	0	0	1	0	1	0	0	0	1	1	1	0	0
39	20	2	2	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0

Handling missing data

If participants don’t complete (or partially complete) a survey, you may want to exclude their data from your analyses. Here are some rows from dass21_raw.

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
106	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
107	19	1	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
108	18	0	2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0
109	20	2	3	1	0	1	1	1	1	0	0	1	0	2	2	1	3	0	1	0	1	1	0	0

We can see that participants 108, and 109 (rows 3 and 4) have numbers in all columns, indicating that their data is complete. However, participants 106 and 107 have cells containing the value NA, which means these cells in the CSV file were empty. For participant 106, all cells are NA (perhaps they dropped out of the study), and for participant 107, all of the DASS-21 cells are NA (perhaps they skipped this survey).

If you select dass21_raw in the Environment pane and look through the rest of the data, you’ll see that participants 35, 49, 61, and 77 also have no data for this survey. We exclude these participants from the data frame.

Enter these comments and commands into your script, and run them:

# Exclude participants with no data
exclude <- c(35,49,61,77,106,107)
dass21 <- dass21_raw %>% filter(!(partID %in% exclude))

Explanation of commands:

Line 1 creates a list of the participant numbers we wish to exclude. In line 2, we remove those participants from dass21_raw. The filter command will be familiar from previous worksheets. The command partID %in% exclude means ‘any participant whose subject number is in our list exclude’. The use of !() in the filter statement means not. So, filter(!(partID %in% exclude)) means keep the participants whose subject number is not in the exclude list.

If you look at the Value column in the Environment pane, you’ll see that dass21 now has six fewer rows than dass21_raw.

Calculating subscale scores

Our next step is to calculate the scores for the constructs measured by our scale. Many scales consist of groups of questions which measure multiple, distinct constructs. The DASS-21 is an example of a scale with subscale scores for depression, anxiety and stress. These are calculated by adding together responses for specific items, which we can do using the rowSums() function.

Enter these comments and commands into your script, and run them:

# Calculate depression subscale score
dass21 <- dass21 %>%
  mutate(depression = rowSums(dass21[4 + c(3,5,10,13,16,17,21)]))
# Add relevant columns to 'dass21_total'
dass21_total <- dass21 %>% select(partID, Age, Gender, depression)

Explanation of command:

We use mutate() to create a depression column which is the sum of items 3, 5, 10, 13, 16, 17 and 21. Item 1 of the DASS-21 data is in column 5 of dass21, so we add 4 to each item number to select the correct columns to add together. The command dass21[4 + c(3,5,10,13,16,17,21)] is an example of “vectorised addition”. It adds 4 to each of the columns defined in the vector to the right of the +. For each row, the values in the resulting columns are added together using rowSums(). We assign the result back to dass21, thereby creating a depression column for each row.

Exercise 1

Use similar commands to add scores for anxiety and stress to dass21. The anxiety subscale is the sum of questions 2,4,7,9,15,19 and 20. The stress subscale is the sum of questions 1,6,8,11,12,14 and 18. After running your commands, the first few rows of dass21_total should look like this:

partID	Age	Gender	depression	anxiety	stress
34	18	2	0	1	2
36	37	2	15	7	8
37	19	2	6	4	7
38	20	2	5	1	5
39	20	2	1	0	2
40	18	1	2	1	5
41	20	1	1	0	0
42	20	2	3	5	4
43	19	0	0	0	0
44	19	2	19	17	14

Copy the R code you used for this exercise, along with appropriate comments, into PsycEL.

Tidying survey data

Some data benefits from a little more tidying than simply removing columns which aren’t required. We’ll demonstrate this more advanced preprocessing using a different dataset. This data came from from an experiment in which self-esteem was measured before and after participants completed one of two mental imagery conditions, or a control condition.

Condition 1: participants visualised a negative mental image of themself
Condition 2: participants visualised a negative mental image of someone else
Condition 0 (control): participants did a card sorting task and did not think of any images.

The experiment used the State Self-Esteem Scale (SSES, Heatherton & Polivy, 1991), a 20-item scale used to measure short-lived (state) changes in self-esteem.

Enter this comment and command into your script, and run it:

# Load data into 'sses'
sses <- read_csv('going-further/sses.csv')

partID	Age	Gender	Stage	Pre_SSE_1	Pre_SSE_2	Pre_SSE_3	Pre_SSE_4	Pre_SSE_5	Pre_SSE_6	Pre_SSE_7	Pre_SSE_8	Pre_SSE_9	Pre_SSE_10	Pre_SSE_11	Pre_SSE_12	Pre_SSE_13	Pre_SSE_14	Pre_SSE_16	Pre_SSE_17	Pre_SSE_18	Pre_SSE_19	Pre_SSE_20	Condition	Post_SSE_1	Post_SSE_2	Post_SSE_3	Post_SSE_4	Post_SSE_5	Post_SSE_6	Post_SSE_7	Post_SSE_8	Post_SSE_9	Post_SSE_10	Post_SSE_11	Post_SSE_12	Post_SSE_13	Post_SSE_14	Post_SSE_16	Post_SSE_17	Post_SSE_18	Post_SSE_19	Post_SSE_20
47	20	male	2	3	4	2	2	3	3	0	1	3	0	4	3	2	3	2	3	2	1	3	control	3	2	3	2	1	2	0	1	3	0	4	3	2	2	1	2	1	1	1
51	22	female	2	2	3	3	2	1	3	0	0	3	2	2	3	3	3	0	2	0	2	2	control	3	3	3	3	1	3	1	1	4	1	2	2	2	3	0	2	0	2	2
57	20	female	2	3	0	0	0	0	0	4	0	2	0	2	3	0	4	0	0	0	0	0	control	3	0	0	0	0	2	4	0	2	0	2	3	0	4	0	0	0	0	0

The data will be easier to analyse if we rename the columns. It will also be useful to divide the data into two data frames, one for the pre-intervention SSES, the other for the post-intervention SSES. We’ll do this in stages.

Enter this comment and command into your script, and run it:

# Place pre-intervention SSES into 'sses_pre_raw'
sses_pre_raw  <- sses %>% select(1, 5:25)

Explanation of command:

sses_pre_raw <- sses %>% select(1, 5:25) - We select() column 1, and columns 5:25 from sses, and store the resulting data frame in sses_pre_raw. Column 1 is the participant id, columns 5:24 are the SSES scores, and column 25 contains a number indicating which of the three experimental conditions the subject was assigned to.

Here are the first three participants of our pre-intervention data:

partID	Pre_SSE_1	Pre_SSE_2	Pre_SSE_3	Pre_SSE_4	Pre_SSE_5	Pre_SSE_6	Pre_SSE_7	Pre_SSE_8	Pre_SSE_9	Pre_SSE_10	Pre_SSE_11	Pre_SSE_12	Pre_SSE_13	Pre_SSE_14	Pre_SSE_16	Pre_SSE_17	Pre_SSE_18	Pre_SSE_19	Pre_SSE_20	Condition
47	3	4	2	2	3	3	0	1	3	0	4	3	2	3	2	3	2	1	3	control
51	2	3	3	2	1	3	0	0	3	2	2	3	3	3	0	2	0	2	2	control
57	3	0	0	0	0	0	4	0	2	0	2	3	0	4	0	0	0	0	0	control

Next, we’ll rename the SSES columns based on their question number. This will make them easier to refer to in the rest of our code.

Enter this comment and command into your script, and run them:

# Rename columns
sses_pre_raw  <- sses_pre_raw %>%
  set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q"))

Explanation of command:

set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q")) - We use the function set_names() to rename our columns. The ~ is a way of telling set_names() to apply a function to each column name. The remainder of the command is a “sub-pipeline” which tidies up the column name. The command str_to_lower(.) converts a string (the . means the current column name) to lower case. This lower case name is piped to str_replace_all("pre_sse_", "q")) which replaces any string with the prefix pre_sse_ with q. All our columns are now lowercase, and the SSES questions are named q1:q20.

partid	q1	q2	q3	q4	q5	q6	q7	q8	q9	q10	q11	q12	q13	q14	q16	q17	q18	q19	q20	condition
47	3	4	2	2	3	3	0	1	3	0	4	3	2	3	2	3	2	1	3	control
51	2	3	3	2	1	3	0	0	3	2	2	3	3	3	0	2	0	2	2	control
57	3	0	0	0	0	0	4	0	2	0	2	3	0	4	0	0	0	0	0	control

Now we’ll convert some columns to factors.

Enter this comment and command into your script, and run them:

# Convert columns to factors; add factor column 'time', set to 'pre'; select relevant columns
sses_pre_raw <- sses_pre_raw %>%
  mutate(subj = factor(partid), condition = factor(condition),
         time = factor('pre')) %>%
  select(subj, condition, time, q1:q20)

mutate(subj = factor(partid), condition = factor(condition), time = factor('pre')) - We use mutate to add and modify some columns. The argument subj = factor(partid) creates a new column named subj (which is a bit clearer than partid) by copying the partid column and making it a factor. The argument condition = factor(condition) makes the condition column a factor. The argument time = factor('pre') creates a new factor called time and sets all values to pre.
select(subj, condition, time, q1:q20) just puts our columns in a more logical order.

Our data is now much tidier:

subj	condition	time	q1	q2	q3	q4	q5	q6	q7	q8	q9	q10	q11	q12	q13	q14	q16	q17	q18	q19	q20
47	control	pre	3	4	2	2	3	3	0	1	3	0	4	3	2	3	2	3	2	1	3
51	control	pre	2	3	3	2	1	3	0	0	3	2	2	3	3	3	0	2	0	2	2
57	control	pre	3	0	0	0	0	0	4	0	2	0	2	3	0	4	0	0	0	0	0

Note that we could do all of these steps in a single pipeline (DO NOT enter these commands, you do not need to do the same thing twice, this is just an illustration of how the previous commands could be combined).

sses_pre_raw  <- select(sses, 1, 5:25) %>%
  set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q")) %>%
  mutate(subj = factor(partid), condition = factor(condition),
         time = factor('pre')) %>%
  select(subj, condition, time, q1:q20)

Exercise 2

Write a similar pipeline (including comments) to create a data frame named sses_post_raw containing the post-intervention SSES data. The condition and post-intervention SSES data are in columns 25:45. The SSES columns have the prefix post_sse_ rather than pre_sse_. Set the value in the time factor to post. After running your commands, the first few rows of sses_post_raw should look like this:

subj	condition	time	q1	q2	q3	q4	q5	q6	q7	q8	q9	q10	q11	q12	q13	q14	q16	q17	q18	q19	q20
47	control	post	3	2	3	2	1	2	0	1	3	0	4	3	2	2	1	2	1	1	1
51	control	post	3	3	3	3	1	3	1	1	4	1	2	2	2	3	0	2	0	2	2
57	control	post	3	0	0	0	0	2	4	0	2	0	2	3	0	4	0	0	0	0	0

Copy the R code and comments you used for this exercise into PsycEL.

References

Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem Journal of Personality and Social Psychology, 60(6), 895.

Henry, J. D., & Crawford, J. R. (2005). The short-form version of the Depression Anxiety Stress Scales (DASS-21): Construct validity and normative data in a large non-clinical sample British Journal of Clinical Psychology, 44(2), 227–239.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Data preprocessing for scales

Paul Sharpe, Andy Wills, Sophie Homer

Contents

Introduction

Getting started

Loading and selecting data

Handling missing data

Calculating subscale scores

Exercise 1

Tidying survey data

Exercise 2

References

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
34	18	2	2	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
35	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
36	37	2	2	2	1	3	1	3	1	0	1	2	1	1	1	2	0	2	3	1	2	1	0	2
37	19	2	1	1	1	1	1	2	2	0	1	1	0	1	1	1	0	1	1	1	1	0	0	0
38	20	2	1	1	0	1	0	2	1	0	1	0	0	1	0	1	0	0	0	1	1	1	0	0
39	20	2	2	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
106	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
107	19	1	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
108	18	0	2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0
109	20	2	3	1	0	1	1	1	1	0	0	1	0	2	2	1	3	0	1	0	1	1	0	0

partID	Age	Gender	depression	anxiety	stress
34	18	2	0	1	2
36	37	2	15	7	8
37	19	2	6	4	7
38	20	2	5	1	5
39	20	2	1	0	2
40	18	1	2	1	5
41	20	1	1	0	0
42	20	2	3	5	4
43	19	0	0	0	0
44	19	2	19	17	14

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
34	18	2	2	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
35	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
36	37	2	2	2	1	3	1	3	1	0	1	2	1	1	1	2	0	2	3	1	2	1	0	2
37	19	2	1	1	1	1	1	2	2	0	1	1	0	1	1	1	0	1	1	1	1	0	0	0
38	20	2	1	1	0	1	0	2	1	0	1	0	0	1	0	1	0	0	0	1	1	1	0	0
39	20	2	2	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
106	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
107	19	1	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
108	18	0	2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0
109	20	2	3	1	0	1	1	1	1	0	0	1	0	2	2	1	3	0	1	0	1	1	0	0

partID	Age	Gender	depression	anxiety	stress
34	18	2	0	1	2
36	37	2	15	7	8
37	19	2	6	4	7
38	20	2	5	1	5
39	20	2	1	0	2
40	18	1	2	1	5
41	20	1	1	0	0
42	20	2	3	5	4
43	19	0	0	0	0
44	19	2	19	17	14

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
34	18	2	2	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
35	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
36	37	2	2	2	1	3	1	3	1	0	1	2	1	1	1	2	0	2	3	1	2	1	0	2
37	19	2	1	1	1	1	1	2	2	0	1	1	0	1	1	1	0	1	1	1	1	0	0	0
38	20	2	1	1	0	1	0	2	1	0	1	0	0	1	0	1	0	0	0	1	1	1	0	0
39	20	2	2	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0

partID	Age	Gender	Stage	DASS1	DASS2	DASS3	DASS4	DASS5	DASS6	DASS7	DASS8	DASS9	DASS10	DASS11	DASS12	DASS13	DASS14	DASS15	DASS16	DASS17	DASS18	DASS19	DASS20	DASS21
106	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
107	19	1	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
108	18	0	2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0
109	20	2	3	1	0	1	1	1	1	0	0	1	0	2	2	1	3	0	1	0	1	1	0	0

partID	Age	Gender	depression	anxiety	stress
34	18	2	0	1	2
36	37	2	15	7	8
37	19	2	6	4	7
38	20	2	5	1	5
39	20	2	1	0	2
40	18	1	2	1	5
41	20	1	1	0	0
42	20	2	3	5	4
43	19	0	0	0	0
44	19	2	19	17	14