Introduction

This worksheet covers the Wilcoxon rank-sum test, which is an alternative to the between-subjects t-test, and the Kruskal-Wallis H test, which is an alternative to the between-subjects one-factor ANOVA.

The Wilcoxon rank-sum and Kruskal-Wallis H tests are both non-parametric tests. This means they make fewer assumptions about your data than do standard parametric tests (such as t-tests, and ANOVA). Specifically, where your sample size is small, parametric tests assume that your population is approximately normally distributed. However, it’s important to realise that as your sample size increases, parametric tests make fewer and fewer assumptions about your population distribution, due to the central limit theorem. It’s also important to realise that parametric tests have greater statistical power than non-parametric tests, so we should use parametric tests when their assumptions are met.

Putting this all together, there’s a relatively small set of situations where it makes sense to use a non-parametric test such as the Wilcoxon rank-sum or Kruskal-Wallis H. These are when:

1. Your sample size is small (N < 30 per group) and,
2. you do not know whether the population distribution is approximately normal and
3. you have reason to expect the effect size will be large (d > 1)

For example, you wouldn’t use an non-parametric test on a small sample of IQ scores, because IQ is known to be normally distributed. The point about effect size follows from the other two - if your sample size is small, you will only be able to detect large effects - see the statistical power worksheet.

Where’s the Bayes Factor? The Wilcoxon rank-sum and Kruskal-Wallis are traditional tests, in the sense that they give us a p-value rather than a Bayes Factor. As we have previously covered, p-values are widely misinterpreted by psychologists, and can never provide evidence for the null hypothesis. For this reason, we have generally advised that you instead use the Bayesian equivalents of these tests - such as Bayesian t-tests, Bayesian ANOVA, and Bayesian chi-square. However, it is not straight forward to calculate Bayesian equivalents of the Mann-Whitney and Kruskal-Wallis tests in R at the moment. So, in this case, we’ll stick to the traditional tests.

Getting started

To prepare for this worksheet:

1. Open the rminr-data project we used previously.

2. Open the Files tab. You should see a folder called going-further. This folder should contain the files picture-naming-long.csv and music-emotion-preproc.csv.

3. If you don’t see the folder or the files, it means you created your project before the data required for this worksheet was added to the rminr-data git repository. You fix this by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.

4. Create a script named non-parametric.R in the rminr-data folder (the folder above going-further). Add the code to this script as you work through each section of the worksheet.

Wilcoxon rank-sum test

We’ll demonstrate the Wilcoxon rank-sum test using data from an experiment which evaluated children’s language development using the Words in Game (WinG) test. WinG consists of a set of picture cards that are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated concepts. The experiment tested whether English-speaking children aged approximately 30 months produce different responses for the two sets of cards.

rm(list = ls()) # clear the environment
library(tidyverse)
wing_preproc <- read_csv('going-further/picture-naming-long.csv')

Explanation of commands:

These commands should be familiar from previous worksheets. Line 1 clears the environment. Line 2 loads the tidyverse package. Line 3 reads the data.

The first few lines of wing_preproc look like this:

1 female english nc 12
1 female english np 4
1 female english pc NA
1 female english pp NA
2 male italian nc 18
2 male italian np 12
2 male italian pc 17
2 male italian pp 9
3 female english nc 18
3 female english np 13
3 female english pc 17
3 female english pp 9

The first three columns are the participant ID number, gender of the participant, and the type of card presented. The fourth column is the test (e.g. “nc” = “Noun comprehension”). The final column is the number of correct responses. Some data is missing - indicated as “NA”.

In the next section, we are going to compare English and Italian on the noun comprehension task. So, we filter the data (which contains all four tasks) to include just that task. We also remove any missing data.

# Filter
nc_include <- wing_preproc %>% filter(task == 'nc') %>% drop_na()

Explanation of command: The filter command should be familiar from many previous worksheets. drop_na() removes any row that contains an NA in it.

Calculating descriptive statistics

When we report non-parametric tests, we normally report the median (rather than the mean) as our descriptive statistic. This makes practical sense, because the mean can be misleading when the distribution is skewed and, if we have chosen to do a non-parametric test, we are (presumably) uncertain whether the distribution is skewed or not.

# summary statistics
nc_include %>%
group_by(cards) %>%
summarise(median = median(correct))
summarise() ungrouping output (override with .groups argument)
# A tibble: 2 x 2
cards   median
<chr>    <dbl>
1 english     19
2 italian     17

Explanation of commands: These commands should be familiar from several previous worksheets. We group the data by cards, and use summarise, to calculate the median() score for each group.

Calculating the Wilcoxon rank-sum test

As we said earlier in this worksheet, the Wilcoxon rank sum test is a non-parametric equivalent of a between-subjects t-test. It works by ranking all of the scores in the two groups, adding the ranks in each group, and comparing these “summed ranks” to determine if they differ.

We’ll run a Wilcoxon rank-sum test to see if there were any significant differences between scores for the Italian and English cards on the noun comprehension (nc) task.

# Wilcoxon rank-sum test (Mann-Whitney U) for noun comprehension
wilcox.test(correct ~ cards, nc_include)
Warning in wilcox.test.default(x = c(12, 18, 17, 17, 20, 19, 19, 20, 19), :
cannot compute exact p-value with ties

Wilcoxon rank sum test with continuity correction

data:  correct by cards
W = 58, p-value = 0.125
alternative hypothesis: true location shift is not equal to 0

Explanation of commands:

The command wilcox.test(correct ~ cards, nc_include) runs the test to compare the correct scores for the Italian and English cards.

Explanation of output:

The phrase with continuity correction is a technical detail that can be safely ignored. If you’re curious, you can read more in the help file, by typing ?wilcoxon.test into the console.

The phrase correct by cards reminds you that you compared the values in correct between the levels in the cards factor. A p value of less than .05 is generally considered by psychologists to be evidence that the two groups are different.

The phrase alternative hypothesis: true location shift is not equal to 0 can be safely ignored, as it’s just another (rather obscure) way of saying that you are testing whether the two groups are different.

The warning cannot compute exact p-value with ties lets you know that the method used to calculate p will not be exact, because some items in the Italian and English scores had identical rankings. It is possible to calculate the exact p value using the pwilcox command, but that’s beyond what we’ll cover in this worksheet. If your p-value is sufficiently close to .05 that it would matter if the estimate is a bit off, then a better solution would be to attempt to replicate your finding in a second study with a larger sample.

We now have all of the information we need to report the results. For the noun comprehension task, there was no significant difference in accuracy between the Italian (Mdn = 17) and English (Mdn = 19) cards W = 58, p = 0.13.

In some journal articles, you may see a non-parametric test called a “Mann-Whitney U”. This is exactly the same test as computed in the Wilcoxon rank-sum test above, just with a different name (and represented by a U rather than a W).

Exercise 1

Calculate summary statistics and the Wilxocon rank-sum for the noun production task. Your results should look like this:

summarise() ungrouping output (override with .groups argument)
# A tibble: 2 x 2
cards   median
<chr>    <dbl>
1 english     13
2 italian     11
Warning in wilcox.test.default(x = c(4, 13, 15, 10, 14, 12, 10, 16, 13), :
cannot compute exact p-value with ties

Wilcoxon rank sum test with continuity correction

data:  correct by cards
W = 47.5, p-value = 0.5619
alternative hypothesis: true location shift is not equal to 0

Copy the R code you used for this exercise into PsycEL.

Kruskal-Wallis H test

The Kruskal-Wallis H test is a non-parametric equivalent of a one-way between subjects ANOVA. It extends the Mann-Whitney test to situations where there are more than two groups. Like the Mann-Whitney test, the Kruskal-Wallis test works on ranked data.

We’ll demonstrate the Kruskal-Wallis test using data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Participants were measured using the Emotion Regulation Strategies for Artistic Creative Activities Scale (ERS-ACA), an 18 item inventory, with each item scored from 1 (‘strongly disagree’) to 5 (‘strongly agree’). The ERS-ACA gives an overall measure of the strategy people use to regulate their emotions when they engage in artistic, creative activities, and scores on three strategy sub-scales; avoidance, approach and self-development.

# Kruskal-Wallis H test
ers_l <- read_csv('going-further/music-emotion-preproc.csv')
Parsed with column specification:
cols(
subj = col_double(),
subculture = col_character(),
ers = col_character(),
score = col_double()
)

Explanation of command:

This data has already undergone some preprocessing and is in long format. The first few lines of ers_l look like this:

subj subculture ers score
17 Goth avoidance 1.286
17 Goth approach 4.167
17 Goth development 4.2
17 Goth total 3.056
18 Metal avoidance 4
18 Metal approach 3.5
18 Metal development 3.4
18 Metal total 3.667

Descriptive statistics

We’ll start by calculating the medians for the ‘approach’ subscale.

approach <- ers_l %>% filter(ers == 'approach') %>% drop_na()
approach %>%
group_by(subculture) %>%
summarise(median = median(score))
summarise() ungrouping output (override with .groups argument)
# A tibble: 4 x 2
subculture median
<chr>       <dbl>
1 Emo          3.83
2 Goth         3.67
3 Mainstream   3.67
4 Metal        3.83

Explanation of commands:

Line 1 filters the data to only include measurements for the ‘approach’ subscale and removes missing data. Lines 2-5 are very similar to the summary statistics we generated for the Mann-Whitney test. In this case we group by music subculture.

Explanation of output:

The differences in medians between groups look quite small.

Calculating the Kruskal-Wallis

We can now run the Kruskal-Wallis test:

kruskal.test(score ~ subculture, data = approach)

Kruskal-Wallis rank sum test

data:  score by subculture
Kruskal-Wallis chi-squared = 5.2313, df = 3, p-value = 0.1556

Explanation of commands:

The command kruskal.test(score ~ subculture, data = approach) runs the test to compare the ERS-ACA score scores for the four groups in subculture.

Explanation of output:

The string score by subculture reminds you that you compared the values in score between the levels in the subculture factor. The Kruskal-Wallis H statistic is 5.2313. R describes it as chi-squared because it is possible to estimate the relevant p-value using a chi-square distribution with the degrees of freedom (df) for that distribution set to one less than the number of groups, in this case df = 3. The p value tells us whether there was a significant difference between the four medians. It does not tell us which pairs of groups differ significantly from each other (for that, using a Wilcoxon rank-sum).

The results of this test are as follows:

There was no significant difference in approach style between the mainstream (Mdn = 3.67), goth (Mdn = 3.67), metal (Mdn = 3.83) and emo (Mdn = 3.83) groups, H = 5.23, p = 0.16.

Exercise 2

Calculate summary statistics and Kruskal-Wallis H for the self-development emotional response subscale of the ERS-ACA. Your results should look like this:

summarise() ungrouping output (override with .groups argument)
# A tibble: 4 x 2
subculture median
<chr>       <dbl>
1 Emo           3.8
2 Goth          4
3 Mainstream    3.6
4 Metal         3.8

Kruskal-Wallis rank sum test

data:  score by subculture
Kruskal-Wallis chi-squared = 8.5011, df = 3, p-value = 0.03671

Copy the R code you used for this exercise into PsycEL.