This worksheet covers the Wilcoxon rank-sum test, which is an alternative to the between-subjects t-test, and the Kruskal-Wallis H test, which is an alternative to the between-subjects one-factor ANOVA.

The Wilcoxon rank-sum and Kruskal-Wallis H tests are both *non-parametric* tests. This means they make fewer assumptions about your data than do standard *parametric* tests (such as t-tests, and ANOVA). Specifically, where your sample size is small, parametric tests assume that your population is approximately normally distributed. However, it’s important to realise that as your sample size increases, parametric tests make fewer and fewer assumptions about your population distribution, due to the central limit theorem. It’s also important to realise that parametric tests have greater statistical power than non-parametric tests, so we should use parametric tests when their assumptions are met.

Putting this all together, there’s **a relatively small set of situations where it makes sense to use a non-parametric test** such as the Wilcoxon rank-sum or Kruskal-Wallis H. These are when:

- Your sample size is small (N < 30 per group)
**and**, - you do not know whether the population distribution is approximately normal
**and** - you have reason to expect the effect size will be large (d > 1)

For example, you wouldn’t use an non-parametric test on a small sample of IQ scores, because IQ is known to be normally distributed. The point about effect size follows from the other two - if your sample size is small, you will only be able to detect large effects - see the statistical power worksheet.

**Where’s the Bayes Factor?** The Wilcoxon rank-sum and Kruskal-Wallis are traditional tests, in the sense that they give us a p-value rather than a Bayes Factor. As we have previously covered, p-values are widely misinterpreted by psychologists, and can never provide evidence for the null hypothesis. For this reason, we have generally advised that you instead use the Bayesian equivalents of these tests - such as Bayesian t-tests, Bayesian ANOVA, and Bayesian chi-square. However, it is not straight forward to calculate Bayesian equivalents of the Mann-Whitney and Kruskal-Wallis tests in R at the moment. So, in this case, we’ll stick to the traditional tests.

To prepare for this worksheet:

Open the

`rminr-data`

project we used previously.Open the

`Files`

tab. You should see a folder called`going-further`

. This folder should contain the files`picture-naming-long.csv`

and`music-emotion-preproc.csv`

.If you don’t see the folder or the files, it means you created your project

*before*the data required for this worksheet was added to the`rminr-data`

git repository. You fix this by asking git to “`pull`

” the repository. Select the`Git`

tab, which is located in the row of tabs which includes the`Environment`

tab. Click the`Pull`

button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the`Git pull`

window.Create a script named

`non-parametric.R`

in the`rminr-data`

folder (the folder above`going-further`

). Add the code to this script as you work through each section of the worksheet.

We’ll demonstrate the Wilcoxon rank-sum test using data from an experiment which evaluated children’s language development using the Words in Game (WinG) test. WinG consists of a set of picture cards that are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated concepts. The experiment tested whether English-speaking children aged approximately 30 months produce different responses for the two sets of cards.

We start by loading the data:

```
rm(list = ls()) # clear the environment
library(tidyverse)
wing_preproc <- read_csv('going-further/picture-naming-long.csv')
```

**Explanation of commands:**

These commands should be familiar from previous worksheets. Line 1 clears the environment. Line 2 loads the `tidyverse`

package. Line 3 reads the data.

The first few lines of `wing_preproc`

look like this:

subj | gender | cards | task | correct |
---|---|---|---|---|

1 | female | english | nc | 12 |

1 | female | english | np | 4 |

1 | female | english | pc | NA |

1 | female | english | pp | NA |

2 | male | italian | nc | 18 |

2 | male | italian | np | 12 |

2 | male | italian | pc | 17 |

2 | male | italian | pp | 9 |

3 | female | english | nc | 18 |

3 | female | english | np | 13 |

3 | female | english | pc | 17 |

3 | female | english | pp | 9 |

The first three columns are the participant ID number, gender of the participant, and the type of card presented. The fourth column is the test (e.g. “nc” = “Noun comprehension”). The final column is the number of correct responses. Some data is missing - indicated as “NA”.

In the next section, we are going to compare English and Italian on the noun comprehension task. So, we filter the data (which contains all four tasks) to include just that task. We also remove any missing data.

**Explanation of command**: The `filter`

command should be familiar from many previous worksheets. `drop_na()`

removes any row that contains an `NA`

in it.

When we report non-parametric tests, we normally report the median (rather than the mean) as our descriptive statistic. This makes practical sense, because the mean can be misleading when the distribution is skewed and, if we have chosen to do a non-parametric test, we are (presumably) uncertain whether the distribution is skewed or not.

``summarise()` ungrouping output (override with `.groups` argument)`

```
# A tibble: 2 x 2
cards median
<chr> <dbl>
1 english 19
2 italian 17
```

**Explanation of commands:** These commands should be familiar from several previous worksheets. We group the data by `cards`

, and use `summarise`

, to calculate the `median()`

score for each group.

As we said earlier in this worksheet, the Wilcoxon rank sum test is a non-parametric equivalent of a between-subjects t-test. It works by ranking all of the scores in the two groups, adding the ranks in each group, and comparing these “summed ranks” to determine if they differ.

We’ll run a Wilcoxon rank-sum test to see if there were any significant differences between scores for the Italian and English cards on the noun comprehension (`nc`

) task.

```
# Wilcoxon rank-sum test (Mann-Whitney U) for noun comprehension
wilcox.test(correct ~ cards, nc_include)
```

```
Warning in wilcox.test.default(x = c(12, 18, 17, 17, 20, 19, 19, 20, 19), :
cannot compute exact p-value with ties
```

```
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 58, p-value = 0.125
alternative hypothesis: true location shift is not equal to 0
```

**Explanation of commands:**

The command `wilcox.test(correct ~ cards, nc_include)`

runs the test to compare the `correct`

scores for the Italian and English `cards`

.

**Explanation of output:**

The phrase `with continuity correction`

is a technical detail that can be safely ignored. If you’re curious, you can read more in the help file, by typing `?wilcoxon.test`

into the console.

The phrase `correct by cards`

reminds you that you compared the values in `correct`

between the levels in the `cards`

factor. A `p`

value of less than .05 is generally considered by psychologists to be evidence that the two groups are different.

The phrase `alternative hypothesis: true location shift is not equal to 0`

can be safely ignored, as it’s just another (rather obscure) way of saying that you are testing whether the two groups are different.

The warning `cannot compute exact p-value with ties`

lets you know that the method used to calculate *p* will not be exact, because some items in the Italian and English scores had identical rankings. It is possible to calculate the exact p value using the `pwilcox`

command, but that’s beyond what we’ll cover in this worksheet. If your p-value is sufficiently close to .05 that it would matter if the estimate is a bit off, then a better solution would be to attempt to replicate your finding in a second study with a larger sample.

We now have all of the information we need to report the results. For the noun comprehension task, there was no significant difference in accuracy between the Italian (*Mdn* = 17) and English (*Mdn* = 19) cards *W* = 58, *p* = 0.13.

In some journal articles, you may see a non-parametric test called a “Mann-Whitney U”. This is exactly the same test as computed in the Wilcoxon rank-sum test above, just with a different name (and represented by a U rather than a W).

Calculate summary statistics and the Wilxocon rank-sum for the noun production task. Your results should look like this:

``summarise()` ungrouping output (override with `.groups` argument)`

```
# A tibble: 2 x 2
cards median
<chr> <dbl>
1 english 13
2 italian 11
```

```
Warning in wilcox.test.default(x = c(4, 13, 15, 10, 14, 12, 10, 16, 13), :
cannot compute exact p-value with ties
```

```
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 47.5, p-value = 0.5619
alternative hypothesis: true location shift is not equal to 0
```

**Copy the R code you used for this exercise into PsycEL.**

The Kruskal-Wallis H test is a non-parametric equivalent of a one-way between subjects ANOVA. It extends the Mann-Whitney test to situations where there are more than two groups. Like the Mann-Whitney test, the Kruskal-Wallis test works on ranked data.

We’ll demonstrate the Kruskal-Wallis test using data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Participants were measured using the Emotion Regulation Strategies for Artistic Creative Activities Scale (ERS-ACA), an 18 item inventory, with each item scored from 1 (‘strongly disagree’) to 5 (‘strongly agree’). The ERS-ACA gives an overall measure of the strategy people use to regulate their emotions when they engage in artistic, creative activities, and scores on three strategy sub-scales; avoidance, approach and self-development.

We start by loading the data:

```
Parsed with column specification:
cols(
subj = col_double(),
subculture = col_character(),
ers = col_character(),
score = col_double()
)
```

**Explanation of command:**

This data has already undergone some preprocessing and is in long format. The first few lines of `ers_l`

look like this:

subj | subculture | ers | score |
---|---|---|---|

17 | Goth | avoidance | 1.286 |

17 | Goth | approach | 4.167 |

17 | Goth | development | 4.2 |

17 | Goth | total | 3.056 |

18 | Metal | avoidance | 4 |

18 | Metal | approach | 3.5 |

18 | Metal | development | 3.4 |

18 | Metal | total | 3.667 |

We’ll start by calculating the medians for the ‘approach’ subscale.

```
approach <- ers_l %>% filter(ers == 'approach') %>% drop_na()
approach %>%
group_by(subculture) %>%
summarise(median = median(score))
```

``summarise()` ungrouping output (override with `.groups` argument)`

```
# A tibble: 4 x 2
subculture median
<chr> <dbl>
1 Emo 3.83
2 Goth 3.67
3 Mainstream 3.67
4 Metal 3.83
```

**Explanation of commands:**

Line 1 filters the data to only include measurements for the ‘approach’ subscale and removes missing data. Lines 2-5 are very similar to the summary statistics we generated for the Mann-Whitney test. In this case we group by music `subculture`

.

**Explanation of output:**

The differences in medians between groups look quite small.

We can now run the Kruskal-Wallis test:

```
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 5.2313, df = 3, p-value = 0.1556
```

**Explanation of commands:**

The command `kruskal.test(score ~ subculture, data = approach)`

runs the test to compare the ERS-ACA `score`

scores for the four groups in `subculture`

.

**Explanation of output:**

The string `score by subculture`

reminds you that you compared the values in `score`

between the levels in the `subculture`

factor. The Kruskal-Wallis *H* statistic is 5.2313. R describes it as `chi-squared`

because it is possible to estimate the relevant p-value using a chi-square distribution with the degrees of freedom (`df`

) for that distribution set to one less than the number of groups, in this case `df`

= 3. The `p`

value tells us whether there was a significant difference between the four medians. It does not tell us which pairs of groups differ significantly from each other (for that, using a Wilcoxon rank-sum).

The results of this test are as follows:

There was no significant difference in approach style between the mainstream (*Mdn* = 3.67), goth (*Mdn* = 3.67), metal (*Mdn* = 3.83) and emo (*Mdn* = 3.83) groups, *H* = 5.23, *p* = 0.16.

Calculate summary statistics and Kruskal-Wallis H for the self-development emotional response subscale of the ERS-ACA. Your results should look like this:

``summarise()` ungrouping output (override with `.groups` argument)`

```
# A tibble: 4 x 2
subculture median
<chr> <dbl>
1 Emo 3.8
2 Goth 4
3 Mainstream 3.6
4 Metal 3.8
```

```
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 8.5011, df = 3, p-value = 0.03671
```

**Copy the R code you used for this exercise into PsycEL.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.