This worksheet covers the Wilcoxon rank-sum test, which is an alternative to the between-subjects t-test, and the Kruskal-Wallis H test, which is an alternative to the between-subjects one-factor ANOVA.

The Wilcoxon rank-sum and Kruskal-Wallis H tests are both
*non-parametric* tests. This means they make fewer assumptions
about your data than do standard *parametric* tests (such as
t-tests, and ANOVA). Specifically, where your sample size is small,
parametric tests assume that your population is approximately normally
distributed. However, it’s important to realise that as your sample
size increases, parametric tests make fewer and fewer assumptions about
your population distribution, due to the central
limit theorem. It’s also important to realise that parametric tests
have greater statistical power than
non-parametric tests, so we should use parametric tests when their
assumptions are met.

Putting this all together, there’s **a relatively small set of
situations where it makes sense to use a non-parametric test**
such as the Wilcoxon rank-sum or Kruskal-Wallis H. These are when:

- Your sample size is small (N < 30 per group)
**and**, - you do not know whether the population distribution is approximately
normal
**and** - you have reason to expect the effect size will be large (d > 1)

For example, you wouldn’t use an non-parametric test on a small sample of IQ scores, because IQ is known to be normally distributed. The point about effect size follows from the other two - if your sample size is small, you will only be able to detect large effects - see the statistical power worksheet.

**Where’s the Bayes Factor?** The Wilcoxon rank-sum and
Kruskal-Wallis are traditional tests, in the sense that they give us a
p-value rather than a Bayes Factor. As we have previously covered, p-values are widely misinterpreted by
psychologists, and can never provide evidence for the null
hypothesis. For this reason, we have generally advised that you instead
use the Bayesian equivalents of these tests - such as Bayesian t-tests, Bayesian ANOVA, and Bayesian chi-square. However, it is not
straight forward to calculate Bayesian equivalents of the Mann-Whitney
and Kruskal-Wallis tests in R at the moment. So, in this case, we’ll
stick to the traditional tests.

To prepare for this worksheet:

Open the

`rminr-data`

project we used previously.Open the

`Files`

tab. You should see a folder called`going-further`

. This folder should contain the files`picture-naming-long.csv`

and`music-emotion-preproc.csv`

.If you don’t see the folder or the files, it means you created your project

*before*the data required for this worksheet was added to the`rminr-data`

git repository. You fix this by asking git to “`pull`

” the repository. Select the`Git`

tab, which is located in the row of tabs which includes the`Environment`

tab. Click the`Pull`

button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the`Git pull`

window.Create a script named

`non-parametric.R`

in the`rminr-data`

folder (the folder above`going-further`

). Add the comments and commands to this script as you work through each section of the worksheet.

We’ll demonstrate the Wilcoxon rank-sum test using data from an experiment which evaluated children’s language development using the Words in Game (WinG) test. WinG consists of a set of picture cards that are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated concepts. The experiment tested whether English-speaking children aged approximately 30 months produce different responses for the two sets of cards.

We start by loading the data. **Enter these commands into your
script, and run them:**

```
# Non-parametric tests
# Clear environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
# Load data
wing_preproc <- read_csv('going-further/picture-naming-long.csv')
```

**Explanation of commands:**

These commands should be familiar from previous worksheets. Command
line 1 clears the environment. Command line 2 loads the
`tidyverse`

package. Command line 3 reads the data.

The first few lines of `wing_preproc`

look like this:

subj | gender | cards | task | correct |
---|---|---|---|---|

1 | female | english | nc | 12 |

1 | female | english | np | 4 |

1 | female | english | pc | NA |

1 | female | english | pp | NA |

2 | male | italian | nc | 18 |

2 | male | italian | np | 12 |

2 | male | italian | pc | 17 |

2 | male | italian | pp | 9 |

3 | female | english | nc | 18 |

3 | female | english | np | 13 |

3 | female | english | pc | 17 |

3 | female | english | pp | 9 |

The first three columns are the participant ID number, gender of the participant, and the type of card presented. The fourth column is the test (e.g. “nc” = “Noun comprehension”). The final column is the number of correct responses. Some data is missing - indicated as “NA”.

In the next section, we are going to compare English and Italian on the noun comprehension task. So, we filter the data (which contains all four tasks) to include just that task. We also remove any missing data.

**Enter this comment and command into your script, and run
it:**

```
# Select 'nc' task; drop NA entries
nc_include <- wing_preproc %>% filter(task == 'nc') %>% drop_na()
```

**Explanation of command**: The `filter`

command should be familiar from many previous worksheets.
`drop_na()`

removes any row that contains an `NA`

in it.

When we report non-parametric tests, we normally report the median (rather than the mean) as our descriptive statistic. This makes practical sense, because the mean can be misleading when the distribution is skewed and, if we have chosen to do a non-parametric test, we are (presumably) uncertain whether the distribution is skewed or not.

**Enter these comments and commands into your script, and run
them:**

```
# Display medians
nc_include %>%
group_by(cards) %>%
summarise(median = median(correct))
```

```
# A tibble: 2 × 2
cards median
<chr> <dbl>
1 english 19
2 italian 17
```

**Explanation of commands:** These commands should be
familiar from several previous worksheets. We group the data by
`cards`

, and use `summarise`

, to calculate the
`median()`

score for each group.

As we said earlier in this worksheet, the Wilcoxon rank sum test is a non-parametric equivalent of a between-subjects t-test. It works by ranking all of the scores in the two groups, adding the ranks in each group, and comparing these “summed ranks” to determine if they differ.

We’ll run a Wilcoxon rank-sum test to see if there were any
significant differences between scores for the Italian and English cards
on the noun comprehension (`nc`

) task.

**Enter this comment and command into your script, and run
it:**

```
# Perform Wilcoxon rank-sum test for noun comprehension
wilcox.test(correct ~ cards, nc_include)
```

```
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
```

```
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 58, p-value = 0.125
alternative hypothesis: true location shift is not equal to 0
```

**Explanation of commands:**

The command `wilcox.test(correct ~ cards, nc_include)`

runs the test to compare the `correct`

scores for the Italian
and English `cards`

.

**Explanation of output:**

The phrase `with continuity correction`

is a technical
detail that can be safely ignored. If you’re curious, you can read more
in the help file, by typing `?wilcoxon.test`

into the
console.

The phrase `correct by cards`

reminds you that you
compared the values in `correct`

between the levels in the
`cards`

factor. A `p`

value of less than .05 is
generally considered by psychologists to be evidence that the two groups
are different.

The phrase
`alternative hypothesis: true location shift is not equal to 0`

can be safely ignored, as it’s just another (rather obscure) way of
saying that you are testing whether the two groups are different.

The warning `cannot compute exact p-value with ties`

lets
you know that the method used to calculate *p* will not be exact,
because some items in the Italian and English scores had identical
rankings. It is possible to calculate the exact p value using the
`pwilcox`

command, but that’s beyond what we’ll cover in this
worksheet. If your p-value is sufficiently close to .05 that it would
matter if the estimate is a bit off, then a better solution would be to
attempt to replicate your finding in a second study with a larger
sample.

We now have all of the information we need to report the results. For
the noun comprehension task, there was no significant difference in
accuracy between the Italian (*Mdn* = 17) and English
(*Mdn* = 19) cards *W* = 58, *p* = 0.13.

In some journal articles, you may see a non-parametric test called a “Mann-Whitney U”. This is exactly the same test as computed in the Wilcoxon rank-sum test above, just with a different name (and represented by a U rather than a W).

Calculate summary statistics and the Wilcoxon rank-sum for the noun production task. Your results should look like this:

```
# A tibble: 2 × 2
cards median
<chr> <dbl>
1 english 13
2 italian 11
```

```
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
```

```
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 47.5, p-value = 0.5619
alternative hypothesis: true location shift is not equal to 0
```

**Copy the R code you used for this exercise into
PsycEL.**

The Kruskal-Wallis H test is a non-parametric equivalent of a one-way between subjects ANOVA. It extends the Mann-Whitney test to situations where there are more than two groups. Like the Mann-Whitney test, the Kruskal-Wallis test works on ranked data.

We’ll demonstrate the Kruskal-Wallis test using data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Participants were measured using the Emotion Regulation Strategies for Artistic Creative Activities Scale (ERS-ACA), an 18 item inventory, with each item scored from 1 (‘strongly disagree’) to 5 (‘strongly agree’). The ERS-ACA gives an overall measure of the strategy people use to regulate their emotions when they engage in artistic, creative activities, and scores on three strategy sub-scales; avoidance, approach and self-development.

We start by loading the data; **enter these comments and
command into your script, and run it:**

```
# Kruskal-Wallis H test
# Load data
ers_l <- read_csv('going-further/music-emotion-preproc.csv')
```

**Explanation of command:**

This data has already undergone some preprocessing and is in long
format. The first few lines of `ers_l`

look like this:

subj | subculture | ers | score |
---|---|---|---|

17 | Goth | avoidance | 1.286 |

17 | Goth | approach | 4.167 |

17 | Goth | development | 4.2 |

17 | Goth | total | 3.056 |

18 | Metal | avoidance | 4 |

18 | Metal | approach | 3.5 |

18 | Metal | development | 3.4 |

18 | Metal | total | 3.667 |

We’ll start by calculating the medians for the ‘approach’ subscale.

**Enter these commands into your script, and run
them:**

```
# Select 'approach' conditionl; drop NAs
approach <- ers_l %>% filter(ers == 'approach') %>% drop_na()
# Display medians by 'subculture'
approach %>%
group_by(subculture) %>%
summarise(median = median(score))
```

```
# A tibble: 4 × 2
subculture median
<chr> <dbl>
1 Emo 3.83
2 Goth 3.67
3 Mainstream 3.67
4 Metal 3.83
```

**Explanation of commands:**

Command line 1 filters the data to only include measurements for the
‘approach’ subscale and removes missing data. Command lines 2-5 are very
similar to the summary statistics we generated for the Mann-Whitney
test. In this case we group by music `subculture`

.

**Explanation of output:**

The differences in medians between groups look quite small.

We can now run the Kruskal-Wallis test.

**Enter this comment and command into your script, and run
it:**

```
# Perform Kruskal-Wallis test
kruskal.test(score ~ subculture, data = approach)
```

```
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 5.2313, df = 3, p-value = 0.1556
```

**Explanation of commands:**

The command
`kruskal.test(score ~ subculture, data = approach)`

runs the
test to compare the ERS-ACA `score`

scores for the four
groups in `subculture`

.

**Explanation of output:**

The string `score by subculture`

reminds you that you
compared the values in `score`

between the levels in the
`subculture`

factor. The Kruskal-Wallis *H* statistic
is 5.2313. R describes it as `chi-squared`

because it is
possible to estimate the relevant p-value using a chi-square
distribution with the degrees of freedom (`df`

) for that
distribution set to one less than the number of groups, in this case
`df`

= 3. The `p`

value tells us whether there was
a significant difference between the four medians. It does not tell us
which pairs of groups differ significantly from each other (for that,
using a Wilcoxon rank-sum).

The results of this test are as follows:

There was no significant difference in approach style between the
mainstream (*Mdn* = 3.67), goth (*Mdn* = 3.67), metal
(*Mdn* = 3.83) and emo (*Mdn* = 3.83) groups, *H* =
5.23, *p* = 0.16.

Calculate summary statistics and Kruskal-Wallis H for the self-development emotional response subscale of the ERS-ACA. Your results should look like this:

```
# A tibble: 4 × 2
subculture median
<chr> <dbl>
1 Emo 3.8
2 Goth 4
3 Mainstream 3.6
4 Metal 3.8
```

```
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 8.5011, df = 3, p-value = 0.03671
```

**Copy the R code you used for this exercise into
PsycEL.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.