Before you start

This is an advanced worksheet, which assumes you have completed the Absolute Beginners’ Guide to R course, the Research Methods in Practice (Quantitative section) course, and the Intermediate Guide to R course.

Contents

Introduction

This worksheet describes a full analysis pipeline for an undergraduate student dissertation on children’s language development. This study was an experiment which evaluated the Words in Game (WinG) test. WinG consists of a set of picture cards which are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated words.

An earlier study found a difference between the English and Italian cards, for adults’ ratings of how well each picture represented the underlying construct. In this study, the researchers hypothesised that this difference would influence children’s WinG task scores, depending on which set of cards they were tested with. The experiment compared WinG performance of English-speaking children, aged approximately 30 months, tested with either the Italian or English cards.

Loading data

Open the rminr-data project we used previously.

Ensure you have the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window. The case-studies folder should contain a folder named allegra-cattani.

Next, create a new, empty R script and save it in the rminr-data folder as wing.R. Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

We start by reading the data:

Explanation of commands:

  1. We clear the workspace, and load the tidyversepackage.
  2. Data for each of the WinG tasks is stored in its own CSV file. There is also a CSV, ‘demographics’, which contains participant age and gender, among other things. We read each file into its own data frame.

Preprocessing WinG data

Next we preprocess the WinG data. Preprocessing is generally easier if our data is in long format (many rows, few columns), and is all contained in a single data frame. For now, we’ll combine the data frames for the four WinG tasks:

Explanation of commands:

The key commands here are:

Putting this all together, we make each of the four data frames longer, add a column indicating which task they are from, and the combine them into a single data frame called wing.

Here are the first few rows of wing:

subj Cards name value task
1 english Mountain D nc
1 english Motorbike C nc
1 english Penguin C nc

Later on in our analysis, we will need to be able to refer to card by its number (e.g. “card 18”) rather than the word it represents (e.g. “Mountain”). So, we’ll add another column, numbering each card from 1 to 20:

The first few rows of wing now look like this:

subj Cards name value task card
1 english Mountain D nc 1
1 english Motorbike C nc 2
1 english Penguin C nc 3

Explanation of commands:

The only new command here is rep, which means “repeat”. So, for example, rep(1, 3) gives us three ones: 1 1 1. We also need to know that e.g. 1:3 gives us the numbers the numbers from 1 to 3, i.e. 1 2 3. So, rep(1:20, 1520/20) gives us the numbers 1 to 20, 76 times. We need them 76 times because each of 19 participants completed four tasks (19*4 = 76).

Tidy up

The data frames we originally loaded the CSV files into are no longer needed, so we can remove them from our environment:

Recoding accuracy

Now we need to recode the data. The child’s response to each card has been represented by a letter code in wing. There are quite a few codes, but the only ones we need to worry about here are:

  • C or C* - C indicates that the child responded correctly for the picture on the card, C* indicates that the response was a correct synonym. In this experiment, both of these values are considered correct responses.

  • NTS - This stands for “non-target but semantically related”. This code is used in the noun and predicate production tests, for example if the picture on the card was a house but the child said “hut”.

  • N/A (not to be confused with the R data type NA) - the researchers used this code to indicate that the task was interrupted for some reason (e.g. the child began crying).

In order to analyze these data more easily, we’re going to convert these letter codes into numbers. First, we’re going to create a new column which will contain a 1 if the letter code is C or C*, and a 0 otherwise. This is going to be useful later, because we can then just add up the numbers in this column to work out how many questions each child got right on each task.

This is how we create that new column:

Explanation of command:

We’ve recoded data before, in the cleaning up questionnaire data worksheet. First, we tell R how we want each value to be recoded, in this case in cormap. Then we use mutate to add a column called correct that recodes the value column using the mapping in cormap.

New to the current worksheet is .default, which allows us to give a default value for the recoding. That way, we don’t have to explicitly say that all the other letters should be recoded as 0, we can just write .default = 0.

Further recoding

We can use the same technique to create two further columns. The first column, related contains a 1 if the answer is wrong but semantically related. The second new column, inter contains a 1 if the task was interrupted for some reason.

Applying exclusion criteria

The authors of this dissertation decided to exclude a child’s answers for a task if there was an interruption at any point during the first 17 questions. Such interruptions make the task hard to interpret, so removing the data before analysis was thought to be the best option.

In order to do this, we need to work out which participants were interrupted during the first 17 cards of each task. It would be possible to do this by hand, but it would be tedious and error prone. Instead, we get R to tell us who was interrupted:

`summarise()` regrouping output by 'subj' (override with `.groups` argument)
# A tibble: 8 x 3
# Groups:   subj [3]
   subj task  inter
  <dbl> <chr> <dbl>
1     1 pc       15
2     1 pp       15
3    10 pc       17
4    10 pp       17
5    18 nc        9
6    18 np        9
7    18 pc       17
8    18 pp       17

Explanation of commands: We’ve used all these commands many times before, with the possible exception of sum - sum is a command like mean except that it adds up the numbers rather than taking their average. So, this series of commands groups the data by subj and task, then filters it to contain just the first 17 questions. It adds up the number of interruptions in each case, and filters to include just those where the number of interruptions was greater than zero.

Removing participants from the dataset

We can see from the above list that participant 18 was interrupted in all four tasks, while participants 1 and 10 were both interrupted in the two predicate tasks. We can remove these participants like this:

Explanation of commands: We’ve excluded participants before, in the preprocessing experiments worksheet. The first line uses ! (meaning “not”) in order to keep all participants except participant 18. The second line in addition uses & (meaning AND), and %in%, to keep all the data except the pc and pp tasks of participant 1. The third line does the same for participant 10.

Note: In the original report, some participants were also excluded for poor performance. For reasons of brevity, and also because this practice is of somewhat debatable validity in this case, we have not included this step in the current worksheet.

Calculate scores

Next, we calculate how many questions each participant got right in each task, and also how many semantically-related errors they made. We can do this using a small set of commands we have used many times before:

`summarise()` regrouping output by 'subj', 'Cards' (override with `.groups` argument)

Here are first few rows of our summarized data:

subj Cards task correct related
1 english nc 12 0
1 english np 4 5
2 italian nc 18 0

Pivot

Our preprocessing is now nearly over, but some of the analyses we do later will be easier to perform with a wider data frame, so we’ll widen it now, using the pivot_wider command that we’ve come across before, in the within-subject differences worksheet:

The first few rows of our new, wider, data frame looks like this:

subj Cards correct_nc correct_np correct_pc correct_pp related_nc related_np related_pc related_pp
1 english 12 4 NA NA 0 5 NA NA
2 italian 18 12 17 9 0 2 0 3
3 english 18 13 17 9 0 3 0 0

Notice how pivot_wider sets cells to the value NA for participants whose responses were excluded for that particular task.

Combine

The final step of preprocessing is to combine some information we have in the demographics data frame into task_by_subj:

Explanation of commands: The first line picks the columns we need from the demographics data frame. The command right_join joins two data frames together, using a column they have in common (in this case, subj). It’s called a right join, because it will join every row in the second (right-hand) data frame (in this case task_by_subj) with the first (left-hand) data frame. We do a “right join” because there are some participants who appear in demo but not in task_by_subj (because we excluded some participants due to interruptions).

Our data is now fully preprocessed:

subj Gender CDI_U CDI_S Cards correct_nc correct_np correct_pc correct_pp related_nc related_np related_pc related_pp
1 Female 62 38 english 12 4 NA NA 0 5 NA NA
2 Male 60 59 italian 18 12 17 9 0 2 0 3
3 Female 97 85 english 18 13 17 9 0 3 0 0
4 Male 82 45 italian 17 11 15 12 0 4 0 2
5 Female 66 66 english 17 15 15 10 0 2 0 0
6 Male 47 32 italian 18 11 15 7 0 2 0 1
7 Male 39 27 english 17 10 13 9 0 7 0 3
8 Female 35 31 italian 18 14 19 11 0 2 0 3
9 Male 22 39 english 20 14 16 9 0 2 0 6
10 Male 34 10 italian 7 2 NA NA 0 1 NA NA
11 Female 49 28 english 19 12 14 8 0 4 0 4
12 Female 98 85 italian 19 14 16 6 0 2 0 5
13 Female 50 36 english 19 10 16 7 0 5 0 3
14 Female 62 56 italian 17 11 17 5 0 3 0 2
15 Female 81 60 english 20 16 17 10 0 3 0 1
16 Male 83 59 italian 17 13 13 5 0 3 0 2
17 Female 87 88 english 19 13 13 8 0 3 0 1
19 Female 63 63 italian 16 11 18 10 0 4 0 1

Randomization check

In our preprocessed data frame, task_by_subj, we included two columns, CDI_U and CDI_S. These are the parent’s ratings of their child’s level of mastery of a list of words, both in terms of the child understanding the words (CDI_U) and speaking the words (CDI_S).

In this first analysis, we’re going to use these measures as a check of whether the random allocation of children to the two conditions of the experiment (English cards versus Italian cards) was successful in eliminating pre-experimental differences in language mastery between those two groups. If it was, we should be able to demonstrate evidence for the null hypothesis that the two groups do not differ in their CDI_U or CDI_S scores. We do this using Bayesian between-subjects t-tests of their parents’ CD_I ratings. Bayesian t-tests were introduced in the Evidence worksheet:

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4133763 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4228484 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Explanation of commands:

First we load the BayesFactor package. Next, we run a t-test which compares CDI ‘understands’ (CDI_U) for the two card sets. We then run another t-test which compares CDI ‘says’ (CDI_S) for the two card sets. We then run a Bayesian t-test on these data.

Explanation of output:

Here, we’re hoping to find evidence for the null hypothesis i.e. no differences in the means for the two groups. Our Bayes factors are in the indeterminate range 0.33 < BF < 3, which means we do not have clear evidence for or against the null hypothesis. We cannot be confident our randomization worked. This is unsurprising given the very small sample in this study.

Gender differences?

Table of descriptive statistics

The authors were interested in whether there were gender differences on any of the four WinG tasks. We’ll start by making a table of descriptive statistics (means, standard deviations) by gender, for each task. This part of the pipeline for this dissertation was discussed in detail in the Better tables worksheet, so we won’t discuss it again here. Instead, we’ll just list the commands and show the final output. For further explanation, see the Better tables worksheet.

Task Female (M) Female (SD) Male (M) Male (SD)
Noun Comprehension 17.64 2.20 16.29 4.23
Noun Production 12.09 3.24 10.43 3.95
Predicate Comprehension 16.20 1.81 14.83 1.60
Predicate Production 8.40 1.96 8.50 2.35

Bayesian t-tests

We can examine whether there is evidence for gender differences, or their absence, using a Bayesian t-test. Let’s look at the noun comprehension task first:

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.5490203 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Explanation of command: We’re using the long-format version of our data task_by_subj_l, which we generated as part of making the table of descriptives. We filter to include just the noun comprehension task, and remove any missing data using drop_na().

Explanation of output: The Bayes Factor is indeterminate - we have no substantial evidence for or against our hypothesis of a gender difference. This is unsurprising give the small sample size.

We can then use basically the same commands to look at gender differences in our other three tasks:

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.5774268 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.9101201 ±0.01%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4376651 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Summary: There is no substantial evidence for or against gender differences in these tasks. This lack of conclusion is unsurprising given the small sample size.

Correlations between WinG and parents’ ratings

Are parents able to estimate the level of word mastery in their children? If so, we would expect to observe a significant correlation between, for example, CDI_U scores and performance on the noun-comprehension task. Do we?

We can calculate both the correlation co-efficient, and a Bayes Factor for that correlation, using the following two commands. We covered these commands in the relationships, part 2, worksheet, take a look back at that worksheet if you need a reminder. The only new thing here is use="complete.obs". We need this extra bit in this case because we have some missing data. The option use="complete.obs" means only use those cases where we have both a parent’s rating (CDI_U) and a task performance score (nc):

CDI_U and noun comprehension:

[1] 0.2356934
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.6982793 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

This particular correlation is relatively small (around 0.2), and the evidence for a relationship is inconclusive (0.33 < BF < 3).

We can go on and do the same thing for the other three relevant correlations:

CDI_U and predicate comprehension:

[1] -0.1132625
Ignored 2 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5543404 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

CDI_S and noun production:

[1] 0.5700877
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 5.094892 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

CDI_S and predictate production:

[1] -0.1548794
Ignored 2 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5874633 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

Summary: There is evidence of a positive correlation in the case of noun production. In the other three cases, the analysis is inconclusive. This is unsurprising given the small sample size.

Comparing accuracy on English and Italian cards

We’re now ready to examine our main hypothesis, which predicts that there will be a difference WinG task scores, depending on which set of cards the children were tested with.

Half-violin plots

We’ll start by creating plots to show the distribution of scores for the two card sets on the WinG tasks.

Warning: Removed 4 rows containing non-finite values (stat_ydensity).

Explanation of commands:

Line 2 recodes the task labels, to make them more meaningful on the plot’s x axis. Line 3 loads the see package which provides the half_violin() function. Line 4 defines the x axis of our plot to be the WinG task, the y axis to be task accuracy (correct), and to use the Cards factor for the fill colour. Line 5 creates a “half violin” plot. As the name suggests, this shows one half of a violin plot. position = position_identity() plots the two distributions on top of each other, making it easy to see how much they overlap. alpha=0.7 changes the transparency, again to help us see the overlapping area. size=0 removes the outline around the distributions. Line 6 gives our axes meaningful labels.

Explanation of output:

The warning Removed 4 rows... is just a reminder that some data is missing. We already know this, and so we can safely avoid the warning.

The plot gives a visual indication of whether there were differences between the Italian and English cards on each of the tests. Given the extensive overlap in scores between the card sets, this seems unlikely.

Non-parametric tests

The authors of this report chose to perform non-parametric tests of their central hypotheses. The conditions under which such tests are a good choice are discussed in the traditional non-parametric worksheet. The example of a Wilcoxon test in that worksheet uses the noun comprehension data from this dissertation, so we’ll just reproduce the commands here - take a look at the worksheet if you need further explanation:

Noun comprehension:

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
  Cards   median
  <chr>    <dbl>
1 english     19
2 italian     17
Warning in wilcox.test.default(x = c(12, 18, 17, 17, 20, 19, 19, 20, 19), :
cannot compute exact p-value with ties

    Wilcoxon rank sum test with continuity correction

data:  correct by Cards
W = 58, p-value = 0.125
alternative hypothesis: true location shift is not equal to 0

Explanation of output: The difference between conditions is not significant. This is unsurprising given the small sample size and a lack of any clear prior expectation of the effect size.

Unlike traditional tests, a Bayesian t-test can assessment evidence for the null hypothesis. It is easy to apply a Bayesian t-test to these data, although the small sample size again makes it unsurprising that the result is inconclusive:

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.6069638 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

We can apply the same commands to the other four tests. Once again, we find the unsurprising result that all the analyses are inconclusive:

Predicate comprehension:

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
  Cards   median
  <chr>    <dbl>
1 english   15.5
2 italian   16.5
Warning in wilcox.test.default(x = c(17, 15, 13, 16, 14, 16, 17, 13), y =
c(17, : cannot compute exact p-value with ties

    Wilcoxon rank sum test with continuity correction

data:  correct by Cards
W = 21, p-value = 0.2623
alternative hypothesis: true location shift is not equal to 0
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.7225958 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Noun production:

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
  Cards   median
  <chr>    <dbl>
1 english     13
2 italian     11
Warning in wilcox.test.default(x = c(4, 13, 15, 10, 14, 12, 10, 16, 13), :
cannot compute exact p-value with ties

    Wilcoxon rank sum test with continuity correction

data:  correct by Cards
W = 47.5, p-value = 0.5619
alternative hypothesis: true location shift is not equal to 0
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4528176 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Predicate production:

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
  Cards   median
  <chr>    <dbl>
1 english      9
2 italian      8
Warning in wilcox.test.default(x = c(9, 10, 9, 9, 8, 7, 10, 8), y = c(9, :
cannot compute exact p-value with ties

    Wilcoxon rank sum test with continuity correction

data:  correct by Cards
W = 36, p-value = 0.7097
alternative hypothesis: true location shift is not equal to 0
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4834645 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Conclusion

There’s not a great deal we can conclude from these data. Unless the authors had reasons to expect a large effect size (\(d > 1.3\)), these inconclusive results are unsurprising, and probably due to the small sample size. There does seem to be some evidence that a child’s WinG performance on noun production is moderately correlated to their parent’s rating of that child’s level of mastery in noun production.


This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.