Before you start

This is an advanced worksheet, which assumes you have completed the Absolute Beginners’ Guide to R course, the Research Methods in Practice (Quantitative section) course, and the Intermediate Guide to R course.

Contents

Introduction

This worksheet describes a full analysis pipeline for an undergraduate student dissertation on children’s language development. This study was an experiment which evaluated the Words in Game (WinG) test. WinG consists of a set of picture cards which are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated words.

An earlier study found a difference between the English and Italian cards, for adults’ ratings of how well each picture represented the underlying construct. In this study, the researchers hypothesised that this difference would influence children’s WinG task scores, depending on which set of cards they were tested with. The experiment compared WinG performance of English-speaking children, aged approximately 30 months, tested with either the Italian or English cards. Therefore, this was a 4 (WinG task) x 2 (cards) mixed design.

Loading data

Open the rminr-data project we used previously.

Ensure you have the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window. The case-studies folder should contain a folder named allegra-cattani.

Next, create a new, empty R script and save it in the rminr-data folder as wing.R. Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

We start by reading the data.

Explanation of commands:

  1. We clear the workspace, and load the tidyversepackage.
  2. Data for each of the WinG tasks is stored in its own CSV file. There is also file containing ‘demographics’ data. We read each file into its own data frame.

Preprocessing demographics data

Next we preprocess the demographics data.

Explanation of commands:

We convert the column names to lower case using set_names(~ str_to_lower(.)). We select the gender, cdi_u and cdi_s columns. These latter two columns contain scores for the Communicative Development Inventory (CDI). The CDI is a list of 100 words with columns ‘understands’ and ‘says’. For each word, the child’s parent placed a tick in these columns if they thought their child could understand and/or say the word. The scores for ‘understands’ (cdi_u) and ‘says’ (cdi_s) are a total of the ticked boxes in the two columns. We convert gender to a factor and recode it’s levels to lower case. Finaly we select() the columns we want to keep.

Our demographics data frame now contains the child’s subject number, their gender and their parents’ CDI scores:

subj gender cdi_u cdi_s
1 female 62 38
2 male 60 59
3 female 97 85

Preprocessing WinG noun comprehension data

Next we preprocess the WinG data.

Explanation of commands:

We convert the column names to lower case. The cards column indicates whether the child was tested with the Italian or English cards, so we convert it to a factor along with subj. We select these columns and the columns containing rating data for the words used in the the noun comprehension task. These are in the column range mountain:wellyboots.

Here are the first few rows of nc:

subj cards mountain motorbike penguin box apple iron cow doll hat watch sofa clouds hammer glasses elephant backyard spoon bib sink wellyboots
1 english D C C C C P C C C P D C D C C P C P P C
2 italian C C C C C C C C C C C C C NR C D C C C C
3 english C C C C C C C C C C C C C C NR NR C C C C

Applying exclusion criteria

Before we go any further, we need to apply some exclusion criteria to our data. When the experimenters test children with WinG, they use a textual coding scheme to record each child’s responses. The codes we’re interested in here are:

In this experiment, the exclusion criteria were as follows:

  1. Any subject with ‘N/A’ for one of the first 17 words in a task. This criterion excludes children with insufficient data for analysis.
  2. Any subject in the remaining data whose accuracy was less than two standard deviations below the mean. This is a more general criterion to exlude subjects who had especially low scores for various reasons.

Defining a function

Although these exclusion criteria are quite easy to describe, writing the code to apply them is a bit more complex. Furthermore, we need to apply these criteria separately for each of the four WinG tasks. Writing the same code four times would make our program harder to read and is error prone. If we were to discover that we’d made a mistake, we would have to fix it in four different places. Fortunately, we can just write the code once and then reuse it. We do this using a function.

Functions are reusable lines of code which perform a particular function (hence the name), and are one of the most powerful features of programming languages such as R. At this point you will be familiar with using functions. For example rm() is a function for removing objects from the R environment, and is part of the base R language. The rm() function is reusable in the sense that you can use the same function to do slightly different things. For example, rm('foo') would remove the single object foo from your environment. The string foo is called an argument and is the data processed by rm(). The same function can also remove multiple variables provided in a list, so calling rm(ls()), first calls ls() which lists all objects in your environment, and passes the results as an argument to rm() which removes them, thereby cleaning your environment of all objects. Similarly, you’ve used read_csv(), with different filename arguments to read different data files. The read_csv() function is part of the readr package which is loaded when you call library(tidyverse) (library() is another base R function).

Removing objects from the environment is such a common task that it’s a function included with R. Reading files is almost as common, which is why it is part of the readr package that you load with library(tidyverse). However, R also includes a special function called function(), which you can use to define your own functions. This is not particularly complicated, but requires a little explanation if you’ve never encountered the idea before (you can find more details in the R manual). We’ll explain the steps by writing a function called exclude(), allowing us to apply our exclusion criteria for each of the four WinG tests.

Explanation of commands:

  1. Our exclude() function accepts a data frame argument df, containing the data for a single WinG task.
  2. logical_matrix <- df == 'N/A' converts df to a matrix containing the value TRUE where a cell in df contained the string N/A, or FALSE otherwise.
  3. which(arr.ind = TRUE) creates a matrix with columns row (this will be the same as our subject number) and col containing the row and column number of any cells with the value TRUE (the cells that originally contained N/A).
  4. We convert this to a data frame which we group by row.
  5. summarise(min = min(col)) summarises the grouped data by creating a single row containing the lowest column number that contained N/A.
  6. mutate(subj = factor(row)) %>% select(subj, min) creates a factor subj from row, and selects this along with the column number of the first column that contained N/A. We have now converted our original data into a data from with rows for any subjects who had at least one N/A and the number of the first word where the task was interrupted.
  7. q17 <- left_join(df, q17, by='subj') joins this data frame to the data frame we passed to exclude(). For rows where there is no matching subj, min gets the value NA. replace_na(list(min = 20)) converts these NAs to the value 20, indicating the child provided an answer for all words.
  8. To complete this exclusion criterion, q17 <- q17 %>% filter(min > 17) %>% select(-min) selects only children who provided answers beyond word 17, and then removes the min column, as it’s no longer needed.
  9. Next we calculate accuracy scores for each subject across all words. q17 <- q17 %>% mutate(correct = rowSums(. == 'C' | . == 'C*')) creates a new column correct with a value for each row which is the sum of the number of columns containing C or C* (| means ‘or’). Cells containing C indicate that the child responded correctly for the picture on the card for the the word in this column. Cells containing C* indicates that the response was a correct synonym. In this experiment, both of these values are considered correct responses. Similarly, we calculate the total number columns containing NTS and store this in the column related. We’ll use this value in a later analysis.
  10. Now we apply the second exclusion criterion. q17 %>% filter(correct < mean(correct) + 2 * sd(correct)) calculates the mean and standard deviation for correct, and then filters any subjects whose value for correct was less than two standard deviations below the mean.
  11. Finally, we return the data frame without rows meeting our two exclusion critera.

Now we can use exclude() to apply the exclusion criteria to our data.

`summarise()` ungrouping output (override with `.groups` argument)

Explanation of commands:

We pipe nc into exclude(), which applies our exclusion criteria. As a side-effect, it also calculates the correct and related scores for each subject.

Looking at nc_by_subj, we can see that subject 18 was excluded from the noun comprehension task. The related scores are all 0 because this column is only relevant for the word production tasks.

subj correct related
1 12 0
2 18 0
3 18 0
4 17 0
5 17 0
6 18 0
7 17 0
8 18 0
9 20 0
10 7 0
11 19 0
12 19 0
13 19 0
14 17 0
15 20 0
16 17 0
17 19 0
19 16 0

We repeat these steps to apply exclusion critera for the noun production, predicate comprehension and predicate production tasks.

`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)

Explanation of commands:

This code is the same as for the noun comprehension task, apart the range of columns we select() for each task. For noun production, the scores are in beach:gloves, for predication comprehension they’re in big:pulling, and for predication production they’re in small:pushing.

Having preprocessed the demographics data and applied our exclusions to the WinG data, we now join these data frames together in preparation for calculating our descriptive statistics.

Explanation of commands:

We start by creating a data frame containing subj and cards for all participants, and joining it with the demographics data frame. We use nc to create the cards data, as this was a data frame before we applied any exclusion critera. If we didn’t do this, in the next stage we would end up with NA values in cards for excluded participants.

Here’s task_by_subj:

subj cards gender cdi_u cdi_s
1 english female 62 38
2 italian male 60 59
3 english female 97 85
4 italian male 82 45
5 english female 66 66
6 italian male 47 32
7 english male 39 27
8 italian female 35 31
9 english male 22 39
10 italian male 34 10
11 english female 49 28
12 italian female 98 85
13 english female 50 36
14 italian female 62 56
15 english female 81 60
16 italian male 83 59
17 english female 87 88
18 italian female 49 19
19 italian female 63 63

We join task_by_subj to the other WinG task data frames:

Explanation of commands:

Each full_join(), joins an additional data frame by matching values in subj. The suffix = c('_nc','_np') argument adds suffixes to disambiguate columns with the same name, in this case correct and related. We use mutate() with select() to remove the correct_ prefix from the ‘correct’ columns. We leave the related_ prefix on the ‘syntactically related’ columns.

Our data is now fully preprocessed. Notice how the join sets cells to the value NA for subjects who were excluded for that particular task.

subj gender cards nc np pc pp cdi_u cdi_s related_nc related_np related_pc related_pp
1 female english 12 4 NA NA 62 38 0 5 NA NA
2 male italian 18 12 17 9 60 59 0 2 0 3
3 female english 18 13 17 9 97 85 0 3 0 0
4 male italian 17 11 15 12 82 45 0 4 0 2
5 female english 17 15 15 10 66 66 0 2 0 0
6 male italian 18 11 15 7 47 32 0 2 0 1
7 male english 17 10 13 9 39 27 0 7 0 3
8 female italian 18 14 19 11 35 31 0 2 0 3
9 male english 20 14 16 9 22 39 0 2 0 6
10 male italian 7 2 NA NA 34 10 0 1 NA NA
11 female english 19 12 14 8 49 28 0 4 0 4
12 female italian 19 14 16 6 98 85 0 2 0 5
13 female english 19 10 16 7 50 36 0 5 0 3
14 female italian 17 11 17 5 62 56 0 3 0 2
15 female english 20 16 17 10 81 60 0 3 0 1
16 male italian 17 13 13 5 83 59 0 3 0 2
17 female english 19 13 13 8 87 88 0 3 0 1
18 female italian NA NA NA NA 49 19 NA NA NA NA
19 female italian 16 11 18 10 63 63 0 4 0 1

Comparing parents’ ratings of their child’s language ability

For our first analysis, we want to check that there were no differences in language ability between the children assigned to the Italian and English card groups. We do this using between-subjects t-tests of their parents’ CDI ratings. Bayesian t-tests were introduced in the Evidence worksheet.


Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
************
Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).

Type BFManual() to open the manual.
************
Warning: data coerced from tibble to data frame
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.404659 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS
Warning: data coerced from tibble to data frame
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4518936 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Explanation of commands:

First we load the BayesFactor package. Next, we run a t-test which compares CDI ‘understands’ (cdi_u) for the two card sets. We another t-test which compares CDI ‘says’ (cdi_s) for the two card sets.

Explanation of output:

Here, we’re hoping to find evidence for the null hypothesis i.e. no differences in the means for the two groups. Our Bayes factors (CDI understands = 0.4; CDI says = 0.45) indicate that it’s 2-2.5 times more likely that there’s no difference, than there is a difference in language ability between the children tested using the Italian and English cards.

Comparing gender on WinG accuracy

We would also like to check that there were no gender differences on any of the tests. We’ll create a table of descriptive statistics with four rows corresponding to the WinG task, and columns containing acurracy means and standard deviations, for male and female children. This table is described in detail in the Better tables worksheet. It’s convenient to see differences between groups next to the descriptive statistics. We’ll do this by adding a column to our table. Each cell will contain the result of a t-test comparing male and female accurcy, for the WinG task in the table row.

We start by calculating means and standard deviations for each WinG task by gender.

`summarise()` regrouping output by 'task' (override with `.groups` argument)

Explanation of commands:

We convert task_by_subj to long format by using pivot_longer() on the correct score columns nc, np, pc and pp. We use this to calculate a mean and standard deviation by gender within task. When calculating these summary statistics, we need the argument na.rm = TRUE to ignore missing data for participants who were excluded from any of the tests. We round the results to 2 decimal places. We use pivot_wider() so that descriptives has a row for each task, and columns containing the mean and standard deviation by gender.

Our table of descriptive statistics now looks like this:

task gender mean sd
nc male 16.29 4.231
nc female 17.64 2.203
np male 10.43 3.952
np female 12.09 3.239
pc male 14.83 1.602
pc female 16.2 1.814
pp male 8.5 2.345
pp female 8.4 1.955

Next we do some preparation for displaying our table in APA format.

task mean_female sd_female mean_male sd_male
nc 17.64 2.203 16.29 4.231
np 12.09 3.239 10.43 3.952
pc 16.2 1.814 14.83 1.602
pp 8.4 1.955 8.5 2.345

Explanation of commands:

We widen the table, using the pivot_wider command we used previously in the within-subject differences worksheet. We use select to order the columns so that the means are next to their associated standard deviations.

Now we run some Bayesian t-tests to test for gender differences on each of the tasks. Between subjects t-tests were covered in the Evidence worksheet.

Explanation of commands:

We’ll be doing the same t-test four times, once for the data from each task. Like the code for applying exclusion criteria that you saw earlier, this makes it a good candidate for a function. Line 1 begins the definition of the t_gender() function. The function will run the t-test using the data it assigns to the variable df. We won’t use the group argument, but we’ll explain why it’s there when we call the function. Line 2 uses drop_na() to remove rows for participants who were excluded for the task in df. Recall that we set these values to NA. Line 3 uses ttestBF() to run a Bayesian t-test which compares task accuracy (correct) for each gender. Line 4 returns the Bayes factor.

Now we can use the function to run the four t-tests.

task bf
nc 0.4897
np 0.6805
pc 0.9101
pp 0.4377

Line 3 calls t_gender() for each of the groups created by group_by(task) in line 2. group_modify() calls a function for each group of a grouped data frame. It always passes the function two values. The first is a data frame containing the subset of rows for the group. The second is the name of the group. The function must accept two arguments, which explains why we defined group in t_gender(), even though we didn’t use it. The data supplied by group_modify() is replaced with the data frame returned by the function it calls. The column which defines the groups is preserved. So here, we’ve use t_gender(), to replace the data for each test, with a data frame containing the Bayes factor for the associated t-test.

Finally, we join this to our descriptives data frame, using the task values to match up the rows.

task mean_female sd_female mean_male sd_male bf
nc 17.64 2.203 16.29 4.231 0.4897
np 12.09 3.239 10.43 3.952 0.6805
pc 16.2 1.814 14.83 1.602 0.9101
pp 8.4 1.955 8.5 2.345 0.4377

We can now tidy the table up ready for printing.

Explanation of commands:

Lines 1-7 use the approach for renaming factor levels that was introduced in the cleaning up questionnaire data worksheet. The last 2 lines give the columns more meaningful headings.

Now we can print our finished table.

Task Female (M) Female (SD) Male (M) Male (SD) BF
nc 17.64 2.20 16.29 4.23 0.49
np 12.09 3.24 10.43 3.95 0.68
pc 16.20 1.81 14.83 1.60 0.91
pp 8.40 1.96 8.50 2.35 0.44

Explanation of commands:

We load the kableExtra package and pipe our data into kable(). The digits=2 part ensures that every number is reported to two decimal places. kable_styling() prints the table to the Viewer window in RStudio. This table could be included in a reporty by copy-pasting into a word processor, and then styled according to APA guidelines.

Explanation of output:

The Bayes factors mean that it’s about twice as likely that there are no gender differences than that there are, for noun comprehension and predicate production. For noun production, it’s about 1.5 times as like that there isn’t a gender difference, than that there is. For predicate comprehension, the evidence for or against a gender difference is about equal.

Exploring correlations between CDI and WinG

We would also like to know if parents’ ratings of their child’s ability using the CDI was related to their child’s performance on the WinG tasks. We start by looking at correlations between CDI ‘understands’ comprehension scores on the noun and predicate tests.

[1] 0.06225014
Ignored 1 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.6982793 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
[1] -0.100163
Ignored 3 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5543404 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

Explanation of commands:

We use a Spearman correlation to test the relationship between task_by_subj$cdi_u and task_by_subj$nc (noun comprehension). This was introduced in the More on relationships, part 2 worksheet. The option use="complete.obs" deletes any cases with a value of NA. (Remember that we assigned the value NA to cells for subjects who were excluded from a test.) We also calculate a Bayes Factor to test the reliability of the relationship. We use similar commands to test the relationship between cdi_u and predicate comprehension (pc).

Explanation of output:

There doesn’t appear to be a relationship between parents’ CDI ‘understands’ scores and children’s noun comprehension (rs = 0.06), although there is weak evidence against this conclusion (BF = 0.7). There seems to be a weak negative correlation between parents’ CDI ‘understands’ scores and children’s predicate comprehension (rs = -0.1), however, the Bayes factor of 0.55 means it’s almost twice as likely that there isn’t a relationship between these two variables, than that there is.

Now we look at correlations between parents’ ratings of their child’s ability to say words (CDI says), and the childrens’ noun and predicate production scores.

[1] 0.5673528
Ignored 1 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 5.094892 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
[1] -0.02759308
Ignored 3 rows containing missing observations.
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5874633 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

Explanation of commands:

These calculations are identical to those above, except we are looking at the evidence for relationships between cdi_u and task_by_subj$np (noun production), and task_by_subj$pp (predicate production).

Explanation of output:

There is evidence of a strong, positive correlation between parents’ CDI ‘says’ scores and children’s noun production (rs = 0.57, BF = 5.09). There doesn’t appear to be a relationship between CDI ‘says’ and children’s predicate production (rs = -0.03, BF = 0.59).

In summary, noun production was the only task in which parents’ ratings of their childrens’ ability matched the childrens’ accuracy.

Plotting WinG accuracy by card set

We’re now ready to test our main hypothesis, which predicts that there will be a difference WinG task scores, depending on which set of cards the children were tested with. We’ll start by creating plots to show the distribution of scores for the two card sets on the WinG tasks.

Warning: Removed 8 rows containing non-finite values (stat_ydensity).

Explanation of commands:

Line 2 recodes the task factor levels, to make them more meaningful on the plot’s x axis. Line 3 loads the see package which provides the half_violin() function. Line 4 defines the x axis of our plot to be the WinG tasks, the y axis to be task accuracy (1-20), and to use the cards factor for the fill colour. Line 5 creates a “half violin” plot. As the name suggests, this shows one half of a violin plot. position = position_identity() plots the two distributions on top of each other, making it easy to see how much they overlap. alpha=0.7 changes to transparency, again to help us see the overlapping area. size=0 removes the outline around the distributions. Line 6 uses a grey palette for filling in the two distributions. Line 7 gives our axes meaningful labels.

Explanation of output:

This plot gives a visual indication of whether there were differences between the Italian and English cards on each of the tests. Given the extensive overlap in scores between the card sets, this seems unlikely. The plot also shows that the data were slightly skewed on the noun tasks due to some low scores.

Comparing WinG accuracy by card set

Although the plot suggests that the predicate data was normally distributed, given the small sample sizes, a non-parametric tests was considered most appropriate for analysing this data. We’ll use Mann-Whitney U tests to compare each task score for the Italian and English cards. The Mann-Whitney U test is explained in more detail in the Traditional non-parametric tests worksheet.

We start by creating some summary statistics.

`summarise()` regrouping output by 'task' (override with `.groups` argument)
# A tibble: 8 x 6
# Groups:   task [4]
  task                    cards       n median mean_rank sum_rank
  <fct>                   <fct>   <int>  <dbl>     <dbl>    <dbl>
1 Noun Comprehension      english     9   19       11.4     103  
2 Noun Comprehension      italian     9   17        7.56     68  
3 Noun Production         english     9   13       10.3      92.5
4 Noun Production         italian     9   11        8.72     78.5
5 Predicate Comprehension english     8   15.5      7.12     57  
6 Predicate Comprehension italian     8   16.5      9.88     79  
7 Predicate Production    english     8    9        9        72  
8 Predicate Production    italian     8    8        8        64  

Explanation of commands:

Lines 1-3 rank the accuracy scores for each of the four tests. Line 6 groups the data by cards within task. Line 7 removes any rows where correct contains the value NA. These were the children we excluded earlier. Lines 8-11 calculate, for each group, the number of children, the median, and the mean and sum of the ranked items.

Now we run Mann-Whitney tests to make a direct comparison between the Italian and English cards for each WinG task.

# A tibble: 4 x 3
# Groups:   task [4]
  task                        U     p
  <fct>                   <dbl> <dbl>
1 Noun Comprehension       58    0.13
2 Noun Production          47.5  0.56
3 Predicate Comprehension  21    0.26
4 Predicate Production     36    0.71

Explanation of commands:

The Mann-Whitney U test is a non-parametric equivalent of a t-test. It’s explained in detail in the Traditional non-parametric tests worksheet. Lines 2-6 define a function called mann_whitney to run a Mann-Whitney test comparing the correct column against the factor cards. The function returns the values for U and p as a data frame.

Lines 8-10 group our data by task, then use group_modify(mann_whitney) to replace the data for each group with the Mann-Whitney test results. Line 11 displays the results.

Explanation of output:

As this is a traditional statistical test, the p-value indicates whether there was a significant difference between the cards on any of the tasks. All p values were > 0.05, suggesting there were no differences, contrary to our hypothesis. (Note that we’ve removed the cannot compute exact p-value with ties warnings to make the output easier to read.)

Conclusion

In summary, these data suggest that WinG scores should be comparable, regardless of whether the English or Italian picture cards are used. It also shows that adults’ ratings of how the pictures map to the underlying concepts are quite different to children’s.


This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.