Reliability of Targets in a Picture Naming Task

Before you start

This is an advanced worksheet, which assumes you have completed the Absolute Beginners’ Guide to R course, the Research Methods in Practice (Quantitative section) course, and the Intermediate Guide to R course.

Introduction
Loading data
Preprocessing demographics data
Preprocessing WinG noun comprehension data
Applying exclusion criteria
Comparing parents’ ratings of their child’s language ability
Comparing gender on WinG accuracy
Exploring correlations between CDI and WinG
Plotting WinG accuracy by card set
Comparing WinG accuracy by card set
Comparing semantically related production errors by card set

Introduction

This worksheet describes a full analysis pipeline for an undergraduate student dissertation on children’s language development. This study was an experiment which evaluated the Words in Game (WinG) test. WinG consists of a set of picture cards which are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated words.

An earlier study found a difference between the English and Italian cards, for adults’ ratings of how well each picture represented the underlying construct. In this study, the researchers hypothesised that this difference would influence children’s WinG task scores, depending on which set of cards they were tested with. The experiment compared WinG performance of English-speaking children, aged approximately 30 months, tested with either the Italian or English cards. Therefore, this was a 4 (WinG task) x 2 (cards) mixed design.

Loading data

Open the rminr-data project we used previously.

Ensure you have the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window. The case-studies folder should contain a folder named allegra-cattani.

Next, create a new, empty R script and save it in the rminr-data folder as wing.R. Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

We start by reading the data.

rm(list = ls()) # clear the environment
library(tidyverse)

# read data
demographics <- read_csv('case-studies/allegra-cattani/demographics.csv')
nc <- read_csv('case-studies/allegra-cattani/noun_comprehension.csv')
np <- read_csv('case-studies/allegra-cattani/noun_production.csv')
pc <- read_csv('case-studies/allegra-cattani/predicate_comprehension.csv')
pp <- read_csv('case-studies/allegra-cattani/predicate_production.csv')

Explanation of commands:

We clear the workspace, and load the tidyversepackage.
Data for each of the WinG tasks is stored in its own CSV file. There is also file containing ‘demographics’ data. We read each file into its own data frame.

Preprocessing demographics data

Next we preprocess the demographics data.

# preprocess demographics data
demographics <- demographics %>%
  set_names(~ str_to_lower(.)) %>%
  mutate(subj = factor(subj), gender = factor(gender)) %>%
  mutate(gender = recode_factor(.$gender, Male = 'male',
                                Female = 'female')) %>%
  select(subj, gender, cdi_u, cdi_s)

Explanation of commands:

We convert the column names to lower case using set_names(~ str_to_lower(.)). We select the gender, cdi_u and cdi_s columns. These latter two columns contain scores for the Communicative Development Inventory (CDI). The CDI is a list of 100 words with columns ‘understands’ and ‘says’. For each word, the child’s parent placed a tick in these columns if they thought their child could understand and/or say the word. The scores for ‘understands’ (cdi_u) and ‘says’ (cdi_s) are a total of the ticked boxes in the two columns. We convert gender to a factor and recode it’s levels to lower case. Finaly we select() the columns we want to keep.

Our demographics data frame now contains the child’s subject number, their gender and their parents’ CDI scores:

subj	gender	cdi_u	cdi_s
1	female	62	38
2	male	60	59
3	female	97	85

Preprocessing WinG noun comprehension data

Next we preprocess the WinG data.

# preprocess noun comprehension data
nc <- nc %>%
  set_names(~ str_to_lower(.)) %>%
  mutate(subj = factor(subj), cards = factor(cards)) %>%
  select(subj, cards, mountain:wellyboots)

Explanation of commands:

We convert the column names to lower case. The cards column indicates whether the child was tested with the Italian or English cards, so we convert it to a factor along with subj. We select these columns and the columns containing rating data for the words used in the the noun comprehension task. These are in the column range mountain:wellyboots.

Here are the first few rows of nc:

subj	cards	mountain	motorbike	penguin	box	apple	iron	cow	doll	hat	watch	sofa	clouds	hammer	glasses	elephant	backyard	spoon	bib	sink	wellyboots
1	english	D	C	C	C	C	P	C	C	C	P	D	C	D	C	C	P	C	P	P	C
2	italian	C	C	C	C	C	C	C	C	C	C	C	C	C	NR	C	D	C	C	C	C
3	english	C	C	C	C	C	C	C	C	C	C	C	C	C	C	NR	NR	C	C	C	C

Applying exclusion criteria

Before we go any further, we need to apply some exclusion criteria to our data. When the experimenters test children with WinG, they use a textual coding scheme to record each child’s responses. The codes we’re interested in here are:

N/A (not to be confused with the R data type NA) - researchers used this code to indicate that the task was interrupted for some reason (e.g. the child began crying).
C or C* - C indicates that the child responded correctly for the picture on the card, C* indicates that the response was a correct synonym. In this experiment, both of these values are considered correct responses.
NTS - This stands for “non-target but semantically related”. This code is used in the noun and predicate production tests, for example if the picture on the card was a house but the child said “hut”.

In this experiment, the exclusion criteria were as follows:

Any subject with ‘N/A’ for one of the first 17 words in a task. This criterion excludes children with insufficient data for analysis.
Any subject in the remaining data whose accuracy was less than two standard deviations below the mean. This is a more general criterion to exlude subjects who had especially low scores for various reasons.

Defining a function

Although these exclusion criteria are quite easy to describe, writing the code to apply them is a bit more complex. Furthermore, we need to apply these criteria separately for each of the four WinG tasks. Writing the same code four times would make our program harder to read and is error prone. If we were to discover that we’d made a mistake, we would have to fix it in four different places. Fortunately, we can just write the code once and then reuse it. We do this using a function.

Functions are reusable lines of code which perform a particular function (hence the name), and are one of the most powerful features of programming languages such as R. At this point you will be familiar with using functions. For example rm() is a function for removing objects from the R environment, and is part of the base R language. The rm() function is reusable in the sense that you can use the same function to do slightly different things. For example, rm('foo') would remove the single object foo from your environment. The string foo is called an argument and is the data processed by rm(). The same function can also remove multiple variables provided in a list, so calling rm(ls()), first calls ls() which lists all objects in your environment, and passes the results as an argument to rm() which removes them, thereby cleaning your environment of all objects. Similarly, you’ve used read_csv(), with different filename arguments to read different data files. The read_csv() function is part of the readr package which is loaded when you call library(tidyverse) (library() is another base R function).

Removing objects from the environment is such a common task that it’s a function included with R. Reading files is almost as common, which is why it is part of the readr package that you load with library(tidyverse). However, R also includes a special function called function(), which you can use to define your own functions. This is not particularly complicated, but requires a little explanation if you’ve never encountered the idea before (you can find more details in the R manual). We’ll explain the steps by writing a function called exclude(), allowing us to apply our exclusion criteria for each of the four WinG tests.

# apply WinG exclusion criteria
exclude <- function(df) {
  # exclude if N/A in item 17 or lower
  logical_matrix <- df == 'N/A'
  q17 <- logical_matrix %>%
    which(arr.ind = TRUE) %>%
    data.frame() %>%
    group_by(row) %>%
    summarise(min = min(col)) %>%
    mutate(subj = factor(row)) %>%
    select(subj, min)
  q17 <- left_join(df, q17, by='subj') %>%
    replace_na(list(min = 20))
  q17 <- q17 %>% filter(min > 17) %>% select(-min)
  
  # calculate total correct and semantically related
  q17 <- q17 %>% mutate(correct = rowSums(. == 'C' | . == 'C*'),
                        related = rowSums(. == 'NTS'))
  
  # exclude participants with scores < 2 sd below the mean
  q17 <- q17 %>% filter(correct < mean(correct) + 2 * sd(correct))

  return(q17)
}

Explanation of commands:

Our exclude() function accepts a data frame argument df, containing the data for a single WinG task.
logical_matrix <- df == 'N/A' converts df to a matrix containing the value TRUE where a cell in df contained the string N/A, or FALSE otherwise.
which(arr.ind = TRUE) creates a matrix with columns row (this will be the same as our subject number) and col containing the row and column number of any cells with the value TRUE (the cells that originally contained N/A).
We convert this to a data frame which we group by row.
summarise(min = min(col)) summarises the grouped data by creating a single row containing the lowest column number that contained N/A.
mutate(subj = factor(row)) %>% select(subj, min) creates a factor subj from row, and selects this along with the column number of the first column that contained N/A. We have now converted our original data into a data from with rows for any subjects who had at least one N/A and the number of the first word where the task was interrupted.
q17 <- left_join(df, q17, by='subj') joins this data frame to the data frame we passed to exclude(). For rows where there is no matching subj, min gets the value NA. replace_na(list(min = 20)) converts these NAs to the value 20, indicating the child provided an answer for all words.
To complete this exclusion criterion, q17 <- q17 %>% filter(min > 17) %>% select(-min) selects only children who provided answers beyond word 17, and then removes the min column, as it’s no longer needed.
Next we calculate accuracy scores for each subject across all words. q17 <- q17 %>% mutate(correct = rowSums(. == 'C' | . == 'C*')) creates a new column correct with a value for each row which is the sum of the number of columns containing C or C* (| means ‘or’). Cells containing C indicate that the child responded correctly for the picture on the card for the the word in this column. Cells containing C* indicates that the response was a correct synonym. In this experiment, both of these values are considered correct responses. Similarly, we calculate the total number columns containing NTS and store this in the column related. We’ll use this value in a later analysis.
Now we apply the second exclusion criterion. q17 %>% filter(correct < mean(correct) + 2 * sd(correct)) calculates the mean and standard deviation for correct, and then filters any subjects whose value for correct was less than two standard deviations below the mean.
Finally, we return the data frame without rows meeting our two exclusion critera.

Now we can use exclude() to apply the exclusion criteria to our data.

nc_by_subj <- nc %>% exclude() %>% select(subj, correct, related)

`summarise()` ungrouping output (override with `.groups` argument)

Explanation of commands:

We pipe nc into exclude(), which applies our exclusion criteria. As a side-effect, it also calculates the correct and related scores for each subject.

Looking at nc_by_subj, we can see that subject 18 was excluded from the noun comprehension task. The related scores are all 0 because this column is only relevant for the word production tasks.

subj	correct	related
1	12	0
2	18	0
3	18	0
4	17	0
5	17	0
6	18	0
7	17	0
8	18	0
9	20	0
10	7	0
11	19	0
12	19	0
13	19	0
14	17	0
15	20	0
16	17	0
17	19	0
19	16	0

We repeat these steps to apply exclusion critera for the noun production, predicate comprehension and predicate production tasks.

np <- np %>%
  set_names(~ str_to_lower(.)) %>%
  mutate(subj = factor(subj), cards = factor(cards)) %>%
  select(subj, cards, beach:gloves)
np_by_subj <- np %>% exclude() %>% select(subj, correct, related)

`summarise()` ungrouping output (override with `.groups` argument)

pc <- pc %>%
  set_names(~ str_to_lower(.)) %>%
  mutate(subj = factor(subj), cards = factor(cards)) %>%
  select(subj, cards, big:pulling)
pc_by_subj <- pc %>% exclude() %>% select(subj, correct, related)

`summarise()` ungrouping output (override with `.groups` argument)

pp <- pp %>%
  set_names(~ str_to_lower(.)) %>%
  mutate(subj = factor(subj), cards = factor(cards)) %>%
  select(subj, cards, small:pushing)
pp_by_subj <- pp %>% exclude() %>% select(subj, correct, related)

`summarise()` ungrouping output (override with `.groups` argument)

Explanation of commands:

This code is the same as for the noun comprehension task, apart the range of columns we select() for each task. For noun production, the scores are in beach:gloves, for predication comprehension they’re in big:pulling, and for predication production they’re in small:pushing.

Having preprocessed the demographics data and applied our exclusions to the WinG data, we now join these data frames together in preparation for calculating our descriptive statistics.

# join data
task_by_subj <- nc %>% select(subj, cards) %>%
  full_join(., demographics, by='subj')

Explanation of commands:

We start by creating a data frame containing subj and cards for all participants, and joining it with the demographics data frame. We use nc to create the cards data, as this was a data frame before we applied any exclusion critera. If we didn’t do this, in the next stage we would end up with NA values in cards for excluded participants.

Here’s task_by_subj:

subj	cards	gender	cdi_u	cdi_s
1	english	female	62	38
2	italian	male	60	59
3	english	female	97	85
4	italian	male	82	45
5	english	female	66	66
6	italian	male	47	32
7	english	male	39	27
8	italian	female	35	31
9	english	male	22	39
10	italian	male	34	10
11	english	female	49	28
12	italian	female	98	85
13	english	female	50	36
14	italian	female	62	56
15	english	female	81	60
16	italian	male	83	59
17	english	female	87	88
18	italian	female	49	19
19	italian	female	63	63

We join task_by_subj to the other WinG task data frames:

task_by_subj <- task_by_subj %>%
  full_join(., nc_by_subj, by='subj') %>%
  full_join(., np_by_subj, by='subj', suffix = c('_nc','_np')) %>%
  full_join(., pc_by_subj, by='subj') %>%
  full_join(., pp_by_subj, by='subj', suffix = c('_pc', '_pp')) %>%
  mutate(nc = correct_nc, np = correct_np, pc = correct_pc, pp = correct_pp) %>%
  select(subj, gender, cards, nc, np, pc, pp, cdi_u, cdi_s,
         related_nc, related_np, related_pc, related_pp)

Explanation of commands:

Each full_join(), joins an additional data frame by matching values in subj. The suffix = c('_nc','_np') argument adds suffixes to disambiguate columns with the same name, in this case correct and related. We use mutate() with select() to remove the correct_ prefix from the ‘correct’ columns. We leave the related_ prefix on the ‘syntactically related’ columns.

Our data is now fully preprocessed. Notice how the join sets cells to the value NA for subjects who were excluded for that particular task.

subj	gender	cards	nc	np	pc	pp	cdi_u	cdi_s	related_nc	related_np	related_pc	related_pp
1	female	english	12	4	NA	NA	62	38	0	5	NA	NA
2	male	italian	18	12	17	9	60	59	0	2	0	3
3	female	english	18	13	17	9	97	85	0	3	0	0
4	male	italian	17	11	15	12	82	45	0	4	0	2
5	female	english	17	15	15	10	66	66	0	2	0	0
6	male	italian	18	11	15	7	47	32	0	2	0	1
7	male	english	17	10	13	9	39	27	0	7	0	3
8	female	italian	18	14	19	11	35	31	0	2	0	3
9	male	english	20	14	16	9	22	39	0	2	0	6
10	male	italian	7	2	NA	NA	34	10	0	1	NA	NA
11	female	english	19	12	14	8	49	28	0	4	0	4
12	female	italian	19	14	16	6	98	85	0	2	0	5
13	female	english	19	10	16	7	50	36	0	5	0	3
14	female	italian	17	11	17	5	62	56	0	3	0	2
15	female	english	20	16	17	10	81	60	0	3	0	1
16	male	italian	17	13	13	5	83	59	0	3	0	2
17	female	english	19	13	13	8	87	88	0	3	0	1
18	female	italian	NA	NA	NA	NA	49	19	NA	NA	NA	NA
19	female	italian	16	11	18	10	63	63	0	4	0	1

Comparing parents’ ratings of their child’s language ability

For our first analysis, we want to check that there were no differences in language ability between the children assigned to the Italian and English card groups. We do this using between-subjects t-tests of their parents’ CDI ratings. Bayesian t-tests were introduced in the Evidence worksheet.

# compare children's language ability using CDI
library(BayesFactor, quietly=TRUE)


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

************
Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).

Type BFManual() to open the manual.
************

cdi_u_bf <- ttestBF(formula=cdi_u ~ cards, data = task_by_subj)

Warning: data coerced from tibble to data frame

cdi_u_bf

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.404659 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

cdi_s_bf <- ttestBF(formula=cdi_s ~ cards, data = task_by_subj)

Warning: data coerced from tibble to data frame

cdi_s_bf

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 0.4518936 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

Explanation of commands:

First we load the BayesFactor package. Next, we run a t-test which compares CDI ‘understands’ (cdi_u) for the two card sets. We another t-test which compares CDI ‘says’ (cdi_s) for the two card sets.

Explanation of output:

Here, we’re hoping to find evidence for the null hypothesis i.e. no differences in the means for the two groups. Our Bayes factors (CDI understands = 0.4; CDI says = 0.45) indicate that it’s 2-2.5 times more likely that there’s no difference, than there is a difference in language ability between the children tested using the Italian and English cards.

Comparing gender on WinG accuracy

We would also like to check that there were no gender differences on any of the tests. We’ll create a table of descriptive statistics with four rows corresponding to the WinG task, and columns containing acurracy means and standard deviations, for male and female children. This table is described in detail in the Better tables worksheet. It’s convenient to see differences between groups next to the descriptive statistics. We’ll do this by adding a column to our table. Each cell will contain the result of a t-test comparing male and female accurcy, for the WinG task in the table row.

We start by calculating means and standard deviations for each WinG task by gender.

# table of descriptives and t-tests for WinG by gender
task_by_subj_l <- task_by_subj %>%
  pivot_longer(cols = c(nc, np, pc, pp),
               names_to = 'task',
               values_to = 'correct')

descript <- task_by_subj_l %>%
  group_by(task, gender) %>%
  summarise(mean = mean(correct, na.rm = TRUE), sd = sd(correct, na.rm = TRUE))

`summarise()` regrouping output by 'task' (override with `.groups` argument)

Explanation of commands:

We convert task_by_subj to long format by using pivot_longer() on the correct score columns nc, np, pc and pp. We use this to calculate a mean and standard deviation by gender within task. When calculating these summary statistics, we need the argument na.rm = TRUE to ignore missing data for participants who were excluded from any of the tests. We round the results to 2 decimal places. We use pivot_wider() so that descriptives has a row for each task, and columns containing the mean and standard deviation by gender.

Our table of descriptive statistics now looks like this:

task	gender	mean	sd
nc	male	16.29	4.231
nc	female	17.64	2.203
np	male	10.43	3.952
np	female	12.09	3.239
pc	male	14.83	1.602
pc	female	16.2	1.814
pp	male	8.5	2.345
pp	female	8.4	1.955

Next we do some preparation for displaying our table in APA format.

descript_table <- descript %>%
  pivot_wider(names_from = gender, values_from = c(mean, sd))
descript_table <- descript_table %>% select(task, mean_female, sd_female, mean_male, sd_male) 
descript_table %>% pander()

task	mean_female	sd_female	mean_male	sd_male
nc	17.64	2.203	16.29	4.231
np	12.09	3.239	10.43	3.952
pc	16.2	1.814	14.83	1.602
pp	8.4	1.955	8.5	2.345

Explanation of commands:

We widen the table, using the pivot_wider command we used previously in the within-subject differences worksheet. We use select to order the columns so that the means are next to their associated standard deviations.

Now we run some Bayesian t-tests to test for gender differences on each of the tasks. Between subjects t-tests were covered in the Evidence worksheet.

# t-test for accuracy by gender
t_gender <- function(df, group) {
  df <- drop_na(df) # remove data for excluded subjects
  bf <- ttestBF(formula=correct ~ gender, data = data.frame(df))
  extractBF(bf)
}

Explanation of commands:

We’ll be doing the same t-test four times, once for the data from each task. Like the code for applying exclusion criteria that you saw earlier, this makes it a good candidate for a function. Line 1 begins the definition of the t_gender() function. The function will run the t-test using the data it assigns to the variable df. We won’t use the group argument, but we’ll explain why it’s there when we call the function. Line 2 uses drop_na() to remove rows for participants who were excluded for the task in df. Recall that we set these values to NA. Line 3 uses ttestBF() to run a Bayesian t-test which compares task accuracy (correct) for each gender. Line 4 returns the Bayes factor.

Now we can use the function to run the four t-tests.

gender_bf <- task_by_subj_l %>%
  group_by(task) %>%
  group_modify(t_gender) %>%
  select(task, bf)
gender_bf %>% pander()

task	bf
nc	0.4897
np	0.6805
pc	0.9101
pp	0.4377

Line 3 calls t_gender() for each of the groups created by group_by(task) in line 2. group_modify() calls a function for each group of a grouped data frame. It always passes the function two values. The first is a data frame containing the subset of rows for the group. The second is the name of the group. The function must accept two arguments, which explains why we defined group in t_gender(), even though we didn’t use it. The data supplied by group_modify() is replaced with the data frame returned by the function it calls. The column which defines the groups is preserved. So here, we’ve use t_gender(), to replace the data for each test, with a data frame containing the Bayes factor for the associated t-test.

Finally, we join this to our descriptives data frame, using the task values to match up the rows.

descript_table <- full_join(descript_table, gender_bf, by = 'task')
descript_table %>% pander()

task	mean_female	sd_female	mean_male	sd_male	bf
nc	17.64	2.203	16.29	4.231	0.4897
np	12.09	3.239	10.43	3.952	0.6805
pc	16.2	1.814	14.83	1.602	0.9101
pp	8.4	1.955	8.5	2.345	0.4377

We can now tidy the table up ready for printing.

task_names <- c(
  nc = 'Noun Comprehension',
  np = 'Noun Production',
  pc = 'Predicate Comprehension',
  pp = 'Predicate Production'  
)
descript$task <- descript$task %>% recode(!!!task_names)

colnames(descript_table) <-
  c("Task", "Female (M)", "Female (SD)", "Male (M)", "Male (SD)", "BF")

Explanation of commands:

Lines 1-7 use the approach for renaming factor levels that was introduced in the cleaning up questionnaire data worksheet. The last 2 lines give the columns more meaningful headings.

Now we can print our finished table.

library(kableExtra)
descript_table %>% kable(digits=2) %>% kable_styling()

Task	Female (M)	Female (SD)	Male (M)	Male (SD)	BF
nc	17.64	2.20	16.29	4.23	0.49
np	12.09	3.24	10.43	3.95	0.68
pc	16.20	1.81	14.83	1.60	0.91
pp	8.40	1.96	8.50	2.35	0.44

Explanation of commands:

We load the kableExtra package and pipe our data into kable(). The digits=2 part ensures that every number is reported to two decimal places. kable_styling() prints the table to the Viewer window in RStudio. This table could be included in a reporty by copy-pasting into a word processor, and then styled according to APA guidelines.

Explanation of output:

The Bayes factors mean that it’s about twice as likely that there are no gender differences than that there are, for noun comprehension and predicate production. For noun production, it’s about 1.5 times as like that there isn’t a gender difference, than that there is. For predicate comprehension, the evidence for or against a gender difference is about equal.

Exploring correlations between CDI and WinG

We would also like to know if parents’ ratings of their child’s ability using the CDI was related to their child’s performance on the WinG tasks. We start by looking at correlations between CDI ‘understands’ comprehension scores on the noun and predicate tests.

# correlations between CDI and WinG
cdi_u_nc_cor <- cor(method = 'spearman', task_by_subj$cdi_u,
                    task_by_subj$nc, use="complete.obs")
cdi_u_nc_cor

[1] 0.06225014

cdi_u_nc_bf <- correlationBF(task_by_subj$cdi_u, task_by_subj$nc)

Ignored 1 rows containing missing observations.

cdi_u_nc_bf

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.6982793 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

cdi_u_pc_cor <- cor(method = 'spearman', task_by_subj$cdi_u,
                    task_by_subj$pc, use="complete.obs")
cdi_u_pc_cor

[1] -0.100163

cdi_u_pc_bf <- correlationBF(task_by_subj$cdi_u, task_by_subj$pc)

Ignored 3 rows containing missing observations.

cdi_u_pc_bf

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5543404 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

Explanation of commands:

We use a Spearman correlation to test the relationship between task_by_subj$cdi_u and task_by_subj$nc (noun comprehension). This was introduced in the More on relationships, part 2 worksheet. The option use="complete.obs" deletes any cases with a value of NA. (Remember that we assigned the value NA to cells for subjects who were excluded from a test.) We also calculate a Bayes Factor to test the reliability of the relationship. We use similar commands to test the relationship between cdi_u and predicate comprehension (pc).

Explanation of output:

There doesn’t appear to be a relationship between parents’ CDI ‘understands’ scores and children’s noun comprehension (rs = 0.06), although there is weak evidence against this conclusion (BF = 0.7). There seems to be a weak negative correlation between parents’ CDI ‘understands’ scores and children’s predicate comprehension (rs = -0.1), however, the Bayes factor of 0.55 means it’s almost twice as likely that there isn’t a relationship between these two variables, than that there is.

Now we look at correlations between parents’ ratings of their child’s ability to say words (CDI says), and the childrens’ noun and predicate production scores.

cdi_s_np_cor <- cor(method = 'spearman', task_by_subj$cdi_s,
                    task_by_subj$np, use="complete.obs")
cdi_s_np_cor

[1] 0.5673528

cdi_s_np_bf <- correlationBF(task_by_subj$cdi_s, task_by_subj$np)

Ignored 1 rows containing missing observations.

cdi_s_np_bf

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 5.094892 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

cdi_s_pp_cor <- cor(method = 'spearman', task_by_subj$cdi_s,
                    task_by_subj$pp, use="complete.obs")
cdi_s_pp_cor

[1] -0.02759308

cdi_s_pp_bf <- correlationBF(task_by_subj$cdi_s, task_by_subj$pp)

Ignored 3 rows containing missing observations.

cdi_s_pp_bf

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.5874633 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

Explanation of commands:

These calculations are identical to those above, except we are looking at the evidence for relationships between cdi_u and task_by_subj$np (noun production), and task_by_subj$pp (predicate production).

Explanation of output:

There is evidence of a strong, positive correlation between parents’ CDI ‘says’ scores and children’s noun production (rs = 0.57, BF = 5.09). There doesn’t appear to be a relationship between CDI ‘says’ and children’s predicate production (rs = -0.03, BF = 0.59).

In summary, noun production was the only task in which parents’ ratings of their childrens’ ability matched the childrens’ accuracy.

Plotting WinG accuracy by card set

We’re now ready to test our main hypothesis, which predicts that there will be a difference WinG task scores, depending on which set of cards the children were tested with. We’ll start by creating plots to show the distribution of scores for the two card sets on the WinG tasks.

# plot WinG accuracy by card set
task_by_subj_l$task <- task_by_subj_l$task %>% recode(!!!task_names)
source('themeapa.R')
library(see)
task_by_subj_l %>% ggplot(aes(x = task, y = correct, fill = cards)) +
  geom_violinhalf(position = position_identity(), alpha=0.7, size=0) +
  scale_fill_grey() +
  theme_APA +
  theme(axis.text = element_text(size = 10)) +
  xlab('WinG Task') + ylab('Accuracy (max = 20)')

Warning: Removed 8 rows containing non-finite values (stat_ydensity).

Explanation of commands:

Line 2 recodes the task factor levels, to make them more meaningful on the plot’s x axis. Line 3 loads the see package which provides the half_violin() function. Line 4 defines the x axis of our plot to be the WinG tasks, the y axis to be task accuracy (1-20), and to use the cards factor for the fill colour. Line 5 creates a “half violin” plot. As the name suggests, this shows one half of a violin plot. position = position_identity() plots the two distributions on top of each other, making it easy to see how much they overlap. alpha=0.7 changes to transparency, again to help us see the overlapping area. size=0 removes the outline around the distributions. Line 6 uses a grey palette for filling in the two distributions. Line 7 gives our axes meaningful labels.

Explanation of output:

This plot gives a visual indication of whether there were differences between the Italian and English cards on each of the tests. Given the extensive overlap in scores between the card sets, this seems unlikely. The plot also shows that the data were slightly skewed on the noun tasks due to some low scores.

Comparing WinG accuracy by card set

Although the plot suggests that the predicate data was normally distributed, given the small sample sizes, a non-parametric tests was considered most appropriate for analysing this data. We’ll use Mann-Whitney U tests to compare each task score for the Italian and English cards. The Mann-Whitney U test is explained in more detail in the Traditional non-parametric tests worksheet.

We start by creating some summary statistics.

# compare WinG accuracy by card set
task_by_subj_rank <- task_by_subj_l %>%
  group_by(task) %>%
  mutate(rank = rank(correct))

task_summary <- task_by_subj_rank %>%
  group_by(task, cards) %>%
  drop_na('correct') %>%
  summarise(n = n(),
            median = median(correct),
            mean_rank = mean(rank),
            sum_rank = sum(rank))

`summarise()` regrouping output by 'task' (override with `.groups` argument)

task_summary

# A tibble: 8 x 6
# Groups:   task [4]
  task                    cards       n median mean_rank sum_rank
  <fct>                   <fct>   <int>  <dbl>     <dbl>    <dbl>
1 Noun Comprehension      english     9   19       11.4     103  
2 Noun Comprehension      italian     9   17        7.56     68  
3 Noun Production         english     9   13       10.3      92.5
4 Noun Production         italian     9   11        8.72     78.5
5 Predicate Comprehension english     8   15.5      7.12     57  
6 Predicate Comprehension italian     8   16.5      9.88     79  
7 Predicate Production    english     8    9        9        72  
8 Predicate Production    italian     8    8        8        64

Explanation of commands:

Lines 1-3 rank the accuracy scores for each of the four tests. Line 6 groups the data by cards within task. Line 7 removes any rows where correct contains the value NA. These were the children we excluded earlier. Lines 8-11 calculate, for each group, the number of children, the median, and the mean and sum of the ranked items.

Now we run Mann-Whitney tests to make a direct comparison between the Italian and English cards for each WinG task.

# independent 2-group Mann-Whitney U Test
mann_whitney <- function(df, group) {
  df <- df %>% drop_na('correct') # remove excluded subjects
  wilcox.test(correct ~ cards, df) %>%
    with(data.frame(U = statistic, p = round(p.value, 2)))
}

wing_mann_whitney <- task_by_subj_l %>%
  group_by(task) %>%
  group_modify(mann_whitney)
wing_mann_whitney

# A tibble: 4 x 3
# Groups:   task [4]
  task                        U     p
  <fct>                   <dbl> <dbl>
1 Noun Comprehension       58    0.13
2 Noun Production          47.5  0.56
3 Predicate Comprehension  21    0.26
4 Predicate Production     36    0.71

Explanation of commands:

The Mann-Whitney U test is a non-parametric equivalent of a t-test. It’s explained in detail in the Traditional non-parametric tests worksheet. Lines 2-6 define a function called mann_whitney to run a Mann-Whitney test comparing the correct column against the factor cards. The function returns the values for U and p as a data frame.

Lines 8-10 group our data by task, then use group_modify(mann_whitney) to replace the data for each group with the Mann-Whitney test results. Line 11 displays the results.

Explanation of output:

As this is a traditional statistical test, the p-value indicates whether there was a significant difference between the cards on any of the tasks. All p values were > 0.05, suggesting there were no differences, contrary to our hypothesis. (Note that we’ve removed the cannot compute exact p-value with ties warnings to make the output easier to read.)

Comparing semantically related production errors by card set

We would also like to know if there were any differences for the semantically related errors between the two sets of cards.

Again, we start with a plot.

# compare semantically related production errors by card set
semantically_related <- task_by_subj %>%
  select(subj, cards, related_np, related_pp) %>%
  rename(np = related_np, pp = related_pp) %>%
  pivot_longer(cols = c(np, pp),
               names_to = 'task',
               values_to = 'correct')
semantically_related$task <- semantically_related$task %>% recode(!!!task_names)

ggplot(semantically_related, aes(x = task, y = correct, fill = cards)) +
  geom_violinhalf(position = position_identity(), alpha=0.7, size=0) +
  scale_fill_grey() +
  theme_APA +
  theme(axis.text = element_text(size = 10)) +
  xlab('WinG Task') + ylab('Semantically related errors (max = 20)')

Warning: Removed 4 rows containing non-finite values (stat_ydensity).

Explanation of commands:

Recall that ‘semantically related’ only applies to the production tasks, so we select just the data for these tasks in line 2. Line 3 renames the task columns. This will allow us to recode the factors using task_names that we defined earlier. Lines 4-6 convert the semantically related error scores from wide to long format. Notice that we use values_to to move these scores to a column named correct. This will allow us to reuse the mann_whitney function later. Line 7 gives the factor levels meaningful names. Lines 9-12 generate half violin plots for the production tasks, using the same approach we took for the task accuracy plot.

Explanation of output:

The extensive overlap in the distribution of semantically scores suggests there were no differences between the Italian and English cards for semantically related errors on the production tasks.

We generate some summary statistics, and run some Mann-Whitney U tests to test this.

nts_rank <- semantically_related %>%
  group_by(task) %>%
  mutate(rank = rank(correct))

nts_summary <- nts_rank %>%
  group_by(task, cards) %>%
  drop_na('correct') %>%
  summarise(n = n(),
            median = median(correct),
            mean_rank = mean(rank),
            sum_rank = sum(rank))
nts_summary

# A tibble: 4 x 6
# Groups:   task [2]
  task                 cards       n median mean_rank sum_rank
  <fct>                <fct>   <int>  <dbl>     <dbl>    <dbl>
1 Noun Production      english     9      3     11.6       104
2 Noun Production      italian     9      2      7.44       67
3 Predicate Production english     8      2      8.12       65
4 Predicate Production italian     8      2      8.88       71

production_mann_whitney <- semantically_related %>%
  group_by(task) %>%
  group_modify(mann_whitney)
production_mann_whitney

# A tibble: 2 x 3
# Groups:   task [2]
  task                     U     p
  <fct>                <dbl> <dbl>
1 Noun Production         59  0.1 
2 Predicate Production    29  0.79

Explanation of commands:

These commands are virtually identical to those we used to for the accuracy scores.

Explanation of output:

The group difference for noun production was larger than for predicate production. However, the p values for both tests are > 0.05, which is evidence against a difference between the card sets for semantically related errors on the production tasks. Again, this is contrary to our hypothesis.

Conclusion

In summary, these data suggest that WinG scores should be comparable, regardless of whether the English or Italian picture cards are used. It also shows that adults’ ratings of how the pictures map to the underlying concepts are quite different to children’s.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Reliability of Targets in a Picture Naming Task

Allegra Cattani, Adele Conn, Paul Sharpe and Andy Wills

Before you start

Contents

Introduction

Loading data

Preprocessing demographics data

Preprocessing WinG noun comprehension data

Applying exclusion criteria

Defining a function

Comparing parents’ ratings of their child’s language ability

Comparing gender on WinG accuracy

Exploring correlations between CDI and WinG

Plotting WinG accuracy by card set

Comparing WinG accuracy by card set

Conclusion

subj	correct	related
1	12	0
2	18	0
3	18	0
4	17	0
5	17	0
6	18	0
7	17	0
8	18	0
9	20	0
10	7	0
11	19	0
12	19	0
13	19	0
14	17	0
15	20	0
16	17	0
17	19	0
19	16	0

subj	correct	related
1	12	0
2	18	0
3	18	0
4	17	0
5	17	0
6	18	0
7	17	0
8	18	0
9	20	0
10	7	0
11	19	0
12	19	0
13	19	0
14	17	0
15	20	0
16	17	0
17	19	0
19	16	0

subj	correct	related
1	12	0
2	18	0
3	18	0
4	17	0
5	17	0
6	18	0
7	17	0
8	18	0
9	20	0
10	7	0
11	19	0
12	19	0
13	19	0
14	17	0
15	20	0
16	17	0
17	19	0
19	16	0