Contents

Before you start…

Before starting this exercise, you should have completed all the previous Absolute Beginners’, Part 1 workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.

Getting the data into R

Relevant worksheet: Introduction to RStudio, Exploring data

Download this CSV file, which contains the all the data you need for this worksheet. Then, create or open an appropriate project on RStudio Server for this analysis (Plymouth University students: use the project ‘psyc414’ created in the inter-rater reliability worksheet), upload your CSV to your project, and create a new R script called chi.R.

Now, add these commands to your script and run them; the will load the tidyverse package, and load your data.

library(tidyverse)
friends <- read_csv("chi.csv")

Look at the data by clicking on it in the Environment tab in RStudio. Each row is one participant in an interview about friendships. Here’s what each of the columns in the data set contain:

Column Description Values
subj Anonymous ID number of participant a number
age Age of the participant One of: “7 years”, “9 years”, “12 years”, “15 years”
gender Gender of the participant One of: “male”, “female”
culture Culture of participant One of: “China”, “East Germany”, “Iceland”, or “Russia”
coded How their interview response was coded One of “activity”, “feelings”, “helping”, “length”, “norms”, or “trust”

This is a large dataset comprising over 700 participants of different ages, genders, and cultures. It is based on, but not identical to, real data on this topic analysed by Michaela (Gummerum et al., 2008). An R script was used to generate these data from Michaela’s more complex data set.

Frequency tables

Let’s start by looking at how often each of the coded responses (i.e. actitivites, feelings, helping, length, norms, and trust) appear in the interviews. We could do this by hand, but it would be slow and error prone. Instead, we use the table command in R to do it for us.

Add this command to your script and run it (CTRL+ENTER):

table(friends$coded)

activity feelings  helping   length    norms    trust 
     255       96      133      192       53       55 

R gives us a table, which reports how often each of the coded resposnes occurred in the data set. We can see that activity was used the most, norms the least. In fact, activity was used more than feelings, norms, and trust combined.

Explanation of command

Here’s a step-by-step explanation of how the above command works. You’ll need this in a moment to calculate some frequency tables for yourself.

  1. table() - This command counts how many times each thing occurs (in this case, how often each type of coded response occurs).

  2. friends$coded - We need to tell table() where to find the data we are interested in. In this case, it’s the coded column of the friends dataframe that we loaded earlier. We tell R this by typing friends$coded. Yes, that’s $, the same symbol as we use to indicate US Dollars. However, it doesn’t mean “dollars” in R. It means column. So, friends$coded means the coded column of the friends dataframe.

Exercise 1

Now produce frequency tables for each of the other variables in this dataframe (i.e. age, gender, and culture). You do this by changing the command table(friends$coded) so that it now refers to a different column in the friends dataframe. Re-read the above Explanation of command section if you’re stuck.

Enter the command for creating a frequency table for culture into your script, and also into your lab book.

Contingency tables

Do childrens’ ideas about friendship differ across cultures? We can use the table command to look at this, too. We use it to produce a frequency table for each of the different cultures in our sample, like this:

Add the following commands to your script and run them:

cont <- table(friends$culture, friends$coded)
cont
              
               activity feelings helping length norms trust
  China              47       33      52     28    28     8
  East Germany       75       19      26     58     6    12
  Iceland            78       17      13     62     6    20
  Russia             55       27      42     44    13    15

Explanation of command

Here’s an explanation of each part of that command:

  1. cont <- Store this table as cont, so we can use it later. The command <- stores the thing on its right in the thing on its left.

  2. table(rows, columns) - The R command for producing tables. We replace the word rows with the name of the variable we want to appear on the rows of the table, and we replace the word columns with the name of the variable we want to appear in the columns of the table.

  3. friends$culture - The culture column of the friends data frame. We’ve put this first in our table command, so culture appears as rows.

  4. friends$coded - The coded column of the friends data frame. This appears second in our table command, so coded appears as columns.

  5. cont - Lastly, we type cont on its own to display the contingency table in the Console (clicking on cont in the Environment tab in RStudio won’t work in this case).

Explanation of output

R gives us a table, showing how many of each response were made in each culture. This is called a contingency table. The name contingency table comes from the word contingent, as in, for example “Getting your degree is contingent on passing your exams”. A contingency table gives the frequencies for one variable (e.g. the interview responses) contingent on another variable (e.g. the culture of the participants).

Close inspection of the contingency table reveals that, for example, the “helping” response is more common in China than in Iceland. The “activity” response is more common in Iceland than in Russia. So, it does look like childrens’ conceptions of friendship vary between cultures. Of course, not everyone in the same culture responded the same way but, overall, some types of response are more or less likely in some cultures than others.

Mosaic plots

Some people find it quite hard to notice these kinds of patterns in contingency tables, and the patterns are certainly harder to spot in a table than in a good visualization. The visualization we’re going to use here is called a mosaic plot. The command to do this in R is as follows:

Add the following command to your script and run it:

mosaicplot(cont)

Explanation of output

It’s called a mosaic plot because it’s made up of tiles.

In the above example, the width of each tile represents the number of participants from each culture. We collected data from approximately the same number of people from each culture, so all tiles are approximately the same width.

The height of each tile is determined by the frequency of each of the responses (feelings, helping, etc.) within each culture – the more common a response within a particular culture, the taller the tile.

Looking at this mosaic plot, it’s visually obvious that “length” is a less common response in China than in other countries.

Evidence

So, it looks like there’s some kind of relationship between culture and conceptions of friendship … but how good is the evidence that this is a real result, and not just some kind of fluke we can put down to chance? As we covered in the Evidence worksheet, the best way to answer this question is to calculate a Bayes Factor (BF). In R, we can calculate the Bayes Factor for a contingency table like this:

Add the following commands to your script and run them:

library(BayesFactor, quietly = TRUE)
contingencyTableBF(cont, fixedMargin = "rows", sampleType = "indepMulti")
Bayes factor analysis
--------------
[1] Non-indep. (a=1) : 107633530 ±0%

Against denominator:
  Null, independence, a = 1 
---
Bayes factor type: BFcontingencyTable, independent multinomial

The Bayes Factor is reported on the third line, towards the right. The Bayes Factor in this example is about 107.6 million. This means it’s more than 100 million times more likely that there is a relationship between culture and friendship concepts, than there isn’t.

Psychologists generally agree to believe the relationship is real if the Bayes Factor exceeds 3, and generally agree to believe the relationship is not real if the Bayes Factor is less than 0.33. So, in this example, we have very strong evidence for the existence of a relationship.

If you’re curious about what the rest of the output means, see more on relationships.

Explanation of command

  1. The first line, library(BayesFactor, quietly = TRUE) loads the BayesFactor package, which is a set of extra commands that allows R to calculate Bayes Factors.

  2. contingencyTableBF() - The command for calculating a BF (Bayes Factor) for a contingency table.

  3. cont - Our contingency table (we stored it in cont earlier on in this worksheet).

  4. fixedMargin = "rows", sampleType = "indepMulti" - This tells R that the different groups in your sample (in this case, different cultures) appear as the rows of your contingency table. If you’d put them as the columns (e.g. if you’d used table(friends$coded, friends$culture) then you would change this to fixedMargin = "cols". For a more detailed explanation, see more on relationships.

Traditional analysis

There’s a long history in psychology of performing a contingency-table chi-square test to examine the level of evidence for a relationship. The results of such tests are widely misinterpreted by psychologists, but some still like to see them anyway. Here’s how to calculate one for these data:

Add the following command to your script and run it:

chisq.test(cont)

    Pearson's Chi-squared test

data:  cont
X-squared = 89.169, df = 15, p-value = 1.417e-12

The key result here is the p-value. It’s important to emphasise that this p value is not the probability that the observed relationship is due to chance. As we covered in the Evidence worksheet, there is no way to explain this p value that is simple, useful, and accurate.

Nonetheless, the convention is that if the p value is less than 0.05, psychologists will generally believe you when you assert that the relationship is not due to chance. If the p value is greater than 0.05, they will generally be skeptical.

The p value in this example is very small, so has been reported in standard form, and is read as 1.417 x 10-12. You would have been taught standard notation in school but, as a reminder, 1.417 x 10-12 = .000000000001417. See this BBC bitesize revision guide on standard form if you need a bit more explanation than that.

The reported p value is less than .05 in this example, and so psychologists will generally believe your result is real.

Reporting your results

In addition to the p value, psychologists will generally record at least two further numbers in their articles. The first is the chi-square value, written as X-squared in the above output, but as X2 in articles.

The second is the degrees of freedom (df in the above output). In this case, degrees of freedom relates to the size of the contingency table, and is the number of columns, minus one, multiplied by the number of rows, minus one (i.e. (rows - 1) x (cols -1)).

In you were writing up this analysis in a report, you would write something like:

The coded friendship concepts occurred with different frequency across cultures, BF = 1.08 x 108, X2(15) = 89.2, p < .001, see Table 1.

“Table 1” would be the contingency table you’d produced with the table command.

As discussed in the Evidence worksheet, it is also important to report the method by which you calculated your Bayes Factor. So, somewhere in you report, you should say something like:

Bayes Factors were calculated using the BayesFactor package (Morey & Rouder, 2018), within the R environment (R Core Team, 2021).

You can get the references for these citations by typing citation("BayesFactor") and citation().

Exercise 2

Each step in this exercise can be completed by slightly modifying a command you have already used.

Add these modified commands to your script and run them.

Here are the things you should do:

  1. Produce a contingency table that shows the relationship between gender and concepts of friendship in this data set. Do this by modifying cont <- table(friends$culture, friends$coded) appropriately.

If your modified command still uses cont, the commands you used before should now work without having to modify them:

  1. Produce a mosaic plot from this contingency table.

  2. Calculate the Bayes Factor for the relationship. Enter your Bayes Factor into your lab book.

  3. Peform a contingency chi-square test.

Further reading

When you write up an experiment, you often need to provide some summary information about the sample, including the exact number of participants, and the gender balance. R makes it easy to work these things out, as this worksheet shows: sample characteristics.

For more detailed information on the analyses covered in this worksheet, see more on relationships.


This material is distributed under a Creative Commons licence. CC-BY-SA 4.0. It is part of Research Methods in R, by Andy Wills