Before starting this exercise, you should have completed **all** the previous Absolute Beginners’, Part 1 workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.

**Relevant worksheet:** Using RStudio projects, Exploring data

Dowload this CSV file, which contains the all the data you need for this worksheet. Then, set up a new project on RStudio Server for this analysis, and upload your CSV to your project.

Now, load the *tidyverse* package, and load your data.

```
library(tidyverse)
friends <- read_csv("chi.csv")
```

Look at the data by clicking on it in the *Environment* tab in RStudio. Each row is one participant in an interview about friendships. Here’s what each of the columns in the data set contain:

Column | Description | Values |
---|---|---|

subj | Anonymous ID number of participant | a number |

age | Age of the participant | One of: “7 years”, “9 years”, “12 years”, “15 years”“ |

gender | Gender of the participant | One of: “male”, “female”“ |

culture | Culture of participant | One of: “China”, “East Germany”, “Iceland”, or “Russia” |

coded | How their interview response was coded | One of “activity”, “feelings”, “helping”, “length”, “norms”, or “trust” |

This is a large dataset comprising over 700 participants of different ages, genders, and cultures. It is based on, but not identical to, real data on this topic analysed by Michaela (Gummerum et al., 2008). An R script was used to generate these data from Michaela’s more complex data set.

Let’s start by looking at how often each of the coded responses (i.e. *actitivites, feelings, helping, length, norms, and trust*) appear in the interviews. We could do this by hand, but it would be slow and error prone. Instead, we use the `table`

command in R to do it for us:

`table(friends$coded)`

```
activity feelings helping length norms trust
255 96 133 192 53 55
```

R gives us a table, which reports how often each of the coded resposnes occurred in the data set. We can see that *activity* was used the most, *norms* the least. In fact, *activity* was used more than *feelings, norms, and trust* combined.

Here’s a step-by-step explanation of how the above command works. You’ll need this in a moment to calculate some frequency tables for yourself.

`table()`

- This command counts how many times each thing occurs (in this case, how often each type of coded response occurs).`friends$coded`

- We need to tell`table()`

where to find the data we are interested in. In this case, it’s the`coded`

column of the`friends`

*dataframe*that we loaded earlier. We tell R this by typing`friends$coded`

. Yes, that’s`$`

, the same symbol as we use to indicate US Dollars. However, it doesn’t mean “dollars” in R. It means column. So,`friends$coded`

means the`coded`

column of the`friends`

dataframe.

Now produce frequency tables for each of the other *variables* in this *dataframe* (i.e. `age`

, `gender`

, and `culture`

). You do this by changing the command `table(friends$coded)`

so that it now refers to a different column in the `friends`

dataframe. Re-read the above *Explanation of command* section if you’re stuck.

**Enter the command for creating a frequency table for culture into your lab book.**

Do childrens’ ideas about friendship differ across cultures? We can use the `table`

command to look at this, too. We use it to produce a *frequency table* for each of the different cultures in our sample, like this:

```
cont <- table(friends$culture, friends$coded)
cont
```

```
activity feelings helping length norms trust
China 47 33 52 28 28 8
East Germany 75 19 26 58 6 12
Iceland 78 17 13 62 6 20
Russia 55 27 42 44 13 15
```

Here’s an explanation of each part of that command:

`cont <-`

Store this table as`cont`

, so we can use it later. The command`<-`

stores the thing on its right in the thing on its left.`table(rows, columns)`

- The R command for producing tables. We replace the word`rows`

with the name of the variable we want to appear on the rows of the table, and we replace the word`columns`

with the name of the variable we want to appear in the columns of the table.`friends$culture`

- The`culture`

column of the`friends`

data frame. We’ve put this first in our`table`

command, so`culture`

appears as rows.`friends$coded`

- The`coded`

column of the`friends`

data frame. This appears second in our`table`

command, so`coded`

appears as columns.`cont`

- Lastly, we type`cont`

on its own to display the contingency table in the Console (clicking on`cont`

in the Environment tab in RStudio won’t work in this case).

R gives us a table, showing how many of each response were made in each culture. This is called a *contingency table*. The name *contingency table* comes from the word *contingent*, as in, for example “Getting your degree is *contingent* on passing your exams”. A contingency table gives the frequencies for one variable (e.g. the interview responses) *contingent* on another variable (e.g. the culture of the participants).

Close inspection of the contingency table reveals that, for example, the “helping” response is more common in China than in Iceland. The “activity” response is more common in Iceland than in Russia. So, it does look like childrens’ conceptions of friendship vary between cultures. Of course, not everyone in the same culture responded the same way but, overall, some types of response are more or less likely in some cultures than others.

Some people find it quite hard to notice these kinds of patterns in contingency tables, and the patterns are certainly harder to spot in a table than in a good visualization. The visualization we’re going to use here is called a *mosaic plot*. The command to do this in R is

`mosaicplot(cont)`

It’s called a *mosaic* plot because it’s made up of *tiles*.

In the above example, the *width* of each tile represents the number of participants from each `culture`

. We collected data from approximately the same number of people from each culture, so all tiles are approximately the same width.

The *height* of each tile is determined by the frequency of each of the responses (feelings, helping, etc.) within each culture – the more common a response within a particular culture, the taller the tile.

Looking at this mosaic plot, it’s visually obvious that “length” is a less common response in China than in other countries.

So, it looks like there’s some kind of relationship between culture and conceptions of friendship … but how good is the evidence that this is a real result, and not just some kind of fluke we can put down to chance? As we covered in the *Evidence* worksheet, the best way to answer this question is to calculate a Bayes Factor (BF). In R, we can calculate the Bayes Factor for a contingency table like this:

```
library(BayesFactor, quietly = TRUE)
contingencyTableBF(cont, fixedMargin = "rows", sampleType = "indepMulti")
```

```
Bayes factor analysis
--------------
[1] Non-indep. (a=1) : 107633530 ±0%
Against denominator:
Null, independence, a = 1
---
Bayes factor type: BFcontingencyTable, independent multinomial
```

The Bayes Factor is reported on the third line, towards the right. The Bayes Factor in this example is about 107.6 *million*. This means it’s more than 100 million times more likely that there is a relationship between culture and friendship concepts, than there isn’t.

Psychologists generally agree to believe the relationship is real if the Bayes Factor exceeds 3, and generally agree to believe the relationship is *not* real if the Bayes Factor is less than 0.33. So, in this example, we have very strong evidence for the existence of a relationship.

If you’re curious about what the rest of the output means, see more on relationships.

The first line,

`library(BayesFactor, quietly = TRUE)`

loads the*BayesFactor*package, which is a set of extra commands that allows R to calculate Bayes Factors.`contingencyTableBF()`

- The command for calculating a BF (Bayes Factor) for a contingency table.`cont`

- Our contingency table (we stored it in`cont`

earlier on in this worksheet).`fixedMargin = "rows", sampleType = "indepMulti"`

- This tells R that the different groups in your sample (in this case, different cultures) appear as the`rows`

of your contingency table. If you’d put them as the columns (e.g. if you’d used`table(friends$coded, friends$culture)`

then you would change this to`fixedMargin = "cols"`

. For a more detailed explanation, see more on relationships.

There’s a long history in psychology of performing a *contingency-table chi-square* test to examine the level of evidence for a relationship. The results of such tests are widely misinterpreted by psychologists, but some still like to see them anyway. Here’s how to calculate one for these data:

`chisq.test(cont)`

```
Pearson's Chi-squared test
data: cont
X-squared = 89.169, df = 15, p-value = 1.417e-12
```

The key result here is the `p-value`

. It’s important to emphasise that this *p value* is **not** the probability that the observed relationship is due to chance. As we covered in the *Evidence* worksheet, there is no way to explain this *p value* that is simple, useful, and accurate.

Nonetheless, the convention is that if the *p value* is less than 0.05, psychologists will generally believe you when you assert that the relationship is not due to chance. If the *p value* is greater than 0.05, they will generally be skeptical.

The *p value* in this example is very small, so has been reported in *standard form*, and is read as 1.417 x 10^{-12}. You would have been taught standard notation in school but, as a reminder, 1.417 x 10^{-12} = .000000000001417. See this BBC bitesize revision guide on standard form if you need a bit more explanation than that.

The reported *p value* is less than .05 in this example, and so psychologists will generally believe your result is real.

In addition to the *p value*, psychologists will generally record at least two further numbers in their articles. The first is the chi-square value, written as `X-squared`

in the above output, but as X^{2} in articles.

The second is the *degrees of freedom* (`df`

in the above output). In this case, *degrees of freedom* relates to the size of the contingency table, and is the number of columns, minus one, multiplied by the number of rows, minus one (i.e. `(rows - 1) x (cols -1)`

).

In you were writing up this analysis in a report, you would write something like:

*The coded friendship concepts occured with different frequency across cultures, BF = 1.08 x 10 ^{8}, X^{2}(15) = 89.2, p < .001, see Table 1.*

“Table 1” would be the contingency table you’d produced with the `table`

command.

As discussed in the *Evidence* worksheet, it is also important to report the method by which you calculated your Bayes Factor. So, somewhere in you report, you should say something like:

*Bayes Factors were calculated using the BayesFactor package (Morey & Rouder, 2018), within the R enviornment (R Core Team, 2018).*

You can get the references for these citations by typing `citation("BayesFactor")`

and `citation()`

.

Each step in this exercise can be completed by slightly modifying a command you have already used.

- Produce a contingency table that shows the relationship between gender and concepts of friendship in this data set. Do this by modifying
`cont <- table(friends$culture, friends$coded)`

appropriately.

If your modified command still uses `cont`

, the commands you used before should now work without having to modify them:

Produce a mosaic plot from this contingency table.

Calculate the Bayes Factor for the relationship.

**Enter your Bayes Factor into your lab book.**Peform a contingency chi-square test.

When you write up an experiment, you often need to provide some summary information about the sample, including the exact number of participants, and the gender balance. R makes it easy to work these things out, as this worksheet shows: sample characteristics.

For more detailed information on the analyses covered in this worksheet, see more on relationships.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0. It is part of Research Methods in R, by Andy Wills