Evidence (briefly), part 1

Before you start…

Before starting this worksheet, you should have had a brief introduction to using RStudio. You should also have also completed the worksheets Exploring Data and Group Differences. If not, take a look these earlier worksheets before continuing.

If you have completed those worksheets, then you’ll have set up an R project, and you’ll have a script in it that looks like this:

# Exploring data (briefly)
# Load package
library(tidyverse)
# Load data into 'cpsdata'
cpsdata <- read_csv("cps2.csv")
# Display mean income
cpsdata %>% summarise(mean(income))
# Calculate mean hours per week
cpsdata %>% summarise(mean(hours, na.rm = TRUE))

# Group differences (briefly)
# Group by sex, display mean income
cpsdata %>% group_by(sex) %>% summarise(mean(income))
# Display density plot of income, by sex
cpsdata %>% ggplot(aes(income, colour = factor(sex))) + geom_density(aes(y = ..scaled..)) 
# Filter people with income < $150K into 'cpslow'
cpslow <- cpsdata %>% filter(income < 150000)
# Display density plot of incomes below $150K, by sex
cpslow %>% ggplot(aes(income, colour = factor(sex))) + geom_density(aes(y = ..scaled..)) 
# EXERCISE
# Group by 'native', display mean income below $150K
cpslow %>% group_by(native) %>% summarise(mean(income))
# Display density plot of incomes below $150K, by 'native'
cpslow %>% ggplot(aes(income, colour = factor(native))) + geom_density(aes(y = ..scaled..))

Evidence
Bayesian t-test
Traditional t-test
Exercise

Evidence

We saw in the last worksheet that in our sample of 10,000 US residents men, on average, earned more than women. We also saw that there was a lot of variability in our sample, and that the range of incomes earned by men largely overlapped the range of incomes earned by women. Obviously, there are a lot more than 10,000 people in the US. So, how likely is it that our sample is representative of the US population? To put that another way, how strong is the evidence, on the basis of our data, that US men do in fact earn more than US women? Basically, should we believe what our data seem to be telling us?

Comparing hypotheses

The standard way to answer this sort of question is to come up with two hypotheses and work out which is the more likely. The two hypotheses we normally compare are:

Hypothesis Zero (aka. ‘null hypothesis’). There is no difference in the means. In our example, the hypothesis is that the mean income for men and women in the US is exactly the same.
Hypothesis One (aka. ‘experimental hypothesis’). There is a difference in the means. In our experiment, the hypothesis is that the mean income for men and women in the US is not exactly the same.

Now we have our hypotheses, we use the data to work out which hypothesis is the more likely. The answer we end up with is called a Bayes Factor. A Bayes Factor is a number that tells us how much more likely Hypothesis One is than Hypothesis Zero.

For example, a Bayes Factor of 10 means it’s ten times more likely there is a difference (Hypothesis One) than there isn’t (Hypothesis Zero).

A Bayes Factor less than 1 is also informative. For example, a Bayes Factor of 0.1 (1/10) tells you it’s ten times more likely that there isn’t a difference than there is.

All the above assumes that, before you collected the data, you thought the presence or absence of a difference were equally likely outcomes. If that is not the case, see more on Bayes Factors.

Psychologists love a “line in the sand”, so a convention has emerged that we believe there is a difference if the Bayes Factor is greater than 3, and believe there isn’t a difference if the Bayes Factor is less than 0.33. We sometimes describe these lines in the sand as “substantial evidence for a difference” (BF > 3) and “substantial evidence for the null” (BF < 0.33).

Others, like Kass & Raftery (1993), have provided some additional guidelines on how to interpret Bayes Factors. It is a more detailed guide than a simple “line in a sand” and can prove very useful when comparing multiple studies. The following comes from Andraszewicz et al. (2014):

Bayes Factors	Interpretation
A rule-of-thumb interpretation of Bayes Factors
> 100	Extreme evidence for the alternative hypothesis
30 - 100	Very strong evidence for the alternative hypothesis
10 - 30	Strong evidence for the alternative hypothesis
3 - 10	Moderate evidence for the alternative hypothesis
1 - 3	Anecdotal evidence for the alternative hypothesis
0.33 - 1	Anecdotal evidence for the null hypothesis
0.10 – 0.33	Moderate evidence for the null hypothesis
0.03 – 0.10	Strong evidence for the null hypothesis
0.01 – 0.03	Very strong evidence for the null hypothesis
< 0.01	Extreme evidence for the null hypothesis
Note that this is BF₁₀, which means that the larger the Bayes Factor is, the more evidence we have for the alternative hypothesis.

Bayesian t-test

The easiest way to calculate a Bayes Factor in R is to use something called a between-subjects Bayesian t-test. This test looks at how big the difference in means between groups is, how much the variation in the two groups overlaps, and how big the sample is. It then does some calculations, based on some assumptions about how large real differences tend to be in psychology, and works out a Bayes Factor. The exact nature of that calculation is largely a matter for mathematicians and methods experts; other psychologists generally take the answer on trust.

We start by loading the BayesFactor package. Recall, that you load a package using the library command.

# Evidence (briefly), part 1
# Load BayesFactor package
library(BayesFactor, quietly = TRUE)

The quietly = TRUE bit is there because this package is very chatty when it loads, which gets a bit annoying after a while. Telling it to load quietly helps a bit, although you’ll still get a welcome message (not shown here).

If you get an error here, please see common errors.

The command to calculate a Bayes Factor for our gender pay gap example is:

# Calculate Bayesian t-test for effect of 'sex', on 'income'
ttestBF(formula = income ~ sex, data = data.frame(cpsdata))

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 18.25138 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

The Bayes Factor is reported on the third line, towards the right. Our Bayes Factor is about 18.25. This means it’s about 18 times more likely that there is a gender pay gap in the US population than that there isn’t.

If you’re curious about what the rest of the output means, see more on Bayes Factors.

Explanation of command

The ttestBF() (short for “Bayes Factor t-test”) command has two components, separated by a comma:

formula = - Here we tell R what we want to analyse (the income column of our data frame), and which group each income belongs to (which is found in the sex column). The tilde, ~ means “as a function of”. So income ~ sex means look at income as a function of biological sex.
data = - Here we tell R which data frame to use; in our case cpsdata. Due to a limitation of the BayesFactor package, we have to specifically tell it to treat our data as a data frame (hence data.frame(cpsdata) rather than just cpsdata).

Traditional t-test

Until the 2010s, nearly all psychologists used what we now call “traditional” t-tests, rather than Bayesian t-tests. They did this because traditional t-tests used to be easier to calculate than Bayesian t-tests (until faster computers and better software came along). This convenience was bought at the cost of the output of traditional t-tests being much harder to interpret. Indeed, most professional psychologists misinterpret p values. If you’ve been taught about p values by a psychologist in the past, what you were told is much more likely to have been wrong than right.

For historical reasons, and so you can understand and critique older papers in our field, we’ll now briefly show you how to run a traditional t-test in R. The command is:

# Calculate traditional t-test for effect of 'sex' on 'income'
t.test(cpsdata$income ~ cpsdata$sex)


    Welch Two Sample t-test

data:  cpsdata$income by cpsdata$sex
t = -3.6654, df = 9991.3, p-value = 0.0002482
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -14518.10  -4400.64
sample estimates:
mean in group female   mean in group male 
            82677.29             92136.66

and the key figure in the output is the p-value. The historical convention, based on a number of misconceptions about what p values are, was as follows:

If p < .05 then psychologists would believe your observed difference was real and allow you to proclaim it as “statistically significant”, and to publish it.
If p >= .05 then psychologists would be skeptical. They probably wouldn’t believe your observed difference is real, and they’d expect you to describe it as “statistically non-significant”. Unlike Bayes Factors, a non-significant p-value does not mean you have evidence for the null. Traditional t-tests cannot provide evidence for the null.

Exercise

Start by adding the following comment to your script:

# EXERCISE

In this exercise, you’ll consolidate what you’ve learned so far. Add comments and commands to your script to complete the exercise.

The task is to further examine this sample of participants who are living in the US, and earning less than $150k (cpslow).

Specifically, the task is to perform a Bayesian t-test and a traditional t-test to address the question of whether people born in the US earn more. Your output should look like the below if you’ve got it right.

As you can see, the Bayesian evidence for a difference is pretty overwhelming in this case – it’s about 3.5 million times more likely there is a difference than there isn’t!

Expected output

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 3534729 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS


    Welch Two Sample t-test

data:  cpslow$income by cpslow$native
t = -6.5669, df = 1473.8, p-value = 7.102e-11
alternative hypothesis: true difference in means between group foreign and group native is not equal to 0
95 percent confidence interval:
 -9716.752 -5246.980
sample estimates:
mean in group foreign  mean in group native 
             51422.69              58904.56

Note: 7.102e-11, what does that mean? It’s scientific notation (a.k.a. “Standard Form”), so is read 7.102 x 10^-11. You would have been taught scientific notation in school, but here’s a reminder if you need it BBC bitesize revision guide on standard form.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.