Before starting this exercise, you should have had a brief introduction to getting and using RStudio – Introduction to RStudio. You should also have also completed the workshop exercises for Exploring Data and Group Differences. If not, take a look these earlier worksheets before continuing.

First, log in to
RStudio
server, and make sure you are in your `psyc411`

project.

Next, create a **new** script file in your project,
called `evidence.R`

.

Now, you load *tidyverse* and you load the income *data
frame*, and create the `cpslow`

data frame by adding the
following commands and comments to your R script, and using CTRL+ENTER
to run each command in turn:

```
## EVIDENCE
# Load package
library(tidyverse)
# Load data
cpsdata <- read_csv(url("https://www.andywills.info/cps2.csv"))
# Filter to incomes < $150K
cpslow <- cpsdata %>% filter(income < 150000)
```

The first thing we’re going to go in this class is calculate a *p
value.* In its most basic form, a p value is just a
*probability*. A probability is a number between 0 and 1 that
represents how likely something is to happen. For example, if you flip a
fair coin, P(Heads) = 0.5, and P(Tails) = 0.5. If you roll a six-sided
dice, the probability of rolling a 3 (or any other number) is 1/6.

In psychology, when people talk about *p values*, they
normally mean something a bit more specific. They mean the probability
that we would have got the data we did, given a theory that we have.
When psychologists talk about *p values*, that theory is nearly
always a theory of no difference (a *null hypothesis*).

For example, we might have the null hypothesis that men and women in
the US have exactly the same mean income, and we collect some data to
test this. For example, we have our `cpsdata`

dataset, which
is data on the incomes of 10,000 US men and women.

When we started looking at this data (see the previous Group Differences workshop), we observed that the mean salary for women was several thousand dollars lower than the mean salary for men.

Our sample of 10,000 people is less than 1% of the US population, and
the effect size in our sample is small (the difference in means is about
one-fifth of the standard deviation). Still, on the face of it, it seems
unlikely that our data (our *sample*) would have a gender pay gap
of several thousands dollars if, in the US as a whole (i.e. in the US
*population*), male and female mean salaries were exactly the
same. So, the probability of our data, given the *null*
hypothesis, seems like it should be quite low.

R allows us to estimate how low this probability is, under the
assumption that we have a *random* sample of the population. One
way to do this is to use a between-subjects t-test.

We are aiming to calculate the probability of the income data we have collected, under the theory that men and women have identical mean incomes. One way we can do that is to perform a t-test. A t-test takes two things into account when calculating this probability:

the effect size, see Group Differences

the sample size (i.e. the number of people in our sample, in this case 10,000).

If we hope to find a low probability despite a low effect size, we will need a large sample. Although our effect size is small (about 0.2, see Group Differences), our sample is quite large (10,000 people).

In R, the command for a t-test is very similar to the command for
calculating effect size (see Group
Differences). In fact, all we need to do is replace
`cohen.d`

with `t.test`

.

**So, the comment and command to add to your script and run
is:**

```
# Perform t-test on income, by sex
t.test(cpsdata$income ~ cpsdata$sex)
```

```
Welch Two Sample t-test
data: cpsdata$income by cpsdata$sex
t = -3.6654, df = 9991.3, p-value = 0.0002482
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-14518.10 -4400.64
sample estimates:
mean in group female mean in group male
82677.29 92136.66
```

The key figure in the output is the `p-value`

. We can see
here that the p value is, as we expected, very low, it’s about .0002.
So, given a theory that male and female mean incomes are identical in
the US population (our null hypothesis), the probability of observing
the data we did (our sample), is very low (less than 1 in 1000).

**Note:** Technically, it’s the probability of observing
*at least* the difference in means that we did (i.e. a difference
in means equal or larger than the one we observed), under the assumption
the null hypothesis is correct.

Psychologists will generally record this probability in their
articles. So, in this case, they would write`p = .0002`

or,
alternatively, `p < .05`

. The latter form,
`p < .05`

, reads “the p value is less than .05”. More on
this later.

Psychologists usually also report at least two further numbers when
reporting a t-test in their articles. The first is the t value,
`t`

. Our t value in this case is about 3.67 (we can safely
ignore the minus sign).

The second is the “degrees of freedom” (`df`

). *Degrees
of freedom* are a way of talking about the size of the dataset used
in the t test. You’ll see it isn’t quite the sample size, because we
know our sample size was 10,000, yet the `df`

are a bit lower
than that: 9991.3.

In summary, `t = 3.67`

, `df = 9991.3`

, and
`p = .0002`

. Psychologists have a standard shorthand for
this, which is to write, `t(9991.3) = 3.67, p = .0002`

.
You’ll see this sort of thing in most psychology articles. In more
recent articles, you’ll also see a report of the effect size, in this
case `d = .020`

, as we calculated previously.

The whole sentence would be something like:

`Men earned more than women, d = .20, t(9991.3) = 3.67, p < .05`

.

There’s quite a bit of other output from our t-test that we haven’t discussed. If you’re curious about what the rest of the output means, take at look at more on t-tests. In particular, there is a section on confidence intervals, which may come up in some of your other teaching.

As we covered above, psychologists often report any p value less than
.05 as simply `p < .05`

. This comparison to .05 (1 in 20)
is an arbitrary convention in our subject. The convention, in practice,
works as follows:

If

`p < .05`

then psychologists will believe your observed difference is real and allow you to proclaim it as “statistically significant”, and to publish it.If

`p >= .05`

then psychologists will be sceptical. They probably won’t believe your observed difference is real, and they’ll expect you to describe it as “statistically non-significant”. Sometimes, if p is between .05 and about 0.15, authors will describe their finding as “marginally significant”.

Psychologists will also tell you (quite accurately) that a “non
significant” p value gives you no more information than you had before
you collected the data. So, a non-significant p-value **does
not** mean you have evidence for the null. Instead, you have an
absence of evidence. This seems crazy – after all, you now have some
data, surely that tells you something? This is a big problem with p
values, but it is not the only one, as we’ll see in a moment.

The *p value* is

*P(data | null)* – which means the probability of our data,
assuming the null hypothesis is correct.

The traditional way of interpreting this *p value* (see above)
treats it is as if it were the:

*P(null | data)* – the probability of the null hypothesis,
given the data.

These two things are **not** the same, but psychologists
have a long
history of not realising this. So, if you can’t see what the problem
is just yet, you’re in good company! Below, I’ll try and illustrate the
error with an analogy. If you understand the analogy, you understand (at
some level) what the problem is. If you don’t get it, don’t worry too
much – in the next section, we’ll look at a different sort of test that
allows us to avoid the error.

To make this all a bit less abstract, let’s make the data “your car
doesn’t start”, and my theory “there’s a banana in your exhaust pipe”.
Sticking a banana up your car’s exhaust pipe (not recommended!!) is a
reasonably effective way of making it so your car won’t start. So the
probability your car won’t start, given there’s a banana in the exhaust
pipe, is quite high. Let’s say that *P(nostart | banana)* =
0.95.

However, the probability your car has a banana in its exhaust pipe,
given the fact it doesn’t start, is quite low. There are lots of reasons
a car might not start (flat battery, no fuel, …), so the probability
that my theory (that there’s a banana in your exhaust) is correct, given
the data that your car does not start, is quite low. You’d be better off
checking the battery and the fuel gauge first. You can save searching
for fruit in your exhaust pipe for later. Let’s say *P(banana |
nostart )* = .01.

In summary, P(nostart | banana) = 0.95, but P(banana | nostart) = .01. The only thing to take from this example is that these two numbers are not the same. Similarly, P(data|null) will seldom be the same as P(null|data).

It is possible to calculate what we want to know, P(null|data), given
what a t-test gives us, P(data|null), but we have to know P(null) – the
probability the null hypothesis was correct *before* we collected
our data. We normally don’t know this precisely but we can make an
informed guess.

For example, say you have the hypothesis that if people read a few sentences related to old age, they will walk more slowly. Believe it or not, this is a real example of a psychological hypothesis.

Before collecting any data, this hypothesis seems unlikely. If the experiment is well designed, we might conclude that the probability of the null hypothesis was rather high. Let’s say, P(null) = 0.95.

We run the experiment, do a t-test, and get a p value of .049. So,
P(data|null) = .049. Traditionally, psychologists would call this a
*significant* result, believe you, and allow you to publish it.
This is the wrong response because, in this case, P(null|data) is about
0.5 (not .05, .5! 50:50 !) .

In other words, instead of running the experiment, you’d have done just as well to flip a coin and say your hypothesis was right if the coin came up heads. This is no way to run a research programme, and probably goes some way to explain why psychology is full of reports of weird findings that aren’t real (aka. the “replication crisis”“).

We can escape this whole confusing mess of the traditional t-test by
calculating a *Bayes Factor* rather than a *p value*. A
Bayes Factor is a number that tells us how likely it is there’s a
difference between groups

For example, a Bayes Factor of 10 means it’s ten times more likely there is a difference than there isn’t.

A Bayes Factor less than 1 is also informative. For example, a Bayes
Factor of 0.1 (1/10) tells you it’s ten times more likely that there
*isn’t* a difference than there is.

A Bayes Factor of exactly 1 tells you that the presence or absence of a difference is equally likely.

All the above assumes that, before you collected the data, you thought the presence or absence of a difference were equally likely outcomes. If that is not the case, see more on Bayes Factors.

A Bayesian t-test is used for the same sorts of things as a traditional t-test (see above), but returns a Bayes Factor rather than a p value.

The easiest way to do a Bayesian t-test in R is to use the
*BayesFactor* package. Recall, that you load a package using the
`library`

command.

**So, the next comment and command to add to your script and
run is:**

```
# Load package for Bayes Factors
library(BayesFactor)
```

R will return a welcome message and various other text in response to this command (we do not show this text here, as it is quite long). You can safely ignore that text. However, if you get an error here, please see common errors.

The command for a Bayesian t-test is similar, but not identical, to
the commands for effect size and for a traditional t-test. For our
gender pay gap example, **we add the following command to our
script and run it:**

```
# Perform Bayesian t-test on income, by sex
ttestBF(formula = income ~ sex, data = data.frame(cpsdata))
```

```
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 18.25138 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
```

The Bayes Factor is reported on the third line, towards the right. Our Bayes Factor is about 18.25. This means it’s about 18 times more likely that there is a gender pay gap in the US population than that there isn’t.

If you’re curious about what the rest of the output means, see more on Bayes Factors.

The `ttestBF()`

(short for “Bayes Factor t-test”) command
has two components, separated by a comma:

`formula =`

- Here we tell R what we want to analyse (the`income`

column of our data frame), and which group each income belongs to (which is found in the`sex`

column). The tilde,`~`

is used in the same way as in the effect size calculation and the standard t-test, i.e. it means “as a function of”. So`income ~ sex`

means look at income as a function of biological sex. Unlike`cohen.d`

, we don’t need to say`cpsdata$income`

because of the second component of the command:`data =`

- Here we tell R which data frame to use; in our case`cpsdata`

. Due to a limitation of the BayesFactor package, we have to specifically tell it to treat our data as a data frame (hence`data.frame(cpsdata)`

rather than just`cpsdata`

).

Psychologists love a “line in the sand”, so a convention has emerged that we believe there is a difference if the Bayes Factor is greater than 3, and believe there isn’t a difference if the Bayes Factor is less than 0.33. We sometimes describe these lines in the sand as “substantial evidence for a difference” (BF > 3) and “substantial evidence for the null” (BF < 0.33). In good, recent papers, you will see a Bayes Factor reported alongside the effect size and results of a traditional t-test. The full sentence might be something like:

`Men earned more than women, d = .20, BF = 18.25, t(9991.3) = 3.67, p < .05`

.

Others, like Kass & Raftery (1993), have provided some additional guidelines on how to interpret Bayes Factors. It is a more detailed guide than a simple “line in a sand” and can prove very useful when comparing multiple studies. The following comes from Andraszewicz et al. (2014):

A rule-of-thumb interpretation of Bayes Factors | |

Bayes Factors | Interpretation |
---|---|

> 100 | Extreme evidence for the alternative hypothesis |

30 - 100 | Very strong evidence for the alternative hypothesis |

10 - 30 | Strong evidence for the alternative hypothesis |

3 - 10 | Moderate evidence for the alternative hypothesis |

1 - 3 | Anecdotal evidence for the alternative hypothesis |

0.33 - 1 | Anecdotal evidence for the null hypothesis |

0.10 – 0.33 | Moderate evidence for the null hypothesis |

0.03 – 0.10 | Strong evidence for the null hypothesis |

0.01 – 0.03 | Very strong evidence for the null hypothesis |

< 0.01 | Extreme evidence for the null hypothesis |

Note that this is BF_{10}, which means that the larger the Bayes Factor is, the more evidence we have for the alternative hypothesis. |

Sometimes, you’ll see Bayes Factor written as BF_{10}, which
means the same thing as BF. You’ll also occasionally see
BF_{01}, which is the same idea but flipped, so BF_{01}
< 1/3 means substantial evidence for a difference, and
BF_{01} > 3 means substantial evidence for the null.

In order to calculate a Bayes Factor, R has to make some assumptions
about how big the difference is likely to be. The `ttestBF`

command does this for you, making some broad assumptions that cover the
range of effect sizes typically seen in psychology. Other commands may
make different assumptions, including estimating the likely effect size
from previous experiments. So, when reporting Bayes Factors, it’s
particularly important to report, somewhere in your Results section, the
specific method that you used. For the `ttestBF`

function
you’d write something like:

“Bayesian t-tests were calculated using the BayesFactor package (Morey & Rouder, 2022), within the R environment (R Core Team, 2022).”

You can get the references for these citations by typing
`citation("BayesFactor")`

and `citation()`

.

In this exercise, you’ll consolidate what you’ve learned so far.

**The task is to further examine this sample of participants
who are living in the US ( cpsdata) , and earning less than
$150k (cpslow).**

`# EXERCISE`

Specifically, the task is to perform a traditional t-test and a
Bayesian t-test to address the question of whether people born in the US
earn more. **So, after adding the comment above, add commands to
the end of your script to do this, and run them**. Your output
should look like the below if you’ve got it right.

As you can see, the Bayesian evidence for a difference is pretty overwhelming in this case – it’s about 3.5 million times more likely there is a difference than there isn’t!

```
Welch Two Sample t-test
data: cpslow$income by cpslow$native
t = -6.5669, df = 1473.8, p-value = 7.102e-11
alternative hypothesis: true difference in means between group foreign and group native is not equal to 0
95 percent confidence interval:
-9716.752 -5246.980
sample estimates:
mean in group foreign mean in group native
51422.69 58904.56
```

```
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 3534729 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
```

**Note:** 7.102e-11, what does that mean? It’s
scientific notation (a.k.a. “Standard Form”), so is read 7.102 x
10^{-11}. You would have been taught scientific notation in
school, but here’s a reminder if you need it BBC
bitesize revision guide on standard form.

Now write a single sentence that reports what you’ve found in the standard way. It will be of the form:

Immigrants earn less than those born in the US,
`BF = , t( ) = , p =`

(or you could replace
`p=`

with `p <`

).

Fill in the blanks!

*Question*: So, what if your p-value is less than .05, but
your Bayes Factor is less than 3?

*Answer*: Believe the Bayes Factor, and ignore the p value.
The p value is there for historical reasons and doesn’t tell you
anything you actually want to know. If you’d like to read more about
this issue, take a look at more on Bayes
Factors.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.