Previous worksheets introduced linear regression using a single predictor variable, and multiple regression with two predictors. This worksheet builds on this foundation, by explaining how to build models with more than two predictors.

Open the `rminr-data`

project we used previously.

Ensure you have the latest files by asking git to “`pull`

”
the repository. Select the `Git`

tab, which is located in the
row of tabs which includes the `Environment`

tab. Click the
`Pull`

button with a downward pointing arrow. A window will
open showing the files which have been pulled from the repository. Close
the `Git pull`

window. The `going-further`

folder
should contain the file `religion-preproc.csv`

.

Next, create a new, empty R script and save it in the
`rminr-data`

folder as `more-regression.R`

. Put
all the commands from this worksheet into this file, and run them from
there. Save your script regularly.

We’ll use some data from an undergraduate student dissertation which explored relationships between Polish (n=93) and British (n=104) people’s religious orientation, spiritual beliefs and emotional intelligence. We start by loading the data.

**Enter these comments and commands into your script, and run
them:**

```
# Clear the environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
# Load data
data <- read_csv('going-further/religion-preproc.csv')
```

In order to follow the rest of this worksheet, you need to understand
the data you are analyzing. Open the `data`

data frame you
have just loaded, by clicking on it in the Environment pane of RStudio.
Here’s what each of the columns contains:

Column | Description | Values |
---|---|---|

subj | Unique anonymous participant number | 1-197 |

age | Age of participant | 18-74 |

sex | Biological sex of participant | male, female |

education | Highest education level of participant | no formal quals, GCSE or equiv, A level or equiv, degree, Technical, HNC, FD or equiv |

religious | Did the participant identify with a recognised religion? | yes, no |

nationality | Nationality of participant | British, Polish |

We need to do a little bit of preprocessing of the
`education`

variable. Specifically, `education`

is
an *ordered* variable – people in this sample have a reported
level of education which ranges from low (`no formal quals`

)
to high (`degree`

). R does not know what order to put these
levels in unless we tell it. The easiest way to do this is to recode the
data, as you did previously in the cleaning
up questionnaire data worksheet.

**Enter these comments and commands into your script, and run
them:**

```
# Recode data
edu_map <- c('no formal quals' = 1,
'GCSE or equiv' = 2,
'A level or equiv' = 3,
'Technical, HNC, FD or equiv' = 4,
'degree' = 5)
data <- data %>% mutate(edu_sc = recode(education, !!!edu_map))
```

**Explanation of commands**: As we have done previously,
we define a set of mappings between text in the data frame
(e.g.`GCSE or equiv`

) and a number (for GCSEs,
`2`

). We then use `mutate`

and `recode`

to create a new variable called `edu_sc`

which gives us a
score for educational level that ranges from 1 to 5.

We now turn to the three scales that the participants completed. In the the data we loaded, the scale scores have already been calculated. In practice, you would need to calculate the scores for each scale from the raw data. This is explained in the Preprocessing scales worksheet.

Religious orientation (RO) was was measured with the amended Religious Orientation Scale (ROS). According to the ROS, people who are intrinsically religious treat religion as a spiritual end in and of itself. Those who are extrinsically religious practise religion for self-serving reasons, such as social status. Therefore, the ROS has two subscales.

Column | Description | Values |
---|---|---|

ro_i | Intrinsic relgious orientation | 1-3 |

ro_e | Extrinsic relgious orientation | 1-3 |

Spirituality was measured with the Spiritual Connection Questionnaire (SCQ). The SCQ has five subscales:

Column | Description | Values |
---|---|---|

happiness | Extent to which spirituality brings the participant happiness | 1-7 |

places | Extent to which the participant feels spiritually connected to places | 1-7 |

others | Extent to which the participant feels spiritually connected to others | 1-7 |

nature | Extent to which the participant feels spiritually connected to nature | 1-7 |

universe | Extent to which the participant feels spiritually connected to the universe | 1-7 |

In this study, emotional intelligence was treated as a trait; a personality factor relating to various aspects of emotions. This was measured using the Short Form Trait Emotional Intelligence Questionnaire (TEIQue-SF). The TEIQue-SF has four subscales: wellbeing, self-control, emotionality and sociability.

Column | Description | Values |
---|---|---|

tei | Total trait emotional intelligence score | 1-7 |

wellbeing | Wellbeing | 1-7 |

self_control | Self-control | 1-7 |

emotionality | Emotionality | 1-7 |

sociability | Sociability | 1-7 |

The linear model you built in the multiple regression worksheet worksheet used two predictors. It’s straightforward to add additional predictors to a model. We’ll start with a model which predicts intrinsic religious orientation from the demographics variables in our Polish sample.

**Enter these comments and commands into your script, and run
them:**

```
# Select Polish participants; remove NA entries
polish <- data %>% filter(nationality == 'Polish')
polish <- polish %>% drop_na()
# Perform regression
oi_lm1 <- lm(ro_i ~ age + sex + edu_sc + religious, data = polish)
# Display results of regression
oi_lm1
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.675430 0.008393 0.006171 0.017407 -0.693304
```

**Explanation of commands:**

Command line 1 filters the data to only include Polish participants.

Command line 2 removes any participants for which we have missing
data (shown as `NA`

in the data frame). We have to do this
because the `lmBF`

function we’ll use later cannot cope with
missing data.

Command line 3 builds a regression model that predicts intrinsic
religious orientation, defined by `ro_i ~`

. The variables on
the right of the `~`

are the predictors. As you can see,
these are the four demographics variables
`age + sex + edu_sc + religious`

.

Command line 4 prints the result of fitting the regression model.

**Explanation of output**: We are given the equation
of the line for the regression model. We can see from this output
that the best fitting line involves a positive relation between
`age`

and `ro_i`

, i.e. older people score higher
on intrinsic religious orientation in this sample. For *binary*
predictors, for example `religious`

, which only have two
levels in our dataset (`yes`

or `no`

for
`religious`

), the output indicates which way round R has
decided to set these up. In this case, it says
`religiousyes`

, meaning that `yes`

has been taken
as the positive value. The co-efficient of `-0.69`

is
negative, so this means that people who answer `yes`

to this
question scored lower on intrinsic religious orientation that those who
replied `no`

. Similarly, `sex`

was set up with
`male`

as the positive value (`sexmale`

), and the
co-efficient (`.006`

) is positive, so men scored higher than
women on intrinsic religious orientation in this sample.

How much of the variance in intrinsic religious orientation is explained by the demographic variables? As explained in the how good was our model? worksheet, we can assess this using the \(R^2\) statistic.

**Enter these comments and commands into your script, and run
them:**

```
# Load 'broom' package, for 'glance' command
library(broom)
glance(oi_lm1)
```

```
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.306 0.273 0.509 9.46 0.00000219 4 -65.0 142. 157.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

**Explanation of commands:** Line 1 loads the
`broom`

library which provides the `glance()`

function. Line 2 uses `glance()`

to print some statistics for
our model.

**Explanation of output:**

As explained in previous worksheets, the value of 0.273 in
`adj.r.squared`

tells us how much of the variance in
`ro_i`

is explained by the demographics variables. A model
containing demographics variables explains 27.3% of the variance in
intrinsic religious orientation. The other values in this output are
beyond the scope of this worksheet and are not discussed here.

The multiple regression worksheet also showed you how to calculate a Bayes factor to decide whether a regression model is any better than a simpler model with no predictors.

**Enter these commands into your script, and run
them:**

```
library(BayesFactor, quietly = TRUE)
oi_lmbf1 <- lmBF(ro_i ~ age + sex + edu_sc + religious,
data = data.frame(polish))
oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 8447.598 ±1.65%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Line 1 loads the `BayesFactor`

package. Lines 2-3 build a
Bayesian model using the same outcome and predictors as above. Line 3
displays information about the model.

**Explanation of output:**

The output `[1] age + sex + edu_sc + religious`

reminds
you of the predictors in your model. The number after the `:`

is a Bayes Factor for the hypothesis that your model is a better
predictor of the outcome than a simpler model with no predictors. That
is, a model which just computes the average of all outcome scores
(sometimes called the ‘Intercept only’ model). The Bayes Factor for this
hypothesis is almost 9,000, which is strong evidence that these
demographics variables help predict intrinsic religiosity.

**Note:** This Bayes Factor is for the *whole*
model (`ro_i ~ age + sex + edu_sc + religious`

), not for the
individual predictors that make it up (e.g. `age`

). So, this
analysis does not tell you that, for example, that there is evidence
that `age`

on its own predicts intrinsic religiosity. We’ll
return to this point towards the end of this worksheet.

One use of multiple regression is to test a sequence of related hypotheses. Groups of variables (sometimes referred to as ‘blocks’) are added to a model in steps. At each step, a comparison is made to see if the model with more variables explains more or less variance than the simpler model from the previous step.

The order in which variables are added to the model is not arbitrary. Based on previous research, variables known to predict the outcome are entered first. Variables which test new hypotheses are added in subsequent steps. Each step builds on the previous one, which is why the approach is known as ‘hierarchical regression’.

So far, we have a model which shows that demographics help to predict intrinsic religiosity. According to the definition we gave above, intrinsically religious people treat religion as a spiritual end in and of itself. If that’s true, we would expect a model that includes spirituality predictor variables to be better than one with just the demographics variables. Again, by ‘better’, we mean ‘explain more variance’ in the outcome variable.

We can test this hypothesis by extending the demographics model to include the spirituality variables measured by the SCQ.

**Enter these commands into your script, and run
them:**

```
oi_lm2 <- lm(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places,
data = polish)
oi_lm2
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.349620 0.003387 -0.037815 -0.046594 -0.586144
happiness universe others nature places
0.295375 0.004245 -0.043822 -0.045118 -0.020146
```

`glance(oi_lm2)`

```
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.622 0.580 0.387 14.8 6.90e-14 9 -37.4 96.8 124.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

**Explanation of commands:**

Lines 1-3 build our new model. This includes the demographics
variables, and adds the spirituality variables
`happiness + universe + others + nature + places`

. Line 4
shows the output of the model, and Line 5 gives us the \(R^2\) value.

**Explanation of output:**

For this model, `adj.r.squared`

= 0.58, which means it
explains 58% of the variance in intrinsic religious orientation, an
increase of 30.7% over the model from step 1.

Earlier, we found that a model with demographics predictors was better than a model with no predictors. To test the hypothesis that the spirituality variables help to explain the variance in intrinsic religiosity, we can compare our second model against the first.

We’ve compared Bayesian models in a couple of previous worksheets. For example, in the multiple regression worksheet, we compared a model where the relationship between grades and study time was assumed to be different for men and women, with one where it was assumed to be the same. You were also comparing two Bayesian models when you tested for an interaction in a Bayesian ANOVA.

We can use the same approach here.

**Enter these commands into your script, and run
them:**

```
oi_lmbf2 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places,
data = data.frame(polish))
oi_lmbf2 / oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 26864252 ±2.08%
Against denominator:
ro_i ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Lines 1-3 build a Bayesian regression model with the demographic and
spirituality predictors. Line 4 compares the model with the demographics
and spirituality variables, `oi_lmbf2`

, against the model
with just the demographics variables, `oi_lmbf1`

. You can
think of this as dividing a more complex model by a simpler one, but
remember that this is actually a comparison of the evidence for the two
models.

**Explanation of output:**

The Bayes Factor for this hypothesis is over 25 million, so we can confidently claim that a model containing these spirituality variables explains more variance in intrinsic religiosity, than a simpler model with just demographics variables.

In this study, the researchers also hypothesized that trait emotional intelligence and extrinsic religiosity would further explain the variance in intrinsic religiosity. These variables were all added to the third model.

**Enter these commands into your script, and run
them:**

```
oi_lm3 <- lm(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places +
wellbeing + self_control + emotionality + sociability + ro_e,
data = polish)
oi_lm3
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places + wellbeing + self_control +
emotionality + sociability + ro_e, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
0.386301 0.004410 0.044791 0.011721 -0.303004
happiness universe others nature places
0.219556 0.057779 -0.095376 -0.037245 0.024456
wellbeing self_control emotionality sociability ro_e
-0.002467 -0.091592 -0.004978 -0.006091 0.643421
```

`glance(oi_lm3)`

```
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.746 0.699 0.327 15.9 3.20e-17 14 -19.3 70.6 111.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

**Explanation of commands:**

Lines 1-4 builds our third model. This contains all of the variables
in `oi_lmbf2`

, and adds the emotional intelligence variables
`wellbeing + self_control + emotionality + sociability`

, and
`ro_e`

for extrinsic religiosity. Line 5 shows the best
fitting line, and Line 6 gives us the \(R^2\) value.

**Explanation of output:**

For this model, `adj.r.squared`

= 0.7, which means it
explains 70% of the variance in intrinsic religious orientation, an
increase of 12% over the model from step 2.

Finally, we compare the Bayesian model, with the one from step 2.

**Enter these commands into your script, and run
them:**

```
oi_lmbf3 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places +
wellbeing + self_control + emotionality + sociability + ro_e,
data = data.frame(polish))
oi_lmbf3 / oi_lmbf2
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places + wellbeing + self_control + emotionality + sociability + ro_e : 4124.3 ±2.23%
Against denominator:
ro_i ~ age + sex + edu_sc + religious + happiness + universe + others + nature + places
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Lines 1-4 build the Bayesian model. Line 5 compares this new model against the previous one.

**Explanation of output:**

The Bayes Factor for this hypothesis is over 4,000, which is strong evidence that a model with the additional variables explains more variance than the simpler model in step 2.

All of the regression models we’ve been creating and comparing were
built based on the data from our sample. However, we’re normally looking
for a model that will predict outcomes for *any* sample. Adding
variables to a model can explain more variance in a particular sample,
at the expense of being able to explain data in other samples. This is
known as overfitting. As a
general principle, we can reduce the risk of overfitting by preferring a
simpler model over a more complex one, when adding variables doesn’t
increase *R ^{2}* by much. Other methods, such as cross-validation,
provide more precise tests of overfitting, but these are outside the
scope of this worksheet.

We earlier saw that there was strong evidence (\(BF > 8000\)) for the model
`ro_i ~ age + sex + edu_sc + religious`

, and we noted that
this is evidence for the whole model, not for any one predictor
(e.g. `age`

). Nonetheless, you might sometimes be asked to
report whether there is evidence for a particular predictor within a
model *individually* having an effect on the outcome variable.
You can answer such questions using the same approach as above. In other
words, calculate the evidence for a model omitting that one predictor –
`ro_i ~ sex + edu_sc + religious`

– and compare it to the
original model.

**Enter these commands into your script, and run
them:**

```
oi_lmbf1_minus_age <- lmBF(ro_i ~ sex + edu_sc + religious,
data = data.frame(polish))
oi_lmbf1 / oi_lmbf1_minus_age
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 1.10057 ±4.13%
Against denominator:
ro_i ~ sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

In this case, the result of our analysis is inconclusive. The Bayes
Factor is close to 1, so the evidence for and against `age`

being an effective predictor in the model is about evenly matched. In
contrast, there is strong evidence for `relgion`

being an
effective predictor within the model.

**Enter these commands into your script, and run
them:**

```
oi_lmbf1_minus_rel <- lmBF(ro_i ~ age + sex + edu_sc, data = data.frame(polish))
oi_lmbf1 / oi_lmbf1_minus_rel
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 60358.28 ±2.06%
Against denominator:
ro_i ~ age + sex + edu_sc
---
Bayes factor type: BFlinearModel, JZS
```

Using just the data from British participants, build a model which
predicts trait emotional intelligence from the demographics variables.
The adjusted *R ^{2}* and Bayes Factor should look like
this:

```
Call:
lm(formula = tei ~ age + sex + edu_sc + religious, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.02501 0.02486 0.62775 -0.08031 0.17461
```

```
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.164 0.130 0.934 4.76 0.00150 4 -135. 282. 298.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 7.177979 ±1.43%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
```

**Note:** For reasons previously explained, your Bayes Factor
may not be exactly the same as the one shown here. Any value between
about 7 and 8 is fine.

**Copy the R code you used for this exercise into
PsycEL.**

Build a second model which predicts trait emotional intelligence from
the demographics and spirituality variables. Compare this against the
model you built in the previous exercise. The adjusted
*R ^{2}* and Bayes Factor should look like this:

```
Call:
lm(formula = tei ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.26541 0.02580 0.62405 -0.05840 0.08868
happiness universe others nature places
-0.20410 -0.00543 0.21218 0.01457 -0.07620
```

```
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.196 0.118 0.940 2.50 0.0133 9 -133. 288. 317.
# … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 0.04024518 ±3.74%
Against denominator:
tei ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

**Copy the R code you used for this exercise into
PsycEL.**

**Write a sentence interpreting R^{2} and the
Bayes Factor for the model comparison.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.