Previous worksheets introduced linear regression using a single predictor variable, and multiple regression with two predictors. This worksheet builds on this foundation, by explaining how to build models with more than two predictors.

Open the `rminr-data`

project we used previously.

Ensure you have the latest files by asking git to “`pull`

” the repository. Select the `Git`

tab, which is located in the row of tabs which includes the `Environment`

tab. Click the `Pull`

button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the `Git pull`

window. The `going-further`

folder should contain the file `religion-preproc.csv`

.

Next, create a new, empty R script and save it in the `rminr-data`

folder as `more-regression.R`

. Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

We’ll use some data from an undergraduate student dissertation which explored relationships between Polish (n=93) and British (n=104) people’s religious orientation, spiritual beliefs and emotional intelligence. We start by loading the data.

```
rm(list = ls()) # clear the environment
library(tidyverse)
<- read_csv('going-further/religion-preproc.csv') data
```

In order to follow the rest of this worksheet, you need to understand the data you are analyzing. Open the `data`

data frame you have just loaded, by clicking on it in the Environment pane of RStudio. Here’s what each of the columns contains:

Column | Description | Values |
---|---|---|

subj | Unique anonymous participant number | 1-197 |

age | Age of participant | 18-74 |

sex | Biological sex of participant | male, female |

education | Highest education level of participant | no formal quals, GCSE or equiv, A level or equiv, degree, Technical, HNC, FD or equiv |

religious | Did the participant identify with a recognised religion? | yes, no |

nationality | Nationality of participant | British, Polish |

We need to do a little bit of preprocessing of the `education`

variable. Specifically, `education`

is an *ordered* variable – people in this sample have a reported level of education which ranges from low (`no formal quals`

) to high (`degree`

). R does not know what order to put these levels in unless we tell it. The easiest way to do this is to recode the data, as you did previously in the cleaning up questionnaire data worksheet:

```
<- c('no formal quals' = 1,
edu_map 'GCSE or equiv' = 2,
'A level or equiv' = 3,
'Technical, HNC, FD or equiv' = 4,
'degree' = 5)
<- data %>% mutate(edu_sc = recode(education, !!!edu_map)) data
```

**Explanation of commands**: As we have done previously, we define a set of mappings between text in the data frame (e.g.`GCSE or equiv`

) and a number (for GCSEs, `2`

). We then use `mutate`

and `recode`

to create a new variable called `edu_sc`

which gives us a score for educational level that ranges from 1 to 5.

We now turn to the three scales that the participants completed. In the the data we loaded, the scale scores have already been calculated. In practice, you would need to calculate the scores for each scale from the raw data. This is explained in the Preprocessing scales worksheet.

Religious orientation (RO) was was measured with the amended Religious Orientation Scale (ROS). According to the ROS, people who are intrinsically religious treat religion as a spiritual end in and of itself. Those who are extrinsically religious practise religion for self-serving reasons, such as social status. Therefore, the ROS has two subscales.

Column | Description | Values |
---|---|---|

ro_i | Intrinsic relgious orientation | 1-3 |

ro_e | Extrinsic relgious orientation | 1-3 |

Spirituality was measured with the Spiritual Connection Questionnaire (SCQ). The SCQ has five subscales:

Column | Description | Values |
---|---|---|

happiness | Extent to which spirituality brings the participant happiness | 1-7 |

places | Extent to which the participant feels spiritually connected to places | 1-7 |

others | Extent to which the participant feels spiritually connected to others | 1-7 |

nature | Extent to which the participant feels spiritually connected to nature | 1-7 |

universe | Extent to which the participant feels spiritually connected to the universe | 1-7 |

In this study, emotional intelligence was treated as a trait; a personality factor relating to various aspects of emotions. This was measured using the Short Form Trait Emotional Intelligence Questionnaire (TEIQue-SF). The TEIQue-SF has four subscales: wellbeing, self-control, emotionality and sociability.

Column | Description | Values |
---|---|---|

tei | Total trait emotional intelligence score | 1-7 |

wellbeing | Wellbeing | 1-7 |

self_control | Self-control | 1-7 |

emotionality | Emotionality | 1-7 |

sociability | Sociability | 1-7 |

The linear model you built in the multiple regression worksheet worksheet used two predictors. It’s straightforward to add additional predictors to a model. We’ll start with a model which predicts intrinsic religious orientation from the demographics variables in our Polish sample.

```
polish <- data %>% filter(nationality == 'Polish')
polish <- polish %>% drop_na()
oi_lm1 <- lm(ro_i ~ age + sex + edu_sc + religious, data = polish)
oi_lm1
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.675430 0.008393 0.006171 0.017407 -0.693304
```

**Explanation of commands:**

Line 1 filters the data to only include Polish participants.

Line 2 removes any participants for which we have missing data (shown as `NA`

in the data frame). We have to do this because the `lmBF`

function we’ll use later cannot cope with missing data.

Line 4 builds a regression model that predicts intrinsic religious orientation, defined by `ro_i ~`

. The variables on the right of the `~`

are the predictors. As you can see, these are the four demographics variables `age + sex + edu_sc + religious`

.

Line 5 prints the result of fitting the regression model.

**Explanation of output**: We are given the equation of the line for the regression model. We can see from this output that the best fitting line involves a positive relation between `age`

and `ro_i`

, i.e. older people score higher on intrinsic religious orientation in this sample. For *binary* predictors, for example `religious`

, which only have two levels in our dataset (`yes`

or `no`

for `religious`

), the output indicates which way round R has decided to set these up. In this case, it says `religiousyes`

, meaning that `yes`

has been taken as the positive value. The co-efficient of `-0.69`

is negative, so this means that people who answer `yes`

to this question scored lower on intrinsic religious orientation that those who replied `no`

. Similarly, `sex`

was set up with `male`

as the positive value (`sexmale`

), and the co-efficient (`.006`

) is positive, so men scored higher than women on intrinsic religious orientation in this sample.

How much of the variance in intrinsic religious orientation is explained by the demographic variables? As explained in the how good was our model? worksheet, we can assess this using the \(R^2\) statistic:

```
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.306 0.273 0.509 9.46 0.00000219 5 -65.0 142. 157.
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

**Explanation of commands:** Line 1 loads the `broom`

library which provides the `glance()`

function. Line 2 uses `glance()`

to print some statistics for our model.

**Explanation of output:**

As explained in previous worksheets, the value of 0.273 in `adj.r.squared`

tells us how much of the variance in `ro_i`

is explained by the demographics variables. A model containing demographics variables explains 27.3% of the variance in intrinsic religious orientation. The other values in this output are beyond the scope of this worksheet and are not discussed here.

The multiple regression worksheet also showed you how to calculate a Bayes factor to decide whether a regression model is any better than a simpler model with no predictors.

```
library(BayesFactor, quietly = TRUE)
oi_lmbf1 <- lmBF(ro_i ~ age + sex + edu_sc + religious,
data = data.frame(polish))
oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 8423.42 ±2.3%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Line 1 loads the `BayesFactor`

package. Lines 2-3 build a Bayesian model using the same outcome and predictors as above. Line 3 displays information about the model.

**Explanation of output:**

The output `[1] age + sex + edu_sc + religious`

reminds you of the predictors in your model. The number after the `:`

is a Bayes Factor for the hypothesis that your model is a better predictor of the outcome than a simpler model with no predictors. That is, a model which just computes the average of all outcome scores (sometimes called the ‘Intercept only’ model). The Bayes Factor for this hypothesis is almost 9,000, which is strong evidence that these demographics variables help predict intrinsic religiosity.

**Note:** This Bayes Factor is for the *whole* model (`ro_i ~ age + sex + edu_sc + religious`

), not for the individual predictors that make it up (e.g. `age`

). So, this analysis does not tell you that, for example, that there is evidence that `age`

on its own predicts intrinsic religiosity. We’ll return to this point towards the end of this worksheet.

One use of multiple regression is to test a sequence of related hypotheses. Groups of variables (sometimes referred to as ‘blocks’) are added to a model in steps. At each step, a comparison is made to see if the model with more variables explains more or less variance than the simpler model from the previous step.

The order in which variables are added to the model is not arbitrary. Based on previous research, variables known to predict the outcome are entered first. Variables which test new hypotheses are added in subsequent steps. Each step builds on the previous one, which is why the approach is known as ‘hierarchical regression’.

So far, we have a model which shows that demographics help to predict intrinsic religiosity. According to the definition we gave above, intrinsically religious people treat religion as a spiritual end in and of itself. If that’s true, we would expect a model that includes spirituality predictor variables to be better than one with just the demographics variables. Again, by ‘better’, we mean ‘explain more variance’ in the outcome variable.

We can test this hypothesis by extending the demographics model to include the spirituality variables measured by the SCQ.

```
<- lm(ro_i ~ age + sex + edu_sc + religious +
oi_lm2 + universe + others + nature + places,
happiness data = polish)
oi_lm2
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.349620 0.003387 -0.037815 -0.046594 -0.586144
happiness universe others nature places
0.295375 0.004245 -0.043822 -0.045118 -0.020146
```

`glance(oi_lm2)`

```
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.622 0.580 0.387 14.8 6.90e-14 10 -37.4 96.8 124.
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

**Explanation of commands:**

Lines 1-3 build our new model. This includes the demographics variables, and adds the spirituality variables `happiness + universe + others + nature + places`

. Line 4 shows the output of the model, and Line 5 gives us the \(R^2\) value.

**Explanation of output:**

For this model, `adj.r.squared`

= 0.58, which means it explains 58% of the variance in intrinsic religious orientation, an increase of 30.7% over the model from step 1.

Earlier, we found that a model with demographics predictors was better than a model with no predictors. To test the hypothesis that the spirituality variables help to explain the variance in intrinsic religiosity, we can compare our second model against the first.

We’ve compared Bayesian models in a couple of previous worksheets. For example, in the multiple regression worksheet, we compared a model where the relationship between grades and study time was assumed to be different for men and women, with one where it was assumed to be the same. You were also comparing two Bayesian models when you tested for an interaction in a Bayesian ANOVA.

We can use the same approach here:

```
oi_lmbf2 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places,
data = data.frame(polish))
oi_lmbf2 / oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 26831761 ±2.61%
Against denominator:
ro_i ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Lines 1-3 build a Bayesian regression model with the demographic and spirituality predictors. Line 4 compares the model with the demographics and spirituality variables, `oi_lmbf2`

, against the model with just the demographics variables, `oi_lmbf1`

. You can think of this as dividing a more complex model by a simpler one, but remember that this is actually a comparison of the evidence for the two models.

**Explanation of output:**

The Bayes Factor for this hypothesis is over 25 million, so we can confidently claim that a model containing these spirituality variables explains more variance in intrinsic religiosity, than a simpler model with just demographics variables.

In this study, the researchers also hypothesized that trait emotional intelligence and extrinsic religiosity would further explain the variance in intrinsic religiosity. These variables were all added to the third model:

```
<- lm(ro_i ~ age + sex + edu_sc + religious +
oi_lm3 + universe + others + nature + places +
happiness + self_control + emotionality + sociability + ro_e,
wellbeing data = polish)
oi_lm3
```

```
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places + wellbeing + self_control +
emotionality + sociability + ro_e, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
0.386301 0.004410 0.044791 0.011721 -0.303004
happiness universe others nature places
0.219556 0.057779 -0.095376 -0.037245 0.024456
wellbeing self_control emotionality sociability ro_e
-0.002467 -0.091592 -0.004978 -0.006091 0.643421
```

`glance(oi_lm3)`

```
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.746 0.699 0.327 15.9 3.20e-17 15 -19.3 70.6 111.
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

**Explanation of commands:**

Lines 1-4 builds our third model. This contains all of the variables in `oi_lmbf2`

, and adds the emotional intelligence variables `wellbeing + self_control + emotionality + sociability`

, and `ro_e`

for extrinsic religiosity. Line 5 shows the best fitting line, and Line 6 gives us the \(R^2\) value.

**Explanation of output:**

For this model, `adj.r.squared`

= 0.7, which means it explains 70% of the variance in intrinsic religious orientation, an increase of 12% over the model from step 2.

Finally, we compare the Bayesian model, with the one from step 2.

```
<- lmBF(ro_i ~ age + sex + edu_sc + religious +
oi_lmbf3 + universe + others + nature + places +
happiness + self_control + emotionality + sociability + ro_e,
wellbeing data = data.frame(polish))
/ oi_lmbf2 oi_lmbf3
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places + wellbeing + self_control + emotionality + sociability + ro_e : 4121.668 ±2.19%
Against denominator:
ro_i ~ age + sex + edu_sc + religious + happiness + universe + others + nature + places
---
Bayes factor type: BFlinearModel, JZS
```

**Explanation of commands:**

Lines 1-4 build the Bayesian model. Line 5 compares this new model against the previous one.

**Explanation of output:**

The Bayes Factor for this hypothesis is over 4,000, which is strong evidence that a model with the additional variables explains more variance than the simpler model in step 2.

All of the regression models we’ve been creating and comparing were built based on the data from our sample. However, we’re normally looking for a model that will predict outcomes for *any* sample. Adding variables to a model can explain more variance in a particular sample, at the expense of being able to explain data in other samples. This is known as overfitting. As a general principle, we can reduce the risk of overfitting by preferring a simpler model over a more complex one, when adding variables doesn’t increase *R ^{2}* by much. Other methods, such as cross-validation, provide more precise tests of overfitting, but these are outside the scope of this worksheet.

We earlier saw that there was strong evidence (\(BF > 8000\)) for the model `ro_i ~ age + sex + edu_sc + religious`

, and we noted that this is evidence for the whole model, not for any one predictor (e.g. `age`

). Nonetheless, you might sometimes be asked to report whether there is evidence for a particular predictor within a model *individually* having an effect on the outcome variable. You can answer such questions using the same approach as above. In other words, calculate the evidence for a model omitting that one predictor – `ro_i ~ sex + edu_sc + religious`

– and compare it to the original model:

```
<- lmBF(ro_i ~ sex + edu_sc + religious,
oi_lmbf1_minus_age data = data.frame(polish))
/ oi_lmbf1_minus_age oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 1.040865 ±6.75%
Against denominator:
ro_i ~ sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

In this case, the result of our analysis is inconclusive. The Bayes Factor is close to 1, so the evidence for and against `age`

being an effective predictor in the model is about evenly matched. In contrast, there is strong evidence for `relgion`

being an effective predictor within the model:

```
<- lmBF(ro_i ~ age + sex + edu_sc, data = data.frame(polish))
oi_lmbf1_minus_rel / oi_lmbf1_minus_rel oi_lmbf1
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 57751.08 ±2.62%
Against denominator:
ro_i ~ age + sex + edu_sc
---
Bayes factor type: BFlinearModel, JZS
```

Using just the data from British participants, build a model which predicts trait emotional intelligence from the demographics variables. The adjusted *R ^{2}* and Bayes Factor should look like this:

```
Call:
lm(formula = tei ~ age + sex + edu_sc + religious, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.02501 0.02486 0.62775 -0.08031 0.17461
```

```
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.164 0.130 0.934 4.76 0.00150 5 -135. 282. 298.
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 7.339347 ±2.3%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
```

**Note:** For reasons previously explained, your Bayes Factor may not be exactly the same as the one shown here. Any value between about 7 and 8 is fine.

**Copy the R code you used for this exercise into PsycEL.**

Build a second model which predicts trait emotional intelligence from the demographics and spirituality variables. Compare this against the model you built in the previous exercise. The adjusted *R ^{2}* and Bayes Factor should look like this:

```
Call:
lm(formula = tei ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.26541 0.02580 0.62405 -0.05840 0.08868
happiness universe others nature places
-0.20410 -0.00543 0.21218 0.01457 -0.07620
```

```
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.196 0.118 0.940 2.50 0.0133 10 -133. 288. 317.
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

```
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 0.03843934 ±3.28%
Against denominator:
tei ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
```

**Copy the R code you used for this exercise into PsycEL.**

**Write a sentence interpreting R^{2} and the Bayes Factor for the model comparison.**

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.