It can be helpful to present data in tables, rather than text, especially when you need to refer to the same data in different parts of a report. Although tables can be produced manually using a word processor, generating them directly from your data ensures they are up-to-date, and reduces copy-paste errors. This worksheet explains how to use R to produce some of the types of table used to report psychological research.

Getting started

To prepare for this worksheet:

  1. Open the rminr-data project we used previously.

  2. If you don’t see a folder named going-further, it means you created your project before the data required for this worksheet was added to the rminr-data git repository. You can get the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.

  3. Open the Files tab. The going-further folder should contain the file picture-naming-preproc.csv.

  4. Create a script named tables.R in the rminr-data folder (the folder above going-further). Add the code to this script as you work through each section of the worksheet.

Creating a correlation matrix

We’ll start by producing a correlation matrix. A correlation matrix shows correlations between all combinations of a set of variables, which is often required in research reports. We’ll demonstrate an easy way to produce correlation matrices, with APA styling, in a format that can be read by Microsoft Word or LibreOffice Writer. A similar approach can be used to produce other common table types.

We’ll generate a correlation matrix using the attitude dataset, which is included with R. These data are the percentage of favourable attitudes given by employees, in relation to seven questions regarding their department (you can find out a bit more about these data by typing ?attitude). Here are the first few rows of the data frame:

rating complaints privileges learning raises critical advance
43 51 30 39 61 92 45
63 64 51 54 63 73 47
71 70 68 69 76 86 48
61 63 45 47 54 84 35
81 78 56 66 71 83 47

We’ll use the apaTables package to generate the correlation matrix.

Enter these commands into your script, and run them:

rm(list = ls()) # clear the environment
apa.cor.table(attitude, filename="table1.doc", table.number = 1)

Table 1 

Means, standard deviations, and correlations with confidence intervals

  Variable      M     SD    1           2           3           4           5          6          
  1. rating     64.63 12.17                                                                       
  2. complaints 66.60 13.31 .83**                                                                 
                            [.66, .91]                                                            
  3. privileges 53.13 12.24 .43*        .56**                                                     
                            [.08, .68]  [.25, .76]                                                
  4. learning   56.37 11.74 .62**       .60**       .49**                                         
                            [.34, .80]  [.30, .79]  [.16, .72]                                    
  5. raises     64.63 10.40 .59**       .67**       .45*        .64**                             
                            [.29, .78]  [.41, .83]  [.10, .69]  [.36, .81]                        
  6. critical   74.77 9.89  .16         .19         .15         .12         .38*                  
                            [-.22, .49] [-.19, .51] [-.22, .48] [-.25, .46] [.02, .65]            
  7. advance    42.93 10.29 .16         .22         .34         .53**       .57**      .28        
                            [-.22, .49] [-.15, .54] [-.02, .63] [.21, .75]  [.27, .77] [-.09, .58]

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

Explanation of commands:

We load the apaTables package.The function to generate a correlation matrix is apa.cor.table(). We pass the attitude data frame as the first argument, and use filename to specify that the output should be saved in the file table1.doc. The table.number argument sets the number in the table heading output, in this case “Table 1”. If you omit this argument, the text will be “Table XX”.

Explanation of output:

Export table1.doc from RStudio and open it using a word processor

The first thing to notice is that the styling (spacing, use of italics, horizontal lines, positioning of captions and footnotes etc.) complies with the APA guidelines for tables.

The table number and caption is above the table itself - you will need to edit the caption by hand to make it more meaningful, for example “Means, standard deviations, and correlations with confidence intervals, for the attitude measures of Study 1”.

The Variable column contains a number and the column name for the seven attitude variables. The next two columns show the mean and standard deviation for each variable. The remaining columns use the numbers from items in the Variable column as headings, indicating that they refer to the same variable. The cells show the correlation between the column variables and each of the variables in the rows. Cells are left empty where a variable would otherwise be correlated with itself. The 95% confidence interval for the correlation is shown in square brackets.

For example, the correlation between rating and complaints in this sample is .83. The confidence interval indicates that the population value is likely to be between .66 and .91.

Evidence for the correlation is calculated using traditional statistics, rather than the Bayes factors described in the Relationships, part 2 worksheet. One asterisk (*) indicates p < .05. Two asterisks (**) signify p < .01. These calculations assumed a two-tailed test; one-tailed tests for correlations are explained in the More on relationships, part 2 worksheet. Also recall that p-values are widely misinterpreted, so it would be better to edit this part of the table by hand to reflect Bayes Factors you have already calculated. We suggest using * for BF > 3, ** for BF > 10, o for BF < 0.33, and oo for BF < 0.1. Change the text at the bottom of the table accordingly.

Exercise 1

For this exercise, we’ll load some data from a study which measured aspects of participants’ personality.

Enter these commands into your script, and run them:

# Exercise 1
big5 <- read_csv('case-studies/jon-may/big5_total.csv')

The first few rows show that the scale used measured the ‘big 5’ personality factors; openness to experience, conscientiousness, extraversion, agreeableness and neuroticism (OCEAN).

subj openness conscientiousness extraversion agreeableness neuroticism
1 29 28 14 36 20
2 22 22 28 28 26
3 33 33 21 37 25
4 17 34 14 39 13
5 27 27 30 40 25

Create a correlation matrix for the five personality factors. Number the table as “Table 2”, and save the results in table2.doc. Your table should look like this in Rstudio:

Table 2 

Means, standard deviations, and correlations with confidence intervals

  Variable             M     SD   1           2           3           4          
  1. openness          23.15 6.78                                                
  2. conscientiousness 25.10 7.23 .15                                            
                                  [-.14, .42]                                    
  3. extraversion      21.50 7.86 .27         -.01                               
                                  [-.02, .51] [-.29, .28]                        
  4. agreeableness     33.54 4.55 .27         .20         .43**                  
                                  [-.01, .52] [-.09, .46] [.17, .64]             
  5. neuroticism       16.00 7.41 .34*        .28         .13         .07        
                                  [.06, .57]  [-.00, .52] [-.16, .40] [-.22, .34]

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

…and it should be APA formatted in the file table2.doc.

Copy the R code you used for this exercise into PsycEL

Creating a custom table of descriptive statistics

As with graphs, there is often an element of design involved in presenting tabular data in a format most useful for your reader. Packages like apaTables are useful for producing APA tables where there is a standard way to present data. However, you often need a table which is customised to present your data in the most useful format. The cost of custom tables is that the content requires a little more preprocessing, and styling the table according to APA standards will require some hand-formatting in your wordprocessor.

We’ll demonstrate this process by producing a table of descriptive statistics. The data we’ll use comes from an experiment which evaluated children’s language development using the Words in Game (WinG) test. WinG consists of a set of picture cards which are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated words. The experiment tested whether English-speaking children aged approximately 30 months, produce similar responses for the two sets of cards. We would like to produce a single table, containing descriptive statistics for all four tests.

We start by loading the data; enter this command into your script, and run it:

# Load data
wing_preproc <- read_csv('going-further/picture-naming-preproc.csv')

The first few rows of wing_preproc look like this:

subj gender cards nc np pc pp cdi_u cdi_s related_nc related_np related_pc related_pp
1 female english 12 4 NA NA 62 38 0 5 NA NA
2 male italian 18 12 17 9 60 59 0 2 0 3
3 female english 18 13 17 9 97 85 0 3 0 0
4 male italian 17 11 15 12 82 45 0 4 0 2
5 female english 17 15 15 10 66 66 0 2 0 0
6 male italian 18 11 15 7 47 32 0 2 0 1

Table of descriptives

Our test scores are currently in wide format (lots of columns, few rows), but R generally requires data to be in long format (lots of rows, few columns). This means we first have to make the data frame wider, so we can calculate summary statistics.

Enter these commands into your script, and run them:

# wide to long
task_by_subj <- wing_preproc %>%
  pivot_longer(cols = c(nc, np, pc, pp),
               names_to = 'task',
               values_to = 'correct') %>%
  select(subj, gender, cards, task, correct)

Explanation of command:

In the Within-subject differences worksheet, you learned how to use pivot_wider() to widen long data frames. The pivot_longer() command does the reverse – it lengthens wide data frames. cols = c(nc, np, pc, pp) selects the columns we want to pivot. Each value in these columns is added to a row in a new column called correct (values_to = 'correct'). In the same row, a new column task is set to the name of the column which the value came from (names_to = 'task'). All of the values in the other columns are duplicated for each row. We select just the columns we want for our table of descriptive statistics.

The first few rows of task_by_subj look like this:

subj gender cards task correct
1 female english nc 12
1 female english np 4
1 female english pc NA
1 female english pp NA
2 male italian nc 18

Now we can calculate some summary statistics, using commands that we’ve already used in previous worksheets.

Enter these commands into your script, and run them:

# Table of descriptive statistics

descript <- task_by_subj %>%
  group_by(task, gender) %>%
  summarise(mean = mean(correct, na.rm = TRUE), sd = sd(correct, na.rm = TRUE))
`summarise()` has grouped output by 'task'. You can override using the `.groups` argument.

Explanation of commands:

  1. We’ve come across group_by before, here we use it to group the data by two variables at the same time, task and gender, giving us eight groups overall.

  2. We’ve also come across summarize before, including the use of na.rm = TRUE to deal with missing data.

Our data now looks like this:

task gender mean sd
nc female 17.64 2.203
nc male 16.29 4.231
np female 12.09 3.239
np male 10.43 3.952
pc female 16.2 1.814
pc male 14.83 1.602
pp female 8.4 1.955
pp male 8.5 2.345

Meaningful labels

The descript data frame contains just the numbers we want to include in our report - the means and standard deviations for each of the eight groups. However, the row labels (np, etc.) are not particularly clear, so we replace them with something more human readable.

Enter these commands into your script, and run them:

task_names <- c(
  nc = 'Noun Comprehension',
  np = 'Noun Production',
  pc = 'Predicate Comprehension',
  pp = 'Predicate Production'  

descript$task <- descript$task %>% recode(!!!task_names)

Explanation of commands: We’re using the recode command that we’ve previously used in the cleaning up questionnaire data worksheet:

  1. We start by telling R what each of the codes, nc etc., mean. So, for example nc = 'Noun Comprehension'. We combine the four ‘translations’ together into task_names using c() (short for ‘concatenate’, i.e. put things together).

  2. We then take the task columns of the descript data frame (descript$task) and pipe (%>%) it to recode, where it uses task_names to do the recoding. We write (<-) that result back into descript$task.

Our table now looks like this:

task gender mean sd
Noun Comprehension female 17.64 2.203
Noun Comprehension male 16.29 4.231
Noun Production female 12.09 3.239
Noun Production male 10.43 3.952
Predicate Comprehension female 16.2 1.814
Predicate Comprehension male 14.83 1.602
Predicate Production female 8.4 1.955
Predicate Production male 8.5 2.345

APA-format tables

Our table is now clear and easy to read. We could include it in a report without much further effort, and the reader would be able to easily see what we wanted to show them. However, it is not quite in the format that psychologists are most familiar with (which is APA format). In APA format, the table would look more like this:

Task Female (M) Female (SD) Male (M) Male (SD)
Noun Comprehension 17.64 2.2 16.29 4.23
Noun Production 12.09 3.24 10.43 3.95
Predicate Comprehension 16.2 1.81 14.83 1.6
Predicate Production 8.4 1.96 8.5 2.35

In other words, it would be wider: more columns and fewer rows.

We can widen the table, using the pivot_wider command we have previously used in the within-subject differences worksheet.

Enter these commands into your script, and run them:

# Widen table
descript_table <- descript %>%
  pivot_wider(names_from = gender, values_from = c(mean, sd)) 

Our table now has the same format as an APA table…

task mean_female mean_male sd_female sd_male
Noun Comprehension 17.64 16.29 2.203 4.231
Noun Production 12.09 10.43 3.239 3.952
Predicate Comprehension 16.2 14.83 1.814 1.602
Predicate Production 8.4 8.5 1.955 2.345

…but the columns are in a different order. APA format dictates that means should be placed next to their associated standard deviations in a table (APA format is weirdly specific). Fortunately, we can rearrange columns using the select command that we’ve come across before.

Enter this command into your script, and run it:

# Re-order columns
descript_table <- descript_table %>% select(task, mean_female, sd_female, mean_male, sd_male) 
task mean_female sd_female mean_male sd_male
Noun Comprehension 17.64 2.203 16.29 4.231
Noun Production 12.09 3.239 10.43 3.952
Predicate Comprehension 16.2 1.814 14.83 1.602
Predicate Production 8.4 1.955 8.5 2.345

Finally, we can replace the column names with something a bit more human readable, using the colnames function.

Enter this command into your script, and run it:

# Column names
colnames(descript_table) <- c("Task", "Female (M)", "Female (SD)", "Male (M)", "Male (SD)")
Task Female (M) Female (SD) Male (M) Male (SD)
Noun Comprehension 17.64 2.203 16.29 4.231
Noun Production 12.09 3.239 10.43 3.952
Predicate Comprehension 16.2 1.814 14.83 1.602
Predicate Production 8.4 1.955 8.5 2.345

Note that it would arguably be clearer to write “mean” rather than “M”, but it’s another quirk of APA style that we write “M” to stand for mean.

Copying into your wordprocessor

There are a number of different ways to get a table in R into your wordprocessor. We’re going to use the kableExtra package, because it’s really flexible, so it’s capable of producing almost any table you might need. We’re only going to use it in the most basic way here; for some other examples of what it can do, see the kableExtra website.

To get a version of descript_table that you can cut-and-paste into your wordprocessor, enter these commands into your script, and run them:


Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

descript_table %>% kable(digits=2) %>%  kable_styling()

Explanation of commands:

  1. library(kableExtra) loads the kableExtra package.
  2. We pipe our data into kable(). The digits=2 part ensures that every number is reported to two decimal places.
  3. We then pipe kable() into kable_styling(). This command prints the table to the Viewer window in RStudio.

Explanation of output:

Try copying the table into your word processor now. In the Viewer pane, select all of the rows and columns in the table, then right-click and select Copy. Open your word processor and select Paste. (For this to work on a Mac, you will need be working with RStudio in Chrome rather than Safari.)

Exercise 2

Starting with the data in task_by_subj, generate a table of descriptive statistics showing task accuracy for the Italian and English cards. It should look like this:

Task English (M) English (SD) Italian (M) Italian (SD)
Noun Comprehension 17.89 2.47 16.33 3.61
Noun Production 11.89 3.59 11.00 3.61
Predicate Comprehension 15.12 1.64 16.25 1.91
Predicate Production 8.75 1.04 8.12 2.75

Copy the R code you used for this exercise into PsycEL.

R Markdown

You can avoid copy-pasting tables (and all other analyses) by writing your reports using R Markdown instead of a word processor. R Markdown is a language for writing documents which include R code. The code is run, and the output is included in the document. R Markdown can be used to produce different types of document (e.g. reports, presentations, web pages), in various formats (e.g. Microsoft Word, PDF, HTML). The Research Methods in R worksheets are written using R Markdown, and although we don’t teach it in these materials, there are other courses which make it easy to learn.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.