Compute Summary Statistics in R

Easy

40 mins

Data Manipulation in R

94 122 78 110 121 70 85 80 71

831

This tutorial introduces how to easily compute statistcal summaries in R using the dplyr package.

You will learn, how to:

Compute summary statistics for ungrouped data, as well as, for data that are grouped by one or multiple variables. R functions: summarise() and group_by().
Summarise multiple variable columns. R functions:
- summarise_all(): apply summary functions to every columns in the data frame.
- summarise_at(): apply summary functions to specific columns selected with a character vector
- summarise_if(): apply summary functions to columns selected with a predicate function that returns TRUE.

Contents:

Required packages
Demo dataset
Summary statistics of ungrouped data
Summary statistics of grouped data
- Group by one variable
- Group by multiple variables
Summarise multiple variables
- Key R functions
- Summarise variables
Useful statistical summary functions
Summary

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data

## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Summary statistics of ungrouped data

Compute the mean of Sepal.Length and Petal.Length as well as the number of observations using the function n():

my_data %>%
  summarise(
          count = n(),
          mean_sep = mean(Sepal.Length, na.rm = TRUE),
          mean_pet = mean(Petal.Length, na.rm = TRUE)
          )

## # A tibble: 1 x 3
##   count mean_sep mean_pet
##   <int>    <dbl>    <dbl>
## 1   150     5.84     3.76

Note that, we used the additional argument na.rm to remove NAs, before computing means.

Summary statistics of grouped data

Key R functions: group_by() and summarise()

Group by one variable

my_data %>%
  group_by(Species) %>%
  summarise(
          count = n(),
          mean_sep = mean(Sepal.Length),
          mean_pet = mean(Petal.Length)
            )

## # A tibble: 3 x 4
##   Species    count mean_sep mean_pet
##   <fct>      <int>    <dbl>    <dbl>
## 1 setosa        50     5.01     1.46
## 2 versicolor    50     5.94     4.26
## 3 virginica     50     6.59     5.55

Note that, it’s possible to combine multiple operations using the maggrittr forward-pipe operator : %>%. For example, x %>% f is equivalent to f(x).

In the R code above:

first, my_data is passed to group_by() function
next, the output of group_by() is passed to summarise() function

Group by multiple variables

# ToothGrowth demo data sets
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

# Summarize
ToothGrowth %>%
group_by(supp, dose) %>%
  summarise(
    n = n(),
    mean = mean(len),
    sd = sd(len)
  )

## # A tibble: 6 x 5
## # Groups:   supp [?]
##   supp   dose     n  mean    sd
##   <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ      0.5    10 13.2   4.46
## 2 OJ      1      10 22.7   3.91
## 3 OJ      2      10 26.1   2.66
## 4 VC      0.5    10  7.98  2.75
## 5 VC      1      10 16.8   2.52
## 6 VC      2      10 26.1   4.80

Summarise multiple variables

Key R functions

The functions summarise_all(), summarise_at() and summarise_if() can be used to summarise multiple columns at once.

The simplified formats are as follow:

summarise_all(.tbl, .funs, ...)
summarise_if(.tbl, .predicate, .funs, ...)
summarise_at(.tbl, .vars, .funs, ...)

.tbl: a tbl data frame
.funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
…: Additional arguments for the function calls in .funs.
.predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

Summarise variables

Summarise all variables - compute the mean of all variables:

my_data %>%
  group_by(Species) %>%
  summarise_all(mean)

## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

Summarise specific variables selected with a character vector:

my_data %>%
  group_by(Species) %>%
  summarise_at(c("Sepal.Length", "Sepal.Width"), mean, na.rm = TRUE)

Summarise specific variables selected with a predicate function:

my_data %>%
  group_by(Species) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)

Useful statistical summary functions

This section presents some R functions for computing statistical summaries.

Measure of location:

mean(x): sum of x divided by the length
median(x): 50% of x is above and 50% is below

Measure of variation:

sd(x): standard deviation
IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)

Measure of rank:

min(x): minimum value of x
max(x): maximum value of x
quantile(x, 0.25): 25% of x is below this value

Measure of position:

first(x): equivalent to x[1]
nth(x, 2): equivalent to n<-2; x[n]
last(x): equivalent to x[length(x)]

Counts:

n(x): the number of element in x
sum(!is.na(x)): count non-missing values
n_distinct(x): count the number of unique value

Counts and proportions of logical values:

sum(x > 10): count the number of elements where x > 10
mean(y == 0): proportion of elements where y = 0

Summary

In this tutorial, we describe how to easily compute statistical summaries using the R functions summarise() and group_by() [in dplyr package].

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Back to Data Manipulation in R

Comments ( 3 )

Azzeddine REGHAIS

02 Jan 2021

Thank you teacher

Reply
Ador tor

19 Feb 2022

this tutorial was very helpful.
thank you so much

Reply
Nabi

20 Sep 2023

Thank you! btw do you know how to save the summarized results in a datafame? I summarized means of many variables and the R console doesn’t show the results at once.

Reply