This article describes how to compute summary statistics, such as mean, sd, quantiles, across multiple numeric columns.
Key R functions and packages
The dplyr
package [v>= 1.0.0] is required. We’ll use the function across()
to make computation across multiple columns.
Usage:
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
.cols
: Columns you want to operate on. You can pick columns by position, name, function of name, type, or any combination thereof using Boolean operators..fns
: Function or list of functions to apply to each column....
: Additional arguments for the function calls in .fns..names
: A glue specification that describes how to name the output columns. This can use{col}
to stand for the selected column name, and{fn}
to stand for the name of the function being applied. The default (NULL) is equivalent to"{col}"
for the single function case and"{col}_{fn}"
for the case where a list is used for.fns
.
# Load required R packages
library(dplyr)
# Data preparation
df <- as_tibble(iris)
head(df)
## # A tibble: 6 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Compute the mean of multiple columns
df %>%
group_by(Species) %>%
summarise(across(Sepal.Length:Petal.Length, mean, na.rm= TRUE))
## # A tibble: 3 x 4
## Species Sepal.Length Sepal.Width Petal.Length
## * <fct> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46
## 2 versicolor 5.94 2.77 4.26
## 3 virginica 6.59 2.97 5.55
# Compute the mean and the sd of all numeric columns
df %>%
group_by(Species) %>%
summarise(across(
.cols = is.numeric,
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}_{fn}"
))
## # A tibble: 3 x 9
## Species Sepal.Length_Me… Sepal.Length_SD Sepal.Width_Mean Sepal.Width_SD Petal.Length_Me… Petal.Length_SD
## * <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 0.352 3.43 0.379 1.46 0.174
## 2 versic… 5.94 0.516 2.77 0.314 4.26 0.470
## 3 virgin… 6.59 0.636 2.97 0.322 5.55 0.552
## # … with 2 more variables: Petal.Width_Mean <dbl>, Petal.Width_SD <dbl>
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Books - Data Science
Our Books
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
Others
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Version: Français
No Comments