Display a Beautiful Summary Statistics in R using Skimr Package

25 Jan

Display a Beautiful Summary Statistics in R using Skimr Package

This article describes how to quickly display summary statistics using the R package skimr.

skimr handles different data types and returns a skim_df object which can be included in a tidyverse pipeline or displayed nicely for the human reader.

Key features of skimr:

Provides a larger set of statistics than the R base function summary(), including missing, complete, n, and sd.
reports each data types separately
handles dates, logicals, and a variety of other types
supports spark-bar and spark-line

Contents:

Prerequisite
Summarize a whole dataset
Select specific columns to summarize
Handle grouped data
Specify your own statistics and classes

Prerequisite

Install the stable version from CRAN:

install.packages("skimr")

Load the package:

library(skimr)

Summarize a whole dataset

skim(iris)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Species	0	1	FALSE	3	set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

Select specific columns to summarize

skim(iris, Sepal.Length, Petal.Length)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	0	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Petal.Length	0	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂

Handle grouped data

skim() can handle data that has been grouped using dplyr::group_by.

iris %>% 
  dplyr::group_by(Species) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
numeric	4
________________________
Group variables	Species

Variable type: numeric

skim_variable	Species	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	setosa	1	5.01	0.35	4.3	4.80	5.00	5.20	5.8	▃▃▇▅▁
Sepal.Length	versicolor	1	5.94	0.52	4.9	5.60	5.90	6.30	7.0	▂▇▆▃▃
Sepal.Length	virginica	1	6.59	0.64	4.9	6.23	6.50	6.90	7.9	▁▃▇▃▂
Sepal.Width	setosa	1	3.43	0.38	2.3	3.20	3.40	3.68	4.4	▁▃▇▅▂
Sepal.Width	versicolor	1	2.77	0.31	2.0	2.52	2.80	3.00	3.4	▁▅▆▇▂
Sepal.Width	virginica	1	2.97	0.32	2.2	2.80	3.00	3.18	3.8	▂▆▇▅▁
Petal.Length	setosa	1	1.46	0.17	1.0	1.40	1.50	1.58	1.9	▁▃▇▃▁
Petal.Length	versicolor	1	4.26	0.47	3.0	4.00	4.35	4.60	5.1	▂▂▇▇▆
Petal.Length	virginica	1	5.55	0.55	4.5	5.10	5.55	5.88	6.9	▃▇▇▃▂
Petal.Width	setosa	1	0.25	0.11	0.1	0.20	0.20	0.30	0.6	▇▂▂▁▁
Petal.Width	versicolor	1	1.33	0.20	1.0	1.20	1.30	1.50	1.8	▅▇▃▆▁
Petal.Width	virginica	1	2.03	0.27	1.4	1.80	2.00	2.30	2.5	▂▇▆▅▇

Specify your own statistics and classes

Users can specify their own statistics using a list combined with the skim_with() function. This can support any named class found in your data.

my_skim <- skim_with(
  numeric = sfl(iqr = IQR, mad = mad, p99 = ~ quantile(., probs = .99)),
  append = FALSE
)
my_skim(iris, Sepal.Length)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	iqr	mad	p99
Sepal.Length	0	1	1.3	1.04	7.7

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Version: Français

Display a Beautiful Summary Statistics in R using Skimr Package

Display a Beautiful Summary Statistics in R using Skimr Package

Prerequisite

Summarize a whole dataset

Select specific columns to summarize

Handle grouped data

Specify your own statistics and classes

Recommended for you

Books - Data Science

Our Books

Others

Comment ( 1 )

Give a comment Cancel reply