Data Preparation and R Packages for Cluster Analysis

5 mins

In this chapter, we start by presenting the data format and preparation for cluster analysis. Next, we introduce two main R packages - cluster and factoextra - for computing and visualizing clusters.

Related Book

Practical Guide to Cluster Analysis

Data preparation

To perform a cluster analysis in R, generally, the data should be prepared as follow:

Rows are observations (individuals) and columns are variables
Any missing value in the data must be removed or estimated.
The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. Read more about data standardization in chapter @ref(clustering-distance-measures).

Here, we’ll use the built-in R data set “USArrests”, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas.

data("USArrests")  # Load the data set
df <- USArrests    # Use df as shorter name

To remove any missing value that might be present in the data, type this:

df <- na.omit(df)

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale():

df <- scale(df)
head(df, n = 3)

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

Required R Packages

In this book, we’ll use mainly the following R packages:

cluster for computing clustering algorithms, and
factoextra for ggplot2-based elegant visualization of clustering results. The official online documentation is available at: https://rpkgs.datanovia.com/factoextra/.

factoextra contains many functions for cluster analysis and visualization, including:

Functions	Description
dist(fviz_dist, get_dist)	Distance Matrix Computation and Visualization
get_clust_tendency	Assessing Clustering Tendency
fviz_nbclust(fviz_gap_stat)	Determining the Optimal Number of Clusters
fviz_dend	Enhanced Visualization of Dendrogram
fviz_cluster	Visualize Clustering Results
fviz_mclust	Visualize Model-based Clustering Results
fviz_silhouette	Visualize Silhouette Information from Clustering
hcut	Computes Hierarchical Clustering and Cut the Tree
hkmeans	Hierarchical k-means clustering
eclust	Visual enhancement of clustering analysis

To install the two packages, type this:

install.packages(c("cluster", "factoextra"))

Summary

This chapter introduces how to prepare your data for cluster analysis and describes the essential R package for cluster analysis.

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Back to Data Clustering Basics

Comments ( 3 )

Alexis Idlette-Wilson

02 Dec 2018

This site is awesome. Thank you!

Reply
PS

23 Jan 2019

Hi,

I love this site. It really is helping me out in a “Cluster Analysis” project,

I wanted to understand what kinds of techniques should be used to perform clustering on very large datasets (my data set has about 3 million rows), I am stuck as using functions like “get_clust_tendency” or even the kmeans and hclust algorithms are throwing “cannot allocate vector of 17000 Gb” error.

Is there a better way to approach this problem with clustering on big datasets?

Reply
Julian

12 May 2019

Dear Dr Kassambara,
as many others have already said – thank you very much for this great site! It is indeed a resource I see myself coming back to again and again.
Regarding data preprocessing, I have been wondering how to deal with skewed data – should some form of power transformation be applied to get them into a more “Gaussian” shape, or are different distance metrics better suited than the Euclidean distance, or does it not matter in the end?

Reply

Data Clustering Basics

Data Preparation and R Packages for Cluster Analysis

Related Book

Data preparation

Required R Packages

Summary

Recommended for you

Books - Data Science

Our Books

Others

Comments ( 3 )

Give a comment Cancel reply

Course Curriculum

Teacher

Alboukadel Kassambara

Role : Founder of Datanovia