DENDOGRAMS: GRAPH SERIES
Hello, Letícia from Minha Estatística here :). Clique aqui para ir ao post em Português.
As we continue our Graph Series, this week we investigate into the expansive world of Dendrograms, a powerful visualization tool for illustrating hierarchical relationships between objects. Dendrograms are commonly used in hierarchical clustering, an unsupervised machine learning method, as well as in supervised algorithms like decision trees. They are also widely applied outside of machine learning, such as in creating Heatmaps and independent dendrogram visualizations.
Today, I'll lay the foundation for understanding dendrograms, providing essential information that will prepare the notions for exploring machine learning methods. This base knowledge will equip you to move forward into advanced applications. All the scripts are available on my GitHub, so you can easily replicate and try to run these on your own, while also making the adjustments you wish!
With dendrograms, understanding how data or objects are grouped is crucial. This grouping is achieved through agglomerative hierarchical clustering methods, which are unsupervised machine learning techniques. Initially, each data point is treated as its own individual cluster. The clustering process begins by merging the two closest clusters (the data vectors with the smallest distance between them). This iterative process continues until only one cluster remains. There are several methods to determine how clusters are merged, and the most commonly used ones (out of a total of 11 methods) include:
- Single linkage (nearest neighbor) method:
- Complete clustering (furthest neighbor) method:
- Average clustering method:
- Centroid Clustering Method:
- Median clustering method:
- Ward’s Method or Increase in Sum of Squares Clustering Method:
- Weighted Average clustering method (WPGMA)
In this method, the distance between two clusters is determined by the closest pair of data point (vectors). During each step of the clustering process, the two clusters with the smallest distance between their closest points are merged together.
\[ D(A,B) = \min_{a \in A, b \in B} d(a,b) \]In the Complete Linkage method, the distance between two clusters is defined by the maximum distance between any point in one cluster and any point in the other cluster. This approach tends to produce more tightly bound clusters, as it focuses on minimizing the largest distances between the clusters during the merging process.
\[ D(A,B) = \max_{a \in A, b \in B} d(a,b) \]In the Average Linkage method, the distance between two clusters is determined by the average distance between all pairs of points. At each step of the clustering process, the two clusters with the smallest average linkage distance are merged. This method aims to balance the distance between clusters by considering all point pairs rather than just the nearest or farthest ones.
\[ D_{\text{avg}}(A,B) = \frac{\sum_{i=1}^{m} \sum_{j=1}^{n} d(a_i \in A, b_j \in B)}{m \times n} \]In the Centroid clustering method, also known as the Unweighted Pair Group Method with Centroid (UPGMC), the distance between two clusters is calculated by determining the squared distance between their centroids (mean points). This method is ideally applied when working with a matrix of squared distances.
\[ d_{AB} = \| \bar{X}_A - \bar{X}_B \|^2 \]The Median Linkage method, calculates the distance between the medians of the vectors in each cluster. This method gives equal weight to clusters of different sizes, unlike the centroid method which is influenced by the number of vectors in each cluster. It is more robust against outliers, and at at each step, the two clusters with the smallest distance between their medians are merged.
\[ d_{AB} = \left\| \bar{m}_A - \bar{m}_B \right\|^2 \]Ward’s method calculates the distance between two clusters as the increase in the total sum of squares caused by merging them. Initially, the sum of squares starts at zero, as each data point is treated as its own cluster. As clusters are merged, the sum of squares grows. Ward’s method ensures this growth is minimized at each step, keeping the increase in SSE as small as possible.
\[ \Delta(A, B) = \frac{n_A \cdot n_B}{n_A + n_B} \cdot \| \tilde{m}_A - \tilde{m}_B \|^2 \]Where \(\tilde{m}_j\) is the center of cluster \(j\), \(n_j\) is the number of points in it and \(\Delta(A, B)\) is referred to as the merging cost of combining the clusters \(A\) and \(B\).
The Weighted Pair Group Method with Arithmetic Mean (WPGMA) calculates the distance between two clusters by averaging the distances between their elements. When two clusters are merged, the distance to any other cluster is the average of the distances from the merged cluster to the other cluster, as shown below:
\[ (D) C, AB = \frac{1}{2} \left( D(C, A) + D(C, B) \right) \]After understanding the different linkage methods in hierarchical clustering, the next step is to visualize how the objects or data points are grouped together through a dendrogram. A dendrogram serves as a visual representation of the hierarchical relationships between clusters, illustrating how groups are formed and how similar or different they are from each other during the clustering process. The anatomy of a dendrogram consists of several key components:
- Branches: These represent the relationships between clusters. The longer the branches, the greater the dissimilarity between the clusters that are being merged.
- Leaves: Located at the bottom of the dendrogram, the leaves correspond to the individual data points or objects. At the beginning of the clustering process, each data point is considered an independent cluster, represented as a leaf in the diagram.
- Nodes: Nodes are the points where two branches meet, signifying the merging of two clusters. As clustering progresses, the nodes move closer together, indicating the formation of more similar or closely related clusters.
- Height: The height of the nodes and branches reflects the distance or dissimilarity between clusters. A higher node indicates a larger distance between merged clusters, whereas a lower node suggests that the clusters being merged are more similar.
Dendrograms are useful for determining the structure of the clusters and for deciding the number of clusters (or groups) you want to identify. By cutting the dendrogram at a certain height, you can define the level of dissimilarity at which you wish to form the final clusters. These are essential concepts to understand so you can better apply dendrograms to your data based on your goals and objectives.
Now that we’ve thoroughly covered the methods for constructing dendrograms and how to interpret them, the next step is to learn how to create them using R and Python, including interactive plots for both.
1) Plot in R
To get started, ensure you have the following libraries installed:
# Load required libraries
library(datasets)
library(tidyr)
library(dplyr)
library(collapsibleTree) # For interactive tree plot
library(htmlwidgets) # Save the plot widget
library(htmltools)
# Load dataset
data("USArrests")
To create the dendograms the USArrests dataset was chosen, it contains rate for three different crimes in each U.S. state. Selecting only the variable for those crimes, excluding the "UrbanPop" variable so there's no need to scale all data.
The funcion dist calculates the distance matrix between each data point. By default, the distance metric used is "euclidean" distance. Next, it's possible to form the hierarchical clusters by chossing one of the methods mentioned above followed by the plot function for visualization.
USArrests %>%
select(Murder, Assault, Rape) %>%
dist() %>%
hclust(method = "average") %>%
as.dendrogram() -> dend
plot(dend, type = "rectangle",
main = "Dendrogram of Violent Crime Rates by US State (Average Linkage)"
)
The dendrograms can be displayed in two different styles. The one on the right uses type = "triangle", while the one on the left, which is the default, uses type = "rectangle".
Other possible customizations include adding colors and markers to indicate where each cluster and its leaves begin, or changing the dendrogram to a horizontal layout with horiz = TRUE.
plot(dend, odePar = list(pch = 1, cex = .5*2:1, col = 2:3),
horiz = F, main = "Dendrogram of Violent Crime Rates by
US State (Average Linkage)")
Moving forward to the interactive dendrogram, the same steps for organizing the data will be followed, computing the distance matrix and creating the hierarchical clusters. What changes here is the setup for the cuts or branches that emerge from the root, as well as deciding on the root name immediately after.
dist_matrix <- dist(USArrests) # Compute distance matrix
hc <- hclust(dist_matrix)
clusters <- cutree(hc, k = 3)
dend = as.dendrogram(hc)
parents <- paste0("Region", clusters)
After saving the parent nodes and dendrogram as objects, a new data.frame must be created to arrange the labels, parents, and roots. For this example, I chose the "Assault" variable, but the dendrogram for other variables is available in the GitHub repository.
labels <- labels(dend)
hierarchy_df <- data.frame(
State = labels,
Parent = parents, # All states are children of the "Root"
Assault = USArrests$Assault
)
Now you can create the object for the interactive plot. It's important to make adjustments for the title and save the plot in HTML to maintain its interactive or collapsible features.
p <- collapsibleTree(
hierarchy_df,
root = "Regions",
hierarchy = c("Parent", "State", "Assault"), fill = "black"
)
prependContent(p,
tags$h2("Interactive Dendrogram of Regional Assault Rates in the USA",
style = "text-align: center;"),
tags$h3("(Average Linked)", style = "text-align: center;")
)
saveWidget(p,
file = paste0(
"path_to_your_folder/dendrogram_interactive_ASSLT.html")
) # Save the plot in html
The final result for the collapsibleTree looks like this:
Now that we've covered the key functions and aspects of creating a dendrogram in R, we can move on to learning how to create dendrograms in Python, including an interactive one, but using a different approach.
2) Plot in Python
Start by importing the necessary libraries for data manipulation, plot creation and the ones for the interactive dendrogram.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
import plotly.figure_factory as ff # For interactive dendogram
import scipy.cluster.hierarchy as sch # For interactive dendogram
Load the dataset and perform the necessary steps to prepare it as needed. Datasets for every post from now on, along with all scripts used, will be available for download in the GitHub repository.
# Load dataset
USArrests = pd.read_csv('path_to_your_dataset/USArrests.csv', delimiter=',')
del USArrests['UrbanPop']
USArrests.set_index('Unnamed: 0', inplace=True)
USArrests.index.name = None
print(USArrests)
Start by setting the hierarchical clustering method, creating the "Z" object, and plotting the result with the .dendogram function.
Z = hierarchy.linkage(USArrests, 'average')
hierarchy.dendrogram(Z, leaf_rotation = 90, leaf_font_size = 6,
labels = USArrests.index)
# labels is an n-sized sequence, with n == Z.shape[0] + 1
plt.title('Dendrogram of Violent Crime Rates by US State (Average Linkage)')
plt.show()
For the interactive plot, the steps are almost the same: you create the linkage method object, but for this the sizes of the Z object an the dataset labels were different so one had to be removed.
Z = sch.linkage(USArrests, method = 'average')
#print("Linkage matrix shape:", Z.shape) = 49
#print("Number of labels:", len(USArrests.index)) = 50
labels = USArrests.index.tolist()
labels_for_dendrogram = labels[:-1] # Remove the last label to match the number of clusters in Z (49 clusters)
fig = ff.create_dendrogram(Z, labels = labels_for_dendrogram)
fig.update_layout(
xaxis = dict(tickangle = -90),
title = "Interactive Dendrogram of Violent Crime Rates by US State"
)
fig.show()
The result is a dendrogram like this, where you can zoom in and out for better visualization of the desired points.
Conclusion
Dendrograms are powerful tools for visualizing hierarchical clustering, helping us to understand the relationships and similarities between data points. By using R and Python, we can create both static and interactive dendrograms to gain deeper insights into our data. It is important to understand the various methods for hierarchical clustering, such as single linkage, complete linkage, average linkage, and Ward's method, as each has unique characteristics and applications. Mastering these methods and visualization techniques allows us to make more informed decisions when analyzing and interpreting data.
Not only are dendrograms valuable for visualizing hierarchical clustering and understanding the data, but they also play an important role in machine learning methods. They provide a clear way to explore relationships between data points, which is essential for tasks like feature selection, dimensionality reduction, and anomaly detection. In machine learning, hierarchical clustering can help uncover hidden patterns, group similar data points, and assist in making decisions about how to organize datasets for further analysis or model training. By applying dendrograms in these contexts, we can gain deeper insights and enhance the overall performance of machine learning algorithms.
I hope this guide encourages you to use dendrograms for analysis as well as to explore the available methods. Stay tuned for upcoming posts in the Graphics Series, where we'll continue to explore the fascinating world of data and its visual representation!
Follow on Instagram @minhaestatistica, and GitHub!
Letícia - Minha Estatística.
References
- R Graph Gallery
- Pyhton Graph Gallery
- Global Journal of HUMAN-SOCIAL SCIENCE: G Linguistics & Education, V.16 I.3 V.1.0, 2016. Double Blind Peer Reviewed International Research Journal. Global Journals Inc. (USA).


Comentários
Postar um comentário