GRAPH SERIES: CORRELOGRAMS

janeiro 13, 2025

GRAPH SERIES: CORRELOGRAMS

ME: Correlograms.

Hello, Letícia from Minha Estatística here :). Clique aqui para ir ao post em Português.

This week, we’re back with the Graph Series, and we’re diving into correlation plots — or as they’re called, correlograms. So far, we’ve covered Violin Plots for analyzing data distributions and Circular Barplots for displaying ranked data. Now, it’s time to explore how correlograms can help us understand the relationships between variables.

There's a variant of the correlogram known as heatplots, they visually represent the correlations qhen the dataset is in a matrix form (i.e., correlation matrix). These plots use colors to represent the values within the matrix while also using dendrograms to display hierarchical clustering. There are some ways to use heatmaps in terms of data formats: they can be in a long format or a wide format. The long format will provide a y-axis with every observation and their density (in this case, you don't use the correlation matrix as input); for the wide format, the correlation matrix will be used as input.

Correlograms, much like heatplots, display the correlation matrix and they usually don’t accept categorical variables - except when a classic correlogram is created using a scatterplot matrix. The correlogram can be more diverse than the heatmap, allowing more variation in how it displays data; what both correlograms and heatmaps have in common is that they display the correlation within the data for exploratory purposes.

The correlation is a coefficient written as \(\rho(X, Y)\), and it’s the standardized version of covariance, which indicates whether variables have influence on each other. The advantage of correlation is that it isn’t sensitive to the units of the variables, as its basic formula is expressed as:

\[ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\quad\text{with}\quad-1 \leq \rho(X,Y) \leq 1 \]

Where \(p = 1\) indicates a strong correlation, \(p = 0\) indicates no correlation, and \(p = -1\) indicates a perfect negative correlation. Since the focus of this post isn't about correlation expressively, we'll stick to the correlograms and heatplots and wait for a future post about correlations only.

As always, I'll be showing the steps to create both correlogram and heatmap plots in R and Python. For this matter, I've chosen two datasets: anorexia, which includes one categorical variable (type of treatment) and two numerical variables (weight pre-treatment and post-treatment); and crabs, which includes two categorical variables (sex and species) and five numerical ones describing observed characteristics, such as body depth. So, let's begin.

1) Plot in R

Start by loading the necessary libraries for dataset access (MASS) and plotting (GGally, ggplot2).


# Load required libraries
library(MASS)
library(GGally)
library(ggplot2)

# Load dataset
data("anorexia")

Within the GGally library, the ggpairs is a great tool to create the classic correlogram, with a scatterplot matrix.


    
ggpairs(anorexia, columns = 1:ncol(anorexia),
  mapping = aes(color = Treat, alpha = 0.5),
  axisLabels = "show",
  upper = list(continuous = wrap("points", alpha = 1), 
               combo = wrap("dot_no_facet", alpha = 1)),              
  lower = list(continuous = wrap("points", alpha = 1), 
               combo = wrap("box_no_facet", alpha = 1)),
  #diag = list(continuous = "densityDiag", 
               #discrete = "barDiag", na = "naDiag"),
  title = "Correlogram of Anorexia Dataset"
  ) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

With this, you can adjust the lower, upper, and diag arguments (with diag being shown the default). When the data is continuous, implies the argument will also be continuous, which can take values like = "cor" which will display the values for correlation in each variable, or = "points", that will display a line plot with the correlation according to the variable position on the matrix.

Meanwhile, the combo argument in both lower and upper positions, places boxplots and jitter plot where the variables intersect with the categorical one. You'll get a plot like this:

Now, two different ways to create the correlogram will be shown, besides the classic one. For the next steps you'll need the crabs dataset (from MASS library as well), the corrgram package for the following plot, and the GGally package for the second correlogram.


# Load dataset
data("crabs")
library(corrgram)
# Remove columns with categorical variables
crabs = crabs[,-c(1:3)]

If the categorical variables aren't removed, they'll be ignored automatically. The panel's argument can take values to display data in different ways, such as: panel.pts, panel.pie, panel.shade, panel.fill, panel.bar, panel.ellipse, panel.conf and panel.cor. There's also the option to display values on the diagonal where the variable names are, as shown in the code comments.


corrgram(crabs, lower.panel = panel.pie, 
         upper.panel = panel.cor, 
         #diag.panel=panel.minmax,  # Min and max values on diagonal
         col.regions = colorRampPalette(c("darkgreen")),
         order = TRUE, # Order variables based on correlation
         main = "Correlogram of Crabs Dataset",  
         cex.main = 1.5, cex.axis = 1.2, cex.cor = 1.2) # Increase label size

This next correlogram is more like a matrix, showing one diagonal and the correlation values. The scale can be adjusted, including your midpoint, as well as the geometry format (which can also be a circle, for example). This is a much simpler correlogram while keeping analytical value, working much like a heatmap, except the input isn’t a matrix and it doesn’t have categorical variables.


# from GGally library
ggcorr(crabs, 
       method = c("everything", "pearson"), # Set correlation method
       label = TRUE,  
       #geom = "circle", min_size = 2, max_size = 15, # Change display to circles
       label_size = 3,  # Adjust label
       label_color = "white",
       low = "white",
       mid = "navy",
       high = "red",
       midpoint = 0.9, # Set the midpoint
       limits = c(0.8, 1), # Plot limits
       legend.position = "right",
       legend.size = 8) + 
  ggtitle("Correlation Matrix of Crabs Dataset") + # Add title
  theme(plot.title = element_text(size = 15, hjust = 0.5)) # Style the title

It's important to notice how the input at this point was kept as it's original format. For the heatmap it'll have to change to a correlation matrix:

  
# For the following plots you'll need these libraries:
library(pheatmap)
library(RColorBrewer) # For color palette 

crabs_matrix = as.matrix(crabs)
cor_matrix <- cor(crabs_matrix)

With the pheatmap function, you can create a complete heatmap with all its features, including dendrograms (which were removed in the second line of code). These plots typically display the correlation values for better readability.

  
    
pheatmap(cor_matrix, display_numbers = TRUE,
         cluster_rows = FALSE, cluster_cols = FALSE,
         color = brewer.pal(5, "Set2"))

These are a few options for displaying and visualizing correlations, demonstrated in R. Now, let's move on to how to create them in Python.

2) Plot in Python

To first create the plots in Python, import the necessary libraries and the dataset.


# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
anorexia = pd.read_csv('path_to_your_dataset/anorexia.csv', delimiter=',')

To create the classic correlogram, the seaborn library is needed along with the pairplot function. The kind argument can be adjusted accroding the type of plot to be represented, with options for 'scatter', 'kde', 'hist', and 'reg'. Similarly, the diag_kind argument controls the type of graph displayed on the diagonal, accepting values like 'auto', 'hist', 'kde' and None.


sns.pairplot(anorexia, kind='scatter', diag_kind='kde', hue='Treat',palette='Set2')
plt.suptitle('Correlogram of Anorexia Dataset', size=13) 
plt.subplots_adjust(top=0.95)
plt.show()

The argument hue assigns colors to observations based on their categories.

Another way to represent the correlogram, besides its classic form, is as a matrix-like display. For this, load the new dataset and remove categorical variables, since we're working with correlations. The next step is to create a DataFrame with the data's correlations.


# Load dataset
crabs = pd.read_csv('path_to_your_dataset/crabs.csv', delimiter=',')
# Remove categorical variables    
crabs = crabs.drop(columns=['rownames','sex','index'])
corr_matrix = crabs.corr()

Now, with the right DataFrame containing the correlation matrix, to make the plot diagonal create a matrix filled with zeros, with the length of the dataset, and then return the indices for the upper-triangle of the array. The plot will take place where there are \(0\)'s in the matrix. The function for this plot is heatmap. Since we are already working with a correlation matrix, we might as well refer to it as a heatmap.

  
mask = np.zeros_like(corr_matrix) # Create a diagonal matrix
mask[np.triu_indices_from(mask)] = True 

array([[1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.]])
 

sns.heatmap(corr_matrix, mask=mask, square=True,annot=True, fmt='.1f')
plt.yticks([ ])
plt.title('Correlogram of Crabs Dataset')
plt.show()

If you don't want the heatmap diagonally, simply use the function with the correlation matrix created without the argument mask:

  
sns.heatmap(corr_matrix,cmap='Set2')
plt.title('Correlation Heatmap of Crabs Dataset')
plt.show()

Heatmaps can be useful when working with time-series data as well, as demonstrated in the plot of the AirPassengers dataset below:

These are a few ways to apply correlograms and heatmaps in practice, providing valuable insights into correlations and patterns within the data, whether for exploratory analysis or to reveal relationships between variables!

Conclusion

In this post, we explored the utility and versatility of correlograms and heatmaps in analyzing correlations and visualizing data. These tools are essential for uncovering patterns, identifying relationships, and deepening the understanding of correlations between variables in a dataset. Using libraries in R and Python, we demonstrated how to create different types of correlograms and heatmaps, from the classic scatterplot matrix to matrix-like visualizations.

Whether for exploratory analysis or presenting complex relationships in a visually clear manner, correlograms and heatmaps are powerful additions to your analytical toolkit. By adapting input data and visualization methods to your specific needs, you can reveal new insights and communicate your findings effectively.

We hope this guide inspires you to incorporate these visualizations into your own projects. Stay tuned for more posts in the Graphics Series, as we continue to explore the intersection of data and visualization!

We are also on Instagram at @minhaestatistica, and I look forward to seeing you there!

Letícia - Minha Estatística.

References

Dekking, F.M., Kraaikamp,C., Lopuhaä,H.P. & Meester, L.E. (2005). A Modern Introduction to Probability and Statistics: Understanding Why and How. Springer.
R Graph Gallery
Seaborn: Statistical Data Visualization

Pesquisar este blog

MINHA ESTATÍSTICA

GRAPH SERIES: CORRELOGRAMS

1) Plot in R

2) Plot in Python

Conclusion

References

Comentários

Postar um comentário

Postagens mais visitadas

DENDOGRAMAS: SÉRIE DE GRÁFICOS

ETL in Data Warehousing