GRAPH SERIES: VIOLIN PLOT

ME: Violin Plot.

Hello, Letícia from Minha Estatística here :). Clique aqui para ir ao post em Português.

This will be the first series of the blog and it's all about graphs. I'll be posting every Monday about a type of graphic that you can make using 1) R or 2) Python (I'll be including resolutions for both in every post). The point is to learn when and how a certain type of graph should or can be applied, depending on the type of data you have. So let's begin!

The graph to inaugurate this series is the Violin plot. This plot is used for numeric data to compare density estimates of a variable across different groups, or a variable across a common denominator they share. Where each vionlin is a group and the density shown represents the kernel density estimation (KDE) of the distribution, meaning the data is transformed (i.e., smoothed) either for better visualization or a better fit, making it suitable for larger datasets as well. The value for smoothing is something that can determined along the way.

The plot has bigger parts where the data is more frequent or more dense, and smaller parts for when the opposite occurs, showing and providing the complete view of the distribution. It can be displayed with a boxplot inside, with colors and legends indicating the value of the density, or not. Next, the plot will be created in R using ggplot2, and then in Python with Seaborn and Matplotlib.

1) Plot in R

    
# Load libraries:
library(ggplot2) 
library(palmerpenguins) # For data
library(tydir) # To remove NA values
    
  

Note that ggplot2 expects input data to be in a long format, where each row represents a single observation. To create a plot, it's needed a categorical (factor) variable for the x-axis and a numeric variable for the y-axis, which means each variable should be in one column.

    
data("penguins", package = "palmerpenguins")
penguins <- drop_na(penguins) # Remove NA
summary(penguins)
    
  

Some adjustments that can be made include changes to the default settings, such as modifying the range and scaling of the violins for improved visualization. The range is based on the minimum and maximum density estimates, so by setting trim = FALSE, the violin tails become longer, extending to the full range of the data for the y-axis while also assuming these values. The scale argument, by default, is "area", meaning that the area for each violin is the same and doesn't correspond to the number of observations and setting scale = "count", the width of the violin becomes proportional to the number of observations in each group, while with scale = "width", all violins have the same maximum width, and what changes is their density.

Another important aspect of the Violin plots, as mentioned before, is the smoothing, which can be adjusted with adjust, keeping in mind that by default adjust = 1.

    
ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +
  geom_violin(trim = F, scale = "width", adjust = 1) +  # Create violin plot
  labs(
    x = "Species", 
    y = "Bill Length (mm)", 
    title = "Distribution of Bill Length by Species"
   ) + 
  scale_fill_discrete(name = "Species") +   # Set legend title 
  theme_minimal() +  # Clean theme for better visualization
  theme(
    plot.title = element_text(hjust = 0.5)  # Center title
  )
  

Adding a boxplot:

    
ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +
  geom_violin(trim = F, scale = "width", adjust = 1) + # Create violin plot
  labs(
    x = "Species", 
    y = "Bill Length (mm)", 
    title = "Distribution of Bill Length by Species"
  ) + 
  scale_fill_discrete(name = "Species") +   # Set legend title 
  geom_boxplot(width=0.1, # Add boxplot
               show.legend = F)+  # Without legend to avoid overlap
  theme_minimal() +  # Clean theme for better visualization
  theme(
    plot.title = element_text(hjust = 0.5)  # Center title
  )    
  

I believe I have covered the essential concepts and coding required for creating a Violin plot in R. Next, I'll show you how to create the same plot in Python.

2) Plot in Python

Start by loading imports for data, matplotlib and seaborn:

    
# Imports:
import seaborn as sea
import matplotlib.pyplot as plt
from palmerpenguins import load_penguins # For data
    
  
    
# Load data:
penguins = load_penguins()
penguins = penguins.dropna() # Remove NA
penguins.head()
    
  

In Python, the arguments used for customizing and adjusting are mainly the density_norm that accepts 'width' which means the violins will be shown by their density, 'area' where they'll all have the same size, disregarding the number of observations in each variable and 'count' where the width corresponds to the number of observations. For controlling the violin's length, the cut argument determines how far the violins extend along the y-axis (if the plot is vertical).

Lastly, the value for smoothing, which by default is = 1, can be customized with bw_method; the Python version I'm currently using is 3.11.19.

    
sea.violinplot(
    data = penguins, 
    x = 'species', 
    y = 'bill_length_mm', 
    hue = 'species',
    inner = "box",  # Add the box plot inside the violins 
    inner_kws = dict(box_width = 15, whis_width = 2, color = "0.8"),  # Customization
    density_norm = 'width',  # Scale violins by width
    cut = 3,  # Length of tale
    bw_method = 1,  # Bandwidth for KDE smoothing
    palette="Set1" 
    )
plt.xlabel('Species') 
plt.ylabel('Bill Length (mm)') 
plt.title('Distribution of Bill Length by Species')
plt.show()

  

The plot can also be created using only Matplotlib, especially when working with list or array data. In this case, the data needs to be in one of these formats, rather than as a DataFrame. Using Matplotlib is a good choice if there's no need to customize arguments like density_norm, cut, or bw_method:

    
# Prepare the data:
species = penguins['species'].unique()
data = [] 
for sp in species:  # Group bill_length_mm by species
    data.append(penguins[penguins['species'] == sp]['bill_length_mm'].values) 
type(data) # The data should be a list or array

  
    
plt.violinplot(
    data,
    showmedians=True
    )
plt.xticks([1, 2, 3], ['Adelie', 'Chinstrap', 'Gentoo'])  # Add species on x-axis
plt.xlabel('Species')
plt.ylabel('Bill Length (mm)')
plt.title('Distribution of Bill Length by Species')
plt.show()

  

Conclusion

In this post, we explored how to visualize data distributions using Violin Plots with both R and Python. We discussed how to generate these plots using Matplotlib and Seaborn on Python, while also exploring how similar plots can be created on R using its powerful plotting libraries.

On Python, Matplotlib is great for quickly generating violin plots with minimal customization, while Seaborn provides more control for creating detailed, customized visualizations. On R, libraries like ggplot2 offer equally effective and flexible plotting capabilities.

Now, you have a clear understanding of how to create Violin Plots in both R and Python, allowing you to understand and choose both the best tool for you and what's best and more suitable for your data visualization needs.

I hope you enjoyed this post, as it’s just the beginning of our Graph Series!

Thank you for being here and feel free to comment for questions or suggestions! Stay tuned for more insights next week,

Letícia - Minha Estatística.

Reference Links

Comentários

Postagens mais visitadas