GRAPH SERIES: VIOLIN PLOT
Hello, Letícia from Minha Estatística here :). Clique aqui para ir ao post em Português.
This will be the first series of the blog and it's all about graphs. I'll be posting every Monday about a type of graphic that you can make using 1) R or 2) Python (I'll be including resolutions for both in every post). The point is to learn when and how a certain type of graph should or can be applied, depending on the type of data you have. So let's begin!
The graph to inaugurate this series is the Violin plot. This plot is used for numeric data to compare density estimates of a variable across different groups, or a variable across a common denominator they share. Where each vionlin is a group and the density shown represents the kernel density estimation (KDE) of the distribution, meaning the data is transformed (i.e., smoothed) either for better visualization or a better fit, making it suitable for larger datasets as well. The value for smoothing is something that can determined along the way.
The plot has bigger parts where the data is more frequent or more dense, and smaller parts for when the opposite occurs, showing and providing the complete view of the distribution. It can be displayed with a boxplot inside, with colors and legends indicating the value of the density, or not. Next, the plot will be created in R using ggplot2, and then in Python with Seaborn and Matplotlib.
1) Plot in R
Note that ggplot2 expects input data to be in a long format, where each row represents a single observation. To create a plot, it's needed a categorical (factor) variable for the x-axis and a numeric variable for the y-axis, which means each variable should be in one column.
Some adjustments that can be made include changes to the default settings, such as modifying the range and scaling of the violins for improved visualization. The range is based on the minimum and maximum density estimates, so by setting trim = FALSE, the violin tails become longer, extending to the full range of the data for the y-axis while also assuming these values. The scale argument, by default, is "area", meaning that the area for each violin is the same and doesn't correspond to the number of observations and setting scale = "count", the width of the violin becomes proportional to the number of observations in each group, while with scale = "width", all violins have the same maximum width, and what changes is their density.
Another important aspect of the Violin plots, as mentioned before, is the smoothing, which can be adjusted with adjust, keeping in mind that by default adjust = 1.
Adding a boxplot:
I believe I have covered the essential concepts and coding required for creating a Violin plot in R. Next, I'll show you how to create the same plot in Python.
2) Plot in Python
Start by loading imports for data, matplotlib and seaborn:
In Python, the arguments used for customizing and adjusting are mainly the density_norm that accepts 'width' which means the violins will be shown by their density, 'area' where they'll all have the same size, disregarding the number of observations in each variable and 'count' where the width corresponds to the number of observations. For controlling the violin's length, the cut argument determines how far the violins extend along the y-axis (if the plot is vertical).
Lastly, the value for smoothing, which by default is = 1, can be customized with bw_method; the Python version I'm currently using is 3.11.19.
The plot can also be created using only Matplotlib, especially when working with list or array data. In this case, the data needs to be in one of these formats, rather than as a DataFrame. Using Matplotlib is a good choice if there's no need to customize arguments like density_norm, cut, or bw_method:
Conclusion
In this post, we explored how to visualize data distributions using Violin Plots with both R and Python. We discussed how to generate these plots using Matplotlib and Seaborn on Python, while also exploring how similar plots can be created on R using its powerful plotting libraries.
On Python, Matplotlib is great for quickly generating violin plots with minimal customization, while Seaborn provides more control for creating detailed, customized visualizations. On R, libraries like ggplot2 offer equally effective and flexible plotting capabilities.
Now, you have a clear understanding of how to create Violin Plots in both R and Python, allowing you to understand and choose both the best tool for you and what's best and more suitable for your data visualization needs.
I hope you enjoyed this post, as it’s just the beginning of our Graph Series!
Thank you for being here and feel free to comment for questions or suggestions! Stay tuned for more insights next week,
Letícia - Minha Estatística.




Comentários
Postar um comentário