sns.displot(tips, x="total_bill", kind="kde", cut=0)
../_images/distributions_47_0.png
The KDE approach also fails for discrete data or when data are naturally continuous but specific values are over-represented. The important thing to keep in mind is that the KDE will always show you a smooth curve, even when the data themselves are not smooth. For example, consider this distribution of diamond weights:
diamonds = sns.load_dataset("diamonds")
sns.displot(diamonds, x="carat", kind="kde")
../_images/distributions_49_0.png
While the KDE suggests that there are peaks around specific values, the histogram reveals a much more jagged distribution:
sns.displot(diamonds, x="carat")
../_images/distributions_51_0.png
As a compromise, it is possible to combine these two approaches. While in histogram mode, displot() (as with histplot()) has the option of including the smoothed KDE curve (note kde=True, not kind="kde"):
sns.displot(diamonds, x="carat", kde=True)
../_images/distributions_53_0.png
Empirical cumulative distributions
A third option for visualizing distributions computes the “empirical cumulative distribution function” (ECDF). This plot draws a monotonically-increasing curve through each datapoint such that the height of the curve reflects the proportion of observations with a smaller value:
sns.displot(penguins, x="flipper_length_mm", kind="ecdf")
../_images/distributions_55_0.png
The ECDF plot has two key advantages. Unlike the histogram or KDE, it directly represents each datapoint. That means there is no bin size or smoothing parameter to consider. Additionally, because the curve is monotonically increasing, it is well-suited for comparing multiple distributions:
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="ecdf")
../_images/distributions_57_0.png
The major downside to the ECDF plot is that it represents the shape of the distribution less intuitively than a histogram or density curve. Consider how the bimodality of flipper lengths is immediately apparent in the histogram, but to see it in the ECDF plot, you must look for varying slopes. Nevertheless, with practice, you can learn to answer all of the important questions about a distribution by examining the ECDF, and doing so can be a powerful approach.
Visualizing bivariate distributions
All of the examples so far have considered univariate distributions: distributions of a single variable, perhaps conditional on a second variable assigned to hue. Assigning a second variable to y, however, will plot a bivariate distribution:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm")
../_images/distributions_60_0.png
A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color (analagous to a heatmap()). Similarly, a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian. The default representation then shows the contours of the 2D density:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde")
../_images/distributions_62_0.png
Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if there is minimal overlap between the conditional distributions:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species")
../_images/distributions_64_0.png
The contour approach of the bivariate KDE plot lends itself better to evaluating overlap, although a plot with too many contours can get busy:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind="kde")
../_images/distributions_66_0.png
Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution. The same parameters apply, but they can be tuned for each variable by passing a pair of values:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5))
../_images/distributions_68_0.png
To aid interpretation of the heatmap, add a colorbar to show the mapping between counts and color intensity:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5), cbar=True)
../_images/distributions_70_0.png
The meaning of the bivariate density contours is less straightforward. Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it. The p values are evenly spaced, with the lowest level contolled by the thresh parameter and the number controlled by levels:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", thresh=.2, levels=4)
../_images/distributions_72_0.png
The levels parameter also accepts a list of values, for more control:
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", levels=[.01, .05, .1, .8])