We know that EDA (Exploratory Data Analysis), is the process of organizing, plotting, and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. Here, we have already discussed various methods of performing EDA with their pros and cons on an underlying dataset. ECDF plot is another visual method of performing EDA on a given feature. In this post, we will learn what is an ECDF function, and how we can create an ECDF plot in Python. Before delving into the deep, let’s start understanding it.
Empirical cumulative distribution function
An ECDF stands for the empirical cumulative distribution function. It provides a way to model the cumulative probability of the sample data and helps us to estimate the cumulative distribution function. The ECDF function jumps up 1/n step for each of the n data points which is associated with the empirical measure of the sample.
What is an ECDF plot?
ECDF plot is used to visualize all the data points of a given feature in the sample. It plots the quantity of the values which we are measuring on the x-axis in ascending order. On the y-axis, it shows the fraction/percentage of the data points that have a smaller value than the corresponding x-axis values.
This is an example of an ecdf plot. This plot displays all the data of the sepal width from the iris dataset.
When to use ECDF plots?
If we have a large sample dataset, we can use an ECDF plot to visualize the data points easily using graphical EDA. The ECDF plots help us to visualize each and every data point on a plot. Unlike histograms, it does not have binning bias issues. Also, it doe not have overlapping data point issues, unlike Bee Swarm Plots. If needed, we can draw multiple ECDFs on the same plot and compare them. It helps us to understand the similarity of two different samples for a given feature.
How to create an ECDF plot in Python?
We can use our own logic to create an ECDF plot or else we can simply use the seaborn library which provides a method called “seaborn.ecdfplot” to draw an ecdf plot.
Method 1 – Using custom functions
To draw an ECDF plot, we need to sort the values of x-axis in ascending order (using np.sort) method and on y axis we can use np.arange(1,(len(x) + 1)) / len(x) as values. Then, we can plot the x and y data values using plt.pyplot method of matplotlib.
To create an ECDF plot, let’s follow the below step:
- Create a function that takes x data as an input parameter.
- Sort the input values in ascending order. We can use the np.sort method of NumPy module to sort the values in a vectorized way.
- Generate the y-axis values using np.arange function of the numpy module. We use (1, (len(x) + 1)) / len(x) inside a function to normalize the values between 0 and 1.
- Return the x and y values from the function.
- Below is the code we can use:
import numpy as np
#ECDF function to generate x and y axis data
def ecdf(xdata):
xdataecdf = np.sort(xdata)
ydataecdf = np.arange(1, len(xdata) + 1) / len(xdata)
return xdataecdf, ydataecdf
Then, we can use this function to get the x and y data which can be plotted in order to generate a nice ecdf plot. Let’s use the iris dataset (which comes bundled out of the box with sci-kit learn library) as a sample dataset and we are plotting the sepal length’s ecdf plot.
#Import iris data set from sklearn and other libraries
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import pandas as pd
#Get the iris dataset
dataset = load_iris()
#Load the data into a dataset
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
#Get the x and y data for ecdf plot from ecdf method
x,y = ecdf(df['sepal length (cm)'])
#Plot the data using matplotlib
plt.plot(x, y, marker = '.', linestyle = 'none')
plt.xlabel('sepal length (cm)')
plt.ylabel('Fraction of values')
plt.margins(0.1)
The above plot helps us to understand that what percent of total values we have for a particular x value. We have plotted all the data points without having any overlapping, binning bias issues.
Method 2 – Using Seaborn library
If you don’t want to write a custom function to generate and display the ecdf plot, you can use the seaborn library. It provides “seaborn.ecdfplot” method to draw an ecdf plot without writing any custom code. Please make sure that you are using the updated version of the seaborn library.
#Import iris data set from sklearn and other libraries
from sklearn.datasets import load_iris
import pandas as pd
#Get the iris dataset
dataset = load_iris()
#Load the data into a dataset
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
#Import seaborn library and generate ecdf plot here
import seaborn as sns
sns.ecdfplot(x = df['sepal length (cm)'])
Here, we can see we just need to import the seaborn library and use the code “sns.ecdfplot(x = df[‘sepal length (cm)’])” to get the above ecdf plot.
Thanks for the reading. Please share your inputs in the comment section.