The exploratory data analysis is a very important step in a Data Science project. It helps us to visualize the data and identify any hidden trends that might not be visible with summary statistics alone. So, we can use matplotlib and seaborn libraries to create stunning visuals in Python. However, the pandas.plotting module of the “pandas” library (which uses the same “matplotlib” visuals under the hood) provides some handy methods to easily create beautiful plots with very few lines of code. Therefore, in this post, we will create pair plots using the scatter_matrix method available in the pandas module.
Pandas is a Python module that is used in data analysis and data wrangling tasks. It is a very fast, powerful, and flexible open-source library. It also provides DataFrame objects for easy data manipulations and some built-in methods which are very helpful in data visualization.
What is a scatter plot
A Scatter plot is a chart that is used to plot the relationship between two numerical attributes or variables e.g. x and y. In other words, each data point in the scatter plot is represented as a dot whose coordinates relate to the x and y variables values. So, we can easily identify a correlation between x and y variables using a scatter plot. For example, let’s have a look at the below scatter plot that is created on the iris dataset of the sci-kit library.
In the above image, each data point displays an iris flower. The petal length of the iris flower is displayed on the x-axis and petal widths is displayed on the y-axis.
We can use the below code to create the above scatter plot:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
#Get the Iris dataset from skleatn library
dataset = load_iris()
data = dataset.data
df_iris = pd.DataFrame(data, columns = dataset.feature_names)
plt.figure(figsize=(8,6))
plt.plot(df_iris['petal length (cm)'], df_iris['petal width (cm)'], marker = '.', linestyle = 'none')
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.margins(0.2)
plt.show()
#Check the correlation coeeficient using numpy
import numpy as np
corr = np.corrcoef(df_iris['petal length (cm)'], df_iris['petal width (cm)'])[0][1]
print("-" * 100)
print("The correlation coefficient between petal length (cm) and petal width (cm) attributes is {0:.5f}".format(corr))
print("-" * 100)
What is the scatter_matrix method
Generating the scatter plots manually for each combination of the numerical attributes of a dataset can be a time-consuming task, especially if we have a wide dataset. We can use the scatter_matrix method to automatically generate the scatter plots for all the combinations of the numerical attributes available in a dataset. We can use this method to check the correlation between each of the numerical attributes in a dataset. The scatter_matrix function is available inside the pandas.plotting module of the pandas library.
How to create a scatter_matrix plot in Python
We can use the below code in order to generate the scatter_matrix for a given DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
#Get the Iris dataset from skleatn library
dataset = load_iris()
data = dataset.data
df_iris = pd.DataFrame(data, columns = dataset.feature_names)
pd.plotting.scatter_matrix(df_iris, figsize = (12,12))
The output is this:
We can see that in the above image each numerical attribute is plotted in rows and in columns. The diagonals are showing the histograms (if it is the same column on the x and y-axis). The other charts are showing the scatter plots for each combination of the numerical variables in the dataset.
Thanks for the reading. Please share your inputs in the comment section.