In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.
The IRIS dataset
The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm), and
- Petal width (cm)
The target variable defines the species of the iris flower which can be “setosa“, “versicolor” or “virginica“. We need to create a classifier (using Decision Tree Classifier) which can be used to predict the species of the iris flower for unseen data based on the given input features – sepal length, sepal width, petal length, and petal width.
Let’s have a look at the sample data:
Data preprocessing and Exploration
Now, we are going to load and analyze this dataset in python using pandas library which is a very powerful and handy library used for data analysis.
from sklearn import datasets #import datasets from sklearn library import pandas as pd #import pandas under alias pd data = datasets.load_iris() #load Iris dataset in a variable named data
Using above code, we have loaded the Iris dataset into a variable named “data”. It is of type <class ‘sklearn.utils.Bunch’>. Bunch is a dictionary like object which has five keys/properties.
- DESCR – Displays the full description of the dataset
- data – Contains input features data in a numpy array with shape (150, 4)
- feature_names – Contains the name of the features in a python list.
- target – Contains the target values (dependent variable values) for each 150 rows – shape (150, )
- target_names – Contains name of the target classes in a string array
We can access these properties using syntax like data.property_name> or data[‘<property_name>’].
Now, let’s create a pandas dataframe using Iris data.
df = pd.DataFrame(data.data, columns = data.feature_names) #create a dataframe df with features as column name print(df.head()) #print top 5 rows of the dataframe
Output:
Let’s use .info() method on the dataframe to get the column names, data types, non null value counts along with memory usage. Also, use .describe() method to get the statistical summary of each column.
print(df.info()) #print column name, datatypes and not null value counts for each column print(df.describe()) #print statistical summary of the columns
Output:
Now, let’s use .isnull() and .isna() methods to verify the Null and NaN(Not a number) values in this dataset:
print(df.isnull().sum()) #Print the sum of all null values print(df.isna().sum()) #Print the sum of all NaN values
Output:
The dataframe does not have any null or NaN values and all the input features in this dataset are numeric (Though, CART supports categorical variables as an input feature). So, we can say that this dataset is satisfying the tidy data principles and it can be used in a Machine Learning model. Before fitting this data into the model, let’s do some EDA on this dataset.
EDA(Exploratory Data Analysis)
As all the input features of this dataset are numeric, we can draw a scatter matrix plot which displays the correlation between each feature of the dataset. To draw a scatter matrix plot, we can use this code.
import matplotlib.pyplot as plt _ = pd.plotting.scatter_matrix(df, c = data.target, figsize = [6, 6], s = 25, marker = 'D') plt.show()
Output:
In above image, we can see that the petal length and petal width are highly correlated.
Now, let’s draw the histogram of each feature.
_ = df.hist(bins = 4, figsize = (6, 6)) plt.show()
Output:
We can also apply some more EDA techniques on this dataframe (like box plot, violin plot, and strip plot) before fitting this data into our Machine learning model. Visit this link to know more on EDA(Exploratory data analysis) techniques.
Classification and Regression Tree – CART
Classification and Regression Tree or CART is a supervised Machine Learning algorithm which is used to solve classification (categorical output) and regression (continuous output) tasks. It uses Decision Tree which consists a hierarchy of nodes. Each node either involves a question or prediction. There can be three types of node:
- Root node: It has no parent node and involves a question which gives rise to two children nodes
- Internal node: It has one parent node and involves a question which gives rise to two children nodes
- Leaf node: It has one parent node but no children node (because it involves no question). It is also known as the prediction node.
During training, Decision trees learn the patterns so that it can produce the purest leaf (a leaf node which is predominant by one class). Decision trees implicitly perform feature engineering and it can handle both numerical and categorical data as input features. It also eliminates the data normalization/standardization process (used to bring all the input features on same scale). In addition, Decision trees can also capture non-linear relationships. A typical Decision Tree model looks like this.
To know more about Decision Trees models, click here.
Building Decision Tree Classification model using scikit-learn
As like our previous model, we need to split the given dataset in two parts, training data and test data. The training data will be used to train the model and the test data will be used to evaluate the model performance on unseen data. We can use this code to train and test the model performance using Decision Tree Classifier.
from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier seed = 22 #set seed value for reproducibility x_data = data.data #Assign input features in x_data y_data = data.target #Assign target/dependent variable values in y_data #Now, split the x and y data into train and test dataset. # Use stratify = y to have the same proportion of the classes in the training sample as the input dataset (x_train, x_test, y_train, y_test) = train_test_split(\ x_data, y_data, random_state = seed, stratify = y_data, test_size = 0.30) #Instantiate decision tree classifier dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, \ min_samples_leaf = 0.10, random_state = seed) dt.fit(x_train, y_train) #Train the model y_pred = dt.predict(x_test) #Predict the values on test data
While instantiating the Decision Tree, we have used criterian = ‘gini’ and max_depth = 2 which are hyper parameters (A parameter value which is required before fitting the data into Machine Learning model). We can use GridSearchCV or RandomizedSearchCV techniques in order to get the optimal values of these parameters.
Evaluate the model performance
Now, let’s evaluate our model performance. As our classification model is a balanced classification problem (each class has 50 samples in the input dataset), we can use the accuracy matrix as a performance measurement matrix.
from sklearn.metrics import accuracy_score print('Accuracy of the model is {0}'.format(accuracy_score(y_test, y_pred)))
The accuracy of our model is 93.3% approximately which is an acceptable score. In addition to the accuracy matrix, we can also use a confusion matrix to measure the model performance.
from sklearn.metrics import confusion_matrix print(pd.DataFrame(confusion_matrix(y_test, y_pred), \ index = ['Actual setosa', 'Actual versicolor', 'Actual virginica'], \ columns = ['Pred setosa', 'Pred versicolor', 'Pred virginica']))
As we have three classes in the target, we have a matrix of dimension 3 x 3 in the output. The diagonal values of the matrix are representing the accurate predictions and the non-diagonal values are representing the incorrect predictions.
We can also print the classification report which is especially useful when we have imbalanced class problem in the input dataset.
from sklearn.metrics import classification_report print('Classification report is {0}'.format(classification_report(y_test, y_pred)))
Thanks for the reading. Please share your inputs in comment.
Pingback: Introduction to k-fold Cross-Validation in Python - SQLRelease