Introduction to k-fold Cross-Validation in Python

This post briefs how we can use the k-fold cross-validation to evaluate a Machine Learning model performance using the Scikit-learn library in Python. We know that the performance of a Machine Learning model depends on the training dataset. Also, if the training dataset has a peculiarity, the model created with that dataset will not work as a generalized model. The cross-validation technique is used to measure the performance of a Machine learning model by dividing the data into folds.

What is k-fold cross-validation

K-fold cross-validation is a model validation technique that is used to assess how well a model is generalized on the unseen data. We split the given dataset into training and test datasets, and then we use the training dataset to train the model. Finally, we use the test dataset to test the model performance. However, this random split may not be the best split and our model may overfit the training data. To generalize our model for the independent data that is not yet exposed to the model, we use the k-fold cross-validation method. The overall target of the k-fold cv is to get an insight that how a model is generalized on unseen data.

How k-fold cross-validation works

The k-fold cross-validation method splits the input dataset into k folds. After that, it trains the model on k-1 folds and then tests the performance of the model on the holdout set. Subsequently, it repeats this process iteratively for all the values of k. Here, k is a positive number like 1,2,3,4,…….n.

In other words, first, it splits the data into various groups/folds and then holds out the first fold for the test data. Then, it fits the model on the remaining folds (k – 1) and predicts on the test set (the fold which we hold out initially). Finally, it computes the metric of the interest. The next iteration holds out the second fold as the test set and fits all the remaining folds as training data. Then, it again predicts on the test dataset and computes the metric of the interest. Similarly, this iteration is performed for all the folds. As a result, we get the multiple values of the metric of interest (5 values in a 5-fold cv). Further, these values are averaged out to get the most generalized value for the model performance. The below image explains how a k-fold validation method works.

K-fold cross-validation
K-fold cross-validation

The IRIS dataset – Sample dataset

The IRIS dataset comes bundled with the Scikit-learn library. It has 150 observations that consist of 50 samples of each of three species of Iris flower. These species of IRIS flowers are “setosa“, “versicolor” and “virginica“. This dataset is a standard, cleansed, and preprocessed multivariate dataset. Each IRIS dataset sample has four input features that are:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm), and
  4. Petal width (cm)

Using k-fold cross-validation to evaluate a model

Let’s start the demo to understand the k-fold cross-validation in detail. We will use scikit-learn library and its built-in IRIS dataset for the demonstration of k-fold cross-validation. During this demo, we will implement the cross-validation using a custom method. Then, we will demonstrate that how we can implement the k-fold cross-validation using the built-in cross_val_score method.

k-fold cross-validation using custom for loop

It is essential to use cross-validation to avoid overfitting the models. To use cross-validation, we can use the below sample code. Note that, in the below code, we are using the IRIS dataset.

#import datasets from sklearn library
from sklearn import datasets
data = datasets.load_iris()

#Import decision tree classification model and cross validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score

#Extract a holdout set at the very begining
X_train_set, X_holdout, y_train_set, y_holdout = train_test_split(data.data, data.target, 
                                stratify = data.target, random_state = 42, test_size = .20)

#Get input and output datasets values in X and Y variables
X = X_train_set
y = y_train_set 

#Initialize k-fold cross validation configurations
kf = KFold(n_splits=5, random_state=42)

scores = []
dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, \
                        min_samples_leaf = 0.10, random_state = 42)
for train_index, test_index in kf.split(X):
    #print("Train index: {0}, \nTest index: {1}".format(train_index, test_index))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    dt.fit(X_train, y_train)
    scores.append(dt.score(X_test, y_test))
print("\n" + ("*" * 100))
print("The cross-validation scores using custom method are \n{0}".format(scores))
print("*" * 100)

import numpy as np
print("\n" + ("*" * 100))
print("Mean of k-fold scores using custom method is {0}".format(np.mean(scores)))
print("*" * 100)
print("\n")

Output

Let’s have a look at the output of the scores using the custom “for loop” method for the cross-validation.

Output using custom method
Output using the custom method

k-fold cross-validation using built-in cross_val_score method

Instead of using a custom code, we can also use the built-in corss_val_score method of the sklearn library. Below is the sample code which shows how we can use this method.

#import datasets from sklearn library
from sklearn import datasets
data = datasets.load_iris()

#Import decision tree classification model and cross validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_val_predict, cross_validate
from sklearn.metrics import accuracy_score

#Extract a holdout set at the very begining
X_train_set, X_holdout, y_train_set, y_holdout = train_test_split(data.data, data.target, 
                                stratify = data.target, random_state = 42, test_size = .20)

#Get input and output datasets values in X and Y variables
X = X_train_set
y = y_train_set 

dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, \
                        min_samples_leaf = 0.10, random_state = 42)

scores = cross_val_score(dt, X, y, cv = 5)
print("\n" + ("*" * 100))
print("The cross-validation scores using cross_val_score method are \n{0}".format(scores))
print("*" * 100)

import numpy as np
print("\n" + ("*" * 100))
print("Mean of k-fold scores using cross_val_score method is {0}".format(np.mean(scores)))
print("*" * 100)
print("\n")

Output

Let’s have a look at the output of the scores using the corss_val_score built-in method of the sklearn library.

Output using cross_val_score method
Output using cross_val_score method

Notes:

  1. Do not use all the labeled data for the cross-validation.
  2. Always keep some unseen data to verify the model prediction capability on never before seen data.
  3. At the very begining, split the dataset into training and hold-out set.
  4. Use the training dataset for cross-validation and hold-out data for final validation.

Pros and cons of using k-fold cross-validation

Below are the pros of using k-fold cv:

  1. Helps to get more generalized validation scores.
  2. Easy to configure as we have to pass only a single configuration parameter value that is k.
  3. Helps to avoid overfitting of the model.

Below are the cons of using k-fold cv:

  1. Be cautious in choosing the value of hyperparameter k. Because, as the number of k increases, the model becomes more computationally expensive.
  2. The value of parameter k controls the overall score of the k-fold cross-validation. So, an imprecise value of k may lead to erroneous validation scores.

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 1 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.