Building first Machine Learning model using Logistic Regression in Python – Step by Step

This post briefs how to create our first machine learning predictive model using Logistic regression in Python. When we start working on a Machine Learning project, first, we perform some data wrangling and transformation to get the tidy dataset. Then, we perform some EDA to find trends, patterns, and outliers in the given dataset. Once, we have machine-interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyperparameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

Machine Learning process workflow
Now, let’s have a quick look at the dataset which we are going to use.

Wisconsin Breast Cancer dataset

Wisconsin Breast Cancer dataset is a standard, preprocessed, cleaned binary classification dataset that comes with the Scikit-learn library. This dataset contains 569 samples. Out of 569, we have 212 malignant and 357 benign samples. Also, each sample has 30 features. The target variable contains the stage of Breast cancer – 0 for malignant, 1 for benign. As a result, we have to create a model that can predict a given sample as malignant or benign.

Below is the attributes information:

Ten real-valued features are computed for each cell nucleus:

  1. radius – mean of distances from center to points on the perimeter
  2. texture – standard deviation of gray-scale values
  3. perimeter
  4. area
  5. smoothness – local variation in radius lengths
  6. compactness – perimeter^2 / area – 1.0
  7. concavity – severity of concave portions of the contour
  8. concave points – number of concave portions of the contour
  9. symmetry
  10. fractal dimension – “coastline approximation” – 1

Mean, standard error, and worst values of the above ten attributes are computed for each image. Now, let’s have a look at the sample data:

Top 5 sample rows

Data preprocessing and Exploration

Let’s load this dataset and analyze it in python using pandas data frame.

from sklearn import datasets #import datasets from sklearn library
import pandas as pd #import pandas under alias pd
data = datasets.load_breast_cancer() #load breast cancer dataset in a variable named data

The variable named “data” is of type <class ‘sklearn.utils.Bunch’> and is a dictionary-like object. It has five keys/properties which are:

  1. DESCR – Displays the description of the dataset
  2. data – Contains input features data in a numpy array with shape 569 x 30
  3. feature_names – Contains the name of the features
  4. target – Contains the target values for each 569 rows – shape (569, )
  5. target_names – Contains name of the target classes

We can access these properties using syntax like:

<data.property_name> or data[‘<property_name>’].

We can see that the input feature values and target variable values are stored separately. Because the Machine Learning algorithm expects input features and target variables in two different arrays.

Pandas dataframe is a very powerful and handy tool used for data analysis. It has many built-in methods that make the data analysis process very smooth. So, let’s create a dataframe using this data and perform a quick data analysis on this dataset. 

df = pd.DataFrame(, columns = data.feature_names) #create a dataframe df with features as column names

To display a quick summary of the features:

print(df.head()) #print top 5 rows of the dataframe


Sample dataframe rows

Let’s generate a quick overview of the columns, their data types, value counts, and memory usage using the .info() method. Also, generate a quick statistical summary of the input columns using the .describe() method.

print( #print column name, datatypes and not null value counts for each column
print(df.describe()) #print statistical summary of the columns


Quick info and statistical summary of the dataframe

We can also use .isnull() and .isna() methods to verify the Null and NaN (Not a number) values in this dataset:



Null and NaN values

All features in this dataset are numeric that are required to use them in a Machine Learning model. Also, we don’t have any null or NaN values in this dataset. So, we can say that this dataset satisfies the tidy data principles and can be used in a Machine Learning model. However, before fitting this data into the model, let’s do some EDA on this dataset.

EDA(Exploratory Data Analysis) before building first Machine Learning model using Logistic Regression

Let’s plot a histogram for each feature. We can use this script:

#import pyplot from matplotlib library
import matplotlib.pyplot as plt
#Create a function to draw histograms for each feature
def draw_hist_all():
    #Lets split the dataframe in 3 dataframes - (1 - Mean, 2 - Standard Error, 3 - Worst)
    df1 = df.iloc[:,0:10]
    df2 = df.iloc[:,10:20]
    df3 = df.iloc[:,20:30]
    #Draw histogram of all features
    _ = df1.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4))
    _ = plt.tight_layout()
    _ = df2.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4))
    _ = plt.tight_layout()
    _ = df3.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4))
    _ = plt.tight_layout()


Histogram of each feature

First, we have created three dataframes, having 10 columns in each dataframe, by splitting the main dataframe. Then, we have used the .hist() method of the dataframe to plot a histogram of each feature. Also, we can apply some more EDA techniques to this dataframe before fitting this data into our Machine learning model. Visit this link to know more about Exploratory data analysis techniques. Likewise, we can also do some feature engineering before fitting this data into a Machine Learning model. However, in this post, we are not going to demonstrate feature engineering.

What is Logistic Regression

In spite of its name, Logistic regression is used in classification problems and not in regression problems. That is to say, it is a binomial regression that has a dependent variable with two possible outcomes. For example, True/False, Pass/Fail, healthy/sick, dead/alive, and 0/1

Types of Logistic Regression

  1. Binary Logistic Regression: The target variable has two possible outcomes only.
  2. Multinomial Logistic Regression: The target variable has three or more classes without ordering. 
  3. Ordinal Logistic Regression: The target variable has three or more categories with ordering.

To know more about Logistic regression, visit this link.

Building first Machine Learning model using Logistic Regression

Firstly, we need to split the given data into two parts, the training dataset, and the test dataset. Secondly, we will be using a training dataset to train the model. Finally, we will be using the test dataset to evaluate the model performance. Above all, it is always recommended to have some unseen data to evaluate model performance. So, below is the code to train and predict the model on the test dataset. However, use this code in addition to the above code lines.

#import required modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Assign the feature and target data in two different variables
x_data =
y_data =
#Split the input dataset into two parts: Training dataset and Test dataset
(x_train, x_test, y_train, y_test) = train_test_split( \
            x_data, y_data, stratify = y_data, test_size = 0.3, random_state = 21)
#Instantiate a logistic regression model            
logreg = LogisticRegression(), y_train) #Fit method is used to train the model with training dataset
y_pred = logreg.predict(x_test) #Predict method is used to predict the outcome on unseen data

Accuracy check using Confusion matrix

Now, it’s time to check the accuracy of our classification model. Certainly, we can use the different matrices to check the accuracy of our models. For example, we can use confusion matrix, classification report, accuracy score, and roc_auc_score. Let’s check the model performance using a few of the mentioned matrices.

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
#Print confusion matrix
print('Confusion matrix is as:')
print(confusion_matrix(y_test, y_pred)


Confusion Matrix

As we have a binary classifier, the dimension of this matrix is 2 x 2. Also, We have two classes 0 and 1. Most importantly, the diagonal values of the matrix are representing accurate predictions. Also, the non-diagonal values are representing the incorrect predictions. Moreover, we can use a dataframe to print this confusion matrix in a more readable form.

print(pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['Actual 0', 'Actual 1'], columns = ['Predicted 0', 'Predicted 1' ]))


Confusion Matrix with labels

We have predicted 58 values as 0 and 6 values as 1 out of 64 values which are 0 (row 1). Also, we have predicted 3 values as 0 and 104 values as 1 out of 107 values which are 1 (row 2). Now, let’s print the classification matrix and accuracy score of the model:

print('Classification report is as:')
print(classification_report(y_test, y_pred))
acc = accuracy_score(y_test, y_pred)
print('Accuracy of model is {0}'.format(acc))


Classification report and accuracy score

We can see that we have built our first Machine Learning model using Logistic Regression. Also, the accuracy of this classification model is 94.7% approximately which is an acceptable score. However, we can further improve the accuracy of this model using feature engineering and some other techniques.

Thanks for reading. Please share your inputs in the comments.

