Categorical variables are very common data types in machine learning datasets. These variables represent non-numeric values such as days of the week, gender, colors, etc. However, typically, we need to convert these categorical variables to a numerical format before using them in machine learning algorithms. One-hot encoding is a powerful technique that accomplishes this transformation efficiently, especially for multi-valued categorical variables. In this blog, we’ll explore what is one-hot encoding, why it’s essential in the ML landscape, and how we can implement it using Pandas DataFrame.
One-Hot Encoding for Multi-Valued Categorical Variables in Pandas DataFrame
Before we start, let’s quickly understand what One-Hot Encoding is.
What is OneHotEncoding?
In machine learning, we utilize One-Hot Encoding as a data preprocessing technique. This technique involves transforming categorical variables into numerical values that enable ML algorithms to process them more effectively. It is especially beneficial for algorithms incapable of directly handling categorical data.
One-Hot Encoding transforms each category in a categorical variable into a distinct binary column. For each category, a new binary column is created, and for each data point, the column corresponding to the category is marked as “1” while the rest are marked as “0”. That is why, a single categorical column may create multiple columns after applying OneHotEncoding.
For example, a “Gender” categorical variable with values “Male” and “Female” will create two new binary columns: “Gender_Male” and “Gender_Female” after applying OneHotEncoding. The “Gender_Male” column would be marked as “1” if the data point represents Male, and the “Gender_Female” column would be marked as “0”. Similarly, if the data point represents a Female value, the “Gender_Female” column would be marked as “1”, and “Gender_Male” would be marked as “0”.
One-Hot Encoding allows machine learning algorithms to work with categorical data effectively, as it eliminates any unintended ordinal relationships and ensures that the algorithm treats each category independently. It is a common technique for preparing categorical data for machine learning.
Why OneHotEncoding is essential?
As we know Machine learning algorithms typically work with numerical data. However, if we have a categorical variable in our dataset, either we need to remove it before fitting it into a Machine learning algorithm, or we have to replace it with some numerical values. The OneHot Encoding technique helps us to convert these categorical variables into numerical values without losing their importance. As a result, OneHot Encoding eventually helps in accurate and meaningful model training by providing below features:
- Makes data compatible with Algorithms
- Removes Ordinal Assumptions that might mislead the algorithm
- Helps to preserve the information
- Prevents biases that might arise from assigning arbitrary numerical values to categories
How to implement One-Hot Encoding for Multi-Valued Categorical Variables in Pandas DataFrame
Suppose, we have the below dataset and the desired output.
This dataset has a feature named “PurchaseType”. Each sample of the dataset represents the purchase type of a customer. However, for each customer ID, we have multiple purchase types that result in duplicate categorical values for a single sample of a customer. Now, if we want to convert this feature into OneHot Encoding, we may end up with duplicate samples. So, to achieve the one-hot encoding for such kind of datasets that contains multiple samples for an entity, we can use the below logic in Python.
import pandas as pd
if __name__ == "__main__":
df = pd.DataFrame({
"CustomerID":["CustID_1", "CustID_1", "CustID_2", "CustID_2"],
"PurchaseType":["Website Purchase", "Mobile App Purchase", "Mobile App Purchase", "Supermarket Purchase"]
})
df_transformed = df.pivot_table(index=["CustomerID"], columns=["PurchaseType"], aggfunc=[len], fill_value=0)
print(df_transformed.head())
How the above code is working?
Step 1:
import pandas as pd: Imports the pandas library and assigns it the alias pd.
Step 2:
if name == “main”: This line is a standard Pythonic way to check whether the script is running as the main program or not.
Step 3:
df = pd.DataFrame(…): This line creates a pandas DataFrame named df with two columns, “CustomerID” and “PurchaseType“. The DataFrame contains four rows of data, representing different customer IDs and types of purchases for the customers.
Step 4:
df_transformed = df.pivot_table(…): This line creates a new DataFrame df_transformed using the .pivot_table() function available in padnas library. We use this method to reshape and aggregate the data. In the above case, it transforms the original DataFrame by grouping data based on “CustomerID” and “PurchaseType” columns and aggregates the count of occurrences using the len function.
The line “(index=[“CustomerID”], columns=[“PurchaseType”], aggfunc=[len], fill_value=0)” specifies that the “CustomerID” column should be used as the index for the rows of the pivoted DataFrame. Also, the “PurchaseType” column should be used to create columns in the pivoted DataFrame. The aggregation function aggfunc = [len] is applied to the data to count the number of occurrences. Finally, fill_value specifies that missing values should be filled with 0.
Step 5:
print(df_transformed.head()): This line prints the first few rows of the transformed DataFrame df_transformed. The .head() function displays the top rows of the DataFrame.
In Summary, the above code snippet creates a DataFrame df with customer IDs and purchase types columns, then it uses the .pivot_table() function to transform and aggregate the data, resulting in a new DataFrame. The df_transformed dataframe shows the count of different purchase types for each customer ID. To generate the values for each column, it uses the len function that will output a NaN value if a column value is not present for a customer ID. Similarly, the len function will output 1 as a value if a column value is present for a customer ID. To handle NaN values in the output, we have used the fill_value option to replace NaN with 0 in case there is no mapping for a given column for a customer ID. That is how, we are able to implement One-Hot encoding for multi-valued categorical variables using a Pandas dataFrame. Finally, we have used a print statement to print the first few rows of the transformed DataFrame.
Thanks for the reading. Please share your inputs in the comment section.