One Hot Encoding: Understanding the "Hot" in Data

Preparing categorical data correctly is a fundamental step in machine learning, particularly when using linear models. One Hot Encoding stands out as a key technique, enabling the transformation of categorical variables into a machine-understandable format. This post tells you why you cannot use a categorical variable directly and demonstrates the use One Hot Encoding in our search for identifying the most predictive categorical features for linear regression.

Let’s get started.

One Hot Encoding: Understanding the “Hot” in Data
Photo by sutirta budiman. Some rights reserved.

Overview

This post is divided into three parts; they are:

What is One Hot Encoding?
Identifying the Most Predictive Categorical Feature
Evaluating Individual Features’ Predictive Power

What is One Hot Encoding?

In data preprocessing for linear models, “One Hot Encoding” is a crucial technique for managing categorical data. In this method, “hot” signifies a category’s presence (encoded as one), while “cold” (or zero) signals its absence, using binary vectors for representation.

From the angle of levels of measurement, categorical data are nominal data, which means if we used numbers as labels (e.g., 1 for male and 2 for female), operations such as addition and subtraction would not make sense. And if the labels are not numbers, you can’t even do any math with it.

One hot encoding separates each category of a variable into distinct features, preventing the misinterpretation of categorical data as having some ordinal significance in linear regression and other linear models. After the encoding, the number bears meaning, and it can readily be used in a math equation.

For instance, consider a categorical feature like “Color” with the values Red, Blue, and Green. One Hot Encoding translates this into three binary features (“Color_Red,” “Color_Blue,” and “Color_Green”), each indicating the presence (1) or absence (0) of a color for each observation. Such a representation clarifies to the model that these categories are distinct, with no inherent order.

Why does this matter? Many machine learning models, including linear regression, operate on numerical data and assume a numerical relationship between values. Directly encoding categories as numbers (e.g., Red=1, Blue=2, Green=3) could imply a non-existent hierarchy or quantitative relationship, potentially skewing predictions. One Hot Encoding sidesteps this issue, preserving the categorical nature of the data in a form that models can accurately interpret.

Let’s apply this technique to the Ames dataset, demonstrating the transformation process with an example:

# Load only categorical columns without missing values from the Ames dataset import pandas as pd Ames = pd.read_csv(“Ames.csv”).select_dtypes(include=[“object”]).dropna(axis=1) print(f”The shape of the DataFrame before One Hot Encoding is: {Ames.shape}”) # Import OneHotEncoder and apply it to Ames: from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse=False) Ames_One_Hot = encoder.fit_transform(Ames) # Convert the encoded result back to a DataFrame Ames_encoded_df = pd.DataFrame(Ames_One_Hot, columns=encoder.get_feature_names_out(Ames.columns)) # Display the new DataFrame and it’s expanded shape print(Ames_encoded_df.head()) print(f”The shape of the DataFrame after One Hot Encoding is: {Ames_encoded_df.shape}”)

# Load only categorical columns without missing values from the Ames dataset

import pandas as pd

Ames = pd.read_csv(“Ames.csv”).select_dtypes(include=[“object”]).dropna(axis=1)

print(f“The shape of the DataFrame before One Hot Encoding is: {Ames.shape}”)

# Import OneHotEncoder and apply it to Ames:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

Ames_One_Hot = encoder.fit_transform(Ames)

# Convert the encoded result back to a DataFrame

Ames_encoded_df = pd.DataFrame(Ames_One_Hot, columns=encoder.get_feature_names_out(Ames.columns))

# Display the new DataFrame and it’s expanded shape

print(Ames_encoded_df.head())

print(f“The shape of the DataFrame after One Hot Encoding is: {Ames_encoded_df.shape}”)

This will output:

The shape of the DataFrame before One Hot Encoding is: (2579, 27) MSZoning_A (agr) … SaleCondition_Partial 0 0.0 … 0.0 1 0.0 … 0.0 2 0.0 … 0.0 3 0.0 … 0.0 4 0.0 … 0.0 [5 rows x 188 columns] The shape of the DataFrame after One Hot Encoding is: (2579, 188)

The shape of the DataFrame before One Hot Encoding is: (2579, 27)

MSZoning_A (agr) … SaleCondition_Partial

0 0.0 … 0.0

1 0.0 … 0.0

2 0.0 … 0.0

3 0.0 … 0.0

4 0.0 … 0.0

[5 rows x 188 columns]

The shape of the DataFrame after One Hot Encoding is: (2579, 188)

As seen, the Ames dataset’s categorical columns are converted into 188 distinct features, illustrating the expanded complexity and detailed representation that One Hot Encoding provides. This expansion, while increasing the dimensionality of the dataset, is a crucial preprocessing step when modeling the relationship between categorical features and the target variable in linear regression.

Identifying the Most Predictive Categorical Feature

After understanding the basic premise and application of One Hot Encoding in linear models, the next step in our analysis involves identifying which categorical feature contributes most significantly to predicting our target variable. In the code snippet below, we iterate through each categorical feature in our dataset, apply One Hot Encoding, and evaluate its predictive power using a linear regression model in conjunction with cross-validation. Here, the drop="first" parameter in the OneHotEncoder function plays a vital role:

# Buidling on the code above to identify top categorical feature from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score # Set ‘SalePrice’ as the target variable y = pd.read_csv(“Ames.csv”)[“SalePrice”] # Dictionary to store feature names and their corresponding mean CV R² scores feature_scores = {} for feature in Ames.columns: encoder = OneHotEncoder(drop=”first”) X_encoded = encoder.fit_transform(Ames[[feature]]) # Initialize the linear regression model model = LinearRegression() # Perform 5-fold cross-validation and calculate R^2 scores scores = cross_val_score(model, X_encoded, y) mean_score = scores.mean() # Store the mean R^2 score feature_scores[feature] = mean_score # Sort features based on their mean CV R² scores in descending order sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True) print(“Feature selected for highest predictability:”, sorted_features[0][0])

# Buidling on the code above to identify top categorical feature

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

# Set ‘SalePrice’ as the target variable

y = pd.read_csv(“Ames.csv”)[“SalePrice”]

# Dictionary to store feature names and their corresponding mean CV R² scores

feature_scores = {}

for feature in Ames.columns:

encoder = OneHotEncoder(drop=“first”)

X_encoded = encoder.fit_transform(Ames[[feature]])

# Initialize the linear regression model

model = LinearRegression()

# Perform 5-fold cross-validation and calculate R^2 scores

scores = cross_val_score(model, X_encoded, y)

mean_score = scores.mean()

# Store the mean R^2 score

feature_scores[feature] = mean_score

# Sort features based on their mean CV R² scores in descending order

sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)

print(“Feature selected for highest predictability:”, sorted_features[0][0])

The drop="first" parameter is used to mitigate perfect collinearity. By dropping the first category (encoding it implicitly as zeros across all other categories for a feature), we reduce redundancy and the number of input variables without losing any information. This practice simplifies the model, making it easier to interpret and often improving its performance. The code above will output:

Feature selected for highest predictability: Neighborhood

Feature selected for highest predictability: Neighborhood

Our analysis reveals that “Neighborhood” is the categorical feature with the highest predictability in our dataset. This finding highlights the significant impact of location on housing prices within the Ames dataset.

Evaluating Individual Features’ Predictive Power

With a deeper understanding of One Hot Encoding and identifying the most predictive categorical feature, we now expand our analysis to uncover the top five categorical features that significantly impact housing prices. This step is essential for fine-tuning our predictive model, enabling us to focus on the features that offer the most value in forecasting outcomes. By evaluating each feature’s mean cross-validated R² score, we can determine not just the importance of these features individually but also gain insights into how different aspects of a property contribute to its overall valuation.

Let’s delve into this evaluation:

# Building on the code above to determine the performance of top 5 categorical features print(“Top 5 Categorical Features:”) for feature, score in sorted_features[0:5]: print(f”{feature}: Mean CV R² = {score:.4f}”)

# Building on the code above to determine the performance of top 5 categorical features

print(“Top 5 Categorical Features:”)

for feature, score in sorted_features[0:5]:

print(f“{feature}: Mean CV R² = {score:.4f}”)

The output from our analysis presents a revealing snapshot of the factors that play pivotal roles in determining housing prices:

Top 5 Categorical Features: Neighborhood: Mean CV R² = 0.5407 ExterQual: Mean CV R² = 0.4651 KitchenQual: Mean CV R² = 0.4373 Foundation: Mean CV R² = 0.2547 HeatingQC: Mean CV R² = 0.1892

Top 5 Categorical Features:

Neighborhood: Mean CV R² = 0.5407

ExterQual: Mean CV R² = 0.4651

KitchenQual: Mean CV R² = 0.4373

Foundation: Mean CV R² = 0.2547

HeatingQC: Mean CV R² = 0.1892

This result accentuates the importance of the feature “Neighborhood” as the top predictor, reinforcing the idea that location significantly influences housing prices. Following closely are “ExterQual” (Exterior Material Quality) and “KitchenQual” (Kitchen Quality), which highlight the premium buyers place on the quality of construction and finishes. “Foundation” and “HeatingQC” (Heating Quality and Condition) also emerge as significant, albeit with lower predictive power, suggesting that structural integrity and comfort features are critical considerations for home buyers.

Summary

In this post, we focused on the critical process of preparing categorical data for linear models. Starting with an explanation of One Hot Encoding, we showed how this technique makes categorical data interpretable for linear regression by creating binary vectors. Our analysis identified “Neighborhood” as the categorical feature with the highest impact on housing prices, underscoring location’s pivotal role in real estate valuation.

Specifically, you learned:

One Hot Encoding’s role in converting categorical data to a format usable by linear models, preventing the algorithm from misinterpreting the data’s nature.
The importance of the drop='first' parameter in One Hot Encoding to avoid perfect collinearity in linear models.
How to evaluate the predictive power of individual categorical features and rank their performance within the context of linear models.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

About Vinod Chugani

Born in India and nurtured in Japan, I am a Third Culture Kid with a global perspective. My academic journey at Duke University included majoring in Economics, with the honor of being inducted into Phi Beta Kappa in my junior year. Over the years, I’ve gained diverse professional experiences, spending a decade navigating Wall Street’s intricate Fixed Income sector, followed by leading a global distribution venture on Main Street.
Currently, I channel my passion for data science, machine learning, and AI as a Mentor at the New York City Data Science Academy. I value the opportunity to ignite curiosity and share knowledge, whether through Live Learning sessions or in-depth 1-on-1 interactions.
With a foundation in finance/entrepreneurship and my current immersion in the data realm, I approach the future with a sense of purpose and assurance. I anticipate further exploration, continuous learning, and the opportunity to contribute meaningfully to the ever-evolving fields of data science and machine learning, especially here at MLM.

Source link

One Hot Encoding: Understanding the “Hot” in Data

Overview

What is One Hot Encoding?

Identifying the Most Predictive Categorical Feature

Evaluating Individual Features’ Predictive Power

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

About Vinod Chugani

Comments

Leave a Reply Cancel reply