Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS

LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.

In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.

We begin by setting up LightGBM and proceed to examine its application in both theoretical and practical contexts.

Let’s get started.

LightGBM
Photo by Marcus Dall Col. Some rights reserved.

Overview

This post is divided into four parts; they are:

Introduction to LightGBM and Initial Setup
Testing LightGBM’s GBDT and GOSS on the Ames Dataset
Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy
Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

Introduction to LightGBM and Initial Setup

LightGBM (Light Gradient Boosting Machine) was developed by Microsoft. It is a machine learning framework that provides the necessary components and utilities to build, train, and deploy machine learning models. The models are based on decision tree algorithms and use gradient boosting at its core. The framework is open source and can be installed on your system using the following command:

This command will download and install the LightGBM package along with its necessary dependencies.

While LightGBM, XGBoost, and Gradient Boosting Regressor (GBR) are all based on the principle of gradient boosting, several key distinctions set LightGBM apart due to both its default behaviors and a range of optional parameters that enhance its functionality:

Exclusive Feature Bundling (EFB): As a default feature, LightGBM employs EFB to reduce the number of features, which is particularly useful for high-dimensional sparse data. This process is automatic, helping to manage data dimensionality efficiently without extensive manual intervention.
Gradient-Based One-Side Sampling (GOSS): As an optional parameter that can be enabled, GOSS retains instances with large gradients. The gradient represents how much the loss function would change if the model’s prediction for that instance changed slightly. A large gradient means that the current model’s prediction for that data point is far from the actual target value. Instances with large gradients are considered more important for training because they represent areas where the model needs significant improvement. In the GOSS algorithm, instances with large gradients are often referred to as “under-trained” because they indicate areas where the model’s performance is poor and needs more focus during training. The GOSS algorithm specifically retains all instances with large gradients in its sampling process, ensuring that these critical data points are always included in the training subset. On the other hand, instances with small gradients are considered “well-trained” because the model’s predictions for these points are closer to the actual values, resulting in smaller errors.
Leaf-wise Tree Growth: Whereas both GBR and XGBoost typically grow trees level-wise, LightGBM default tree growth strategy is leaf-wise. Unlike level-wise growth, where all nodes at a given depth are split before moving to the next level, LightGBM grows trees by choosing to split the leaf that results in the largest decrease in the loss function. This approach can lead to asymmetric, irregular trees of larger depth, which can be more expressive and efficient than balanced trees grown level-wise.

These are a few characteristics of LightGBM that differentiate it from the traditional GBR and XGBoost. With these unique advantages in mind, we are prepared to delve into the empirical side of our exploration.

Testing LightGBM’s GBDT and GOSS on the Ames Dataset

Building on our understanding of LightGBM’s distinct features, this segment shifts from theory to practice. We will utilize the Ames Housing dataset to rigorously test two specific boosting strategies within the LightGBM framework: the standard Gradient Boosting Decision Tree (GBDT) and the innovative Gradient-based One-Side Sampling (GOSS). We aim to explore these techniques and provide a comparative analysis of their effectiveness.

Before we dive into the model building, it’s crucial to prepare the dataset properly. This involves loading the data and ensuring all categorical features are correctly processed, taking full advantage of LightGBM’s handling of categorical variables. Like XGBoost, LightGBM can natively handle missing values and categorical data, simplifying the preprocessing steps and leading to more robust models. This capability is crucial as it directly influences the accuracy and efficiency of the model training process.

# Import libraries to run LightGBM import pandas as pd import lightgbm as lgb from sklearn.model_selection import cross_val_score # Load the Ames Housing Dataset data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’] # Convert categorical columns to ‘category’ dtype categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’)) # Define the default GBDT model gbdt_model = lgb.LGBMRegressor() gbdt_scores = cross_val_score(gbdt_model, X, y, cv=5) print(f”Average R² score for default Light GBM (with GBDT): {gbdt_scores.mean():.4f}”) # Define the GOSS model goss_model = lgb.LGBMRegressor(boosting_type=”goss”) goss_scores = cross_val_score(goss_model, X, y, cv=5) print(f”Average R² score for Light GBM with GOSS: {goss_scores.mean():.4f}”)

# Import libraries to run LightGBM

import pandas as pd

import lightgbm as lgb

from sklearn.model_selection import cross_val_score

# Load the Ames Housing Dataset

data = pd.read_csv(‘Ames.csv’)

X = data.drop(‘SalePrice’, axis=1)

y = data[‘SalePrice’]

# Convert categorical columns to ‘category’ dtype

categorical_cols = X.select_dtypes(include=[‘object’]).columns

X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))

# Define the default GBDT model

gbdt_model = lgb.LGBMRegressor()

gbdt_scores = cross_val_score(gbdt_model, X, y, cv=5)

print(f“Average R² score for default Light GBM (with GBDT): {gbdt_scores.mean():.4f}”)

# Define the GOSS model

goss_model = lgb.LGBMRegressor(boosting_type=‘goss’)

goss_scores = cross_val_score(goss_model, X, y, cv=5)

print(f“Average R² score for Light GBM with GOSS: {goss_scores.mean():.4f}”)

Results:

Average R² score for default Light GBM (with GBDT): 0.9145 Average R² score for Light GBM with GOSS: 0.9109

Average R² score for default Light GBM (with GBDT): 0.9145

Average R² score for Light GBM with GOSS: 0.9109

The initial results from our 5-fold cross-validation experiments provide intriguing insights into the performance of the two models. The default GBDT model achieved an average R² score of 0.9145, demonstrating robust predictive accuracy. On the other hand, the GOSS model, which specifically targets instances with large gradients, recorded a slightly lower average R² score of 0.9109.

The slight difference in performance might be attributed to the way GOSS prioritizes certain data points over others, which can be particularly beneficial in datasets where mispredictions are more concentrated. However, in a relatively homogeneous dataset like Ames, the advantages of GOSS may not be fully realized.

Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy

One of the distinguishing features of LightGBM is its ability to construct decision trees leaf-wise rather than level-wise. This leaf-wise approach allows trees to grow by optimizing loss reductions, potentially leading to better model performance but posing a risk of overfitting if not properly tuned. In this section, we explore the impact of varying the number of leaves in a tree.

We start by defining a series of experiments to systematically test how different settings for num_leaves affect the performance of two LightGBM variants: the traditional Gradient Boosting Decision Tree (GBDT) and the Gradient-based One-Side Sampling (GOSS). These experiments are crucial for identifying the optimal complexity level of the models for our specific dataset—the Ames Housing Dataset.

# Experiment with Leaf-wise Tree Growth import pandas as pd import lightgbm as lgb from sklearn.model_selection import cross_val_score # Load the Ames Housing Dataset data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’] # Convert categorical columns to ‘category’ dtype categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’)) # Define a range of leaf sizes to test leaf_sizes = [5, 10, 15, 31, 50, 100] # Results storage results = {} # Experiment with different leaf sizes for GBDT results[‘GBDT’] = {} print(“Testing different ‘num_leaves’ for GBDT:”) for leaf_size in leaf_sizes: model = lgb.LGBMRegressor(boosting_type=”gbdt”, num_leaves=leaf_size) scores = cross_val_score(model, X, y, cv=5, scoring=’r2′) results[‘GBDT’][leaf_size] = scores.mean() print(f”num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”) # Experiment with different leaf sizes for GOSS results[‘GOSS’] = {} print(“\nTesting different ‘num_leaves’ for GOSS:”) for leaf_size in leaf_sizes: model = lgb.LGBMRegressor(boosting_type=”goss”, num_leaves=leaf_size) scores = cross_val_score(model, X, y, cv=5, scoring=’r2′) results[‘GOSS’][leaf_size] = scores.mean() print(f”num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”)

# Experiment with Leaf-wise Tree Growth

import pandas as pd

import lightgbm as lgb

from sklearn.model_selection import cross_val_score

# Load the Ames Housing Dataset

data = pd.read_csv(‘Ames.csv’)

X = data.drop(‘SalePrice’, axis=1)

y = data[‘SalePrice’]

# Convert categorical columns to ‘category’ dtype

categorical_cols = X.select_dtypes(include=[‘object’]).columns

X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))

# Define a range of leaf sizes to test

leaf_sizes = [5, 10, 15, 31, 50, 100]

# Results storage

results = {}

# Experiment with different leaf sizes for GBDT

results[‘GBDT’] = {}

print(“Testing different ‘num_leaves’ for GBDT:”)

for leaf_size in leaf_sizes:

model = lgb.LGBMRegressor(boosting_type=‘gbdt’, num_leaves=leaf_size)

scores = cross_val_score(model, X, y, cv=5, scoring=‘r2’)

results[‘GBDT’][leaf_size] = scores.mean()

print(f“num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”)

# Experiment with different leaf sizes for GOSS

results[‘GOSS’] = {}

print(“\nTesting different ‘num_leaves’ for GOSS:”)

for leaf_size in leaf_sizes:

model = lgb.LGBMRegressor(boosting_type=‘goss’, num_leaves=leaf_size)

scores = cross_val_score(model, X, y, cv=5, scoring=‘r2’)

results[‘GOSS’][leaf_size] = scores.mean()

print(f“num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”)

Results:

Testing different ‘num_leaves’ for GBDT: num_leaves = 5: Average R² score = 0.9150 num_leaves = 10: Average R² score = 0.9193 num_leaves = 15: Average R² score = 0.9158 num_leaves = 31: Average R² score = 0.9145 num_leaves = 50: Average R² score = 0.9111 num_leaves = 100: Average R² score = 0.9101 Testing different ‘num_leaves’ for GOSS: num_leaves = 5: Average R² score = 0.9151 num_leaves = 10: Average R² score = 0.9168 num_leaves = 15: Average R² score = 0.9130 num_leaves = 31: Average R² score = 0.9109 num_leaves = 50: Average R² score = 0.9117 num_leaves = 100: Average R² score = 0.9124

Testing different ‘num_leaves’ for GBDT:

num_leaves = 5: Average R² score = 0.9150

num_leaves = 10: Average R² score = 0.9193

num_leaves = 15: Average R² score = 0.9158

num_leaves = 31: Average R² score = 0.9145

num_leaves = 50: Average R² score = 0.9111

num_leaves = 100: Average R² score = 0.9101

Testing different ‘num_leaves’ for GOSS:

num_leaves = 5: Average R² score = 0.9151

num_leaves = 10: Average R² score = 0.9168

num_leaves = 15: Average R² score = 0.9130

num_leaves = 31: Average R² score = 0.9109

num_leaves = 50: Average R² score = 0.9117

num_leaves = 100: Average R² score = 0.9124

The results from our cross-validation experiments provide insightful data on how the num_leaves parameter influences the performance of GBDT and GOSS models. Both models perform optimally at a num_leaves setting of 10, achieving the highest R² scores. This indicates that a moderate level of complexity suffices to capture the underlying patterns in the Ames Housing dataset without overfitting. This finding is particularly interesting, given that the default setting for num_leaves in LightGBM is 31.

For GBDT, increasing the number of leaves beyond 10 leads to a decrease in performance, suggesting that too much complexity can detract from the model’s generalization capabilities. In contrast, GOSS shows a slightly more tolerant behavior towards higher leaf counts, although the improvements plateau, indicating no further gains from increased complexity.

This experiment underscores the importance of tuning num_leaves in LightGBM. By carefully selecting this parameter, we can effectively balance model accuracy and complexity, ensuring robust performance across different data scenarios. Further experimentation with other parameters in conjunction with num_leaves could potentially unlock even better performance and stability.

Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

After fine-tuning the num_leaves parameter and assessing the basic performance of the GBDT and GOSS models, we now shift our focus to understanding the influence of individual features within these models. In this section, we explore the most important features by each boosting strategy through visualization.

Here is the code that achieves this:

# Importing libraries to compare feature importance between GBDT and GOSS: import pandas as pd import numpy as np import lightgbm as lgb from sklearn.model_selection import KFold import matplotlib.pyplot as plt import seaborn as sns # Prepare data data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’] categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’)) # Set up K-fold cross-validation kf = KFold(n_splits=5) gbdt_feature_importances = [] goss_feature_importances = [] # Iterate over each split for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Train GBDT model with optimal num_leaves gbdt_model = lgb.LGBMRegressor(boosting_type=”gbdt”, num_leaves=10) gbdt_model.fit(X_train, y_train) gbdt_feature_importances.append(gbdt_model.feature_importances_) # Train GOSS model with optimal num_leaves goss_model = lgb.LGBMRegressor(boosting_type=”goss”, num_leaves=10) goss_model.fit(X_train, y_train) goss_feature_importances.append(goss_model.feature_importances_) # Average feature importance across all folds for each model avg_gbdt_feature_importance = np.mean(gbdt_feature_importances, axis=0) avg_goss_feature_importance = np.mean(goss_feature_importances, axis=0) # Convert to DataFrame feat_importances_gbdt = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_gbdt_feature_importance}) feat_importances_goss = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_goss_feature_importance}) # Sort and take the top 10 features top_gbdt_features = feat_importances_gbdt.sort_values(by=’Importance’, ascending=False).head(10) top_goss_features = feat_importances_goss.sort_values(by=’Importance’, ascending=False).head(10) # Plotting plt.figure(figsize=(16, 12)) plt.subplot(1, 2, 1) sns.barplot(data=top_gbdt_features, y=’Feature’, x=’Importance’, orient=”h”, palette=”viridis”) plt.title(‘Top 10 LightGBM GBDT Features’, fontsize=18) plt.xlabel(‘Importance’, fontsize=16) plt.ylabel(‘Feature’, fontsize=16) plt.xticks(fontsize=13) plt.yticks(fontsize=14) plt.subplot(1, 2, 2) sns.barplot(data=top_goss_features, y=’Feature’, x=’Importance’, orient=”h”, palette=”viridis”) plt.title(‘Top 10 LightGBM GOSS Features’, fontsize=18) plt.xlabel(‘Importance’, fontsize=16) plt.ylabel(‘Feature’, fontsize=16) plt.xticks(fontsize=13) plt.yticks(fontsize=14) plt.tight_layout() plt.show()

# Importing libraries to compare feature importance between GBDT and GOSS:

import pandas as pd

import numpy as np

import lightgbm as lgb

from sklearn.model_selection import KFold

import matplotlib.pyplot as plt

import seaborn as sns

# Prepare data

data = pd.read_csv(‘Ames.csv’)

X = data.drop(‘SalePrice’, axis=1)

y = data[‘SalePrice’]

categorical_cols = X.select_dtypes(include=[‘object’]).columns

X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))

# Set up K-fold cross-validation

kf = KFold(n_splits=5)

gbdt_feature_importances = []

goss_feature_importances = []

# Iterate over each split

for train_index, test_index in kf.split(X):

X_train, X_test = X.iloc[train_index], X.iloc[test_index]

y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Train GBDT model with optimal num_leaves

gbdt_model = lgb.LGBMRegressor(boosting_type=‘gbdt’, num_leaves=10)

gbdt_model.fit(X_train, y_train)

gbdt_feature_importances.append(gbdt_model.feature_importances_)

# Train GOSS model with optimal num_leaves

goss_model = lgb.LGBMRegressor(boosting_type=‘goss’, num_leaves=10)

goss_model.fit(X_train, y_train)

goss_feature_importances.append(goss_model.feature_importances_)

# Average feature importance across all folds for each model

avg_gbdt_feature_importance = np.mean(gbdt_feature_importances, axis=0)

avg_goss_feature_importance = np.mean(goss_feature_importances, axis=0)

# Convert to DataFrame

feat_importances_gbdt = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_gbdt_feature_importance})

feat_importances_goss = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_goss_feature_importance})

# Sort and take the top 10 features

top_gbdt_features = feat_importances_gbdt.sort_values(by=‘Importance’, ascending=False).head(10)

top_goss_features = feat_importances_goss.sort_values(by=‘Importance’, ascending=False).head(10)

# Plotting

plt.figure(figsize=(16, 12))

plt.subplot(1, 2, 1)

sns.barplot(data=top_gbdt_features, y=‘Feature’, x=‘Importance’, orient=‘h’, palette=‘viridis’)

plt.title(‘Top 10 LightGBM GBDT Features’, fontsize=18)

plt.xlabel(‘Importance’, fontsize=16)

plt.ylabel(‘Feature’, fontsize=16)

plt.xticks(fontsize=13)

plt.yticks(fontsize=14)

plt.subplot(1, 2, 2)

sns.barplot(data=top_goss_features, y=‘Feature’, x=‘Importance’, orient=‘h’, palette=‘viridis’)

plt.title(‘Top 10 LightGBM GOSS Features’, fontsize=18)

plt.xlabel(‘Importance’, fontsize=16)

plt.ylabel(‘Feature’, fontsize=16)

plt.xticks(fontsize=13)

plt.yticks(fontsize=14)

plt.tight_layout()

plt.show()

Using the same Ames Housing dataset, we applied a k-fold cross-validation method to maintain consistency with our previous experiments. However, this time, we concentrated on extracting and analyzing the feature importance from the models. Feature importance, which indicates how useful each feature is in constructing the boosted decision trees, is crucial for interpreting the behavior of machine learning models. It helps in understanding which features contribute most to the predictive power of the model, providing insights into the underlying data and the model’s decision-making process.

Here’s how we performed the feature importance extraction:

Model Training: Each model (GBDT and GOSS) was trained across different folds of the data with the optimal num_leaves parameter set to 10.
Importance Extraction: After training, each model’s feature importance was extracted. This importance reflects the number of times a feature is used to make key decisions with splits in the trees.
Averaging Across Folds: The importance was averaged over all folds to ensure that our results were stable and representative of the model’s performance across different subsets of the data.

The following visualizations succinctly present these differences in feature importance between the GBDT and GOSS models:

The analysis revealed interesting patterns in feature prioritization by each model. Both the GBDT and GOSS models exhibited a strong preference for “GrLivArea” and “LotArea,” highlighting the fundamental role of property size in determining house prices. Additionally, both models ranked ‘Neighborhood’ highly, underscoring the importance of location in the housing market.

However, the models began to diverge in their prioritization from the fourth feature onwards. The GBDT model showed a preference for “BsmtFinSF1,” indicating the value of finished basements. On the other hand, the GOSS model, which prioritizes instances with larger gradients to correct mispredictions, emphasized “OverallQual” more strongly.

As we conclude this analysis, it’s evident that the differences in feature importance between the GBDT and GOSS models provide valuable insights into how each model perceives the relevance of various features in predicting housing prices.

Summary

This blog post introduced you to LightGBM’s capabilities, highlighting its distinctive features and practical application on the Ames Housing dataset. From the initial setup and comparison of GBDT and GOSS boosting strategies to an in-depth analysis of feature importance, we’ve uncovered valuable insights that not only demonstrate LightGBM’s efficiency but also its adaptability to complex datasets.

Specifically, you learned:

Exploration of model variants: Comparing the default GBDT with the GOSS model provided insights into how different boosting strategies can be leveraged depending on the data characteristics.
How to experiment with leaf-wise strategy: Adjusting the num_leaves parameter influences model performance, with an optimal setting providing a balance between complexity and accuracy.
How to visualize feature importance: Understanding and visualizing which features are most influential in your models can significantly impact how you interpret the results and make decisions. This process not only clarifies the model’s internal workings but also aids in improving model transparency and trustworthiness by identifying which variables most strongly influence the outcome.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

Source link

Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS

Overview

Introduction to LightGBM and Initial Setup

Testing LightGBM’s GBDT and GOSS on the Ames Dataset

Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy

Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

Further Reading

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

Comments

Leave a Reply Cancel reply