When we analyze relationships between variables in machine learning, we often find that a straight line doesn’t tell the whole story. That’s where polynomial transformations come in, adding layers to our regression models without complicating the calculation process. By transforming our features into their polynomial counterparts—squares, cubes, and other higher-degree terms—we give linear models the flexibility to curve and twist, fitting snugly to the underlying trends of our data.
This blog post will explore how we can move beyond simple linear models to capture more complex relationships in our data. You’ll learn about the power of polynomial and cubic regression techniques, which allow us to see beyond the apparent and uncover the underlying patterns that a straight line might miss. We will also delve into the balance between adding complexity and maintaining predictability in your models, ensuring that they are both powerful and practical.
Let’s get started.
Overview
This post is divided into three parts; they are:
- Establishing a Baseline with Linear Regression
- Capturing Curves with a Polynomial Regression
- Experimenting with a Cubic Regression
Establishing a Baseline with Linear Regression
When we talk about relationships between two variables, linear regression is often the first step because it is the simplest. It models the relationship by fitting a straight line to the data. This line is described by the simple equation y = mx + b
, where y
is the dependent variable, x
is the independent variable, m
is the slope of the line, and b
is the y-intercept. Let’s demonstrate this by predicting the “SalePrice” in the Ames dataset based on its overall quality, which is an integer value ranging from 1 to 10.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Import the necessary libraries import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt
# Prepare data for linear regression Ames = pd.read_csv(“Ames.csv”) X = Ames[[“OverallQual”]] # Predictor y = Ames[“SalePrice”] # Response
# Create and fit the linear regression model linear_model = LinearRegression() linear_model.fit(X, y)
# Coefficients intercept = int(linear_model.intercept_) slope = int(linear_model.coef_[0]) eqn = f“Fitted Line: y = {slope}x – {abs(intercept)}”
# Perform 5-fold cross-validation to evaluate model performance cv_score = cross_val_score(linear_model, X, y).mean()
# Visualize Best Fit and display CV results plt.figure(figsize=(10, 6)) plt.scatter(X, y, color=“blue”, alpha=0.5, label=“Data points”) plt.plot(X, linear_model.predict(X), color=“red”, label=eqn) plt.title(“Linear Regression of SalePrice vs OverallQual”, fontsize=16) plt.xlabel(“Overall Quality”, fontsize=12) plt.ylabel(“Sale Price”, fontsize=12) plt.legend(fontsize=14) plt.grid(True) plt.text(1, 540000, f“5-Fold CV R²: {cv_score:.3f}”, fontsize=14, color=“green”) plt.show() |
With a basic linear regression, our model came up with the following equation: y = 43383x - 84264
. This means that each additional point in quality is associated with an increase of approximately $43,383 in the sale price. To evaluate the performance of our model, we used 5-fold cross-validation, resulting in an R² of 0.618. This value indicates that about 61.8% of the variability in sale prices can be explained by the overall quality of the house using this simple model.
Linear regression is straightforward to understand and implement. However, it assumes that the relationship between the independent and dependent variables is linear, which might not always be the case, as seen in the scatterplot above. While linear regression provides a good starting point, real-world data often require more complex models to capture curved relationships, as we’ll see in the next section on polynomial regression.
Capturing Curves with Polynomial Regression
Real-world relationships are often not straight lines but curves. Polynomial regression allows us to model these curved relationships. For a third-degree polynomial, this method takes our simple linear equation and adds terms for each power of x
: y = ax + bx^2 + cx^3 + d
. We can implement this by using the PolynomialFeatures
class from the sklearn.preprocessing
library, which generates a new feature matrix consisting of all polynomial combinations of the features with a degree less than or equal to the specified degree. Here’s how we can apply it to our dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# Import the necessary libraries import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score from sklearn.preprocessing import PolynomialFeatures import matplotlib.pyplot as plt
# Load the data Ames = pd.read_csv(“Ames.csv”) X = Ames[[“OverallQual”]] y = Ames[“SalePrice”]
# Transform the predictor variable to polynomial features up to the 3rd degree poly = PolynomialFeatures(degree=3, include_bias=False) X_poly = poly.fit_transform(X)
# Create and fit the polynomial regression model poly_model = LinearRegression() poly_model.fit(X_poly, y)
# Extract model coefficients that form the polynomial equation #intercept = np.rint(poly_model.intercept_).astype(int) intercept = int(poly_model.intercept_) coefs = np.rint(poly_model.coef_).astype(int) eqn = f“Fitted Line: y = {coefs[0]}x^1 – {abs(coefs[1])}x^2 + {coefs[2]}x^3 – {abs(intercept)}”
# Perform 5-fold cross-validation cv_score = cross_val_score(poly_model, X_poly, y).mean()
# Generate data to plot curve X_range = np.linspace(X.min(), X.max(), 100).reshape(–1, 1) X_range_poly = poly.transform(X_range)
# Plot plt.figure(figsize=(10, 6)) plt.scatter(X, y, color=“blue”, alpha=0.5, label=“Data points”) plt.plot(X_range, poly_model.predict(X_range_poly), color=“red”, label=eqn) plt.title(“Polynomial Regression (3rd Degree) of SalePrice vs OverallQual”, fontsize=16) plt.xlabel(“Overall Quality”, fontsize=12) plt.ylabel(“Sale Price”, fontsize=12) plt.legend(fontsize=14) plt.grid(True) plt.text(1, 540000, f“5-Fold CV R²: {cv_score:.3f}”, fontsize=14, color=“green”) plt.show() |
First, we transform our predictor variable into polynomial features up to the third degree. This enhancement expands our feature set from just x
(Overall Quality) to x, x^2, x^3
(i.e., each feature becomes three different but correlated features), allowing our linear model to fit a more complex, curved relationship in the data. We then fit this transformed data into a linear regression model to capture the nonlinear relationship between the overall quality and sale price.
Our new model has the equation y = 65966x^1 - 11619x^2 + 1006x^3 - 31343
. The curve fits the data points more closely than the straight line, indicating a better model. Our 5-fold cross-validation gave us an R² of 0.681, which is an improvement over our linear model. This suggests that including the squared and cubic terms helps our model to capture more of the complexity in the data. Polynomial regression introduces the ability to fit curves, but sometimes focusing on a specific power, like the cubic term, can reveal deeper insights, as we will explore in cubic regression.
Experimenting with a Cubic Regression
Sometimes, we may suspect that a specific power of x
is particularly important. In these cases, we can focus on that power. Cubic regression is a special case where we model the relationship with a cube of the independent variable: y = ax^3 + b
. To effectively focus on this power, we can utilize the FunctionTransformer
class from the sklearn.preprocessing
library, which allows us to create a custom transformer to apply a specific function to the data. This approach is useful for isolating and highlighting the impact of higher-degree terms like x^3
on the response variable, providing a clear view of how the cubic term alone explains the variability in the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# Import the necessary libraries import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score from sklearn.preprocessing import FunctionTransformer import matplotlib.pyplot as plt
# Load data Ames = pd.read_csv(“Ames.csv”) X = Ames[[“OverallQual”]] y = Ames[“SalePrice”]
# Function to apply cubic transformation def cubic_transformation(x): return x ** 3
# Apply transformation cubic_transformer = FunctionTransformer(cubic_transformation) X_cubic = cubic_transformer.fit_transform(X)
# Fit model cubic_model = LinearRegression() cubic_model.fit(X_cubic, y)
# Get coefficients and intercept intercept_cubic = int(cubic_model.intercept_) coef_cubic = int(cubic_model.coef_[0]) eqn = f“Fitted Line: y = {coef_cubic}x^3 + {intercept_cubic}”
# Cross-validation cv_score_cubic = cross_val_score(cubic_model, X_cubic, y).mean()
# Generate data to plot curve X_range = np.linspace(X.min(), X.max(), 300) X_range_cubic = cubic_transformer.transform(X_range)
# Plot plt.figure(figsize=(10, 6)) plt.scatter(X, y, color=“blue”, alpha=0.5, label=“Data points”) plt.plot(X_range, cubic_model.predict(X_range_cubic), color=“red”, label=eqn) plt.title(“Cubic Regression of SalePrice vs OverallQual”, fontsize=16) plt.xlabel(“Overall Quality”, fontsize=12) plt.ylabel(“Sale Price”, fontsize=12) plt.legend(fontsize=14) plt.grid(True) plt.text(1, 540000, f“5-Fold CV R²: {cv_score_cubic:.3f}”, fontsize=14, color=“green”) plt.show() |
We applied a cubic transformation to our independent variable and obtained a cubic model with the equation y = 361x^3 + 85579
. This represents a slightly simpler approach than the full polynomial regression model, focusing solely on the cubic term’s predictive power.
With cubic regression, our 5-fold cross-validation yielded an R² of 0.678. This performance is slightly below the full polynomial model but still notably better than the linear one. Cubic regression is simpler than a higher-degree polynomial regression and can be sufficient for capturing the relationship in some datasets. It’s less prone to overfitting than a higher-degree polynomial model but more flexible than a linear model. The coefficient in the cubic regression model, 361, indicates the rate at which sale prices increase as the quality cubed increases. This emphasizes the substantial influence that very high-quality levels have on the price, suggesting that properties with exceptional quality see a disproportionately higher increase in their sale price. This insight is particularly valuable for investors or developers focused on high-end properties where quality is a premium.
As you may imagine, this technique does not limit you from polynomial regression. You can introduce more exotic functions such as log and exponential if you think that makes sense in the scenario.
Further Reading
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
This blog post explored different regression techniques suited for modeling relationships in data across varying complexities. We started with linear regression to establish a baseline for predicting house prices based on quality ratings. Visuals accompanying this section demonstrate how a linear model attempts to fit a straight line through the data points, illustrating the basic concept of regression. Advancing to polynomial regression, we tackled more intricate, non-linear trends, which enhanced model flexibility and accuracy. The accompanying graphs showed how a polynomial curve adjusts to fit the data points more closely than a simple linear model. Finally, we focused on cubic regression to examine the impact of a specific power of the predictor variable, isolating the effects of higher-degree terms on the dependent variable. The cubic model proved to be particularly effective, capturing the essential characteristics of the relationship with sufficient precision and simplicity.
Specifically, you learned:
- How to identify non-linear trends using visualization techniques.
- How to model non-linear trends using polynomial regression techniques.
- How cubic regression can capture similar predictability with fewer model complexities.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.