Photo by Michael Burrows from Pexels

Feature scaling in machine learning

Harish Daryani
Analytics Vidhya
Published in
3 min readOct 28, 2021

--

Let’s say you run into a dataset of house prices in your locality with their corresponding area in square feet. You being a data enthusiast want to understand the relationship between area and price.

The first thing you probably would do is create a scatter plot. Here is how it looks.

Scatter plot of area and price

Motivated by what you see, you now want to fit a regression line to this data. Here is how that looks.

Regression line
Regression coefficients

While this looks fine on the face of it, observing a little more carefully you would ask the following questions.

  1. What is the intercept value of -369.6 really conveying?

While it is impractical to have house of zero square feet area, the intercept is telling that such houses go for negative 369 thousand dollars. That doesn’t make any sense.

2. Why is the confidence interval on the intercept so large i.e. -580 to -158 with a very high standard error?

As you can see from the scatter plot, the data points on the explanatory variable ‘area’ are pretty further away from zero. The minimum value of area is around 1700–1800 square feet.

By definition, intercept is the value of price when area is zero (a hypothetical situation). With virtually no data points from 0–1700 square feet for the model to train on, the intercept becomes unstable. What that means is that the intercept is susceptible to more fluctuations were we to add a few more houses to our dataset. Hence, making the model unstable from a prediction standpoint.

One way to solve this is to scale the explanatory variable i.e. ‘area’ in this case in a way that data points are not clustered far away from zero.

A simplest way is to make values of ‘area’ relative to the mean by subtracting them from the mean area. Here the mean area is around 2000 square feet, so this is how your data would look.

Data with area centered variable (a sample)

A positive value for ‘area_centered’ means the house is above 2000 square feet and vice versa.

Here is how your scatter plot, regression line and model equation looks like now.

Regression line and scatter plot with area centered
New model coefficients

Observations

  • While the coefficient on area has remained unchanged, the confidence interval on the intercept is narrower now with a much lower standard error
  • The intercept is also interpretable i.e. houses with mean area are priced at 1.053 million dollars

This is a simple example to demonstrate the value of feature scaling in making your model more robust and interpretable.

Credits : Codecademy

Where have you used feature scaling?

--

--