Improve your data to boost your regression results
Normal features, or features that are as normally distributed as possible, will lead to better outcomes. This is what makes scaling and the normalization of features in regression modeling so significant. There are a number of ways to scale your features, and in this blog post, I am going to help you evaluate whether it is appropriate for a particular dataset or model to have normalization and/or standardization performed, while also having you consider the different approaches for standardization and normalization.
Sometimes there will be features that differ greatly in magnitude in the dataset. Coefficient sizes can often vary significantly in magnitude if you keep those magnitudes untouched. This may offer the misleading perception that some factors are less significant than the others. However, if you are evaluating only linear regression models, this is not necessarily a challenge, but this can be a problem in more complex machine learning techniques because many algorithms for machine learning use Euclidean distance in their calculations among two data points (which means that checking features have comparable scales is technically needed). I’m getting carried away — anyways, a common guideline is to check your features for normality, and scale your features such that they have comparable magnitudes, even if it is for a basic model such as linear regression. The descriptions for common transformations will be brief for the sake of simplicity, and if needed, the mathematical formula will be provided.
If you have data that simply does not fit a normal distribution, a log transformation is a very valuable tool. Log transformations may help decrease skewness when you have skewed data and can also help minimize data variability.
The formula for standardization is:
- Z = x-μ/σ
X equals the observation, mu(μ) equals the mean, and sigma(σ) equals the standard deviation. One important thing to understand is that standardization doesn’t really make data more normal — only the mean and the standard error will be adjusted.
The formula for min-max scaling is:
- Z = x-min(x)/max(x)-min(x)
This scaling approach takes all values to between 0 and 1.
The formula for mean normalization is:
- Z = x-mean(x)/max(x)-min(x)
This method is very similar to min-max scaling, with the major difference being that the mean is involved. There will be values between -1 and 1 in the distribution as well as a mean of 0.
The beauty of Python is that when you want to use these transformations with your data, you don’t have to hard code the formulas for whichever type of transformation you want to use. Scikit-learn has packages for many transformations, such as StandardScaler and Normalizer. Check out there API here if you’re interested in learning more.
Thank you for reading!