How to automate the process of selecting features

Image for post
Image for post
Image Source

In data science, there are many different approaches to building features to model complicated relationships — although, this may sometimes be troublesome. But, you will learn about the various strategies you can use, in this blog post, to use only the features that are most important to your model!

Defining Feature Selection

Feature selection is the approach by which you choose a subset of features specific to the design of models. This process comes with many advantages, with the most noticeable being the performance enhancement of a machine learning algorithm.

A second advantage includes decreasing…


Hooray for calculus!

Image for post
Image for post
Image Source

You have probably heard about the central principle of mathematical functions if you have studied linear regression. You can articulate this with the following example — assume that you have used the number of bathrooms in a house as a predictor and the house rental price as the target variable. The mathematical function of this example would be rental price = f(bathrooms), or more generically, y = f(x). Then, let’s assume that the price of the apartment is set in a very simplistic manner and the relation between the number of bathrooms and the rental price is…


A quick discussion on computational complexity

Image for post
Image for post
Image Source

In this blog post, I will be exposing you to some of the complexity in computation in relation to OLS regression. You will read about this concept and see that this might not be the most powerful algorithm for estimating regression parameters while regression is being done with large datasets. This will lay the groundwork for an algorithm for optimization called gradient descent that will later be discussed.

In the case of simple linear regression, the OLS formula functions perfectly well because of a small number of computed operations. But, it gets computationally very…


Improve your data to boost your regression results

Image for post
Image for post
Image Source

Normal features, or features that are as normally distributed as possible, will lead to better outcomes. This is what makes scaling and the normalization of features in regression modeling so significant. There are a number of ways to scale your features, and in this blog post, I am going to help you evaluate whether it is appropriate for a particular dataset or model to have normalization and/or standardization performed, while also having you consider the different approaches for standardization and normalization.

Sometimes there will be features that differ greatly in magnitude in…


The biggest issue hindering quality results in regression modeling

Image for post
Image for post
Source

If you are familiar with data science, especially with regression modelling, then you are probably familiar with the concepts of covariance and correlation. This post will go over the issue of multicollinearity in multiple linear regression, go over how to create and interpret scatterplots and correlation matrices, and teach you how to identify if two or more predictors have collinearity.

So… Why is Multicollinearity Bad?

The key purpose is to evaluate the relation between each predictor and the outcome variable when doing a regression analysis. The concept of a regression coefficient is that for every 1…


How to test the required assumptions

Image for post
Image for post
Source

Regression diagnostics are a series of regression analysis techniques that test the validity of a model in a variety of ways. These techniques can include an examination of the underlying mathematical assumptions of the model, an overview of the model structure through the consideration of formulas with fewer, more, or unique explanatory variables, or an analysis of observation subsets, such as searching for those which are either badly represented by the data, like outliers, or that have a reasonably large effect on the predictions of the regression model. …


Concisely discussing one of the more important statistical philosophies in data science

Image for post
Image for post
Image Source

If I had to guess, most of the statistical theory you’ve learned has been from a Frequentist’s prism. T-tests, z-tests, p-values, and ANOVA, just to name a few, are all from the viewpoint of Frequentists. In this blog post, I will have you consider an alternate viewpoint — the Bayesian viewpoint — as I compare the statistical structures of Bayesians versus Frequentists, and then later discuss Bayes Theorem.

Bayesians Vs. Frequentists — Philosophical Differences

Talking about their understanding of probability itself is a normal place to begin when explaining the distinctions between Bayesians and…


What it is and how to correct it

Image for post
Image for post
Image Source

The discovery of an amazingly low p-value does not guarantee that the null-hypothesis is false. A p-value of 0.0001 indicates that there is still a 1 in 10,000 likelihood that the null hypothesis is valid. But, p-values alone can be deceptive. For instance, if you do repeated tests, you are likely to come upon a tiny p-value at some point, whether the null hypothesis is valid or not. Imagine taking 100 scientific experiments with a p-value of 0.05. Are these assumptions all legitimate? Probably not. There is already a 5 percent probability…


Discussing power and what its influences are in hypothesis testing

Image for post
Image for post
Image Source

If you have studied hypothesis testing, you are probably familiar with p-values and how they’re used to accept or deny the null hypothesis. The power of a statistical test calculates the potential of an experiment to detect a difference when one happens to occur. For this post, I will use testing the fairness of a coin as the example, and my null hypothesis will be “this coin is fair”. In this case, the likelihood of rejecting the null hypothesis when the coin is unfair is the power of our statistical…


Deciphering the differences between alpha and beta

Image for post
Image for post
Image Source

When you’re conducting statistical tests to decide whether you think the argument is true or false, you’re hypothesis testing. The null hypothesis is the initial statement that you are testing. The null hypothesis is believed to be true unless there is overwhelming evidence to the contrary. One common example of this is when you assume that two groups are statistically different from each other.

However, there are times when scientists reject the null hypothesis when they should not have rejected it. The reverse could also happen if the null hypothesis is not rejected when it should have been. Data scientists…

Acusio Bivona

Aspiring Data Scientist — Recent Graduate of Flatiron School’s Online Data Science Bootcamp

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store