How to automate the process of selecting features

In data science, there are many different approaches to building features to model complicated relationships — although, this may sometimes be troublesome. But, you will learn about the various strategies you can use, in this blog post, to use only the features that are most important to your model!

**Feature selection** is the approach by which you choose a subset of features specific to the design of models. This process comes with many advantages, with the most noticeable being the performance enhancement of a machine learning algorithm.

A second advantage includes decreasing…

Hooray for calculus!

You have probably heard about the central principle of **mathematical functions** if you have studied linear regression. You can articulate this with the following example — assume that you have used the number of bathrooms in a house as a predictor and the house rental price as the target variable. The mathematical function of this example would be **rental price = f(bathrooms)**, or more generically,

A quick discussion on computational complexity

In this blog post, I will be exposing you to some of the complexity in computation in relation to OLS regression. You will read about this concept and see that this might not be the most powerful algorithm for estimating regression parameters while regression is being done with large datasets. This will lay the groundwork for an algorithm for optimization called **gradient descent** that will later be discussed.

In the case of simple linear regression, the OLS formula functions perfectly well because of a small number of computed operations. But, it gets computationally very…

Improve your data to boost your regression results

Normal features, or features that are as normally distributed as possible, will lead to better outcomes. This is what makes **scaling** and the **normalization of features **in regression modeling so significant. There are a number of ways to scale your features, and in this blog post, I am going to help you evaluate whether it is appropriate for a particular dataset or model to have normalization and/or standardization performed, while also having you consider the different approaches for standardization and normalization.

Sometimes there will be features that differ greatly in magnitude in…

The biggest issue hindering quality results in regression modeling

If you are familiar with data science, especially with regression modelling, then you are probably familiar with the concepts of covariance and correlation. This post will go over the issue of **multicollinearity **in *multiple linear regression*, go over how to create and interpret scatterplots and correlation matrices, and teach you how to identify if two or more predictors have collinearity.

The key purpose is to evaluate the relation between each predictor and the outcome variable when doing a **regression analysis**. The concept of a *regression coefficient* is that for every 1…

How to test the required assumptions

**Regression diagnostics** are a series of regression analysis techniques that test the validity of a model in a variety of ways. These techniques can include an examination of the underlying mathematical assumptions of the model, an overview of the model structure through the consideration of formulas with fewer, more, or unique explanatory variables, or an analysis of observation subsets, such as searching for those which are either badly represented by the data, like outliers, or that have a reasonably large effect on the predictions of the regression model. …

Concisely discussing one of the more important statistical philosophies in data science

If I had to guess, most of the statistical theory you’ve learned has been from a Frequentist’s prism. T-tests, z-tests, p-values, and ANOVA, just to name a few, are all from the viewpoint of Frequentists. In this blog post, I will have you consider an alternate viewpoint — the Bayesian viewpoint — as I compare the statistical structures of Bayesians versus Frequentists, and then later discuss Bayes Theorem.

Talking about their understanding of probability itself is a normal place to begin when explaining the distinctions between Bayesians and…

What it is and how to correct it

The discovery of an amazingly low p-value does not guarantee that the null-hypothesis is false. A p-value of 0.0001 indicates that there is still a 1 in 10,000 likelihood that the null hypothesis is valid. But, p-values *alone* can be deceptive. For instance, if you do repeated tests, you are likely to come upon a tiny p-value at some point, whether the null hypothesis is valid or not. Imagine taking 100 scientific experiments with a p-value of 0.05. Are these assumptions all legitimate? Probably not. There is already a 5 percent probability…

Discussing power and what its influences are in hypothesis testing

If you have studied hypothesis testing, you are probably familiar with p-values and how they’re used to accept or deny the null hypothesis. The **power** of a statistical test calculates the potential of an experiment to detect a difference when one happens to occur. For this post, I will use testing the fairness of a coin as the example, and my null hypothesis will be “this coin is *fair*”. In this case, the likelihood of *rejecting* the null hypothesis when the coin is *unfair* is the *power* of our statistical…

When you’re conducting statistical tests to decide whether you think the argument is true or false, you’re **hypothesis testing**. The **null hypothesis** is the initial statement that you are testing. The null hypothesis is believed to be true unless there is overwhelming evidence to the contrary. One common example of this is when you assume that two groups are statistically different from each other.

However, there are times when scientists reject the null hypothesis when they should not have rejected it. The reverse could also happen if the null hypothesis is not rejected when it should have been. Data scientists…

Aspiring Data Scientist — Recent Graduate of Flatiron School’s Online Data Science Bootcamp