Just why is Bayes so naive?

In this blog post, we’ll look at how to apply some of the theories for machine learning that are acquainted with Bayes’ theorem and foundational principles of Bayesian statistics. Classification problems are a natural implementation of Bayes’ theorem when you’re trying to predict a classification based on other data, which can be thought of as conditional probability. I will help you understand how to make a classification using the probabilities provided by Naive Bayes.

By believing that the features are autonomous of one another, Naive Bayes algorithms apply Bayes’ formula to several variables. The…

A quick debriefing on a really cool supervised learning algorithm

**K-Nearest Neighbors, **or KNN, is a supervised learning algorithm that can be applied on *classification* and *regression* problems. KNN is a **distance-based classifier**, which means it automatically implies that the closer two points are, the more identical they are. Euclidean distance, Minkowski distance, and Manhattan distance are all examples of different distance metrics. Each feature in KNN serves as a dimension. We can conveniently visualize this in a dataset of two columns by considering values for one column as X coordinates and the other as Y coordinates. Because this is…

Discussing the two most useful metrics that are used to describe a model’s efficiency

In my most recent blog post, I went over two of the easier and more common metrics used to explore model performance in machine learning, p*recision* and r*ecall*. In this blog post, I will be discussing the two *better* choices for evaluating model performance — a*ccuracy *and *F1 score — *and going over how to evaluate them.

The most logical metric is likely **accuracy**. Accuracy is helpful because it helps us to compute the amount of correct predictions a model makes because it includes true positives…

Precision and recall are two of the most fundamental evaluation metrics that we have at our hands.

It’s imperative to compare your models to each other and pick the best fit models when performing tasks about **classification**. When you are estimating values in regression, it makes sense to speak about error as a deviation from the real values and how far apart the predictions were. But in classification, you are either correct or incorrect when classifying a *binary variable*. Consequently, we prefer to think of it in terms of how many false positives and false negatives a model has. In…

How to automate the process of selecting features

In data science, there are many different approaches to building features to model complicated relationships — although, this may sometimes be troublesome. But, you will learn about the various strategies you can use, in this blog post, to use only the features that are most important to your model!

**Feature selection** is the approach by which you choose a subset of features specific to the design of models. This process comes with many advantages, with the most noticeable being the performance enhancement of a machine learning algorithm.

A second advantage includes decreasing…

Hooray for calculus!

You have probably heard about the central principle of **mathematical functions** if you have studied linear regression. You can articulate this with the following example — assume that you have used the number of bathrooms in a house as a predictor and the house rental price as the target variable. The mathematical function of this example would be **rental price = f(bathrooms)**, or more generically,

A quick discussion on computational complexity

In this blog post, I will be exposing you to some of the complexity in computation in relation to OLS regression. You will read about this concept and see that this might not be the most powerful algorithm for estimating regression parameters while regression is being done with large datasets. This will lay the groundwork for an algorithm for optimization called **gradient descent** that will later be discussed.

In the case of simple linear regression, the OLS formula functions perfectly well because of a small number of computed operations. But, it gets computationally very…

Improve your data to boost your regression results

Normal features, or features that are as normally distributed as possible, will lead to better outcomes. This is what makes **scaling** and the **normalization of features **in regression modeling so significant. There are a number of ways to scale your features, and in this blog post, I am going to help you evaluate whether it is appropriate for a particular dataset or model to have normalization and/or standardization performed, while also having you consider the different approaches for standardization and normalization.

Sometimes there will be features that differ greatly in magnitude in…

The biggest issue hindering quality results in regression modeling

If you are familiar with data science, especially with regression modelling, then you are probably familiar with the concepts of covariance and correlation. This post will go over the issue of **multicollinearity **in *multiple linear regression*, go over how to create and interpret scatterplots and correlation matrices, and teach you how to identify if two or more predictors have collinearity.

The key purpose is to evaluate the relation between each predictor and the outcome variable when doing a **regression analysis**. The concept of a *regression coefficient* is that for every 1…

How to test the required assumptions

**Regression diagnostics** are a series of regression analysis techniques that test the validity of a model in a variety of ways. These techniques can include an examination of the underlying mathematical assumptions of the model, an overview of the model structure through the consideration of formulas with fewer, more, or unique explanatory variables, or an analysis of observation subsets, such as searching for those which are either badly represented by the data, like outliers, or that have a reasonably large effect on the predictions of the regression model. …