The biggest issue hindering quality results in regression modeling

If you are familiar with data science, especially with regression modelling, then you are probably familiar with the concepts of covariance and correlation. This post will go over the issue of **multicollinearity **in *multiple linear regression*, go over how to create and interpret scatterplots and correlation matrices, and teach you how to identify if two or more predictors have collinearity.

The key purpose is to evaluate the relation between each predictor and the outcome variable when doing a **regression analysis**. The concept of a *regression coefficient* is that for every 1 corresponding change in a predictor, it reflects the average change in the dependent variable, assuming all other predictor variables are kept constant. It’s precisely for that reason that multicollinearity can trigger issues. Since the theory behind regression is how one variable can be modified and the others kept stable, correlation is a concern when it reveals that changes in one predictor are often correlated with changes in another. Because of that and possible minor variations in the formula, the estimates of the coefficients can have major changes. …

How to test the required assumptions

**Regression diagnostics** are a series of regression analysis techniques that test the validity of a model in a variety of ways. These techniques can include an examination of the underlying mathematical assumptions of the model, an overview of the model structure through the consideration of formulas with fewer, more, or unique explanatory variables, or an analysis of observation subsets, such as searching for those which are either badly represented by the data, like outliers, or that have a reasonably large effect on the predictions of the regression model. …

Concisely discussing one of the more important statistical philosophies in data science

If I had to guess, most of the statistical theory you’ve learned has been from a Frequentist’s prism. T-tests, z-tests, p-values, and ANOVA, just to name a few, are all from the viewpoint of Frequentists. In this blog post, I will have you consider an alternate viewpoint — the Bayesian viewpoint — as I compare the statistical structures of Bayesians versus Frequentists, and then later discuss Bayes Theorem.

Talking about their understanding of probability itself is a normal place to begin when explaining the distinctions between Bayesians and Frequentists. With Frequentists, given the same situation, including its details and assumptions, if it is replicated indefinitely, the probability of an incident is the limit of the rate of instances of that event. By comparison, Bayesians define probability as the degree of trust, or confidence, of a single occurrence happening. For certain cases, this gives a more natural explanation for rare or unusual occurrences that can not occur in the same background and conditions. The functional effects of Bayesians vs Frequentists rests on making claims concerning unspecified quantities. …

What it is and how to correct it

The discovery of an amazingly low p-value does not guarantee that the null-hypothesis is false. A p-value of 0.0001 indicates that there is still a 1 in 10,000 likelihood that the null hypothesis is valid. But, p-values *alone* can be deceptive. For instance, if you do repeated tests, you are likely to come upon a tiny p-value at some point, whether the null hypothesis is valid or not. Imagine taking 100 scientific experiments with a p-value of 0.05. Are these assumptions all legitimate? Probably not. There is already a 5 percent probability that the null hypothesis is potentially valid for any experiment with a p-value of 0.05. Collectively however, the possibility that *all* of these null hypotheses are false is very minimal. …

Discussing power and what its influences are in hypothesis testing

If you have studied hypothesis testing, you are probably familiar with p-values and how they’re used to accept or deny the null hypothesis. The **power** of a statistical test calculates the potential of an experiment to detect a difference when one happens to occur. For this post, I will use testing the fairness of a coin as the example, and my null hypothesis will be “this coin is *fair*”. In this case, the likelihood of *rejecting* the null hypothesis when the coin is *unfair* is the *power* of our statistical test. …

When you’re conducting statistical tests to decide whether you think the argument is true or false, you’re **hypothesis testing**. The **null hypothesis** is the initial statement that you are testing. The null hypothesis is believed to be true unless there is overwhelming evidence to the contrary. One common example of this is when you assume that two groups are statistically different from each other.

However, there are times when scientists reject the null hypothesis when they should not have rejected it. The reverse could also happen if the null hypothesis is not rejected when it should have been. Data scientists refer to these errors as Type I and Type II errors, respectively. …

The central limit theorem — if you are studying statistics or data science, then this is definitely a term you have heard before. But given its importance, it can be a bit confusing to understand when you are first learning it (I know it was for me!). In this blog post, I’m going to explain the central limit theorem in a short, concise way that will hopefully stick with you and help you become a better statistician or data scientist!

By definition, the central limit theorem declares that independent, random variables that are added together will progressively be distributed into a normal distribution **as the number of variables is increased**. This is useful and important, especially when you want to start sampling statistics in order to estimate the parameters of whatever population you may be studying. This theorem pairs very well with another important statistical concept, which is that when the means (averages) of samples are calculated, they will also form a normal distribution. So, when you know these two concepts, it allows you to set up boundaries, in a sense, when you are making your estimates for your population. Also, this knowledge can be used to estimate the probability of certain samples possessing outlier or extreme values that may differ significantly from the mean of your population. …

Often in data science, we focus and obsess over obtaining the holy grail of distributions — the normal distribution. However, two other types of distributions, the Bernoulli and binomial , have many real world applications and must be clearly understood when the data science community encounters them. In this post, I will discuss the differences between the two and provide simple, real examples of their occurrence in our data-driven world.

Also called the **binary distribution**, this is the kind of distribution that is present when flipping a coin. Personally, I much prefer referring to this as the binary distribution because of that prefix, *bi*. …

Probability is a vital area of study to understand in order to be an effective data scientist. It may not be the most fun, but having an understanding of the math underlying all the amazing work your models do will allow you to better explain and better develop all of your models. In this post, I will specifically be talking about sets, and will be covering these topics:

- Defining what a set is
- Explaining universal sets and subsets
- Discussing the following set operations: unions, intersections, relative complements, and absolute complements

But first, what exactly is a set? Well, it’s generally described as a well-defined collection of objects. In mathematics, sets are usually represented by 𝑆. If you have an object X and it belongs to the set, then you would say that X ∈ 𝑆. But, if object X does *not* belong to the defined set, then you would say that X ∉ 𝑆. For example, if you define 𝑆 as a set of odd numbers, and if X= 1, then X ∈ 𝑆. …

In the real world of data science, it’ll often be a task that we have to obtain some or all of our data. As much as we love to work with clean, organized datasets from Kaggle, that perfection is not always replicated in day-to-day tasks at work. That’s why knowing how to scrape data is a very valuable skill to possess, and today I’m going to demonstrate how to do just that with images, along with eventually displaying your image results in a Pandas DataFrame.

To start, I’m going to scrape from the website that I first learned to scrape images from, which is books.toscrape.com. This is a great site to practice all of your scraping skills on, not just image scraping. Now, the first thing you’ll want to do is import some necessary packages — BeautifulSoup and requests. …

About