Regression, Big O Notation, and Gradient Descent — How Are They All Related?
A quick discussion on computational complexity
In this blog post, I will be exposing you to some of the complexity in computation in relation to OLS regression. You will read about this concept and see that this might not be the most powerful algorithm for estimating regression parameters while regression is being done with large datasets. This will lay the groundwork for an algorithm for optimization called gradient descent that will later be discussed.
In the case of simple linear regression, the OLS formula functions perfectly well because of a small number of computed operations. But, it gets computationally very costly for datasets that are quite large (which are known as big data sets) since it would possibly require a vast amount of complex mathematical dimensions. This is where knowing what Big O Notation is becomes very important.
Big O Notation is used to explain how quickly an algorithm develops by measuring the number of operations within the algorithm. In other words, Big O Notation allows you to see the worst-case algorithm situation. Since we’re focused on how slowly a given algorithm could run at its worst, we’re most concerned with the Big O time.
Let’s demonstrate this with an example. Imagine that your friend lost their dog — sad, I know — and they’re asking you to help find them. But, the only thing they tell you about the dog is its breed. With this limited information, you decide that you are going to call every dog shelter in your town to see if they have any pups matching your friend’s breed and, hopefully, finding them. This would be a simple search. If your friend’s dog is a rare breed, like a Tibetan Mastiff, if you happen to come across one at a shelter then it is pretty likely it belongs to your friend. But what if it’s a more common dog breed, like a bulldog?
Because of their popularity and the various types of bulldogs that exist, there could be hundreds or even thousands of bulldogs that match what your friend told you about the dog. The number of bulldogs listed would be your dataset, and as your dataset increases, the total amount of time it would take to perform an exhaustive simple search would increase in a linear relationship with your dataset. Relaying this back to the main point, Big O Notation helps you to define what the worst scenario is — which in this case would be having to call every shelter and getting the details of every bulldog.
The Solution — Gradient Descent
Gradient descent is an iterative technique used to reduce model error when training a machine learning model. This is an optimization algorithm centered on a convex function that adjusts its parameters recursively to reduce a certain function towards its local minimum. Gradient descent is used in regression to obtain the values of model parameters (coefficients) that reduce, as much as possible, a cost function, such as root mean square error (RMSE). The key reason why gradient descent is used for linear regression is the reduction in computational complexity, in that, in several situations, finding the answer utilizing gradient descent is computationally cheaper.
The full understanding of how gradient descent works involves the knowledge of what a gradient actually is and being familiar with calculus. If that is not you, that’s okay — because I’m one of those people. As long as you really study and understand that last section, then you can take that knowledge and apply to any machine learning work you may perform as a data scientist.
I hope this helped you better understand the importance of knowing what Big O Notation is and why gradient descent is so commonly used in machine learning. Thank you for reading!