Accuracy and F1 Score — The Better Choices for Evaluating Model Success

Discussing the two most useful metrics that are used to describe a model’s efficiency

Image Source

In my most recent blog post, I went over two of the easier and more common metrics used to explore model performance in machine learning, precision and recall. In this blog post, I will be discussing the two better choices for evaluating model performance — accuracy and F1 score — and going over how to evaluate them.

The most logical metric is likely accuracy. Accuracy is helpful because it helps us to compute the amount of correct predictions a model makes because it includes true positives and true negatives. The following is the formula for accuracy: Accuracy = True Positives + True Negatives / Total Predictions. Accuracy helps us to address the following question — “What percentage of all of our model’s projections were correct?” The most commonly applied metric for classification tasks is accuracy because it gives us a clear overview of our model’s overall results.

The F1 score is more difficult to understand, but it is also more insightful because the harmonic mean of precision and recall is represented by the F1 score. Basically, the F1 score cannot be strong without also being strong in precision and recall. Whenever a model’s F1 score is high, you should be confident that it is performing well in all areas. The following is the formula for F1 score: F1 score = 2 * (Precision * Recall/Precision + Recall). Typically, if a model’s precision or recall is skewed too strongly, the F1 score penalizes it severely. As a result, the F1 score is the most commonly used metric for describing a model’s efficiency, no matter the machine learning task.

As I mentioned in my precision and recall post, the most critical metrics for a project are often determined by the business usage case or priorities for the model. That’s why it’s important to know why you’re doing your particular machine learning task and how the model results can be put into practice — or else the model could be optimized for the incorrect metric. It’s worth noting that, when in question, it’s a smart idea to measure all applicable metrics.

When it comes to most classification projects, you have no idea which model will do best before you get started. The standard workflow is to train each particular type of classifier and then compare their results to determine which is the best. Tables, like those that can be made using Python, can then be made, with the best performer for each metric highlighted. One of my favorite ways to calculate all of the major evaluation metrics — precision, recall, accuracy, and F1 score — is by creating a confusion matrix. All of these metrics will be calculated if we understand the true positives, true negatives, false positives, and false negatives arising from a model’s predictions. A built-in feature in Scikit-learn generates what’s known as a classification report. The sklearn.metrics module contains the classification_report() function. This function receives the labels and predictions for a model’s outcomes and reports the precision, recall, F1 score, and support, which is the number of instances from each label in y_true, while also breaking down the model’s results by individual class projections.

After reading this and my previous blog post, I hope you now have a more solid understanding of the various types of evaluation metrics used when observing the performance of machine learning models. Thank you for reading!


Fitness, Sports, Data — And not necessarily in that order

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store