Machine learning continues to be an increasingly integral component of our lives, whether we’re applying the techniques to research or business quandaries. Machine learning models ought to be able to give accurate predictions to engender genuine value for a given organization. That is why Model evaluation is an important step in the machine learning development process.
For a machine learning solution to add value, it is crucial to evaluate the same with the right set of metrics before you push it to production. Believe me, it is one of the most complex steps in the machine learning pipeline. A Model evaluated with incorrect metrics is of no value at all!
In our trade, we consider different types of metrics to evaluate our models. The choice of metric plenarily depends on the type of model, use-case, data that is used, etc. This article will primarily focus on evaluation parameters for Classification based models!
Classification Model Evaluation :

In machine learning, we often utilize the classification models to get a predicted result of population data. Classification, as you know is one of the two sections of supervised learning & often deals with different categories of data. There are multiple algorithms under classification & they have their own style of execution and different techniques of prediction. But, irrespective of the underlying technique the evaluation techniques more or less remain the same(compares Actual Vs Predicted). This article will discuss some of the common evaluation methods used for classification problems!
Confusion Matrix:
It is one of the foundational concepts we must understand to go further with other metrics used for classification model evaluations. The confusion matrix simply is a visual representation of the number of correct and incorrect predictions for each class label in the data. You can see the image given below to get an idea of how the confusion matrix table looks like:

So in order to understand let’s take an example of a case where we need to predict if a person has cancer or not. You have done some medical examinations and with the help of the data which you’ve got, you are going to predict if a person has cancer or not. So, you are going to confirm if the hypothesis of declaring a person as having cancer is acceptable or not. Say, among 165 people you are predicting 110 people to have cancer. In actual only 105 people have cancer and among those 105 people, you have diagnosed 100 people correctly. So, if I put the result in a confusion matrix, it will look like the following —

Let us use fig-1 and 2, to understand the following terminologies:
- TP (True Positive): A positive class predicted as positive! (in our case it’s those 100 cancer patients whom you’ve predicted correctly)
- TN (True Negative): A negative class predicted as negative! (in our case it’s those 50 patients whom you’ve correctly predicted that they were not cancer patients)
- FP (False positive): A negative class incorrectly predicted as positive! You have predicted these 10 people as having cancer, but in actual they do not have. FP is also termed type-I error.
- FN (False negative): A Positive class incorrectly classified as Negative!. In our case, 5 people predicted to be negative actually is cancer positive. This is dangerous. FN is also termed type-II error
The most important decision a data scientist has to make during evaluation is to decide which error is a greater risk for the given use-case. Often, we continue to strike a balance between avoiding type 1 and type 2 errors based on the business risks associated!
Accuracy
Accuracy is the most sensitive performance measure and it is a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is finest. Yes, accuracy is a great amount but only when you have symmetric datasets where values of false positive and inaccurate negatives are almost the same. Therefore, you must look at other factors to evaluate the performance of your model. For example, if our model gets 0.803 value, which means our model is approx. 80% accurate.

Well, Accuracy doesn’t work always and shouldn’t be used in silos as a best practice. In cases where there is a heavy class imbalance, accuracy is highly influenced by the majority class and will give you a wrong indication of the robustness of the model. Avoid using accuracy when there is a class imbalance.
Precision and Recall
Precision is the proportion of accurately predicted positive observations of the overall predicted positive observations. Instantly, you can see that Precision discusses how precise/accurate your model is out of those predicted positive, how many of them are actually positive.

Precision is a good measure to decide when the costs of FP is high. For example, email spam detection. In email spam detection, an FP means that an email that is non-spam has been detected as spam. The email user may lose important emails if the precision is not high for the spam detection model.
Recall: (also known as sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class.

It essentially calculates how many of the actual positives our model capture through categorizing it as Positive (TP). Applying the same knowledge, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with FN.
To fully evaluate the efficiency of a model, you must analyse both precision and recall. Unfortunately, precision and recall are often in stress. That is, improving precision generally reduces recall and vice versa. In other words, precision is inversely proportional to recall.
So when recall and precision together gets imbalanced, we can combine those two metrics using F1 Score
F1 Score
It is also known as harmonic mean of precision and recall & is basically used in those models where precision and recall both are important. The F1 Score gives equal weight to both precision and recall and is a specific example of general Fβ where β can be adjusted to give more weight to either precision or recall.

F-Measure provides a way to accumulate both precision and recall into a single measure that captures both properties. Alone, neither precision nor recall tells the whole story. We can have exceptional precision with bad recall, or alternately, bad precision with exceptional recall. F-measure provides a way to express both concerns with a single score.
Once precision and recall have been computed for a binary or multiclass classification problem, the two scores can be merged in the calculation of the F-Measure.
ROC and AUC curve:
ROC and AUC tells us how much a classification model is capable of distinguishing between the classes. Using AUC ROC curve, we can even visualize the performance of a multi class classification problem.
- AUC – Area Under Curve
- ROC – Receiver Operating Characteristics
Before going more deeper into this topic, let us understand some of the terminologies which we will be using later in this topic.
Specificity
Specificity tells us how good our model is by identifying the false correctly

Sensitivity:
It is also called Recall/TPR; Sensitivity tells us how good our model is by identifying the true labels correctly.

To put it in simple terms :
- ROC is a probability curve
- AUC represents the degree of separability
Note: When we decrease the threshold value, we get more positive values, thus it increases the sensitivity and decreases the specificity Similarly when we increase the threshold value, we get more negative values, thus it increases the specificity and decreases the sensitivity.
FPR (False Positive Rate): tells us that when the actual values are false, then how wrongly have the classification model had performed to classify the data

Now let’s understand ROC AUC method by taking an example:
Consider a situation where you need to classify if the received email is spam mail or not.
(Note: Red curve is positive class a mail is spam and Green curve is negative class a mail is not spam)
Case I:

This is the best situation. When two curves don’t intersect at all implies model has an ideal measure of separability. It is perfectly able to recognize between positive class and negative class. Here, you can notice that AUC = 1, i.e. it’s the best scenario where actual and predicted values have a similarity of 100%.
Case II:

Here, as you can see the two curves intersect with each other resulting in the generation of FN and FP i.e. Type I and Type II errors (which is not a good thing)
When the value of AUC is 0.7, it means there is 70% chance that model will be able to distinguish between positive class and negative class.
Case III:

Here comes the worst case. When the value of AUC is approximately equal to 0.5, the model has no discrimination capacity to distinguish between positive and negative class.
Case IV:

Here comes the last but not the least, when AUC is nearly equal to 0, the model is reciprocating the classes. It means the model is guessing the negative class as a positive class and vice versa.
Conclusion
Model evaluation is one of the most critical steps in the machine learning development process. A Model that is evaluated with the wrong metrics is a great risk for the business. In this article, we discussed some of the common metrics and evaluation methods used for classification based models. But this is just an introduction to this topic! Stay tuned to Digital Tesseract to learn more!
I hope you found this article helpful! Would love hear your feedback on the same!
Reference Articles
Introduction to confusion matrix – https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Understanding AUC – ROC Curve – https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Accuracy, Precision, Recall & F1 Score – https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/
Subscribe To Our Newsletter
Join our mailing list to receive the latest news and updates from our team.
You have Successfully Subscribed!