Imagine that you are a part of a credit card company. To protect your customers from credit card fraud, you always check the card usages that are rather different from typical cases. For example, if a purchase amount is much bigger than usual, and if the purchase occurs far from the owner’s resident city, then the purchase is suspicious. You have to detect such transactions as soon as they occur and contact the credit card owner for verification. Most credit card transactions are normal. However, if a credit card is stolen, its transaction pattern usually changes dramatically. An essential idea behind credit card fraud detection is to identify those transactions that are very different from the normal. This is very much similar to outlier detection.

Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or outliers (For example iris, mt cars dataset). Real-life datasets are quite messy and a lot of pre-processing is needed before it’s analysis. An outlier is one of the hurdles we will face certainly. But outliers are highly diverse and their tendency may change over time. So, learning a stable decision boundary between outliers and not outliers is often difficult in practice.

In this article, I will brief you about outliers, impacts of it on data analysis, several detection techniques, and also on handling them!

What are outliers?

Outliers are the data objects that stand out amongst other data objects and do not conform to the expected behavior in a dataset.

Hence, an outlier is always defined in the context of other objects in the dataset. By nature of the occurrence, outliers are also rare and, hence, they stand out amongst other data points. For example, a high-income individual may be an outlier in a middle-class neighbourhood dataset, but not in the membership of a luxury vehicle ownership dataset.

Causes of outliers:

Outliers in the dataset can originate from either error in the data or from valid inherent variability in the data. However, pinpointing exactly what caused an outlier is a tedious task and it may be impossible to find the causes of outliers in the dataset. Here are some of the most common reasons why an outlier occurs in the dataset.

1.Data errors:

Outliers may be part of the dataset because of measurement errors, human errors, or data collection errors. For example, in a dataset of human heights, a reading such as 1.70 cm is an error and most likely was wrongly entered into the system. These data points are often neglected.

2. Normal variance in the data:

In a normal distribution, 99.7% of data points lie within three standard deviations from the mean. In other words, 0.26% or 1 in 370 data points lie outside of three standard deviations from the mean. It’s unlikely that records are present above that and but they too are a part of legitimate data. For example, an individual earning a crore rupees in a year or someone who is more than 7 ft tall falls under the category of an outlier in an income dataset or a human height dataset respectively. These outliers skew some of the descriptive statistics like the mean of the dataset. Regardless, they are legitimate data points in the dataset.

3. Data from other distribution classes:

The number of daily page views for an e-commerce website from an IP address usually ranges from one to several dozen. But there are a few IP addresses reaching hundreds of page views in a day. This outlier could be an automated program from a computer making the calls to scrape the content of the site or access one of the utilities of the site. Even though they are an outlier, it is quite “normal” for bots to register thousands of page views to a website.

4. Distributional assumptions:

Outlier data points can originate from incorrect assumptions made on the data or distribution. For example, if the data measured is the usage of a library in a college, then during final exams, there will be an outlier because of a surge in the usage of the library. Similarly, there will be a surge in retail sales during festivals.  Here the outlier is expected and but that doesn’t mean that it isn’t a part of the dataset.

Impacts of outliers:

Outliers can drastically change the results of data modeling. There are numerous unfavorable impacts of outliers in the data set:

  • It increases the variance error and reduces the power of statistical tests
  • They can decrease normality if the outliers are non-randomly distributed
  • They also increase the bias error
  • The basic assumption of Regression and other statistical model assumptions are also affected by it.

I will give a simple example, how a single outlier affects the data analysis. Let’s consider two samples

  1. 12,34,55,32,25,45,78,43,345
  2. 12,34,55,32,25,45,78,43,19
Python

Let’s check it’s mean, median and standard deviation to know more the outlier’s impactsImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

As you can see, the sample with outliers has a lot of differences. While the mean for sample A is 74.3 whereas mean for sample B is 38.1. It’s a huge difference and it’s just because of a single outlier. It changes the data analysis completely.

Types of outliers:

Outlier can be of two types:

        1) Univariate                  2) Multivariate

Univariate outliers can be found when we look at the distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. To find them, you have to look at distributions in multi-dimensions.

Python

For univariate analysis, I have plotted box plots for both BMI and Blood glucose levels. Here we check outliers in each variable but it seems that there are no outliers present.

Python

For multivariate analysis, I have plotted a scatter plot between them as shown below. But this time the plot shows that there are 2 extreme records which can be considered as outliers.

Python

Outlier detection techniques:

A) For univariate outliers:

1) Inter-Quartile Range method (IQR):

The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles.

In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals i.e Q1, Q2 & Q3.

IQR is the difference between the third quartile and the first quartile (IQR = Q3 -Q1). Outliers, in this case, are defined as the observations that are below (Q1 − 1.5x IQR) or boxplot lower whisker or above (Q3 + 1.5x IQR) or boxplot upper whisker. It can be visually represented by the box plot.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Let’s define a function to find out the IQR, lower, and upper whisker.

This block has encountered an error and cannot be previewed.

For example, I’ll take up the Heart disease UCI DataSet (https://www.kaggle.com/ronitf/heart-disease-uci) for analysis. I will check the outliers in the variable “chol” using the IQR method.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Let’s check the outliers:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

As per the IQR method, there are 5 outliers.

Visual representation:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Here the red zone represents the outlier zone! The records present in that zone are considered as outliers.

Remedial Measure:

Remove the records which are above the upper bound value and records below the lower bound value!

This block has encountered an error and cannot be previewed.

2) Standard deviation method:

This method only applies if the dataset has a normal distribution. The distribution has the property that the standard deviation from the mean can be used to reliably summarize the percentage of values in the sample. We can cover more of the data sample if we expand the range as follows:

– Within one standard deviation of the mean will cover 68% of the data.

– Within two standard deviations of the mean will cover 95% of the data.

– Within three standard deviations of the mean will cover 99.7% of the data.

It is represented in the figure belowImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Sometimes, the data is standardized first (e.g. to a Z-score with zero mean and unit variance) so that the outlier detection can be performed using standard Z-score cut-off values. This is a convenience and is not required in general, and we will perform the calculations in the original scale of the data here to make things clear.

Let’s define a function to find out the lower and the upper whisker using SDM:

This block has encountered an error and cannot be previewed.

I will check the outliers in the variable “trestbps” using the SD method.

Let’s check the outliers:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

As per the SD method, there are 2 outliers.

Visual representation:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Here the red zone represents the outlier zone! The records present in that zone are considered as outliers

Remedial Measure:

Remove the records which are above the upper bound value and records below the lower bound value!

This block has encountered an error and cannot be previewed.

3) Isolation forest method:

Isolation forest is an algorithm to detect outliers. It is ideal for large datasets in a one or multi-dimensional feature space. It partitions the data using a set of trees and provides an anomaly score looking at how isolated the point is in the structure found. The anomaly score is then used to tell apart outliers from normal observations. Just like Random Forest, it is also built on an ensemble of binary trees. At first, partitions are created by selecting a feature at random then selecting a random split value between the minimum and maximum value of the selected feature. Outliers are less frequent than regular observations as they lie further away from regular observations. That is why by using such random partitioning they should be identified closer to the root of the tree, with fewer splits necessary. Then an anomaly score is calculated and the outliers are identified according to it.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Let’s define a function to find out the outlier region in the column “chol” and “age” using isolation forest method:

This block has encountered an error and cannot be previewed.

Visual representation:ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

In the figure above, we have trained our IsolationForest using the data generated, computed the anomaly score for each observation, and classified each observation as an outlier or non-outlier. The chart shows the anomaly scores and the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution, and the outlier regions correspond to low probability areas.

B) Multivariate outliers:

1) DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Outliers, by definition, occur less frequently compared to normal data points. This means that in the data space outliers occupy low-density areas and normal data points occupy high-density areas. Density is a count of data points in a normalized unit of space and is inversely proportional to the distances between data points. The objective of a density-based outlier algorithm is to identify those data points from low-density areas. There are a few different implementations to assign an outlier score for the data points. The inverse of the average distance of all k neighbors can be found. The distance between data points and density are inversely proportional. Neighborhood density can also be calculated by calculating the number of data points from a normalized unit distance. The approach for density-based outliers is similar to the approach discussed for density-based clustering and the k-NN classification algorithm.

Since distance is the inverse of density, the approach of a density-based outlier can be explained with two parameters, distance (d) and proportion of data points (p). A point X is considered an outlier if at least p fraction of points lie more than d distance from the point. The figure below provides a visual illustration of outlier detection. By the given definition, the point X occupies a low-density area. The parameter p is specified as a high value, above 95%. One of the key issues in this implementation is specifying distance. It is important to normalize the attributes so that the distance makes sense, particularly when attributes involve different measures and units. If the distance is specified too low, then more outliers will be detected, which means normal points have the risk of being labeled as outliers and vice versa.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

I’ll consider the “age” and “oldpeak” columns of the dataset for evaluation.

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Here the -1‘s represent the outliers!

Let’s plot to differentiate the outliers. I’ll set the blue color to normal records and red color to outliers.

Visual Representation:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

The downside with this method is that the higher the dimension, the less accurate it becomes. You also need to make a few assumptions like estimating the right value for eps which can be challenging.

2) Local outlier factor (LOF) method:

The LOF technique is a variation of density-based outlier detection, and addresses one of its key limitations, detecting the outliers in varying density. Varying density is a problem in simple density-based methods, including DBSCAN clustering (see Chapter 7: Clustering). The LOF technique was proposed in the paper LOF: Identifying Density-Based Local Outliers.

LOF takes into account the density of the data point and the density of the neighbourhood of the data point as well. A key feature of the LOF technique is that the outlier score takes into account the relative density of the data point.

Once the outlier scores for data points are calculated, the data points can be sorted to find the outliers in the dataset. The core of the LOF lies in the calculation of the relative density. By comparing the density of the data point and density of all the data points in the neighbourhood, whether the density of the data point is lower than the density of the neighbourhood can be determined. This scenario indicates the presence of an outlier.

I will consider “age” and “chol” variables for the analysis of outliers using the LOF method.

This block has encountered an error and cannot be previewed.

Here -1 represents the outliers! Now let’s plot and visualize the outliers. I’ve set blue for the normal records and red for outliers.

Visual Representation:

This block has encountered an error and cannot be previewed.ImageUpload an image file, pick one from your media library, or add one with a URL.UploadMedia LibraryInsert from URL

Conclusion:

As you saw, there are many ways to identify outliers. My philosophy is that you must use your in-depth knowledge about all the variables when analyzing data. Part of this knowledge is knowing what values are typical, unusual, and impossible. While outlier removal forms an essential part of a dataset normalization, it’s important to ensure zero errors in the assumptions that influence outlier removal.

I find that when you have this in-depth knowledge, it’s best to use the more straightforward, visual methods. At a glance, data points that are potential outliers will pop out under your knowledgeable gaze. Consequently, I’ll often use boxplots, histograms, and good old-fashioned data sorting! These simple tools provide enough information for me to find unusual data points for further investigation.

There are some more advanced outlier detection processes like:

1) Elliptic Envelope [https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html]

2) One-Class Support Vector Machines [http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/]

3)RobustRandomCutForest[https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/random_cut_forest]

We live in a world where the data is getting bigger by the second.

The value of the data can diminish over time if not used properly. Finding anomalies in a dataset is crucial to identifying problems in the business or building a proactive solution to potentially discover the problem before it happens or even in the exploratory data analysis (EDA) phase to prepare a dataset for ML.