It has been a challenge since last many years to automatically detect and correctly classify something abnormal as anomalous. And it becomes more complex when it comes to high dimensional data, because the traditional machine learning approaches fail to capture the complex structure in the imbalanced data. This is where deep learning methods for anomaly detection can be leveraged for the task. Among all the Deep Learning techniques, we use Autoencoder for anomaly detection.

So, in this blog, we will learn about the following:

- What is Autoencoder?
- Applications of Autoencoders
- Architecture of Autoencoder.
- Different types of autoencoders
- Simple autoencoders

- Sparse autoencoders

- Deep autoencoders

- Convolutional autoencoders

- Denoising autoencoders

- Variational autoencoders

- Advantages of Autoencoders
- How autoencoders can be used for Anomaly Detection?
- Anomaly Detection in Cardio dataset using tensorflow.

#### Application of Autoencoders:

**Anomaly Detection:**Autoencoders use the property of a neural network in a special way to accomplish some efficient methods of training networks to learn normal behavior. When an outlier data point arrives, the auto-encoder cannot codify it well. It learned to represent patterns not existing in this data. When trying to reconstruct the original data from its compact representation, the reconstruction will not resemble the original data. Thus helping to detect anomalies when they occur. The goal of such a process is to try to reconstruct the original input from the encoded data, which is critical in building an anomaly detection module.**Dimensionality**Reduction: It can learn non-linear transformations, with a non-linear activation function and multiple layers. The objective of an autoencoder is to learn a compressed, distributed representation for the given data, which we can use for the purpose of dimensionality reduction.**Feature Extraction:**Here, while in the process of reducing construction error, Encoding part of Autoencoders helps to learn important hidden features present in the input data. By this, model generates, a new set of combination of original features.**Sequence to sequence prediction:**LSTMs-based autoencoders uses the Encoder-Decoder Model that can capture temporal structure of a sentence in Machine Translation process.**Recommendation system:**We can use Deep Autoencoders to understand user preferences to recommend movies, books or other items. Following are the steps to build recommender system using autoencoder:- The input data is the clustering of similar users based on interests.
- Interests of users are categorized by Videos Watched, Watched Time for each video etc,,,

- A cluster can be created with similar kind of data from the above filter.
- Encoder part will capture the interests of the user
- Decoder part will try to project the interests on two parts:Existing unseen content
- New content from content creators

#### What is AutoEncoder?

Autoencoders are neural networks that have the ability to discover low-dimensional representations of high-dimensional data. From this, it should be able to reconstruct the input from the output.

There are 3 major parts in an Autoencoder Architecture, as below:

**An Encoder,**which reduces the dimensionality of a high dimensional dataset to a low dimensional one.**Code**, which contains the reduced representation of the input that is fed into the decoder.**A Decoder**which expands the low-dimensional data to high-dimensional data.

Here, compression and decompression functions are

- Data-Specific, which means the model will be able to compress those data on which it was trained. This feature of the model helps to compress the majority class of an imbalanced dataset.
- Lossy, which means which means that the decompressed outputs gets degraded compared to the original inputs.
- Learned Automatically from examples, which means it works better with a specific type of input.

#### Understanding of Architecture

The encoder part is a feature extraction function, f, that computes a feature vector h (xi) from an input xi. We define h(xi)=f(xi), where h(xi) is the feature representation.

The decoder part is a recovery function, g, that reconstructs the input space xi~ from the feature space h(xi) such that xi~=g(h(xi))

The autoencoder is attempting to learn an approximation such that xi is similar to xi~. It means, it is trying to attain the lowest possible reconstruction error E (xi, xi~) that measure the discrepancy between xi and xi~ . Hence the following equation is obtained E (xi, xi~)=||xi-xi~||

Autoencoders were mainly developed as a multi-layer perceptron (MLP). Most common used forms for the encoder and decoder are transformations that keep collinearity followed by a nonlinearity:

f (xi)=sf(b+W)

g(xi)=gf(c+W~)

where sf and sg are the encoder and decoder activation function, e.g. sigmoid and tanh

b and c are the encoder and decoder bias vectors,

W and W~ are the encoder and decoder weight matrices.

#### Types of Autoencoder:

##### Simple Autoencoders:

An encoder network takes in an input and converts it into a smaller, dense representation, also known as a latent representation of the input. The decoder network can then use it to convert it back to the original input as much as possible. When we have input with many features, generating a compressed representation helps in compressing the input of the training sample. So when the neural network goes through all the training data and fine tunes the weights of all the hidden layer nodes, the weights will truly represent the kind of input that we typically see. As a result of this, if we try to input some other type of data, such as having data with some noise, the autoencoder network will be able to detect the noise . Then it removes at least some portion of the noise when generating the output. This is truly fantastic because now we can potentially remove noise from our data.

We need to create a single fully-connected neural layer as encoder and as decoder model, compile the models with Optimizer, Loss and Evaluation Metrics. The loss function is usually either the mean-squared error or cross-entropy between the output and the input, which we call ‘Reconstruction Loss’. It penalizes the network for creating outputs different from the input. Then, we need to fit our model with the test data.

##### Steps to create a simple Autoencoder

We will build a simple single fully-connected neural layer as encoder and as decoder to read a number present in the image

- Let’s define the size of the Encoded representation.
- encoding_dim=32 #Assuming the input size= 882 floats and the compression factor=27.6
- input_img=Input(shape=(882,)) # This is our input placeholder
- encoded=Dense(encoding_dim, activation=’relu’)(input_img) #”encoded” is the encoded representation of the input
- decoded=Dense(882, activation=’sigmoid’)(encoded) # ‘decoded’ is the lossy reconstruction of the input
- autoencoder=model(input_img, decoded) #this model maps an input to its reconstruction

- Lets create a separate encoder model
- encoder=model(input_img, encoded) #this model maps an input to its encoded representation

- Lets create a separate decoded model
- encoded_input=Input (shape=(encoding_dim,)) # create a placeholder for an encoded (32-dimensional) input
- decoded_layer=autoencoder.layers[-1] #retrieve the last layer of the autoencoder model
- decoder=model(encoded_input, decoder_layer(encoded_input)) #create the decoder model

- Now, lets train our autoencoder to reconstruct the digits
- autoencoder.compile(optimizer=’ada’, loss=’binary_crossentropy’)

- Prepare train data: x_train and test data: x_test
- lets train our autoencoder for 50 epochs
- autoencoder.fit(x_train,x_train, epochs=50, batch_size=250, shuffle=True, validation_data=(x_test, x_test))

- Now, we will try to visualize the reconstructed input and encoded representation
- encoded_img=encoder.predict(x_test)

- decoded_img=decoded.predict(encoded_img)

##### Sparse Autoencoder

A sparse autoencoder is an autoencoder whose training criterion involves a** **sparsity penalty. We will construct our loss function by penalizing activations of hidden layers. As a result, only a few nodes are encouraged to activate when a single sample is fed into the network. Fewer nodes gets activated, by making sure that autoencoder actually learns latent representations instead of redundant information in the input data.

It’s important to note that the individual nodes of a trained model which activate are data-dependent, different inputs will result in activations of different nodes through the network.

One result of this fact is that we allow our network to sensitize individual hidden layer nodes toward specific attributes of the input data. An undercomplete autoencoder will use the entire network for every observation, whereas a sparse autoencoder will use selectively activate regions of the network depending on the input data. As a result, we’ve limited the network’s capacity to memorize the input data without limiting the networks capability to extract features from the data. This allows us to consider the latent state representation and regularization of the network separately. Then, we can choose a latent state representation in accordance with what makes sense given the context of the data while imposing regularization by the sparsity constraint. There are two main ways by which we can impose this sparsity constraint; both involve measuring the hidden layer activations for each training batch and adding some term to the loss function in order to penalize excessive activations. These terms are

**L1 Regularization: **For L1 regularization, the gradient is either 1 or -1 except when w=0, which means that L1 regularization will always move w towards zero with same step size regardless of the value of w. When w=0, the gradient becomes zero and no update will be made anymore.

**KL-Divergence: **Kullback-Leibler (KL) divergence penalty term quantifies how much one probability distribution differs from another probability distribution.

The KL divergence between two distributions Q and P is stated as KL(P || Q)

Where the “||” operator indicates “*divergence*” or Ps divergence from Q.

KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x)) depicts the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P.

KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x)) depicts the positive sum of probability of each event in P multiplied by the log of the probability of the event in P over the probability of the event in Q.

When the probability for an event from P is large, but the probability for the same event in Q is small, there is a large divergence. When the probability from P is small and the probability from Q is large, there is also a large divergence, but not as large as the first case.

In the above example, the hidden layer is learning an approximation of dimensional reduction (PCA). There is another way to compress the representations is to add a sparsity constraint on the activity of the hidden representations, by adding an `activity_regularizer`

to our `Dense`

layer:

##### Steps to create a Sparse Autoencoder

- encoded=Dense(encoding_dim, activation=’relu’, activation_regularizer=regularizers.l1(10e-5))(input_img)

##### Convolutional Autoencoders:

Whenever our inputs are images, it makes sense to use convolutional neural networks (convnets or CNNs) as encoders and decoders. Autoencoder is sensitive enough to recreate the original observation but insensitive enough to the training data such that the model learns a generalizable encoding and decoding. In another word, a generalizable model is to slightly corrupt the input data. But still maintains the uncorrupted data as our target output. With this approach, our model isn’t able to simply develop a mapping which memorizes the training data because our input and target output are no longer the same. Rather, the model learns a vector field for mapping the input data towards a lower-dimensional manifold. If this manifold accurately describes the natural data, we’ve effectively “canceled out” the added noise.

##### Steps to create a Convolutional Autoencoder

- The encoder consists of a layer of Conv2D and MaxPooling2D layers.
- x=Conv2D(20, (3,3), activation=’relu’, padding=’same’) (input_img)
- x=MaxPooling2D((2,2), padding=’same’) (x) #SAME applies padding to the input image so that the input image gets fully covered by the filter and specified stride, where the output will be the same as the input. So, this is called SAME padding
- x=Conv2D(10, (3,3), activation=’relu’, padding=’same’) (input_img)
- x=MaxPooling2D((2,2), padding=’same’) (x)
- x=Conv2D(10, (3,3), activation=’relu’, padding=’same’) (input_img)
- encoded=MaxPooling2D((2,2), padding=’same’) (x)

- Decoder consists of a layer of Conv2D and UpSampling2D layers.

##### Variational Autoencoders

In Variational Autoencoders, encodings that come from some known probability distribution can be decoded to produce reasonable outputs, even if they are not encodings of actual images. If we sample points from this distribution, we can generate new input data samples: a VAE is a “generative model”.

First, an encoder network turns the input samples x into two parameters in a latent space: z_mean and z_log_sigma. Then, we randomly sample similar points z from the latent normal distribution that is assumed to generate the data, via z = z_mean + exp(z_log_sigma) * epsilon, where epsilon is a random normal tensor. Finally, a decoder network maps these latent space points back to the original input data. The parameters of the model are trained via two loss functions:

- A reconstruction loss forcing the decoded samples to match the initial inputs.
- KL divergence between the learned latent distribution and the prior distribution, acting as a regularization term.

##### steps to create Variational encoders

- First, in the encoder model, we have to map inputs to our latent distribution parameters.
- x=Input(batch_shape=(batch_size, original_dim))
- h=Dense(intermediate_dim, activation=’relu’)(x)
- z_mean=Dense(latent_dim)(h)
- z_log_sigma=Dense(latent_dim)(h)

- Now, we can use these parameters to get a sample of new similar points in the latent space.
- Now, we can map these sampled latent points back to the reconstructed inputs.
- Now, we can instantiate 3 models:
- An end-to-end autoencoder mapping inputs to reconstruction
- An encoder mapping inputs to the latent space.
- A generator that can take points in the latent space and will output the corresponding reconstructed samples.

- After this, we need to train using end-to-end model and then will fit with the test data.

**Denoising autoencoders: **

Denoising autoencoders add some noise to the input image and learn to remove it. So it tries to avoid to copy the input to the output without learning features about the data. These autoencoders take a partially corrupted input while training to recover the original undistorted input. And then learns a vector field for mapping the input data towards a lower-dimension which describes the natural data to cancel out the added noise. In this way, the encoder will extract the most important features and learn a more robust representation of the data.

##### steps to create Variational encoders

- The model creation steps are similar to Convolutional Autoencoder. the only one difference is that we will generate synthetic noisy digits: we just apply a gaussian noise matrix and clip the images between 0 and 1.

#### Use Case:

To build an Anomaly Detection model for Cardiotocography (Cardio) Data Set

**Dataset:** Here, we will build a model using Cardiotocography (Cardio) dataset, available in UCI machine learning repository, consists of measurements of fetal heart rate (FHR) and uterine contraction (UC). features on cardiotocograms classified by expert obstetricians have evaluated all the features and classified each example as normal, suspect, and pathologic for the attribute NSP. For outlier detection, they have termed the normal class as Inliers, while the pathologic (outlier) class as Outliers. We have discarded the suspect class. Dataset is present in the following location:

The dataset is present here:

Following is a good repository of Anomaly Detection datasets:

2126 cardiotocograms (CTGs) were automatically processed. The respective diagnostic features measured and CTGs were classified by three expert obstetricians and a classification label assigned to each of them. Classification had been performed both with respect to a morphologic pattern (A, B, C. …) and to a fetal state (N, S, P). Therefore, the dataset can be used either for 10-class or 3-class experiments and we have considered 3-class experiments to build our model.

The dataset is collected from UCI Machine Learning repository, present here:

**Data Dictionary:**

Attribute Information:

LB – FHR baseline (beats per minute)

AC – No. of accelerations per second

FM – No. of fetal movements per second

UC – No. of uterine contractions per second

DL – No. of light decelerations per second

DS – No. of severe decelerations per second

DP – No. of prolongued decelerations per second

ASTV – Percentage of time with abnormal short term variability

MSTV – Mean/Average value of short term variability

ALTV – Percentage of time with abnormal long term variability

MLTV – Mean/Average value of long term variability

Width – width of FHR histogram

Min – minimum of FHR histogram

Max – Maximum of FHR histogram

Nmax – No. of histogram peaks

Nzeros – No. of histogram zeros

Mode – histogram mode

Mean – histogram mean

Median – histogram median

Variance – histogram variance

Tendency – histogram tendency

CLASS – FHR pattern class code having codes from 1 to 10)

NSP – Fetal state class code, where N=normal; S=suspect; P=pathologic

Here, we are considering NSP=1 as Inliers and NSP=3 as Outliers

Total number of Inliers=1658

Total Number of Outliers=178

**Anomaly Detection using Autoencoder:**

Download full code :

### Anomaly Detection using Deep Learning Technique

Step1: Import all the required Libraries to build the model

Step 2: Step 2: Upload the dataset in Google Colab. If we are using Jupyter Notebook, then we can directly access the dataset from our local system using read_csv().

Step 3: Get more information about the dataset.

Step 4: Fill the missing values with Mode for the target variable: NPS

Step 5: Get the data distribution for different values of target variable: ‘NSP’. Rest of the features present in the datset are Predictors.

Step 6: Discard all the datapoints having NSP=2, as this is not of our interest.

(1831, 23)

Step 7: Count the no. of data points for ‘Normal’ and ‘Outlier’ classes

1.0 1655 3.0 176 Name: NSP, dtype: int64

Step 8: Split the train, test data.

((1155, 22), (550, 22))

Step 9: Build the model with input_dim=1 (for 2 class classifier), encoding_dim=12. Activation function=tanh for encoder layer and sigmoid for decoder layer.

Step 10: Compile the model with optimizer=ADAM, loss=binary-crossentropy and metrics=accuracy.

Step 11: Fit the model with batch_size=32 and epochs=20.

Step 12: Evaluate the model.

Step 13: Reconstruction Error and True Class

Step 14: Plotting the Confusion Matrix

Step 15: Printing the Classification Report

#### Conclusion:

In this blog, we built an Autoencoder model to detect anomaly in Cardio data, which is having two class classifier: Normal and Anomaly. In our next blog, we will learn a new technique to build an anomaly detection model.

https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf

https://www.quora.com/What-is-the-Kullback-Leibler-KL-divergence

## Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates from our team.

## You have Successfully Subscribed!