“article am I an reading”… what? It doesn’t make any sense at all. But what if I ask you to arrange the words in such a manner so that it makes sense. Without putting a lot of effort you will quickly answer “I am reading an article”. Easy isn’t it. But how did you do that? You probably didn’t spend much time looking for proper English grammar and might have quickly predicted the correct word arrangement just by looking at it. What if we build a model that we do not have to hard code English grammar but can learn and understand how to write sentences. Yup, there is a type of model that can do all these things. They are called Recurrent Neural Networks & this will be the 1st part of a series of articles to discuss the foundational concepts on this topic!

Because of their efficiency, not only they are used in sentences but also they are also used in Speech Recognition i.e. converting audio data to text data, analyzing the sentiment of a text by sentiment analysis, translating one language to another with machine translation, predicting future stock market values, and many more. They are also great at generating sequential data like text and audio. We can even train an RNN model to generate a text that can completely mimic the style of writing of a poet or guitar playing style of a guitarist

In this article we will be diving deep into Recurrent Neural Networks, what is so special about them, and how they are trained. So, let’s get started!

### What is a Recurrent Neural Networks or RNN?

Recurrent Neural Networks or RNNs are a modification over Artificial Neural Networks or ANNs. The ANNs are constrained to take a fixed-sized vector (like an image) as input and return a fixed-sized vector (like the probability of different classes). Say, it can only take an input of a vector of 3 elements and will return only a vector of 1 element. But what if we want a sequence of inputs or outputs or we are working on a  time series data.  In that case, we cannot simply just use an ANN as it expects a fixed size vector but we have a sequential input. In that case, we will need a model that can remember the previous sequence and use it to make future predictions. This is what an RNN can do and an ANN can’t!

If we look into the definition the term “Recurrent” it means “Occurring often or repeatedly”.  Meaning a Recurrent Neural Network is a set of neural networks that repeats itself or designed in such a way to have the same neural network stacked multiple times.

Now you might think, What if I design a single neural network and use it multiple times in a loop?

But that won’t fix the problem of dealing with sequential data. Where the occurrence of an element is determined by the occurrence of the previous elements. An ANN does not consider the output of previous input while processing the current input. That is where RNNs are specialized in. The decisions of an RNN is influenced by the past.

To understand it much better, let’s look at the architecture of an RNN.

First, let’s understand the figure. Here the green blocks are the Input to the model, yellow blocks are the hidden layers, and red ones are the outputs. The weights connecting to the input layer and the hidden layer are labeled as ‘U’, the weights connecting the hidden layer and output layer are labeled as ‘W’ and the weight that connects to the previously hidden layer to the current hidden layer is ‘V’. As you can see that there are X subscripts t-1, t, and t+1 which represent the timestamp of the model. Also, you can see that at a time ‘t’ the model also takes an input not only from Xt but also from the hidden state at a time ‘t-1’. This makes the model take the information from previous computations while dealing with the current input. This is what makes an RNN different from an ANN.

#### The Equations..

For a simple ANN, having an input layer, a hidden layer, and an output layer, the equations of the model can be written as,

So you can clearly see that the hidden layer or hidden state is completely dependent on the current input. (Here the ‘T’ in the superscript means the value is transposed). But if we look into the equation of an RNN model,

You can see that a new term ‘VT ht-1’ is added to ‘Zh’. Where ht-1 is the previous hidden state and Vt is the weight between the current hidden state and previous hidden state. This shows that the output of the current hidden state is dependent upon the output of the previous hidden state. Meaning with each increasing timestamp, the layers are dependent upon all the preceding hidden states. To understand this concept much better, let’s take an example.

Let’s take the sentence “please stay at home” and we want to train an RNN to predict the word “home” given the words “please”, “stay”, “at”. The expected RNN model will look like the following,

Here as we provide the data sequentially to the model. The model is supposed to predict “stay” when it is given the input “please”, “at” when given input “stay” and “home” when given “at”. So as the model learns it finds the pattern that after the words “please”, “stay” and “at” the word “home” comes. Now we will be learning how the model is learning to predict the words. And just like any training of a Neural Network the first step is Forward propagation.

Forward Propagation:

Before we go right into forward propagation first we would like to convert the string data to numeric data. As we know a deep learning model computes mathematical operations and they require numeric values. One of the ways to do so is to take the data and one-hot encode it. So our data becomes,

Just like in the training of an ANN, we will first initialize the weights of the model. You can use any random initialization method you want. In the ‘figure_3’ you might have seen the hidden state ‘h0’, it is there as we need to have a previous hidden state to compute the very first hidden state, but at the very first timestamp, we do not have any knowledge about past states. So when we start a model we initialize ‘h0’ as a vector of 0’s and use it to calculate the first hidden state of the model. But can’t we just assign random values to this one as well? The answer is no. As we have no information before the first state in the sequence.

At time t = 1  we have, At time t = 2 we have,

From ‘Z2’ you can clearly see that as we forward propagate through the model with time the model considers the previous input and hidden states along with current values. Which makes the model remember information from the past. Similarly, you can also derive for Z3. Now that we’ve seen how forward propagation looks works lets see how the model learns with backpropagation or here we call it backpropagation through time (in the next section we will discuss it)

Back propagation through time:

Just like any ANN the steps of training RNNs are quite the same. First, we compute the forward pass and calculate the Loss, then find the gradients of the Loss function with respect to all the weights of the model and finally use those gradients to update our weight. But the real difference comes in RNN because of the recursive nature of the weights that links over time. As you can see in ‘figure_4’ the loss L3 is computed and the gradients are calculated with respect to the 3rd hidden state ‘h3’. But also the gradient of ‘h3’ with respect to ‘h2’, the previous hidden state is calculated. Which measures the rate of change of ‘h3’ with respect to change in ‘h2’. So the loss accumulated by the model in the current layer also goes back to all the previous layers connected to it.

If you look carefully at ‘figure_4’ you might see that the computation of the gradients is traveling back in time, starting from ‘h3’ to ‘h0’ and we already know that they represent the hidden state at a given timestamp. So we are propagating from the last layer or the layer in the present time to the state at t = 0. Meaning we are traveling back in time. That is why it is called Backpropagation through time.

The mathematics behind computing all gradients from all the weights is quite long. So to keep it simple and short let’s just focus on the weight W. The loss at time t = 3 can be represented as,

Here we are using the cross-entropy loss function. To compute the gradient of L3 with respect to the weight W we need to compute,

Similarly, we compute the gradient of L2 and L1 and sum up to compute the total gradient of L with respect to W.

Now that we have our gradients the last step is to update the weights in the model. To do so we choose a learning rate or and multiply it with the gradient and subtract it from the weight matrix.

We use the same method to update all other weights as well. And we repeat all those steps in a loop till we get our optimized model. This completes the training of an RNN model, but it is not as easy as it sounds when the model becomes deeper or has more layers in it.

As we know the number of states in an RNN can vary, and suppose we have a model having 50 hidden states. Such models are also called Deep Recurrent Neural Networks. Previously we learned how to compute the gradient of a state. So to compute the gradient of the last state say ‘h50’ with respect to the ‘h0’, we have to calculate,

Now that’s a lot of calculations. Which is very troublesome to our model. Why?, Let’s assume a situation where all the gradients in the RHS of the equation are equal to 0.9.

So the value of

which is way less than what we have started with. As the gradient goes deeper to the network the value diminishes and becomes very irrelevant for the model. Which decreases the learning capacity of the last layers when doing backpropagation. This is called the problem of Vanishing Gradients. Just like in a class if we sit at the last bench of a very long hall then we might not be able to hear the teacher and are unable to learn.

Now let’s think of another situation, instead of 0.9 let’s say our gradients are equal to 1.9. So the value of

Which is extremely large for our model and might cause it to crash or result in very unexpected outcomes. This is called the problem of Exploding Gradients.

So, having a long sequence of data will be very troublesome for an RNN. They are not good at retaining long information and unable to perform well with given a long set of sequences. Not only that as the sequence length increases the computation becomes slower and more resource exhausting.

To solve these problems there are some modified RNN models that handle these problems very effectively. Namely,

– LSTM (Long Short Term Memory)

– GRU (Gated Recurrent Units)

They are advanced modifications over RNNs. Not only they take care of Vanishing and Exploding gradients but they also are very efficient in keeping information for a long time. But still, they are limited to perform well with sequences of hundreds not thousands.

### Conclusion

RNN is a powerful algorithm and has many real-world use cases. This article intends to provide an introduction to the foundational concepts of RNN. In the subsequent article, I will provide a code example for an RNN use case and also discuss on other alternatives of RNN. Hope you found this post useful and I will be happy to hear feedback from you! Happy Learning!

Please find my other article where I have used RNN for sentence Generation : Link