In the previous article, we discussed the foundational concepts of Recurrent Neural Network works and what makes it great for predicting sequential data.

In this article, we will be implementing an RNN using the Tensorflow Framework that can generate some awesome poems by learning from poems by the great poet William Shakespeare. We will also introduce you to GRUs and LSTMs. Now before we jump right in, first let’s formulate a strategy on how we are going to tackle this task.

Training Data :

The first thing that we need to care about is the data. We are going to use a text corpus by Project Gutenberg for our educational purpose. You can download the text from here. Once you downloaded it open it in a text editor and keep the text from line number 253 to line 2867 and remove everything else.

If you want you can also take any other section you’d like but do take a look into the size of the text you are considering as it may result in memory issues in the future. After that, we will remove any unnecessary things in the data, and then we will design our model and train it with the data. After training, we will be using it to generate some awesome poems. So let’s get started.

Model Creation :

STEP 1: Import Necessary Packages
STEP 2: Data Preprocessing

First, we will load the data and see how it looks.


We can see that there are some capital letters, punctuations, and escaping characters. We don’t need them, for now, we only just want sentences with just lowercase letters. So first we have to convert all capital letters to small and after that, we will remove all the punctuations and escaping characters. As it is a word-level generator we would like to have the text split up to a list of words.


Now you can see that the text corpus is split to a list of words and in this corpus, we have 17582 total words and 3178 unique words. Now we will create the training sentences for our model. With each sentence having 11 words. 10 words as input and 1 as output. We will create the sentences from the list of words in such a way that the sentences would look like they are convolved over a list of words. For example,

Let’s say from the list of words we want training samples of length 4, then they would look like:

  • from fairest creatures we
  • fairest creatures we desire
  • creatures we desire increase
  • we desire increase that
  • desire increase that thereby …

In our case, we will take 11 words and generate sentences like above.

STEP 3: Tokenize the sentences

Now you can see that our corpus is divided into sentences having 11 words each. Which makes all of the data samples of the same length. But as we know that machine learning models only understand numeric values,  so we will encode the words of the sentences with numbers. With each word assigned to a number.

To do this we will use a predefined Tokenizer class in Tensorflow. First, we will create an object instance of that class and then train the tokenizer on the entire dataset. Where it will find all the unique words and assign them a unique integer. After tokenization, we can find the number assigned to a word by using a method called “word_index” of the tokenizer object. It returns a dictionary having words as keys and their encoded numbers as values.


In our model, we will be using an Embedding layer to analyze the representation of words. It assigns the vector representation of each word in the vocabulary starting from index 1 to the last index(in our case it’s 3178). That’s why our array has to be 3178 + 1 long. So for the vocabulary size, we will be having 1 more than the actual vocabulary.

Now we will split our data to input data as X and output data as y. We will convert the numeric sentences from a list to an array and slice them so that the last element will be in y (output) and rest in X (input). Then we will one-hot encode the labels by using a method in TensorFlow called “to_categorical”.

STEP 4: Build and train the model

With that our data is completely ready. We can now build our model and train it with our data. The first layer will be the Embedding layer where we will pass the vocabulary size and the length of the input data. There is another parameter known as the embedding vector space which specifies how many dimensions to be used to represent each word.

Then we will have 2 SimpleRNN layers with 25 cells each. The more the number of cells the deeper the networks will become. It is stacked on Dense network with ‘relu’ activation followed by the output layer with ‘softmax’ activation. Output layer with softmax gives the probability value for each class.

As we are classifying from the vocabulary we will use a ‘categorical cross entropy’ loss function and use ‘adam’ as the model’s optimization algorithm. We will also keep a check on the ‘accuracy’ of the model during the training so we add an ‘accuracy’ metrics and compile our model.


Now our model is completely ready for training, before that we will set batch size and the number of iterations the model will train. Then we fit the model with our created input_data (X) and output data(y).


With that our model is trained and ready to use. But before that let’s take a look at the loss and accuracy of our model during the whole training process.


From the above plots, we can see that the loss of the model decreases and the accuracy of the model increases iterations. And after 150 iterations we are left with a model that gives 98.71% accuracy on our training data. We should definitely save our trained model in case we want to use it in the future. To do so, simply use the ‘save’ attribute of the model and specify the path. Also, we would like to save the tokenizer which has all of our words to numbers mappings saved in it. To save it use the ‘dump’ method in the Pickle module.


Now the only thing that’s left is to use our model and generate some sentences seeded by the user input. Meaning from the input of the user the model will predict the next words from the input and continue to do so till the number of words we want.

To do so we will define a function called ‘poem_generator’ in which we will ask for user input to seed our model. Then we will define a loop that will run for the number of words we want in our poem. In that loop first, we will encode the user input just like we have done it before we trained our model using the tokenizer. Then we will use padding, in case the user input doesn’t match the length of our model input, which is 10. If the user input is less than 10 it will add zero paddings and if it is greater than 10 it will clip the input to a length of 10.

After that, we use the model to predict the numeric word with given encoded input. Then we will search through the tokenizer to find the mapping of that prediction and get the word mapped to it. After we get our word we will print it out.

STEP 5: Use model to generate sentences

Wonderful, our model is now able to write poems, it has some mistakes but we can fix that by training on large data or using more powerful models.

Now that we have successfully implemented a word-level sentence generator, let’s take a look at the ways to improve the model so that it can be used for longer sequences of data. To build this sentence generator we have used a SimpleRNN in our model which returned good results. Just like we have discussed in the previous article that having longer sentences may result in vanishing or exploding gradients which hampers the performance of a SimpleRNN. Also for longer sequences, the model must retain some information that may be useful for future predictions. For example,

dog who ate …(long sequence)…, is now sleeping”

dogs who ate …(long sequence)…, are now sleeping”

As you can see the word ‘dog’ is responsible for the word ‘is’ and the word ‘dogs’ is responsible for word ‘are’. So the model must learn to remember such information long enough to predict such words.

In such cases, models like GRUs and LSTMs perform much better than RNNs. But what makes them perform so well in long sequences? 

To answer that question let’s take a look at their architecture.

Recurrent Neural Networks: 
(Architecture of a RNN cell)

RNNs are good for keeping information for a short time but when the sequences become longer, it becomes very difficult for them to memorize. To do so, the more powerful sequence models use ‘gates’ which decide which word is to be remembered and which word to be forgotten. Let’s take a look at that in the LSTM model.

Long Short Term Memory:
(Architecture of LSTM)

The important part of an LSTM is the cell state along with the gates in it. They can be considered as the memory units of the network. Where they retain some information that is important for the future and remove them once their job is complete.

The addition and removal of information are controlled by the gates of the network. The forget gate controls what information to store and what to remove or forget. The input gate is used to update the cell state, it takes the output of the previous state and the current input to a sigmoid function. That checks for which values to update from the output of the sigmoid function.

If it’s closer to 0 then it’s not important and if it’s closer to 1 then it’s important for the update process. The cell state returns all the information obtained in that LSTM Cell after the information is passed through the ‘input gate’ and ‘forget gate’. Then the output gate controls what should be the next hidden state and passes the hidden state to the next layer.

With the combination of all these gates, the LSTM model becomes very powerful in retaining information for a long time.

Gated Recurrent Units:

(Architecture of GRU)

The GRU is quite similar to that of an LSTM but it doesn’t keep some gates of the LSTM. It has two gates, ‘reset gate’ and ‘update gate’. Here the update gate checks whether to pass the previous output (ht-1) to the next cell or not. The reset gate checks on how much of the previous information to forget. As the number of operations in GRU is less than that of LSTM, GRU is quite faster than that of LSTM. 

Both models are very powerful and perform very well many applications. They are the state of the art techniques used for many applications like Image Captioning, Speech Recognition, Sentiment analysis, etc.

I hope this article was helpful!  From the code, you can change the length of the sentences and use LSTMs and GRUs to compare their performance with SimpleRNNs. Download the source code from below: