Let’s delve into how computers deal with human language.
Natural Language Processing (NLP) is the part of AI (Artificial Intelligence) that deals with human language. We want computers to be able to grasp the meaning between words which is not evident since computers only deal with sequences of zeroes and ones.
Sentiment analysis is one of the essential tasks done in NLP. We will build a sentiment analyzer using CNN (Convolutional Neural Network). It takes tweets as input and outputs whether it conveys a positive or negative sentiment.
CNN was first used to process images. For example, at the very beginning, it was used to recognize digits that were handwritten and also to recognize the contents of an image. Researchers soon realized that CNN could also be a powerful tool for language processing. Classification tasks are easily addressed with CNNs.
First, we will go into the origins of CNN (how it works for images, how does it work and what are its components). We will understand how text is linked to images and CNNs for text. CNN takes as input an image and outputs a label (image class).
Steps in CNN:
A crucial one and the first step is convolution. Convolution creates a lot of feature detectors that scan the entire image and gives us a list of feature maps whether or not and where a specific feature appears in the image.
Next operation is Max Pooling, where we just apply the maximum function to all feature maps so that we can make them smaller and improve the performance of the model.
Then comes the straightforward flattening phase where we make a massive vector out of all the maps which are matrices.
We end with a feed-forward neural network which learns from the feature extraction phase.
Convolution consists of feature detectors applying over the entire image. We go through patches in the image that have the same size as that of our feature detector and perform an element-wise multiplication and sum all the products. This result might not make any sense for us humans, but when this is connected with a feed-forward neural network, CNN can achieve great results with it.
A CNN consists of a lot of feature detectors, and each detector will provide a feature map. The feature detector contains randomly initialized values, and those are the variables learned throughout the process of training.
After comparing the results after one cycle of prediction with the actual correct labels, the model will tune those numbers to learn to detect more useful and accurate features.
Next, in the max-pooling operation, we don’t take the global maximum but maximum of kernel size. Max-pooling reduces size and computation. The idea is we need not know everything about whether the feature appeared in the image or not. The most important thing is to see whether it seems or not. But we still need to keep some locality in this process.
We still want to keep the relations between the positions of the features. There exists a tradeoff between getting rid of the information we don’t need and also making the map smaller so that we can improve the process and reduce the costs. We apply this to several feature maps, and we get the same number of pooled maps.
Flattening converts into a vector of 1 dimension to be as input to the feed-forward neural network. We transform each pooled map into a vector. It keeps the locality information (position).
We can add as many hidden layers required in the feed-forward neural network. Higher the output value, higher the probability that input belongs to that class.
Similar to the way we looked for features in the images, we can look for features in the text. For that, a sentence should be converted into a matrix as images were matrices.
The easiest way of representing a sentence as a vector is also the most inefficient one. It’s called one-hot encoding. Vocabulary consists of all the unique words in our corpus (text dataset). Each vector is of vocabulary size. All the entries in each vector will be all zeroes except for one entry which will have a value of one. It doesn’t convey any relation between words.
Reducing the size of vectors enforces less liberty, and that forces our model to create relations. Instead of being binary, each value in the vector dimension will have a value between 0 and 1. Words with similar meaning will be closer in the embedding space.
Image matrices have a meaning when moved from left to right and top to bottom. But word matrices only have a purpose when moved from top to bottom. Because the right to left movement indicates the dimension of that vector, it doesn’t make sense to perform 2D convolution. The width of the filter or kernel would be the same as that of the embedding dimension. 1D convolution implemented as feature extraction along one dimension is needed.
Global maximum of the convolution operation is taken since the position of a feature in a sentence is not as important as that in an image. We care if that feature is present or not. A filter of different sizes is used for convolution as opposed to that when working with images.
Flattening phase is not required as the output of the convolution is a vector and not a matrix.
We are done with the theory and let’s jump into the implementation. The implementation will be done in Google Colab. It offers free GPU and TPU instances which speeds up the code execution. You will also learn how to use Google Colab in this post.
Download the code!
tweets_data.csv contains the data with sentiments of all tweets.
Stage 1: Importing dependencies.
Stage 2: Data preprocessing
Obviously, the first step is to load our files. As already told, drive is the best way of accessing files in Google Colab. The mount method of drive gives us the ability to connect one’s Drive to Colab.
‘/content/drive’ is the path at which Drive is located
When you run the below code, it asks you to go to a URL for authentication. Once you land on the page, copy the code available and paste it in the textbox titled ‘Enter your authorization code’ and hit enter.
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly Enter your authorization code: ·········· Mounted at /content/drive
Next, let’s create a list ‘cols’ that holds the column names of our dataset.
sentiment : indicates if the sentiment is positive or negative. 0 denotes negative, 1 denotes positive
id : ID of the tweet
date : date on which the tweet was sent
query : this column is not very useful. All the values for this column are ‘NO_QUERY’ which means none of the tweets have any query.
user : the user who tweeted (Twitter handle)
text : the tweetIn :
We will be using Pandas to work with our data. Pandas stores the data in a dataframe which is a neat tabular representation of the data.
Our dataset is a CSV file (Comma separated value) which means all our values in a line (a row if you think of the data as being in a table) are separated by commas.
We use the read_csv method to read in the CSV file.
First argument: The Path to the file
Second argument: We don’t have column names in the first row. Which means there is no header and we don’t want Pandas to use the first row as the header. So, header = None
Third argument: names = column names which are stored in the cols list. So, we set names = cols
Fourth argument: an optional argument, specifies which engine to use to parse the data, we select Python
Fifth argument: encoding to use for the file. You can think of it as the language standard. We specify latin1, just in case we have tweets in other languages like Chinese, Arabic or Hebrew or words with accents unlike in English.In :
shape returns the number of rows and columns in our dataset.
There are 30000 rows and 6 columns
Let’s see the first five rows of our dataset
|0||0||1553795194||Sat Apr 18 15:13:59 PDT 2009||NO_QUERY||t_win||It’s been the longest day ever! I still haven’…|
|1||0||2179002334||Mon Jun 15 08:30:28 PDT 2009||NO_QUERY||badsotheynv||I feel uber bad little ol lady is sick wanted …|
|2||0||1936039755||Wed May 27 07:20:42 PDT 2009||NO_QUERY||mubi_just_do_it||goose just died…saddest scene i’ve seen…|
|3||0||2185132296||Mon Jun 15 16:56:05 PDT 2009||NO_QUERY||walkthistown||@alexamarzi I KNOWW dont move|
|4||0||2180496762||Mon Jun 15 10:33:02 PDT 2009||NO_QUERY||clare666||@Piewacket1 awwww pie… the ‘once in a lifeti…|
Let’s have our master copy as it is. If we want to return to this file or see how the original train.csv looked. So, we create a variable called data and assign it train_data to make a copy of it.
The columns important for our text analysis is only the tweet and it’s sentiment. We can remove all other columns. This has 2 advantages:
- Smaller dataframes leads to higher performance
- We can focus on what is required and need not worry or get disturbed by unimportant columns
Pandas gives a method called drop to drop rows or columns from the dataframe.
We need to specify which columns to drop (remove), axis : if 0: removes the row, if 1: removes that column
Once this cell is executed, Pandas modifies our dataframe only for this cell but not permanently. To permanently modify the dataframe, there are 2 methods:
1. Assigning this statement to data (the variable containing our data). In this we overwrite the old data
2. Using the inplace argument, setting it to True tells it to modify it in the original dataframe.
Tweets may contain punctuations, white spaces, URL links. They may be in any case, upper or lower. To use a piece of code repeatedly, a function can be used. Since we have a lot of tweets and every tweet needs to be cleaned. This calls for the perfect situation to use a function. To bring all tweets to a standard, let’s define a function to clean tweets. The name of the function is clean_tweet. It takes a tweet as an input.
BeautifulSoup is a Python library used to parse HTML. HTML stands for HyperText Markup Language. This is the language behind the Internet. Every webpage and website uses HTML to structure the layout and define components. HTML is dirty, apart from the text we need it contains a lot of tags, formatting, structuring components. This is where BeautifulSoup comes in, it parses HTML using any of the parsers like lxml, html5lib, html.parser. We also have the luxury to choose the parser we want. lxml is the best parser among all.
Line 1: Creates a BeautifulSoup object with lxml parser. The first argument is the data we want to parse which is the tweet in our case. get_text() returns only the text.
For text cleaning, regex is the best library. It allows us to specify patterns and search for strings that follow a pattern. We have imported it using the line ‘import re’ in the beginning.
Line 2: Many tweets mention other users using ‘@’. This needs to be removed. Removing can also be thought of as replacing that text with whitespace. Regex provides a method called sub which replaces the value in 1st argument by 2nd arg in the text specified by 3rd arg.
1st arg: r indicates the start of regex. Follows the pattern specified within the quotes. The pattern is ‘@’, ‘any upper or lower case characters or numerals between 0 and 9’ (A to Z, a to z, 0 to 9 is indicated as [A-Za-z0-9],  specifies a class or group of characters), + indicates any number of them
2nd arg: a space in quotes => Whitespace
3rd arg: tweet
The next step is to remove any links in the tweet. Links start with https or http follwed by :// and any number of characters or numerals. In regex, this is specified as r: start of regex, pattern within quotes.
1st arg: Since links can either start with http or https, we need to make s an optional character. ‘?’ makes the preceding character optional. So, ‘?’ should be used after https. Then ‘://’, [A-Za-z0-9] to indicate the content of the URL. ‘+’ indicates any number of such characters.
Next, keeping only letters and common punctuations used in text like ‘.’, ‘!’, ‘?’ and ” (apostrophe)’. The same explanation as given for above 2 steps works here too.
Since we have replaced all of the unnecessary stuff by whitespaces, a lot of whitespaces would exist. As you expect, last step is to remove all whitespaces. To indicate any number of whitespaces, we use ‘ +’ (whitespace followed by +).
After the manipulation, return tweet.In :
Now, we have to call this function on all our tweets. A for loop can be used, but a much more compact way is list comprehension (only takes 1 line).
We say call the function clean_tweet on each tweet in the text column of data.In :
We can see the possible values of a column and their respective count using value_counts(). Let’s apply this to the sentiment column. We see that 4 is used instead of 1 to denote positive sentiments. So, all the occurrences of 4 have to be replaced by 1. Using the values attribute on any column returns an array with all column values.
Name: sentiment, dtype: int64
A very powerful concept used in Pandas is boolean masking. An example of this is used below. ‘data_labels == 4’ returns either True if a value is 4 or False if a value is not 4, which means 0 (since 0 and 4 are the only possible values). We can use this True, False list as an index to decide which values need to be replaced by 1. Wherever there is True, it replaces that value by 1. The values with zero remain as is.In :
Now that we are done with preprocessing steps, let’s get into the meat of this project. Tokenization is the first part of any NLP task. Tokenizer is separating the sentence into different components. A word tokenizer splits by every word. A sentence tokenizer splits by a sentence. It is also what converts a list of characters into numbers. If we manually make up a list of numbers for every word or sentence, it will be tedious and also will not be able to convert any given word into a list of numbers.
Luckily for us, tensorflow datasets which we had imported before. We just give it the corpus (the tweets) and the Vocabulary size (number of unique words). It can even be less than the number of words we wish to have. In that way, the encoder can compose a word with another if it can’t find a unique representation for a word. It can be useful and powerful sometimes for words that appear very less number of times in the corpus. It will build an encoder which is an object that can transform any string to a list of numbers. We will have our vocab size as 64000 words. It takes quite some time to create the tokenizer.
We don’t train with a single exmaple at a time. We train in batches. But in order to do that, we need all tweets to have the same length. A simple way is to add zeros at the end of each sentence so that they all have the same length. An important thing is that zero is not a number that is used by our tokenizer. Zero doesn’t have any meaning and doesn’t correspond to any words. It can be used without altering the meaning of our sentences. First thing is to declare the maximum length of our tweets. Now, length of the tweet after encoding will be the number of words rather than it being the number of characters as it was before tokenizing.
pad_sequences is the method used for padding. It takes the corpus, the value to pad with, the way we want to pad: ‘post’ indicates we want to pad at the end, maximum length will be the max length we just calculated.
Splitting into training/testing set
We need split our dataset into training and testing sets. Our data is actually ordered, the first half has negative sentiments while the latter half has positive sentiment. We will take 10% of the data as training set. To maintain the proportion of positive and negative sentiments in the test set to get an accurate resemblance of the accuracy, we need 1500 positive and 1500 negative tweets. (Totalling to 3000 tweets)
We can generate 1500 random integers between 0 and 15000; 1500 random integers between 15001 and 30000. This can be done using random.randint() method of numpy. First arg: starting number, second arg: ending number, third arg: number of integers required.
Using indices, we obtain the rows by accessing it like an array from both data_inputs and data_labels. Training inputs are obtained by deleting the test indices (randomly generated indices).
Axis specifies if we want to remove along row or column. Since we need to remove row-wise, set axis to 0. Delete from both data_inputs and data_labels to generate data and targets in their respective variables.
Axis is not required for labels since it’s a vector and the only possible way to delete is row-wise.
Stage 3: Model building
Architecture: Apply embedding layer. Then 3 different kinds of 1D convolution of size two (bigram), three (trigram), and four (fourgram). 1D convolution is used because the width is the model and filter is applied along one dimension. We apply a certain number of each of them. After applying the activation function for each filter, the output is a vector. We will take the max of each of those vectors via Max pooling and concatenate. Apply a linear function (Dense layer). Finally, our classification task is done.
Our class is called DCNN stands for Deep Convolutional Neural Network that we inherit from keras.Model. Basically, we are building our own model.
First is the init method which is the initialization function and has to be implemented for every model or layer in TensorFlow. The first parameter is self (to refer to the current object when instantiating a class). This is followed by all the variables that we need to build our model.
Stage 4: Application
Let’s create a checkpoint before training our model. This is a way to store our model once it’s trained so that we need not train from scratch when we want to use it later.In :
Epoch 1/5 849/849 [==============================] - 56s 66ms/step - loss: 0.5422 - accuracy: 0.7159 Epoch 2/5 849/849 [==============================] - 57s 67ms/step - loss: 0.4216 - accuracy: 0.8064 Epoch 3/5 849/849 [==============================] - 56s 66ms/step - loss: 0.2888 - accuracy: 0.8797 Epoch 4/5 849/849 [==============================] - 62s 73ms/step - loss: 0.1313 - accuracy: 0.9518 Epoch 5/5 849/849 [==============================] - 56s 65ms/step - loss: 0.0799 - accuracy: 0.9704
Let’s see how our model performs on new or unknown data.
94/94 [==============================] - 1s 16ms/step - loss: 0.9547 - accuracy: 0.7293 [0.9547498822212219, 0.7293333411216736]
For testing our own sentences, we have to encode the sentence using Tokenizer and convert it into a numpy array. Set the training variable to False as we aren’t training and dropout isn’t required. The output will be a tensor which is hard to read. So, let’s convert it into numpy format
We had good results on the test dataset and it seems to work pretty well on our sentences. Of course, the data used is far from perfect. They are just a few tweets. You have to get your dataset depending on the task at hand. This was a good dataset to show that our model works really well. They may not contain all the possible words. This was a good dataset to show that our model works pretty well. You can also try with your sentences or even your own dataset. In this project, we started from scratch, learned theory, and implemented a sentiment analyzer. Hope you liked it!