Nowadays, a lot of text data is generated every minute. Text categorization, which organizes documents into groups based on their underlying structures, can help to capture a large number of activities and diversity of a multidisciplinary field.
On the internet, there are lots of data that needs classification to solve big data problems. For that, we perform text categorization and document classification. This is a very basic step to solving big data problems. One of the real-time applications of this technique is email filtering, emails are classified in different categories like Spam, social, promotions, etc.
There are some traditional methods such as Bayesian, decision tree, K-NN, vector space model. These methods are not appropriate compared to new approaches like the LSA & word2vec model. The weakness of traditional methods the vectors are not able to represent documents perfectly because the term-document matrix cannot fully represent the meaning of words. LSA approach overcomes this with the word2vec model & CNN model, by this, we can use vectors to represent words better. Here we use CNN also which is an efficient deep learning model that helps us to increase categorization accuracy. So in this article, we go through Latent semantic analysis, word2vec, and CNN model for text & document categorization.
Here in this article, we are going to do text categorization with LSA & document classification with word2vec model, this system flow is shown in the following figure.
Fig 1. system Architecture for text categorization
As shown in the above figure first of all we need to do some Pre-processing on the given document. Pre-processing includes data cleaner, tokenizer, sentence splitting, stop word removal, etc.. based on the document. The next step is synonym replacement which reduces the number of unique words for a document. The next phase is to perform LSA which contains 2 parts, first is standard LSA and the other is Modified LSA.
Word Vector(term-document matrix)
Word vector trained by the word2vec model and the TF-IDF weighting describes the influence of the word on the document. TF-IDF weighting and word vectors are used to represent documents.
Fig.2 word vector
Each element of the term-document matrix in the vector space model is managed by TF-IDF weighting and word vector training by the word2vec model. Here we multiply word vector with TF-IDF weighting and get the result in form of a term-document matrix to represent the relation between word and document So, m × n × v (m is the number of words, n is the number of documents and v is the length of vectors trained by word2vec) matrix is created.
LSA is widely used in NLP for accurate text categorization. The main task of LSA is to analyze the relationship between two documents by creating a vector space model for word & terms which occur in both documents.
Fig 3. Standard LSA architecture
Here we use the word vector which we discussed above. LSA can identify important relationships by counting the frequency of words in documents. let us consider A1, A2, A3 sentence as documents, apply LSA on that and get representation matrix as below.
Fig 4. sample sentence
Fig 5. TF-IDF matrix representation
Here the matrix generated based on TF-IDF(term frequency-inverse document frequency)in this matrix word frequency assigned by 1or 0, 1 indicates presence, and 0 indicates the absence of that word in the sentence. Here we apply this for sentence & word, it is also applicable in documents and word vectors for document classification.
Single Value Decomposition:
SVD is used to reduce the dimensionality of the word matrix. By reducing dimensions we can remove noise and clear semantic architecture. SVD is an algebraic method that can identify the relationship between word & sentence.
A is the term-document(m x n ) matrix. U is the m × m matrix and its columns are the orthogonal feature vectors. V is an × n matrix and its columns are the orthogonal feature vectors. We used scipy.sparse.linalg python library to apply SVD on the input matrix. After SVD we are going to compute cosine similarity as the following equation.
Here A is frequency sequence of answer Ai. and B is frequency sequence of answer Ai+1.
The LSA uses TF-IDF to represent word vectors. The disadvantage of LSA is, sometimes it lies in denying syntactic aspects of the words. Let’s take an example of the word “cell” which has multiple meanings. Each meaning can’t be determined with syntactic aspects. Before we identify meaning first we need to identify syntactic tags as words in nouns, verbs, or adjectives. This study of documents is done in modified LSA using syntactic features.
Fig 6. Modified LSA architecture
Part of Speech tagging:
In modified LSA, we use TF-POS(term frequency-part of speech). This part of speech tagging aims to assign each word to one of three fixed sets of speech as noun, verb, or adjective. The main aim of this POS is to determine the exact tag for each word of the document. Instead of listing all words in a single list TF-POS divides all words into three separate vectors as shown in the following figure.
Fig 7. TF-POS matrix representation
Here we consider A1, A2, A3 as given in Standard LSA and represent that with the TF-POS matrix. As shown in the above representation all the words are divided into three tags and then Term frequency(TF) has been identified. The advantage of this modified LSA is that there is no need for Single Value Decomposition(SVD), which requires high computation power and complex tasks. TF-POS uses divide & conquer strategy by utilizing multiple vectors based on tags, so no need to reduce to the dimensionality of the word vector. And lastly, cosine similarity applied to measure the similarity between the answers.
Convolution Neural Network(CNN)
After constructing a matrix based on the term-document relationship by LSA or modified LSA, all document vectors need to be trained on CNN architecture.
Fig 8. CNN structure
The CNN structure is shown in the above figure. This CNN is able to categorize documents from low-level local feature extraction layer to high-level classification layer. In this structure, the first input layer takes the input of a 2-D vector of a document containing n x n matrices. This CNN structure also contains 2 convolution layers followed by a max-pooling layer. Here for max accuracy, we are using the ReLU activation function at first convolution layer and sigmoid activation function for the second convolution layer. In the end, a fully connected layer is placed which performs classification operation and gains accuracy.
We conclude that we should use some new techniques over old techniques for better results like using LSA in place of other classification methods like K-nn or decision trees, etc. And also use modified LSA at some point of time with Part of Speech tagging (POS) in place of SVD for better results. After all this work is done we need to use CNN containing a hidden convolution and pooling layer followed by a fully connected layer to get better accuracy in categorization.
- A Hybrid Method of Syntactic Feature and Latent Semantic Analysis for Automatic Arabic Essay Scoring. DOI: 10.3923/jas.2016.209.215
- An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization. https://doi.org/10.1007/s13369-018-3286-z
- An Application of Latent Semantic Analysis for Text Categorization. DOI: 10.15837/ijccc.2015.3.1923
- Using Latent Semantic Analysis to Identify Research Trends in OpenStreetMap doi:10.3390/ijgi6070195
- An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis. DOI: 10.1109/CIT/IUCC/DASC/PICOM.2015.336
- Latent semantic analysis for text categorization using neural network doi:10.1016/j.knosys.2008.03.045