The Impact of Stemming and Lemmatization Applied to Word Vector Based Models in Sentiment Analysis
Sentiment analysis is a commercially attractive field in which a great deal of research has already been conducted. Millions of people across the world share their opinion on the web, which makes it a popular subject among researchers. The first step in sentiment analysis is text preprocessing. This research focuses on the methods stemming and lemmatization. These are text normalization methods that attempt to obtain root forms of inflected words. Prior work highlighted the importance of preprocessing.
However, while preprocessing in classification models is studied extensively, little work has been done towards preprocessing in word vector based models. Therefore, the goal of this work is to examine the role of stemming and lemmatization when applied at the training phase of word vector based models. The following research question is addressed: “Which stemming or lemmatization method is most suitable for predicting sentiment polarity when integrated at the training phase of word vector based models?” This thesis uses a training corpus consisting of 142.570 news articles and the IMDB movie review dataset for classification. First, stemming and lemmatization are applied to the training corpus. Second, Word2Vec’s CBOW and Skip-gram models are trained. Hereafter, stemming and lemmatization are applied to the classification dataset to obtain a compatible vocabulary. Finally, sentiment is classified using a LSTM model with embedding layer. The results of the experiments show that all methods outperformed the baseline for both word embedding models. Lemmatization is the preferred method for the CBOW model, whereas the Snowball stemmer is the preferred stemming method. The Snowball stemmer achieved the best performance for the Skip-gram model, while the Porter stemmer and lemmatization are close behind. Therefore, this research concludes that the Snowball stemmer is the best performing stemming method, while lemmatization achieves similar results.