The Impact of Stemming and Lemmatization Applied to Word Vector Based Models in Sentiment Analysis

  • Youri Senders Youri Senders

Sentiment analysis is a commercially attractive field in which a great deal of research has already been conducted. Millions of people across the world share their opinion on the web, which makes it a popular subject among researchers. The first step in sentiment analysis is text preprocessing. This research focuses on the methods stemming and lemmatization. These are text normalization methods that attempt to obtain root forms of inflected words. Prior work highlighted the importance of preprocessing.
However, while preprocessing in classification models is studied extensively, little work has been done towards preprocessing in word vector based models. Therefore, the goal of this work is to examine the role of stemming and lemmatization when applied at the training phase of word vector based models. The following research question is addressed: “Which stemming or lemmatization method is most suitable for predicting sentiment polarity when integrated at the training phase of word vector based models?” This thesis uses a training corpus consisting of 142.570 news articles and the IMDB movie review dataset for classification. First, stemming and lemmatization are applied to the training corpus. Second, Word2Vec’s CBOW and Skip-gram models are trained. Hereafter, stemming and lemmatization are applied to the classification  dataset to obtain a compatible vocabulary. Finally, sentiment is classified using a LSTM model with embedding layer. The results of the experiments show that all methods outperformed the baseline for both word embedding models. Lemmatization is the preferred method for the CBOW model, whereas the Snowball stemmer is the preferred stemming method. The Snowball stemmer achieved the best performance for the Skip-gram model, while the Porter stemmer and lemmatization are close behind. Therefore, this research concludes that the Snowball stemmer is the best performing stemming method, while lemmatization achieves similar results.

Netspar, Network for Studies on Pensions, Aging and Retirement, is a thinktank and knowledge network. Netspar is dedicated to promoting a wider understanding of the economic and social implications of pensions, aging and retirement in the Netherlands and Europe.

MORE ABOUT NETSPAR


Mission en strategy           •           Network           •           Organisation           •          Magazine
Board Brief            •            Actionplan 2023-2027           •           Researchagenda

ABOUT NETSPAR

Our partners

B20160708_tilburg university
B20200214_BlackRock_BLK_eng_black_rgb_small
B20200104_RailOV_logoo.original.grijswaarden
Print
B20190823_mn-logo_small
View all partners