The Impact of Stemming and Lemmatization Applied to Word Vector Based Models in Sentiment Analysis

  • Youri Senders Youri Senders

Sentiment analysis is a commercially attractive field in which a great deal of research has already been conducted. Millions of people across the world share their opinion on the web, which makes it a popular subject among researchers. The first step in sentiment analysis is text preprocessing. This research focuses on the methods stemming and lemmatization. These are text normalization methods that attempt to obtain root forms of inflected words. Prior work highlighted the importance of preprocessing.
However, while preprocessing in classification models is studied extensively, little work has been done towards preprocessing in word vector based models. Therefore, the goal of this work is to examine the role of stemming and lemmatization when applied at the training phase of word vector based models. The following research question is addressed: “Which stemming or lemmatization method is most suitable for predicting sentiment polarity when integrated at the training phase of word vector based models?” This thesis uses a training corpus consisting of 142.570 news articles and the IMDB movie review dataset for classification. First, stemming and lemmatization are applied to the training corpus. Second, Word2Vec’s CBOW and Skip-gram models are trained. Hereafter, stemming and lemmatization are applied to the classification  dataset to obtain a compatible vocabulary. Finally, sentiment is classified using a LSTM model with embedding layer. The results of the experiments show that all methods outperformed the baseline for both word embedding models. Lemmatization is the preferred method for the CBOW model, whereas the Snowball stemmer is the preferred stemming method. The Snowball stemmer achieved the best performance for the Skip-gram model, while the Porter stemmer and lemmatization are close behind. Therefore, this research concludes that the Snowball stemmer is the best performing stemming method, while lemmatization achieves similar results.

Netspar, Network for Studies on Pensions, Aging and Retirement, is a thinktank and knowledge network. Netspar is dedicated to promoting a wider understanding of the economic and social implications of pensions, aging and retirement in the Netherlands and Europe.

MORE ABOUT NETSPAR


Mission en strategy           •           Network           •           Organisation           •          Magazine
Board Brief            •            Actionplan 2023-2027           •           Researchagenda

ABOUT NETSPAR

Our partners

B20210618_Achmea_logo_grey
B20160708_universiteit utrecht
B20160708_erasmus
B20160708_ministeries
B20210909_SPMS_logo download greyscale smaller
View all partners