2020

Classification of imbalanced data sets with sampling methods for predicting online shopping intentions

MSc 01/2020-017

Mustafa Bulca

With the increasing use of the internet, online shopping is getting more popular and these developments are opening new possibilities in the area of data science. Data from visitors that visit online webshops are more often collected for prediction purposes. One of the important properties of online webshop data is that the data is imbalanced. Due to the ease of gaining information and comparing products on the internet, data sets are getting imbalanced. Visits of online webshops are more often ending without a buy of the visit. Handling imbalanced data sets in machine learning is an important part when building classifiers. Several algorithms handle imbalanced data sets and these methods can be split into two categories. The first is algorithm-based approaches where the focus is mainly on improving the algorithm to enhance prediction performance. The second is sampling-based approaches where the data set is oversampled or downsampled for training classifiers. The data used in this study is obtained from the UCI machine learning repository site and this data set is previously used in the study of Sakar et al. (2018), where the researchers are randomly oversampling the data set. This study broadens the study of Sakar et al. (2018), by applying different sampling-methods to the same classifiers. In this research, sampling-based approaches are used to enhance the prediction of online shopping intentions. The data set used in this study is randomly downsampled, oversampled with the SMOTE algorithm, and a combination of both methods is used with a decision tree, support vector machine, and multilayer perceptron. The results are showing that the proposed combination of downsampling and the SMOTE algorithm is not outperforming the SMOTE algorithm. Further research on the combination of two sampling methods is needed to further develop or adapt this combination in other data sets and machine learning classifiers.

Paper

Classification of imbalanced data sets with sampling methods for predicting online shopping intentions

ABOUT NETSPAR

Our partners

Classification of imbalanced data sets with sampling methods for predicting online shopping intentions

Share

ABOUT NETSPAR

Our partners