Effects of Severe Data Imbalance on Evaluation of Support Vector Machines and Decision Trees
Belenko, Vera (2022)
Belenko, Vera
2022
Master's Programme in Computational Big Data Analytics
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-11-22
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202210267908
https://urn.fi/URN:NBN:fi:tuni-202210267908
Tiivistelmä
Data imbalance refers to a phenomena when one of the classes is much better represented in the dataset compared to the others. Many researchers have been facing data imbalances in various fields, including medicine, fraud or device failure detection, and predicting conversions from user behavior data. Even though there exist a significant number of papers devoted to predicting user behavior, a closer analysis of data imbalances and how they affect performance of classifiers and model evaluation is missing in this literature. This thesis attempts to fill this gap.
In this thesis work, support vector machines and decision trees are employed to predict whether a website user is interested in making a purchase of a certain product or not. Each of the classifiers is evaluated using four strategies: balanced training and testing data, balanced training and unbalanced testing data, unbalanced training and balanced testing data, unbalanced training and testing data. The metrics used for models’ performance evaluation are: Accuracy, Precision, F1, MCC, Sensitivity, Specificity and ROC-AUC. The learning curves are built for each of the metrics to evaluate how performance changes when training sample size increases. Hierarchical clustering is applied to evaluate how dimensionality reduction affects the performance.
Predictions yielded by the classifiers are to be used by a company to target marketing efforts. In this thesis work, an emphasis is put on utility measures and how they can be used to evaluate and compare the usefulness of the classifiers for the marketing task
In this thesis work, support vector machines and decision trees are employed to predict whether a website user is interested in making a purchase of a certain product or not. Each of the classifiers is evaluated using four strategies: balanced training and testing data, balanced training and unbalanced testing data, unbalanced training and balanced testing data, unbalanced training and testing data. The metrics used for models’ performance evaluation are: Accuracy, Precision, F1, MCC, Sensitivity, Specificity and ROC-AUC. The learning curves are built for each of the metrics to evaluate how performance changes when training sample size increases. Hierarchical clustering is applied to evaluate how dimensionality reduction affects the performance.
Predictions yielded by the classifiers are to be used by a company to target marketing efforts. In this thesis work, an emphasis is put on utility measures and how they can be used to evaluate and compare the usefulness of the classifiers for the marketing task