Probabilistic Calibration of Bi-Class Machine Learning Algorithms for Analysing Breast Cancer Datasets
Raji, Wareez Olalekan (2021)
Raji, Wareez Olalekan
2021
Master's Programme in Computational Big Data Analytics
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-04-28
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202104253468
https://urn.fi/URN:NBN:fi:tuni-202104253468
Tiivistelmä
Breast cancer is one of the most widely spread diseases and the second leading cause of cancer death among women in the world. According to a 2013 World Health Organization report, it is stated that over 508,000 women died worldwide in 2011 due to breast cancer. Benign and malignant are the two types of breast cancer tumours. Breast cancer in women can be cured and prevented in the primary stages, i.e., when detected early. However, many women are diagnosed with breast cancer when it is too late.
It is important in machine learning that probabilistic classifiers and predictive models produce reliable probabilities, as decision-making problems greatly depend on them. Especially, in medical field where a wrong diagnosis and prediction may result in death, calibration of classifiers is needed in order to get more reliable probabilities. Recall and precision-based metrics such as Brier score, F1-score or AUC-ROC (Area Under the Curve or Receiver Operating Characteristic curve) are used in machine learning models to analyse real-world problems. In this thesis, an approach of examining the probability calibration of the machine learning algorithms and to examine and compare different machine learning algorithms on two breast cancer datasets (empirical and simulated), in order to determine the best performing ones was presented. There are several methods for performing calibration of probabilistic predictions, but the main ones are: Platt’s sigmoid metric (based on parametric approach) and isotonic regression model (based on non-parametric approach). The five machine learning methods (logistic regression, naïve Bayes, support vector machine, random forest and K-Nearest neighbours) were used on empirical and simulated breast cancer binary datasets, and afterwards being calibrated. However, the empirical and simulated breast cancer dataset gave different results after being calibrated.
Overall, after calibration with both sigmoid and isotonic regression for empirical dataset, improved results were obtained for KNN and random forest. However, logistic regression and SVC gave a worse result, while Naïve Bayes produced an improved result only for isotonic regression and not sigmoid scaling. Also, after calibration with sigmoid and isotonic regression for simulated dataset, Naïve Bayes, KNN and SVC gave improved results, while random forest gave a worse result. Logistic regression only produced an improved result with isotonic regression and not sigmoid scaling.
It is important in machine learning that probabilistic classifiers and predictive models produce reliable probabilities, as decision-making problems greatly depend on them. Especially, in medical field where a wrong diagnosis and prediction may result in death, calibration of classifiers is needed in order to get more reliable probabilities. Recall and precision-based metrics such as Brier score, F1-score or AUC-ROC (Area Under the Curve or Receiver Operating Characteristic curve) are used in machine learning models to analyse real-world problems. In this thesis, an approach of examining the probability calibration of the machine learning algorithms and to examine and compare different machine learning algorithms on two breast cancer datasets (empirical and simulated), in order to determine the best performing ones was presented. There are several methods for performing calibration of probabilistic predictions, but the main ones are: Platt’s sigmoid metric (based on parametric approach) and isotonic regression model (based on non-parametric approach). The five machine learning methods (logistic regression, naïve Bayes, support vector machine, random forest and K-Nearest neighbours) were used on empirical and simulated breast cancer binary datasets, and afterwards being calibrated. However, the empirical and simulated breast cancer dataset gave different results after being calibrated.
Overall, after calibration with both sigmoid and isotonic regression for empirical dataset, improved results were obtained for KNN and random forest. However, logistic regression and SVC gave a worse result, while Naïve Bayes produced an improved result only for isotonic regression and not sigmoid scaling. Also, after calibration with sigmoid and isotonic regression for simulated dataset, Naïve Bayes, KNN and SVC gave improved results, while random forest gave a worse result. Logistic regression only produced an improved result with isotonic regression and not sigmoid scaling.