Early risk factor prediction in chronic kidney disease diagnosis incorporating feature selection and machine learning algorithms
Prima, Chowdhury Nazia Enam (2024)
Prima, Chowdhury Nazia Enam
2024
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-10-25
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202410249455
https://urn.fi/URN:NBN:fi:tuni-202410249455
Tiivistelmä
Chronic kidney disease, CKD in short, is a kind of long-lasting kidney illness where kidney function is decreased throughout a span of time. It is extremely challenging to predict early risk factors for this illness. Accurate detection of risk factors is a crucial step in this disease diagnosis. This research work addresses this issue and shows the effective identification of the risk factors and treatment of CKD employing different supervised and ensemble machine learning classifiers.
A CKD-focused dataset consisting of 1032 patient records and 14 features is used for this research purpose. Since noisy data can lead to inaccuracies and miscalculations leading to sensitivity in medical condition diagnosis, therefore for preparing the data at first missing values and outliers are handled using necessary techniques. Feature scaling is done to ensure that all features are scaled uniformly and to mitigate the impact of outliers.
This thesis work emphasized on identifying the risk factors of CKD using Feature importance (for tree-based model) with Sequential feature selector and ReliefF algorithm as feature selection process. Based on the ranking for both feature selection procedures, hemoglobin is determined to be the most significant and specific gravity is obtained as the least important predictor among these. To enhance the predictive diagnosis of CKD, 8 classifiers are used such as Random Forest, Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, Gradient Boosting, K-Nearest Neighbors, and ensemble classifier Voting technique. The classifiers are trained using stratified 5-fold and grid-based search cross-validation techniques. Their performances are finally assessed using evaluation metrics. The classifiers performed very well to classify between class 0 (not CKD) and class 1(CKD) indicating individuals who are at elevated risk for developing this renal disease. All the classifiers showed their effectiveness in CKD prediction using the selected features by achieving higher accuracy, F1 score, precision, recall, and AUC.
A CKD-focused dataset consisting of 1032 patient records and 14 features is used for this research purpose. Since noisy data can lead to inaccuracies and miscalculations leading to sensitivity in medical condition diagnosis, therefore for preparing the data at first missing values and outliers are handled using necessary techniques. Feature scaling is done to ensure that all features are scaled uniformly and to mitigate the impact of outliers.
This thesis work emphasized on identifying the risk factors of CKD using Feature importance (for tree-based model) with Sequential feature selector and ReliefF algorithm as feature selection process. Based on the ranking for both feature selection procedures, hemoglobin is determined to be the most significant and specific gravity is obtained as the least important predictor among these. To enhance the predictive diagnosis of CKD, 8 classifiers are used such as Random Forest, Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, Gradient Boosting, K-Nearest Neighbors, and ensemble classifier Voting technique. The classifiers are trained using stratified 5-fold and grid-based search cross-validation techniques. Their performances are finally assessed using evaluation metrics. The classifiers performed very well to classify between class 0 (not CKD) and class 1(CKD) indicating individuals who are at elevated risk for developing this renal disease. All the classifiers showed their effectiveness in CKD prediction using the selected features by achieving higher accuracy, F1 score, precision, recall, and AUC.