Automated Risk Prediction from Health Data
Tommola, Janne (2022)
Tommola, Janne
2022
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-12-14
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202212018787
https://urn.fi/URN:NBN:fi:tuni-202212018787
Tiivistelmä
This work investigates the prediction of diagnoses to patients based on their electronic health records (EHRs). The work compares traditional risk calculators based on scientific research with newer neural networks utilizing machine learning. The material used in the work is the US-based MIMIC-III patient database, and part of the work is determining the suitability of it for risk prediction. MIMIC-III is an intensive care database of 46,520 patients that is available for research use under certain conditions. One of the goals of the work is to prepare for the prediction of diagnoses using the data in the Finnish Kanta patient database.
The research questions of this work are 1) whether it is feasible to utilize conventional risk models on EHRs, and 2) how machine learning methods compare to the conventional models. Part of the work is determining the availability and validity of the input variables needed in the risk calculators, and utilizing the less structured text of the patient report. The text of the patient report is used in a BERT-based neural network model that processes natural language to determine whether the patient smokes.
A finding of the study is that most traditional risk calculators cannot be used for the majority of patients due to the lack of necessary data in their EHRs, such as the family history of various cardiovascular diseases. The neural networks used in the work only need the patient’s ICD-9 diagnostic code history and basic demographic information as inputs, so they can be used for any patient.
The work compares the predictive power of different methods for two diagnoses: heart failure and stroke. The risk calculators had poor predictive power, but neural networks were able to predict a future diagnosis of heart failure to some extent. None of the methods tested succeeded in predicting the diagnosis of stroke, which is rarer than heart failure.
Finally, the work considers the suitability of the methods used for health care, such as population screening as-is. With further improvements, the methods could possibly also be used at the individual level to warn of risks or to assist the doctor. On the other hand, regulations and ethical challenges may limit the use of neural networks due to e.g., their more difficult interpretation, predictability and unclear liability for errors. Areas of required developments for further improvement are considered last, such as the needs for EHRs and potential changes or additions that may improve the accuracy of the models.
The research questions of this work are 1) whether it is feasible to utilize conventional risk models on EHRs, and 2) how machine learning methods compare to the conventional models. Part of the work is determining the availability and validity of the input variables needed in the risk calculators, and utilizing the less structured text of the patient report. The text of the patient report is used in a BERT-based neural network model that processes natural language to determine whether the patient smokes.
A finding of the study is that most traditional risk calculators cannot be used for the majority of patients due to the lack of necessary data in their EHRs, such as the family history of various cardiovascular diseases. The neural networks used in the work only need the patient’s ICD-9 diagnostic code history and basic demographic information as inputs, so they can be used for any patient.
The work compares the predictive power of different methods for two diagnoses: heart failure and stroke. The risk calculators had poor predictive power, but neural networks were able to predict a future diagnosis of heart failure to some extent. None of the methods tested succeeded in predicting the diagnosis of stroke, which is rarer than heart failure.
Finally, the work considers the suitability of the methods used for health care, such as population screening as-is. With further improvements, the methods could possibly also be used at the individual level to warn of risks or to assist the doctor. On the other hand, regulations and ethical challenges may limit the use of neural networks due to e.g., their more difficult interpretation, predictability and unclear liability for errors. Areas of required developments for further improvement are considered last, such as the needs for EHRs and potential changes or additions that may improve the accuracy of the models.