BioBERT for Dietary Compounds and Cancer Relation Extraction
Wongwarawipatr, Weerin (2022)
Wongwarawipatr, Weerin
2022
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-01-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202111148394
https://urn.fi/URN:NBN:fi:tuni-202111148394
Tiivistelmä
The relation extraction of biomedical publications has been an essential natural language processing task for constructing the biomedical network. This network will allow the study of genes, disease, and food compounds holistically together, which will help to solve many biomedical problems. Biomedical papers, which are growing exponentially over time, are the primary resource for network construction. Thus, natural language processing is an integral part of enabling this network creation from biomedical text papers. The fully connected biomedical knowledge graph that connects many sub-fields will benefit the better data management and analysis of text data in the biomedical domain.
While there are several pieces of research on genes and disease natural language text relation extraction, there are not many works on disease and food compounds conducted. This research aims to conduct the natural language relation extraction models for dietary compounds and cancer.
This research uses the transfer learning method to acquire the pre-trained models and fine-tune them with the prepared biomedical text dataset. The state-of-art natural language understanding model, BERT, will be explored with other BERT-based variations, DistilBERT and BioBERT. The fine-tuned BioBERT model is expected to give the best result in the end since it is the specialized model pre-trained with biomedical text documents. In addition, to benchmark the model performance effectively, we include the two traditional machine learning classification models, Support Vector Machine and Gaussian Naive Bayes, to be the baseline for the comparison with the proven state-of-art language models. The text data is acquired from PubMed search engine, and the articles are biomedical paper abstracts about cancer and food compounds. The food and cancer entities are annotated manually.
The result shows that BioBERT has the best performance with a 0.7855 F1-score. It is higher than the original BERT model for 0.0011 F1-score. DistilBERT also acquires acceptable performance with a 0.7338 F1-score. DistilBERT performs justifiable prediction with only 0.05 lower than BERT original model but 50\% faster in model fine-tuning compared to its model size. We have also constructed learning curves to visualize each model learning process. All BERT-based models can be improved with more datasets in future research.
While there are several pieces of research on genes and disease natural language text relation extraction, there are not many works on disease and food compounds conducted. This research aims to conduct the natural language relation extraction models for dietary compounds and cancer.
This research uses the transfer learning method to acquire the pre-trained models and fine-tune them with the prepared biomedical text dataset. The state-of-art natural language understanding model, BERT, will be explored with other BERT-based variations, DistilBERT and BioBERT. The fine-tuned BioBERT model is expected to give the best result in the end since it is the specialized model pre-trained with biomedical text documents. In addition, to benchmark the model performance effectively, we include the two traditional machine learning classification models, Support Vector Machine and Gaussian Naive Bayes, to be the baseline for the comparison with the proven state-of-art language models. The text data is acquired from PubMed search engine, and the articles are biomedical paper abstracts about cancer and food compounds. The food and cancer entities are annotated manually.
The result shows that BioBERT has the best performance with a 0.7855 F1-score. It is higher than the original BERT model for 0.0011 F1-score. DistilBERT also acquires acceptable performance with a 0.7338 F1-score. DistilBERT performs justifiable prediction with only 0.05 lower than BERT original model but 50\% faster in model fine-tuning compared to its model size. We have also constructed learning curves to visualize each model learning process. All BERT-based models can be improved with more datasets in future research.