Automating the Discipline Analysis with Latent Dirichlet Allocation: A Case Study on 30 Core Journals of Library and Information Science Published in 2015
Ylikruuvi, Kaisa (2023)
Ylikruuvi, Kaisa
2023
Master's Programme in Computational Big Data Analytics
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-06-12
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202306086623
https://urn.fi/URN:NBN:fi:tuni-202306086623
Tiivistelmä
Discipline analysis is an interesting and important research area, especially in the interdisciplinary and multidisciplinary fields of science, such as library and information science (LIS). Discipline analysis helps to identify the current trends and evolution of the research topics and the main methodologies employed within a field of study. In this thesis, discipline analysis is conducted by building a topic model on library and information science articles. The latent Dirichlet allocation (LDA) algorithm is employed in the set of LIS articles, which has been previously classified intellectually by LIS researchers. The thesis aims to compare the LDA model to the result of the intellectual content analysis, previous LDA models of LIS, and the co-citation analysis model of the same data set.
The data consists of 1 440 articles and conference papers published in 30 core journals of LIS in 2015. The selection of journals, and the decision to use only titles, abstracts, and keywords in the analysis, are the same as in the intellectual content analysis. Most of the data could be fetched via Scopus API and the rest were downloaded from ProQuest or collected manually from the journals’ homepages. The data preprocessing phase included the correction of errors caused by optical character recognition and XML encoding, the removal of platform-specific metadata, numbers, stopwords, and extra whitespaces, and lemmatization. The data were analysed in R with package topicmodels to perform latent Dirichlet allocation. The quality assessment values of perplexity and topic coherence were calculated with functions from packages topicmodels and topicdoc, respectively. The final LDA model consists of 14 topics: Impact Indicators, Education in LIS Studies and Education as LIS Service, Academic Libraries, Information Retrieval, Computation-Assisted Analysis (analysis method), Scientific Collaboration, Public Libraries, Interactive Information Retrieval, Knowledge and Patent Management, Bibliometrics (analysis method), Open Access, Information History, Social Media, and User Behaviour in Digital Environment.
The LDA model is of good quality and it succeeds to describe the different aspects of LIS well. The model compares well to the content analysis, which was conducted using the same data set, and to previous topic models of LIS. The LDA model outperforms the result of co-citation analysis, which was performed on the same data set, and which selects labels automatically for its clusters from the titles in the data. LDA topic modelling is a suitable method for pursuing discipline analysis. Further development is still recommended to automate the process more by developing a comprehensive preprocessing framework and especially by implementing high-quality automatic topic labelling for various platforms.
The data consists of 1 440 articles and conference papers published in 30 core journals of LIS in 2015. The selection of journals, and the decision to use only titles, abstracts, and keywords in the analysis, are the same as in the intellectual content analysis. Most of the data could be fetched via Scopus API and the rest were downloaded from ProQuest or collected manually from the journals’ homepages. The data preprocessing phase included the correction of errors caused by optical character recognition and XML encoding, the removal of platform-specific metadata, numbers, stopwords, and extra whitespaces, and lemmatization. The data were analysed in R with package topicmodels to perform latent Dirichlet allocation. The quality assessment values of perplexity and topic coherence were calculated with functions from packages topicmodels and topicdoc, respectively. The final LDA model consists of 14 topics: Impact Indicators, Education in LIS Studies and Education as LIS Service, Academic Libraries, Information Retrieval, Computation-Assisted Analysis (analysis method), Scientific Collaboration, Public Libraries, Interactive Information Retrieval, Knowledge and Patent Management, Bibliometrics (analysis method), Open Access, Information History, Social Media, and User Behaviour in Digital Environment.
The LDA model is of good quality and it succeeds to describe the different aspects of LIS well. The model compares well to the content analysis, which was conducted using the same data set, and to previous topic models of LIS. The LDA model outperforms the result of co-citation analysis, which was performed on the same data set, and which selects labels automatically for its clusters from the titles in the data. LDA topic modelling is a suitable method for pursuing discipline analysis. Further development is still recommended to automate the process more by developing a comprehensive preprocessing framework and especially by implementing high-quality automatic topic labelling for various platforms.