Unsupervised Machine Learning for Event Categorization in Business Intelligence
Valtonen, Laura (2019)
Valtonen, Laura
2019
Tuotantotalous
Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-05-28
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201905311818
https://urn.fi/URN:NBN:fi:tty-201905311818
Tiivistelmä
The data and information available for business intelligence purposes is increasing rapidly in the world. Data quality and quantity are important for making the correct business decisions, but the amount of data is becoming difficult to process. Different machine learning methods are becoming an increasingly powerful tool to deal with the amount of data. One such machine learning approach is the automatic annotation and location of business intelligence relevant actions and events in news data.
While studying the literature of this field, it however became clear, that there exists little standardization and objectivity regarding what types of categories these events and actions are sorted into. This was often done in subjective, arduous manners. The goal of this thesis is to provide information and recommendations on how to create more objective, less time consuming initial categorizations of actions and events by studying different common unsupervised learning methods for this task.
The relevant literature and theory to understand the followed research and methodology is studied. The context and evolution of business intelligence to today is considered, and specially its relationship to the big data problem of today is studied. This again relates to the fields of machine learning, artificial intelligence, and especially natural language programming. The relevant methods of these fields are covered to understand the taken steps to achieve the goal of this thesis. All approaches aided in understanding the behaviour of unsupervised learning methods, and how it should taken into account in the categorization creation.
Different natural language preprocessing steps are combined with different text vectorization methods. Specifically, three different text tokenization methods, plain, N-gram, and chunk tokenizations are tested with two popular vectorization methods: bag-of-words and term frequency inverse document frequency vectorizations. Two types of unsupervised methods are tested for these vectorizations: Clustering is a more traditional data subcategorization process, and topic modelling is a fuzzy, probability based method for the same task. Out of both learning methods, three different algorithms are studied by the interpretability and categorization value of their top cluster or topic representative terms. The top term representations are also compared to the true contents of these topics or clusters via content analysis.
Out of the studied methods, plain and chunk tokenization methods yielded the most comprehensible results to a human reader. Vectorization made no major difference regarding top term interpretability or contents and top term congruence. Out of the methods studied, K-means clustering and Latent Dirichlet Allocation were deemed the most useful in event and action categorization creation. K-means clustering created a good basis for an initial categorization framework with congruent result top terms to the contents of the clusters, and Latent Dirichlet Allocation found latent topics in the text documents that provided serendipitous, fruitful insights for a category creator to take into account.
While studying the literature of this field, it however became clear, that there exists little standardization and objectivity regarding what types of categories these events and actions are sorted into. This was often done in subjective, arduous manners. The goal of this thesis is to provide information and recommendations on how to create more objective, less time consuming initial categorizations of actions and events by studying different common unsupervised learning methods for this task.
The relevant literature and theory to understand the followed research and methodology is studied. The context and evolution of business intelligence to today is considered, and specially its relationship to the big data problem of today is studied. This again relates to the fields of machine learning, artificial intelligence, and especially natural language programming. The relevant methods of these fields are covered to understand the taken steps to achieve the goal of this thesis. All approaches aided in understanding the behaviour of unsupervised learning methods, and how it should taken into account in the categorization creation.
Different natural language preprocessing steps are combined with different text vectorization methods. Specifically, three different text tokenization methods, plain, N-gram, and chunk tokenizations are tested with two popular vectorization methods: bag-of-words and term frequency inverse document frequency vectorizations. Two types of unsupervised methods are tested for these vectorizations: Clustering is a more traditional data subcategorization process, and topic modelling is a fuzzy, probability based method for the same task. Out of both learning methods, three different algorithms are studied by the interpretability and categorization value of their top cluster or topic representative terms. The top term representations are also compared to the true contents of these topics or clusters via content analysis.
Out of the studied methods, plain and chunk tokenization methods yielded the most comprehensible results to a human reader. Vectorization made no major difference regarding top term interpretability or contents and top term congruence. Out of the methods studied, K-means clustering and Latent Dirichlet Allocation were deemed the most useful in event and action categorization creation. K-means clustering created a good basis for an initial categorization framework with congruent result top terms to the contents of the clusters, and Latent Dirichlet Allocation found latent topics in the text documents that provided serendipitous, fruitful insights for a category creator to take into account.