Multi-label classification for industry-specific text document sets
Rajala, Juhana (2021)
Rajala, Juhana
2021
Konetekniikan DI-ohjelma - Master's Programme in Mechanical Engineering
Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-08-24
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202107266342
https://urn.fi/URN:NBN:fi:tuni-202107266342
Tiivistelmä
Multi-label classification is a generalization of a broader concept of multi-class classification in which a single instance acting as a piece of data can belong into one and only one class. In multi-labeling problems, the mapping between the class space and the instances is less restricted since one instance may be assigned with an arbitrary number of classes. This is common especially with the categorization of movies that are influenced by multiple genres yet the same idea is applicable with other types of media, too. In this work the media type is text documents, considering especially such documents that being created and stored by companies from various specialized areas of industries. In the scope of this study the class space for each company’s set of documents is very different and the document set sizes are relatively small.
For being able to benefit from machine learning, a suitable model and its training method needs to be found that are resistant to the lack of training data yet still easy to deploy and use. To find a solution for this problem, this study carries out a literature study about the common theories and practices used in the domain of multi-labeling problems, an empirical research of an algorithmic multi-label classifier and a performance analysis of the implemented classifier.
As a result of this study a NetClass based multi-label classifier suitable for relatively small text document data sets was implemented. This classifier performs effectively with pure multi-label data sets but loses some of its effectiveness with low cardinality data sets. Also, due to the algorithmic nature of the classifier the training process was noticed to be reversible, which is a rare feature for machine learning applications. Even though this opportunity was glanced in this thesis a more detailed consideration was still left for future work.
For being able to benefit from machine learning, a suitable model and its training method needs to be found that are resistant to the lack of training data yet still easy to deploy and use. To find a solution for this problem, this study carries out a literature study about the common theories and practices used in the domain of multi-labeling problems, an empirical research of an algorithmic multi-label classifier and a performance analysis of the implemented classifier.
As a result of this study a NetClass based multi-label classifier suitable for relatively small text document data sets was implemented. This classifier performs effectively with pure multi-label data sets but loses some of its effectiveness with low cardinality data sets. Also, due to the algorithmic nature of the classifier the training process was noticed to be reversible, which is a rare feature for machine learning applications. Even though this opportunity was glanced in this thesis a more detailed consideration was still left for future work.