Enhancing Outlier Detection with Semantic Feature Extraction
Salmimaa, Linnea (2024)
Salmimaa, Linnea
2024
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-10-31
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202410319743
https://urn.fi/URN:NBN:fi:tuni-202410319743
Tiivistelmä
This thesis explores whether outlier detection accuracy can be meaningfully improved by using semantic feature extraction instead of statistical feature extraction. The experiments are done with several different types of data sets. Both statistical and semantic feature vectors are generated using multiple techniques, such as the Term Frequency-Inverse Document Frequency and the sentence-BERT network. These vectors are then used for outlier detection, and the results are compared. Isolation forest and Local Outlier Factor are used as outlier detection methods.
The results show that semantic feature extraction enables more accurate outlier detection. However, the outlier detector must be compatible with the selected feature extraction method. Analyzing text and extracting features from it usually results in very high-dimensional data, which limits the number of options, when choosing a suitable outlier detector.
The results show that semantic feature extraction enables more accurate outlier detection. However, the outlier detector must be compatible with the selected feature extraction method. Analyzing text and extracting features from it usually results in very high-dimensional data, which limits the number of options, when choosing a suitable outlier detector.