Algorithms for Clustering High-Dimensional Data
Kampman, Ilari (2017)
Kampman, Ilari
2017
Teknis-luonnontieteellinen
Teknis-luonnontieteellinen tiedekunta - Faculty of Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2017-12-07
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201711222201
https://urn.fi/URN:NBN:fi:tty-201711222201
Tiivistelmä
In the context of artificial intelligence, clustering refers to a machine learning method in which a data set is grouped into subsets based on similarities in the features of data samples. Since the concept of clustering is broad and complex, clustering algorithms are often designed for solving specific problems. Especially, for overcoming the challenges proposed by high-dimensional data, specialized techniques are required. The existing algorithms for clustering high-dimensional data are often too dependent on prior information about the data, which is why they are not necessarily suitable for autonomous applications of artificial intelligence.
This thesis introduces two new algorithms for clustering high-dimensional shapes: CHUNX and CRUSHES. The clusters of both algorithms are based on a hierarchical tree structure which is formed using matrix diagonalization done in Principal Component Analysis (PCA). Due to its expressiveness, the constructed tree structure can be used efficiently for detecting noise clusters and outliers from the data. In addition, the clustering parameter setup in CHUNX and CRUSHES is more flexible than in other PCA-based clustering algorithms.
In the empirical part of this thesis, CHUNX and CRUSHES are compared with k-means, DBSCAN and ORCLUS clustering algorithms. The algorithms are evaluated by their usability in situations where there is no prior information about the structure of data to be clustered. The data used in the empirical algorithm evaluation consists of energy curve shapes from the Finnish electrical grid. According to the algorithm evaluation, CHUNX and CRUSHES produce more detailed results and are more suitable for situations, where there is no prior information about the data, than the other algorithms of the evaluation.
This thesis introduces two new algorithms for clustering high-dimensional shapes: CHUNX and CRUSHES. The clusters of both algorithms are based on a hierarchical tree structure which is formed using matrix diagonalization done in Principal Component Analysis (PCA). Due to its expressiveness, the constructed tree structure can be used efficiently for detecting noise clusters and outliers from the data. In addition, the clustering parameter setup in CHUNX and CRUSHES is more flexible than in other PCA-based clustering algorithms.
In the empirical part of this thesis, CHUNX and CRUSHES are compared with k-means, DBSCAN and ORCLUS clustering algorithms. The algorithms are evaluated by their usability in situations where there is no prior information about the structure of data to be clustered. The data used in the empirical algorithm evaluation consists of energy curve shapes from the Finnish electrical grid. According to the algorithm evaluation, CHUNX and CRUSHES produce more detailed results and are more suitable for situations, where there is no prior information about the data, than the other algorithms of the evaluation.