Finite mixture models in comparison to k-means clustering in both simulated and real world data
Luoma, Juho (2019)
Luoma, Juho
2019
Matematiikan ja tilastotieteen tutkinto-ohjelma
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-06-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-201907182694
https://urn.fi/URN:NBN:fi:tuni-201907182694
Tiivistelmä
Finite mixture models are finite-dimensional generalizations of probabilistic models, which express the existence of groups or sub-populations that form the sample. In this thesis, multivariate normal mixtures are examined and compared to k-means clustering in different experimental situations. The comparison is carried out by simulations and by using a real-world, repeated measurements data set. A special extension of k-means clustering, k-means for longitudinal clustering (KmL), is used for the longitudinal data set. The goal of these experiments is to investigate if there is evidence to suggest that one method is better in some respect than the other.
Simulations were conducted to test the performance of the methods when increasing the number of outliers, average overlap between the clusters, the number of dimensions, and the number of observations. The data used in this thesis were collected as a part of iLiNS project which studied the effects of nutrient supplement to children’s growth and mothers’ health in rural areas of Malawi. There were 1391 Malawian mothers enrolled to the study, and the data consist of their children who were measured seven times from birth up to 30 months after birth.
In simulations, while requiring a non-random initialization for the algorithm, mixture models performed better than or equally well as k-means clustering in terms of correctly clustered individual data points. The parameter estimates by mixture models were also closer than or equally close to the true cluster centers as estimates by k-means. In real data, the participants were divided into clusters based on weight, using all the time points except the last one to formthe clusters. The last measurement point was used to determine the status of growth for the child at 30 months. The dependency between the cluster identity of a participant and the growth status at last
time pointwas tested with the Chi Square test. Both approaches were able to yield clusters that were formed so that the cluster membership of a participant was significantly related to growth status at 30 months, although the optimal number of clusters differed between the methods.
Simulations were conducted to test the performance of the methods when increasing the number of outliers, average overlap between the clusters, the number of dimensions, and the number of observations. The data used in this thesis were collected as a part of iLiNS project which studied the effects of nutrient supplement to children’s growth and mothers’ health in rural areas of Malawi. There were 1391 Malawian mothers enrolled to the study, and the data consist of their children who were measured seven times from birth up to 30 months after birth.
In simulations, while requiring a non-random initialization for the algorithm, mixture models performed better than or equally well as k-means clustering in terms of correctly clustered individual data points. The parameter estimates by mixture models were also closer than or equally close to the true cluster centers as estimates by k-means. In real data, the participants were divided into clusters based on weight, using all the time points except the last one to formthe clusters. The last measurement point was used to determine the status of growth for the child at 30 months. The dependency between the cluster identity of a participant and the growth status at last
time pointwas tested with the Chi Square test. Both approaches were able to yield clusters that were formed so that the cluster membership of a participant was significantly related to growth status at 30 months, although the optimal number of clusters differed between the methods.