Improvements to k-means clustering
Ranganathan, Sindhuja (2013)
Ranganathan, Sindhuja
2013
Master's Degree Programme in Information Technology
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2013-12-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201312171495
https://urn.fi/URN:NBN:fi:tty-201312171495
Tiivistelmä
Working with huge amount of data and learning from it by extracting useful information is one of the prime challenges of the Internet era. Machine learning algorithms; provide an automatic and easy way to accomplish such tasks. These algorithms are classified into supervised, unsupervised, semi-supervised algorithms. Some of the most used algorithms belong to the class of unsupervised learning as training becomes a challenge for many practical applications.
Machine learning algorithms for which the classes of input are unknown are called unsupervised algorithms. The k-means algorithm is one of the simplest and predominantly used algorithms for unsupervised learning procedure clustering. The k-means algorithm works grouping similar data based on some measure. The k in k-means denotes the number of such groups available. This study starts from the standard k-means algorithm and goes through some of the algorithmic improvements suggested in the machine learning and data mining literature. Traditional k-means algorithm is an iterative refinement algorithm with an assignment step and an update step. The distances from each of the clusters centroids are calculated and iteratively refined. The computational complexity of k –means algorithm mainly arises from the distance calculation or the so called nearest –neighbor query.
Certain formulation of the k-means calculations are NP hard. For a dataset with dimension and values the computational complexity is for a single iteration. Some of the k-means clustering procedures take several iterations and still fail to produce an acceptable clustering error. Several authors have studied different ways of reducing the complexity of k-means algorithm in the literature. Particularly, this study focuses mainly on the algorithmic improvements suggested in the related works of Hamerly and Elkan. These algorithmic improvements are chosen to be studied as they are generic in nature and could be applied for many available datasets. The improved algorithms due to Elkan and Hamerly are applied to various available datasets and their computation performances are presented.
Machine learning algorithms for which the classes of input are unknown are called unsupervised algorithms. The k-means algorithm is one of the simplest and predominantly used algorithms for unsupervised learning procedure clustering. The k-means algorithm works grouping similar data based on some measure. The k in k-means denotes the number of such groups available. This study starts from the standard k-means algorithm and goes through some of the algorithmic improvements suggested in the machine learning and data mining literature. Traditional k-means algorithm is an iterative refinement algorithm with an assignment step and an update step. The distances from each of the clusters centroids are calculated and iteratively refined. The computational complexity of k –means algorithm mainly arises from the distance calculation or the so called nearest –neighbor query.
Certain formulation of the k-means calculations are NP hard. For a dataset with dimension and values the computational complexity is for a single iteration. Some of the k-means clustering procedures take several iterations and still fail to produce an acceptable clustering error. Several authors have studied different ways of reducing the complexity of k-means algorithm in the literature. Particularly, this study focuses mainly on the algorithmic improvements suggested in the related works of Hamerly and Elkan. These algorithmic improvements are chosen to be studied as they are generic in nature and could be applied for many available datasets. The improved algorithms due to Elkan and Hamerly are applied to various available datasets and their computation performances are presented.