Discretization in Subgroup Discovery
Simonen, Niina (2016)
Simonen, Niina
2016
Tietotekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2016-06-08
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201605254107
https://urn.fi/URN:NBN:fi:tty-201605254107
Tiivistelmä
Subgroup discovery is a data mining technique to discoverer interesting subgroups from a selected population. It seeks to discover interesting relationships between different objects in a set with respect to a specific property. The discovered patterns are called subgroups and they are represented in the form of rules. Discretization is technique to replace numerical attributes with nominal ones, making it possible to use them with algorithms that do not support numerical attributes.
In this thesis two datasets are discretized for the application of subgroup discovery. For the discretizations four different methods were used and three different bin amounts were applied. The used datasets are the heart disease and the Australian credit approval from the UCI Machine Learning Repository. The subgroup discovery technique produced eleven subgroups sets as result, eight from heart disease dataset and three from Australian credit approval dataset. We observed that the bin amount affects greatly on the results. Also, with the binary discretization there are subgroup sets with a high share of subgroups with discretized attributes. In addition, the importance of expert guidance is emphasized.
In this thesis two datasets are discretized for the application of subgroup discovery. For the discretizations four different methods were used and three different bin amounts were applied. The used datasets are the heart disease and the Australian credit approval from the UCI Machine Learning Repository. The subgroup discovery technique produced eleven subgroups sets as result, eight from heart disease dataset and three from Australian credit approval dataset. We observed that the bin amount affects greatly on the results. Also, with the binary discretization there are subgroup sets with a high share of subgroups with discretized attributes. In addition, the importance of expert guidance is emphasized.