Choosing Compressor for Computing Normalized Compression Distance
Hahne, Lauri (2013)
Hahne, Lauri
2013
Signaalinkäsittelyn ja tietoliikennetekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2013-12-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201312181497
https://urn.fi/URN:NBN:fi:tty-201312181497
Tiivistelmä
Modern measurement technologies in sciences such as biology produce huge amounts of measurement data. A major problem with this is that the data need to be classified so that it becomes possible to tell, for example, cancerous cells from regular cells. However, there is seldom enough information about the processes producing the data that is getting measured, so traditional feature based classification algorithms might prove useless.
Normalized information distance is a universal metric than can be used to theoretically cluster all kinds of data based on their algorithmic similarity without knowing the interesting features beforehand. However normalized information distance depends on algorithmic Kolmogorov complexity which cannot be computed. Normalized compression distance is a construct based on normalized information distance. Normalized compression distance substitutes the uncomputable Kolmogorov complexity for compressed file size.
Theoretically any good enough data compressor can be used to compute normalized compression distance. In practice different compressors are known to produce dissimilar results. The purpose of this work is to construct a framework for evaluating the performance of different compressors when computing the normalized compression distance.
The evaluation framework that is presented in this work is able to uncover differences in performance of different compressors given different input lengths and input types. Four different compressors were evaluated using the framework and the results show how dif- ferent features of the compressors affect their performance. In addition, it became possible to give recommendations about which compressors to use for different kinds of inputs.
Normalized information distance is a universal metric than can be used to theoretically cluster all kinds of data based on their algorithmic similarity without knowing the interesting features beforehand. However normalized information distance depends on algorithmic Kolmogorov complexity which cannot be computed. Normalized compression distance is a construct based on normalized information distance. Normalized compression distance substitutes the uncomputable Kolmogorov complexity for compressed file size.
Theoretically any good enough data compressor can be used to compute normalized compression distance. In practice different compressors are known to produce dissimilar results. The purpose of this work is to construct a framework for evaluating the performance of different compressors when computing the normalized compression distance.
The evaluation framework that is presented in this work is able to uncover differences in performance of different compressors given different input lengths and input types. Four different compressors were evaluated using the framework and the results show how dif- ferent features of the compressors affect their performance. In addition, it became possible to give recommendations about which compressors to use for different kinds of inputs.