Managing Missing Data in Data Integration
Jokipii, Mervi (2023)
Jokipii, Mervi
2023
Tietojenkäsittelyopin maisteriohjelma - Master's Programme in Computer Science
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-05-28
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202305155792
https://urn.fi/URN:NBN:fi:tuni-202305155792
Tiivistelmä
The amount of data in the world is constantly growing at an enormous pace, especially with the expansion of the internet. Data is stored in different formats in various source systems. The goal of data integration is to provide users with unified access to heterogeneous and independent data without requiring them to understand the logic of the source systems. Users can submit queries on the mediated schema that interprets them to the source systems. The data in integration is rarely complete: it may contain incorrect or completely missing values. These missing data can be managed and enriched using various methods.
The literature review of this thesis explores data integration and its challenges, as well as the missing data mechanisms and strategies for dealing with missing data. The experimental section of this work analyses these strategies in the context of online automotive dealerships. Cars are increasingly being purchased directly from the internet or at least using the internet as a strong support in the purchasing process. Incomplete car data can lead to issues such as the car not appearing in potential buyers' search results, even resulting in the car not being sold.
The results of this work show that finding a similar car from a dataset is crucial in managing missing car data, which is not always straightforward. String matching -method is an essential part of finding a similar car, but it doesn't always give a perfectly accurate result. For this reason, the work presents a model for managing missing car data, where string matching is used only when necessary. According to the model, string matching can also be strengthened by comparing other values belonging to the same attribute group. External sources, such as pre-existing com- mercial databases or a company's self-built database, should also be used, if needed, to find the similar car.
The literature review of this thesis explores data integration and its challenges, as well as the missing data mechanisms and strategies for dealing with missing data. The experimental section of this work analyses these strategies in the context of online automotive dealerships. Cars are increasingly being purchased directly from the internet or at least using the internet as a strong support in the purchasing process. Incomplete car data can lead to issues such as the car not appearing in potential buyers' search results, even resulting in the car not being sold.
The results of this work show that finding a similar car from a dataset is crucial in managing missing car data, which is not always straightforward. String matching -method is an essential part of finding a similar car, but it doesn't always give a perfectly accurate result. For this reason, the work presents a model for managing missing car data, where string matching is used only when necessary. According to the model, string matching can also be strengthened by comparing other values belonging to the same attribute group. External sources, such as pre-existing com- mercial databases or a company's self-built database, should also be used, if needed, to find the similar car.