Data Vault 2.0 Automation Solutions for Commercial Use
Jenni, Laukkanen (2020)
Jenni, Laukkanen
2020
Laskennallisen suurten tietoaineistojen analysoinnin maisterikoulutus, FM (engl) - Master's Degree Programme in Computational Big Data Analytics
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-08-28
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202008286755
https://urn.fi/URN:NBN:fi:tuni-202008286755
Tiivistelmä
As the amount of data and the need for its processing and storage have increased, methods for its management and reporting have been intensely developed. However, these methods require a lot of skills, time, and manual work.
Efforts have been made to fully automate data warehousing solutions in various areas, such as loading data at different stages of data warehousing. However, few solutions automate data warehouse construction, and learning how to use these data warehouse automation solutions requires a certain amount of expertise and time.
In this research, we discuss different solution options for automating data warehouse construction. From the point of view of organizations, the study identifies different options such as purchasing, collaborating with other organizations to obtain or building the solution. In addition to market analysis, we also create and implement an automated tool for building a Data Vault 2.0 type data warehouse by leveraging metadata as well as sources RDBMS relationships to predict critical components of Data Vault 2.0 data warehousing, most of which are usually defined by experts.
Based on the metadata collected and processed, the classification algorithm was able to correctly classify an average of 85.89% of all given observations correctly and 55.11% correctly for business keys alone. The algorithm was able to classify more correctly the observations that were not business keys than the business keys themselves. However, the correctness of the classification has the most significant impact on what the Automation tool that builds Data Vault 2.0 inserts into the target tables of the data model, rather than what kind of tables and what source table they consist of. The model generated by the tool corresponded well to the target model implemented at the beginning of the study. What came to hubs and satellites, without taking into account a couple of missing hubs and the content of some hubs due to shortcomings in the classification of business keys, the model would have been able to be used as an enterprise data warehouse. Links differed more from the original target, but after testing, the link variations produced by the tool worked well either way.
There are still many shortcomings and areas for development in the created and implemented tool of the research, which, however, have been considered in the logic and structure of the tool. Also, the tool can be implemented with even a small amount of financial capital but requires a lot of experience and expertise on the subject.
Efforts have been made to fully automate data warehousing solutions in various areas, such as loading data at different stages of data warehousing. However, few solutions automate data warehouse construction, and learning how to use these data warehouse automation solutions requires a certain amount of expertise and time.
In this research, we discuss different solution options for automating data warehouse construction. From the point of view of organizations, the study identifies different options such as purchasing, collaborating with other organizations to obtain or building the solution. In addition to market analysis, we also create and implement an automated tool for building a Data Vault 2.0 type data warehouse by leveraging metadata as well as sources RDBMS relationships to predict critical components of Data Vault 2.0 data warehousing, most of which are usually defined by experts.
Based on the metadata collected and processed, the classification algorithm was able to correctly classify an average of 85.89% of all given observations correctly and 55.11% correctly for business keys alone. The algorithm was able to classify more correctly the observations that were not business keys than the business keys themselves. However, the correctness of the classification has the most significant impact on what the Automation tool that builds Data Vault 2.0 inserts into the target tables of the data model, rather than what kind of tables and what source table they consist of. The model generated by the tool corresponded well to the target model implemented at the beginning of the study. What came to hubs and satellites, without taking into account a couple of missing hubs and the content of some hubs due to shortcomings in the classification of business keys, the model would have been able to be used as an enterprise data warehouse. Links differed more from the original target, but after testing, the link variations produced by the tool worked well either way.
There are still many shortcomings and areas for development in the created and implemented tool of the research, which, however, have been considered in the logic and structure of the tool. Also, the tool can be implemented with even a small amount of financial capital but requires a lot of experience and expertise on the subject.