Technical debt dataset cleansing
Islam, Sazzad Ul (2020)
Islam, Sazzad Ul
2020
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-11-25
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202011208111
https://urn.fi/URN:NBN:fi:tuni-202011208111
Tiivistelmä
Technical debt is a common talk among developers and researchers in the software industry. This concept is used to define and find out dirty codes and weak architecture because of shortcuts which will cost to refactor that code in future. There are several tools which analyze the code based on rules and give Technical debt data in returns. This technical debt data helps developers to find the code violation and fix them at an early stage to avoid future refactoring complications. Researchers also use this technical debt dataset for their research purpose. However, not all companies provide public access into their projects technical debt dataset, but there are some already analyzed technical debt dataset available by some research work. The Technical Debt Dataset is one of them which has analyzed technical debt data of 33 Apache Software Foundations Java projects using different tools and methods; besides, this dataset is available in both CSV file and SQLite database format. This dataset itself has hundreds of thousands of data but it has been found that they are not properly formatted. There are different branches technical debt data presents in the dataset and analysis data of those branches was not consistent. That is why there is missing data as well as non-matching data in between the dataset.
In this ‘The Technical Debt Dataset Cleansing’, we demonstrate the process of the technical debt data fetching from sonarqube, what was the reason behind the cleaning redundant data, how we made the analysis on the data and removed unnecessary data from the dataset. To do that, we fetch the available technical debt data of 33 java projects of Apache software Foundation from SonarQube, then save them into CSV files. Next, we check data in the dataset of individual projects, look for any missing information and non-matching data by comparing within the related dataset of each project. We fill the missing data in the files and try to find matches for non-matching data in the projects' dataset. We also check if any duplicate data is present in the dataset and remove them. This is how we get perfectly matched, no redundant dataset of those 33 projects. Finally, we make the cleaned dataset available in CSV format and an SQLite database which will help others to make queries and get results in a convenient way. The Technical Debt Dataset Cleansing aims to provide non-duplicate, non-redundant, perfectly matched dataset of technical debt data to the researchers. After the cleansing, we successfully generated 67k+ clean sonar analysis data, 1M+ clean sonar issues data and 66k+ clean sonar measures data in new dataset.
In this ‘The Technical Debt Dataset Cleansing’, we demonstrate the process of the technical debt data fetching from sonarqube, what was the reason behind the cleaning redundant data, how we made the analysis on the data and removed unnecessary data from the dataset. To do that, we fetch the available technical debt data of 33 java projects of Apache software Foundation from SonarQube, then save them into CSV files. Next, we check data in the dataset of individual projects, look for any missing information and non-matching data by comparing within the related dataset of each project. We fill the missing data in the files and try to find matches for non-matching data in the projects' dataset. We also check if any duplicate data is present in the dataset and remove them. This is how we get perfectly matched, no redundant dataset of those 33 projects. Finally, we make the cleaned dataset available in CSV format and an SQLite database which will help others to make queries and get results in a convenient way. The Technical Debt Dataset Cleansing aims to provide non-duplicate, non-redundant, perfectly matched dataset of technical debt data to the researchers. After the cleansing, we successfully generated 67k+ clean sonar analysis data, 1M+ clean sonar issues data and 66k+ clean sonar measures data in new dataset.