Measuring the impact of Sonarqube on the development velocity using regression analysis
Robredo Manero, Mikel (2023)
Robredo Manero, Mikel
2023
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-05-28
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202305226030
https://urn.fi/URN:NBN:fi:tuni-202305226030
Tiivistelmä
The study of development velocity has gained importance in software engineering research within the last decades. Not only software development projects but many fields are interested in analyzing the impact specific factors have on the development velocity, since this one stands as a useful metric to measure the productivity with which teams perform when working on different types of tasks. One of these factors is SonarQube, a widely used software considered to be one of the most used code analysis tools by developers in software development.
This thesis aims to analyse the impact of SonarQube as a factor affecting the variance of the development velocity in software development projects. Furthermore, based on expert knowledge from the field, a set of different confounder variables that are believed to have an impact on the development velocity are included in the analysis. Thus, an additional goal of this thesis is to analyse which is the relationship of the considered variables with the development velocity that better describes its variance. Regression analysis was selected to conduct the analysis of this thesis, and the statistical software R was the computational tool. The collected data included information about 337 mature software development projects in the Apache Software Foundation obtained through a cohort study design.
The conducted analysis considers a complete regression analysis process, first understanding the shape of the data and drawing initial distributional assumptions. Consequently, the analysis considers using Linear Models as well as Generalized Linear Models under the drawn assumptions. By performing a backward selection process variables in models under different distributional assumptions, results showed a low statistical significance in the exposure of projects to SonarQube. Moreover, all the observed models denoted a low predictive power towards the development velocity, hence showing a low ability to describe its variance. Additionally, ensemble learning was used to discover that results behaved in the say way under an agglomerating approach.
In the same way, the model selection showed a better fit with models assuming distributions depicting high skewness. These results suggested that potential work could be done inspecting further non-parametric methods that assume the observed skewness in the distribution of the development velocity. Furthermore, the obtained results do not show the significance of the use of SonarQube to describe the development velocity, a fact that differs from the software development field. This suggests the possibility of finding alternative data collection designs that may understand and capture the connection between SonarQube and the development velocity in a more accurate way. These could consider periodic measurements of the velocity level in measurements, as well as a different variable structure when performing regression analysis, among many other options.
This thesis aims to analyse the impact of SonarQube as a factor affecting the variance of the development velocity in software development projects. Furthermore, based on expert knowledge from the field, a set of different confounder variables that are believed to have an impact on the development velocity are included in the analysis. Thus, an additional goal of this thesis is to analyse which is the relationship of the considered variables with the development velocity that better describes its variance. Regression analysis was selected to conduct the analysis of this thesis, and the statistical software R was the computational tool. The collected data included information about 337 mature software development projects in the Apache Software Foundation obtained through a cohort study design.
The conducted analysis considers a complete regression analysis process, first understanding the shape of the data and drawing initial distributional assumptions. Consequently, the analysis considers using Linear Models as well as Generalized Linear Models under the drawn assumptions. By performing a backward selection process variables in models under different distributional assumptions, results showed a low statistical significance in the exposure of projects to SonarQube. Moreover, all the observed models denoted a low predictive power towards the development velocity, hence showing a low ability to describe its variance. Additionally, ensemble learning was used to discover that results behaved in the say way under an agglomerating approach.
In the same way, the model selection showed a better fit with models assuming distributions depicting high skewness. These results suggested that potential work could be done inspecting further non-parametric methods that assume the observed skewness in the distribution of the development velocity. Furthermore, the obtained results do not show the significance of the use of SonarQube to describe the development velocity, a fact that differs from the software development field. This suggests the possibility of finding alternative data collection designs that may understand and capture the connection between SonarQube and the development velocity in a more accurate way. These could consider periodic measurements of the velocity level in measurements, as well as a different variable structure when performing regression analysis, among many other options.