Impact of Semi-Synthetic Multimodal Data on Audio-Visual Segmentation
Tötterström, Sophie (2026)
Tötterström, Sophie
2026
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2026-03-26
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202603263518
https://urn.fi/URN:NBN:fi:tuni-202603263518
Tiivistelmä
Segmentation is a traditional computer vision problem, where objects are separated from background by providing categorical annotations for each pixel. In binary segmentation the pixels are divided into background or object. In semantic segmentation the pixel-precise annotations also include categorical labels.
Audio-visual segmentation (AVS) is a new field of machine learning research, where the goal is to segment the sound sources. This multimodal problem can be divided into three tasks: segmenting single sound sources, multiple sound sources, and semantic segmentation, where objects are also provided class labels. Audio-visual segmentation requires video and audio data as well as a segmentation ground-truth annotation. Manually gathering such data is resource-heavy.
In this work, the aim of the research was to discover if a new and diverse semi-synthetic multimodal dataset could be compiled by utilizing existing methods and data. The aim of the thesis was to utilize this data to improve previous AVS model performance. The dataset was generated with a custom data generation pipeline utilizing foundation models and state-of-the-art generative models.
This work presents a data generation pipeline for multimodal data and introduces a new multi-modal audio-visual segmentation dataset, S-AVS. The resulting dataset was used to pre-train an AVS model, which was subsequently fine-tuned using a traditional AVS dataset. The impact of the pre-training was evaluated and compared to the baseline model. The conducted experiments demonstrate improvements in model performance when introducing S-AVS pre-training. The conducted research demonstrates that partially synthetic multimodal data can be utilized to improve the performance of existing AVS models.
The conducted research reveled many potential directions for future AVS research. The implemented data generation pipeline can be further developed and refined for higher-quality data. As AVS is a relatively new field of research, the utilized data, benchmarks and evaluation metrics ought to be researched further. These developments would also reveal further information regarding the performance and potential over-fitting of current SOTA AVS methods.
Audio-visual segmentation (AVS) is a new field of machine learning research, where the goal is to segment the sound sources. This multimodal problem can be divided into three tasks: segmenting single sound sources, multiple sound sources, and semantic segmentation, where objects are also provided class labels. Audio-visual segmentation requires video and audio data as well as a segmentation ground-truth annotation. Manually gathering such data is resource-heavy.
In this work, the aim of the research was to discover if a new and diverse semi-synthetic multimodal dataset could be compiled by utilizing existing methods and data. The aim of the thesis was to utilize this data to improve previous AVS model performance. The dataset was generated with a custom data generation pipeline utilizing foundation models and state-of-the-art generative models.
This work presents a data generation pipeline for multimodal data and introduces a new multi-modal audio-visual segmentation dataset, S-AVS. The resulting dataset was used to pre-train an AVS model, which was subsequently fine-tuned using a traditional AVS dataset. The impact of the pre-training was evaluated and compared to the baseline model. The conducted experiments demonstrate improvements in model performance when introducing S-AVS pre-training. The conducted research demonstrates that partially synthetic multimodal data can be utilized to improve the performance of existing AVS models.
The conducted research reveled many potential directions for future AVS research. The implemented data generation pipeline can be further developed and refined for higher-quality data. As AVS is a relatively new field of research, the utilized data, benchmarks and evaluation metrics ought to be researched further. These developments would also reveal further information regarding the performance and potential over-fitting of current SOTA AVS methods.
