Temporally Aligned Audio for Video with Autoregression
Viertola, Ilpo; Iashin, Vladimir; Rahtu, Esa (2025)
Viertola, Ilpo
Iashin, Vladimir
Rahtu, Esa
2025
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202510099745
https://urn.fi/URN:NBN:fi:tuni-202510099745
Kuvaus
Peer reviewed
Tiivistelmä
We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at v-aura.notion.site.
Kokoelmat
- TUNICRIS-julkaisut [22206]
