Multimodal audio dataset creation with crowdsourcing
Lipping, Samuel (2019)
Lipping, Samuel
2019
Tietotekniikka
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-05-23
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201905061515
https://urn.fi/URN:NBN:fi:tty-201905061515
Tiivistelmä
Creating large multimodal datasets for machine learning tasks can be difficult. Annotating large amounts of data for the dataset is costly and time consuming if done by finding and hiring participants. This thesis outlines a method for gathering multimodal annotations with the crowdsourcing platform Amazon Mechanical Turk (AMT). Specifically, the method in this thesis is made for annotating audio files with five captions and subjective scores for description accuracy and fluency for each caption. The durations of the audio files used in this thesis are uniformly distributed from 15 to 30 seconds. The method divides the whole annotation task into three separate tasks, namely audio description, description editing and description scoring. The editing and scoring tasks were introduced to attempt to fix errors from the previous tasks.
The inputs for the audio description task are the audio files that are to be annotated. The inputs for the description editing task are the descriptions from the audio description task, and the inputs for the description scoring task are the descriptions from the previous tasks. Each audio file is described five times, each description is edited once, and each set of descriptions is scored three times. At the end of the process there are ten descriptions for each audio file and three scores for accuracy and fluency for each description. The scores are used to sort the descriptions and the top five descriptions are used as the final captions for the files. This thesis creates an audio captioning dataset using this method for 5,000 audio files.
The inputs for the audio description task are the audio files that are to be annotated. The inputs for the description editing task are the descriptions from the audio description task, and the inputs for the description scoring task are the descriptions from the previous tasks. Each audio file is described five times, each description is edited once, and each set of descriptions is scored three times. At the end of the process there are ten descriptions for each audio file and three scores for accuracy and fluency for each description. The scores are used to sort the descriptions and the top five descriptions are used as the final captions for the files. This thesis creates an audio captioning dataset using this method for 5,000 audio files.
Kokoelmat
- Kandidaatintutkielmat [8324]