Systematic Literature Review of Deep Visual and Audio Captioning
Nag, Indrani (2021)
Nag, Indrani
2021
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-10-06
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202108266803
https://urn.fi/URN:NBN:fi:tuni-202108266803
Tiivistelmä
Deep learning is a very prevalent field in these recent years and so many applications is coming out day by day. Artificial intelligence is doing great job in different fields and even better than humans for some arena. Deep visual and audio captioning has become a prominent research area of Artificial Intelligence which can be referred to generate a human readable textual description for a given image or audio clip. Deep captioning is playing a critical role in many real-life applications which raises technical demand of using deep learning methods in this field. It is difficult to keep on track with the state-of-arts in deep captioning because of faster expansion of the information on this topic. In this study a systematic literature review on deep visual and audio captioning upholds a comprehensive study of the improvements in this field over the last seven years.
Generation of caption requires the understanding of the significant objects, their features or attributes as well as their inner relationship in an image or audio clip. Producing syntactically and semantically accurate sentences are possible by using deep learning-based methods. The goal of this study is to describe different existing methods and models, their advantages and shortcomings, popular datasets, evaluation metrics, growth of captioning field, technical difficulties and future directions based on six well-structured research questions.
After collecting articles from the profound libraries, 170 articles have been finally selected to for the review through various analysis and synthesis. Encoder-decoder based architecture and attention mechanism are the widely used methods for captioning. The CNN architecture-based networks are used as encoder and initially RNN were employed as decoder. But later LSTM has replaced RNN and improve the captioning accuracy. In addition, MSCOCO dataset as well as evaluation metric BLEU are popular that are utilized in most of the studies. Another thing has been found in this study that reinforcement learning based methods can be a good option for future research. Overall, this paper will work as guideline for the researchers and provide the comparison of the results of the published paper for inventing more powerful approaches to achieve better and more accurate results.
Generation of caption requires the understanding of the significant objects, their features or attributes as well as their inner relationship in an image or audio clip. Producing syntactically and semantically accurate sentences are possible by using deep learning-based methods. The goal of this study is to describe different existing methods and models, their advantages and shortcomings, popular datasets, evaluation metrics, growth of captioning field, technical difficulties and future directions based on six well-structured research questions.
After collecting articles from the profound libraries, 170 articles have been finally selected to for the review through various analysis and synthesis. Encoder-decoder based architecture and attention mechanism are the widely used methods for captioning. The CNN architecture-based networks are used as encoder and initially RNN were employed as decoder. But later LSTM has replaced RNN and improve the captioning accuracy. In addition, MSCOCO dataset as well as evaluation metric BLEU are popular that are utilized in most of the studies. Another thing has been found in this study that reinforcement learning based methods can be a good option for future research. Overall, this paper will work as guideline for the researchers and provide the comparison of the results of the published paper for inventing more powerful approaches to achieve better and more accurate results.