Event Classification from Video Sequence Data
Venkatakrishnan, Arjun Venkat (2019)
Venkatakrishnan, Arjun Venkat
2019
Tietotekniikan DI-ohjelma - Degree Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-10-08
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-201909203436
https://urn.fi/URN:NBN:fi:tuni-201909203436
Tiivistelmä
The process of identifying a specific event from a video is a relatively easy task for humans. However, it is challenging for a machine to perform the same task. Deep neural networks, trained on huge datasets with adequate processing power, have achieved commendable results in image classification tasks. The natural extension is to use the existing approaches with added additional processing steps and solve the event classification from videos. In this thesis, three neural network based models Convolutional Neural Network (CNN), combination of a CNN with Long Short Term Memory neural network (CNN-LSTM) and three dimensional CNN (3D-CNN) are implemented to solve the task. In this thesis the implemented architectures are referred to as CNN, CNN-LSTM and 3D-CNN. Multiple experiments are performed to find out the most suitable model architecture. For CNN and 3D-CNN, the architectures are trained from scratch. In the case of CNN-LSTM, a pre-trained CNN model is used for extracting features. However, the LSTM architecture is trained from scratch.
The models are trained to identify specific events from two different video clip datasets. The UCF101 dataset, which is publicly available, is used as one of the datasets. This dataset consists of video clips of humans performing 101 different actions, four of which are examined here. The second dataset consists of video clips in which humans walk through a doorway. The events of interest in the second dataset are a person's entry and the exit. The data was collected from Tampere University premises and evaluated. The tested architectures are compared by estimating their classification accuracy. The additional factors involved in the architecture evaluation such as utilization of resources and memory consumption are also examined.
Out of the three architectures, CNN-LSTM performs better than the other two architectures. It produces best results, in terms of accuracy for both the datasets. CNN-LSTM achieves an average accuracy of 98.14 % for the UCF101 dataset and 98.9 % for detecting entries and exits. Additional experiments are also performed to determine the minimum amount of data required to reach certain accuracy. Using 50 % of the total data, CNN-LSTM was able to achieve 98.8 % accuracy.
The models are trained to identify specific events from two different video clip datasets. The UCF101 dataset, which is publicly available, is used as one of the datasets. This dataset consists of video clips of humans performing 101 different actions, four of which are examined here. The second dataset consists of video clips in which humans walk through a doorway. The events of interest in the second dataset are a person's entry and the exit. The data was collected from Tampere University premises and evaluated. The tested architectures are compared by estimating their classification accuracy. The additional factors involved in the architecture evaluation such as utilization of resources and memory consumption are also examined.
Out of the three architectures, CNN-LSTM performs better than the other two architectures. It produces best results, in terms of accuracy for both the datasets. CNN-LSTM achieves an average accuracy of 98.14 % for the UCF101 dataset and 98.9 % for detecting entries and exits. Additional experiments are also performed to determine the minimum amount of data required to reach certain accuracy. Using 50 % of the total data, CNN-LSTM was able to achieve 98.8 % accuracy.