Video event classification using 3D convolutional neural networks
Teivas, Iikka Tapio (2017)
Teivas, Iikka Tapio
2017
Tietotekniikka
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2017-05-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201704061251
https://urn.fi/URN:NBN:fi:tty-201704061251
Tiivistelmä
The objective of this thesis is to study the capabilities of 3D convolutional neural networks (CNN) in human action classification. By doing 3D convolutional operations through a stack of adjacent video frames, motion can be captured in the resulting features. These features are called spatio-temporal features, which take the spatial video frame data and motion into account, hence the name. For example, in a golf video the model recognizes the typical swinging motion in addition to the player itself.
Here, we present a 3D CNN model for action recognition and test it with four different 10 class datasets consisting of sports and daily human interactions. We compare this fairly new method with an image based 2D CNN method, where only spatial frame data is considered. In addition, we analyze and visualize the 3D CNN features and discuss the strengths and weaknesses.
The achieved video accuracies were lower than expected and we learned that neural networks architecture design and scaling is difficult and time consuming. Still, the relative results were comparable to previous studies. On average, 3D CNN performs around 10% better than 2D CNN and offers compact features. The resulting spatio-temporal features work well with strong motions like sports and struggle with separating human interactions that consist of small movement e.g. eating and speaking.
Here, we present a 3D CNN model for action recognition and test it with four different 10 class datasets consisting of sports and daily human interactions. We compare this fairly new method with an image based 2D CNN method, where only spatial frame data is considered. In addition, we analyze and visualize the 3D CNN features and discuss the strengths and weaknesses.
The achieved video accuracies were lower than expected and we learned that neural networks architecture design and scaling is difficult and time consuming. Still, the relative results were comparable to previous studies. On average, 3D CNN performs around 10% better than 2D CNN and offers compact features. The resulting spatio-temporal features work well with strong motions like sports and struggle with separating human interactions that consist of small movement e.g. eating and speaking.