Visual Scene Classification Using Object Detection Histograms
Kesti, Matias (2024)
Kesti, Matias
2024
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-10-01
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202408208192
https://urn.fi/URN:NBN:fi:tuni-202408208192
Tiivistelmä
Visual scene classification is a challenging task in machine learning. It requires a higher-level understanding about the environment and the context of an image or video. Therefore, to approach the task it is reasonable to utilize the higher-level semantic information, such as the objects visible in an image, instead of only low-level features of the image. In this study, the research topic is scene classification based on the detected objects in a video. Scene classification has been studied in numerous methods, but in the field of computer vision, the research has focused on classifying still images. The material used in this study, however, is a set of videos from a public dataset called TAU Urban Audio-Visual Scenes 2021. It contains in total 12,292 video segments, which have been recorded in different urban environments in European cities. There are 10 different classes of these environmental scenes. These include airports, parks, and metro stations.
The method proposed in this study can be broken into two stages. In the first stage, the whole video material in the dataset is processed through an object detection algorithm. In this case, the implementation used a detector called YOLOv3. It can detect 80 different classes of objects from a video, such as humans, cars, suitcases, and traffic lights. After running the algorithm on a video from the dataset, the number of occurrences of each object class are saved in a histogram. In the second stage of the method, the histograms of detected objects from all the videos in the dataset are used to train a neural network-based classifier. The principle behind this method is that the neural network learns to classify the scenes based on the relative number of occurrences and co-occurrences of detected objects between each scene.
The results from this thesis suggest, that this histogram-based method is a usable approach for scene classification. There are some limiting factors to consider, though. For example, the accuracy in this method is dependant on the output from the object detection algorithm. The overall accuracy for all the classes in the test dataset was 53.2%, but the accuracies between each scene class varied considerably. The highest accuracy was attained for videos depicting street traffic, with a classification accuracy of 90.0%. The lowest accuracy was reached for the videos recorded inside trams, which had an accuracy of 23.1%.
The method proposed in this study can be broken into two stages. In the first stage, the whole video material in the dataset is processed through an object detection algorithm. In this case, the implementation used a detector called YOLOv3. It can detect 80 different classes of objects from a video, such as humans, cars, suitcases, and traffic lights. After running the algorithm on a video from the dataset, the number of occurrences of each object class are saved in a histogram. In the second stage of the method, the histograms of detected objects from all the videos in the dataset are used to train a neural network-based classifier. The principle behind this method is that the neural network learns to classify the scenes based on the relative number of occurrences and co-occurrences of detected objects between each scene.
The results from this thesis suggest, that this histogram-based method is a usable approach for scene classification. There are some limiting factors to consider, though. For example, the accuracy in this method is dependant on the output from the object detection algorithm. The overall accuracy for all the classes in the test dataset was 53.2%, but the accuracies between each scene class varied considerably. The highest accuracy was attained for videos depicting street traffic, with a classification accuracy of 90.0%. The lowest accuracy was reached for the videos recorded inside trams, which had an accuracy of 23.1%.
Kokoelmat
- Kandidaatintutkielmat [8709]