Improving Temporal Consistency for Semantic Segmentation of Videos
Hassankhani, Amirhossein (2023)
Hassankhani, Amirhossein
2023
Tietojenkäsittelyopin maisteriohjelma - Master's Programme in Computer Science
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-12-20
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2023121210766
https://urn.fi/URN:NBN:fi:tuni-2023121210766
Tiivistelmä
Understanding a visual scene is crucial for many applications, such as self-driving cars, robotics, etc. The first step in understanding a natural scene is generating a 3D semantic map that represents the environment's geometry and semantics. To get the semantic map, the video sequence from the sensor is sent to the semantic segmentation model to get a 2D semantic map for each frame. The segmentation model assigns labels to each pixel in the image. However, when dealing with video sequences, in addition to the accuracy of individual frames, the consistency of the prediction across the frame is important. For example, two models might have the same overall accuracy for individual frames, but one is more temporally consistent across the frames. The latter model is preferred in 3D Mapping applications because it makes the fusion of these semantic maps easier and more accurate. This work focuses on improving the temporal consistency of semantic segmentation models for videos.
Researchers have been working on ways to improve the temporal consistency of segmentation models. Many of these methods use optical flow during test or training time because optical flow captures the movement of pixels across time. However, computing optical flow is expensive, and the accuracy of the optical flows limits the performance of our model. Other methods use more sophisticated ways to extract relevant information from previous frames. All these methods attempt to address the temporal inconsistency problems by changing the model to include more information about the past frames.
The other class of algorithms for improving temporal consistency only fine-tunes the model after it is trained on the target dataset. Since fine-tuning works on any model, we can use it on the previous methods that include information from the past and get a compounding effect on our performance. This fine-tuning can happen online during test time or offline after training. AuxAdapt is an online adaptation method that improves the model's confidence over time, resulting in more consistent predictions. The algorithm uses one frozen (MainNet) and one active network (AuxNet), where the latter is fine-tuned during test time. For each coming frame in test time, the algorithm computes the joint prediction of both networks and trains the active network with the joint prediction. Since the algorithm only uses its predictions and not any ground truth labels, it is an unsupervised method. In this work, we try to address some of the limitations of AuxAdapt.
In AuxAdapt, one of the networks (MainNet) is frozen. This means the frozen network can become a liability for the active network since the final output is the joint prediction of both networks. In other words, the active network might be unable to fix the final prediction if the frozen network prediction is wrong and not changing. For this reason, we propose a method to update the MainNet to keep up with the predictions of the AuxNet. We propose to use a momentum update to update the weights of the MainNet using the weights of the AuxNet, making the MainNet exponential moving average of the Auxnet. For this approach to work, both networks must have identical architecture. Our qualitative and quantitative experiments on three datasets, Cityscapes, KITTI, and SceneNet RGB-D, show our algorithm, Momentum Adapt, outperforms AuxNet when keeping the relative computation the same. Additionally, our method has more stability in harsh conditions.
However, in AuxAdapt, since the MainNet is frozen, the architectures of the two networks can be different. This allows for using much smaller active network (AuxNet) and faster predictions. To address this, we developed a mixed algorithm called Aux-Momentum Adapt, where the algorithm uses three networks: one large frozen network, one small active network, and one identical small momentum network. This allows for the use of networks of different sizes. For the cases where the difference in size is large, the run-time for Aux-Momentum Adapt is very close to AuxAdapt, while it has better performance and stability compared to AuxAdapt.
Researchers have been working on ways to improve the temporal consistency of segmentation models. Many of these methods use optical flow during test or training time because optical flow captures the movement of pixels across time. However, computing optical flow is expensive, and the accuracy of the optical flows limits the performance of our model. Other methods use more sophisticated ways to extract relevant information from previous frames. All these methods attempt to address the temporal inconsistency problems by changing the model to include more information about the past frames.
The other class of algorithms for improving temporal consistency only fine-tunes the model after it is trained on the target dataset. Since fine-tuning works on any model, we can use it on the previous methods that include information from the past and get a compounding effect on our performance. This fine-tuning can happen online during test time or offline after training. AuxAdapt is an online adaptation method that improves the model's confidence over time, resulting in more consistent predictions. The algorithm uses one frozen (MainNet) and one active network (AuxNet), where the latter is fine-tuned during test time. For each coming frame in test time, the algorithm computes the joint prediction of both networks and trains the active network with the joint prediction. Since the algorithm only uses its predictions and not any ground truth labels, it is an unsupervised method. In this work, we try to address some of the limitations of AuxAdapt.
In AuxAdapt, one of the networks (MainNet) is frozen. This means the frozen network can become a liability for the active network since the final output is the joint prediction of both networks. In other words, the active network might be unable to fix the final prediction if the frozen network prediction is wrong and not changing. For this reason, we propose a method to update the MainNet to keep up with the predictions of the AuxNet. We propose to use a momentum update to update the weights of the MainNet using the weights of the AuxNet, making the MainNet exponential moving average of the Auxnet. For this approach to work, both networks must have identical architecture. Our qualitative and quantitative experiments on three datasets, Cityscapes, KITTI, and SceneNet RGB-D, show our algorithm, Momentum Adapt, outperforms AuxNet when keeping the relative computation the same. Additionally, our method has more stability in harsh conditions.
However, in AuxAdapt, since the MainNet is frozen, the architectures of the two networks can be different. This allows for using much smaller active network (AuxNet) and faster predictions. To address this, we developed a mixed algorithm called Aux-Momentum Adapt, where the algorithm uses three networks: one large frozen network, one small active network, and one identical small momentum network. This allows for the use of networks of different sizes. For the cases where the difference in size is large, the run-time for Aux-Momentum Adapt is very close to AuxAdapt, while it has better performance and stability compared to AuxAdapt.