Enhancing Image Coding for Machines with Compressed Feature Residuals
Seppälä, Joni (2021)
Seppälä, Joni
2021
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-09-09
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202107096282
https://urn.fi/URN:NBN:fi:tuni-202107096282
Tiivistelmä
Media data, such as images and videos, are produced and consumed in enormous quantities on a daily basis. Their volume has increased by several magnitudes over the years, and this trend can be foreseen to continue in the future. However, the storage and transmission of the media utilize resources that are costly and almost always limited.
Traditionally, video and image data have been produced for human consumption. Therefore, the conventional image and video codecs have been optimized to spare as much storage space as possible with compression while minimizing human perceived deterioration of the data. As computer vision based technologies have become increasingly more common over the last decade, a new paradigm has emerged, in which the data is compressed to be used primarily by machines. The codecs designed for machine consumption commonly undertake a more aggressive compression than the traditional codecs, saving more storage space. Usually, the employed metric of the image quality is the computer vision task performance – in which case the perceptual quality of the images may be ignored, leading to a decline in human perceivable quality. Although machines are the primary consumers in many use cases, human involvement is also convenient or even mandatory. For example, in some expert systems, the machines must provide visualizations to justify generated results for human observers.
This thesis proposes a novel image coding technique targeted for machines while maintaining the capability for human consumption. The proposed codec generates two bitstreams: one bitstream from a traditional codec, referred to as human bitstream, optimized for human consumption; the other bitstream, referred to as machine bitstream, generated from an end-to-end learned neural network-based codec and optimized for machine tasks. Instead of working on the image domain, the proposed machine bitstream is derived from feature residuals – the difference between the features extracted from the input image and the features extracted from the reconstructed image generated by the traditional codec.
With the help of the machine bitstream, the machine task performance was significantly improved in the low bitrate range. The proposed system beats the state-of-the-art traditional codec, the Versatile Video Coding (VVC/H.266), achieving -40.5% in Bjøntegaard delta bitrate reduction average for bitrates up to 0.07 BPP in the object detection task in the Cityscapes dataset.
Traditionally, video and image data have been produced for human consumption. Therefore, the conventional image and video codecs have been optimized to spare as much storage space as possible with compression while minimizing human perceived deterioration of the data. As computer vision based technologies have become increasingly more common over the last decade, a new paradigm has emerged, in which the data is compressed to be used primarily by machines. The codecs designed for machine consumption commonly undertake a more aggressive compression than the traditional codecs, saving more storage space. Usually, the employed metric of the image quality is the computer vision task performance – in which case the perceptual quality of the images may be ignored, leading to a decline in human perceivable quality. Although machines are the primary consumers in many use cases, human involvement is also convenient or even mandatory. For example, in some expert systems, the machines must provide visualizations to justify generated results for human observers.
This thesis proposes a novel image coding technique targeted for machines while maintaining the capability for human consumption. The proposed codec generates two bitstreams: one bitstream from a traditional codec, referred to as human bitstream, optimized for human consumption; the other bitstream, referred to as machine bitstream, generated from an end-to-end learned neural network-based codec and optimized for machine tasks. Instead of working on the image domain, the proposed machine bitstream is derived from feature residuals – the difference between the features extracted from the input image and the features extracted from the reconstructed image generated by the traditional codec.
With the help of the machine bitstream, the machine task performance was significantly improved in the low bitrate range. The proposed system beats the state-of-the-art traditional codec, the Versatile Video Coding (VVC/H.266), achieving -40.5% in Bjøntegaard delta bitrate reduction average for bitrates up to 0.07 BPP in the object detection task in the Cityscapes dataset.