Real-time Single-shot Face Recognition using Machine Learning
Hintikka, Topi (2019)
Hintikka, Topi
2019
Tietotekniikka
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-05-08
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201904181414
https://urn.fi/URN:NBN:fi:tty-201904181414
Tiivistelmä
Face recognition is one of the earliest applications in the field of computer vision. In general, face recognition refers to recognizing the identity of a person from the facial image, by comparing the image to the set of facial images from known identities. Numerous approaches for solving this face recognition problem utilizing traditional machine learning methods have been proposed over the years. However, these methods have not been capable of solving the problem in an unconstrained environment. Due to increase of available data and computational power over the past 5 years, convolutional neural network (CNN) based deep learning approaches have became a state-of-the-art solution for face recognition problem. While some of these methods are capable of even beating a human in recognition accuracy, also the computational complexity of deep learning models used in solutions, have increased significantly and usually a graphics processing unit (GPU) is required to run face recognition in real-time.
A pipeline for solving the face recognition problem in an unconstrained environment on real-time from the video stream, is presented in this thesis. For solving the problem, the pipeline is divided into 4 steps where each step solves one smaller more specific problem. For each of the steps, multiple approaches are tested to find an optimal solution to the problem solved in the step. After all, a solution for each of the steps in the pipeline are proposed and the solutions are collected for demonstrating the complete system. The system is designed to run on embedded NVIDIA Jetson TX2 hardware.
Face detection is the first step in the pipeline, where faces are detected from the video stream and extracted. Then, the affection of applying different image alignments for the extracted facial croppings before the following steps of the pipeline, are tested. Alignment is performed by training machine learning models for facial landmark detection and applying an image transformation based on the detections. The third step, feature extraction from the facial images, is solved by training CNN’s with different alignments and loss functions to find an optimal solution for extraction. In the last step, an embedding of extracted features, is classified using nearest neighbor search from the set of known embeddings extracted earlier. Different clustering approaches are tested for accelerating the search speed by finding the nearest neighbor only from the subset of known embeddings.
For face detection, all the tested lightweight CNN object detectors achieved similar accuracy and each of them could be used. For extracted facial images, trained Multi-task Cascaded Convolutional Networks (MTCNN) framework’s ONet achieved superior accuracy in facial landmark detection and it was utilized in implementing different alignments for facial images. Different alignments and loss functions had significant affection to nearest neighbor search. The highest accuracy was achieved by applying an affine transformation based on detected facial landmarks before feature extraction performed by CNN trained with Additive Angular Margin (ArcFace) loss. For accelerating nearest neighbor search, a hierarchical correlation clustering algorithm CHUNX achieved very accurate approximation for the nearest neighbor with significantly smaller slope compared to the exact nearest neighbor search. After all, the implemented demo application runs in real-time on Jetson TX2 hardware with over 10 frames per seconds (FPS).
A pipeline for solving the face recognition problem in an unconstrained environment on real-time from the video stream, is presented in this thesis. For solving the problem, the pipeline is divided into 4 steps where each step solves one smaller more specific problem. For each of the steps, multiple approaches are tested to find an optimal solution to the problem solved in the step. After all, a solution for each of the steps in the pipeline are proposed and the solutions are collected for demonstrating the complete system. The system is designed to run on embedded NVIDIA Jetson TX2 hardware.
Face detection is the first step in the pipeline, where faces are detected from the video stream and extracted. Then, the affection of applying different image alignments for the extracted facial croppings before the following steps of the pipeline, are tested. Alignment is performed by training machine learning models for facial landmark detection and applying an image transformation based on the detections. The third step, feature extraction from the facial images, is solved by training CNN’s with different alignments and loss functions to find an optimal solution for extraction. In the last step, an embedding of extracted features, is classified using nearest neighbor search from the set of known embeddings extracted earlier. Different clustering approaches are tested for accelerating the search speed by finding the nearest neighbor only from the subset of known embeddings.
For face detection, all the tested lightweight CNN object detectors achieved similar accuracy and each of them could be used. For extracted facial images, trained Multi-task Cascaded Convolutional Networks (MTCNN) framework’s ONet achieved superior accuracy in facial landmark detection and it was utilized in implementing different alignments for facial images. Different alignments and loss functions had significant affection to nearest neighbor search. The highest accuracy was achieved by applying an affine transformation based on detected facial landmarks before feature extraction performed by CNN trained with Additive Angular Margin (ArcFace) loss. For accelerating nearest neighbor search, a hierarchical correlation clustering algorithm CHUNX achieved very accurate approximation for the nearest neighbor with significantly smaller slope compared to the exact nearest neighbor search. After all, the implemented demo application runs in real-time on Jetson TX2 hardware with over 10 frames per seconds (FPS).