Abstract
Face recognition involves identifying or verifying a person from a digital image or video frame and is still one of the most challenging tasks in computer vision today. The conventional face recognition pipeline consists of face detection, face alignment, feature extraction, and classification. This page further explains three exemplary state-of-the-art architectures: DeepID3 (6), FaceNet (9), and Sparse ConvNet (11).
Introduction
The task of face recognition involves identifying or verifying a person from a digital image or video frame. Computer applications capable of performing this task, known as facial recognition systems, have been around for decades. The general idea of face recognition is identifying facial features by extracting facial landmarks and then compare to other images by matching those features.
However, face recognition is still one of the most relevant and challenging research areas in computer vision and pattern recognition due to variations in facial expressions, poses, and illumination. (1)
Overview
The conventional face recognition pipeline consists of four stages: face detection, face alignment (or preprocessing), feature extraction (or face representation) and classification, as illustrated in figure 1.
A milestone in the face detection areas was the contribution by Viola & Jones (2) in 2001, which provided an object detection framework that was operating in real-time and was suited for human faces. The remaining multi-view face detection problem was first tackled by Farfade, Saberian, & Li (3) in 2015 by using deep CNNs instead of cascade-based approach as Viola & Jones. Current state-of-the-art approaches use region-based CNNs to enable a faster and more reliable detection. (4)
To simplify the extraction part, a proper alignment is crucial. If facial points can be identified correctly, features can be matched in a region around them. Recently, CNN-based architectures showed success in this area. (5)
The feature extraction part is often considered the most challenging and important of all, since any matching algorithm is limited by the quality of the underlying features.
Notable networks
There is a verity of successful architectures. This section focuses on three different models and explains their idiosyncrasies. Evaluations for face recognition approaches are almost always performed on the Labeled Face in the Wild (LFW) (12) data set, with face verification accuracy as the most common metric. In the verification task, given a pair of face images, the goal is to determine whether they are coming from a single subject or not.
DeepID3
DeepID3 is the third generation of the DeepID architecture, which was one of the first publications to propose learning discriminative deep face representations (DFR) through large-scale face identity classification. The second generation proposed DFR by joint face identification-verification, which finally brought the networks up to human performance.
In this third approach (shown in figure 2), Sun et. al (6) were trying to use insights of the most successful architectures from the ImageNet challenge in 2014: the inception layers of GoogLeNet (7) and stacked convolutions of VGG (8). They also included joint identification-verification supervisory signals to multiple layers, to further reduce the intra-personal variance of the representation. The publication shows that very deep neural networks achieve state-of-the-art performance on face recognition tasks and slightly outperform their shallow counterparts. By exposing the architectures to large-scale training data, another increase in effectiveness is expected.
Figure 2: Layers of DeepID3 network. Source: (6)
FaceNet
The FaceNet publications by Google researchers (9) introduced a novelty to the field by directly learning a mapping from face images to a compact Euclidean space. The distances between representation vectors are a direct measure of their similarity with 0.0 corresponding to two equal pictures and 4.0 marking the opposite site of the spectrum. The representation is also able to significantly reduce the image complexity to only 128-bytes per face. This generalized embedding significantly differs from other approaches, which are trained over a set of known faces and then generalized via an intermediate bottleneck layer. Figure 3 shows the exemplary scores of pairs of test images.
The architecture is a combination of the multiple interleaved layers of convolutions of Zeiler & Fergus (10) and the inception model of GoogLeNet (7). These models are interwoven to a deep architecture, which is symbolized as a black box in figure 4. The most important part of the approach lies in the end-to-end learning of the whole system. As a loss function, the Triplet Loss was used, which is explained and shown in figure 5.
During the time of the publication, FaceNet set a new record accuracy on the LFW (12) dataset with 99.63%. The drawback of this model is the demand for a large training data set (200 million training samples in this case).
Figure 3: Illumination and pose invariance. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly. Source: (9)
Figure 4. Model structure. This network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training. Source: (9)
Figure 5: The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Source: (9)
Sparse ConvNet
In this recent publication, Sun et al. (11). tried to further improve their achievements of DeepID3 (6). by taking a trained, dense CNN, sparsify the connections, and train it even further to improve performance. This architecture increases the baseline performance of the DeepID3 from 98.95% to 99.30%, which implies an error rate reduction of 33%. It is important to note that even if it did not achieve a better performance than FaceNet (9), it only required 300,000 training samples and can thereby be considered more efficient.
Literature
1) Kasar, M. M., Bhattacharyya, D., & Kim, T. H. (2016). Face Recognition Using Neural Network: A Review. International Journal of Security and Its Applications, 10(3), 81-100.
2) Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on (Vol. 1, pp. I-I). IEEE.
3) Farfade, S. S., Saberian, M. J., & Li, L. J. (2015, June). Multi-view Face Detection Using Deep Convolutional Neural Networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 643-650). ACM.
4) Jiang, H., & Learned-Miller, E. (2016). Face detection with the faster R-CNN. arXiv preprint.
5) Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3476-3483).
6) Sun, Y., Liang, D., Wang, X., & Tang, X. (2015). DeepID3: Face recognition with very deep neural networks.. arXiv preprint.
7) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
8) Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint.
9) Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815-823).
10) Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer International Publishing.
11) Sun, Y., Wang, X., & Tang, X. (2016). Sparsifying neural network connections for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4856-4864).