Face detection is a fundamental problem that has been around since the early ’70s. Faces hold a lot of information, e.g. age, identity, emotion. To be able to extract this information from photos or videos it is important to first accurately localize faces. For most straightforward cases with many constraints, e.g. a face on an id card, this is already very accurate. However, for the harder/unconstrained cases, like occlusions, more problems occur.
Subsequently, it is very common for detectors to return multiple bounding box suggestions that refer to the same object/face. Though, in order to not have all these overlapping boxes negatively influence to model’s performance we’d like to suppress all but the best box(es). Commonly we use non-maximum suppression (NMS) for this purpose. Throughout this post we will discover and address to downside of the traditional Greedy NMS strategy, which uses classification confidence to select the ‘best’ box.
So, how do we measure if one approach is better than another? Whether a face is accurately localized? And is that the only important metric when it comes to face detection?
Robert-Jan Bruintjes wrote a nice piece about object detection metrics here, but I will recap some bits that are most relevant for this post.
Intersection over Union (IoU)
IoU, also known as Jaccard index, is a popular metric used to measure the degree to which two bounding boxes overlap.
In the field of object or face detection, IoU is often used to determine if two bounding boxes belong to the same object. IoU is a straightforward to compute metric, as it says in the name: the intersection between the two bounding boxes is divided by their union.
Figure 1 illustrates this.
The resulting IoU is in range [0, 1], where IoU = 0 means the two boxes don’t overlap and IoU = 1 means the boxes overlap perfectly and are essentially the same.
(mean) Average Precision (mAP)
The general definition of Average Precision (AP) is finding the area under the Precsion-Recall curve (AuC), which is the average precision over all recall values in range [0, 1]. Whereas mean Average Precision (mAP) is the mean AP over all classes. So for face detection mAP is the same as AP as we online have a single class: faces.
Simplified, the mAP is a single metric that shows how well a model performs at detecting ground truths (with IoU ≥ 0.5) and how well these detections are classified. Where mAP = 0 means the model didn’t correctly classify any of its detections and/or no ground truths were detected, while mAP = 1 means all ground truths were detected, and all detections were correctly classified. Ofcourse incorrect detections are also punished, so the model can’t just fill an image with detections.
Now that we have a better understanding of how models are evaluated, and which box is better than another box, we can look into this non-maximum suppression (NMS) we mentioned earlier.
Lets start with the traditional Greedy NMS. This version uses a greedy strategy to simply pick the box with the highest classification confidence of all boxes the overlap more than a given threshold (e.g. IoU(box A, box B) > Nt). All the other boxes that overlap more than the threshold are ‘suppressed’/discarded. Figure 2 illustrates this procedure.
Additionally, it is common practice to discard all detections that have a classification confidence score lower than a given threshold (i.e. Ot). This is done to only consider boxes that the model actually assumes are of a desired class (e.g. a face).
Next up is another popular variation on the traditional Greedy NMS, the Soft NMS. This variation was designed by Bodla et al. to work better for overlapping objects. It does this by no longer discarding the lower classification confidence boxes, instead it reduces their classification confidence based on the overlap. A higher IoU with another box would mean the confidence will be suppressed more. Figure 3 illustrates this procedure.
It is important to note that the two strategies commonly use different thresholds to determine which boxes are considered to overlap enough.
Problem: Misalignment classification confidence – IoU
Jiang et al. describe a misalignment between classification confidence and overlap (IoU) between a detection and its matched ground truth, for object detection. They find that for object detection classification confidence is not a reliable reflection of localization accuracy. Yet, most bounding box suppression and refinement methods use classification confidence to decide which detections are more accurately localized than others. Figure 4 illustrates the problem this misalignment can cause.
The graph below illustrates the extent of this misalignment for face detection (on the popular WIDER FACE dataset).
As can be seen in the graph, when we have a low overlap (IoU<0.5) the classification confidence will likely be ~0.
When when we have a high overlap (IoU>0.5) the classification confidence will likely be ~1.
As to why it occurs, that is expected behavior. In the process of learning a detector to recognize objects of a certain class (e.g. faces), we train it to maximize the classification confidence of that class. This means that even for partially localized faces we want the resulting classification confidence to be 1. This leads to the model becoming more accurate, but also overconfident.
Addressing the misalignment
We now know that using classification confidence for a localization purpose is inadvisable. So what can we use instead to determine which is the better detection?
Jiang et al. suggest training a model to predict a localization confidence, and call this model the IoU-Net. Essentially, this localization confidence is the IoU between a ground truth and a detection, where loc.conf. = 1 means a 100% perfect localization and loc.conf. = 0 means the detection doesn’t overlap with a ground truth.
i.e. The input training data for the model is the feature map of a detection and the target is the IoU between a ground truth and the detection.
To predict the localization confidence we extend a simple Faster R-CNN model (Ren et al.) with an extra branch. This allows the model to predict a bounding box and the box’s accompanying localization and classification confidence. Figure 5 illustrates the Faster R-CNN model and in blue the extra branch.
Now that we know how to model looks and how it’s supposed to work, logistically we need to think about the type of training data we want to feed it. Essentially, the only extra training data we need to add is the target localization confidence accompanying some Region of Interest (RoI). We could for example just calculate the IoU between a ground truth box and a detection suggested by the region proposal network (RPN) of the Faster R-CNN. However, this is undesirable, as training the RPN will result in better proposals with a higher IoU, and thus we would be creating a data imbalance which can result in a bias. We don’t want that as it can cause the model to predict incorrect localization confidence scores.
Instead, as the Figure of the network architecture above suggests, we feed to model jittered RoIs. For these jittered RoIs we basically use the ground truth boxes and add some noise to the coordinates to ‘jitter’ their final location. The amount of noise we add depends on the desired target localization confidence we want to feed the model.
i.e. if we want to feed the model a region with localization confidence of 0.95, we want the IoU between the ground truth (G) and the jittered RoI (G’) to be 0.95 (i.e. IoU(G, G’) = 0.95).
G = (x1, y1, x2, y2),
G’ = (x’1, y’1, x’2, y’2).
x’1 = x1 + d1,
y’1 = y1 + d2,
x’2 = x2 + d3,
y’2 = y2 + d4.
And d1, d2, d3, d4 are the noise values added to the box coordinates to create the jittered RoI. Figure 6 illustrates this process a bit further.
Now that we have predicted a localization confidence to address the misalignment we can use it during NMS. Jiang et al. have designed a greedy approach using localization confidence instead of classification confidence. Now we select the box with the highest localization confidence of all boxes that overlap more than a given threshold (e.g. IoU(box A, box B) > Nt) and we assign it the highest classification confidence of these same overlapping boxes. Afterwards, all non-maximum boxes are ‘suppressed’/discarded.
Soft IoU-guided NMS
Based on the earlier findings of Soft NMS and the traditional Greedy NMS, we have also experimented with combining Soft NMS and IoU-guided NMS into Soft IoU-guided NMS. Like with Soft NMS we reduce the confidence of the overlapping boxes, based on their IoU with the higher confidence boxes, however we do this for both localization confidence and classification confidence. And like IoU-guided NMS, we select based on localization confidence instead of classification confidence. For those who are interested, the algorithm below shows the pseudocode for this novel NMS strategy and points at the difference between it and traditional Soft NMS.
We see here the pseudocode for the Soft NMS and Soft IoU-guided NMS algorithms. The red and blue colors indicate the difference between these two strategies.
So, does this all work? Do we see a higher performance when predicting a localization confidence and use it during NMS instead of classification confidence?
Though it is still difficult for a model to learn to predict the IoU between a region and a ground truth, we do see that there is a stronger correlation between localization confidence and IoU. Here this correlation is visualized by the slope of the datapoints better resembling the red line (the unit line y=x) than in the earlier graph plotting classification confidence vs. IoU.
Given this stronger correlation we assume the model to perform better when using localization confidence instead of classification confidence.
We put this to the test and compared the IoU-Net to the baseline Faster R-CNN for all images in the WIDER FACE test set. These results are shown in the Table on the right.
We see that IoU-Net outperforms the Faster R-CNN baseline, while the NMS strategies seem to perform similarly. This is the case when comparing Greedy and IoU-guided NMS, and Soft and Soft IoU-guided NMS.
However, when considering all faces, most are normal. We expect the biggest difference in faces with occlusions, where some facial features are not visible. As seen in the research in my thesis, the lower correlation between classification confidence and IoU for occluded faces leaves more room for improvement than with faces overall.
When looking at the results for normally (e.g. partially) and heavily occluded faces in the Table in the left, we see a more noticable difference in performance.
We see a relative improvement of 1.4~2.2% for normally occluded faces and a relative improvement of 3.9~4.8% for heavily occluded faces when comparing IoU-Net’s NMS strategies to those of the baseline.
More importantly, we see that IoU-guided NMS and Soft IoU-guided NMS perform better than their classification confidence based counterparts.
In conclusion, we see that face detection can benefit from learning to predict a localization confidence and especially for occluded faces we find that using this localization confidence during NMS will yield better results.
However, these models have not been entirely optimized for face detection and can benefit from more experimentation and finetuning. As well as some further improvements on the input data during training to work better for crowded images.
For more info on the subject and details about the research I would like to refer you to my thesis.
Jiang, Borui et al. (2018). “Acquisition of localization confidence for accurate object detection”. In:Proceedings of the European Conference on Computer Vision(ECCV), pp. 784–799
Ren, Shaoqing et al. (2015). “Faster r-cnn: Towards real-time object detection with region proposal networks”. In:Advances in neural information processing systems, pp. 91–99.
Bodla, Navaneeth et al. (2017). “Soft-NMS–Improving Object Detection With One Line of Code”. In:Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5569