COMPUTER VISION, DEEP LEARNING
ECCV 2018: take-home-messages from one of the largest conferences on Computer Vision
A month has passed since the European Conference on Computer Vision 2018, which took place in Munich. In this blog, I have summarized some of the most interesting trends in this field. Actually, since Computer Vision (CV) is quite a broad field, I have restricted the interest of this article to “everything that could be relevant to Face Analysis”. Although it might seem like a pretty harsh decision, the contributions on this topic are many and very interesting.
ECCV this year was fully booked with 3200 attendees and an increased number of paper submissions received: 2439 valid papers of which only 776 were accepted (31.8 %). In this post, I will review just a few of them, grouped into 3 topics, plus 1 topic of interest.
Without further ado, I present to you the “Top-3, plus 1, topics of Interest for Face Analysis” at the European Conference on Computer Vision
1. The Advances in Object Detection
Object Detection is one of the milestone topics in CV and is the first step for any Face Analysis system, so it deserves particular attention. Listing all the relevant contributions to this topic would be unreasonable. Here, I will report on a couple of works that I think represent the main trends in the field.
A place of honor goes to “Group Normalization” , an alternative method to Batch Normalization that promises better results when training on small batch sizes. Data normalization has always been an essential procedure for every Machine Learning technique and Batch Normalization is a very common and effective method to adapt the statistics of the input data throughout the layers of the neural network. It is a well-known issue that Batch Normalization fails when training on small batch sizes: to normalize properly, you need to estimate the statistics of your data, and with small batch sizes you don’t have enough data to make a correct estimate.
Figure 1: Over iterations, the localization can be wrong when is based on classification confidence (top), while it is improved by the proposed method .
However, when trying to optimize very deep models on one single GPU, using small batch sizes is a necessary condition, due to the limited amount of memory that these devices have. Group Normalization addresses this problem by grouping units and then normalizing by the depth size of each convolutional layer for each input, thus resulting in a batch-independent method. Group Normalization performs well on different batch sizes and improves results for Object Detection systems as well, thus we expect to see this method to become one of the most common practices for training very Deep Learning models.
On the same Object Detection topic, we have noticed a trend toward the improvement of the loss function that is currently used. An example is a paper that proposes to include a Localization Confidence for Object Detection . For those of you not familiar with Object Detection, usually the loss of these systems is composed by two terms: one about the class that you are predicting — which kind of object you are detecting — and another one is for localization — where this object is. Only the class loss gives you a confidence — how confident you are that that object belongs to that class — and this confidence is used to select which prediction to consider among all the predictions that the model will give as output. However, the classification loss can be misleading sometimes: if you are predicting a cat with a tail, the class confidence will steer your prediction toward those features relevant for the cat class, and will ignore the tail (Figure 1).
This will reduce your prediction accuracy because the bounding box of your prediction will be tight to the cat body instead of the full annotation. This problem can be solved by using the localization confidence in two parts of the object localization procedure. First, the localization confidence is used to guide the selection — usually done with Non Maximum Suppression (NMS) — using the IoU (intersection-over-union) as criterion. This allows the more accurate bounding boxes to be preserved during the selection process. Second, the IoU is additionally used to optimize localization during the bounding box refinement. The experimental results show that these two methods improve the overall accuracy on the MS-COCO dataset.
I believe that the above mentioned will be taken into strong consideration regarding future works on Object Detection.
2. Inference Time as a metric for Deep Learning models
When choosing between speed and accuracy, comparing models has always been a very controversial topic since the former is quite difficult to estimate. Indeed how fast a model is, depends practically on the kind of hardware it runs on. Therefore, when a new model comes out, a common practice is to report either the floating point operations per second (FLOPS, flops or flop/s) and the number of parameters of a model. The downside is that sometimes models with the same FLOPs have very different inference times. One possible reason is that some operations are optimized for specific platforms, thus they work much faster on runtime. Another reason is because FLOPs do not include the time to access memory which could be not negligible. At ECCV18 this year, the community seemed more aware of this issue and several works tried to provide information about the specific inference time of their models.
One example is ShuffleNet V2 , where the authors specifically focused their work on the inference-time efficiency across multiple platforms.
As a matter of fact, the authors proposed several guidelines to reduce the inference time of Neural Networks without affecting performance. For example, having an equal number of channels in input and in output of a convolutional block minimizes memory access cost (MAC). Another example is about the multiple branches that complex architectures, such as the Inception family networks, proposes. Although these architectures have a small number of parameters they turn out to be pretty slow because of all of these branches in each module.
The same inference-time trend can also be seen in the model compression field. Two collaborative works between MIT and Google [4, 5] proposed model compression methods for deployment on specific devices such as mobile devices, that are based on the effective inference time.
Hence, inference time has become a critical metric for many research fields specifically when it involves deployment on real devices.
3. Action Units are back
The Facial Action Units System was introduced in the late 70s to code the human facial movement. It basically defines one action unit for each muscle in the face: if a muscle is contracted, the corresponding action unit is set to one. A facial expression is encoded by multiple action units. These facial expressions can than be used to classify emotions. With the advances of Deep Learning, we can now design an artificial system that is able to predict which Action Units are active on a face. At ECCV18, several works have proposed systems that rely on the Action Units framework.
One example is the work from the University of Barcelona, where the authors proposed a modular architecture to predict action units activation  (Figure 2). The Action Units prediction is not an easy problem due to its intrinsic multi-label nature for which the standard softmax loss is not applicable. Moreover, there are strong correlations between the activations of the units which makes the problem even more complex. Here, the authors proposed a recurrent neural network that combines a history of predictions to exploit the recurrency among the activations.
Figure 2: Example of Action Units prediction from .
Another example of Action Units application has been introduced to animate pictures automatically . The proposed system morphs a face to a specific facial expression using as input a vector of the Action Units to activate. The authors used GANs to generate realistic faces that are much closer to the original image when compared to other similar systems. This is done thanks to a specific component to the loss function that minimizes the difference between the original image and the generated one.
Action Units are an interesting coding system that recent advances in Deep Learning have brought back to work, thus we are expecting to see more technologies using them in the future.
Plus one! The data bias problem
Data has become one of the most valuable resources in every sector. Nowadays, AI technologies are all dependent on the kind of data they have been trained on. For a Deep Learning based technology, more data means better and more accurate models able to outperform the state of the art. However, collecting a wide and noise-free dataset is a very hard job that sometimes only big corporates can afford. Therefore, in most cases, the model training relies on public and old datasets that are far from being well assorted in terms of all the classes that the models are supposed to work with.
One well-known example is the Labeled faces in the Wild (LFW) dataset used for Face Recognition problems: since the majority of the faces come from white men, automatic systems trained with LFW will dramatically fail on women and people with dark skin. This issue has become very serious since several AI technologies are used to identify who you are to grant access to online services. Potentially, these biases could cut off the majority of the population from these services. A project from MIT (gendershades.org) tries to tackle this problem by reporting gender or ethnicity biases in some online services such as Microsoft, IBM, Face++. At ECCV18 a specific workshop has been organized to discuss how to properly address this problem from the scientific point of view.
How do we correct our models from the biases present in the dataset? The paper “Women also Snowboard: Overcoming Bias in Captioning Models”  addresses the problem for the captioning systems where a sentence is produced after an image. The paper introduces a model called Equalizer that forces the prediction to be right from the right reasons. Indeed, some elements in the pictures could be wrongly correlated with a specific gender (such as computers for men or kitchens for women). In this case, a model could predict the gender of a person only from the objects around.
The Equalizer steers the attention of the model toward the person in the picture to be sure to predict a gender only from the features of the body. We strongly believe that the data bias problem is crucial for the correct development of AI technologies in our life. We, therefore, have reserved a specific place for it that we hope will resonate across the whole community.
That’s it for my recap of the European Conference on Computer Vision 2018. Of course, there are probably many other works that deserve equal attention and perhaps I’ll touch on those in future articles. If you want to suggest some other papers to include please add them in the comments.
 Wu, Y., & He, K. (2018). Group Normalization, 3–19. Retrieved from https://arxiv.org/abs/1803.08494
 Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of Localization Confidence for Accurate Object Detection. Retrieved from https://arxiv.org/abs/1807.115901
 Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. https://arxiv.org/abs/1807.11164.
 He, Y., Lin, J., Liu, Z., Wang, H., Li, L. J., & Han, S. (2018, September). Amc: Automl for model compression and acceleration on mobile devices. https://arxiv.org/abs/1802.03494.
 Yang, T. J., Howard, A., Chen, B., Zhang, X., Go, A., Sandler, M., … & Adam, H. (2018). Netadapt: Platform-aware neural network adaptation for mobile applications. https://arxiv.org/abs/1804.03230
 Corneanu, C. A., Madadi, M., & Escalera, S. (2018). Deep Structure Inference Network for Facial Action Unit Recognition. https://arxiv.org/abs/1803.05873.
 Pumarola, A., Agudo, A., Martinez, A. M., Sanfeliu, A., & Moreno-Noguer, F. (2018, July). GANimation: Anatomically-aware Facial Animation from a Single Image. https://arxiv.org/abs/1807.09251.
 Anne Hendricks, L., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women also Snowboard: Overcoming Bias in Captioning Models. https://arxiv.org/abs/1803.09797.