Facial expression recognition (FER) that classifies facial expressions from input images has been considerably advanced thanks to deep learning. While deep learning requires large-scale datasets; collecting data with an accurate annotation is challenging in FER since accurately discretizing expression is difficult due to the subtlety and complexity of facial expressions and the subjectivity of the annotator. In this paper, we collected a large-scale reaction mashup (RM) videos (without expression annotations) from the YouTube that involves multiple persons’ facial reactions when watching the same film. Based on this, we propose a novel contrastive learning framework for FER, which is composed of two stages: inter-sample attention learning (IAL) and attention-based contrastive learning (ACL) stages: In IAL, we learn the baseline FER network and train the expression similarity of sample pairs based on the benchmark dataset and its discretized expression annotation. In ACL, we apply the contrastive learning on collected RM videos using priors combined with the learned expression similarities: Given the anchor face, different persons’ faces in nearby frames that exhibit high similarity are used as the positive samples; while the same person’s faces in the distant frames that exhibit low similarity are used as the negative samples. Experimental results showed that the proposed method effectively improves the distribution of learned features reflecting continuous variations of facial expressions and thereby outperforming previous state-of-the-arts in three FER benchmark datasets (i.e. AffectNet, RAF-DB and FERPlus datasets).
Publisher
Ulsan National Institute of Science and Technology (UNIST)