When Human Coders (and Machines) Disagree on the Meaning of Facial Affect
This paper describes the challenges of getting ground truth affective labels for spontaneous video, and presents implications for systems such as
virtual agents that have automated facial analysis capabilities. We first present a dataset from an intelligent tutoring application and describe the most prevalent approach to labeling such data. We then present an alternative labeling approach, which closely models how the majority of automated facial analysis systems are designed. We show that while participants, peers and trained judges report high inter-rater agreement on expressions of delight, confusion, flow, frustration, boredom, surprise, and neutral when shown the entire 30 minutes of video for each participant, inter-rater agreement drops below chance when human coders are asked to watch and label short 8 second clips for the same set of labels. We also perform discriminative analysis for facial action units for each affective state represented in the clips. The results emphasize that human coders heavily rely on factors such as familiarity of the person and context of the interaction to correctly infer a person's affective state; without this information, the reliability of humans as well as machines attributing affective labels to spontaneous facial-head movements drops significantly.