Evaluating the quality of generated nonverbal behavior remains a major challenge in the development of generative models. While human evaluations are reliable, they are costly and impractical for large-scale or iterative optimization. In this work, we propose an objective evaluation framework based on aggregated ranks across multiple fidelity and diversity metrics, computed from both raw features and learned latent representations. Compared to existing works, our framework emphasizes consistency across multiple metrics, aiming to provide a more holistic assessment.
Radar chart comparing the performance of four CWGAN-GP variants. Each polygon represents a model configuration defined by specific values of the loss parameters α and β. The legend indicates the final aggregated ranking for each configuration based on the composite score.
The videos present segments of nonverbal behaviors generated by one of the CWGAN-GP variant. The nonverbal behaviors are then replayed on a virtual character through the Retorik platform developed by Davi. The original audio from the video are preserved.
This video presents a segment of nonverbal behaviors from a real-life recorded sequence. The nonverbal behaviors were extracted using the OpenFace tool. The original audio from the video is preserved. The extracted nonverbal behaviors are then replayed on a virtual character through the Retorik platform developed by Davi.