Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Abstract

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained.

We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

Human Survey Interfaces (via JsPsych + Prolific)

The instruction page, after being put into the fullscreen mode. We ask participants to press different buttons to ensure they read and follow the instructions.

This page explains the presence of attention checks. All participants were paid regardless.

A question page where participants click one of the buttons to make their choice and proceed to the next question.

BibTeX

@inproceedings{vlmGaze2026, title={Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues}, author={Zhang, Zory and Feng, Pinyuan and Wang, Bingyang and Zhao, Tianwei and Yu, Suyang and Gao, Qingying and Deng, Hokin and Ma, Ziqiao and Li, Yijiang and Luo, Dezhi}, booktitle={Findings of the Association for Computational Linguistics: ACL 2026}, publisher={Association for Computational Linguistics}, year={2026}, url={https://arxiv.org/abs/2506.05412}, }

👁️ Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Accepted by ACL 2026

TL;DR: Vision-Language Models are bad at knowing where someone looks, primarily because they are using the wrong cue. Instead of utilizing eye appearance, they use head orientation as a shortcut to find gaze target.