👁️ Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Accepted by ACL 2026

1Brown University, 2Columbia University, 3Emory University, 4Johns Hopkins University,
5University of Washington, 6Carnegie Mellon University, 7University of Michigan, 8UC San Diego

Co-leading, *Co-advising

TL;DR: Vision-Language Models are bad at knowing where someone looks, primarily because they are using the wrong cue. Instead of utilizing eye appearance, they use head orientation as a shortcut to find gaze target.

Abstract

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained.

We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

Task setup for gaze referent inference

We controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained.

Example controlled stimuli with objects on a table

Other controlled variables. We learned that VLMs are using heuristics that break down when objects are closer or with more objects, but not when it is a side view. Side views are harder for humans but not VLMs because VLMs are mostly using head orientation, still detectable in side views.

Accuracy comparison for related methods and variants

We tested 111 VLMs and 65 humans and found a substantial performance gap: o1 accuracy=50%, human accuracy=89%. Larger or newer VLMs are not better.

Analysis showing head orientation predicts VLM choices

𝐖𝐡𝐲 𝐝𝐢𝐝 𝐭𝐡𝐞𝐲 𝐟𝐚𝐢𝐥? We individually diagnosed 4 strong VLMs (and other baselines). Compared with alternative explanations such as resolution and object-naming skills, the strongest explanatory factor is the head orientation bias.

Fine-tuning result reducing head-orientation bias

𝐖𝐡𝐲 𝐝𝐨 𝐕𝐋𝐌𝐬 𝐬𝐡𝐨𝐰 𝐡𝐞𝐚𝐝 𝐨𝐫𝐢𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐛𝐢𝐚𝐬? We think it is due to the data, not architecture. Fine-tuning GazeLLE on our stimulus set (where there are more instances requiring fine-grained processing of eye details beyond head orientation) greatly improves its performance in cases where head orientation does not align with gaze direction. It serves as a proof-of-concept experiment showing that head orientation bias can be mitigated. The recommendation is targeted training data where gaze cues must be used to get the correct answer, but not shortcuts like context cue or head/body orientation.

Human Survey Interfaces (via JsPsych + Prolific)

BibTeX

@inproceedings{vlmGaze2026,
  title={Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues},
  author={Zhang, Zory and Feng, Pinyuan and Wang, Bingyang and Zhao, Tianwei and Yu, Suyang and Gao, Qingying and Deng, Hokin and Ma, Ziqiao and Li, Yijiang and Luo, Dezhi},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
  publisher={Association for Computational Linguistics},
  year={2026},
  url={https://arxiv.org/abs/2506.05412},
}