Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained.
We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.
𝐖𝐡𝐲 𝐝𝐨 𝐕𝐋𝐌𝐬 𝐬𝐡𝐨𝐰 𝐡𝐞𝐚𝐝 𝐨𝐫𝐢𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐛𝐢𝐚𝐬? We think it is due to the data, not architecture. Fine-tuning GazeLLE on our stimulus set (where there are more instances requiring fine-grained processing of eye details beyond head orientation) greatly improves its performance in cases where head orientation does not align with gaze direction. It serves as a proof-of-concept experiment showing that head orientation bias can be mitigated. The recommendation is targeted training data where gaze cues must be used to get the correct answer, but not shortcuts like context cue or head/body orientation.
@inproceedings{vlmGaze2026,
title={Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues},
author={Zhang, Zory and Feng, Pinyuan and Wang, Bingyang and Zhao, Tianwei and Yu, Suyang and Gao, Qingying and Deng, Hokin and Ma, Ziqiao and Li, Yijiang and Luo, Dezhi},
booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
publisher={Association for Computational Linguistics},
year={2026},
url={https://arxiv.org/abs/2506.05412},
}