File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

HandVQA: Diagnosing Fine-Grained Spatial Reasoning Failures in Vision-Language Models via Hand Pose Question Answering

Author(s)
Sayem, MD Khalequzzaman Chowdhury
Advisor
Baek, Seungryul
Issued Date
2025-08
URI
https://scholarworks.unist.ac.kr/handle/201301/88295 http://unist.dcollection.net/common/orgView/200000903978
Abstract
Understanding the nuanced articulation of human hands is essential for high-stakes applications such as robot-assisted surgery, chip manufacturing, and human-AI interaction in AR/VR. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning—especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek, Qwen- VL, mPLUG) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes critical reasoning gaps but also offers a concrete path toward improving spatial grounding in multimodal language models.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Department of Computer Science and Engineering

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.