HandVQA: Diagnosing Fine-Grained Spatial Reasoning Failures in Vision-Language Models via Hand Pose Question Answering

Sayem, MD Khalequzzaman Chowdhury

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Baek, Seungryul	-
dc.contributor.author	Sayem, MD Khalequzzaman Chowdhury	-
dc.date.accessioned	2025-09-29T11:31:31Z	-
dc.date.available	2025-09-29T11:31:31Z	-
dc.date.issued	2025-08	-
dc.description.abstract	Understanding the nuanced articulation of human hands is essential for high-stakes applications such as robot-assisted surgery, chip manufacturing, and human-AI interaction in AR/VR. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning—especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek, Qwen- VL, mPLUG) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes critical reasoning gaps but also offers a concrete path toward improving spatial grounding in multimodal language models.	-
dc.description.degree	Master	-
dc.description	Department of Computer Science and Engineering	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/88295	-
dc.identifier.uri	http://unist.dcollection.net/common/orgView/200000903978	-
dc.language	ENG	-
dc.publisher	Ulsan National Institute of Science and Technology	-
dc.rights.embargoReleaseDate	9999-12-31	-
dc.rights.embargoReleaseTerms	9999-12-31	-
dc.subject	Vision Language Models, VQA, Hand Pose	-
dc.title	HandVQA: Diagnosing Fine-Grained Spatial Reasoning Failures in Vision-Language Models via Hand Pose Question Answering	-
dc.type	Thesis	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.