Understanding the nuanced articulation of human hands is essential for high-stakes applications such as robot-assisted surgery, chip manufacturing, and human-AI interaction in AR/VR. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning—especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek, Qwen- VL, mPLUG) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes critical reasoning gaps but also offers a concrete path toward improving spatial grounding in multimodal language models.
Publisher
Ulsan National Institute of Science and Technology