While multimodal research has progressed rapidly with Vision-Language Models (VLMs) like LLaVA and GPT-4, linking 3D hand geometry to natural language remains largely uncharted territory. Existing VLMs struggle to capture joint-specific details of hand poses, primarily due to the absence of fine- grained datasets that articulate the intricacies of 3D hand structures. Bridging this gap could unlock a range of valuable applications, including the generation of posed hands for animation, improved remote physical therapy for form correction, and beyond. To address this need, this work introduces a novel framework that generates precise, joint-level text annotations from 3D hand data through an automatic, geometry-based captioning pipeline, establishing a bridge between hand geometry and natural language descriptions. In the experiments presented in this work, using two publicly available 3D hand datasets, these joint-level captions significantly enhanced reconstruction accuracy, validating the robustness of the proposed approach as the first multimodal hand mesh reconstruction model. This framework advances the capabilities of VLM-driven 3D hand representation and sets the stage for more nuanced multimodal applications.
Publisher
Ulsan National Institute of Science and Technology