3D Pose and Shape Estimation for Clothed Multi-Person from a Single Image

Abstract: Estimating 3D poses and shapes in the form of meshes from monocular RGB images is a challenging task, particularly in multi-person scenarios where occlusions and complex interactions introduce significant ambiguities. This paper proposes a novel coarse-to-fine pipeline to address these challenges, extending traditional 3D pose and shape estimation to clothed human mesh reconstruction. By tackling occlusions and missing geometry, the method delivers a comprehensive solution to reconstruct accurate and physically plausible 3D human meshes. The pipeline combines robust 3D skeleton estimation with advanced mesh refinement techniques to achieve significant improvements in scenarios involving multiple interacting individuals. The pipeline begins by estimating occlusion-resistant 3D skeletons for multiple persons from a single RGB image. These skeletons, designed to handle partial visibility, are transformed into deformable 3D mesh parameters through inverse kinematics, providing an initial coarse representation. To refine these meshes, a Transformer-based relation-aware module is employed, which considers both intra-person dynamics (e.g., spatial consistency of body parts) and inter-person interactions (e.g., spatial relations between individuals). This multi-level refinement enhances the realism and accuracy of the resulting meshes, even in complex scenes. To handle clothed human meshes in globally coherent scene spaces, the pipeline addresses critical issues like missing body parts and physical implausibilities, such as self-penetration and person-to-person penetration. Two innovative human priors are introduced to overcome these challenges. The geometry prior uses an encoder-decoder architecture to recover detailed 3D features from incomplete body geometry and combines these with a surface normal map to reconstruct realistic, detailed clothed meshes. The contact prior, on the other hand, employs an image-space contact detector to enforce physical plausibility by estimating and maintaining realistic surface contacts between individuals. Extensive experiments conducted on benchmark datasets, including 3DPW, MuPoTS, AGORA, and MultiHuman, demonstrate the pipeline’s effectiveness. For pose and shape estimation, the method excels in managing occlusions and interactions, while for clothed mesh re- construction, it achieves penetration-free results with detailed textures and surface features. These results confirm the superiority of the approach compared to existing methods, highlighting its ability to handle diverse and challenging scenarios in 3D human mesh reconstruction. In conclusion, this paper presents a significant advancement in the field of 3D human mesh reconstruction from monocular RGB images. By integrating robust skeletal estimation, Transformer-based refinement, and innovative human priors, the proposed pipeline delivers accurate, coherent, and physically plausible clothed human meshes. This work not only addresses long-standing challenges like occlusions and physical implausibilities but also opens new opportunities for applications in virtual reality, gaming, and digital human modeling, setting a new benchmark in multi-person 3D reconstruction.

Publisher: Ulsan National Institute of Science and Technology

Degree: Doctor

Major: Graduate School of Artificial Intelligence

Show Full Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.