Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction

Jeong, Uyoung

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction

Author(s): Jeong, Uyoung

Advisor: Baek, Seungryul

Issued Date: 2025-08

URI: https://scholarworks.unist.ac.kr/handle/201301/88259 http://unist.dcollection.net/common/orgView/200000904455

Abstract: We study human representations to enhance generalization capabilities in downstream applications, specifically focusing on three challenging tasks: 2D multi-person pose estimation, unified multi-dataset training for 2D pose estimation, and photorealistic 3D hand-object interaction generation. Representation learning forms the foundational scaffold of deep learning systems, and its refinement has become essential in the era of general-purpose AI. In this work, we address critical challenges in three domains: instance-level discrimination in 2D multi-person pose estimation, representation unifica- tion across heterogeneous pose datasets, and photorealistic 3D hand-object interaction generation using large-scale generative models. The first study proposes BoIR, a bounding box-level instance representation learning framework that enhances robustness in densely populated scenes. Through a multi-task learning scheme that integrates contrastive instance embeddings with spatially enriched keypoint estimation, BoIR achieves state-of- the-art performance in multi-person pose estimation under occlusions. The second contribution, PoseBH, tackles the longstanding issue of skeletal heterogeneity in multi- dataset training. By introducing nonparametric keypoint prototypes within a unified embedding space and leveraging cross-type self-supervision, PoseBH effectively aligns semantically similar keypoints across diverse pose datasets. This approach demonstrates improved generalization to novel datasets while maintaining high accuracy on established benchmarks. The final study introduces THOM, a novel pipeline for text-guided generation of 3D hand-object interacting meshes. Addressing limitations in shape diversity and physical plausibility, THOM employs a two-stage optimization strategy grounded in Gaussian representation learning and enhanced with op- timization of the compositional Gaussians and interactions. This method enables the synthesis of topo- logically coherent and photorealistic 3D interactions, significantly outperforming existing approaches in semantic disentanglement and physical plausibility. Collectively, these contributions extend the frontiers of human representation learning from discrim- inative perception to generative modeling, suggesting a paradigm shift towards integrating large-scale multi-modal models in downstream human-centric tasks. This dissertation advocates for a thoughtful balance between domain specificity and general-purpose modeling to ensure robust and scalable re- search in human representation learning. As a final remark, we propose several future work directions that would further expand the boundaries of human-centric tasks.

Publisher: Ulsan National Institute of Science and Technology

Degree: Doctor

Major: Graduate School of Artificial Intelligence

Show Full Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.