File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction

Author(s)
Jeong, Uyoung
Advisor
Baek, Seungryul
Issued Date
2025-08
URI
https://scholarworks.unist.ac.kr/handle/201301/88259 http://unist.dcollection.net/common/orgView/200000904455
Abstract
We study human representations to enhance generalization capabilities in downstream applications, specifically focusing on three challenging tasks: 2D multi-person pose estimation, unified multi-dataset training for 2D pose estimation, and photorealistic 3D hand-object interaction generation. Representation learning forms the foundational scaffold of deep learning systems, and its refinement has become essential in the era of general-purpose AI. In this work, we address critical challenges in three domains: instance-level discrimination in 2D multi-person pose estimation, representation unifica- tion across heterogeneous pose datasets, and photorealistic 3D hand-object interaction generation using large-scale generative models. The first study proposes BoIR, a bounding box-level instance representation learning framework that enhances robustness in densely populated scenes. Through a multi-task learning scheme that integrates contrastive instance embeddings with spatially enriched keypoint estimation, BoIR achieves state-of- the-art performance in multi-person pose estimation under occlusions. The second contribution, PoseBH, tackles the longstanding issue of skeletal heterogeneity in multi- dataset training. By introducing nonparametric keypoint prototypes within a unified embedding space and leveraging cross-type self-supervision, PoseBH effectively aligns semantically similar keypoints across diverse pose datasets. This approach demonstrates improved generalization to novel datasets while maintaining high accuracy on established benchmarks. The final study introduces THOM, a novel pipeline for text-guided generation of 3D hand-object interacting meshes. Addressing limitations in shape diversity and physical plausibility, THOM employs a two-stage optimization strategy grounded in Gaussian representation learning and enhanced with op- timization of the compositional Gaussians and interactions. This method enables the synthesis of topo- logically coherent and photorealistic 3D interactions, significantly outperforming existing approaches in semantic disentanglement and physical plausibility. Collectively, these contributions extend the frontiers of human representation learning from discrim- inative perception to generative modeling, suggesting a paradigm shift towards integrating large-scale multi-modal models in downstream human-centric tasks. This dissertation advocates for a thoughtful balance between domain specificity and general-purpose modeling to ensure robust and scalable re- search in human representation learning. As a final remark, we propose several future work directions that would further expand the boundaries of human-centric tasks.
Publisher
Ulsan National Institute of Science and Technology
Degree
Doctor
Major
Graduate School of Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.