File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Full metadata record

DC Field Value Language
dc.contributor.advisor Baek, Seungryul -
dc.contributor.author Jeong, Uyoung -
dc.date.accessioned 2025-09-29T11:31:15Z -
dc.date.available 2025-09-29T11:31:15Z -
dc.date.issued 2025-08 -
dc.description.abstract We study human representations to enhance generalization capabilities in downstream applications, specifically focusing on three challenging tasks: 2D multi-person pose estimation, unified multi-dataset training for 2D pose estimation, and photorealistic 3D hand-object interaction generation. Representation learning forms the foundational scaffold of deep learning systems, and its refinement has become essential in the era of general-purpose AI. In this work, we address critical challenges in three domains: instance-level discrimination in 2D multi-person pose estimation, representation unifica- tion across heterogeneous pose datasets, and photorealistic 3D hand-object interaction generation using large-scale generative models. The first study proposes BoIR, a bounding box-level instance representation learning framework that enhances robustness in densely populated scenes. Through a multi-task learning scheme that integrates contrastive instance embeddings with spatially enriched keypoint estimation, BoIR achieves state-of- the-art performance in multi-person pose estimation under occlusions. The second contribution, PoseBH, tackles the longstanding issue of skeletal heterogeneity in multi- dataset training. By introducing nonparametric keypoint prototypes within a unified embedding space and leveraging cross-type self-supervision, PoseBH effectively aligns semantically similar keypoints across diverse pose datasets. This approach demonstrates improved generalization to novel datasets while maintaining high accuracy on established benchmarks. The final study introduces THOM, a novel pipeline for text-guided generation of 3D hand-object interacting meshes. Addressing limitations in shape diversity and physical plausibility, THOM employs a two-stage optimization strategy grounded in Gaussian representation learning and enhanced with op- timization of the compositional Gaussians and interactions. This method enables the synthesis of topo- logically coherent and photorealistic 3D interactions, significantly outperforming existing approaches in semantic disentanglement and physical plausibility. Collectively, these contributions extend the frontiers of human representation learning from discrim- inative perception to generative modeling, suggesting a paradigm shift towards integrating large-scale multi-modal models in downstream human-centric tasks. This dissertation advocates for a thoughtful balance between domain specificity and general-purpose modeling to ensure robust and scalable re- search in human representation learning. As a final remark, we propose several future work directions that would further expand the boundaries of human-centric tasks. -
dc.description.degree Doctor -
dc.description Graduate School of Artificial Intelligence -
dc.identifier.uri https://scholarworks.unist.ac.kr/handle/201301/88259 -
dc.identifier.uri http://unist.dcollection.net/common/orgView/200000904455 -
dc.language ENG -
dc.publisher Ulsan National Institute of Science and Technology -
dc.rights.embargoReleaseDate 9999-12-31 -
dc.rights.embargoReleaseTerms 9999-12-31 -
dc.subject computer vision,human pose estimation,multi-dataset training,multi-person pose estimation,hand-object interaction generation -
dc.title Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction -
dc.type Thesis -

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.