Repository Collection:

Repository Collection: https://scholarworks.unist.ac.kr/handle/201301/84 2026-05-13T05:29:03Z Reliable and Interpretable Evaluation in Deep Representational Models https://scholarworks.unist.ac.kr/handle/201301/91046 Title: Reliable and Interpretable Evaluation in Deep Representational Models Author(s): Kim, Pum Jun Abstract: Recent advances in deep generative models in computer vision have extended their capabilities from image generation to diverse domains such as video and 3D object generations. What has driven these advancements at their core is the development of evaluation metrics that are reliable and accurate. These metrics assess generative models from a human perceptual perspective, measuring how closely the gen- erated data resembles real-world data and effectively highlighting their differences. This thesis inves- tigates recent advances in evaluation metrics by examining the key contributions of Article 1, Article 2, and Article 3. In addition, it identifies open challenges in evaluation that remain critical for the development of more powerful and reliable deep generative models. Article 1 introduces a novel evaluation metric for image generative models that measures the level of realism along two key aspects: fidelity and diversity. Existing metrics typically estimate the distributions of real and generated data in model embedding spaces that reflect human perception, and compute scores by comparing these distributions. However, generative models that are not properly trained often produce noisy data, and in the presence of such noise, existing metrics are unable to provide reliable and accurate evaluations. To address this issue, this work proposes a robust evaluation approach by estimating statistically and topologically significant supports for both real and generated data. This distribution estimation method is sensitive to subtle variations in the data distribution and provides more accurate and reliable evaluation results, even in the presence of noise. Article 2 introduces a novel evaluation metric for video generative models that measures realism along three aspects: fidelity, diversity, and temporal naturalness. Existing video metrics have largely relied on techniques developed for image generation models, which often fail to capture the temporal characteristics inherent in video data, resulting in incomplete or unreliable evaluations. To address this limitation, this work leverages the observation that frame-wise changes in typical videos exhibit amplitude distributions following a power law in the Fourier domain. By estimating this power law distribution, the proposed metric quantitatively measures the deviation of generated videos from the natural distribution, providing the first principled evaluation of temporal consistency in video generation. Article 3 proposes a benchmark that enables comparison between object recognition models and humans, and allows model analysis from a human visual perspective. The existing benchmark, using stylized images that blend shape and texture within a single image, suggests that humans primarily rely on shape, whereas models focus on texture. However, this prior work suffers from several limitations: (1) it does not utilize data representing pure shape and pure texture, (2) it does not consider images in which shape and texture are present in equal proportion (50:50), and (3) it employs evaluation measures that are not well-suited for model analysis and comparison. To address these limitations, Article 3 generates disentangled datasets that contain pure shape and texture cues and proposes a new metric that enables reliable and precise evaluation of models. This benchmark provides a clear and unbiased assessment of current object recognition models, enabling accurate measurement of how closely their reliance on shape and texture aligns with human perception. Major: Graduate School of Artificial Intelligence Artificial Intelligence 2026-01-31T15:00:00Z Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction https://scholarworks.unist.ac.kr/handle/201301/88259 Title: Generalizing Human-Centric Representations For Pose Estimation and Hand-Object Interaction Author(s): Jeong, Uyoung Abstract: We study human representations to enhance generalization capabilities in downstream applications, specifically focusing on three challenging tasks: 2D multi-person pose estimation, unified multi-dataset training for 2D pose estimation, and photorealistic 3D hand-object interaction generation. Representation learning forms the foundational scaffold of deep learning systems, and its refinement has become essential in the era of general-purpose AI. In this work, we address critical challenges in three domains: instance-level discrimination in 2D multi-person pose estimation, representation unifica- tion across heterogeneous pose datasets, and photorealistic 3D hand-object interaction generation using large-scale generative models. The first study proposes BoIR, a bounding box-level instance representation learning framework that enhances robustness in densely populated scenes. Through a multi-task learning scheme that integrates contrastive instance embeddings with spatially enriched keypoint estimation, BoIR achieves state-of- the-art performance in multi-person pose estimation under occlusions. The second contribution, PoseBH, tackles the longstanding issue of skeletal heterogeneity in multi- dataset training. By introducing nonparametric keypoint prototypes within a unified embedding space and leveraging cross-type self-supervision, PoseBH effectively aligns semantically similar keypoints across diverse pose datasets. This approach demonstrates improved generalization to novel datasets while maintaining high accuracy on established benchmarks. The final study introduces THOM, a novel pipeline for text-guided generation of 3D hand-object interacting meshes. Addressing limitations in shape diversity and physical plausibility, THOM employs a two-stage optimization strategy grounded in Gaussian representation learning and enhanced with op- timization of the compositional Gaussians and interactions. This method enables the synthesis of topo- logically coherent and photorealistic 3D interactions, significantly outperforming existing approaches in semantic disentanglement and physical plausibility. Collectively, these contributions extend the frontiers of human representation learning from discrim- inative perception to generative modeling, suggesting a paradigm shift towards integrating large-scale multi-modal models in downstream human-centric tasks. This dissertation advocates for a thoughtful balance between domain specificity and general-purpose modeling to ensure robust and scalable re- search in human representation learning. As a final remark, we propose several future work directions that would further expand the boundaries of human-centric tasks. Major: Graduate School of Artificial Intelligence 2025-07-31T15:00:00Z Harnessing the Time and Channel Dynamics of Motion Time Series Classification https://scholarworks.unist.ac.kr/handle/201301/88258 Title: Harnessing the Time and Channel Dynamics of Motion Time Series Classification Author(s): Kim, Jaeho Abstract: Time series data consists of two axes: time and channel. The time axis records information about tem- poral progression and events, while different channels (i.e. sensors) encode unique information based on their measurement properties and positioning. These time and channel characteristics vary significantly depending on the type of time series, and effectively harnessing these characteristics is essential for de- veloping deep learning methodologies suitable for the specific time series in task. Our thesis focuses on motion time series, a time series that captures different human motion activities with different sensors. We specifically examine how we can effectively utilize time and channel axes to develop methodolo- gies that address practical challenges each of the domain. We focus on motion time series for two key reasons: the repetitive temporal periodicity and strong cross-sensor correlation present unique method- ological challenges in the time series domain, while its applications in healthcare, sports science, and human-computer interaction offer practical real-world usage. This work addresses three fundamental research questions at the intersection of deep learning and motion time series classification: (1) How to effectively harness time and channel characteristics to enable representation learning from unlabeled time series data, reducing dependence on scarce labeled datasets; (2) How to identify and validate the contribution of individual channels to classification performance, enabling sensor redundancy detection and enhancing model interpretability; and (3) How to develop transfer learning approaches that adapt models across users with different temporal patterns and channel dynamics, accounting for time and channel-wise variability. To address these challenges, this dissertation is structured as follows. Chapter 2 introduces PPT (Patch- order aware Pretext Task), a novel self-supervised learning strategy that explicitly supervises the order- ing of patches across both temporal and channel dimensions, enabling effective representation learning from unlabeled motion time series. Chapter 3 presents CAFO (Channel Attention and Feature Or- thogonalization), an explainable framework that identifies and validates channel importance through channel attention, enabling the identification of important or redundant sensors. Chapter 4 develops TransPL (Transitional Pseudo-Labeling), a domain adaptation methodology that generates high-quality pseudo-labels by explicitly modeling temporal and channel dynamics between source and target domains. TransPL outperforms traditional pseudo-labeling approaches by incorporating domain-specific temporal patterns and channel-wise shifts, enabling effective knowledge transfer across different users. Through this thesis, we highlight the need to thoroughly understand and leverage the unique time and channel dynamics of motion time series to develop novel and suitable deep learning methodologies. Our approaches demonstrably reduce reliance on labeled data, improve model interpretability, and enhance transferability across different domains, ultimately providing practical deep learning methodologies for real-world motion time series challenges. Major: Graduate School of Artificial Intelligence 2025-07-31T15:00:00Z 3D Pose and Shape Estimation for Clothed Multi-Person from a Single Image https://scholarworks.unist.ac.kr/handle/201301/86407 Title: 3D Pose and Shape Estimation for Clothed Multi-Person from a Single Image Author(s): Cha, Junuk Abstract: Estimating 3D poses and shapes in the form of meshes from monocular RGB images is a challenging task, particularly in multi-person scenarios where occlusions and complex interactions introduce significant ambiguities. This paper proposes a novel coarse-to-fine pipeline to address these challenges, extending traditional 3D pose and shape estimation to clothed human mesh reconstruction. By tackling occlusions and missing geometry, the method delivers a comprehensive solution to reconstruct accurate and physically plausible 3D human meshes. The pipeline combines robust 3D skeleton estimation with advanced mesh refinement techniques to achieve significant improvements in scenarios involving multiple interacting individuals. The pipeline begins by estimating occlusion-resistant 3D skeletons for multiple persons from a single RGB image. These skeletons, designed to handle partial visibility, are transformed into deformable 3D mesh parameters through inverse kinematics, providing an initial coarse representation. To refine these meshes, a Transformer-based relation-aware module is employed, which considers both intra-person dynamics (e.g., spatial consistency of body parts) and inter-person interactions (e.g., spatial relations between individuals). This multi-level refinement enhances the realism and accuracy of the resulting meshes, even in complex scenes. To handle clothed human meshes in globally coherent scene spaces, the pipeline addresses critical issues like missing body parts and physical implausibilities, such as self-penetration and person-to-person penetration. Two innovative human priors are introduced to overcome these challenges. The geometry prior uses an encoder-decoder architecture to recover detailed 3D features from incomplete body geometry and combines these with a surface normal map to reconstruct realistic, detailed clothed meshes. The contact prior, on the other hand, employs an image-space contact detector to enforce physical plausibility by estimating and maintaining realistic surface contacts between individuals. Extensive experiments conducted on benchmark datasets, including 3DPW, MuPoTS, AGORA, and MultiHuman, demonstrate the pipeline’s effectiveness. For pose and shape estimation, the method excels in managing occlusions and interactions, while for clothed mesh re- construction, it achieves penetration-free results with detailed textures and surface features. These results confirm the superiority of the approach compared to existing methods, highlighting its ability to handle diverse and challenging scenarios in 3D human mesh reconstruction. In conclusion, this paper presents a significant advancement in the field of 3D human mesh reconstruction from monocular RGB images. By integrating robust skeletal estimation, Transformer-based refinement, and innovative human priors, the proposed pipeline delivers accurate, coherent, and physically plausible clothed human meshes. This work not only addresses long-standing challenges like occlusions and physical implausibilities but also opens new opportunities for applications in virtual reality, gaming, and digital human modeling, setting a new benchmark in multi-person 3D reconstruction. Major: Graduate School of Artificial Intelligence 2025-01-31T15:00:00Z