| dc.description.abstract |
Recent advances in human–robot interaction highlight the growing importance of understanding and generating human motion for intelligent humanoid control. This thesis presents two studies that address these challenges from the perspectives of data collection and motion representation.
First, a data generation framework for learning-based human-to-humanoid motion retargeting is proposed. Traditional approaches rely on expensive equipment or handcrafted mappings, which limit their scalability and generality across robot morphologies. To overcome these limitations, this work introduces a reverse-wise data pairing method that generates robot-side poses within feasible pose domains and reconstructs corresponding human poses while filtering out physically invalid poses. This approach produces diverse and high-quality paired datasets suitable for deep learning. Using these data, a two-stage motion retargeting network is trained via supervised-learning, achieving improved accuracy in the predicted positions of robot links compared to a unsupervised baseline, while also generating more natural robot motions in qualitative evaluations. In addition, ablation studies demonstrate the extreme-pose filtering during data generation and confirm that the proposed two-stage architecture is well suited for motion retargeting.
Second, a keyframe-based motion tokenization framework is presented as an alternative to the conventional 1D-convolutional VQ-VAE tokenizers used in transformer-based motion models. Instead of encoding motion through fixed receptive fields, proposed method constructs discrete tokens from keyframes that explicitly encode pose, velocity, and duration information, ensuring clear traceability to the original motion. To enable end-to-end learning, we introduce a differentiable soft interpolation method and develop a Keyframe Motion VQ-VAE that quantizes keyframe segments and reconstructs full-length motion through duration prediction, recover-length expansion, and convolutional decoding. Experiments on the HumanML3D dataset show that our keyframe selection method outperforms random and uniform sampling in preserving semantic structure.
Together, these studies explore alternative approaches to addressing the challenges of data acquisition and motion representation in deep learning–based human–robot interaction. Moreover, by integrating the two approaches, this research suggests the potential for semantic motion retargeting, in which robots may reproduce human motion with preserved meaning as well as physical fidelity. |
- |