File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Unsupervised Subgoal Decomposition for Robot Policy Learning

Author(s)
Kim, Seong Hyeon
Advisor
Joo, Kyungdon
Issued Date
2026-02
URI
https://scholarworks.unist.ac.kr/handle/201301/91057 http://unist.dcollection.net/common/orgView/200000965091
Abstract
Existing approaches to solving complex, long-horizon robot tasks via imitation learning face clear limitations. Current methods either lack explicit intermediate goal structure, or rely on empirically- chosen heuristics—such as fixed temporal offsets (e.g., Seer’s [1] 8-frame conditioning) or motion-based keyframes (e.g., gripper state changes [2])—that fail to capture variable-length semantic phases inherent in manipulation tasks. To address these challenges, this thesis proposes a 3-stage learning pipeline for automatic semantic labeling and multi-task policy learning. First, we develop a Semantic Tokenizer designed with a sensory information bottleneck, which discovers discrete semantic tokens from language and action sequences using Vector Quantization (VQ) [3] with a small codebook. By intentionally excluding visual input, our Semantic Tokenizer learns object-invariant action primitives independent of specific objects or visual contexts. By enforcing this sensory bottleneck, we compel the model to learn motion-centric represen- tations rather than object-specific routines. Second, a Token Refinement Module filters transient noise by merging short segments, stabilizing semantic boundaries and creating per-frame semantic labels for the training dataset. Importantly, the Semantic Tokenizer and Token Refinement Module serve as an offline data anno- tation pipeline—they are used only during training to generate pseudo-labels for each demonstration frame. At inference time, only the VLA model is deployed. This decoder autoregressively generates an interleaved sequence of VQ action indices and semantic tokens through co-generation. By training the model to generate both actions and semantic boundary markers from visual observations, proprioceptive state, and noun embeddings, we encourage the decoder to learn richer visual-action associations without adding computational overhead at inference. Experimental evaluations on the LIBERO-90 benchmark demonstrate that the proposed pipeline achieves 88.1% success rate, outperforming VLA models including OpenVLA [4] (73.5%) and our base model VQ-VLA [5] (81.0%). This improvement validates the efficacy of semantic co-generation, showing that co-generation of semantic tokens with actions enhances policy robustness for long-horizon manipulation tasks.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Graduate School of Artificial Intelligence Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.