Unsupervised Subgoal Decomposition for Robot Policy Learning

Kim, Seong Hyeon

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Joo, Kyungdon	-
dc.contributor.author	Kim, Seong Hyeon	-
dc.date.accessioned	2026-03-26T22:15:24Z	-
dc.date.available	2026-03-26T22:15:24Z	-
dc.date.issued	2026-02	-
dc.description.abstract	Existing approaches to solving complex, long-horizon robot tasks via imitation learning face clear limitations. Current methods either lack explicit intermediate goal structure, or rely on empirically- chosen heuristics—such as fixed temporal offsets (e.g., Seer’s [1] 8-frame conditioning) or motion-based keyframes (e.g., gripper state changes [2])—that fail to capture variable-length semantic phases inherent in manipulation tasks. To address these challenges, this thesis proposes a 3-stage learning pipeline for automatic semantic labeling and multi-task policy learning. First, we develop a Semantic Tokenizer designed with a sensory information bottleneck, which discovers discrete semantic tokens from language and action sequences using Vector Quantization (VQ) [3] with a small codebook. By intentionally excluding visual input, our Semantic Tokenizer learns object-invariant action primitives independent of specific objects or visual contexts. By enforcing this sensory bottleneck, we compel the model to learn motion-centric represen- tations rather than object-specific routines. Second, a Token Refinement Module filters transient noise by merging short segments, stabilizing semantic boundaries and creating per-frame semantic labels for the training dataset. Importantly, the Semantic Tokenizer and Token Refinement Module serve as an offline data anno- tation pipeline—they are used only during training to generate pseudo-labels for each demonstration frame. At inference time, only the VLA model is deployed. This decoder autoregressively generates an interleaved sequence of VQ action indices and semantic tokens through co-generation. By training the model to generate both actions and semantic boundary markers from visual observations, proprioceptive state, and noun embeddings, we encourage the decoder to learn richer visual-action associations without adding computational overhead at inference. Experimental evaluations on the LIBERO-90 benchmark demonstrate that the proposed pipeline achieves 88.1% success rate, outperforming VLA models including OpenVLA [4] (73.5%) and our base model VQ-VLA [5] (81.0%). This improvement validates the efficacy of semantic co-generation, showing that co-generation of semantic tokens with actions enhances policy robustness for long-horizon manipulation tasks.	-
dc.description.degree	Master	-
dc.description	Graduate School of Artificial Intelligence Artificial Intelligence	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/91057	-
dc.identifier.uri	http://unist.dcollection.net/common/orgView/200000965091	-
dc.language	ENG	-
dc.publisher	Ulsan National Institute of Science and Technology	-
dc.subject	Erythropoietin, Erythropoiesis	-
dc.title	Unsupervised Subgoal Decomposition for Robot Policy Learning	-
dc.type	Thesis	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.