Proprioception-conditioned Visual Scene Generation for Robot World Modeling via Contrastive Learning and Diffusion

Kim, Seong Hyeon; Ahn, Hyemin

doi:10.5302/J.ICROS.2025.25.0050

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

안혜민

Ahn, Hyemin

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.endPage	585	-
dc.citation.number	6	-
dc.citation.startPage	581	-
dc.citation.title	Journal of Institute of Control, Robotics and Systems	-
dc.citation.volume	31	-
dc.contributor.author	Kim, Seong Hyeon	-
dc.contributor.author	Ahn, Hyemin	-
dc.date.accessioned	2026-04-22T18:00:11Z	-
dc.date.available	2026-04-22T18:00:11Z	-
dc.date.created	2026-04-22	-
dc.date.issued	2025-06	-
dc.description.abstract	A world model allows robots to understand and predict the interplay between their actions and environmental dynamics. Recent advancements in diffusion models have significantly improved the quality of image frame generation in simulated environments, contributing to the development of more robust and generalized world models. However, these diffusion-based world models often depend on discrete inputs, such as keyboard commands, which limit their applicability to continuous real-world robotic control. To address this limitation, we propose a novel framework that integrates contrastive learning to align visual and proprioceptive modalities (e.g., joint positions) within a shared latent space. This shared latent space facilitates accurate cross-modal predictions between visual scenes and proprioceptive states. By combining this latent representation with a diffusion model, our world model can generate long-term future visual scenes by leveraging both initial visual observations and proprioceptive states. Experimental results demonstrate that the proposed framework generates high-fidelity, long-term future visual scenes when provided with target proprioceptive data. This capability enables robots to plan their motions solely based on the generated images, enabling imagination-based planning. © ICROS 2025.	-
dc.identifier.bibliographicCitation	Journal of Institute of Control, Robotics and Systems, v.31, no.6, pp.581 - 585	-
dc.identifier.doi	10.5302/J.ICROS.2025.25.0050	-
dc.identifier.issn	1976-5622	-
dc.identifier.scopusid	2-s2.0-105007990021	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/91452	-
dc.identifier.url	https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12246460	-
dc.language	영어	-
dc.publisher	Institute of Control, Robotics and Systems	-
dc.title.alternative	고유 감각 정보 기반 시각적 장면 생성을 통한 로봇 세계 모델링을 가능케하는 대조 학습 및 디퓨전 모델	-
dc.title	Proprioception-conditioned Visual Scene Generation for Robot World Modeling via Contrastive Learning and Diffusion	-
dc.type	Article	-
dc.description.isOpenAccess	FALSE	-
dc.identifier.kciid	ART003208269	-
dc.type.docType	Article	-
dc.description.journalRegisteredClass	scopus	-
dc.description.journalRegisteredClass	kci	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.