wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Tang, Xiaohang; Dolga, Rares; Yoon, Sangwoong; Bogunovic, Ilija

Scholarworks@UNIST

UNIST Library

File Download

There are no files associated with this item.

SFX Link

Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Related Researcher

윤상웅

Yoon, Sangwoong

Read More

Views & Downloads

Detailed Information

Cited time in webofscience

Cited time in scopus

Metadata Downloads

Full metadata record

DC Field	Value	Language
dc.citation.conferencePlace	BL	-
dc.citation.title	International Conference on Learning Representations	-
dc.contributor.author	Tang, Xiaohang	-
dc.contributor.author	Dolga, Rares	-
dc.contributor.author	Yoon, Sangwoong	-
dc.contributor.author	Bogunovic, Ilija	-
dc.date.accessioned	2026-02-23T15:47:00Z	-
dc.date.available	2026-02-23T15:47:00Z	-
dc.date.created	2026-02-23	-
dc.date.issued	2026-04-23	-
dc.description.abstract	Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias – particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce wd1, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that wd1, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. wd1 delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method’s implementation and R1-Zero-like training (no SFT), position wd1 as a more effective and efficient method for applying RL to dLLMs reasoning.	-
dc.identifier.bibliographicCitation	International Conference on Learning Representations	-
dc.identifier.uri	https://scholarworks.unist.ac.kr/handle/201301/90535	-
dc.language	영어	-
dc.publisher	Proceedings of International Conference on Learning Representations (ICLR)	-
dc.title	wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models	-
dc.type	Conference Paper	-
dc.date.conferenceDate	2026-04-23	-

Show Simple Item Record

qrcode

RSS 1.0 RSS 2.0

UNIST | Library

Tel : 052-217-1403 / Email : scholarworks@unist.ac.kr

ScholarWorks@UNIST was established as an OAK Project for the National Library of Korea.