File Download

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Penalty relaxation techniques for efficient offline reinforcement learning

Alternative Title
효율적 오프라인 강화학습을 위한 패널티 감쇄 기법
Author(s)
Yeom, Junghyuk
Advisor
Han, Seungyul
Issued Date
2024-02
URI
https://scholarworks.unist.ac.kr/handle/201301/82157 http://unist.dcollection.net/common/orgView/200000744587
Abstract
In reinforcement learning, an agent receives observations from the environment, makes decisions based on these observations, and receives rewards for certain actions. By repeatedly experiencing trial and error through this process, the agent learns a policy that yields higher rewards. However, traditional reinforcement learning models heavily depend on continuous interaction with the environment for new data acquisition. This reliance often results in significant time, cost, and even safety concerns, partic- ularly in real-world applications like robotics and autonomous vehicles. To address these challenges, offline reinforcement learning (Offline RL) emerges as a viable alternative. By utilizing pre-collected data, Offline RL greatly diminishing the need for continuous data collection and thereby reducing time and safety risks. This approach provides a safer and more controlled learning environment, as it depends on data that is already available. However, Offline RL encounters specific issues due to its reliance on a fixed dataset. A notable chal- lenge emerges when the pre-existing data greatly differs from the target policy to be learned, leading to out-of-distribution problems that present significant obstacles in effectively implementing offline rein- forcement learning. To address this problem, it is crucial for offline RL algorithms to be designed in a conservative way that can make the learned policy close with the behavior policy. Our baseline method, Conservative Q-Learning [1] addresses this challenge by applying penalties to state-action pairs gener- ated by the policy. This approach enables the learning of a conservative Q-function that serves as a lower bound for the true value function. Yet, this method might lead to performance degradation from exces- sive constraints. Moreover, if the imposed penalties do not accurately reflect the dataset’s characteristics, the algorithm’s performance may become too dependent on the quality of the batch data. In this paper, we propose a penalty relaxation technique by analyzing the penalty characteristics sug- gested in conservative Q-learning. By adjusting penalties to align with the batch data’s characteristics, we can reduce reliance on the dataset for performance, thereby boosting the efficiency of Offline RL. This increased efficiency is achievable with a smaller number of networks, ensuring greater effective- ness. Furthermore, to tackle the suboptimal aspects of batch data, we propose an strategy for updating the behavior policy, advancing beyond simple replication of the current policy. Our method focuses on learning the behavior policy in a way that aligns with the quality of the batch data. This technique in- volves the categorize states and predict a more optimized policy. By incorporating it in Bellman updates and conservative Q-learning, we can enhance the performance of the behavior policy and mitigate the bias issues inherent in the dataset.
Publisher
Ulsan National Institute of Science and Technology
Degree
Master
Major
Graduate School of Artificial Intelligence

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.