[RL] Value-based algorithm VS Policy-based algorithm

Scheduling/Paper 2021. 4. 19. 09:46

Topic

Value-based algorithm과 policy-base algorithm 비교

Description

-Value-based
Goal : maximize state-value function(expected return for each state s_t)
→ action value function을 maximize 하는 policy P(s_t |a_t)를 도출하여 Goal 달성
→ action이 Q-function을 최대로 하는 a_t^∗로 deterministic로 결정
→ ε-greedy을 이용하여 탐색 영역 확대
ex) SARSA, Q-learning, DQN

-Policy-based
Goal : 처음부터 episode 종료시까지 주어지는 reward의 총합의 기댓값을 최대화
→ J_θ를 최대로 하기 위하여 policy gradient(θ← θ+α∆θJ_θ)를 시행하면서 P_θ (a_t |s_t)의 parameter를 추정
→ policy distribution에 의해 action이 stochastic하게 발생하면서 exploitation & exploration 수행
ex) REINFORCEMENT algorithm, Actor-critic algorithm, ⋯

2. Value-based algorithm

Algorithm	Q-learning
Theory	-Q-learning → Target policy는 delta function를 이용해 Q function을 greedy하게 업데이트 하면서 behavior policy는 ε-greedy알고리즘을 사용하여 충분히 탐험하며 최적값에 수렴
Pros	1. target policy를 사람 or agent가 만든 policy로 사용 가능 2. 충분한 exploration을 통해 optimal policy를 수립 3. 재평가 가능 → 충분히 업데이트 된 Q를 가지고 같은 상황에서 target policy를 동작 시켰을 때 새로운 수를 놓음.
Cons	1. Q를 업데이트 하는 과정에서 이전 관측치가 다음 관측치에 영향을 주어 학습이 잘 되지 않는 상황 발생 2. 학습에 도움이 되는 데이터가 한 번 사용되고 버려짐

Algorithm	DQN(Deep Q-Network)
Theory	-DQN(Playing Atari with Deep Reinforcement Learning, 2015) Q-learning에서 Q function을 DNN(Deep Neural Network)를 이용하여 추정 → State를 DNN의 input으로 넣고 action에 따른 Q 값이 가장 큰 action을 선택 Issue 1. 순차적인 데이터 간의 높은 상관관계 Experience Replay → mini-batch SGD : N개의 state를 replay memory에 자징하고 uniform random하게 추출하여 Q - function을 업데이트 2. Q를 업데이트 할 때 target이 흔들리는 문제 발생 : → Target net과 main net을 분리하여 w- 은 고정하고 일정 iteration 후 w로 업데이트
Pros	1.CNN을 이용하여 사람이 게임을 하는 원리와 유사하게 학습 2.Experience replay를 이용한 state간 correlation 완화
Cons	1. Replay memory 사용 -> 많은 메모리 사용 2. 느린 학습 속도 3. 불안정한 학습과정 4. state와 action의 수가 많은 경우 학습이 어려움

3. Policy-based algorithm

Algorithm	REINFORCE algorithm
Theory	-REINFORCE : Policy-based의 시발점이 되는 algorithm Goal : max⁡〖〖 J〗_θ 〗 처음부터 episode 종료시까지 주어지는 reward의 총합의 기댓값을 최대화 → J_θ=E[G_0 ]=E[R_0+γR_1+γ^2 R_(2 )+⋯] = ∫16_(s_0,a_0,s_0,a_1,⋯)▒〖G_0 P_θ (s_0,a_0,s_0,a_1,⋯)ds_0,a_0,s_0,a_1,⋯=∫16_τ▒〖G_0 P_θ (τ)dτ〗〗 → J_θ 의 max값을 구하기 위해 policy gradient 사용(J_θ에 대한 gradient ascent algorithm : θ ← θ+α∇_θ J_θ) → ∇_θ J_θ= ∫16_τ▒∑_(t=0)^∞▒〖∇_θ lnP_θ (a_t \|s_t)G_t P_θ (τ)dτ〗 - -Algorithm 0. initialize θ Repeat 1,2 1. Get τ from P_θ (a_t│s_t ) 2. θ←θ+ α∑_(t=0)^∞▒〖∇_θ lnP_θ (a_t \|s_t)G_t 〗 algorithm repeat 단위 : episode
Pros	Low biased Continuous action 학습 가능
Cons	monte-carlo 방법에서는 Q-function을 return을 통해 가져올 수 없음 → high variance 문제 발생, episode가 길거나 복잡한 문제는 오래 걸리거나 수렴하지 않음

Algorithm	Actor-Critic algorithm
Theory	-Actor-Critic algorithm : REINFORCEMENT algorithm에서 Q function을 구할 때 발생하는 high variance 문제를 보완하기 위하여 action-value function을 estimate하는 방법으로 critic을 도입. → Critic은 TD(0)의 방법으로 매 step마다 parameter를 업데이트 하여 function을 estimate. Goal : max⁡〖〖 J〗_θ 〗 - policy gradient → ∇_θ J_θ= ∫16_τ▒∑_(t=0)^∞▒〖∇_θ lnP_θ (a_t \|s_t)G_t P_θ (τ)dτ〗 =∫16_(s_t,a_t)▒∑_(t=0)^∞▒〖∇_θ lnP_θ (a_t│s_t )Q(s_t,a_t)P_θ (s_t,a_t)ds_t,a_t 〗 →〖 G〗_t P_θ (τ)를 전개하여 Q(s_t,a_t)과 P_θ (s_t,a_t)도출 - Critic network :〖 Q〗_w (s_t,a_t ) update - Actor network : P_θ (s_t,a_t ) update -Algorithm 0. initialize θ, w Repeat 1,2 1. Actor update θ←θ+ α∇_θ lnP_θ (a_t│s_t ) 〖 Q〗_w (s_t,a_t ) 2. Critic update θ←θ-β(R_t+γ〖 Q〗_w (s_(t+1 ),a_(t+1 ) )-〖 Q〗_w (s_t,a_t ))^2 : SARSA algorithm - algorithm repeat 단위 : episode 내 1-step 단위
Pros	Low variance than REINFORCEMENT algorithm
Future works	Variance 및 학습속도 등을 줄이기 위한 algorithm으로 Actor-Critic algorithm을 기반으로 A2C, A3C등의 알고리즘 등장

'Scheduling > Paper' 카테고리의 다른 글

[RL] Toward optimal assembly line order sequencing with reinforcement learning a case study(2020) (0)	2021.04.19
[RL] Actor-critic deep reinforcement learning for solving job shop scheduling problems(2020) (0)	2021.04.19
[RL] Simulation study on reward function of reinforcement learning in gantry work cell scheduling(2018) (0)	2021.04.19
[RL] Gantry Work Cell Scheduling through Reinforcement Learning with knowledge-guided Reward Setting(2018) (0)	2021.04.19
[GNN] A graph neural network assisted monte carlo tree search approach to traveling salesman problem(2020) (0)	2021.04.19

ABOUT ME

우우 우우

'Scheduling > Paper' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Scheduling > Paper' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바