๐Ÿฆ„AI/Reinforcement Learning

[CS234]Lecture 2. ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค

mingyung 2024. 7. 10. 17:31

์ €๋ฒˆ ์‹œ๊ฐ„์— ์ˆœ์ฐจ๊ฒฐ์ •๋ฌธ์ œ ์ค‘ ๋ถˆ์—ฐ์† ๋ฌธ์ œ๋Š” ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(Markov Decision Process, MDP)๋ฅผ ํ†ตํ•ด ์ˆ˜ํ•™์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ–ˆ๋‹ค. ๋˜ํ•œ ์ˆœ์ฐจ ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์šฐ๋ฆฌ๋Š” maximize total expected future reward๋ฅผ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๊ธฐ์–ตํ•˜์ž.

 

์ด๋ฒˆ ๊ฐ•์˜์—์„œ๋Š” ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ช…ํ™•ํžˆ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค์—์„œ MDP๊นŒ์ง€์˜ ๋ฐœ์ „ ๊ณผ์ •์„ ์‚ดํ”ผ๊ณ , MDP์˜ Control ๊ณผ Evaluation์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณธ๋‹ค.

 

 

0. ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค์˜ ๋ฐœ์ „

๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(MDP)๋Š” ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค์—์„œ ์ถœ๋ฐœํ•˜์—ฌ ํ™•์žฅ๋˜์—ˆ๋‹ค.

Markov Process > Markov Reward Process > Markov Decision Process ์ˆœ์œผ๋กœ ๋ฐœ์ „ํ–ˆ๋‹ค.

 

๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ฌ๋ผ์ง„๋‹ค.

Markov Process Markov Reward Process, MRP Markov Decision Process
- ์ƒํƒœ(state) ์ „์ด๋งŒ ๊ณ ๋ คํ•˜๋Š” ํ™•๋ฅ ๊ณผ์ •.

- ํ˜„์žฌ ์ƒํƒœ๋งŒ์ด ๋‹ค์Œ์ƒํƒœ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” "๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ"์„ ๋งŒ์กฑํ•œ๋‹ค ๊ฐ€์ •
- ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค์— ๋ณด์ƒ(reward) ๊ฐœ๋…์„ ์ถ”๊ฐ€

- ์ƒํƒœ์ „์ด์— ์ถ”๊ฐ€๋กœ ๊ฐ ์ƒํƒœ์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ๋ณด์ƒ๊ฐ’๋„ ๊ณ ๋ ค
- ๋งˆ๋ฅด์ฝ”ํ”„ ๋ณด์ƒ ํ”„๋กœ์„ธ์Šค์— ํ–‰๋™(action) ๊ฐœ๋…์„ ์ถ”๊ฐ€

- ์—์ด์ „ํŠธ๊ฐ€ ๊ฐ ์ƒํƒœ์—์„œ ํ–‰๋™์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ™•์žฅ๋œ ๊ฒƒ์ด๋‹ค.

- ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•ด์„œ ์ด ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ๊ทน๋Œ€ํ™” ํ•˜์—ฌ ์ˆœ์ฐจ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.

 

 

0. Markov Assumption

๋“ค์–ด๊ฐ€๊ธฐ์— ์•ž์„œ์„œ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฐ€์ •์„ ํ•œ๋ฒˆ ํ™•์ธํ•ด๋ณด์ž.

State st is Markov if and only if
p(st+1|st,at)=p(st+1|ht,at)

์ฆ‰ ๋ฏธ๋ž˜ ์ƒํƒœ๋Š” ๊ณผ๊ฑฐ์˜ ์ƒํƒœ๋“ค์— ๋…๋ฆฝ์ ์ด๊ณ , ํ˜„์žฌ ์ƒํƒœ์—๋งŒ ์˜์กดํ•˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฏธ๋ž˜์˜ ์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ณผ๊ฑฐ์˜ ์ƒํƒœ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

 

์ด ๊ฐ€์ •์„ ๋งŒ์กฑํ•˜๋Š” ๊ฒƒ์„ ๋งˆ๋ฅด์ฝ”ํ”„ ์„ฑ์งˆ์„ ๊ฐ€์ง„๋‹ค๊ณ  ํ•œ๋‹ค.

 

 


1. Markov Process (๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค)

Markov Process๋Š” ์ฃผ์–ด์ง„ ์ƒํƒœ s์—์„œ ๋‹ค์Œ ์ƒํƒœ s'๋กœ์˜ ์ƒํƒœ ์ „์ด๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š” ๊ณผ์ •์„ ๋งํ•œ๋‹ค.

์ด๋•Œ ๋ฏธ๋ž˜์˜ ์ƒํƒœ๋Š” ๊ณผ๊ฑฐ์˜ ์ƒํƒœ์—๋Š” ๋…๋ฆฝํ•˜๊ณ , ํ˜„์žฌ ์ฃผ์–ด์ง„ ์ƒํƒœ s์—๋งŒ ์˜์กดํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

์ฆ‰, ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฐ€์ •์„ ๋งŒ์กฑํ•˜๋Š” ํ”„๋กœ์„ธ์Šค๋ฅผ ๋งํ•œ๋‹ค.

 

์ •์˜

Sequence of Random States with Markov Property

๐Ÿ‘‰ ํŠน์ง•
No Reward, No Actions, Memoryless property

Memoryless Property: ๋ฏธ๋ž˜์ƒํƒœ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ์— ํ˜„์žฌ์ƒํƒœ๋ฅผ ์ œ์™ธํ•œ history๋ฅผ ์•Œ ํ•„์š”๊ฐ€ ์—†์Œ
Transition Probability: ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค๋Š” ํ˜„์žฌ ์ƒํƒœ์—์„œ ๊ฐ ๋‹ค์Œ ์ƒํƒœ๋กœ์˜ ํ™•๋ฅ  map์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค

S: (finite)set of States
P: transition model that specifies p(st+1|st,at)=p(st+1|ht,at) 

 

 

๋งŒ์•ฝ states๊ฐ€ finite number(N)๊ฐฏ์ˆ˜๋งŒํผ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ, ํ™•๋ฅ ๋ชจ๋ธ P๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ matrix๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์ด matrix๋ฅผ Markov Chain Trainsition Matrix๋ผ๊ณ  ํ•œ๋‹ค.

 

์‰ฌ์šด ์ดํ•ด๋ฅผ ์œ„ํ•ด ๊ฐ•์˜์˜ ์˜ˆ์‹œ๋ฅผ ์ฐธ๊ณ ํ•˜์ž.

๊ฐœ๋ณ„ state๋Š” ์ด 7๊ฐœ ์ด๊ณ , ๋‹ค์Œ state๋กœ์˜ ์ „์ด ํ™•๋ฅ ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ๋‹ค.

์ด ๊ฒฝ์šฐ P๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค

์ด๋ ‡๊ฒŒ ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค๋Š” ๊ฐ ์ƒํƒœ๋กœ์˜ ์ „์ดํ™•๋ฅ ์„ ๋งํ•œ๋‹ค.

 

 

 


2. Markov Reward Process (MRP, ๋งˆ๋ฅด์ฝ”ํ”„ ๋ณด์ƒ ํ”„๋กœ์„ธ์Šค)

์œ„์˜ Markov Process์—์„œ๋Š” Reward๊ณผ Action์ด ํฌํ•จ๋˜์ง€ ์•Š์€ ๊ฐœ๋…์ด์—ˆ๋‹ค. 

๋‹จ์ˆœํžˆ ๊ฐ ์ƒํƒœ๋กœ์˜ ์ „์ด ํ™•๋ฅ ์„ ์˜๋ฏธํ–ˆ๋‹ค.

 

๋งˆ๋ฅด์ฝ”ํ”„ ๋ณด์ƒ ํ”„๋กœ์„ธ๋Š” MRP๋Š” Markov Chain์— Reward๊ฐ€ ์ถ”๊ฐ€๋œ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

MRP๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ํ•œ๋‹ค.

S: (finite) sets of states ์ƒํƒœ์— ๋Œ€ํ•œ ์œ ํ•œ์ง‘ํ•ฉ(N๊ฐœ)
P: ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค. trainsition model that specifies p(st+1|st,at)=p(st+1|ht,at)
R: Reward function R(st=s)=E[rt|st=s] ๋งŒ์•ฝ ๊ฐ€๋Šฅํ•œ state๊ฐ€ ์œ ํ•œ๊ฐœ(N)์ด๋ผ๋ฉด R์€ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๊ฐ€๋Šฅํ•˜๋‹ค.
Discount Factor: ฮณโˆˆ[0,1] ๋ฏธ๋ž˜์˜ Reward๋ฅผ ์–ผ๋งˆ๋งŒํผ ๋ฐ˜์˜ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. 

 

์—ฌ์ „ํžˆ ์—ฌ๊ธฐ์— Action์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์กด์žฌํ•˜์ง€ ์•Š์Œ์„ ๊ธฐ์–ตํ•˜์ž.

 

๊ทธ๋Ÿผ ๋งˆ๋ฅด์ฝ”ํ”„ ๋ณด์ƒ ํ”„๋กœ์„ธ์Šค์—์„œ์˜ optimal reward๋Š”, ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ๋ฌด์—‡์œผ๋กœ ๋ณด์•„์•ผ ํ• ๊นŒ?

์—ฌ์ „ํžˆ ์šฐ๋ฆฌ๋Š” ๋ณด์ƒ์˜ ๊ธฐ๋Œ€๊ฐ’์˜ ์ตœ๋Œ€ํ™”๋ฅผ ์›ํ•  ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋Œ€๋‹ค์ˆ˜์˜ ์ˆœ์ฐจ ๊ฒฐ์ • ๋ฌธ์ œ์—์„œ๋Š” ์ „์ด ์ฆ‰์‹œ ๋ณด์ƒ์ด ๊ฒฐ์ •๋˜์ง€ ์•Š๋Š”๋‹ค. 

๋”ฐ๋ผ์„œ ๋” ๋ฏธ๋ž˜์•  ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณด์ƒ๊นŒ์ง€ ๋ชจ๋‘ ํ•ฉ์ณ์„œ ํ•ด๋‹น ์ „์ด์— ๋Œ€ํ•œ ๋ณด์ƒ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

 

Return Function (๋ฆฌํ„ดํ•จ์ˆ˜ G)

Horizon: Number of time steps in each episode.

์ฆ‰, ํ•œ๋ฒˆ์— ์–ผ๋งˆ๋งŒํผ์˜ future๋ฅผ ํ™•์ธํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ๊ฐ’์ด๋‹ค.

Horizon์€ infinite์ผ์ˆ˜๋„ ์žˆ๊ณ , finite๊ฐ’์ผ ์ˆ˜๋„ ์žˆ๋‹ค. ๋งŒ์•ฝ horizon์ด finite์ธ ๊ฒฝ์šฐ์—๋Š” MRP๋ฅผ finite MRP๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

Return: Discounted Sum of Rewards from timestep t to horizon H

์‹œ๊ฐ„ t๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ Horizon๊ฐœ์˜ Reward๋ฅผ Discount Factor ฮณ๋ฅผ ์ ์šฉํ•ด์„œ ๋”ํ•œ ๊ฐ’์„ ๋งํ•œ๋‹ค.

์ˆ˜์‹์„ ๋ณด๋ฉด ๋” ์ดํ•ด๊ฐ€ ์‰ฝ๋‹ค

Gt=rt+ฮณrt+1+ฮณ2rt+2+...+ฮณHโˆ’1rt+Hโˆ’1

 

์ฆ‰, ๋ฆฌํ„ด์ด๋ž€ ํ˜„์žฌ ์‹œ๊ฐ„ t์˜ ์ž…์žฅ์—์„œ ๋ฏธ๋ž˜์— ๋ฐ›์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ๋ฆฌ์›Œ๋“œ๋ฅผ ๋ชจ๋‘ ํ•ฉํ•œ ๊ฐ’์ด๋‹ค

 

State Value Function (๊ฐ€์น˜ํ•จ์ˆ˜ V)

Value: Expected Return from starting in state s

์‹œ๊ฐ„ t์ผ๋•Œ์˜ state ๊ฐ€ s์ผ๋•Œ

Return์˜ ๊ธฐ๋Œ€๊ฐ’์„ V(s)๋ผ๊ณ  ํ•œ๋‹ค.

V(s)=E[Gt|st=s]

 

์ฆ‰, ๊ฐ€์น˜๋Š” ํ˜„์žฌ ์‹œ๊ฐ„ t, ํ˜„์žฌ ์ƒํƒœ s์ผ๋•Œ ๋ฏธ๋ž˜์— ๋ฐ›์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ๋ฆฌํ„ด์˜ ์˜ˆ์ธก๊ฐ’์„ ๋งํ•œ๋‹ค.

 

Value๋Š” ์ดํ›„์— policy๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , optimal policy๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ์— ์‚ฌ์šฉ๋œ๋‹ค. 

์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฒƒ์€ ๋ฏธ๋ž˜์˜ ์ „์ฒด ๋ฆฌ์›Œ๋“œ๊ฐ€ ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹ค์งˆ์ ์œผ๋กœ ์‚ดํŽด์•ผ ํ•˜๋Š” ๊ฒƒ์€ Value๊ฐ’์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

Computing Value of MRP

์ด์ œ ์•„๋ž˜์˜ ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

MRP์—์„œ value function V๋Š” ๋‹ค์Œ์„ ๋งŒ์กฑํ•œ๋‹ค

 

 

๋ฐฉ๋ฒ•1. Bellman Equation

Value Fucntion์€ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹์„ ์ด์šฉํ•ด์„œ ์žฌ๊ท€์ ์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค. 

์•ž์„œ์„œ finite state (N)์ธ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ ๋งˆ๋ฅด์ฝ”ํ”„ ํ”„๋กœ์„ธ์Šค P์™€ ๋ณด์ƒ R์ด ํ–‰๋ ฌ๋กœ ํ‘œํ˜„๋จ์„ ์•Œ์•˜๋‹ค.

๋”ฐ๋ผ์„œ ์ด๋ฅผ Matrix Form์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์ด ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ ์žˆ๋Š” ์ •๋ณด์ด๋‹ค.

์ฒซ๋ฒˆ์งธ ๋ฒกํ„ฐ๋Š” ๊ฐ ์ƒํƒœ์—์„œ์˜ ๋ณด์ƒ R์— ๋Œ€ํ•œ ํ–‰๋ ฌ์ด๊ณ , ๋‘๋ฒˆ์งธ ํ–‰๋ ฌ์€ ๋งˆ๋ฅด์ฝ”ํ”„ ์ฒด์ธ. ์ƒํƒœ s์ผํƒœ ์ƒํƒœ s'๋กœ์˜ ์ „์ด ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์‹ถ์€ ๊ฒƒ์€ V์ด๋ฏ€๋กœ ์‹์„ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

์ด ๊ณผ์ •์„ ํ†ตํ•ด์„œ Value๊ฐ’์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๋ฐฉ๋ฒ• 2. DP

์œ„์—์„œ๋Š” ์žฌ๊ท€์ ์œผ๋กœ ๊ตฌํ–ˆ์œผ๋‹ˆ, DP๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 


3. Markov Decision Process (MDP, ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค)

MDP์˜ ๊ฒฝ์šฐ MRP์— Action์ด ์ถ”๊ฐ€๋œ ๊ฒƒ์ด๋‹ค.

 

๋”ฐ๋ผ์„œ Reward๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ Action๋˜ํ•œ ๊ณ ๋ คํ•˜๊ฒŒ ๋œ๋‹ค.

 

 

MDP๋Š” 5๊ฐœ์˜ ์›์†Œ S,A,P,R,gamma๋กœ ์ด๋ฃจ์–ด์ง„ ์ƒํƒœ๊ณต๊ฐ„์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

S: ์ƒํƒœ์— ๋Œ€ํ•œ ์œ ํ•œ ์ง‘ํ•ฉ

A: ์—์ด์ „ํŠธ๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™์— ๋Œ€ํ•œ ์œ ํ•œ ์ง‘ํ•ฉ

P==P(a,s,s')  : s ์ƒํƒœ์—์„œ a ํ–‰๋™์„ ํ•  ๊ฒฝ์šฐ s'์œผ๋กœ ๋„˜์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ํ™•๋ฅ 

R==R(a,s,s') : s ์ƒํƒœ์—์„œ a ํ–‰๋™์„ ํ•  ๊ฒฝ์šฐ s' ์ƒํƒœ๋กœ ๊ฐ”์„๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ณด์ƒ

gamma: ํ• ์ธ ๊ฐ’ [0,1]๋กœ, ํ˜„์žฌ์˜ ๋ณด์ƒ์ด ๋ฏธ๋ž˜์— ๋ฐ›์„ ๋ณด์ƒ์— ๋Œ€ํ•ด์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€์— ๋Œ€ํ•œ ๊ฐ’

 

 

Policy 

ํ–‰๋™์ด ์ถ”๊ฐ€๋˜๋ฉด์„œ ์•ž์„  MRP์™€ ๋‹ฌ๋ฆฌ ์ถ”๊ฐ€๋กœ state s์ผ๋•Œ action ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ •ํ•ด์•ผํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

policy๋Š” agent๊ฐ€ ํŠน์ • state์—์„œ ์–ด๋–ป๊ฒŒ actionํ• ์ง€์— ๋Œ€ํ•œ ๊ทœ์น™์„ ๋งํ•œ๋‹ค. 

์ด Policy๋Š” Deterministicํ•˜๊ฒŒ ์ •ํ•ด์ ธ์žˆ์„ ์ˆ˜๋„ ์žˆ๊ณ , ํ™•๋ฅ ์„ ํ†ตํ•ด์„œ stochasticํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ๋‹ค.

 

Policy๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ํ‘œ๊ธฐํ•œ๋‹ค

์ฆ‰ ์ƒํƒœ s์ผ๋•Œ ํ–‰๋™ a๋ฅผ ํ•  ํ™•๋ฅ ๋กœ ์ •์ฑ…์„ ๊ฒฐ์ •ํ•œ๋‹ค.

 

๋งŒ์•ฝ Stochastic Policy์˜ ๊ฒฝ์šฐ ์œ„์ฒ˜๋Ÿผ ํ™•๋ฅ ๋กœ ํ‘œํ˜„ํ•˜๊ฒŒ ๋˜๊ณ , Deterministic Policy์˜ ๊ฒฝ์šฐ s์— ๋Œ€ํ•œ action์ด ์ •ํ•ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ์ฒ˜๋Ÿผ ํ‘œํ˜„ํ•œ๋‹ค. 

์ƒํƒœ s์ผ๋–„ action์€ a๋ฅผ ํ•œ๋‹ค.(ํ™•๋ฅ x)

 

Bellman Backup

์•ž์„œ์„œ MDP๋Š” MRP+Policy ๋ผ๋Š”๊ฒƒ์„ ๋ฐฐ์› ๋‹ค.

๋”ฐ๋ผ์„œ Policy๊ฐ€ ๊ฒฐ์ •๋˜๊ธฐ๋งŒ ํ•œ๋‹ค๋ฉด, MDP๋ฅผ MRP๋กœ ์ถ•์†Œํ•˜์—ฌ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Policy๋ฅผ ฯ€ ๋ผ๊ณ  ํ•˜๋ฉด MDP๋Š” MRP๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ MRP์—์„œ Value๋ฅผ ๊ณ„์‚ฐํ–ˆ๋˜ ๊ฒƒ๊ณผ ๊ฐ™์€ ๋กœ์ง์„ ํ†ตํ•ด ํŠน์ • Policy ฯ€์ผ๋•Œ์˜ Value๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์ •์ฑ…์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

 

๋”ฐ๋ผ์„œ ์ •์ฑ…์ด ๊ฒฐ์ •๋˜๋ฉด ์šฐ๋ฆฌ๊ฐ€ P(sโ€ฒ|s,ฯ€(s))๋กœ ํ‘œํ˜„ํ•˜๋˜ ๊ฒƒ์„ Pโ€ฒ(sโ€ฒ|s)๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

MDP Control - Optimal Policy

์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š”๊ฒƒ์€ MDP์˜ ์ •๋ณด ๊ทธ ์ž์ฒด๋„ ์žˆ๊ฒ ์ง€๋งŒ, Optimal ํ•œ Policy๊ฐ€ ๋ฌด์—‡์ธ์ง€๋ฅผ ์•„๋Š” ๊ฒƒ์ด๋‹ค.

Optimal Policy๊ฐ€ ๋ญ๋ƒ๊ณ  ๋ฌผ์œผ๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ ๋งํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

 

 

Policy Search

๊ทธ๋ž˜์„œ ์ด Optimal Policy๋ฅผ ์–ด๋–ป๊ฒŒ ๊ตฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?

Opt Policy๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ Policy Iteration์ด๋‹ค.

Policy Iteration

Policy Iteration ์˜ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

Policy๋ฅผ ์–ด๋–ป๊ฒŒ Improve์‹œํ‚ฌ๊นŒ?

์ด๋ฅผ ์œ„ํ•ด์„œ Q (State-Action Value) ๋ผ๋Š” ์ƒˆ๋กœ์šด Value Function์„ ๋„์ž…ํ•œ๋‹ค.

 

์ฆ‰, State-Action Value Q๋Š” ๊ธฐ์กด ์šฐ๋ฆฌ๊ฐ€ ์•Œ์•„๋ดค๋˜ Value์™€ ๋‹ฌ๋ฆฌ state s์—์„œ์˜ action a๊ฐ€ policy์™€ ๊ด€๊ณ„์—†์ด ๋ชจ๋“  Action์— ๋Œ€ํ•ด์„œ ๊ณ„์‚ฐ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. (Greedy!)

 

๋”ฐ๋ผ์„œ Qฯ€(s,a)๋Š” ฯ€(s)๋ฅผ ๋”ฐ๋ฅธ ๊ฒฝ์šฐ์˜ Value๋ฅผ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

์ด ๋ง์€ Q๊ฐ€ max๊ฐ€ ๋˜๋Š” ๊ฒฝ์šฐ์˜ policy๊ฐ€ ๊ธฐ์กด์˜ policy์™€ ๋™์ผํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ policy๊ฐ€ ๋œ๋‹ค.

 

 

๊ทธ๋Ÿผ ์œ„์˜ Policy Iteration์˜ ๊ณผ์ •์„ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋ฉด

While๋ฌธ์˜ ์˜๋ฏธ๋Š” policy๊ฐ€ optimalํ•˜์—ฌ Q๋ฅผ ํ†ตํ•ด์„œ ๋” ๋‚˜์€ policy๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์— ๋ฐ˜๋ณต๋ฌธ์„ ์ข…๋ฃŒํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

 

 

 

๋‹น์—ฐํ•œ ๋ง์ด์ง€๋งŒ, ๊ธฐ์กด์˜ policy์˜ V(s)๊ฐ’ ๋ณด๋‹ค maxQ๊ฐ’์ด ๋” ํฐ ๊ฒฝ์šฐ, ๊ธฐ์กด์˜ policy๋ฅผ ๋ฒ„๋ฆฌ๊ณ , ์–ป์€ ์ƒˆ๋กœ์šด policy๋ฅผ ์“ด๋‹ค.

 

 

๋ง‰ ํ•„๊ธฐ ๋ฒ„์ „