[DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

337 views

Published on

2017/3/10
Deep Learning JP:
http://deeplearning.jp/workshop/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
337
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

  1. 1. Qπ (s,a) = r(s,a)+γ Eπ [Qπ (s',a')] Qπ L = (r(s,a)+γ Qθ π (s',a')−Qθ π (s,a))2
  2. 2. Qo (s,a) = r(s,a)+γ max a' Qo (s',a') Qo L = (r(s,a)+γ max a' Qθ o (s',a')−Qθ o (s,a))2
  3. 3. ๏ ๏
  4. 4. ๏ ๏
  5. 5. Qπ (s,a) = r(s,a)+γ Eπ [Qπ (s',a')] L = ( γ i r(si ,ai ) i=0 n−1 ∑ +γ n Qθ π (sn,an )−Qθ π (s0,a0 ))2
  6. 6. ๏ ๏
  7. 7. L = (r(s,a)+γ max a' Qθ o (s',a')−Qθ o (s,a))2
  8. 8. Q∗ (s,a) = r(s,a)+γτ log exp(Q∗ (s',a') /τ )a'∑ Q∗
  9. 9. τ log exp(Q∗ (s',a') /τ )a'∑ = τ log(exp(Q∗ (sM ,aM ) /τ ) exp((Q∗ (s',a')−Q∗ (sM ,aM )) /τ )a'∑ ) = max a' Q∗ (s',a')+τ log( exp((Q∗ (s',a')−Q∗ (sM ,aM )) /τ )a'∑ )
  10. 10. V∗ (s) = −τ logπ∗ (a | s)+ r(s,a)+γV∗ (s')
  11. 11. s0,v0 {a1,...,an} {v1,...,vn} {s1,...,sn} OMR (π) = π(ai )(ri +γ vi o ) i=1 n ∑ v0 o = OMR (πo ) = max i (ri +γ vi o )
  12. 12. OENT (π) = π(ai )(ri +γ vi ∗ −τ logπ(ai )) i=1 n ∑ OENT (π) = −τ π(ai )log π(ai ) exp((ri +γ vi ∗ ) /τ ) / Si=1 n ∑ +τS π∗ (ai ) = exp((ri +γ vi ∗ ) /τ ) exp((ri' +γ vi' ∗ ) /τ ) i'=1 n ∑
  13. 13. v0 ∗ = OENT (π∗ ) = τ log exp((ri +γ vi ∗ ) /τ ) i=1 n ∑ π∗ (ai ) = exp((ri +γ vi ∗ ) /τ ) exp((ri' +γ vi' ∗ ) /τ ) i'=1 n ∑ v0 ∗ = −τ logπ∗ (ai )+ r(si ,ai )+γ vi ∗
  14. 14. V∗ (s) = −τ logπ∗ (a | s)+ r(s,a)+γV∗ (s') −V∗ (s1)+γ t−1 V∗ (st )+ R(s1:t )−τG(s1:t ,π∗ ) = 0 R(sm:n ) = γ i r(sm+i ,am+i ) i=0 n−m−1 ∑ G(sm:n,π) = γ i logπ(am+i | sm+i ) i=0 n−m−1 ∑
  15. 15. Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1 Vφ (st )+ R(s1:t )−τG(s1:t ,πθ ) Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ ) Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1 Vφ (st ))
  16. 16. Aθ,φ (s1:d+1) = −Vφ (s1)+γ d Vφ (sd+1)+ R(s1:d+1) Δθ ∝ Es0:T [ Aθ,φ (si:i+d )∇θ logπθ (ai | si ) i=0 T −1 ∑ ] Δφ ∝ Es0:T [ Aθ,φ (si:i+d )∇φVφ (si ) i=0 T −1 ∑ ] Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1 Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )

×