Learning Continuous Control Policies by Stochastic Value Gradients

1. Learning Continuous Control Policies by Stochastic Value Gradients NIPS2015 読会藤田康博 Preferred Networks Inc. January 20, 2016

2. 話人 ▶ 藤田康博 ▶ Preferred Networks Inc. ▶ Twitter: @mooopan ▶ GitHub: muupan ▶ 強化学習・ AI 興味 ▶ 最近仕事（発表関係） (https://twitter.com/hillbig/status/684813252484698112)

3. 読論文 ▶ Learning Continuous Control Policies by Stochastic Value Gradients ▶ Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, Tom Erez (Google DeepMind) ▶ 強化学習状態・行動連続値取確率的制御問題扱提案 ▶ ・価値関数・policy NN 表 ▶ reparameterization trick 使

4. 動画 ▶ https://www.youtube.com/watch?v=PYdL7bcn_cM

5. 問題設定 ▶ Markov Decision Process ▶ 状態 st ∈ RNS ▶ 行動 at ∈ RNA ▶ 初期状態分布 s0 ∼ p0(·) ▶ 遷移分布 st+1 ∼ p(·|st, at) ▶ st at 確率的 st+1 決 ▶ 報酬関数 rt = r(st, at, t)（時間依存） ▶ 求 ▶ （確率的）policy at ∼ p(·|st; θ) ▶ st 確率的 at 決 ▶ 最大化 ▶ 報酬和期待値 J(θ) = E[ ∑T t=0 γtrt|θ] ▶ γ ∈ [0, 1] 割引率

7. 表記関注意 ▶ 下付文字偏微分表 ▶ πθ = ∂π ∂θ ▶ 「 θ 表 π」 ▶ （ 1 箇所 πθ 後者意味使場所 …） ▶ 上付文字時間指数 ▶ 報酬和期待値（再掲） J(θ) = E[ ∑T t=0 γtrt|θ] ▶ rt 時間 t 報酬 ▶ γt γ t 乗 ▶ 時間依存判断 …

8. 行動連続値 ▶ 「DQN 駄目？」 ▶ DQN [Mnih et al. 2013; Mnih et al. 2015] 状態行動価値 Q(s, a; θ) 学習，行動 arg maxa Q(s, a; θ) 選択 ▶ a 連続値 arg max 求！ ▶ policy 直接（NN ）表 ▶ at ∼ p(·|st; θ) ▶ 行動選際 ▶ θ 更新方法，論文 policy gradient methods 種類方法扱

9. Policy Gradient Methods ▶ 目標：J(θ) = E[ ∑T t=0 γt rt |θ] 最大化 policy θ 求 ▶ ∇θJ(θ)（policy gradient）求 ▶ 求勾配法 policy 最適化（policy gradient methods） ▶ 求？ ▶ likelihood ratio methods ▶ value gradient methods

11. Likelihood Ratio Methods (2) ▶ 使 ∇θJ(θ) 推定 [Williams 1992; Sutton et al. 1999] ∇θJ(θ) = Es∼ρπ,a∼p(·|s;θ)[Q(s, a)∇θ log p(a|s; θ)] ▶ policy gradient 求方法広使 ▶ 欠点 ▶ Q(s, a) 勾配情報使 ▶ variance 大

12. Deterministic Value Gradients (1) ▶ Backpropagation 価値関数勾配（value gradient）求（value gradient methods） ▶ J(θ) = Es0∼p0 V 0(s0) V 0 θ 計算良 ▶ MDP policy 決定的（s′ = f (s, a), a = π(s)），決定的 Bellman 方程式 V (s) = r(s, π(s)) + γV ′ (f (s, π(s))) 微分 value gradient 計算 Vs = rs + raπs + γV ′ s′ (fs + faπs) (3) Vθ = raπθ + γV ′ s′ faπθ + γV ′ θ (4) = Qaπθ + γV ′ θ

13. Deterministic Value Gradients (2) ▶ 式 (3)，(4) 系列 (s0 , a0 , s1 , a1 , . . . ) V 0 θ (s0 ) RNN 計算

14. Deterministic Value Gradients (3) ▶ 欠点 ▶ 確率的 MDP policy 扱 ▶ 異区別状態（state aliasing）確率的 policy 必要 ▶ 例：灰色状態区別場合，決定的 policy 開始地点一生金辿 ▶ reparameterization trick 解決

15. Reparameterization Trick ▶ ∇x Ep(y|x)g(y) 求別方法 ▶ p(y|x) 決定的関数 f 変数 ξ 使書：y = f (x, ξ), ξ ∼ ρ(·) ▶ 例：p(y|x) = N(µ(x), σ2(x)) y = µ(x) + σ(x)ξ, ξ ∼ N(0, 1) ▶ 微分 ∇x Ep(y|x)g(y) = Eρ(ξ)gy fx ≈ 1 M M∑ i=0 gy fx |ξ=ξi (5) ▶ likelihood ratio methods 異 g 勾配情報使，variance 低

16. Stochastic Value Gradients ▶ 遷移分布 s′ = f (s, a, ξ)，policy a = π(s, η; θ) reparameterize Vs = Eρ(η)[rs + raπs + γEρ(ξ)V ′ s′ (fs + faπs)] (7) Vθ = Eρ(η)[raπθ + γEρ(ξ)[V ′ s′ faπθ + γV ′ θ]] (8) = Eρ(η)[Qaπθ + γV ′ θ] ▶ MDP 確率的，policy 確率的，value gradient 求！（stochastic value gradient）

17. 復元 ▶ 遷移関数 f 実際未知学習，ˆs′ = ˆf (s, a, ξ) 実際観測 s′ 使勾配計算 ▶ f 予測誤差影響抑 ▶ 昔 θk 使選 ak = π(sk , η; θk ) 使，今 θt 勾配計算 ▶ experience replay（経験再利用）可能 ▶ 結果復元必要 ξ ∼ p(ξ|s, a, s′ ), η ∼ p(η|s, a) ▶ Gaussian 場合 η = (ak − µ(sk))/σ(sk) 求（著者確認）

18. 3種類 ▶ value gradient 求方異 3 種類提案 ▶ SVG(∞) ▶ SVG(1) ▶ SVG(0)

19. SVG(∞) ▶ 遷移関数 ˆf (s, a, ξ) policy π(s, η) 一緒学習

20. SVG(1) ▶ 遷移関数 ˆf (s, a, ξ) policy π(s, η) ˆV (s) 一緒学習 ▶ ˆf 1 使残 ˆV 使 ▶ experience replay 使場合特 SVG(1)-ER 表記

21. SVG(0) ▶ policy π(s, η) ˆQ(s, a) 一緒学習 ▶ 遷移関数使

22. 評価 ▶ AC [Wawrzynski 2009]，DPG [Silver et al. 2014] 既存手法（ policy value function 学習） ▶ SVG(1)-ER 総良

23. 悪化場合 ▶ ˆf 隠層次元数減評価 ▶ SVG(∞) 性能大劣化，SVG(1) 変

24. 価値関数悪化場合 ▶ 価値関数隠層次元数減評価 ▶ DPG 性能大劣化，SVG(1) 影響

25. ▶ likelihood ratio methods 代 reparameterization trick 使 ▶ 確率的 MDP，確率的 policy 対 value gradient 計算（stochastic value gradients） ▶ 提案実験 SVG(1)-ER 良性能

26. 感想 ▶ reparameterization trick 便利 ▶ likelihood ratio methods 代使使 ▶ 行動離散的 reparameterization trick 使 likelihood ratio methods 頼無？ ▶ SVG(0)-ER 評価気 ▶ experience replay 重要

27. 参考文献 I [1] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533. [2] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: NIPS 2014 Deep Learning Workshop. 2013, pp. 1–9. arXiv: arXiv:1312.5602v1. [3] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML 2014. 2014, pp. 387–395. [4] Richard S. Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”. In: In Advances in Neural Information Processing Systems 12. 1999, pp. 1057–1063. [5] Pawel Wawrzynski. “Real-time reinforcement learning by sequential Actor-Critics and experience replay”. In: Neural Networks 22.10 (2009), pp. 1484–1497. [6] RJ Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Reinforcement Learning 8.3-4 (1992), pp. 229–256.

Learning Continuous Control Policies by Stochastic Value Gradients

mooopan

Learning Continuous Control Policies by Stochastic Value Gradients

A particular slide catching your eye?