本文主要是介绍ddpg/Continuous control with deep reinforcement learning,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 总结
- 细节
总结
融合dqn buffer+actor/critic分别2个网络,连续的action(更新时不是 ∇ θ π ( a ∣ s ) \nabla_\theta\pi(a|s) ∇θπ(a∣s)了,而是 ∇ θ μ ( s ) \nabla_\theta\mu(s) ∇θμ(s), μ \mu μ是actor网络)、连续的states
细节
之前的方法action都是离散的,这里可以解决连续action的问题
对action和critic分别建了2个网络,actor网络 μ , μ ′ \mu, \mu' μ,μ′,critic网络 Q , Q ′ Q, Q' Q,Q′,更新步骤为:
- 根据当前actor μ \mu μ及噪音 N t \mathcal{N}_t Nt选动作: a t = μ ( s t ) + N t a_t = \mu(s_t) + \mathcal{N}_t at=μ(st)+Nt
- 执行动作 a t a_t at获得reward r t r_t rt以及新的state s t + 1 s_{t+1} st+1
- buffer
- 从buffer中sample N个trajectory
- 计算 y i = r i + γ Q ′ ( s i + 1 , μ ′ ( s i + 1 ) ) y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1})) yi=ri+γQ′(si+1,μ′(si+1))
- 更新critic网络: L = 1 N ∑ i ( y i − Q ( s i , a i ) ) 2 L = \frac{1}{N}\sum_i(y_i - Q(s_i, a_i))^2 L=N1∑i(yi−Q(si,ai))2
- 更新actor网络: ∇ θ μ ≈ 1 N ∑ i ∇ a Q ( s , a ) ∇ θ μ μ ( s ) \nabla_{\theta_\mu} \approx \frac{1}{N}\sum_i \nabla_aQ(s, a)\nabla_{\theta_\mu}\mu(s) ∇θμ≈N1∑i∇aQ(s,a)∇θμμ(s),作为对比,之前的policy gradient更新时是 ∇ θ μ ≈ 1 N ∑ i ∇ a Q ( s , a ) ∇ θ π ( a ∣ s ) \nabla_{\theta_\mu} \approx \frac{1}{N}\sum_i \nabla_aQ(s, a)\nabla_{\theta}\pi(a|s) ∇θμ≈N1∑i∇aQ(s,a)∇θπ(a∣s)
- 更新critic, actor的target网络:
θ Q ′ ← τ θ Q + ( 1 − τ ) θ Q ′ θ μ ′ ← τ θ μ + ( 1 − τ ) θ μ ′ \theta^{Q'} \leftarrow \tau\theta^Q + (1 - \tau)\theta^{Q'} \\ \theta^{\mu'} \leftarrow \tau\theta^\mu + (1 - \tau)\theta^{\mu'} \\ θQ′←τθQ+(1−τ)θQ′θμ′←τθμ+(1−τ)θμ′
这篇关于ddpg/Continuous control with deep reinforcement learning的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!