IEEE TAI 2024 paper 加权TD3_BC Method 离线阶段,算法基于TD3_BC,同时加上基于Q函数的权重函数,一定程度上避免了过估计 J o f f l i n e ( θ ) = E ( s , a ) ∼ B [ ζ Q ϕ ( s , π θ ( s ) ) ] − ∥ π θ ( s ) − a ∥ 2 \begin{aligned}J_{\mathrm{of
Policy Gradient review ∇ R ‾ θ = 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − t r t ′ n − b ) ∇ log p θ ( a t n ∣ s t n ) \nabla \overline{R}_\theta = \frac{1}{N}\sum_{n = 1}^{N}\sum_{t = 1}^{
Bringing Fairness to Actor-Critic Reinforcement Learning for Network Utility Optimization 阅读笔记 Problem FormulationLearning AlgorithmLearning with Multiplicative-Adjusted RewardsSolving Fairness Uti
a b s t r a c t \qquad 控制器基于Actor-Critic(AC)算法,受强化学习和最优控制理论(optimal control theory)的启发。控制器的主要特性是: 同时调整 胰岛素基础率 the insulin basal rate 和 大剂量 the bolus dose;根据临床规程进行初始化;real-time personalization。 \
参考 Reinforcement Learning, Second Edition An Introduction By Richard S. Sutton and Andrew G. Barto 非策略梯度方法的问题 之前的算法,无论是 MC,TD,SARSA,Q-learning, 还是 DQN、Double DQN、Dueling DQN,有至少两个问题: 都是处理离散状态、
gym-0.26.1 CartPole-v1 Actor-Critic 这里采用 时序差分残差 ψ t = r t + γ V π θ ( s t + 1 ) − V π θ ( s t ) \psi_t = r_t + \gamma V_{\pi _ \theta} (s_{t+1}) - V_{\pi _ \theta}({s_t}) ψt=rt+γVπθ(st+1)−Vπθ