强化学习系列9：无模型的值函数法

2024年6月11日 285次阅读

1. 从表格到参数函数逼近

当状态空间维数非常多或者存在连续维度时，使用表格型强化学习会遇到“维度爆炸”的问题。一种解决方案是舍弃状态与值的一一对应关系，而是使用函数的形式拟合状态与值的关系。这里就能引入机器学习模型和深度学习模型了。注意在表格型方法中，每次评估状态值时，更新的只有当前状态的值，而在值函数模型中，更新的是函数参数 θ \theta θ，会导致所有状态的值都发生改变，因此更新规则需要做一下设计。

1.1 基于梯度的方法

对问题进行采样，假设 s t i , G t i s^i_t,G^i_t sti,Gti是采用蒙特卡洛法采集的时间 t t t的样本和样本对应计算出来的值。
首先来看增量式更新的方法，也就是每一个样本都更新一次 θ \theta θ。假设第 i i i轮迭代时间t的样本参数估计值为 θ t i \theta^i_t θti，则下一个样本的目标是 min ⁡ θ ( G t i − v θ ( s t i ) ) 2 \min_{\theta}(G^i_t-v_{\theta}(s^i_t))^2 minθ(Gti−vθ(sti))2。在负梯度方向上更新，更新率为 α \alpha α。

基于蒙特卡洛法的梯度更新公式为 Δ θ = α ( G t i − v θ t i ( s t ) ) ∗ v θ t i ′ ( s t ) \Delta\theta=\alpha(G^i_t-v_{\theta^i_t}(s_t))*v'_{\theta^i_t}(s_t) Δθ=α(Gti−vθti(st))∗vθti′(st)。
假设第 i − 1 i-1 i−1轮迭代后参数的最终估计值为 θ i − 1 \theta^{i-1} θi−1，则基于时间差分法的梯度更新公式为 Δ θ = α ( R t i + γ v θ i − 1 ( s t + 1 ) − v θ t i ( s t ) ) ∗ v θ t i ′ ( s t ) \Delta\theta=\alpha(R^i_t+\gamma v_{\theta^{i-1}}(s_{t+1})-v_{\theta^i_t}(s_t))*v'_{\theta^i_t}(s_t) Δθ=α(Rti+γvθi−1(st+1)−vθti(st))∗vθti′(st)。
基于SARSA的梯度更新公式为 Δ θ = α ( R t i − q θ t i ( s t , a t ) ) ∗ v θ t i ′ ( s t , a t ) \Delta\theta=\alpha(R^i_t-q_{\theta^i_t}(s_t,a_t))*v'_{\theta^i_t}(s_t,a_t) Δθ=α(Rti−qθti(st,at))∗vθti′(st,at)。

接下来是批更新方法，每一轮数据作为一批同时更新 θ \theta θ，假设第 i i i轮迭代后参数估计值为 θ i \theta^i θi，则下一个样本的目标是 min ⁡ θ Σ t ( G t i − v θ ( s t i ) ) 2 \min_{\theta}\Sigma_t(G^i_t-v_{\theta}(s^i_t))^2 minθΣt(Gti−vθ(sti))2。在负梯度方向上更新，更新率为 α \alpha α，更新方式类似上面。
当我们假设 v θ ( s ) = θ ϕ ( s ) v_{\theta}(s)=\theta\phi(s) vθ(s)=θϕ(s)时，称为线性逼近方法，可以直接计算梯度。否则称为非线性梯度法，最常见的是DQN方法。

1.2 DQN方法

DQN的基础是Q-Learning，也就是异策略时间差分法，要点如下：

异策略：行动策略采用 ϵ \epsilon ϵ-greedy，评估和改进采用贪婪策略。
时间差分：用下一时间的值函数来估计当前值函数的方法。将TD中的参数和值函数中的参数分开计算，减少数据关联问题。TD参数每隔固定步数更新为值函数的参数。
非线性函数：采用深度卷积网络进行逼近。采用经验回放解决马氏链的时间关联问题，具体做法是将样本存储到size固定的数据库中，每次随机采样一批进行训练。
优先回放：TD偏差越大的样本采样概率越高，为了纠正有偏估计，需要乘以一个重要性采样系数。

Double DQN试图解决Q-Learning过估计的问题，具体做法是：动作选择和动作评估使用不同的值函数。
Dueling DQN将 q ( s , a ) q(s,a) q(s,a)函数分解成 v ( s ) + A ( s , a ) v(s)+A(s,a) v(s)+A(s,a)两部分分别用神经网络估计，其好处是由于决策空间|A|常常远小于状态空间|S|，分解之后可以大大减少网络输出层的范围。

2. Baseline的DQN例子

我们使用封装版本的stable-baselines，其语句非常类似sklearn：

import gym
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import DQN
model = DQN(MlpPolicy,'CartPole-v1',verbose = 1).learn(25000)
obs = model.env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = model.env.step(action)
    model.env.render()

输出为：

Creating environment from the given name, wrapped in a DummyVecEnv.
--------------------------------------
| % time spent exploring  | 27       |
| episodes                | 100      |
| mean 100 episode reward | 18.7     |
| steps                   | 1848     |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 200      |
| mean 100 episode reward | 126      |
| steps                   | 14405    |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 300      |
| mean 100 episode reward | 100      |
| steps                   | 24423    |
--------------------------------------
......

DQN的参数和默认值如下：

class stable_baselines.deepq.DQN(policy, env, gamma=0.99, learning_rate=0.0005, buffer_size=50000, exploration_fraction=0.1, exploration_final_eps=0.02, train_freq=1, batch_size=32, checkpoint_freq=10000, checkpoint_path=None, learning_starts=1000, target_network_update_freq=500, prioritized_replay=False, prioritized_replay_alpha=0.6, prioritized_replay_beta0=0.4, prioritized_replay_beta_iters=None, prioritized_replay_eps=1e-06, param_noise=False, verbose=0, tensorboard_log=None, _init_setup_model=True)

policy – (DQNPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
gamma – (float) discount factor
learning_rate – (float) learning rate for adam optimizer
buffer_size – (int) size of the replay buffer
exploration_fraction – (float) fraction of entire training period over which the exploration rate is annealed
exploration_final_eps – (float) final value of random action probability
train_freq – (int) update the model every train_freq steps. set to None to disable printing
batch_size – (int) size of a batched sampled from replay buffer for training
checkpoint_freq – (int) how often to save the model. This is so that the best version is restored at the end of the training. If you do not wish to restore the best version at the end of the training set this variable to None.
checkpoint_path – (str) replacement path used if you need to log to somewhere else than a temporary directory.
learning_starts – (int) how many steps of the model to collect transitions for before learning starts
target_network_update_freq – (int) update the target network every target_network_update_freq steps.
prioritized_replay – (bool) if True prioritized replay buffer will be used.
prioritized_replay_alpha – (float)alpha parameter for prioritized replay buffer. It determines how much prioritization is used, with alpha=0 corresponding to the uniform case.
prioritized_replay_beta0 – (float) initial value of beta for prioritized replay buffer
prioritized_replay_beta_iters – (int) number of iterations over which beta will be annealed from initial value to 1.0. If set to None equals to max_timesteps.
prioritized_replay_eps – (float) epsilon to add to the TD errors when updating priorities.
param_noise – (bool) Whether or not to apply noise to the parameters of the policy.
verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
tensorboard_log – (str) the log location for tensorboard (if None, no logging)
_init_setup_model – (bool) Whether or not to build the network at the creation of the instance