In this chapter we consider methods for learning the policy parameter
based on the gradient of some performance measure with respect to the policy
parameter. These methods seek to maximize performance, so their updates
approximate gradient ascent in :
衡量一个策略是否好(some performance measure of policy)从回合制(Episodic)来讲,就是看积累的收益的期望:大还是小;好的能够提升,那么就存在一个使得最大化。不同的策略会导致在环境当中的路径不同:如何定义路径?如果只是状态和,那就跟策略没关系了,我们需要把策略连带进来:在相同的状态下,不同策略选择的行动是不一样的,按照某一种策略行动所得到的和按照另外一种策略行动所得到是不一样的,这个过程是靠积累才慢慢看出来的,所以,就是靠路径区分不同策略。Going
Deeper Into Reinforcement Learning: Fundamentals of Policy
Gradients。后面的推导,其实是建立在这个想法的基础上的。我觉得文章作者的这个Trajectory:的提出非常好,既跟环境联系,也跟策略联系起来。
某种策略某次试验下的收益:既然外部环境是混沌的,就代表,就算是相同状态相同策略下,每次实验得到的是不一样的。为了准确建模每一次,我们需要改一下上面的:改为
这个概念看作一种元数据,就是暂时先不关心每次状态-行动-奖励-状态-行动的传导细节,只关心更长的一个传导链条。
The choice yields almost the lowest possible
variance, though in practice, the advantage function is not known and
must be estimated. This statement can be intuitively justified by the
following interpretation of the policy gradient: that a step in the
policy gradient direction should increase the probability of
better-than-average actions and decrease the probability of
worse-than-average actions. The advantage function, by it’s
definition
, measures whether or not the action is better or worse than the
policy’s default behavior. Hence, we should choose to be the advantage function ,
so that the gradient term points in the direction of increased if and only if . See Greensmith et al. (2004) for a more rigorous analysis of
the variance of policy gradient estimators and the effect of using a
baseline.