Evaluate state values according to current critic: Below story is
from hackernoon:
This function acts extremely as reflection: Imagine the fox called
Cranberry doing the reflection. At first, it does not know the exactly
values of different state-action pair (). It has to make a prediction of
such values() to guide
itself taking next actions. But after doing several actions it stops to
do a reflection: "So several steps ahead, I predicted as q, but the rewards-to-go I
get is g, difference is too large, I should adjust the to a closer level of g when I
have the same next time to
predict!" So in Cranberry's mind there is a state-action value table to
tell it the values given states. So this table is always updating.
Cranberry takes the latest state-action value table to do prediction of
next state-action value, but given a certain state and candidate
state-action values, which action should be taken there is another table
being responsible for this. Taking actions depends on the history habits
and newly observed rewards, such as rewards that are never seen before.
Because this fox has a family to raise, below are two restrictions:
it does not want to easily change the history policy
it wants to find more rewards to provide better life for its
family
There should be a balance between the two. So an objective(target)
combining 1 and 2 comes: penalizing big policy changes and maximizing
some predicted(estimated) value(it could be , or ) If the fox is a:
simple policy gradient fox "Next time when I am at state s, because
I know is the
largest one ,so I tune the to the biggest among
those " if the expected
value() of is not correct, it will cause less
exploration and the total reward should not be the max.
intelligent policy gradient fox "Next time when I am at state s, I
should not only consider but also consider the actual rewards after the action is
taken. I should compare how different between the expected
reward(Q(s,a)) of taking that action and the actual rewards. If the
expected reward is high(that means high), but rewards-to-go is low, it means I
overestimated the Q(s,a), in this case, I should not choose a; on the
contrary, if the expected rewards is low(that means ) is low, but the
rewards-to-go is high, that means I under estimated this action, I
should take this action more when I am at state s in the future. So in
my mind, I regard (rewards-to-go - ) as the criteria of taking
actions. " -- this strategy will not only do more exploration but also
keep the policy to go in the direction of at least not bad(TRPO).
Actor-Critic Framework: A concept essential for
understanding the original PPO paper by Schulman et al. is the
actor-critic framework [4]. Already in the 1980s,
Sutton et al. argued that it is inefficient for the agent to evaluate
its own actions [5].
Instead, they proposed splitting the agent into two roles:
the actor, which decides actions and learns the policy
π(a∣s), and the critic, which evaluates actions by
estimating the state value V(s). This framework makes the algorithm
versatile and balances exploration vs exploitation. The two roles can be
designed using neural networks with a shared backbone. 摘自Felix
Verstraete