(8 horas) 6 Contratación e inducción de personal
6. Unidad 6, actividad Adjuntar archivo Lee el “Reglamento de trabajo”; determina los elementos y características mínimos de un reglamento interior de
The Deep Deterministic Policy Gradient can be summarized in the following pseudo algorithm 8. First the critic network, Q(s|a, θQ), gets randomly initialized with weights and biases θQ. Then the actor network, µ(s, θµ), gets initialized with a different set of weights and biases θµ. Both the target networks, Q0 and µ0, gets the weights copied over from the online networks. A replay buffer, B, of selected size gets created to a specific size, waiting to be populated with sample transitions.
Algorithm 8 Deep Deterministic Policy Gradient
1: Randomly initialize critic network Q(s|a, θQ) with weights θQ 2: Randomly initialize actor network µ(s, θµ) with weights θµ
3: Initialize target network Q0 and µ0 with weights θQ0← θQ, θµ0 ← θµ 4: Initialize replay buffer B
5: for n episodes do
6: Initialize a random process N for action exploration
7: Observe initial s
8: for each step of episode do
9: a ← µ(st, θµ) + Nt
10: Take action a, observe reward r and next state s0
11: Store transition (st, at, rt, st+1) in B
12: Sample a random minibatch of transitions (si, ai, ri, si+1) from B 13: Set yi = ri+ γQ0(si+1, µ0(si+1, θµ
0
), θQ0) 14: Update critic by minimizing the loss: L = 1
N P
i(Q(si, ai), θQ) − yi)2 15: Update the actor policy using the sampled policy gradient:
∇θµJ ≈ N1 P
i
∇aQ(s, a, θQ)|s=si,a=µ(si))∇θµµ(s, θ
µ)| si
16: Update target critic network, θQ0 ← τ θQ+ (1 − τ )θQ0 17: Update target actor network, θµ0 ← τ θµ+ (1 − τ )θµ0
For n episodes or while the agent’s training hasn’t converged, the following process loops. At the start of an episode, the exploration noise from Ornstein-Uhlenbeck is randomly re-initialized. Also at the beginning of the episode, the initial states are observed. Then a for loop for the maximum number of steps per episode begins, starting with the first action predicted by the actor network combined with the exploration noise. The initial outputs of the networks are nearly zero due to how Lillicrap specified the weight and bias initialization. This means that the beginning of training is overpowered by the noise in order to not propagate the training in one direction or another due to the random beginning weights.
The action is then taken in the environment, returning the reward and the next state. The transition is stored into the replay buffer, B. If the replay buffer does not have at least the batch size or if a warm-up period has not been reached, then the agent will continue to interact with the environment and not make updates to the weights. Once the algorithm is allowed to sample from the replay buffer, the critic is trained first using backpropagation. Taking a batch size of say, randomly 64 samples, the online critic network forward propagates or predicts what the output would be.
The labels for each of the critic predictions derives from the Bellman equation, but the key insight is that the target networks are used to determine yi= ri+ γQ0(si+1, µ0(si+1, θµ
0
), θQ0). The mean squared loss for each of the batch samples are calculated from L = N1 P
i(Q(si, ai, θQ) − yi)2. The loss is then used to calculated the partial derivatives and the chain rule is used across all the layers in order to change the weights and biases for the critic network. The target networks are only
used for updating the critic network to provide consistent target values to prevent the critic training to diverge.
Next the online actor network takes the same sample batch and predicts what the actions are. The gradient of the Q-values from the online critic network with respect to the predicted actions are then calculated. The actor network gradients are then determined using the action predictions, online actor network variables, and the negative gradient of the Q-values with respect to the actions. The reason the negative gradient is used is so that it minimizes the loss, in other words, it performs gradient ascent.
There isn’t really a loss for updating the weights of the actor network. Instead, the idea is to update the neural network in the direction of the maximizing the expected return; this is the policy gradient. This can also be thought of as maximizing the Q-function, which makes sense as to why the critics usually need to be trained first. This update reinforces good policies and punishes bad policies. Figure 6.5 visually displays the agent in the DDPG algorithm.
Environment Critic Loss Q Critic Actor Gradient Actor s s a s a a r Update Update Agent
Figure 6.5: The DDPG algorithm conveniently uses continuous actions from the actor and the a critic provides a measure of how well it did; hence the interplay between the actor and critic
After changing the weights of the actor, the target critic network is updated slowly using θQ0 ← τ θQ+ (1 − τ )θQ0. Immediately after, the target actor network is also updated using θµ0 ← τ θµ+ (1 − τ )θµ0. Once this finishes, then another step of the current episode continues on to interact with the environment and update the networks accordingly. Once the episode is terminated by a certain condition or the maximum steps have been taken, the next episode begins. The noise and initial
conditions reset. The process repeats itself until the number of episodes run out or training has been deemed converged according to a convergence criteria discussed later.