• No se han encontrado resultados

The critic class primarily houses the critic neural network and provides the training method. Constructing the critic begins with initializing the number of states, the number of actions, the discount factor, the learning rate, the number neurons, etc. The construction of the computational graph to include the critic is also performed in the initialization of the class.

Inside a variable scope named ‘Critic,’ contains the various variables and tensorflow operations such that it can be easily found on the computational graph. The actor will also exist on the same computational graph, so it makes sense to organize variables as such. First, several terms are set

up as tensorflow placeholder variables, which are the pipelines of feeding data to the computational graph. These variables are not known ahead of time and will constantly be changing. The placeholder variables for the critic include the state, next state, action, predicted action, target label and the training phase. Table 6.3 summarizes the variables mentioned above.

Table 6.3: This table summarizes the placeholder values for the critic implementation.

Placeholder variable Description

s state

s next state

a action

a predicted action from target actor

y target label for critic

train phase critic boolean used for toggling batch normalization mean and variance

Nested inside the ‘Critic’ variable scope are a few more variable scopes. The ‘Q online network’ variable scope is where the main neural network is built using the state and action as inputs. The critic network needs to be built twice since there is an online and a target network, thus it is a class method called build network(). Similarly, the variable scope for the target network is ‘Q target network.’

The build network() function has arguments for the states, actions, booleans for making the parameters trainable, boolean if regularization is needed, and a string to decide whether or not the target or online network is built. Determining which neural network is built is necessary only in the critic due to writing the class such that it can generally choose L2weight decay or not.

First the weights are initialized as a tensorflow variable as mentioned before. If the online network is desired, an if-else statement checks if the L2 weight decay has been set. If so, the tf.contrib.layers.l2 regularizer() is set as the regularizer. If not, then the regularizer is set to None. A variable scope titled ‘reg’ is created to contain the weights of the online network with the regularizer. The tensorflow function called get variable() is used to create new variables with the regularizer using the initialized weights. This specific syntax was found to work in order to obtain proper seed initialization and regularization. If the target network is being built, then no regularization needs to be used.

The layers of the neural networks are then built with the proper number of neurons. What is unique to the critic is the need to augment the action to the second hidden layer. The following code snippet shows the details of the augmentation.

1 with tf.variable_scope("hidden2"):

2 augment = tf.matmul(hidden1, w2) + tf.matmul(a, w3) + b2

3 hidden2 = tf.nn.relu(augment)

Next, the trainable parameter variables of the online and target networks are collected so the target could be updated slowly from the online network in a tensorflow operation. Finding the trainable parameter variables was the simplest method for determining the neural network param- eters from the computational graph even though the target network is not trainable. In order to use the tensorflow operation, the session needs to run the name of the tensorflow operation on the computational graph. Thus, a class method called slow update to target() is used.

1 self.vars_Q_online = tf.get_collection(

2 tf.GraphKeys.TRAINABLE_VARIABLES,

3 scope='Critic/Q_online_network')

4 self.vars_Q_target = tf.get_collection(

5 tf.GraphKeys.TRAINABLE_VARIABLES,

6 scope='Critic/Q_target_network')

7

8 with tf.name_scope("slow_update"):

9 slow_update_ops = []

10 for i, var in enumerate(self.vars_Q_target):

11 slow_update_ops.append(var.assign(

12 tf.multiply(self.vars_Q_online[i], self.tau) + \

13 tf.multiply(self.vars_Q_target[i], 1.0-self.tau)))

14 self.slow_update_2_target = tf.group(*slow_update_ops,

15 name="slow_update_2_target")

16

17 def slow_update_to_target(self):

18 self.sess.run(self.slow_update_2_target)

The critic loss or temporal difference error is determined between the prediction of the online Q network and the target label which uses the discount factor for future reward. If regularization is used, then the regularization term gets added to the loss. The tensorflow API for backpropagation is set up as the training operation using the Adam optimizer. This is where the learning rate gets included. In order to use the optimizer operation from the computational graph, a function named train() is used to run this from the session.

1 with tf.name_scope("Critic_Loss"):

2 td_error = tf.square(self.y - self.Q_online)

3 self.loss = tf.reduce_mean(td_error)

4 if self.l2:

5 reg_term = tf.losses.get_regularization_loss()

6 self.loss += reg_term

8 optimizer = tf.train.AdamOptimizer(self.alpha) 9 self.training_op = optimizer.minimize(self.loss) 10

11 def train(self, s_batch, a_batch, y_batch, train_phase=None):

12 self.sess.run(self.training_op, feed_dict={

13 self.s: s_batch, self.a: a_batch, self.y: y_batch,

14 self.train_phase_critic: train_phase})

It is probably worth mentioning that the tensorflow operation will automatically run other operations in order to complete the operation. Since the training operation requires the loss operation to be calculated, it will go ahead and run a prediction of the Q online network. Care simply needs to be taken that all of the necessary placeholders are fed into the session run.

The last operation to set up for the computational graph is the ability to calculate the gradients of the Q value with respect to the actions from the actor. Just as before, in order to use this operation a class method was used to run this in the session.

1 with tf.name_scope("qa_grads"):

2 self.qa_gradients = tf.gradients(self.Q_online, self.a)

3

4 def get_qa_grads(self, s, a, train_phase=None):

5 return self.sess.run(self.qa_gradients, feed_dict={self.s: s, self.a: a,

6 self.train_phase_critic: train_phase})

The critic’s computational graph has now been created. The variable scopes that were discussed can be visualized in the tensorboard visualization. The scopes have been expanded to display what has been created in Figure 6.18.

Figure 6.18: The visualization of the critic’s portion of the computational graph shows how tensors are passed around for tensorflow operations.

The DDPG class calculates the target labels, y, for updating the critic network because it also depends on the target network of the actor. In order for the DDPG class to calculate the target labels, it uses the critic class to predict the Q values for a batch. Both the online and target critic networks are needed in this process, so class methods were created for them to run the neural networks in the session.

1 def predict_online_batch(self, s, a, train_phase=None):

2 return self.sess.run(self.Q_online, feed_dict={self.s: s, self.a: a,

3 self.train_phase_critic: train_phase})

4

5 def predict_target_batch(self, s_, a_, train_phase=None):

6 return self.sess.run(self.Q_target, feed_dict={self.s_: s_, self.a_: a_,

7 self.train_phase_critic: train_phase})

The prediction operations run from the computational graph, so it grabs all the necessary operations preceding the output of the neural network to compute a forward pass. The online prediction uses the state and action, whereas the target prediction uses the next state and the action prediction from the target actor. The full implementation of the critic class can be found in appendix C.

Documento similar