Neural networks for machine learning are universal approximators that, with a hidden layer of sufficient size, can learn to approximate any continuous function. They are inspired by biological neural networks in the brain, composed of many interconnected neuron cells that carry electrical impulses across the nervous system. A neuron receives impulses from connected neurons and if the strength of those impulses is above a threshold it is ’activated’ and itself fires an output impulse. An individual neuron is modelled mathematically as a simple module that outputs an activation value based on a weighted sum of input values it receives. Fig. 4.1 shows this structure.
A variety of non-linear functions are used for neuron activation, such as the sigmoid function:
30 Learning a Human Pose Descriptor
Fig. 4.1 Neuron model. Outputs xifrom i preceding neurons are combined in a weighted sum
with weights wiand a bias value b added. This is passed to an activation function to generate
the neuron output.
Fig. 4.2 Neural network structure. Neurons are organised into layers forming an acyclic graph. These layers are fully connected such that a neuron is connected to every neuron in neighbouring layers, but not connected within a layer.
And the rectified linear unit (ReLU), which sets any negative number to zero:
f(x) = max(0, x) (4.2)
A neural network organises a number of neurons in an acyclic graph of connected layers, as depicted in Fig. 4.2.
The size of the input layer is dictated by the dimensions of the input data. If, say, the network is being used to approximate a multivariable function with two input variables, the input layer would be of size two as in the figure. Accordingly, the dimensionality of the output layer depends on the size of the desired output, e. g. 1 neuron for a single-valued
4.1 Introduction 31
function or for a classification task, multiple neurons each representing an individual class probability.
The process of training a neural network is to optimise the parameters of all the neurons in the network (their weights and biases) so that the final output best approximates the desired function. This is done by iteratively passing labelled data through the network, calculating the loss, that is how wrong the output is compared to the desired result, and incrementally updating the parameters to minimise the loss and improve the result.
The most common classification task is to correctly assign a single class to an input from a fixed set of K distinct classes. These networks typically use a Softmax classifier with a cross-entropy loss function. In this case, the output of the network for the ithinput sample xi,
denoted yi= f (xi), is a vector of neuron outputs from the output layer, yi∈ ℜK, interpreted
to be the unnormalized log probabilities of the input belonging to each of the K classes. The loss function takes the form:
Li= −log eyi j ∑Kk=1eyik (4.3)
where j is the index in yiof the correct class, according to the data label. This function
calculates the normalised probability of correct classification of input xi and penalises
deviation from a perfect score, i.e. a probability of one for the target class, and zero for all others.
For regression tasks, where the objective is to output a vector of real numbers, the data label zi∈ ℜK matches the form of the output yiand the L2 loss function, equivalent to the
squared euclidean distance between them, is commonly used:
Li=
K
∑
k=1
(zik− yik)2 (4.4)
The networks are optimised by variants of gradient descent to minimise the loss. Gra- dients are calculated by the process of backpropagation which leverages the chain rule of differentiation. The network as a whole can be expressed as a series of nested differentiable functions of the form y = f (g(x)). The chain rule states that, if we substitute u = g(x), the derivative of y with respect to x can be calculated by the following product:
dy dx = dy du. du dx (4.5)
32 Learning a Human Pose Descriptor
This rule can be applied at every stage of the network to calculate the partial derivative of the loss with respect to each of the parameters of the network, and hence the direction and magnitude for their update for the next training iteration.
The simplest optimisation strategy is standard gradient descent, which shifts each network parameter along the negative gradient direction, scaled by learning rate γ, a global network hyperparameter. For example parameter w for training iteration i:
wi+1= wi− γ∇wi (4.6)
where ∇wi = ∂ yi/∂ wi, the partial derivative of the loss y with respect to parameter w. Stochastic gradient descent, often preferred in practice, follows this form, but performs a weight update based over a small subset of training samples, calculating the gradient from the sum of training sample losses. The subset batch-size becomes another network hyperparameter.
More sophisticated variants incorporate a concept of momentum into the weight update which can help speed up optimisation convergence. A ’velocity’ factor v, initialised to zero, is added to the update calculation:
vi+1= µvi− γ∇wi (4.7)
adding another hyperparameter for momentum, µ, which is then used for parameter update:
wi+1= wi+ vi+1 (4.8)
A popular variant, Nesterov momentum, uses the current velocity to approximate the next position of the parameters and calculates the gradient with respect to these values before updating the velocity.
There is a further family of optimisers that use per-parameter, adaptive learning rates, as opposed to a single global value. Examples include Adagrad, RMSprop and Adam. All are able to automatically tune learning rates as training progresses, removing a degree of user dependent hyperparameter tuning.