• No se han encontrado resultados

4.3 Proyecto Centros de Desarrollo Vitivinícola

4.3.3 Planes Operativos Anuales

Instead of following a set of rules, specified by a third person, the network teaches itself the necessary rules based on the data set. In the field of neural networks, the process is understood as choosing the right weights and network topologies so that it might perform the desired task efficiently.

The learning is performed by iteratively presenting the training data set, adjusting the network parameters and thus ideally minimizing the network’s error. The goal of the learning process is to approximate a function which describes the desired behavior the best (provides the best outputs for given inputs). The whole training is performed in iterations.

A notable drawback of the ANN is their black box approach which causes the model to be very hardly interpretable and understood. The training phase can also be seen as very computationally heavy.

3.4.1 Initialization of the Training Procedure

Before the network starts learning anything, it needs to be initialized. The initialization consists of setting all the weights and biases. The key is to start with different weights as the neuron with same incoming and outgoing weights would attain the same gradient during training. The most basic form of initialization is setting the parameters to random values. One possibility is to initialize the weights with a random Gaussian variable with std. deviation 1 and mean value of 0. Bengio in [33], however, proposes to pick the initial weights proportionate to the square root of neuron fan-in and fan-out (indegree and outdegree of the node). As described in the section 1.2.5, it is better to keep the weights as small as possible.

The number of layers and neurons in each of these layers is important; however, there exist only general empiric observations and the parameters are usually the knowledge of a good neural network architect. In these simplifying terms, one can say that a small number of neurons in layers leads to insufficient training whereas large quantities of neurons lead to the loss of generalizing capabilities.

3.4.2 Backpropagation Algorithm

The change in one neuron activation influences many other activations in the network. These effects must be combined. Backpropagation tries to do that by working within a similar concept as the network’s forward pass.

The forward pass takes an input, keeps computing activations of single neurons layer by layer until it reaches the last layer and calculates the desired output. Backpropagation computes gradients of weights for the last layer and propagates them back through the network using same weights as the forward pass. Then, for each node i in layer l, one would like to calculate an error

3.4. Training Neural Networks

term δi(l)that measures how much that node was responsible for any errors in our output.

The Backpropagation algorithm is derived using the derivation chain rule and the gradient descent technique from 1.2.3. The algorithm is in detail described in 3.1, where the core is taken over from [34].

One also has the following possibilities on how often to update the net- work’s parameters [15]:

ˆ Online update – Weights are updated after each training case. ˆ Full-batch update – Weights are updated only after every epoch.

ˆ Mini-batch update – Weights are updated after a small sample of training cases.

A forward and backward pass where the neural network is fed the entire training set is called an epoch. A forward and backward pass where the neural network is fed the batch is known as a cycle or an iteration. The difference in the two terms can be seen in the different update policies, e.g. with the full-batch update, the iteration and epoch count are identical.

3.4.3 Dropout

Dropout is a recent invention in the field of neural networks [35]. It can be seen as a type of regularization, together with techniques like L1 and L2 regularization, and constraining the maximum value of weights. It tries to address the problem of overfitting.

The name dropout comes from the idea of dropping out some unit activ- ations in a layer during training. Dropping out means setting them to zero. This can be seen as sampling a neural ’subnetwork’ from the full neural net- work and only updating the parameters of the sampled network for the current train set.

3.4.4 Momentum

The adaptation of weights might not only be determined by the current train- ing examples but also by the previous examples with less intensity; this is called momentum. Momentum is added to the equations to prevent getting stuck in a local minimum [33].

Algorithm 3.1: Learning of the neural network with the online back- propagation algorithm.

1 Choose topology of the neural network

2 Perform initialization of the network’s parameters 3 while stopping condition is not met do

1. Perform a feedforward pass, computing the activations for all layers. 2. For each output unit i in layer nl (the output layer), set

δnl i = ∂ ∂znl i 1 2ky − hW,b(x)k 2 = −(y i− anil) · f 0(znl i ) (3.15) , where δnl

i is the error term, z nl

i is the value of the neuron i in layer nl after the basis function is applied, y is the ground truth, hW,b(x) is the calculated network’s output value, anl

i is the neuron’s output value after the activation function has been applied, and f (x) is the activation function. Therefore it is necessary that the activation function has a derivative.

3. For l = nl− 1, nl− 2, nl− 3, . . . , 2 For each node i in layer l, set

δil= ( sl+1

X

j=1

Wjijl+1) · f0(zil) (3.16)

, where Wjil is the matrix of weights.

4. Compute the desired partial derivatives, which are given as: ∂ ∂Wl ij J (W, b; x, y) = aljδil+1 (3.17) ∂ ∂bliJ (W, b; x, y) = δ l+1 i (3.18)

, where J (W, b; x, y) is the mean squared error with respect to a single example.

5. Update the weights as follows:

Wijl = Wijl − α ∂ ∂Wl ij J (W, b; x, y) (3.19) bli = bli− α ∂ ∂bliJ (W, b; x, y) (3.20) , where α is the learning rate.

Chapter

4

Neural Network Based Anomaly

Detection Methods

Anomaly detection can be done through many different techniques. The main purpose of this Thesis is to; however, research only a small subset of these possibilities, residing in the neural networks models. The rich field of neural networks has resulted in an equally vast number of neural network based anomaly detection methods. It is, therefore, natural that each of the numerous variations of the outlier detection problems can be addressed well by only certain types of neural network. Most neural network approaches work either in semi-supervised or supervised mode [1]. In this chapter, the architecture of only several types of neural networks is describeb. The chapter also includes examples of how the particular networks are used in novelty detection.

Unfortunately, there has been little survey and summarizing effort in the particular field of neural network based anomaly detection methods. Therefore most summarizing has been done in the general anomaly detection surveys [1, 19] in 2007 and [18] in 2014. However, the last and only specialized review of the neural network based techniques has been published in 2003 [36]. The important thing to note here is that the review has been conducted before the renaissance of neural networks in the last decade. Thus the general reviews may serve as an equally important source, if not better.

4.1

Restricted Boltzmann Machines

Restricted Boltzmann machines (RBM), as the name implies, are derived from the Boltzmann machines (BM). The name of this model stems from the stat- istical mechanics since the probability of particular states during the thermal equilibrium is equal to the Boltzmann distribution [20]. The learning process of BM is computationally demanding; that is why RBM are often used in- stead. RBM is also used as a building blog of deep belief networks 4.2. The

subsections 4.1.1 and 4.1.2 are both paraphrased from [37].

4.1.1 Topology and Operating Principles

The aforementioned restriction appears in the topology of the network. Boltzmann Machines are symmetrical networks of stochastically working units which can be interpreted as a neural network. BM is fully connected, and no restriction is imposed on the chromatic index of the network. RBM is arranged in two layers and must be a bipartite undirected graph (the difference in the topology might be seen in 4.1). The first layer has visible neurons and is the observa- tional part of the network. The second layer contains the hidden neurons and models the relations between the particular features. Each neuron from both layers is connected with all the neurons from the other layer. The network is fully defined by a matrix of weights associated with the connections between hidden and visible units, as well as the biases for the visible and hidden units. The standard type of RBM has binary-valued (boolean/Bernoulli) hidden and visible units; thus the name Bernoulli-Bernoulli restricted Boltzmann ma- chine is sometimes used. There also exist the Gaussian-Bernoulli Restricted Boltzmann Machine which uses real-valued Gaussian units in the visible input layer. Naturally, the use of Gaussian-Bernoulli RBM is advisable when the in- put data is real-valued. The hidden layer is similar to the Bernoulli-Bernoulli RBM as it uses the binary stochastical neurons. For simplicity, the formulae described in this section assumes the Bernoulli-Bernoulli RBM. For more info on the GBRM, please, see [37].

RBM also belongs to energy-based models. Energy-based models associate

(a) Boltzmann Machine. (b) Restricted Boltzmann Machine.

Figure 4.1: Figures comparing classic Boltzmann machine and its restricted version. Source: [3].

4.1. Restricted Boltzmann Machines

energy to each configuration of the variables. Learning then modifies the energy function so that its shape has desirable properties. It is desirable that the plausible configurations have low energy. The plausibility is derived from the occurrence in the training data. Energy-based models define a probability distribution over hidden vector h and visible vector v (in this case, both boolean vectors) through an energy function [37]:

P (v, h) = 1 Ze

−E(v,h) (4.1)

, where Z is a partition function defined as:

Z =X

v,h

e−E(v,h) (4.2)

, and E(v, h) the energy of a configuration defined as: E(v, h) = −X i aivi− X j bjhj − X i X j viwi,jhj (4.3)

Similarly, the marginal probability of visible boolean units over all possible hidden layer configurations is defined in the following manner:

P (v) = 1 Z

X

h

e−E(v,h) (4.4)

Because of the RBM’s topology, the visible unit activations are mutually independent given the hidden unit activations and conversely the hidden unit activations are mutually independent given the visible unit activations. Which permits the following conditional probabilities formulations [37]:

P (v|h) = m Y i=1 P (vi|h) (4.5) , and P (h|v) = n Y j=1 P (hj|v) (4.6)

, where the individual activation probabilities are defined as: P (hj = 1|v) = σ bj+ m X i=1 wi,jvi ! (4.7) , and P (vi= 1|h) = σ  ai+ n X j=1 wi,jhj   (4.8)

, where σ denotes the logistic function. The probability of turning on is determined by the weighted input from other units (plus a bias).

4.1.2 Training

RBM can be used to learn significant aspects of unknown probability dis- tributions based on provided training samples. The learning tries to fix the parameters so that the probability distribution represented by the network corresponds to the training data and so that the arrangement expresses the relations between input features well. After successfully learning, the RBM provides a finite representation of the observation’s distribution.

The training algorithm most often used is called the contrastive divergence (CD). The algorithm performs Gibbs sampling2 that is combined with gradi- ent descent optimization methods 1.2.3 to determine the weight updates. One also needs to decide how many Gibbs sampling iterations are performed on one training data point; the k in CD-k denotes this decision.

The CD-1 learning step for one sample can be summarized as follows [37]: 1. Initialize the visible units to a training sample v

2. Compute the probabilities of the hidden units according to equation 4.6 and sample a hidden activation vector h.

3. Calculate the outer product of v and h and call this the positive gradient. 4. Compute the probabilities of the visible units according to equation 4.5

and sample a reconstruction v0 of the visible units.

5. Resample the hidden activations h0 based on v0. (Gibbs sampling step) 6. Calculate the outer product of v0 and h0 and call this the negative gradi-

ent.

7. Update the weight matrix W based on the positive and negative gradi- ents:

W = W − α(vhT− v0h0T) (4.9)

8. Update the biases a and b analogously:

a = a − α(v − v0) (4.10)

, and

b = b − α(h − h0) (4.11)