• No se han encontrado resultados

In physics, speed is defined as the distance passed over per unit of time. The average speed during a given interval of time is defined as the quotient of the distance travelled during this time interval and the length of the time interval and the instantaneous speed is defined as the limit of the average speed when the length of the interval tends towards zero. In physics, speed is measured in units of length per units of time: meters per second, miles per hour, kilometres per hour, etc.

If the definition used in physics is to be extended directly to training neural networks, the definition of the training speed could be the number of patterns learned per unit of time measured in patterns per unit of time.

Unfortunately, no two problems are the same and it is very difficult to compare the complexity of two different problems. Then, perhaps the problem can be clearly

stated and two algorithms could be compared by running them on the same problem. In this case the speed would be measured in problems per unit of time where there is only one problem and this problem is clearly stated. In these conditions one has only to quote the time in order to give a measure of the speed. If a certain algorithm trains XOR in 100 seconds and another does is in 50 seconds, it is clear that the second one is twice as fast as the first one. Or is it?

Unfortunately, the algorithms are executed on machines and different machines have different performances. A comparison between the time necessary for a particular algorithm to train a particular problem is meaningful only if the machine is clearly specified. Furthermore, for a precise comparison one should state the compiler and the operating system, to say nothing about the programmer's skills. Clearly this is very unsatisfactory. Nevertheless, this method of reporting the speed performances of an algorithm by quoting only the time necessary to train a given problem (and the machine) is used very often. [Baum, 1991], [Baba, 1989], [Musavi, 1992], [Weymaere, 1991], [Brent, 1991] are just a few references in which the speed is reported in this way. Even if all the important factors were stated, this approach is still inconvenient because it requires the reproduction of the same experimental conditions for a meaningful comparison between different training algorithms. Thus, if a new algorithm A, tested on a new machine M is to be compared with set of existing algorithms Al,A2,...,Ak tested on machines Ml,M2,...,Mk there are two possibilities for performing this comparison. A possibility is to implement the algorithms Al,A2,...,Ak on the machine M and the second possibility is to implement the algorithm A on the machines Ml,M2,...,Mk. Both possibilities require a total of k+1 implementations which is extremely inconvenient.

Sometimes, the training time is quoted more with the purpose of illustrating the effect of varying some parameters than offering a comparison with other techniques as in [Romaniuk, 1993], [Baba, 1989], [Wilensky, 1990] for example.

Epochs, pattern presentations

A measure of the speed which is independent of the machine the algorithm is run on would be very useful. In looking for such a measure, one could consider the fact that a particular algorithm started with the same data should end up with the same result no matter the machine it is run on and this will be done by performing the same number of operations. If the convergence process uses a cycle, the cycle will

be performed the same number of times. In the framework of training, the most natural choice is the cycle over the training set.

The epoch, defined as a single presentation of the entire training set is a possible measurement unit for the training speed. However, there are algorithms which do not cycle through the pattern set or situations in which the size of the patterns set varies during the training. In order to cope with these situations, one could use the pattern presentation as a speed measurement unit. For those techniques which use a fixed, finite training set one can easily calculate the training speed in epochs if the same is given in pattern presentations or reciprocally. This is the measure proposed by Falhman in [Falhman, 1988].

Connection-crossings

Comparison of training algorithms which use the number of epochs or pattern presentations as a machine independent measure are appropriate only when the algorithms being compared involve similar amount of work per epoch or pattern presentation [Brent, 1991].

There are algorithms which need to propagate the error only in a limited part of the network and/or in only one direction. This is the reason Falhman in [Falhman, 1990] proposes the number of connection-crossings as a measure of the learning time and implicitly of the learning speed. According to Falhman, the learning time measured in connection-crossings is the number of multiply- accumulative steps necessary to propagate activation values forward through the network and error values backward.

This is appropriate for those algorithms which perform operations of the same computational load for each connection. However, there are algorithms which do different things. Some algorithms might converge in very few epochs, propagate values only forward through the network and therefore have few connection- crossings. However, the same algorithm could need the calculation of the inverse of a large matrix for each of the connection-crossings for instance, and thus it could require a lot of CPU time.

Number of operations

A better measure for the training speed which can be applied to virtually all types of training algorithms is the number of operations as defined in the field of algorithmic analysis and design. This can be done by counting certain operations (the ones

which are estimated to take a significant time) and expressing the performance modulo some multiplicative constant.

As opposed to the algorithm analysis in which the performance is associated with the algorithm, in the assessment of the speed of a training session the number of operations is used to give an indication of the computation involved in that particular training session. For instance, Brent in [Brent, 1991] uses the number of operations to characterise the general performance of the algorithm and to compare it with standard backpropagation. In doing this, he is forced to make some assumptions about the problem (a generic problem) which cannot be sustained by theoretical reasons. Later on, when reporting the performance, Brent uses the more common, but less informative training time in seconds on a particular machine. The approach presented in this thesis, proposes using the number of operations both in assessing the performance of the algorithm and in reporting trial results. Eventually, the latter could sustain some assumptions used by the former.

Let us consider the example of an algorithm which performs a global operation on the weight matrix. Let us assume this operation needs w*log(w) operations where w is the number of weights in the network (the architecture is fixed). This operation is performed for each pattern presentation. Let us suppose this algorithm uses a training set which increases linearly so that the first training set contains 1 pattern, the second 2 patterns and so on. Furthermore, this algorithm uses k passes through the network for each iteration and the total number of patterns is n. In a particular case in which e epochs (an epoch is defined as a presentation of the entire current training set) are necessary for each training set, the training time could be expressed as:

e\kw log w + 2kw log vv+... +nkw log w] =

n n = zbvlogvv = eZrvvlogw^z i=l i=l = efcwlogwn(n + 1) 2 (1)

This shows how this measure can be used to characterise the performance of a given trial independently of the machine it is used on and in the situation in which the algorithm's processing is non-standard. Any other performance measure would be misleading if applied to this algorithm.

In the case of standard backpropagation, quickprop and cascade correlation, this measure reduces to the connection-crossing measure because there are no complex

operations associated with a connection and there are no global operations on the weight matrix or pattern set. Thus, the number of operations measure would give the same comparison between backpropagation, quickprop and cascade correlation as the one given by the connection crossings measure. For instance, the value of this measure for the backpropagation is w*2*n*e where w is the number of weights, 2 is the number of passes through the weights for a pattern presentation, n is the number of patterns and e is the number of epochs. A comparison is now possible between backpropagation and the hypothetical algorithm considered before.

In those cases in which the training depends very much on the problem and perhaps the initial state, this measure cannot be estimated a piiori as a function of the various parameters but can be easily reported by the implementation of the algorithm.

Documento similar