EL ESTADO PERUANO Y SU PARTICIPACIÓN EN EL PROCESO DE FORMACIÓN DE LAS OC.

CAPITULO II: “LA TRASCENDENCIA DE LA LABOR CONSULTIVA DE LA CORTE INTERAMERICANA DE CONSULTIVA DE LA CORTE INTERAMERICANA DE

6. EL ESTADO PERUANO Y SU PARTICIPACIÓN EN EL PROCESO DE FORMACIÓN DE LAS OC.

The validation technique is the most practical of the generalisation theories discussed in section 3.3. It is also the standard technique used in the neural community, and is based on the method of cross-validation in

14Schwarz et al, 1990

Generalisation in Neural Networks Generalisation in the Literature

statistics. There are many implementable variants along a basic theme. The data are divided into two sets, a training set, and a validation set. Only the training set is used to determine the values of the weights. The validation set is used, during training, to measure the generalisation ability of the network, and to decide when to stop training. The algorithm proceeds as follows:

[The validation technique] employs a network with an excessive number of free parameters and stops training before the network reaches overlearning on the training set. ... Training is stopped when the performance on the validation set ceases to improve.16

Weigend et al17 actually suggest dividing the data into three parts, and using a prediction set (not used for training or for assessing when to stop training) which gives a measure of the expected future performance.

Over-fit and under-fit relate to training time for back-propagation. The longer a network is trained, the more peaks and troughs in the function implemented by the network. This is because back-propagation uses low random initial weights, which leads to low excitation values for each unit. This yields a roughly uniform initial function. As training proceeds, the weights increase in magnitude, and the excitation values increase. This gives rise to functions with more peaks and troughs.

Using too many hidden units during training ensures that there can be no under-fit in the final solution, since there will definitely be more than enough hidden units to realise the underlying function. Initially, the validation error will be high, as the network under-fits the data. As training progresses, and the under-fit is reduced, the validation error goes down. As the network is trained beyond the point of under-fitting the data, the validation error increases, as there are more incorrect outputs given for the members of the validation set. The network is now beginning to over-fit the data. This is a good time to stop training, since there is likely to be a good compromise between over-fit and under-fit.

16Hasegawa et al, 1992, p. 2459 17 Weigend et al, 1991a, p. 108

Generalisation in Neural Networks Generalisation in the Literature

Validation need not be restricted to an equilibration of over-fit and under fit with over-sized topologies, however. With topologies with too few units to exactly fit the data, a minimum of validation error can be used to indicate when there is a balance in the errors of the validation set and the training set. This balance at the minimum of validation error means that further training, although it leads to an improved training error, will result in poorer performance on the validation set.

Training too far with any topology, over- or under-sized, will lead to an excessive fit to the training set, and hence an imbalance between the fits to the training set and the validation set. This is undesirable, since it is likely to lead to poor generalisation. Lang et al give the following report of training too far with their neural network:

Peak generalisation occurred after ... 10 000 epochs, at which point the network got 95.4% of the training cases, and 91.4% of the testing cases correct. During an additional 10 000 epochs of training, the network's performance increased to 98.1% on the training set, but generalisation fell to 88.1%.18

Figure 3.6 shows the effect on the IO of overtraining, for a simple problem (a linearly separable set) whose targets have been corrupted by noise. An overlarge topology has been used for the simple problem, and if training continues sufficiently beyond the first minimum of the validation error, then the network begins to fit the noise.

Generalisation in Neural Networks Generalisation in the Literature

The Effect on the IO of

19 - 17 -- U15 - u13 ■■ LU 11 - 9 - 7 - 5 -- • ’ • 1992 cycles 10 Cycles i raining validation error error

Figure 3.6 — The change in the black and white picture produced by an overlarge topology for a noisy pattern set, over a period of 3 000 cycles. The

topology had a single hidden layer of 5 units. Light grey indicates an output of 0 by the network, and dark grey indicates an output ofl. The pattern set is taken from a linearly separable pattern set of two hundred patterns, with 25% noise on the targets. The patterns are indicated by black and white dots on the IO picture. A black dot indicates a target of 1, and a white dot a target ofO. (The patterns are shown in more detail in figure 3.7.)

The IO graph fits the noise more and more closely as training proceeds beyond the first minimum of validation error. At 37 cycles, the first minimum of validation error, the network realises a linear separation close to the underlying function. By 545 cycles, the separation is distorted slightly as it begins to fit the noise in the training set. By 1 992 cycles, the separation is severely distorted.

Generalisation in Neural Networks Generalisation in the Literature □ □ r. ■A "■ □ ■ ■ ■ ■ ".i _ ■ ° " cP V.D cj> □ O d

J3

SficP ~~ t5fcL 03 . S, n 00 a a n°° •rT-acu CP □ -? % ■ cc_{□ □}

_°

_dd

Figure 3.7 — Close up of the pattern set used in figure 3.6. Solid black, squares indicate a target ofl, and the unfilled squares indicate a target ofO.

One problem with the technique is that the minimum of validation error stopped at need not be the best minimum in all cases. The original desired decision region in the above problem need not have been a linear partition, but the decision region shown at 1 992 cycles in figure 3.6. The same data set is used as for the simple problem, but now it is assumed that there is less noise. The same topology and initial weight state could also be used for this problem. However, the desired decision region, given these conditions, does not occur until the third minimum of validation error, which has lower training error relative to the first minimum of validation error. The decision to be content with a given minimum represents an assumption about the noise in the data.

Hence, the choice of which minimum to stop at is a bias. (See section 3.2.2, earlier.) This relates to the problem with validation discussed by Denker et al. Denker et al refer to validation as "rule extraction"19 (and do not assert any requirement for an over-sized topology). Given a training set, M, and a validation set, X, both of which are assumed to be representative samples of the rule to be extracted, they give a theoretical basis for expecting validation to work as follows:

The idea is to extract the rule from... M, and extend it to ... X.20

19Denker et al, 1987, pp.897-901 ^Denker et al, 1987, p. 897

Generalisation in Neural Networks Generalisation in the Literature

Defining the extraction score as the accuracy with which the network predicts the validation set, Denker et al then go on to acknowledge a problem with the validation technique:

We emphasise that rule extraction is rather a slippery concept, since it is possible to change a network's extraction score (without changing the network) simply by changing one's mind about what rule was "supposed" to be extracted.21

This is a rather weak criticism of the technique, however. The assumption of representativeness must be broken if there is any major change of mind about the rule to be extracted. However, when the data are noisy, which must be taken to be the norm in real-world problems, the assumption of representativeness becomes an assumption of the degree of representativeness of the data. For the same data, differing assumptions of the amount of noise may yield different desired fits to the data, as is illustrated by the example in figure 3.6.

The validation technique makes no assumptions about the underlying function, other than that the data in the training set and the validation set are representative. It provides a means of deciding when to terminate training, at which point there is a balance between the memorisation of the training set, and the generalisation ability.

The validation technique provides no guarantees about the expected degree of fit beyond the available data. Even using a prediction set as per Weigend et al gives information about the available data only. The degree of expected fit to new data may only be based on the assumption that the data used for training and validation was representative. If this assumption is valid, then the technique should give satisfactory generalisation performance, but only in so far as the training regime produces the best fit to the training and validation data.

Generalisation in Neural Networks Generalisation in the Literature

In document ANÁLISIS DEL RECONOCIMIENTO JURISPRUDENCIAL DE DERECHOS INNOMINADOS POR PARTE DE LA CORTE INTERAMERICANA DE DERECHOS HUMANOS, PERÚ 2000 – 2012 (página 41-46)