• No se han encontrado resultados

O x : Observación del Deterioro Cognitivo.

GRADO DE DETERIORO COGNITIVO

4. DISCUSIÓN DE RESULTADOS 1 Contrastación de resultados.

The relation of Bayesian probability theory to neural networks is discussed in several papers.33 The Bayesian approach to generalisation is based on the Bayes optimal classification algorithm.34 Given p inputs whose targets (e {0,1}) are known, this algorithm finds the most probable output of a new input, q. Let Vp be the volume of weight space that

correctly classifies the p patterns seen so far. Let Pq+ be the proportion of Vp that gives an output of 1 to q. Let Pq~ be the proportion of Vp that gives

an output of 0 to q. If Pq+ > Pq~ then q is given output 1, otherwise q is

given output 0. Evaluating Pq± is equivalent to evaluating the Bayesian

posterior probability that q has the given classification.35 Eisenstein and Kanter refer to this as the Bayes i algorithm,36 since one new input is tested at a time.

An important disadvantage with this technique, akin to the average generalisation error technique, is that it requires the knowledge of all the weight states that correctly classify the p patterns. Opper and Haussler37 sample Vp by independently training N networks to minimum error from

several different random initial weight states. This will result in different final solution trained weight states. A committee machine is then constructed which takes the outputs of the N networks as inputs, and gives output equal to the output given by the majority of the inputs. N must be odd so that there is always a clear majority one way or the other. For sufficiently large N, the output of the committee machine converges to the Bayes estimate. Compared with the validation technique, which only trains one network, Opper and Haussler's algorithm is very costly, particularly for large problems, and for large N.

33E.g. Kanaya & Miyake, 1991; Levin et al, 1990; MacKay, 1992a-d; Opper & Haussler, 1991; Watkin et al, 1993

34See Opper & Haussler, 1991, p. 2678, and references therein. 35Watkin et al, 1993, p. 510

36Eisenstein & Kanter, 1993. p. 3668 37Opper & Haussler, 1991

Generalisation in Neural Networks Generalisation in the Literature

In the case of a set, Q, of cardinality n > 1 patterns to be generalised to,

there is a choice of two possibilities for deciding the outputs for the members of Q. Firstly, one can apply the Bayesi algorithm n times. This means selecting the largest proportion of Vp to be the target for the first member of Q, qi, then, of that sub-volume, selecting the largest proportion for qi, and so on up to </n. The second possibility is to consider proportions

of Vp for each possible set of targets for members of Q, and the largest

proportion selected. This is termed the Bayesn algorithm.38 The difference between these algorithms is indicated in figure 3.14.

2 6 4 H -H

(a) (b)

Figure 3.14 — The difference between Bayesi (a) and Bayesn (b). In (a) each

pattern is assessed sequentially, and the largest portion of the available weight states selected. In (b) all possible target sequences with the current architecture are assessed at once, and the largest area selected. Each pattern has a boundary hyperplane in version space which separates those zveight states which classify the pattern as 1, and those which classify it as 0. The arrows indicate on which side of the boundary each pattern is classified as 1.

Eisenstein and Kanter distinguish between three kinds of generalisation in order to show that the Bayesi and Bayesn algorithms are not necessarily optimal:

Generalisation in Neural Networks Generalisation in the Literature

(i) To maximise the probability of correct classification of all members of Q. In this case, the aim is to choose the maximum probability outputs for all members of Q simultaneously. For this, the Bayesn algorithm is optimal.39

(ii) To maximise the average number of correct classifications of Q. Here, rather than considering all members of Q simultaneously, the aim is to be sure that on average, each member of Q will be correctly classified. For this, applying Bayesi n times is optimal.40 (iii) As (ii), but at least m patterns must be correctly classified. Here,

neither Bayesn nor Bayesi is optimal.41 This is because neither algorithm can guarantee a given number of correct classifications from Q, since there is no a priori basis for selecting among the possibilities according to the proportion of Vp that they occupy. Figure 3.15 shows this. It is hard to imagine that any algorithm could guarantee the correct classification of a given number of patterns, however. Yet Eisenstein and Kanter claim to have an algorithm which is optimal in this case, though their description of it is not straightforward:

The optimal strategy for [case (iii)] is a table which indicates for each type of partition of the VS to make a prediction following a particular part of the VS, among 2n.42 [VS stands for version space, and is equivalent to Vp in the above discussion.]

MacKay has an alternative Bayesian approach which uses a single neural network.43 The general approach is to find the simplest, most probable interpolant of the data.44 This embodies Occam's razor, which is a principle stating that, given a choice, simple rules that explain a phenomenon are more likely to be true than complex rules.

39Eisenstein & Kanter, 1993. p. 3669 40Eisenstein & Kanter, 1993, p. 3669 41Eisenstein & Kanter, 1993, p. 3670 42Eisenstein & Kanter, 1993, p. 3670 43MacKay, 1992a-d

Generalisation in Neural Networks Generalisation in the Literature

In the classification paradigm, MacKay's approach to finding the

maximum probability classification is to estimate the probability of each class for a given input, and choose the class with the highest probability.45 Each output unit is dedicated to a particular class, and the analogue output of each unit is treated as an estimate of the probability that the current input has that class. This is in line with the view of Richard and Lippmann that neural network classifiers trained using back-propagation are good estimators of Bayesian a posteriori probabilities,46 though MacKay suggests several refinements to back-propagation.47

• Bayes(n)" ■

>4 >4 >4 >4 >4

llli

Bayes(1) >4

Figure 3.15 — Regions that have at least 4 correct patterns for a true target sequence ofT = {0, 0,1, 0, 0, 0}. There is no basis for finding these on the basis of the relative volumes of weight space they occupy, and hence neither the Bayesi nor the Bayesn algorithms find an acceptable solution.

MacKay's approach to classification is therefore more in the line of surface fitting than of raw classification, in which only the class boundaries are required. Telfer and Szu point out that more complex topologies are needed to estimate the Bayesian a posteriori probabilities than to perform raw classification.48

This thesis is focused on estimating the class boundaries, rather than the a

posteriori probabilities. This is because there is a better understanding of

the topology needed to realise a given set of class boundaries than to fit a

45MacKay, 1992d

46Richard & Lippmann, 1991 47MacKay, 1992b

Generalisation in Neural Networks Generalisation in the Literature

given probability surface. MacKay deals with this by training using various topologies, and then choosing the best among these.49

The technique in chapter 6 has the topology as a bias of the user, and it is therefore important that the user has a reasonable idea of the capabilities of a given topology before using the technique. These capabilities are discussed in detail in chapter 4.

Bayesian learning in general represents the embodiment of certain a priori ideas of what constitutes a good generalisation. The most probable generalisation, however, cannot be guaranteed to be the best generalisation, even if it is the most likely.

The Bayesian approach is akin to the Mitchellian approach for partially learned concepts. When the Mitchellian version space does not converge to a singleton, there will be some instances for which there is disagreement among the various members of S and G as to whether it does or does not match the target concept. For such an instance, one of Mitchell's strategies is to take the decision of the majority of the members of S and G. The main difference between the Bayesian approach and the Mitchellian approach is that Mitchell uses the boundaries of version space only, whereas the Bayesian approach samples the whole of version space.

Generalisation in Neural Networks The Mitchellian View

Documento similar