3.4. Supuestos en los que se consiente la legitimación individual de uno de los
3.4.6 El ejercicio de los derechos de crédito y la defensa de los bienes
To learn structure using classification error, we must adopt a strategy of searching through the space of all structures in an efficient manner while avoid- ing local maxima. In this section, we propose a method that can effectively search for better structureswith an explicit focus on classification. We essen- tially need to find a search strategy that can efficiently search through the space of structures. As we have no simple closed-form expression that relates struc- ture with classification error, it would be difficult to design a gradient descent algorithm or a similar iterative method. Even if we did that, a gradient search algorithm would be likely to find a local minimum because of the size of the search space.
First we define a measure over the space of structures which we want to maximize:
Definition 7.1 Theinverse error measurefor structureSis
inve(S) = 1 pS(ˆc(X)= C) S pS(ˆc(X1)= C) , (7.13)
where the summation is over the space of possible structures andpS(ˆc(X)=
C)is the probability of error of the best classifier learned with structureS.
We use Metropolis-Hastings sampling [Metropolis et al., 1953] to generate samples from the inverse error measure, without having to ever compute it for all possible structures. For constructing the Metropolis-Hastings sampling, we define a neighborhood of a structure as the set of directed acyclic graphs to which we can transit in the next step. Transition is done using a predefined set of possible changes to the structure; at each transition a change consists of a
single edge addition, removal or reversal. We define the acceptance probability of a candidate structure,SSSnew, to replace a previous structure,SSStas follows:
min 1, inve(Snew) inve(St) 1/T q(St|Snew) q(Snew|St) =min 1, pterror pnew error 1/T Nt N N Nnew N N , (7.14)
where q(S|S) is the transition probability from S to S and NNNt and NNNnew
are the sizes of the neighborhoods ofSSSt and SSSnew respectively; this choice
corresponds to equal probability of transition to each member in the neighbor- hood of a structure. This choice of neighborhood and transition probability creates a Markov chain which is aperiodic and irreducible, thus satisfying the Markov chain Monte Carlo (MCMC) conditions [Madigan and York, 1995]. The algorithm, which we name stochastic structure search (SSS), is presented in Box 7.6.
Box 7.6 (Stochastic Structure Search Algorithm)
Fix the network structure to some initial structure,SSS0. Estimate the parameters of the structureSSS0and compute the probability of errorp0error.
Sett= 0.
Repeat, until a maximum number of iterations is reached (M axIter): – Sample a new structureSSSnew, from the neighborhood ofSSSt
uniformly, with probability1/NNNt.
– Learn the parameters of the new structure using maximum
likelihood estimation. Compute the probability of error of the new classifier,pnewerror.
– AcceptSSSnewwith probability given in Eq.(7.14).
– IfSSSnewis accepted, setSSSt+1=SSSnewandpterror+1 =pnewerrorand
changeT according to the temperature decrease schedule. OtherwiseSSSt+1=SSSt.
– t=t+ 1.
return the structureSSj, such thatj= argmin 0≤j≤M axIter
Classification Driven Stochastic Structure Search 145 We add T as a temperature factor in the acceptance probability. Roughly speaking,T close to1would allow acceptance of more structures with higher probability of error than previous structures. T close to0 mostly allows ac- ceptance of structures that improve probability of error. A fixedT amounts to changing the distribution being sampled by the MCMC, while a decreasing
T is a simulated annealing run, aimed at finding the maximum of the inverse error measures. The rate of decrease of the temperature determines the rate of convergence. Asymptotically in the number of data, a logarithmic decrease of
T guarantees convergence to a global maximum with probability that tends to one [Hajek, 1988].
The SSS algorithm, with a logarithmic cooling scheduleT, can find a struc- ture that is close to minimum probability of error. There are two caveats though. First, the logarithmic cooling schedule is very slow. Second, we never have access to the true probability of error for each structure – we estimate it from a limited pool of training data. To avoid the problem of overfitting we can take several approaches. Cross-validation can be performed by splitting the labeled training set to smaller sets. However, this approach can signifi- cantly slow down the search, and is suitable only if the labeled training set is moderately large. A different approach is to change the the error measure using known bounds on the empirical classification error, which account for model complexity. We describe such an approach in the next section.