, then any Maximum Weight Spanning Tree (MWST) where the weight of the branch connecting
! and A is defined by áM2 !+" A5'& N F Þ ® F Ý :;24â*!L"%â?A5 log :;24â ! "@â A 5 :`24â ! 5%:;24â A 5
will unambiguously recover the skeleton of .
Having found the skeleton of the polytree we move on to find the directionality of the branches. To recover the directions of the branches we use the following facts: nondegen- eracy implies that for any pairs of variables2 ! " A 5 that do not have a common descendent we have
áM2 ! " A 5B
Furthermore, for the pattern
! è 7HG A (4.7) we have áM2 ! " A 5'& pjsJháM2 ! " A < 75g where áM2 ! " A < 75'& N F Þ ® F Ý ® F ã :;2Qâ ! "%â A "@â75 log :`24â ! "@â A < â75 :;2Qâ ! < â75@:;2Qâ A < â75
and for any of the patterns
! G 7HG A " ! G 7è A pjsJ ! è 7è A we have <
Sec. 4.7] Causal networks 45
Taking all these facts into account we can recover the head–to–head patterns, (4.7), which are the really important ones. The rest of the branches can be assigned any direction as long as we do not produce more head–to–head patterns. The algorithm to direct the skeleton can be found in Pearl (1988).
The program to estimate causal polytrees used in our trials is CASTLE, ( p usal
}ructures From Inductive
¥
5
arning). It has been developed at the University of Granada for the ESPRIT projectStatLog(Acid et al. (1991a); Acid et al. (1991b)). See Appendix B for availability.
4.6.1 Example
We now illustrate the use of the Bayesian learning methodology in a simple model, the digit recognition in a calculator.
Digits are ordinarily displayed on electronic watches and calculators using seven hor- izontal and vertical lights in on–off configurations (see Figure 4.7). We number the lights as shown in Figure 4.7. We take& 2>£6"
" " +* "JI*5 to be an eight–dimensional 1 4 7 5 6 2 3 Fig. 4.7: Digits. vector where£'&|$ denotes the$i}Lø digit,$9&
"*("
"
+?
"-´ and when fixing£ to$ the remaining2
" " *+
"JI?5 is a seven dimensional vector of zeros and ones with
&( if the light in the
position is on for the$i}Lø digit and
&
otherwise.
We generate examples from a faulty calculator. The data consist of outcomes from the random vector£W"
" " *+ "
I where£ is the class label, the digit, and assumes the values in
"+("
"
*?
"-´ with equal probability and the
" " +* " I are zero-one variables. Given the value of£, the
" " *+ "
I are each independently equal to the value corresponding to the
! with probability d
´ and are in error with probability
( . Our aim is to build up the polytree displaying the (in)dependencies in
.
We generate four hundred samples of this distribution and use them as a learning sample. After reading in the sample, estimating the skeleton and directing the skeleton the polytree estimated by CASTLE is the one shown in Figure 4.8. CASTLE then tells us what we had expected: ! pjsJK A p± 5 Ft¢*jsJ$i}@$W¢*jspP£i£qâ$QjsJ 5 0 5 jsJ 5 jy} r $4z 5 j/£6"s$-"WG&þ(" " *+ "
Finally, we examine the predictive power of this polytree. The posterior probabilities of each digit given some observed patterns are shown in Figure 4.9.
46 Modern statistical techniques [Ch. 4
Fig. 4.8: Obtained polytree.
0 1 2 3 4 5 6 7 8 9 463 0 2 0 0 0 519 0 16 0 0 749 0 0 0 0 0 251 0 0 1 0 971 0 6 0 1 12 0 0 1 0 0 280 0 699 19 2 0 0 0 21 0 0 913 0 0 1 2 63 290 0 0 0 0 644 51 5 10 0 Digit
Fig. 4.9: ProbabilitiesL 1000 for some ‘digits’.
4.7 OTHER RECENT APPROACHES
The methods discussed in this section are available via anonymous ftp from statlib, internet address 128.2.241.142. A version of ACE for nonlinear discriminant analysis is available as the S coded functionr
J$Wu*F. MARS is available in a FORTRAN version. Since these algorithms were not formally included in theStatLogtrials (for various reasons), we give only a brief introduction.
4.7.1 ACE
Nonlinear transformation of variables is a commonly used practice in regression problems. The Alternating Conditional Expectation algorithm (Breiman & Friedman, 1985) is a simple iterative scheme using only bivariate conditional expectations, which finds those transformations that produce the best fitting additive model.
Suppose we have two random variables: the response, and the predictor,
, and we seek transformationsM2 5 andb32
5 so that Ö 2M2 5 < wÀb_2
5. The ACE algorithm approaches this problem by minimising the squared-error objective
Ö M2 5 b32 5@ (4.8)
For fixed M , the minimising b is b_2
5[& Ö M2 5 <
,and conversely, for fixed b the minimisingM isM2 5g&
Ö
b_2 5
<
Sec. 4.7] Other recent approaches 47
some starting functions and alternate these two steps until convergence. With multiple predictors " ** " n
, ACE seeks to minimise
5 & Ö Ð ÑÒ M2 5 n N A ] b A 2 A 5 ÓÔ Õ (4.9) In practice, given a dataset, estimates of the conditional expectations are constructed using an automatic smoothing procedure. In order to stop the iterates from shrinking to zero functions, which trivially minimise the squared error criterion, M2 5 is scaled to have unit variance in each iteration. Also, without loss of generality, the condition
Ö MU& Ö b & *+ & Ö b n &
is imposed. The algorithm minimises Equation (4.9) through a series of single-function minimisations involving smoothed estimates of bivariate conditional expectations. For a given set of functions b
" +* "Lb n , minimising (4.9) with respect toM2 5 yields a newM2 5
M2 5 :&M#ÂONPE2 5C& ÖRQ { n A ] b-AM2 A5 < 1S T T T Ö Q { n A ] b A 2 A 5 < S T T T (4.10) with U U&WV Ö 2 5@ YX 6ä- . Next5
is minimised for eachbt! in turn with givenM2 5 and b-A\
]
! yielding the solution
b ! 2 ! 5 :&¬b ! ® ÂNBPB2 !5'& Ö[Z\ M2 5 N A\ ] ! b A 2 A 5 < !B]^ (4.11)
This constitutes one iteration of the algorithm which terminates when an iteration fails to decrease5
.
ACE places no restriction on the type of each variable. The transformation functions M2 5t"Lb 2 5t" ** "-b n 2 n
5 assume values on the real line but their arguments may assume values on any set so ordered real, ordered and unordered categorical and binary variables can all be incorporated in the same regression equation. For categorical variables, the procedure can be regarded as estimating optimal scores for each of their values.
For use in classification problems, the response is replaced by a categorical variable representing the class labels, BA . ACE then finds the transformations that make the relationship ofM245 to theb ! 2
! 5 as linear as possible.
4.7.2 MARS
The MARS (Multivariate Adaptive Regression Spline) procedure (Friedman, 1991) is based on a generalisation of spline methods for function fitting. Consider the case of only one predictor variable,
. An approximating,
÷
order regression spline functionb*2Ä
5 is obtained by dividing the range of
values into
/( disjoint regions separated by
points called “knots”. The approximation takes the form of a separate,
÷
degree polynomial in each region, constrained so that the function and its,
( derivatives are continuous. Each
,
÷
degree polynomial is defined by, ( parameters so there are a total of2
U(+5#2>,C (*5 parameters to be adjusted to best fit the data. Generally the order of the spline is taken to be low2>,
Î
5 . Continuity requirements place, constraints at each knot location making a total of
48 Modern statistical techniques [Ch. 4
While regression spline fitting can be implemented by directly solving this constrained minimisation problem, it is more usual to convert the problem to an unconstrained optimi- sation by chosing a set of basis functions that span the space of all,
÷
order spline functions (given the chosen knot locations) and performing a linear least squares fit of the response on this basis function set. In this case the approximation takes the form
Ä b 2 5g&!_K` N 7 ] pM7ba Ø %Ù 7 2 5 (4.12)
where the values of the expansion coefficientsp 7 _K`
are unconstrained and the continu- ity constraints are intrinsically embodied in the basis functions2a
Ø @Ù 7 2 5@ _K` . One such basis, the “truncated power basis”, is comprised of the functions
A A ] "M2 } 7 5 ` _ (4.13) whereK}L7 _
are the knot locations defining the
þ( regions and the truncated power functions are defined
2 }L75 & Ì [Î }L7 2 }L75 U}L7 (4.14) The flexibility of the regression spline approach can be enhanced by incorporating an au- tomatic knot selection strategy as part of the data fitting process. A simple and effective strategy for automatically selecting both the number and locations for the knots was de- scribed by Smith(1982), who suggested using the truncated power basis in a numerical minimisation of the least squares criterion
ü N ! ] Z\ â ! N A ] 4 A A _ N 7 ] pM7M2 }L75 ` ]^ (4.15)
Here the coefficients
4 AK , p 7 _
can be regarded as the parameters associated with a multiple linear least squares regression of the responseâ on the “variables”
A and Ã2 } 7 5 ` _
. Adding or deleting a knot is viewed as adding or deleting the corresponding variable2 } 7 5 `
. The strategy involves starting with a very large number of eligible knot locationsK} " ** "6} _
max ; we may choose one at every interior data point, and considering corresponding variablesÃ2 }L75 ` _ max
as candidates to be selected through a statistical variable subset selection procedure. This approach to knot selection is both elegant and powerful. It automatically selects the number of knots
and their locations}
"
+* "W}
_
thereby estimating the global amount of smoothing to be applied as well as estimating the separate relative amount of smoothing to be applied locally at different locations.
The multivariate adaptive regression spline method (Friedman, 1991) can be viewed as a multivariate generalisation of this strategy. An approximating spline functionb*24ks5Ä ofj variables is defined analogously to that for one variable. Thej -dimensional space
Â
is divided into a set of disjoint regions and within each onebH24ks5Ä is taken to be a polynomial inj variables with the maximum degree of any single variable being, . The approximation and its derivatives are constrained to be everywhere continuous. This places constraints on the approximating polynomials in seperate regions along the24j
(*5-dimensional region boundaries. As in the univariate case, bÄ 24ks5 is most easily constructed using a basis function set that spans the space of all,
÷
Sec. 4.7] Other recent approaches 49
MARS implements a forward/backward stepwise selection strategy. The forward se- lection begins with only the constant basis functiona 2Qks5w&V( in the model. In each iteration we consider adding two terms to the model
agAM2 }L5 ` agAM2f} 5 ` (4.16) wherea A is one of the basis functions already chosen,
is one of the predictor variables not represented in a A and} is a knot location on that variable. The two terms of this form, which cause the greatest decrease in the residual sum of squares, are added to the model. The forward selection process continues until a relatively large number of basis functions is included in a deliberate attempt to overfit the data. The backward “pruning” procedure, standard stepwise linear regression, is then applied with the basis functions representing the stock of “variables”. The best fitting model is chosen with the fit measured by a cross-validation criterion.
MARS is able to incorporate variables of different type; continuous, discrete and categorical.