Results
4.1. Genetic characterisation of RadA/Sms and DisA
In this section, the focus is two-class classification. We discuss statistical discriminant analysis formulated by linear combination of variables with compression of the information in the observed data into sample mean vectors and sample variance-covariance matrices.
6.1.1 Basic Concept
Let us assume that certain evergreen trees can be broadly divided into two varieties on the basis of their leaf shape, and consider the use of width and length as two properties (variables) in constructing a formula
137
Table 6.1 The 23 two-dimensional observed data from the varieties A and B.
1 2 3 4 5 6 7 8 9 10 11 12
A: L(x1) 5 7 6 8 5 9 6 9 7 6 7 9
W(x2) 5 7 7 6 6 8 6 7 5 5 8 4
B: L(x1) 6 8 7 9 7 10 8 10 9 8 7
W(x2) 2 4 4 5 3 5 5 6 6 3 6
for classification of newly obtained data known to be from one of the two varieties A and B but not from which one. We first take sample leaves for which the variety is known and measure their length (x1) and width (x2), resulting in the 23 two dimensional observed data (Table 6.1).
This dataset for the known varieties is used to construct the classifi-cation formula, and is thus called training data. Here, let us consider a linear equation
y = w1x1+ w2x2. (6.1) If we were to perform the classification based solely on length x1, we would take w1 = 1 and w2 = 0 as the variable coefficients and thus perform the classification by projecting the two-dimensional data onto the axis y= x1, as shown in Figure 6.1. Similarly, if we were to perform the classification based solely on width x2, we would take the coefficients as w1 = 0 and w2= 1 and thus project the two-dimensional data onto the axis y = x2, as also shown in Figure 6.1. To perform the classification based on the information presented by both variables, the question is then what kind of axis to use for the projection. This question can be reduced to variable weighting based on a criterion, by projecting the two-dimensional data onto y = w1x1 + w2x2 in the figure and selecting the weighting that yields the best separation of the two classes.
Suppose that for random variables x = (x1, x2)T we have n1 two-dimensional data x(1)i = (x(1)i1, x(1)i2)T (i= 1, 2, · · · , n1) from class G1 and n2two-dimensional data x(2)i = (x(2)i1, x(2)i2)T (i= 1, 2, · · · , n2) from class G2. In Figure 6.2 the total n= (n1+ n2) training data are plotted in the x1–x2plane, and we tentatively assume three projection axes and express the distribution of the data when projected on each one. In this figure we take the values μy1 and μy2 on axis y in (b) as representing the class G1
and class G2 means, respectively. Similarly, we take the values μz1 and μz2 on axis z in (c) as representing the class G1 and class G2 means, respectively.
Figure 6.1 Projecting the two-dimensional data in Table 6.1 onto the axes y = x1, y= x2and y= w1x1+ w2x2.
Figure 6.2 then shows the following.
(1) Greater separation between the two class means is obtained by pro-jection onto axis z rather than onto axis y. That is,
(μy1− μy2)2 < (μz1− μz2)2. (6.2) In the projection of the data onto axis y, (μy1− μy2)2can be regarded as a measure of the degree of separation of the two classes, referred to as the between-class variance or between-class variation on axis y.
(2) The variance of class G1is smaller on axis y than on axis z. The same holds true for G2. To determine the degree of data dispersion within each class with projection onto axis y, we consider the sum weighted for the number of data
(n1− 1)(var. of G1on y)+ (n2− 1)(var. of G2on y)
(n1+ n2− 2) (6.3)
referred to as the within-class variance or within-class variation on
[
[
D
E
F
\
P
\
P
*
*
]
P
]
\ P ]
Figure 6.2 Three projection axes (a), (b), and (c) and the distributions of the class G1and class G2data when projected on each one.
axis y. The within-class variance around the mean within each class serves as a measure of the degree of data concentration.
A high degree of separation between the two class means generally facilitates classification. If it is attributable to a large dispersion, how-ever, the region of overlap between the two classes will also tend to be large, with an adverse effect on the performance of the classification. The question then becomes how to determine the optimum axis with respect to these advantageous and adverse effects. One approach is to set coef-ficients w1and w2so as to obtain a high ratio of between-class variance to within-class variance in the projection onto axis y= w1x1+ w2x2 as follows:
λ = between-class variance
within-class variance . (6.4) In this approach, the projection axis is selected to obtain the largest pos-sible between-class variance in the numerator together with the smallest possible within-class variance in the denominator. For this purpose, the ratio of between-class variance to within-class variance is expressed on
the basis of the training data and the optimum projection axis is deter-mined by the maximum ratio as next described.
6.1.2 Linear Discriminant Function
In the above approach, we first determine the projection axis that best separates the observed data. By projecting the i-th two-dimensional data x(1)i = (x(1)i1, x(1)i2)Tof class G1onto y= w1x1+ w2x2, we have
y(1)i = w1x(1)i1 + w2x(1)i2, i= 1, 2, · · · , n1. (6.5) Similarly the projection of the i-th data x(2)i = (x(2)i1, x(2)i2)T of class G2
onto axis y is given by
y(2)i = w1x(2)i1 + w2x(2)i2, i= 1, 2, · · · , n2. (6.6) In this way, by projection onto y, the two-dimensional data are reduced to one-dimensional data.
The sample means of classes G1 and G2, as obtained from the one-dimensional data on y, may be given by
y(1)= 1 Accordingly, the between-class variance defined by the formula (6.2) can be expressed as Also, when we project the data of class G1 onto y, the sample vari-ance on y is given by
= 1 and similarly we have the sample variance on y for the data of class G2
1
where S1 and S2are, respectively, the sample variance-covariance ma-trices of G1and G2given by Hence the within-class variance defined by the formula (6.3) can be writ-ten as The matrix S is called the pooled sample variance-covariance matrix.
Therefore it follows from (6.8) and (6.12) that the ratio of between-class variance to within-between-class variance in the projection onto axis y = w1x1+ w2x2can be expressed as
λ =
wT(x1− x2)2
wTSw . (6.14)
It can be shown from the result (6.90) in Section 6.4 that the coefficient vectorw which maximizes the ratio λ is
w = Sˆ −1(x1− x2). (6.15) Thus, we obtain the optimum projection axis
y = ˆw1x1+ ˆw2x2= ˆwTx= (x1− x2)TS−1x (6.16)
for maximum separation of the two classes. This linear function is called Fisher’s linear discriminant function.
An observation x can be classified to one of the two classes G1and G2based on the projected point ˆwTx. Projecting the sample mean vectors x1and x2onto axis y gives
G1: (x1− x2)TS−1x1, G2: (x1− x2)TS−1x2. (6.17) Then x is classified to the class whose projected sample mean ( ˆwTxi) is closer to ˆwTx. This is equivalent to comparing ˆwTx with the midpoint
1 and consequently we have the classification rule
h(x)= (x1− x2)TS−1x−1
Figure 6.3 Fisher’s linear discriminant function.
Example 6.1 (Linear discriminant function) Consider the 23 two di-mensional leaf shape data in Table 6.1 in which the measurements are taken on the length (x1) and width (x2) for the classes A and B. The sam-ple mean vectors and the samsam-ple variance-covariance matrices are
A : x1= The pooled sample variance-covariance matrix in (6.13) is then
S =
1.95 0.74 0.74 1.73
. (6.22)
Therefore, the coefficient vector in (6.16) is w = Sˆ −1(x1− x2)= Thus, we obtain the following optimum projection axis for maximum separation of the two classes
y = −1.12x1+ 1.46x2, (6.23) and consequently we have the classification rule
h(x)= −1.12x1+ 1.46x2+ 0.59 ⎧⎪⎪⎪⎨
⎪⎪⎪⎩ ≥ 0 ⇒ A
< 0 ⇒ B, (6.24) where (x1− x2)TS−1(x1+ x2)/2= −0.59.
6.1.3 Summary of Fisher’s Linear Discriminant Analysis
Suppose that we have n1 p-dimensional data from class G1 and n2 p-dimensional data from class G2, and represent the total n = (n1 + n2) training data as
G1: x(1)1 , x(1)2 , · · · , x(1)n1, G2: x(2)1 , x(2)2 , · · · , x(2)n2. (6.25)
Then the sample mean vectors and variance-covariance matrices for each To determine the projection axis that best separates the observed data, we project the n p-dimensional data in (6.25) onto axis
y = w1x1+ w2x2+ · · · + wpxp= wTx. (6.27) The ratio of between-class variance to within-class variance in the pro-jection onto axis y= wTx can be expressed as
λ =
wT(x1− x2)2
wTSw (6.28)
with S pooled variance-covariance matrix given by S = (n1 + n2 − 2)−1{(n1−1)S1+(n2−1)S2}. We find the p-dimensional coefficient vector such that the between-class variance is maximized relative to the within-class variance. It follows from the result (6.90) in Section 6.4 that the solution is
w = Sˆ −1(x1− x2). (6.29) Thus we obtain Fisher’s linear discriminant function
y = ˆwTx= (x1− x2)TS−1x, (6.30) the optimum projection axis for maximum separation of the two classes.
A future observation x is classified to the class whose projected sample mean ( ˆwTxi) is closer to ˆwTx.
Noting that the midpoint between the projected sample means (x1− x2)TS−1x1and (x1− x2)TS−1x2is we have the classification rule based on Fisher linear discriminant func-tion in the form
It is also possible, as next described, to adjust the decision boundary us-ing the concept of prior probability and loss in cases in which the classi-fication depends on whether the value of the linear discriminant function is positive or negative.
which is the Mahalanobis distance between the sample mean vectors x1
and x2. This distance measure is described in detail in Section 6.2.
6.1.4 Prior Probability and Loss
Up to this point, we have not considered the cost of the loss involved in classification performed with incorrect data incidence, or frequency of occurrence, in the two classes. Let us consider the case of stomach ul-cers and stomach cancer. We first organize the medical test data in rela-tion to several properties and divide the patients into class (G1) for those with stomach ulcers and class (G2) for those with stomach cancer, and construct a linear discriminant function on this basis. Using this func-tion, we then attempt to assign new patients to one or the other class on the basis of their medical test data. It is safe to presume that the number of stomach ulcer patients is inherently quite different from the number of stomach cancer patients. In other words, a large difference in inci-dence naturally exists between the two diseases. In one approach, the incidences represent a form of information acquired in advance that is incorporated into the construction of the determinant function as a prior probability.
Let the relative incidence of stomach ulcers and stomach cancer be represented by π1and π2(π1+ π2 = 1), respectively. In ordinary linear classification, as discussed above, the assignment for future data is based on the value of the linear discriminant function with 0 as the classifi-cation point. In linear determination incorporating prior probability, in contrast, assignment to class G1is performed when
h(x)= (x1− x2)TS−1x−1 The stomach ulcer incidence π1 is higher than the stomach cancer in-cidence π2, and accordingly the value of log(π2/π1) is negative. Shift-ing the classification point from 0 toward the class with the lower in-cidence (in this case, the stomach cancer class) will presumably facili-tate classification-based assignment to the higher-incidence class G1(the
stomach ulcer patient class). This type of classification point operation may be considered in cases such as plant variety classification in which one variety is extremely rare and the relative proportion of observation data acquired for it is inherently quite small. In such a case, incorporating this difference can be presumed to be meaningful.
The question remains, however, whether this would also be true in cases such as classification between the stomach ulcer patient class (G1) and the stomach cancer patient class (G2). When incidence is incorpo-rated, it effectively raises the bar for assignment of patients to the stom-ach cancer class. In this regard, however, it must be noted that if a patient who actually has stomach cancer is judged to have a stomach ulcer, the loss may be irreparable and very large. If on the other hand, a stomach ul-cer is mistakenly judged to be stomach canul-cer, the loss will presumably be substantially smaller. Some method is therefore necessary to incor-porate the concept of the cost of loss due to mistaken classification and thereby further adjust the classification point.
For this purpose, let the cost of the loss be c(2| 1) if a determination results in assignment to the stomach cancer class G2 for a patient who actually belongs in the stomach ulcer class G1, and c(1| 2) if the reverse occurs and assignment is made to the stomach ulcer class G1when the patient actually belongs in the stomach cancer class G2. It is of course necessary to make c(1| 2) large relative to c(2 | 1). On this basis, if the value of the linear discriminant function for medical test data from a new patient is found to be
h(x)= (x1− x2)TS−1x−1
2(x1− x2)TS−1(x1+ x2) > log
π2c(1| 2) π1c(2| 1)
, then the patient will be assigned to G1. In cases such as this example of stomach ulcer and stomach cancer, the value of log [π2c(1| 2)/π1c(2| 1)]
becomes positive by making c(1| 2) relatively large and effectively low-ers the bar for assignment to the class having the lower incidence.
It is actually possible to estimate the incidence from the number of observed data: if the numbers of observations in classes G1 and G2are taken as n1 and n2, respectively, the estimated incidence is then π1 = n1/(n1+ n2) and π2 = n2/(n1+ n2). This estimation, however, can be applied only if the data acquisition is random for both classes. If it is not random, then the estimate will not correctly represent the true incidence, and considerable care is required in this regard. The appropriate method for determination of the classification point varies with the problem, and it is always necessary to make this determination carefully on the basis of
an appropriate variety of information obtained by effective information gathering.