Genetic characterisation of RadA/Sms and DisA

Results

4.1. Genetic characterisation of RadA/Sms and DisA

In this section, the focus is two-class classiﬁcation. We discuss statistical discriminant analysis formulated by linear combination of variables with compression of the information in the observed data into sample mean vectors and sample variance-covariance matrices.

6.1.1 Basic Concept

Let us assume that certain evergreen trees can be broadly divided into two varieties on the basis of their leaf shape, and consider the use of width and length as two properties (variables) in constructing a formula

137

Table 6.1 The 23 two-dimensional observed data from the varieties A and B.

1 2 3 4 5 6 7 8 9 10 11 12

A: L(x1) 5 7 6 8 5 9 6 9 7 6 7 9

W(x2) 5 7 7 6 6 8 6 7 5 5 8 4

B: L(x1) 6 8 7 9 7 10 8 10 9 8 7

W(x2) 2 4 4 5 3 5 5 6 6 3 6

for classiﬁcation of newly obtained data known to be from one of the two varieties A and B but not from which one. We ﬁrst take sample leaves for which the variety is known and measure their length (x1) and width (x2), resulting in the 23 two dimensional observed data (Table 6.1).

This dataset for the known varieties is used to construct the classiﬁ-cation formula, and is thus called training data. Here, let us consider a linear equation

y = w1x₁+ w2x₂. (6.1) If we were to perform the classification based solely on length x1, we would take w1 = 1 and w2 = 0 as the variable coefficients and thus perform the classification by projecting the two-dimensional data onto the axis y= x1, as shown in Figure 6.1. Similarly, if we were to perform the classification based solely on width x2, we would take the coefficients as w1 = 0 and w2= 1 and thus project the two-dimensional data onto the axis y = x2, as also shown in Figure 6.1. To perform the classification based on the information presented by both variables, the question is then what kind of axis to use for the projection. This question can be reduced to variable weighting based on a criterion, by projecting the two-dimensional data onto y = w1x₁ + w2x₂ in the figure and selecting the weighting that yields the best separation of the two classes.

Suppose that for random variables x = (x1, x₂)^T we have n₁ two-dimensional data x⁽¹⁾_i = (x⁽¹⁾_i1, x⁽¹⁾_i2)^T (i= 1, 2, · · · , n1) from class G₁ and n₂two-dimensional data x⁽²⁾_i = (x⁽²⁾_i1, x⁽²⁾_i2)^T (i= 1, 2, · · · , n2) from class G₂. In Figure 6.2 the total n= (n1+ n2) training data are plotted in the x₁–x2plane, and we tentatively assume three projection axes and express the distribution of the data when projected on each one. In this ﬁgure we take the values μy1 and μy2 on axis y in (b) as representing the class G1

and class G2 means, respectively. Similarly, we take the values μz1 and μz2 on axis z in (c) as representing the class G1 and class G2 means, respectively.

Figure 6.1 Projecting the two-dimensional data in Table 6.1 onto the axes y = x1, y= x2and y= w1x1+ w2x2.

Figure 6.2 then shows the following.

(1) Greater separation between the two class means is obtained by pro-jection onto axis z rather than onto axis y. That is,

(μy1− μy2)² < (μz1− μz2)². (6.2) In the projection of the data onto axis y, (μy1− μy2)²can be regarded as a measure of the degree of separation of the two classes, referred to as the between-class variance or between-class variation on axis y.

(2) The variance of class G1is smaller on axis y than on axis z. The same holds true for G2. To determine the degree of data dispersion within each class with projection onto axis y, we consider the sum weighted for the number of data

(n1− 1)(var. of G1on y)+ (n2− 1)(var. of G2on y)

(n1+ n2− 2) (6.3)

referred to as the within-class variance or within-class variation on

[

D

E

F

P

]

P

]

\ P ^]

Figure 6.2 Three projection axes (a), (b), and (c) and the distributions of the class G1and class G2data when projected on each one.

axis y. The within-class variance around the mean within each class serves as a measure of the degree of data concentration.

A high degree of separation between the two class means generally facilitates classification. If it is attributable to a large dispersion, how-ever, the region of overlap between the two classes will also tend to be large, with an adverse effect on the performance of the classification. The question then becomes how to determine the optimum axis with respect to these advantageous and adverse effects. One approach is to set coef-ficients w1and w2so as to obtain a high ratio of between-class variance to within-class variance in the projection onto axis y= w1x₁+ w2x₂ as follows:

λ = between-class variance

within-class variance . (6.4) In this approach, the projection axis is selected to obtain the largest pos-sible between-class variance in the numerator together with the smallest possible within-class variance in the denominator. For this purpose, the ratio of between-class variance to within-class variance is expressed on

the basis of the training data and the optimum projection axis is deter-mined by the maximum ratio as next described.

6.1.2 Linear Discriminant Function

In the above approach, we ﬁrst determine the projection axis that best separates the observed data. By projecting the i-th two-dimensional data x⁽¹⁾_i = (x⁽¹⁾_i1, x⁽¹⁾_i2)^Tof class G1onto y= w1x₁+ w2x₂, we have

y⁽¹⁾_i = w1x⁽¹⁾_i1 + w2x⁽¹⁾_i2, i= 1, 2, · · · , n1. (6.5) Similarly the projection of the i-th data x⁽²⁾_i = (x⁽²⁾_i1, x⁽²⁾_i2)^T of class G2

onto axis y is given by

y⁽²⁾_i = w1x⁽²⁾_i1 + w2x⁽²⁾_i2, i= 1, 2, · · · , n2. (6.6) In this way, by projection onto y, the two-dimensional data are reduced to one-dimensional data.

The sample means of classes G1 and G2, as obtained from the one-dimensional data on y, may be given by

y⁽¹⁾= 1 Accordingly, the between-class variance deﬁned by the formula (6.2) can be expressed as Also, when we project the data of class G1 onto y, the sample vari-ance on y is given by

= 1 and similarly we have the sample variance on y for the data of class G2

where S1 and S2are, respectively, the sample variance-covariance ma-trices of G1and G2given by Hence the within-class variance deﬁned by the formula (6.3) can be writ-ten as The matrix S is called the pooled sample variance-covariance matrix.

Therefore it follows from (6.8) and (6.12) that the ratio of between-class variance to within-between-class variance in the projection onto axis y = w₁x₁+ w2x₂can be expressed as

λ =

w^T(x1− x2)2

w^TSw . (6.14)

It can be shown from the result (6.90) in Section 6.4 that the coeﬃcient vectorw which maximizes the ratio λ is

w = Sˆ ⁻¹(x1− x2). (6.15) Thus, we obtain the optimum projection axis

y = ˆw1x₁+ ˆw2x₂= ˆw^Tx= (x1− x2)^TS⁻¹x (6.16)

for maximum separation of the two classes. This linear function is called Fisher’s linear discriminant function.

An observation x can be classiﬁed to one of the two classes G1and G₂based on the projected point ˆw^Tx. Projecting the sample mean vectors x₁and x2onto axis y gives

G₁: (x₁− x2)^TS⁻¹x₁, G₂: (x₁− x2)^TS⁻¹x₂. (6.17) Then x is classiﬁed to the class whose projected sample mean ( ˆw^Tx_i) is closer to ˆw^Tx. This is equivalent to comparing ˆw^Tx with the midpoint

1 and consequently we have the classiﬁcation rule

h(x)= (x1− x2)^TS⁻¹x−1

Figure 6.3 Fisher’s linear discriminant function.

Example 6.1 (Linear discriminant function) Consider the 23 two di-mensional leaf shape data in Table 6.1 in which the measurements are taken on the length (x1) and width (x2) for the classes A and B. The sam-ple mean vectors and the samsam-ple variance-covariance matrices are

A : x₁= The pooled sample variance-covariance matrix in (6.13) is then

S =

1.95 0.74 0.74 1.73

. (6.22)

Therefore, the coeﬃcient vector in (6.16) is w = Sˆ ⁻¹(x1− x2)= Thus, we obtain the following optimum projection axis for maximum separation of the two classes

y = −1.12x1+ 1.46x2, (6.23) and consequently we have the classiﬁcation rule

h(x)= −1.12x1+ 1.46x2+ 0.59 ⎧⎪⎪⎪⎨

⎪⎪⎪⎩ ≥ 0 ⇒ A

< 0 ⇒ B, (6.24) where (x1− x2)^TS⁻¹(x1+ x2)/2= −0.59.

6.1.3 Summary of Fisher’s Linear Discriminant Analysis

Suppose that we have n₁ p-dimensional data from class G₁ and n₂ p-dimensional data from class G₂, and represent the total n = (n1 + n2) training data as

G₁: x⁽¹⁾₁ , x⁽¹⁾₂ , · · · , x⁽¹⁾n1, G₂: x⁽²⁾₁ , x⁽²⁾₂ , · · · , x⁽²⁾n2. (6.25)

Then the sample mean vectors and variance-covariance matrices for each To determine the projection axis that best separates the observed data, we project the n p-dimensional data in (6.25) onto axis

y = w1x₁+ w2x₂+ · · · + wpxp= w^Tx. (6.27) The ratio of between-class variance to within-class variance in the pro-jection onto axis y= w^Tx can be expressed as

λ =

w^T(x1− x2)2

w^TSw (6.28)

with S pooled variance-covariance matrix given by S = (n1 + n2 − 2)⁻¹{(n1−1)S1+(n2−1)S2}. We ﬁnd the p-dimensional coeﬃcient vector such that the between-class variance is maximized relative to the within-class variance. It follows from the result (6.90) in Section 6.4 that the solution is

w = Sˆ ⁻¹(x1− x2). (6.29) Thus we obtain Fisher’s linear discriminant function

y = ˆw^Tx= (x1− x2)^TS⁻¹x, (6.30) the optimum projection axis for maximum separation of the two classes.

A future observation x is classiﬁed to the class whose projected sample mean ( ˆw^Tx_i) is closer to ˆw^Tx.

Noting that the midpoint between the projected sample means (x1− x₂)^TS⁻¹x₁and (x1− x2)^TS⁻¹x₂is we have the classiﬁcation rule based on Fisher linear discriminant func-tion in the form

It is also possible, as next described, to adjust the decision boundary us-ing the concept of prior probability and loss in cases in which the classi-ﬁcation depends on whether the value of the linear discriminant function is positive or negative.

which is the Mahalanobis distance between the sample mean vectors x1

and x2. This distance measure is described in detail in Section 6.2.

6.1.4 Prior Probability and Loss

Up to this point, we have not considered the cost of the loss involved in classification performed with incorrect data incidence, or frequency of occurrence, in the two classes. Let us consider the case of stomach ul-cers and stomach cancer. We first organize the medical test data in rela-tion to several properties and divide the patients into class (G1) for those with stomach ulcers and class (G2) for those with stomach cancer, and construct a linear discriminant function on this basis. Using this func-tion, we then attempt to assign new patients to one or the other class on the basis of their medical test data. It is safe to presume that the number of stomach ulcer patients is inherently quite different from the number of stomach cancer patients. In other words, a large difference in inci-dence naturally exists between the two diseases. In one approach, the incidences represent a form of information acquired in advance that is incorporated into the construction of the determinant function as a prior probability.

Let the relative incidence of stomach ulcers and stomach cancer be represented by π1and π2(π1+ π2 = 1), respectively. In ordinary linear classiﬁcation, as discussed above, the assignment for future data is based on the value of the linear discriminant function with 0 as the classiﬁ-cation point. In linear determination incorporating prior probability, in contrast, assignment to class G1is performed when

h(x)= (x1− x2)^TS⁻¹x−1 The stomach ulcer incidence π1 is higher than the stomach cancer in-cidence π2, and accordingly the value of log(π2/π₁) is negative. Shift-ing the classiﬁcation point from 0 toward the class with the lower in-cidence (in this case, the stomach cancer class) will presumably facili-tate classiﬁcation-based assignment to the higher-incidence class G1(the

stomach ulcer patient class). This type of classification point operation may be considered in cases such as plant variety classification in which one variety is extremely rare and the relative proportion of observation data acquired for it is inherently quite small. In such a case, incorporating this difference can be presumed to be meaningful.

The question remains, however, whether this would also be true in cases such as classification between the stomach ulcer patient class (G₁) and the stomach cancer patient class (G₂). When incidence is incorpo-rated, it effectively raises the bar for assignment of patients to the stom-ach cancer class. In this regard, however, it must be noted that if a patient who actually has stomach cancer is judged to have a stomach ulcer, the loss may be irreparable and very large. If on the other hand, a stomach ul-cer is mistakenly judged to be stomach canul-cer, the loss will presumably be substantially smaller. Some method is therefore necessary to incor-porate the concept of the cost of loss due to mistaken classification and thereby further adjust the classification point.

For this purpose, let the cost of the loss be c(2| 1) if a determination results in assignment to the stomach cancer class G2 for a patient who actually belongs in the stomach ulcer class G1, and c(1| 2) if the reverse occurs and assignment is made to the stomach ulcer class G1when the patient actually belongs in the stomach cancer class G2. It is of course necessary to make c(1| 2) large relative to c(2 | 1). On this basis, if the value of the linear discriminant function for medical test data from a new patient is found to be

h(x)= (x1− x2)^TS⁻¹x−1

2(x1− x2)^TS⁻¹(x1+ x2) > log

π₂c(1| 2) π₁c(2| 1)

, then the patient will be assigned to G1. In cases such as this example of stomach ulcer and stomach cancer, the value of log [π2c(1| 2)/π1c(2| 1)]

becomes positive by making c(1| 2) relatively large and eﬀectively low-ers the bar for assignment to the class having the lower incidence.

It is actually possible to estimate the incidence from the number of observed data: if the numbers of observations in classes G₁ and G₂are taken as n₁ and n₂, respectively, the estimated incidence is then π₁ = n₁/(n₁+ n2) and π₂ = n2/(n₁+ n2). This estimation, however, can be applied only if the data acquisition is random for both classes. If it is not random, then the estimate will not correctly represent the true incidence, and considerable care is required in this regard. The appropriate method for determination of the classiﬁcation point varies with the problem, and it is always necessary to make this determination carefully on the basis of

an appropriate variety of information obtained by eﬀective information gathering.

In document Bacillus subtilis RadA/Sms and RecA contribute in concert to double-strand break repair and natural transformation, and with DisA to DNA damage tolerance (página 85-93)

Genetic characterisation of RadA/Sms and DisA

Results

4.1. Genetic characterisation of RadA/Sms and DisA

[

[

D

E

F

P

P

P

\ P ]

\ P ^]