percent to discrimination between 3 groups and only 4
percent to discrimination between 4 and more groups. 97
percent of the abstracts indicated that prediction was the chief purpose of the discriminant while only 3 percent of the reported studies were concerned with analysing the
underlying structure of the data. In 84 percent no
crossvalidation of estimates was carried out. Leaving-one- out crossvalidation or separate training and test sets were used in only 16 percent of reported studies.
The following example may help to illustrate the ubiquity of discrete data in medical research. A frequent question in obstetrics but also of general relevance to maternal child health services concerns the management of premature delivery. The causes are still not fully understood. No one single cause has been identified that accounts for all premature births which suggests strongly multifactorial effects. Those factors most likely to be of relevance such as demographic characteristics and psychosocial effects like social class, smoking and stress due to one-parent- family rearing but also past obstetric history, are often inherently difficult to measure precisely and are therefore often dichotomised, i.e. one is dealing with discrete data. It makes more sense to classify a smoker as such if she
smokes heavily than to attempt fine gradings of the number of cigarettes smoked per day as this will be confounded with reporting bias. Another typical example of discrete data is given by Cole et al (1991) who derived a scoring system to quantify illness in babies under 6 months of age using logistic regression on four dichotomous variables. It was already pointed out for the data in figure 2.2-1 that occasionally curvilinear separations may be required to improve separation in two dimensions. The hypothetical data plotted in figure 2.2-3 shows such a scenario.
Figure 2.2-3: Curvilinear separation
Figure 2.2-3 shows the obvious discriminant rule for hypothetical bivariate discrete data in two populations as:
"Allocate any new object whose population membership is unknown to population 2 if its coordinates for X1 and X2 lie in the region above the curved (discriminant) line - or equivalently, if variables X1 and X2 jointly exceed about
1.5." The arc represents perfect separation between both populations. A curve need not be the only solution, however. To see this assume that the discrete levels of X1
and X2 refer to low, medium and high. Next assume that the proximity between the levels medium and high is greater than between low and medium. When the data are replotted taking this into account the consequence is that all points except the one at the lower left corner drift out away from the coordinates {0,0). Figure 2.2-4 shows how now again a straight line - mathematically the simpler model - is sufficient for complete separation.
\
Figure 2.2-4: The effect of scaling
Figure 2.2-4 shows essentially the same data as in figure 2.2-3 yet with a shift of scale values of 2.0 for X1 and
X2 to about 2.8. All but one point move away from the coordinate {0,0). Now again the mathematically simpler straight line is sufficient for perfect separation. The obvious discriminant rule may now be expressed more simply
in terms of a straight line: "Allocate any new object whose population membership is unknown to population 2 if its coordinates for and X2 lie in the region above the straight (discriminant) line."
The scale shift example has shown how it may lead to a more
parsimonious solution of the discriminant problem. The law of parsimony states that one should generally opt for simpler explanations when there is no obvious evidence pointing to the more complex solution. The next example demonstrates parsimony but this time in relation to
sampling. Assume again a bivariate distribution for 2
populations with some overlap such that observations with
high values on variable X2 are predominantly from
population 1 and observations with low X2 values are
predominantly from population 2. Assume further that a sample of an equal number of observations from both populations exists (figure 2.2-5).
circles right of vertical gridillnas refer to population 1 elides left of vertical grid lines refer to population 2 Figure 2.2-5: Optimal fitting of discriminant
Figure 2.2-5 shows an hypothetical sample of bivariate discrete data. The oscillating dashed line has been drawn such that separation leads to least misallocations. Given only this sample an immediate discriminant rule may be:
"Allocate any new object whose population membership is unknown to population 1 if its coordinates for X1 and X2 lie in the region above the broken line." Small circles resemble 1 observation, medium ones 3 observations and
large ones 5 observations. To enable distinction
observations from population 1 are displaced to the upper right of grid intersections and observations belonging to population 2 to the lower left. Separation based on this line in figure 2.2-5 would result in a minimum number of objects from population 1 to be allocated to population 2
and vice versa.
As far as the given sample in figure 2.2-5 is concerned this line represents an optimal solution to the
discriminant problem of separating population 1 from
population 2. Next assume that further samples become