E L HIJO SEGUNDO DE M ARTÍN F IERRO XIII

In 1977, John Tukey published a highly influential book entitled Exploratory Data Analysis, which sought to introduce readers to a variety of techniques he Table 5.6 Explore output for income

88 Summarizing data

had developed which emphasize simple arithmetic computation and diagrammatic displays of data. Although the approach he advocates is antithetical to many of the techniques conventionally employed by data analysts, including the bulk of techniques examined in this book, some of Tukey’s displays can be usefully appended to more orthodox procedures. Two diagrammatic presentations of data are very relevant to the present discussion—the stem and leaf display and the boxplot (sometimes called the box and whisker plot).

The stem and leaf display

The stem and leaf display is an extremely simple means of presenting data on an interval variable in a manner similar to a histogram, but without the loss of information that a histogram necessarily entails. It can be easily constructed by hand, although this would be more difficult with very large amounts of data. In order to illustrate the stem and leaf display, data on one indicator of local authority performance are taken. For a number of years, the British government has given the Audit Commission the task of collecting data on the performance of local authorities, so that their performance can be compared. One of the criteria of performance relates to the percentage of special needs reports issued within six months. A good deal of variation could be discerned with respect to this criterion, as the author of an article in The Times noted:

If a child in Sunderland needs a report drawn up on its special educational needs, it has no chance of receiving this within six months. If the child moved a mile or two down the road into Durham, there would be an 80 per cent chance that the report would be issued in that time. (Murray 1995:32) Whether such data really measure efficiency is, of course, a matter of whether the measure is valid (see Chapter 4), but there is no doubt that there is a great deal of variation with respect to the percentage of reports issued within six months. As Table 5.7 shows, the percentage varies between 0 and 95 per cent. Figure 5.5 provides a stem and leaf display for this variable which we call ‘needs’. The display has two main components. First, the digits in the middle column make up the stem. These constitute the starting parts for presenting each value in a distribution. Each of the digits that form the stem represents age in tens, i.e. 0 refers to single digit numbers; 1 to tens; 2 to twenties; 3 to thirties; and so on. To the right of the stem are the leaves, each of which represents an item of data which is linked to the stem. Thus, the 0 to the right of the 0 refers to the lowest value in the distribution, namely the percentage figure of 0. We can see that three authorities failed to issue any reports within six months and four issued only 1 per cent of reports within six months. When we come to the row starting with 1, we can see that five managed to issue 10 per cent of reports within six months. It is important to ensure that all of the leaves—the digits to the right of the stem—are vertically aligned. It is not necessary for the leaves to be ordered

Table 5.7 Percentage of special needs reports issued within six months in local authorities in England and Wales, 1993–4

Note: *=missing or doubtful information.

90 Summarizing data

in magnitude, i.e. from 0 to 9, but it is easier to read. We can see that the distribution is very bunched at the low end of the distribution. The appearance of the diagram has been controlled by requesting that incremental jumps are in tens, i.e. first tens, then twenties, then thirties, and so on. The output can also be controlled by requesting that any outliers are separately positioned. Practitioners of exploratory data analysis use a specific criterion for the identification of outliers. Outliers at the low end of the range are identified by the formula:

first quartile—(1.5×the inter-quartile range) and at the high end of the range by the formula:

third quartile—(1.5×the inter-quartile range).

The first quartile for ‘needs’ is 8.0 and the third quartile is 36.0. Substituting in these two simple equations means that outliers will need to be below -36.0 or above 78.0. Using this criterion, four outliers (Extremes) are identified (see Figure 5.5). To produce a stem and leaf display, we follow almost exactly the same procedure as we did for producing measures of central tendency and dispersion (see Box 5.5):

ªStatistics ªSummarize ªExplore…[opens Explore dialog box shown in Box 5.8]

ªneeds ª䉴button by Dependent List: [puts needs in Dependent List: box] ªPlots in box by Display ?OK

The output is in Figure 5.5. The figures in the column to the left of the starting parts represent the frequency for each. We can also see that there are missing data for two authorities.

The stem and leaf display provides a similar presentation to a histogram, in that it gives a sense of the shape of the distribution (such as whether values tend to be bunched at one end), the degree of dispersion, and whether there are outlying values. However, unlike the histogram it retains all the information, so that values can be directly examined to see whether particular ones tend to predominate.

The boxplot

Figure 5.6 provides the skeletal outline of a basic boxplot. The box comprises the middle 50 per cent of observations. Thus the lower end of the box, in terms of the measure to which it refers, is the first quartile and the upper end is the third quartile. In other words, the box comprises the inter-quartile range. The line in the box is the median. The broken lines (the whiskers) extend down- wards to the lowest value in the distribution and upwards to the largest value excluding outliers, i.e. extreme values, which are separately indicated. It has a number of advantages. Like the stem and leaf display, the boxplot provides information about the shape and dispersion of a distribution. For example, is the box closer to one end or is it near the middle? The former would denote that values tend to bunch at one end. In this case, the bulk of the observations are at the lower end of the distribution, as is the median. This provides further information about the shape of the distribution, since it raises the question of whether the median is closer to one end of the box, as it is in this case. On the

92 Summarizing data

Figure 5.7 Boxplot for Needs (SPSS output)

other hand, the boxplot does not retain information like the stem and leaf display. Figure 5.7 provides a boxplot of the data from Table 5.6. The four outliers are signalled, using the previously-discussed criterion. It is clear that in half the authorities (all those below the line representing the median) 20 per cent or fewer reports are issued within six months.

When a stem and leaf display is requested as above, a boxplot will also be produced and will appear in the SPSS Output Viewer. In other words, following the sequence stipulated on page 90 will generate both a stem and leaf display and a boxplot.

Both of these exploratory data analysis techniques can be recommended as providing useful first steps in gaining a feel for data when you first start to analyze them. Should they be used as alternatives to histograms and other more common diagrammatic approaches? Here they suffer from the disadvantage of not being well known. The stem and leaf diagram is probably the easier of the two to assimilate, since the box and leaf diagram requires an understanding of quartiles and the median. If used in relation to audiences who are likely to be unfamiliar with these techniques, they may generate some discomfort even if a full explanation is provided. On the other hand, for audiences who are (or should be) familiar with these ideas, they have much to recommend them.

In document Martín Fierro José Hernández (página 77-92)