• No se han encontrado resultados

There are many graphical ways to indicate the distribution of a numerical variable, but the two we prefer and will discuss in this subsection are histograms and box plots (also called

box-whisker plots). Each of these is useful primarily for cross-sectional variables. If they

are used for time series variables, the time dimension gets buried. Therefore, we will discuss time series graphs for time series variables separately in the next section.

Histograms

A histogram is the most common type of chart for showing the distribution of a numerical variable. It is based on binning the variable—that is, dividing it up into discrete categories. The histogram is then a column chart of the counts in the various cate- gories (with no gaps between the bars). In general, a histogram is great for showing the shape of a dis- tribution. We are particularly interested in whether the distribution is symmetric or is skewed in one direction. The concept is a simple one, as illus- trated in the following example with the baseball salary data.

If you open a file with errors in StatTools outputs, close the file, load StatTools, and reopen the file.

The term distribution refers to the way the data are distributed in the various categories. It is common to refer to a skewed distri- bution, say, rather than a skewed histogram. However, either term can be used.

A histogram can be created with Excel tools only, but the process is quite tedious. It is much easier to use StatTools.

Histograms Versus Summary Measures

It is important to remember that each of the sum- mary measures we have discussed for a numerical variable—the mean, the median, the standard devia- tion, and others—describes only one aspect of a numerical variable. In contrast, a histogram provides the complete picture. It indicates the “center” of the distribution, the variability, the skewness, and other aspects, all in one convenient chart.

FU N DA M E N TA L IN S I G H T

E X A M P L E

2.3 B

ASEBALL

S

ALARIES

(

CONTINUED

)

W

e have already mentioned that the baseball salaries are skewed to the right. How does this show up in a histogram of salaries?

Objective To see the shape of the salary distribution through a histogram.

Solution

It is possible to create a histogram with Excel tools only—no add-ins—but it is a tedious process. First, the bins must be defined. If you do it yourself, you will probably choose “nice” bins, such as $400,000 to $800,000, $800,000 to $1,200,000, and so on. But there is also the question of how many bins there should be and what their endpoints should be, and these are not always easy choices. In any case, once the bins have been selected, the number of observations in each bin must be counted. This can be done in

Excel with the COUNTIF function. (You can also use the COUNTIFS and FREQUENCY functions, but we won’t discuss them here.) The resulting table of counts is usually called a frequency table. Finally, a column chart of the counts must be created. If you are inter- ested, we have indicated the steps in the Histogram sheet of the finished version of the baseball file.

It is much easier to create a histogram with StatTools, as we now illustrate. As with all StatTools analyses, the first step is to designate a StatTools data set, which has already been done for the salary data. To create a histogram, select the Histogram item from the Summary Graphs dropdown list to obtain the dialog box in Figure 2.18. At this point, all you really need to do is select the Salary variable and click on OK. This gives you the default bins, indicated by “auto” values. Essentially, StatTools checks your data and chooses “good” settings for the bins. The resulting histogram, along with the bin data it is based on, appears in Figure 2.19. StatTools has used 11 bins, with the endpoints indicated in columns B and C. The histogram is then a column chart (with no gaps between the bars) of the counts in column E. (These counts are also called frequencies.)

2.4 Descriptive Measures for Numerical Variables 49

Figure 2.18 StatTools Histogram Dialog Box

You could argue that the bins chosen by StatTools aren’t very “nice.” For example, the upper limit of the first bin is $3,363,636.36. If you want to fine-tune these, you can enter your own bins instead of the “auto” values in Figure 2.18. We will illustrate this in the next example, but it is largely beside the point for the main question about baseball salaries. The StatTools default histogram shows very clearly that the salaries are skewed to the right, and fine-tuning bins won’t change this primary finding. The vast majority of the players are in the lowest two categories, and the salaries of the stars account for the long tail to the right. This big picture finding is all you typically want from a histogram.

When is it useful to fine-tune the StatTools histogram bins? One good example is when the values of the variable are integers, as illustrated next.

In many situations, you can accept the StatTools defaults for histogram bins.They generally show the big picture quite well, which is the main goal.

7 9 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 A B C D E F G

Histogram Bin Min Bin Max Midpoint Freq. Rel. Freq. Prb. Density

Bin #1 $400000.00 $3363636.36 $1881818.18 574 0.7017 0.000000237 Bin #2 $3363636.36 $6327272.73 $4845454.55 102 0.1247 0.000000042 Bin #3 $6327272.73 $9290909.09 $7809090.91 49 0.0599 0.000000020 Bin #4 $9290909.09 $12254545.45 $10772727.27 43 0.0526 0.000000018 Bin #5 $12254545.45 $15218181.82 $13736363.64 32 0.0391 0.000000013 Bin #6 $15218181.82 $18181818.18 $16700000.00 8 0.0098 0.000000003 Bin #7 $18181818.18 $21145454.55 $19663636.36 7 0.0086 0.000000003 Bin #8 $21145454.55 $24109090.91 $22627272.73 2 0.0024 0.000000001 Bin #9 $24109090.91 $27072727.27 $25590909.09 0 0.0000 0.000000000 Bin #10 $27072727.27 $30036363.64 $28554545.45 0 0.0000 0.000000000 Bin #11 $30036363.64 $33000000.00 $31518181.82 1 0.0012 0.000000000

Salary / Data Set #1

0 100 200 300 400 500 600 700 Fr equenc y

Histogram of Salary / Data Set #1

37 38 39 $1881818 .1 8 $4845454 .55 $7809090 .91 $10772727 .2 7 $13736363 .6 4 $16700000 .0 0 $19663636 .3 6 $22627272 .7 3 $25590909 .0 9 $28554545 .4 5 $31518181 .8 2

Figure 2.19 Histogram of Salaries

E X A M P L E

2.4 L

OST OR

L

ATE

B

AGGAGE AT

A

IRPORTS

T

he file Late or Lost Baggage.xlsxcontains information on 456 flights into an airport. (This is not real data.) For each flight, it lists the number of bags that were either late or lost. A sample is shown in Figure 2.20. What is the most natural histogram for this data set?

Objective To fine-tune a histogram for a variable with integer counts.

Solution

From a scan of the data (sort from lowest to highest), it is apparent that all flights had from 0 to 8 late or lost bags. Therefore, the most natural histogram is one that shows the count of each possible value. If you try using the default settings in StatTools, this is not what you will get. However, if you fill in the Histogram dialog box as shown in Figure 2.21, you

will get exactly what you want. The resulting histogram appears in Figure 2.22. Do you see the trick? When you request 9 bins and set the min and max to 0.5 and 8.5, StatTools divides the range from 0.5 to 8.5 into 9 equal-length bins: 0.5 to 0.5, 0.5 to 1.5, and on up to 7.5 to 8.5. Of course, each bin contains only one possible value, the integer in the middle. So you get the count of 0s, the count of 1s, and so on. As an extra benefit, StatTools always labels the horizontal axis with the midpoints of the bins, which are exactly the integers you want. (For an even nicer look, we formatted these horizontal axis values with no decimals.)

2.4 Descriptive Measures for Numerical Variables 51

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A B

Flight Bags late or lost

1 0 2 3 3 5 4 0 5 2 6 2 7 1 8 5 9 1 10 3 11 3 12 4 13 5 14 4 15 3 Figure 2.20 Data on Late or Lost Baggage

Figure 2.21 Histogram Dialog Box with Desired Bins

For a quick analysis, feel free to accept StatTools’s automatic histogram options. However, don’t be afraid to experiment with these options in defining your own bins. The goal is to make the histogram as meaningful and easy to read as possible.

The point of this example is that you do have control over the histogram bins if you are not satisfied with the StatTools defaults. Just keep one technical detail in mind. If a bin extends, say, from 2.7 to 3.4, then its count is the number of observations greater than 2.7 and less than or equal to 3.4. In other words, observations equal to the right endpoint are counted, but observations equal to the left endpoint are not. (They would be counted in the

previous bin.) So in this example, if we had designated the minimum and maximum as

1 and 8 in Figure 2.21, we would have gotten the same histogram. ■

2 0.0244 7 8 9 10 11 12 13 A B C D E F G

Histogram Bin Min Bin Max Bin Midpoint Fred. Rel. Fred. Prb. Density

Bin #1 Bin #2 Bin #3 Bin #4 Bin #5

Bags late or lost / Data Set #1

14 15 16 17 18 19 20 21 Bin #6 Bin #7 Bin #8 −0.500 0.500 0.000 16 0.0351 0.04 0.500 1.500 1.000 67 0.1469 0.15 1.500 2.500 2.000 113 0.2478 0.25 2.500 3.500 3.000 101 0.2215 0.22 4.500 5.500 5.000 44 0.0965 0.10 3.500 4.500 4.000 77 0.1689 0.17 5.500 6.500 6.000 23 0.0504 0.05 6.500 7.500 7.000 13 0.0285 0.03 7.500 8.500 8.000 0.00 Bin #9 120

Histogram of Bags late or lost / Data Set #1

22 23 24 25 26 27 28 60 80 100 Frequency 29 30 31 32 33 34 35 36 0 20 40 37 0 1 2 3 4 5 6 7 8

Figure 2.22 Histogram of Counts

Box Plots

A box plot (also called a box-whisker plot) is an alternative type of chart for showing the distribution of a variable. For the distribution of a single variable, a box plot is not nearly as popular as a histogram, but as you will see in the next chapter, side-by-side box plots are very popular for comparing distributions, such as salaries for men versus salaries for women. As with histograms, box plots are “big picture” charts. They show you at a glance some of the key features of a distribution. We explain how they do this in the following continuation of the baseball salary example.

2.4 Descriptive Measures for Numerical Variables 53

E X A M P L E

2.3 B

ASEBALL

S

ALARIES

(

CONTINUED

)

A

histogram of the salaries clearly indicated the skewness to the right. Does a box plot of salaries indicate the same behavior?

Objective To illustrate the features of a box plot, particularly how it indicates skewness.

Solution

This time you must rely on StatTools. There is no easy way to create a box plot with Excel tools only. Fortunately, it is easy with StatTools. Select the Box-Whisker Plot item from the Summary Graphs dropdown list and fill in the resulting dialog box as in Figure 2.23— there are no other choices to make. The box plot appears in Figure 2.24. (StatTools also lists some mysterious values below the box plot. You can ignore these, but don’t delete them. They are the basis for the box plot itself.)

Excel has no built-in box plot chart type. In this case, you must rely on StatTools.

Figure 2.23 StatTools Box- Whisker Plot Dialog Box

Box Plot of Salary / Data Set #1

0 5000000 10000000 15000000 20000000 25000000 30000000 35000000

Figure 2.24 Box Plot of Salaries

To help you understand the elements of a box plot, StatTools provides the generic box plot shown in Figure 2.25. (It is not drawn to scale.) You can get this by checking the Include Key Describing Plot Elements option in Figure 2.23, although you will probably want to do this only once or twice. As this generic diagram indicates, the box itself extends, left to right, from the 1st quartile to the 3rd quartile. This means that it contains the middle half of the data. The line inside the box is positioned at the median, and the x inside the box is posi- tioned at the mean. The lines (whiskers) coming out either side of the box extend to 1.5 IQRs (interquartile ranges) from the quartiles. These generally include most of the data outside the box. More distant values, called outliers, are denoted separately with small squares. They are hollow for “mild” outliers, and solid for “extreme” outliers, as indicated in the explanation.

Figure 2.25 Elements of a Generic Box Plot

The box plot of salaries in Figure 2.24 should now make more sense. It is typical of an extremely right-skewed distribution. The mean is much larger than the median as we explained earlier; there is virtually no whisker out of the left side of the box (because the first quartile is barely above the minimum value—remember all the players earning $400,000?), and there are many outliers to the right (the stars). In fact, many of these out- liers overlap one another. You can decide whether you prefer the histogram of salaries to this box plot or vice versa, but both are clearly telling the same story.

Box plots have been around for several decades, and they are probably more popular now than ever. The implementation of box plots in StatTools is just one version of what you might see. Some packages draw box plots vertically, not horizontally. Also, some vary the height of the box to indicate some other feature of the distribution. (The height of the box is irrelevant in StatTools’s box plots.) Nevertheless, they all follow the same basic rules and provide the same basic information. ■

FU N DA M E N TA L IN S I G H T

Box Plots Versus Histograms

Box plots and histograms are complementary ways of displaying the distribution of a numerical variable. Although histograms are much more popular and are

arguably more intuitive, box plots are still informa- tive. Besides, side-by-side box plots are very useful for comparing two or more populations.

2.4 Descriptive Measures for Numerical Variables 55

P R O B L E M S

Level A

6. The file P02_06.xlsxlists the average time (in minutes)

Outline

Documento similar