Population quantities defined by quantiles can also be estimated by the plug-in prplug-inciple. Agaplug-in, suppose that X1, . . . , Xn ∼ P and that we observe a sample ~x ={x1, . . . , xn}. Then
Definition 7.5 The plug-in estimate of a population quantile is the corre-sponding quantile of the empirical distribution. In particular, the sample median is the median of the empirical distribution. The sample interquartile range is the interquartile range of the empirical distribution.
Example 7.4 Consider the experiment of drawing a sample of size n = 20 from Uniform(1, 5). This probability distribution has a population median of 3 and a population interquartile range of 4− 2 = 2. I simulated this experiment (and listed the sample in increasing order) with the following Rcommand:
> x <- sort(runif(20,min=1,max=5))
This resulted in the following sample:
1.124600 1.161286 1.445538 1.828181 1.853359 1.934939 1.943951 2.107977 2.372500 2.448152 2.708874 3.297806 3.418913 3.437485 3.474940 3.698471 3.740666 4.039637 4.073617 4.195613 The sample median is
2.448152 + 2.708874
2 = 2.578513,
which also can be computed with the following R command:
> median(x) [1] 2.578513
Notice that the sample median does not exactly equal the population median.
This is another example of sampling variation.
To compute the sample interquartile range, we require the first and third sample quartiles, i.e., the α = 0.25 and α = 0.75 sample quantiles.
We must now confront the fact that Definition 6.5 may not specify unique quantile values. For the empirical distribution of the sample above, any number in [1.853359, 1.934939] is a sample first quartile and any number in [3.474940, 3.698471] is a sample third quartile.
The statistical community has not agreed on a convention for resolving the ambiguity in the definition of quartiles. One natural and popular possi-bility is to use the central value in each interval of possible quartiles. If we adopt that convention here, then the sample interquartile range is
3.474940 + 3.698471
2 −1.853359 + 1.934939
2 = 1.692556.
Radopts a slightly different convention, illustrated below. The following command computes the 0.25 and 0.75 quantiles:
> quantile(x,probs=c(.25,.75))
25% 75%
1.914544 3.530823
The following command computes several useful sample quantities:
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.124600 1.914544 2.578513 2.715325 3.530823 4.195613
If we use the R definition of quantile, then the sample interquartile range is 3.530823− 1.914544 = 1.616279. Rather than typing the quartiles into R, we can compute the sample interquartile range as follows:
> q <- as.vector(quantile(x,probs=c(.25,.75)))
> q[2]-q[1]
[1] 1.616279
This is sufficiently complicated that we might prefer to create a function that computes the interquartile range of a sample:
> iqr <- function(x) {
+ q <- as.vector(quantile(x,probs=c(.25,.75))) + return(q[2]-q[1])
+ }
> iqr(x) [1] 1.616279
Notice that the sample quantities do not exactly equal the population quantities that they estimate, regardless of which convention we adopt for defining quartiles. This is another example of sampling variation.
Used judiciously, sample quantiles can be extremely useful when trying to discern various features of the population from which the sample was drawn. The remainder of this section describes two graphical techniques for assimilating and displaying sample quantile information.
7.3.1 Box Plots
Information about sample quartiles is often displayed visually, in the form of a box plot. A box plot of a sample consists of a rectangle that extends from the first to the third sample quartile, thereby drawing attention to the central 50% of the data. Thus, the length of the rectangle equals the sample interquartile range. The location of the sample median is also identified, and its location within the rectangle often provides insight into whether or not the population from which the sample was drawn is symmetric. Whiskers extend from the ends of the rectangle, either to the extreme values of the data or to 1.5 times the sample interquartile range, whichever is less. Values that lie beyond the whiskers are called outliers and are individually identified.
0246810
Figure 7.2: A box plot of a sample from χ2(3).
Example 7.5 The pdf of the asymmetric distribution χ2(3) was graphed in Figure 5.8. The following R commands draw a random sample of n = 100 observed values from this population, then construct a box plot of the sam-ple:
> x <- rchisq(100,df=3)
> boxplot(x)
An example of a box plot produced by these commands is displayed in Figure 7.2. In this box plot, the numerical values in the sample are represented by the vertical axis.
The third quartile of the box plot in Figure 7.2 is farther above the median than the first quartile is below it. The short lower whisker extends
from the first quartile to the minimal value in the sample, whereas the long upper whisker extends 1.5 interquartile ranges beyond the third quartile.
Furthermore, there are 4 outliers beyond the upper whisker. Once we learn to discern these key features of the box plot, we can easily recognize that the population from which the sample was drawn is not symmetric.
The frequency of outliers in a sample often provides useful diagnostic information. Recall that, in Section 6.3, we computed that the interquartile range of a normal distribution is 1.34898 standard deviations. A value is an outlier if it lies more than
z = 1.34898
2 + 1.5· 1.34898 = 2.69796
standard deviations from the mean. Hence, the probability that an observa-tion drawn from a normal distribuobserva-tion is an outlier is
> 2*pnorm(-2.69796) [1] 0.006976582
and we would expect a sample drawn from a normal distribution to contain approximately 7 outliers per 1000 observations. A sample that contains a dramatically different proportion of outliers, as in Example 7.5, is not likely to have been drawn from a normal distribution.
Box plots are especially useful for comparing several populations.
Example 7.6 We drew samples of 100 observations from three normal populations: Normal(0, 1), Normal(2, 1), and Normal(1, 4). To attempt to discern in the samples the various differences in population mean and stan-dard deviation, we examined side-by-side box plots. This was accomplished by the following R commands:
> z1 <- rnorm(100)
> z2 <- rnorm(100,mean=2,sd=1)
> z3 <- rnorm(100,mean=1,sd=2)
> boxplot(z1,z2,z3)
An example of the output of these commands is displayed in Figure 7.3.
7.3.2 Normal Probability Plots
Another powerful graphical technique that relies on quantiles are quantile-quantile (QQ) plots, which plot the quantile-quantiles of one distribution against the
1 2 3
−4−20246
Figure 7.3: Box plots of samples from three normal distributions.
quantiles of another. QQ plots are used to compare the shapes of two distri-butions, most commonly by plotting the observed quantiles of an empirical distribution against the corresponding quantiles of a theoretical normal dis-tribution. In this case, a QQ plot is often called a normal probability plot. If the shape of the empirical distribution resembles a normal distribution, then the points in a normal probability plot should tend to fall on a straight line.
If they do not, then we should be skeptical that the sample was drawn from a normal distribution. Extracting useful information from normal probabil-ity plots requires some practice, but the patient data analyst will be richly rewarded.
Example 7.4 (continued) A normal probability plot of the sample generated in Example 7.5 against a theoretical normal distribution is dis-played in Figure 7.4. This plot was created using the following R command:
> qqnorm(x)
Notice the systematic and asymmetric bending away from linearity in this plot. In particular, the smaller quantiles are much closer to the central values
−2 −1 0 1 2
02468101214
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Figure 7.4: A normal probability plot of a sample from χ2(3).
than should be the case for a normal distribution. This suggests that this sample was drawn from a nonnormal distribution that is skewed to the right.
Of course, we know that this sample was drawn from χ2(3), which is in fact skewed to the right.
When using normal probability plots, one must guard against overinter-preting slight departures from linearity. Remember: some departures from linearity will result from sampling variation. Consequently, before drawing definitive conclusions, the wise data analyst will generate several random samples from the theoretical distribution of interest in order to learn how much sampling variation is to be expected. Before dismissing the possibil-ity that the sample in Example 7.5 was drawn from a normal distribution, one should generate several normal samples of the same size for comparison.
The normal probability plots of four such samples are displayed in Figure 7.5. In none of these plots did the points fall exactly on a straight line.
However, upon comparing the normal probability plot in Figure 7.4 to the normal probability plots in Figure 7.5, it is abundantly clear that the sample in Example 7.5 was not drawn from a normal distribution.
−2 −1 0 1 2
−2−10123
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
−2 −1 0 1 2
−2−10123
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
−2 −1 0 1 2
−3−2−1012
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
−2 −1 0 1 2
−1012
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Figure 7.5: Normal probability plots of four samples from Normal(0, 1).