Aplicación del procedimiento de ajuste de datos a una línea recta denominado regresión lineal.
4.2 CALIDAD Y CANTIDAD DEL AGUA 1 Calidad del agua.
One of the more useful features of R is the ability to visualize your data. Visualizing your data is important because it provides you with an intuitive
feel for your data that can help you in applying and evaluating statistical tests. It is tempting to believe that a statistical analysis is foolproof, particu- larly if the probability for incorrectly rejecting the null hypothesis is small. Looking at a visual display of your data, however, can help you determine whether your data is normally distributed—a requirement for most of the significance tests in this chapter—and can help you identify potential out- liers. There are many useful ways to look at your data, four of which we consider here.
To plot data in R will use the package “lattice,” which you will need to load using the following command.
> library(“lattice”)
To demonstrate the types of plots we can generate, we will use the object “penny,” which contains the masses of the 100 pennies in Table 4.13.
Our first display is a histogram. To construct the histogram we use mass to divide the pennies into bins and plot the number of pennies or the percent of pennies in each bin on the y-axis as a function of mass on the x-axis. Figure 4.28a shows the result of entering
> histogram(penny, type = “percent”, xlab = “Mass (g)”,
ylab = “Percent of Pennies”, main = “Histogram of Data in Table 4.13”)
Figure 4.27 Output of an R session for Dixon’s Q-test and Grubb’s test for outliers. The p-values for both tests show that we can treat as an outlier the penny with a mass of 2.514 g.
You can download the file “Penny.Rdata” from the textbook’s web site.
To create a histogram showing the num- ber of pennies in each bin, change “per- cent” to “count.”
You do not need to use the command in- stall.package this time because lattice was automatically installed on your computer when you downloaded R.
Visualizing data is important, a point we will return to in Chapter 5 when we con- sider the mathematical modeling of data. > penny3=c(3.067,3.049, 3.039, 2.514, 3.048, 3.079, 3.094, 3.109, 3.102)
> dixon.test(penny3, type=10, two.sided=TRUE) Dixon test for outliers
data: penny3
Q = 0.8824, p-value < 2.2e-16
alternative hypothesis: lowest value 2.514 is an outlier > grubbs.test(penny3, type=10, two.sided=TRUE) Grubbs test for one outlier
data: penny3
G = 2.6430, U = 0.0177, p-value = 1.938e-06
A histogram allows us to visualize the data’s distribution. In this ex- ample the data appear to follow a normal distribution, although the larg- est bin does not include the mean of 3.095 g and the distribution is not perfectly symmetric. One limitation of a histogram is that its appearance depends on how we choose to bin the data. Increasing the number of bins and centering the bins around the data’s mean gives a histogram that more closely approximates a normal distribution (Figure 4.10).
An alternative to the histogram is a kernel density plot, which is basically a smoothed histogram. In this plot each value in the data set is replaced with a normal distribution curve whose width is a function of the data set’s standard deviation and size. The resulting curve is a summation
Figure 4.28 Four different ways to plot the data in Table 4.13: (a) histogram; (b) kernel density plot showing smoothed distribution and individual data points; (c) dot chart; and (d) box plot.
Histogram of Data in Table 4.13
Mass of Pennies (g) Pe rce n t o f Pe n n ie s 0 10 20 30 3.00 3.05 3.10 3.15 3.20
Kernel Density Plot of Data in Table 4.13
Mass of Pennies (g) Dens it y 0 2 4 6 8 10 12 2.95 3.00 3.05 3.10 3.15 3.20 3.00 3.05 3.10 3.15 3.20
Dotchart of Data in Table 4.13
Mass of Pennies (g) Pe n n y N u mb e r
Boxplot of Data in Table 4.13
Mass of Pennies (g)
3.00 3.05 3.10 3.15 3.20
(a) (b)
of the individual distributions. Figure 4.28b shows the result of entering the command
> densityplot(penny, xlab = “Mass of Pennies (g)”, main = “Kernel Density Plot of Data in Table 4.13”)
The circles at the bottom of the plot show the mass of each penny in the data set. This display provides a more convincing picture that the data in
Table 4.13 are normally distributed, although we can see evidence of a small clustering of pennies with a mass of approximately 3.06 g.
We analyze samples to characterize the parent population. To reach a meaningful conclusion about a population, the samples must be represen- tative of the population. One important requirement is that the samples must be random. A dot chart provides a simple visual display that allows us look for non-random trends. Figure 4.28c shows the result of entering
> dotchart(penny, xlab = “Mass of Pennies (g)”, ylab = “Penny Number”, main = “Dotchart of Data in Table 4.13”)
In this plot the masses of the 100 pennies are arranged along the y-axis in the order of sampling. If we see a pattern in the data along the y-axis, such as a trend toward smaller masses as we move from the first penny to the last penny, then we have clear evidence of non-random sampling. Because our data do not show a pattern, we have more confidence in the quality of our data.
The last plot we will consider is a box plot, which is a useful way to identify potential outliers without making any assumptions about the data’s distribution. A box plot contains four pieces of information about a data set: the median, the middle 50% of the data, the smallest value and the largest value within a set distance of the middle 50% of the data, and pos- sible outliers. Figure 4.28d shows the result of entering
> bwplot(penny, xlab = “Mass of Pennies (g)”, main = “Boxplot of Data in Table 4.13)”
The black dot (
•
) is the data set’s median. The rectangular box shows the range of masses for the middle 50% of the pennies. This also is known as the interquartile range, or IQR. The dashed lines, which are called “whiskers,” extend to the smallest value and the largest value that are within ±1.5×IQR of the rectangular box. Potential outliers are shown as open circles (º
). For normally distributed data the median will be near the center of the box and the whiskers will be equidistant from the box. As is often true in statistics, the converse is not true—finding that a boxplot is perfectly symmetric does not prove that the data are normally distributed.The box plot in Figure 4.28d is consistent with the histogram (Figure 4.28a) and the kernel density plot (Figure 4.28b). Together, the three plots provide evidence that the data in Table 4.13 are normally distributed. The potential outlier, whose mass of 3.198 g, is not sufficiently far away from the upper whisker to be of concern, particularly as the size of the data set
To find the interquartile range you first find the median, which divides the data in half. The median of each half provides the limits for the box. The IQR is the me- dian of the upper half of the data minus the median for the lower half of the data. For the data in Table 4.13 the median is 3.098. The median for the lower half of the data is 3.068 and the median for the upper half of the data is 3.115. The IQR is 3.115 – 3.068 = 0.047. You can use the command “summary(penny)” in R to ob- tain these values.
The lower “whisker” extend to the first data point with a mass larger than
3.068 – 1.5 × IQR = 3.068 – 1.5 × 0.047
= 2.9975
which for this data is 2.998 g. The upper “whisker” extends to the last data point with a mass smaller than
3.115+1.5×IQR = 3.115 + 1.5×0.047 = 3.1855
which for this data is 3.181 g.
Note that the dispersion of points along the x-axis is not uniform, with more points occurring near the center of the x- axis than at either end. This pattern is as expected for a normal distribution.
(n = 100) is so large. A Grubb’s test on the potential outlier does not pro- vide evidence for treating it as an outlier.