1. Motivación: Los gráficos, en general, ofrecen una visión de conjunto
que facilita mucho la comprensión de la materia y la rápida observación
y percepción de los datos numéricos. Este tipo de gráficos, de Caja y
Bigotes, es de relativamente reciente inclusión en el temario de 4º ESO.
De hecho, este curso es el primero en el que yo voy a explicar este tipo
de gráficos.
2. Objetivos: Vincular los conceptos de mediana, cuartiles, valor mínimo
y máximo que los alumnos manejan individualmente pero no en forma
global.
3. Temporalización: La duración prevista es de 1 hora.
4. Recursos didácticos: PC, o pizarra digital, con acceso a Internet.
5. Metodología: Explicación teórica en la pizarra. Visualización de dos
vídeos cortos, prácticos, en inglés, con dos ejemplos muy bien
explicados, muy didácticos, que se pueden recrear en la pizarra, una vez
visualizados por los alumnos, para comprobar su asimilación por parte
de éstos.
6. Contenido teórico/práctico (en español):
Punto 0: Apunte Histórico: (5 minutos)
John Wilder Tukey (* 16 de junio de 1915 - † 26 de julio de 2000) fue un
estadístico nacido en New Bedford, Massachusetts. Tukey obtuvo un Bachiller
en Artes en 1936 y una Maestría en Ciencias en 1937, ambas en química, en la
Universidad de Brown, antes de trasladarse a la universidad de Princeton
donde recibió un Doctorado en Matemáticas. Durante la Segunda Guerra
Mundial, Tukey trabajó en la Oficina de la Investigación de Control de Fuego de
Artillería y colaboró con Samuel Wilks y William Cochran. Después de la guerra
regresó a Princeton dividiendo su tiempo entre la universidad y los Laboratorios
AT&T Bell.
Introdujo los diagramas de caja (Box Plot) en su libro de 1977,
Análisis
exploratorio de datos
.
Se retiró en 1985, Tukey murió en New Brunswick, New Jersey en el 2000.
Punto 1: Gráficos de Caja y Bigotes (10 minutos)
Este tipo de gráfico, creado por John Tukey en 1997, es muy interesante
porque permite resumir la información de una distribución de frecuencias
usando 5 medidas estadísticas: el valor mínimo, el primer cuartil, la mediana, el
tercer cuartil y el valor máximo.
El gráfico de Caja y Bigotes consiste en un rectángulo (CAJA) donde los lados
más largos muestran el RECORRIDO INTERCUARTÍLICO (RIC). Esta “caja”
está dividida por un segmento vertical que indica donde está la mediana y su
relación con los cuartiles primero y tercero.
Este rectángulo se ubica a escala sobre un segmento que tiene como extremos
los valores mínimo y máximo de la distribución. Estos segmentos que salen de
la caja en direcciones opuestas, a izquierda y derecha de la misma, se llaman
BIGOTES.
Los BIGOTES tienen una longitud máxima, de forma de modo que aquellos
valores atípicos que se separan del cuerpo principal de datos se indican
individualmente. A diferencia de otros métodos de presentación de datos, los
gráficos de caja muestran los valores atípicos de la variable. Llamaremos
valores atípicos de la variable a aquellos que están tan apartados del cuerpo
principal de los datos que bien pueden representar los efectos de causas
extrañas, como algún error de medición o registro. Su eliminación no se
justifica, ya que el propósito del gráfico de caja consiste en brindarnos un
mayor conocimiento de la forma en que se distribuyen los datos.
Punto 2: Criterio de TUKEY para fijar los extremos de los BIGOTES (10 m,)
Tukey introduce un criterio para fijar los extremos de los bigotes. Para esto
calcula 4 barreras, dos interiores y dos exteriores:
Barrera interior inferior=Primer cuartil – 1,5 . RIC
Barrera interior superior=Tercer cuartil + 1,5 . RIC
Barrera exterior inferior=Primer cuartil – 3 . RIC
Barrera exterior superior=Tercer cuartil + 3 . RIC
Recordemos que RIC (Recorrido Intercuartílico) es igual a la diferencia entre el
Tercer cuartil y el Primero.
Si se consideran los valores de la variable comprendidos entre las dos barreras
interiores, el valor mínimo de la variable y el valor máximo son los extremos de
los bigotes.
Punto 3: INFORMACION SOBRE LA SIMETRIA (5 m.)
Por otra parte, este tipo de gráfico nos proporciona información con respecto a
la simetría o asimetría de la distribución. Se utilizan los siguientes criterios:
•
si la mediana está en el centro de la caja o cerca de él, constituye un
indicio de simetría de los datos
•
si la mediana está considerablemente más cerca del primer cuartil indica
que los datos son positivamente asimétricos
•
si está más cerca del tercer cuartil, señala que los datos son
negativamente asimétricos.
•
Asimismo, la longitud relativa de los bigotes se puede emplear como un
indicio de su asimetría.
Punto 4: VISUALIZACION DE LOS VIDEOS (In English) (10 minutos.)
http://www.youtube.com/watch?v=Fhk5lDGpivo
(duración:aprox. 2 min)
http://www.youtube.com/watch?v=GMb6HaLXmjY
(duración: 6 min aprox.)
PUNTO 5: PREGUNTAS BASICAS (5 minutos)
Una vez realizado el gráfico, ¿qué tipo de preguntas debemos formular para
una mejor comprensión?
Algunas preguntas podrían ser las siguientes:
•
¿Qué porcentaje de los datos está representado por la caja?
•¿Qué porcentaje representa cada uno de los bigotes?
•
¿Puede ser un bigote más largo que otro?. ¿Cuál es el significado?
•¿Se encuentra la mediana siempre en el centro de la caja?
Punto 6: EJERCICIO PRÁCTICO (15 minutos)
Hildebrand (1997) propone el siguiente problema donde se muestra como
actúan las barreras interiores y exteriores:
-24,6 2,6 2,4 2,7 3,8 5,6
5,9
6,7
7,0 7,2 7,5 8,0 8,2
8,5
8,6
8,8 9,0 9,2 9,7 10,0 20,5
Trace un diagrama de caja para estos datos, señalando valores atípicos
Solución
En base a los datos obtenemos que:
Mediana: 7,5
Cuartil 1: 5,6
Cuartil 3: 8,8
RIC : 3,2
Las barreras son:
Barrera exterior inferior=Q1 - 3.0 RIC=5,6 - 3.0 (3,2)=-4,0
Barrera exterior superior=Q3 + 3.0 RIC=8,8 + 3.0 (3,2)=18,4
Barrera interior inferior=Q1 - 1.5 RIC=5,6 - 1.5 (3,2)=0,8
Barrera interior superior=Q3 + 1.5 RIC=8,8 + 1.5 (3,2)=13,6
La prueba de las barreras identifica dos valores atípicos importantes, -24,6 y
20,5 y un posible valor atípico, -2,6. (Una gráfica de los datos indica que los
valores atípicos importantes son obviamente valores extremos y que el valor
dudoso queda posiblemente excluído).
Punto 7: APUNTES EN INGLÉS
He seleccionado dos direcciones de Internet, que, la primera por su brevedad,
y la segunda porque es más completa, y con ejercicios, me han parecido las
más idóneas.
1.
Box and Whisker Diagrams
(http://www.mathsrevision.net/alevel/pages.php?page=50
Given some data, we can draw a box and whisker diagram (or box plot) to
show the spread of the data. The diagram shows the
quartiles
of the data, using
these as an indication of the spread.
The diagram is made up of a "box", which lies between the upper and lower
quartiles. The median can also be indicated by dividing the box into two.
The "whiskers" are straight line extending from the ends of the box to the
maximum and minimum values.
Outliers
Skewness
If the whisker to the right of the box is longer than the one to the left, there is
more extreme values towards the positive end and so the distribution is
positively skewed.
Similarly, if the whisker to the left is longer, the distribution is negatively skewed.
2.
Box-and-Whisker Plots:
Quartiles, Boxes, and Whiskers
(http://www.purplemath.com/modules/boxwhisk.htm)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers
Statistics assumes that your data points (the numbers in your list) are clustered around some central value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle half of these data points.
To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical order), if they aren't ordered already. Then you find the median of your data. The median divides the data into two halves. To divide the data into quarters, you then find the medians of these two halves. Note: If you have an even number of values, so the first median was the average of the two middle values, then you include the middle values in your sub-median computations. If you have an odd number of values, so the first sub-median was an actual data point, then you do not include that value in your sub-median computations. That is, to find the sub-medians, you're only looking at the values that haven't yet been used.
You have three points: the first middle point (the median), and the middle points of the two halves (what I call the "sub-medians"). These three points divide the entire data set into quarters, called "quartiles". The top point of each quartile has a name, being a "
Q
" followed by the number of the quarter. So the top point of the first quarter of the data points is "Q
1", and soforth. Note that
Q
1is also the middle number for the first half of the list,Q
2 is also the middlenumber for the whole list,
Q
3is the middle number for the second half of the list, andQ
4is thelargest value in the list.
Once you have these three points,
Q
1,Q
2, andQ
3, you have all you need in order to draw asimple box-and-whisker plot. Here's an example of how it works.
• Draw a box-and-whisker plot for the following data set:
4
.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4,
4.2, 4.5, 4.4
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8,
4.9, 5.0, 5.1
The first number I need is the median of the entire set. Since there are seventeen values in this list, I need the ninth value:
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4,
4.4,
4.5, 4.5, 4.6, 4.7, 4.8,
4.9, 5.0, 5.1
The median is
Q
2= 4.4.
The next two numbers I need are the medians of the two halves. Since I used the "
4.4
" in the middle of the list, I can't re-use it, so my two remaining data sets are:3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4
and4.5, 4.5, 4.6, 4.7, 4.8, 4.9,
5.0, 5.1
The first half has eight values, so the median is the average of the middle two:
Q
1= (4.3 + 4.3)/2 = 4.3
The median of the second half is: Copyright © Elizabeth Stapel 1999-2009 All Rights Reserved
Q
3= (4.7 + 4.8)/2 = 4.75
Since my list values have one decimal place and range from
3.9
to5.1
, I won't use a scale of, say, zero to ten, marked off by ones. Instead, I'll draw a number line from3.5
to5.5
, and mark off by tenths.
Now I'll mark off the minimum and maximum values, and
Q
1,Q
2, andQ
3:
The "box" part of the plot goes from
Q
1toQ
3:
By the way, box-and-whisker plots don't have to be drawn horizontally as I did above; they can be vertical, too.
More terminology: The top end of your box may also be called the "upper hinge"; the lower end may also be called the "lower hinge". The lower hinge is also called "the
25
th percentile"; the median is "the50
th percentile"; the upper hinge is "the75
th percentile". This means that25%, 50%
and75%
of the data, respectively, is at or below that point. The distance between the hinges may be referred to as the "H-spread" or, as you will see on the following page, the "Interquartile Range", abbreviated "IQR
". ("Hinge" actually has a different technical definition, but the term is sometimes used informally.)Also, some books and software will include the overall median (
Q
2) when computingQ
1 andQ
3 for data sets with an odd number of elements. The Texas Instruments calculators do notinclude
Q
2 in this case, so you may encounter a book answer that doesn't match the calculatoranswer. And different software packages use all different sorts of formulas. Be careful to use the formula from your book when doing your homework!
Additionally, the box-and-whisker plot may include a cross or an "X" marking the mean value of the data, in addition to the line inside the box that marks the median. The difference between the "X" and the median line can then be used as a measure of "skew".
Please don't ask me to explain "skew".
• Draw the box-and-whisker plot for the following data set:
77, 79, 80, 86, 87, 87, 94, 99
My first step is to find the median. Since there are eight data points, the median will be the average of the two middle values: (
86 + 87) ÷ 2 = 86.5 = Q
2This splits the list into two halves:
77, 79, 80, 86
and87, 87, 94, 99
. Since the halves of the data set each contain an even number of values, the sub-medians will be the average of the middle two values.Q
1= (79 + 80) ÷ 2 = 79.5
Q
3= (87 + 94) ÷ 2 = 90.5
The minimum value is
77
and the maximum value is99
, so I have:min:
77, Q
1: 79.5, Q
2: 86.5, Q
3: 90.5,
max:99
As you can see, you only need the five values listed above (min,
Q
1, Q
2, Q
3, and max) in orderto draw your box-and-whisker plot. This set of five values has been given the name "the five-number summary".
• Give the five-number summary of the following data set:
79, 53, 82, 91, 87, 98, 80, 93
The five-number summary consists of the numbers I need for the box-and-whisker plot: the minimum value,
Q
1(the bottom of the box),Q
2(the median of the set),Q
3(the topof the box), and the maximum value (which is also
Q
4). So I need to order the set, findthe median and the sub-medians, and then list the required values in order.
ordering the list:
53, 79, 80, 82, 87, 91, 93, 98
, so the minimum is53
and the maximum is98
finding the median:
(82 + 87) ÷ 2 = 84.5 = Q
2lower half of the list:
53, 79, 80, 82
, soQ
1= (79 + 80) ÷ 2 = 79.5
upper half of the list:
87, 91, 93, 98
, soQ
3= (91 + 93) ÷ 2 = 92
five-number summary:
53, 79.5, 84.5, 92, 98
Part of the point of a box-and-whisker plot is to show how spread out your values are. But what if one or another of your values is way out of line? For this, we need to consider "outliers"....
The "interquartile range", abbreviated "
IQR
", is just the width of the box in the box-and-whisker plot. That is,IQR = Q
3– Q
1. TheIQR
can be used as a measure of how spread-out thevalues are. Statistics assumes that your values are clustered around some central value. The
IQR
tells how spread out the "middle" values are; it can also be used to tell when some of the other values are "too far" from the central value. These "too far away" points are called "outliers", because they "lie outside" the range in which we expect them.The
IQR
is the length of the box in your box-and-whisker plot. An outlier is any value that lies more than one and a half times the length of the box from either end of the box. That is, if a data point is belowQ
1– 1.5×IQR
or aboveQ
3+ 1.5×IQR
, it is viewed as being too far from thecentral values to be reasonable. Maybe you bumped the weigh-scale when you were making that one measurement, or maybe your lab partner is an idiot and you should never have let him touch any of the equipment. Who knows? But whatever their cause, the outliers are those points that don't seem to "fit".
(Why one and a half times the width of the box? Why does that particular value demark the difference between "acceptable" and "unacceptable" values? Because, when John Tukey was inventing the box-and-whisker plot in 1977 to display these values, he picked
1.5×IQR
as the demarkation line for outliers. This has worked well, so we've continued using that value ever since.)10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7,
14.7, 14.7, 14.9, 15.1, 15.9, 16.4
To find out if there are any outliers, I first have to find the
IQR
. There are fifteen data points, so the median will be at position(15 + 1) ÷ 2 = 8
. ThenQ
2= 14.6
. There areseven data points on either side of the median, so
Q
1is the fourth value in the list andQ
3is the twelfth:Q
1= 14.4
andQ
3= 14.9
. ThenIQR = 14.9 – 14.4 = 0.5
.Outliers will be any points below
Q
1– 1.5×IQR = 14.4 – 0.75 = 13.65
or aboveQ
3+
1.5×IQR = 14.9 + 0.75 = 15.65.
Then the outliers are at
10.2, 15.9,
and16.4
.The values for
Q
1– 1.5×IQR
andQ
3+ 1.5×IQR
are the "fences" that mark off the"reasonable" values from the outlier values. Outliers lie outside the fences.
If your assignment is having you consider outliers and "extreme values", then the values for
Q
1– 1.5×IQR
andQ
3+ 1.5×IQR
are the "inner" fences and the values forQ
1– 3×IQR
andQ
3+ 3×IQR
are the "outer" fences. The outliers (marked with asterisks or open dots) arebetween the inner and outer fences, and the extreme values (marked with whichever symbol you didn't use for the outliers) are outside the outer fences. Copyright © Elizabeth Stapel 1999-2009 All Rights Reserved
By the way, your book may refer to the value of "
1.5×IQR
" as being a "step". Then the outliers will be the numbers that are between one and two steps from the hinges, and extreme value will be the numbers that are more than two steps from the hinges.Looking again at the previous example, the outer fences would be at
14.4 – 3×0.5 = 12.9
and14.9 + 3×0.5 = 16.4
. Since16.4
is right on the upper outer fence, this would be considered to be only an outlier, not an extreme value. But10.2
is fully below the lower outer fence, so10.2
would be an extreme value.Your graphing calculator may or may not indicate whether a box-and-whisker plot includes outliers. For instance, the above problem includes the points
10.2, 15.9
, and16.4
as outliers. One setting on my graphing calculator gives the simple box-and-whisker plot which uses only the five-number summary, so the furthest outliers are shown as being the endpoints of the whiskers:
A different calculator setting gives the box-and-whisker plot with the outliers specially marked (in this case, with a
simulation of an open dot), and the whiskers going only as far as the highest and lowest values that aren't outliers:
Note that my calculator makes no distinction between outliers and extreme values.
If you're using your graphing calculator to help with these plots, make sure you know which setting you're supposed to be using and what the results mean, or the calculator may give you a perfectly correct but "wrong" answer.
• Find the outliers and extreme values, if any, for the following data set, and draw the box-and-whisker plot. Mark any outliers with an asterisk and any extreme values with an open dot.
21, 23, 24, 25, 29, 33, 49
To find the outliers and extreme values, I first have to find the
IQR
. Since there are seven values in the list, the median is the fourth value, soQ
2= 25
. The first half of thelist is
21, 23, 24
, soQ
1= 23
; the second half is29, 33, 49
, soQ
3= 33
. ThenIQR =
33 – 23 = 10
.The outliers will be any values below
23 – 1.5×10 = 23 – 15 = 8
or above33 +
1.5×10 = 33 + 15 = 48
. The extreme values will be those below23 – 3×10 = 23 – 30
= –7
or above33 + 3×10 = 33 + 30 = 63
.So I have an outlier at