• No se han encontrado resultados

CAPÍTULO I: Historia, origen y contexto

1.1.2. Origen del Tango

One common way data will be displayed on the GRE is in bar charts. ETS calls a chart a “bar chart” whether the bars are horizontal or vertical. It’s good to know that many sources call charts with horizontal bars “bar charts” and charts with vertical bars “column charts.” Regardless of the names we use, there’s a subtle difference.

Bar charts

Typically, in a bar chart, when the bars are horizontal, each bar represents a completely different item from some overarching category. For example:

Here, the bars are different fruits. Why these fruits were selected and not others isn’t obvious. The order here is simply alphabetical because there is no predetermined way to put fruits into order, whatever that would mean. If there is no inherent order to the categories, and the representatives chosen don’t exhaust the category, then the data typically would be displayed in horizontal bars or what many sources would call a bar chart.

Various Fruits, Calories per Serving

Calories Apple Banana Canteloupe Grapes (a cup) Kiwi Orange Pear 120 100 80 80 40 20 0

Column charts

If the set has a logical order to it, such as days of the week or months of the year, and/ or the representatives shown encompass all in the category, the data typically would be displayed in vertical bars or what many would call a column chart. For example:

Here, days of the week have a well-defined order, in which they’re displayed.

Assuming this business only operates during weekdays, this is also a complete set of all the days on which they do business. That’s why the vertical columns are used.

Segmented bars and columns

The following is a more detailed version of the “sales by day of week” chart given above:

This type of chart gives more nuanced information. Apparently, this company has two divisions, and how each division performs during different days of the week varies considerably. For example, Division 1 clearly has its best days on Wednesdays, while for Division 2, Mondays and Thursdays appear to be neck-and-neck for the best days. Here, the individual pieces are displayed as segments of a column because you might

Edward Tufte (1942– ) is widely recognized as the world’s expert in the visual display of data. Once, Tufte said: “The only thing worse than a pie chart is several pie charts.” Admittedly, the pie chart is an extremely simple diagram that generally conveys far less than most of the other charts and diagrams discussed in this chapter.

Sales Over a Five-Week Period, by Day of Week

45000 50000 40000 35000 30000 20000 25000 10000 15000 0 5000 Friday Thursday Wednesday Tuesday Monday

Sales Over a Five-Week Period, by Day of Week

45000 50000 40000 35000 30000 20000 25000 10000 15000 0 5000 Friday Division 2 Division 1 Thursday Wednesday Tuesday Monday

GRE Quantitative Reasoning

be interested in knowing either the revenue of each division separately or the total revenue of the company, which equals the sum of the revenues of the two divisions.

Clustered bars and columns

Sometimes you’ll care about the sum of the parts and sometimes you won’t. If, instead of being two divisions of the same company, that same data was interpreted as the revenues of two different companies competing in the same market, then the sum of the revenues would be virtually meaningless. In this case, the columns or bars are “clustered,” that is to say, displayed side by side. For example:

Here, the side-by-side comparison makes it very easy to compare which company outperforms the other on each day of the week.

Scatterplots

One of the most common types of graphs in statistics and in the quantitative sciences is a scatterplot. A scatterplot is a way of displaying data in which two different variables are measured for each participant. For example, suppose you ask several people to identify their age and their weight, or both their annual income and the amount of debt they carry, or both their number of kids and number of credit cards. Suppose you measure both the weight and the gas mileage of several cars or the annual revenue and the price, per share, of stock of publically traded companies. In all of those cases, each individual (each person, each car, each company) would be a single dot on the graph, and the graph would have as many dots as individuals surveyed or measured.

Sales Over a Five-Week Period, by Day of Week

30000 20000 25000 15000 10000 5000 0 Friday Company 2 Company 1 Thursday Wednesday Tuesday Monday

An example of a scatterplot

Below is a scatterplot on which the individual dots represent countries.

On this graph, the x-axis is the gross domestic product (GDP) per capita of the country. The GDP is a measure of the amount of business the country conducts. The size of the GDP depends on both the inherent wealth of the country and the population. When you divide that by the population of the country, you get GDP per capita, which is an excellent measure of the average wealth of the country. The y-axis is life expectancy at birth in that country. The sideways L shape tells a story: for countries with a GDP per capita above $20K, life expectancy at birth is between 70 and 80 years, but for the poor countries, those with a GDP per capita less than about $20K, life expectancy at birth varies considerably, and is in many cases considerably less than the 70+ years that’s standard for most of the world.

Now, as an example of a scatterplot with two different marks on the graph, here’s the same graph again, with some of the points marked differently.

GDP Per Capita (thousands)

Li fe E xpe ct an cy 60 70 80 90 40 50 40 50 60 70 80 30 20 10 0

GDP Per Capita (thousands)

Li fe E xpe ct an cy 60 70 80 90 40 50 40 50 60 70 80 30 20 10 0 Africa Region Elsewhere

GRE Quantitative Reasoning

On this graph, the circles are countries on the continent of Africa, and the triangles are countries in the rest of the world. Notice that essentially, the entire continent of Africa is in the vertical arm of the L shape on the left side, while the rest of the world predominantly makes the horizontal arm of the L at the top of the graph. In other words, for most African countries, a person’s odds from birth are worse than those of a person born in a non-African country. Suffice it to say, displaying data in a scatterplot can make truly important information visually apparent.

Boxplots

Statisticians like to chunk data in sets. One way to do this is by labeling data using the five-number summary. The five-number summary looks at data like this:

1. Maximum

2. Third quartile, Q3, the 75th percentile

3. Median, 50th percentile

4. First quartile, Q1, the 25th percentile 5. Minimum

The beauty of the five-number summary is that it divides the entire data set into quarters: between any two numbers on the five-number summary is exactly 25% of the data.

Because statisticians, like many human beings, are highly visual, they created a visual way to display the five-number summary. This visual form is called a boxplot. The five vertical lines represent the five numbers of the five-number summary, and the

box in the middle, from Q1 to Q3, represents the middle 50% of the data. Between any

two adjacent vertical lines are 25% of the data points.

Strikeouts

Here’s an example of a boxplot using real baseball data. The data here is the total strikeouts pitched by all National League (NL) pitchers who pitched at least 75 innings in the 2012 season.

Half of all the NL pitchers here pitched between Q1 = 83 and Q3 = 161 strikeouts in

the year; these are the pitchers in the big grey box in the center, called the IQR. Meanwhile, only 25% of the pitchers in this data struck out fewer than 83 batters; this

Strikeouts, NL Pitchers 2012

100 120 140 160 180 200 220 240

80 60 40

bottom 25% is on the lower arm from 38 to 83. And finally, only 25% of the pitchers struck out more than 161 batters in the 2012 season; this top 25% is on the upper arm from 161 to 230. A data interpretation question on the GRE could give you a boxplot and expect you to read all the five-number summary information, including percentiles, from it.

Histogram

Histograms aren’t simple bar or column charts. A histogram, like a boxplot, shows the distribution of a single quantitative variable. The makers of this graph asked each high school student, “How many hours of TV did you watch last week?” and each high school student gave a numerical answer. After interviewing 86 students, a list of 86 numbers was generated. The histogram is a way to visually display the distribution of those 86 numbers.

The histogram “chunks” the values into sections that occupy equal ranges of the variable, and it tells how many numbers on the list fall into that particular chunk. For example, the left-most column on this chart has a height of 13. This means that, of the 86 students surveyed, 13 of them gave a numerical response somewhere from 1 to 5 hours. Similarly, each other bar tells us how many responses were in that particular range of hours of TV watched.

The median

The median is the middle of the list. In this same histogram problem, there’s an even number of entries on the list, so the median would be the average of the two middle terms—the average of the 43rd and 44th numbers on the list. We can tell that the first column accounts for the first 13 people on the list, and that the first two columns account for the first 13 + 35 = 48 people on the list, so by the time we got to the last person on the list in the second column, we would have already passed the 43rd and 44th entries, which means the median would be somewhere in that second column, somewhere between 6 and 10. But we don’t know the exact value of the median.

35 40 Nu m be r o f S tud en ts

Number of Hours of TV Watched Last Week 30 25 20 10 15 5 0 21–25 26–30 31–35 16–20 11–15 6–10 1–5

GRE Quantitative Reasoning

The mean

To calculate the mean, you would have to add up the exact values of all 86 entries on the list, and then divide that sum by 86. In a histogram, you don’t have access to exact values; you only know the ranges of numbers. Therefore, it’s impossible to calculate the mean from a histogram.

Median vs. mean

If it’s impossible to calculate the mean, then how can the GRE expect you to compare the mean to the median? Well, here you need to know a slick little bit of statistical reasoning. Consider the following two lists:

List A = {1, 2, 3, 4, 5} median = 3 and mean = 3 List B = {1, 2, 3, 4, 100} median = 3 and mean = 21

In changing from List A to List B, we took the last point and slid it out on the scale from x = 5 to x = 100. We made it an outlier, or a point that’s noticeably far from the other points. Notice that the median didn’t change at all. The median doesn’t care about outliers. Outliers simply don’t affect the median. By contrast, the mean changed substantially, because, unlike the median, the mean is sensitive to outliers.

Now, consider a symmetrical distribution of numbers—it could be a perfect bell curve, or it could be any other symmetrical distribution. In any symmetrical distribution, the mean equals the median. Now, consider an asymmetrical distribution. If the outliers are yanked out to one side, then the median will stay put, but the mean will be yanked out in the same direction as the outliers. Outliers pull the mean away from the median. Therefore, if you simply notice on which side the outliers lie, then you know in which direction the mean was pulled away from the median. That makes it very easy to compare the two. The comparison is purely visual, and involves absolutely no calculations of any sort.

mantra

The standardized test is a learnable thing. You can become much better at the GRE in a very short time. Just remember to repeat this mantra when that little voice in your head wants to say things like “I’m no good at this.”

Normal distribution

A distribution is a graph that shows what values of a variable are more or less

common in a population. A higher region of the graph represents more people meeting the variable criteria, and a lower region of the graph, one close to 0, represents fewer people meeting the criteria.

By far, the most famous and most useful distribution is the normal distribution, better known as the bell curve. It shows up everywhere, with an almost eerie

universality. Suppose you measured one genetically determined bodily measurement such as thumb length or distance between pupils, for every single human being on the planet, and then graphed the distribution. You would wind up with a normal distribution. The same goes for any genetically determined bodily measurement you could make on an animal or a plant: you’d end up with a normal distribution. The normal distribution is the shape of the distribution of any naturally occurring variable of any natural population.

Properties of the normal distribution

All normal distributions on earth, from giraffe height to ant height, share certain central properties.

It’s important to appreciate that any normal distribution comes with its own yardstick. For normal distribution, that yardstick is the standard deviation. The very center of the normal distribution is the mean, median, and mode all in one. We use the standard deviation to measure distances from the mean.

In the graph below, the mean (and median and mode) is at x = 0, and the units on the x-axis mark off distances in standard deviations. Thus, x = 1 is one standard deviation above the mean, and x = −2 is two standard deviations below the mean.

From x = 0 to x = 1, from the mean to one standard deviation above the mean, we find 34% of the population. Because the curve is completely symmetrical, the same is true on the other side: another 34% of the population is between x = −1 and x = 0. This means that between x = −1 and x = 1, we find 68% of the population, just more than two-thirds: this accounts for all the people that fall within one standard deviation of the mean. More than two thirds of the population is located at a distance from the mean of one standard deviation or less.

1 −4 −3 −2 −1 2 3 4 0.3 0.1 0.2 0.5 0.4

The people who are more than four standard deviations above the mean are the people who are truly exceptional, among the greatest on earth, in one way or another: the professional athletes, the concert pianists and violinists, or the Nobel Laureates of various fields.

GRE Quantitative Reasoning

If we go two standard deviations from the mean in either direction, from x = −2 to x = 2, that always includes 95% of the population. In other words, 95% percent of the population is located at a distance from the mean of two standard deviations or less.

Only 5% of the population is more than two standard deviations from the mean, and that's symmetrically divided between a 2.5% “tail” on the left and a 2.5% “tail” on the right. Folks who are in the upper tail are in the top 2.5% of the population. For example, these would be the folks who score a 168 or more out of the 170 on the GRE math.

If we go out to three standard deviations from the mean in either direction, from x = −3 to x = 3, that includes 99.7% of the population, with only 0.15% (i.e., 15 people out of 10000) falling in each tail beyond this. The data points that lie more than three standard deviations above the mean are the true outliers.

If you simply remember these two numbers, then you’ll have the ability to figure out any GRE math question that addresses the normal distribution:

● 68% within one standard deviation of the mean (which means 34% on each side)

95% within two standard deviations of the mean

1 2 −3 −2 −1 3 0.1 0.02 0.08 0.2 0.14 0.06 0.04 0.12 0.16 0.18 68% 1 2 −3 −2 −1 3 0.1 0.02 0.08 0.2 0.14 0.06 0.04 0.12 0.16 0.18 95%

Here's a practice question:

A height of 182 cm is one standard deviation (12 cm) above the mean (170 cm). The mean is the median, so 50% of the population is below the mean, shorter than 170 cm. We learned in the discussion above that between the mean and one standard deviation above the mean, heights between 170 cm and 182 cm, we’ll find 34% of the population. Adding these, we find that 34 + 50, which = 84% of the heights, will be below 182 cm. If 84% are below 182 cm, then the other 16% must be above 182 cm. Thus, 16% of the adult women in Dilandia are taller than 182 cm. Answer = (B).

How do we know whether the 34% region is “less than” or “less than or equal to”? In other words, how do we know whether to include the height of exactly 182 cm? Most of the variables that follow normal distributions are real-world continuous variables, such as human height. Typically, people report their height in an integer number of inches or centimeters, but if we were to take hyperaccurate scientific measurements, every real person would have some decimal height (e.g., 176.48251 . . . cm). No one ever would have an exact height of 182 cm, a height that equaled 182.000000 cm, with an infinite number of decimals of precision. Such precision is a mathematical fiction that simply doesn't exist in the real world of measurement. Thus, in most normal distribution problems, we don't worry about the endpoints of regions: in practical terms, it doesn't matter at all whether we’re discussing heights that are “greater than 182 cm” or “greater than or equal to 182 cm.” Again, as regards real-world measurements, this is a fictional mathematical distinction that’s absolutely meaningless where the rubber meets the road.

In the country of Dilandia, adult female height follows a normal distribution, with a mean of 170 cm and a standard deviation of 12 cm. What percent of the adult females in Dilandia are taller than 182 cm?

6% 16% 25% 34% 44% A B C D E