• No se han encontrado resultados

Sports articles and broadcasts often give much attention to “unusual” results, results that seem to require an inordinate amount of luck or coincidence. In some cases, it is possible to use probability theory to evaluate the likelihood of a particular event occur- ring. We have seen one instance of this type of analysis, in Section 3.15 on streaks. In this section, we consider three examples of events that might be considered to be unusual and show how to use probability to obtain a rough idea of how likely these events are.

In the May 2, 2011 edition of Bleacher Report (http://www.bleacherreport.com), Peter Wardell contributed an entertaining article on “20 Statistical Oddities from the 2011 MLB Season So Far.” Here we look at three of these “oddities”: the fact that 2 teams had a sub-50% save conversion rate; the fact that the Nationals had only 1 pinch hit in 26 pinch-hit at bats; and the fact that Kurt Suzuki had a 55% caught-stealing percentage.

First, consider the save conversion rates. After the first month of the 2011 MLB season, the White Sox had a save conversion rate of 33%, and the Astros had a save conversion rate of 36%. As noted in the article, in the previous decade only two teams had a season save conversion rate of less than 50% (and both were close, at 48% and 49%).

The save conversion rate is the ratio of saves to the sum of saves plus blown saves, expressed as a percentage. For convenience, I refer to saves plus blown saves as “save attempts.” In a month of an MLB season, teams average about 10 save attempts; the average save conversion rate of MLB teams is about 68%. Let X denote the number of save conversions for a team. Assuming the probability of a save conversion is 0.68 and that a team has 10 save attempts, we can model X as a random variable with a binomial distribution with n = 10 and π = 0.68. The probability that the team has a save conver- sion rate of less than 50% is P(X ≤ 4). Note that if X is less than or equal to 4, the save conversion rate is 40% or lower.

Using the probability function of the binomial distribution, it can be shown that

(

)

=

P X 4 0.064.

Therefore, it appears that having two teams with such a low save conversion rate is unlikely. However, it is important to keep in mind that there are 30 MLB teams, each of which might have a low save conversion rate.

Because the probability of a “low” save conversion rate (i.e., one that is less than 50%) is 0.064, the probability of a “normal” save conversion rate (i.e., one that is 50% or greater) is

The chance that all 30 MLB teams have a normal save conversion rate, assuming that they all have 10 save attempts, the probability of a save conversion is 0.68 for all teams, and that the results of different teams are independent, is

(

0.936

)

30 =0.139.

Using properties of the binomial distribution, it may be shown that the probability that all but 1 team has a normal save conversion rate is 0.283. Therefore, the probability that 2 or more teams have a low save conversion rate is

1 − 0.139 − 0.283 = 0.578.

That is, it is actually more likely that at least 2 teams will have a low save conver- sion rate in the first month of the season than it is that all teams, or all but 1 team, will have a normal save conversion rate in the first month. Although the probability 0.578 is based on a number of assumptions that are unlikely to be exactly true, it seems reason- able to conclude that the fact that 2 teams had a sub-50% save conversion rate in the first month of the season is not particularly unusual.

Now, consider the fact that the Nationals were 1 for 26 in the first month of the 2011 season. Because only National League teams have a large number of pinch-hit at bats, the analysis here refers only to National League teams.

The batting average of pinch hitters for the league overall was .214 in 2011. Let X denote the number of pinch hits a team has in 26 pinch-hit at bats; then, we can model X as a binomial random variable with n = 26 and π = 0.214. It follows that the probability that a team has 0 hits in 26 at bats is

(

=

) (

= −

)

=

P X 0   1 0.214 26 0.0019;

the probability that the team has 1 hit in 26 at bats is 0.0135. Therefore, the probability that a team has less than or equal to 1 hit in 26 pinch-hit at bats is 0.0154; it follows that the event that a team goes 1 for 26 in pinch hitting is unusual.

However, as in the save conversion example, we must take into account the fact that there are 16 National League teams, each of which had a chance to go 1 for 26 in pinch hitting. Because the probability that a given team has at least 2 hits in 26 at bats is 1 − 0.0154 = 0.9846, the probability that all 16 teams have at least 2 hits in 26 at bats is

(

0.9846

)

16=0.780.

Stated another way, the probability that at least 1 team has 0 or 1 hit in 26 pinch-hit at bats is 0.22. We expect this to occur about once every 1/0.22 = 4.5 seasons. Therefore, according to this analysis, the fact that a team went 1 for 26 in pinch hitting to start the season is fairly unusual. Furthermore, part of the “oddity” of this result, as described by Wardell, is that the Nationals had 9 strikeouts in 26 at bats, making their first-month pinch-hitting performance even more unusual.

Finally, consider the fact that Kurt Suzuki threw out 16 of 29 players attempting to steal in the first month of the 2011 season. What makes this event unusual is the fact that Suzuki had a caught-stealing percentage of only 22% in 2010 and 25% in 2009. Therefore, this oddity can be interpreted as one in which a catcher with a relatively poor record of throwing out those attempting to steal has a stretch of 29 attempts in which he throws out 16 (or more).

Of the 30 catchers in 2010 with the most playing time, 15 had a caught-stealing percentage of less than 30%. The average caught-stealing percentage of these 15 catch- ers was 22.4%, which coincidently is the same as Suzuki’s caught-stealing percentage in 2010. Let X denote the number of runners caught stealing by a given catcher; we can model X as a random variable with a binomial distribution with n = 29 and π = 0.224. Under this assumption,

(

)

=

P X 16  0.000128.

Therefore, the probability that 1 of these 15 catchers throws out no more than 15 of 29 runners is

1 − 0.000128 = 0.999872

and the probability that all 15 catchers throw out 15 or fewer runners is

(

0.999872

)

15=0.99808.

It follows that the probability that a catcher with a poor record of throwing out those attempting to steal throws out 16 (or more) of 29 runners attempting to steal is

1 − 0.99808 = 0.00192;

that is, this is an extremely rare event that, according to the analysis in this section, can be expected to occur only once every 1/0.00192 = 521 seasons.

Although, given the number of assumptions used in this analysis, we should not take the 521 seasons result too seriously; it is clear that Suzuki’s start to the 2011 sea- son deserves to be called a statistical oddity. It is worth noting that for the remainder of 2011, Suzuki threw out only 22 of 107 runners attempting to steal, and he ended the season with a caught-stealing percentage of about 28%.

3.17 COMPUTATION

Probability calculations for the two distributions considered in this chapter, the bino- mial and the normal distributions, can be easily carried out in Excel.

First, consider the binomial distribution. In Section 3.16, when analyzing the save conversion oddity, it was noted that if X is a random variable with a binomial distribu- tion with n = 10 and π = 0.68, then

(

)

=

P X 4 0.064.

In Excel, this probability can be calculated using the function BINOM.DIST. For a random variable Y that has a binomial distribution with parameters n and π, P(Y ≤ a) can be obtained from

BINOM.DIST(a, n, π, TRUE);

the “TRUE” in the statement refers to the fact that we want the cumulative probability P(Y ≤ y) rather than the individual probability P(Y = y), which would be calculated using

(

π

)

BINOM DIST a n. ,  ,  , FALSE .

Therefore, in the save conversion example, P X

(

≤4 can be obtained using

)

(

)

BINOM DIST. 4, 10, 0.68, TRUE , which returns the value 0.063715.

Now, consider the normal distribution. Let X denote a random variable with a nor- mal distribution with mean µ and standard deviation σ. Then, P(X < a) can be obtained using the Excel command

(

µ σ

)

NORM DIST a. ,  ,  , TRUE .

A probability of the form P(−a < X < a) can be expressed as

(

<

)

(

< −

)

P X a P X   a .

Therefore, it can be calculated using

µ σ − − µ σ

NORM DIST a. ( ,  ,  , TRUE) NORM DIST a. ( ,  ,  , TRUE).

For instance, Table 3.4 gives P(−1 < Z < 1) where Z is a standard normal random variable, that is, a random variable with a normal distribution with mean 0 and standard deviation 1. This probability can be calculated using

(

)

(

)

NORM DIST. 1, 0, 1, TRUE NORM DIST. 1,0,1, TRUE ,

3.18 SUGGESTIONS FOR FURTHER READING

Probability theory is an important area of mathematics with applications in a wide range of fields. There are two aspects of probability: the technical side, which focuses on the mathematical properties of probability functions and random variables, and the intuitive side, which focuses on understanding randomness and the role it plays in many fields of study, including sports. For further reading, the works of Grinstead and Snell (1997) and Ross (2006) are detailed introductions to the mathematics of probability, suitable for readers with strong math backgrounds and a desire to understand the technical details behind probability theory. Mlodinow (2008) does an excellent job of describing the intuition behind probability theory and random- ness; this book is suitable for a general audience and does not require a background in mathematics.

Win probabilities and expected points are important general techniques used in many sports. See the works of Tango, Lichtman, and Dolphin (2007, Chapter 1); Click (2006); and Woolner (2006) for applications in baseball; for applications in football, see Winston (2009, Chapters 21 and 24), Goldner (2012), and the Advanced Football Analytics website (http://www.advancedfootballanalytics.com).

The method of adjustment described in Section 3.11 is known as poststratifica- tion; see the work of Wainer (1989) for a detailed discussion of the pros and cons of this approach, as well as some alternative methods. Poststratification can also be used to adjust for a continuous variable by grouping the variable into classes, similar to the method used in Section 3.12 (although in that case, the continuous variable—the length of a field goal attempt—is not readily available); see Cochran’s (1968) work.

The Z-score approach to comparing players from different eras presented in Section 3.14 was used by Lependorf (2012) to compare MLB players from different eras and by Silver (2006a) in his comparison of Babe Ruth and Barry Bonds. Lederer (2009) uses Z-scores in his method for ranking MLB pitchers.

Streaks are a popular topic for discussion in sports. Gilovich (1991, Chapter 2) discusses the tendency to try to explain streaks in terms of some underlying pattern or theory, rather than as simply random occurrences. Moskowitz and Wertheim (2011, pp. 215–229) and Winston (2009, Chapter 11) discuss the properties of streaks in the con- text of sports. The technical results on streaks in Section 3.15 are based on Schilling’s (1990) work.

79

4

Statistical

Methods

4.1 INTRODUCTION

In an ideal world, we would have an unlimited amount of data, and all relevant ques- tions could be answered with certainty. Is Tom Brady better than Peyton Manning? Have them play hundreds and hundreds of games with the same teammates against similar opponents and analyze the results. Of course, in the real world, this is not pos- sible, and we have to base our analyses on the available data.

Statistical methods play at least two roles in these situations. First, they provide methods for extracting the maximum amount of information from a set of data. Second, they give us a way to quantify the uncertainty that results from having to base these conclusions on such limited data.

The goal of this chapter is to give an overview of statistical reasoning and the type of statistical methods that are useful in analyzing sports data.

4.2 USING THE MARGIN OF ERROR