Probability theory is useful for understanding the role of randomness in sports statistics. In this section, we consider a detailed example of this by looking at how the variability in outcomes naturally leads to “streaks.” Streaks, such as consecutive game hitting streaks or consecutive pass completion streaks, always seem to be of great interest in sports. However, an analysis of the probability theory underlying such streaks shows that, even when the outcomes are totally random, relatively long streaks are not uncommon.
We consider the same basic scenario we considered when describing the binomial distribution: We have an experiment and event A of interest. Define a random variable X such that X = 1 if A occurs and X = 0 otherwise; let π = P(A) = P(X = 1). Suppose that we observe a sequence of n independent experiments and let X X1, 2…, Xn be the corresponding values of X. Then, X X1, 2…, Xn is a sequence of independent random variables, each taking the value 0 or 1 depending on whether or not A occurs in the experiment. In Section 3.13, we considered S = X1+ X2+ + Xn, which has a bino- mial distribution.
Here, we are concerned with the longest consecutive streak of ones in X X1, 2, , … Xn, a random variable that we will denote by L. The distribution of L will depend on two parameters, n, the number of experiments under consideration, and π, the probability that A occurs in any one experiment. For instance, consider a hitting streak in baseball. We take as the experiment a game in which the batter plays so that n is the number of games played and π is the probability of a hit in any given game.
For given values of n and π, it is possible to calculate the probability distribution of L. For instance, suppose that n = 3 and π = 1/2. There are eight possibilities for a sequence of 3 ones and zeros: (0,0,0), (0,0,1), (0,1,0), (0,1,1), (1,0,0), (1,0,1), (1,1,0), (1,1,1). Each of these has probability 1/8. The corresponding values of L are 0, 1, 1, 2, 1, 1, 2, 3, respectively. Therefore, P(L = 0) = 1/8, P(L = 1) = 1/2, P(L = 2) = 1/4, P(L = 3) = 1/8. The case for general n and π is much more complicated, and this simple method of determining the distribution cannot be used, but the basic idea is the same, and there are published formulas that give the distribution.
Consider the case of a hitting streak. The analysis depends heavily on the player under consideration, so let us look at Miguel Cabrera in the 2011 season. Cabrera played in 161 games, so take n = 161. Of those 161 games, Cabrera had at least one hit in 127 games, so let us take π = 127/161 = 0.79. Using these values, the distribution of L, the longest hitting streak of the season, is given in Table 3.9. This distribution gives the probability of a hit- ting streak of length L = a under the assumptions that the probability that Cabrera gets a hit in a given game is 0.79 and that the game-to-game results are independent.
Based on this distribution, we see that the probability that Cabrera has a hitting streak of at least 20 games is about 0.25 (obtained by adding the probabilities of a streak of length 20 games, 21 games, and so on), so that Cabrera’s hitting streak of 17 games in 2011 is not particularly long given his batting ability. In fact, the mean value of L based on n = 161 and π = 0.79 is 16.9, so that Cabrera’s 17-game hitting streak is almost exactly what would be expected from a batter with his ability if results in different games are independent. Based on this analysis, we could conclude that there is little evidence of “streakiness” in Cabrera’s batting.
The distribution in Table 3.9 is based on Cabrera’s statistics. Properties of hitting streaks in general can be derived from some basic assumptions. Take n = 162, the length of an MLB season. To determine π, suppose that the player has 4 at bats per game, and the probability of a hit on any given at bat is r; for example, for a .300 hitter, r = 0.3. Then, the probability of no hits in 4 at bats is (1 − r)(1 − r)(1 − r)(1 − r); therefore, the probability of at least one hit in a game is
π = 1 − (1 − r)4.
TABLE 3.9 Distribution of the
longest hitting streak based on Cabrera’s data a P(L = a) <10 0.018 10 0.031 11 0.053 12 0.074 13 0.089 14 0.096 15 0.095 16 0.089 17 0.080 18 0.069 19 0.059 20 0.049 21 0.040 22 0.033 23 0.026 24 0.021 25 0.017 26 0.013 27 0.010 28 0.008 29 0.006 30 0.005 >30 0.019
Table 3.10 gives some general properties of the distribution of L for a .200 hitter, a .250 hitter, and a .300 hitter. From this, we see that the probability that a .200 hitter has a batting streak of greater than 20 games in a season is about 0.001; that is, it is extremely unlikely for a .200 hitter to have a hitting streak of longer than 20 games based solely on the random nature of batting. For a .300 hitter, on the other hand, the average longest hit- ting streak for the season is about 15 games, and a hitting streak of greater than 20 games is not uncommon based only on the randomness of game-to-game results.
Exact calculation of the distribution of the longest streak is difficult and requires specialized software that is not typically available in spreadsheets or statistical pack- ages. However, there is a simple expression for a “typical streak” based on the values of n and π, given by − − π π n ln( (1 )) ln( ) .
Here, ln represents the natural logarithm function. For instance, for the Cabrera exam- ple, where n = 161 and π = 0.79,
(
(
))
− ln 161 1 0.79 − = ln(0.79) 14.9
which is a reasonable value, given the information in Table 3.9.
In addition to usefulness for numerical calculations, this expression for a typical streak is useful for understanding the properties of streaks. Note that the typical value does not depend heavily on the value of n because the ln function changes fairly slowly with n for the values of n typically encountered in sports.
For instance, for a given season, the typical length of a hitting streak for Miguel Cabrera was just shown to be 14.9 games. Now, suppose that we are interested in his longest hitting streak over the past 8 years, assuming that the values of n and π that we used for 2011 apply for each year in that entire period. Then, n = 1288 and π = 0.79 and using the preceding equation, the typical length of his longest hitting streak over that period is 23.8 games. That is, although the time period is eight times as long, the typical longest streak is less than twice that of a single year. In fact, during that time period, Cabrera’s longest hitting streak was 20 games.
TABLE 3.10 Distributions of streak length
BATTING AVERAGE LENGTH .200 .250 .300 <10 0.563 0.167 0.017 10 to 15 0.424 0.730 0.609 16 to 20 0.012 0.087 0.269 >20 0.001 0.015 0.105 Mean 8.6 11.4 15.0