SUPERIORES DE MONTERREY
CAMPUS MONTERREY
DIVISION DE ELECTRONICA, COMPUTACION, INFORMACION Y COMUNICACIONES
PROGRAMA DE GRADUADOS EN ELECTRONICA, COMPUTACION, INFORMACION Y COMUNICACIONES
DISCRETE HEAVY TAILED DISTRIBUTIONS FOR NETWORK TRAFFIC MODELING
THESIS
PRESENTED AS PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN ELECTRONIC ENGINEERING MAJOR IN TELECOMMUNICATIONS
SALVADOR VILLARREAL REYES
JUNE 2001
INSTITUTO TECNOLOGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY
CAMPUS MONTERREY
DIVISION DE ELECTRONICA, COMPUTACION, INFORMACION Y COMUNICACIONES
PROGRAMA DE GRADUADOS EN ELECTRONICA, COMPUTACION, INFORMACION Y COMUNICACIONES
DISCRETE HEAVY TAILED DISTRIBUTIONS FOR NETWORK TRAFFIC MODELING
THESIS
PRESENTED AS PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN ELECTRONIC ENGINEERING MAJOR IN TELECOMMUNICATIONS
SALVADOR VILLARREAL REYES
JUNE 2001
I want to thank to the Institute Tecnológico y de Estudios Superiores de Monterrey the opportunity to study a Masters degree in Electronic Engineering mayor in Telecommunications at this institute.
I am grateful with my thesis advisor, David Muñoz Rodriguez, Ph. D., for their support, collaboration and interest in the realization of this thesis.
I sincerely thank my synodals Cesar Vargas Rosales, Ph. D., and Jose Ramon Rodriguez Cruz, Ph. D., for their comments and help that contributed to enhance this thesis.
To all my teachers at the ITESM.
I also want to thank to Engineer Liborio Alba Torres for their help at the beginning of my graduate studies.
To my ITD teachers Francisco Godínez, Miguel Esparza, Aurelio Castillo, Raul Barraza, among others not mentioned here.
To all my friends from Durango: Marco, Billy, Escalante, Ruano, Zaira, Polo, Salvador, Gabriel, Adanai....
To all my friends at the Center for Electronics and Telecommunications: Ramon, Fernando, Shogun, Mayela, Luis, Elsa, Quique, Anabel, Enrique, Daniel, Gerson, Jhonathan, Esteban, Fredy, Samuel, Francisco....
In this work we derive discrete heavy tailed distributions that can be used to model the holding times or the ON and OFF periods observed in the present network traffic. We obtain two discrete distributions, namely the zeta prime distribution and the beta CC distribution. The zeta prime distribution offers a better fit to the continuous Pareto distribution (normally used to fit the empirical distribution obtained from the statistical analysis of computer network traffic) than the Zipf distribution. The beta CC distribution is obtained from an urn model that could be applied to several traffic issues such as the holding times, the ON and OFF periods, etc. Finally, we present an entropy analysis for the ON periods that can be used to estimate the probability of continuing with the holding time during the next time slot or stopping the call in the next time slot.
RESUMEN
En este trabajo derivamos distribuciones discretas de cola pesada que pueden ser utilizadas para modelar los tiempos de sostenimiento o los periodos de encendido y apagado observados en el trafico de red actual. Se obtuvieron dos distribuciones discretas llamadas zeta prima y beta CC. La distribucion zeta prima se ajusta mejor a la distribucion Pareto continua (utilizada normalmente para ajustar datos empiricos obtenidos del analisis estadistico del trafico en redes de computadoras) que la distribucion Zipf. La distribucion beta CC se obtiene de un modelo de urna que puede ser aplicado a varias caracteristicas de trafico tales como los tiempos de sostenimiento, los periodos de encendido y apagado, las longitudes de trayectorias de navegacion en la Internet, etc. Finalmente, se presenta un analisis de entropia que puede ser utilizado para estimar la probabilidad de continuar o no con una llamada en la siguiente ranura de tiempo.
Contents
Chapter 1
Introduction 1
1.1 Objective 1
1.2 Justification 2
1.3 Contribution 2
Chapter 2
Heavy Tailed Phenomena on Computer Networks 3
2.1 A Brief Introduction to Heavy Tailed Distributions 4
2.2 Self-Similarity 5
2.3 Heavy Tails on Network Traffic 5
2.3.1 Heavy Tails on the Ethernet Traffic 6
2.3.2 Heavy Tails on the WWW 6
2.3.3 Heavy Tails on the Telephone Network 7
Chapter 3
Zipf-Pareto Distributions 9
3.1 The Pareto Distribution 10
3.2 The Zipf Distribution 12
3.3 LLCD Plots 15
3.3.7 Examples of Empirically Obtained LLCD Plots 17
Chapter 4
A Discrete Heavy Tailed Distribution for Network Traffic Modeling 21
4.1 ON and OFF Periods 22
4.1.1 Discrete ON and OFF Periods 23
4.2 Derivation of the Zeta Prime Distribution 24
4.3 Graphical Comparison between the Pareto, Zipf and Zeta Prime Distributions 29
Chapter 5
A Heavy Tailed Urn Model for Network Traffic Modeling 35
5.1 The Residual Lifetimes 37
5.2 The Traffic Source Urn Model 40
5.2.7 The Urn Model 41
5.2.2 Derivation of the Holding Time Probability Distribution 43
5.3 The Beta CC Distribution 46
5.3.1 CDF and CCDF of the Beta CC Distribution 49 5.3.2 Verification of the Heavy Tail Behavior of the Beta CC Distribution 50
5.4 Graphical Comparison between the Beta CC, Zipf and Pareto Distributions 54
5.5 Other Possible Applications of the Urn Model 59
5.5.7 The Urn Model Applied to the ON/OFF Sources 59 5.5.2 The Urn Model Applied to World Wide Web Surfing 60
Chapter 6
Entropy Analysis for Heavy Tailed Distributed Discrete Holding Times 63
6.1 Entropy of a Traffic Source 64
6.2 Entropy of a Traffic Source with Geometrically Distributed Holding Times 65 6.3 Entropy of a Traffic Source with Zipf Distributed Holding Times 67 6.4 Entropy of a Traffic Source with Zeta Prime Distributed Holding Times 71 6.5 Entropy of a Traffic Source with Beta CC Distributed Holding Times 74
Conclusions and Future Work 79
References 83
Appendix A
Long Range-Dependence and the Noah Effect 87
A.1 Long Range Dependence 87
A.2 The Noah Effect 88
Appendix B
The Exponential and the Geometric Distributions 89
B.1 The Exponential Distribution 89
B.2 The Geometric Distribution 91
B.2.1 Geometrically Distributed Discrete Holding Times 93
B2.2 Discrete Time ON/OFF Source with Geometrically Distributed ON and OFF Periods 94
Appendix C
Checking for Short or Long Range Dependence of the Urn Model 97 Appendix D
The Waring Distribution 101
List of Figures
Figure 3.1. The probability density function (pdf) and cumulative distribution
function (cdf) of the Pareto distribution 11
Figure 3.2. The generalized Pareto distribution for a =1.2 12 Figure 3.3. Plot of the pmf of a random variable with Zipf distribution and (X= 1.2 13 Figure 3.4. Plot of the cdf of a random variable with Zipf distribution and a = 1.2 14 Figure 3.5. LLCD plot of the Pareto distribution for several values of a. 15 Figure 3.6. LLCD plot of the Zipf distribution for several values of a. 16 Figure 3.7. LLCD plot of the Exponential and Pareto distributions 17
Figure 3.8. LLCD plots taken from [Cro97] 18
Figure 3.9. LLCD plot taken from [Cro98] 19
Figure 3.10. LLCD plots taken from [Wil97] 19
Figure 4.1. Discrete OFF Period 23
Figure 4.2. Discrete ON Period 24
Figure 4.3. pmf of a random variable X with zeta prime distribution for a=1.2 26 Figure 4.4. cdf of a random variable X with zeta prime distribution for a = 1.2 26 Figure 4.5. LLCD plot of the zeta prime distribution for several values of a. 27 Figure 4.6. LLCD plot of the Geometric and zeta prime distributions 28 Figure 4.7. Plot of the zeta prime and Zipf pmf's for a= 1.2 30 Figure 4.8. Plot of the zeta prime, Zipf and Pareto cdf s for a= 1.2 30 Figure 4.9. LLCD plot of the zeta prime, Zipf and Pareto distributions 32 Figure 5.1. Probability that a Rayleigh distributed holding time, exceeds a residual
lifetime, x = 1, once that it has lasted for time t, for a -1 39
Figure 5.2. Probability that a Pareto distributed holding time, exceeds a residual
lifetime, x = 1, once that it lasted for time t, for a = 1 40
Figure 5.3. The urn model 42
Figure 5.4. pmf of a random variable, X, beta CC distributed for several
values of a and & 48
Figure 5.5. LLCD of a random variable, X, beta CC distributed for several
values of a and 8. 51
Figure 5.6. Plots of the beta CC ccdf (5.41) and the approximation (5.45) 53 Figure 5.7. Plots of the beta CC, Zipf and Pareto cdf s for a = 1.2 56 Figure 5.8. LLCD plots of the beta CC, Zipf and Pareto distributions
for a = 1.2 57
Figure 5.9. LLCD plots of the beta CC, Zipf and log-logistic distributions
for a = 1.2 58
Figure 6.1. Information source 64
Figure 6.2. Entropy of a traffic source with Zipf distributed holding times
for«= 1.2 70
Figure 6.3. Probabilities of stopping or continuing once that we lasted for
time k for Zipf distributed holding times with a= 1.2 71 Figure 6.4. Entropy of a traffic source with zeta prime distributed holding
times for a = 1.2 73
Figure 6.5. Probabilities of stopping or continuing once that we lasted for time
k for zeta prime distributed holding times with a= 1.2 73 Figure 6.6. Entropy and conditional probabilites for a traffic source with beta CC
distributed holding time 76
Figure B.I. The probability density function (pdf) and cumulative distribution
function (cdf) of the exponential distribution 90
Figure B.2. The probability mass function (pmf) and cumulative distribution
function (cdf) of the geometric distribution 92
Figure B.3. Markov chain model of an ON/OFF source 94
List of Tables
Table 4.1. Values of the cdf of a random variable X with Zeta prime,
Zipf and Pareto distributions 31
Table 5.1. Comparison between the beta CC ccdf (5.41) and the
approximation (5.54) for two values of a and S. 52
Table 5.2. Values of the cdf of a random variable X with beta CC,
Zipf and Pareto distributions for a = 1.2 55
Chapter 1
Introduction
Since the last decade, computer networks have experienced an explosive growth, mainly due to its adoption by commercial, educational, governmental and home users as a way to interchange, distribute and get information that is rich in multimedia content.
This growth, the increasing demand of digital transmission of real-time services, such as voice or video, and the deep penetration of the Internet in all ambits of the actual world make particularly acute the problems involved with the network engineering, capacity planning, network dimensioning and performance evaluation of the networks. This is why we need to develop new models that allow us to understand the behavior of the traffic presented in networks such as Ethernet and applications such as WWW, as well as to carry out the design and analysis of networks.
1.1 Objective
The objective in this work is to develop some models leading us to discrete heavy tailed probability distributions which can be used to model the holding times or the ON and OFF periods observed in the present network traffic. We also introduce an entropy analysis that can be used to estimate the length of the ON periods given that we know the current time duration.
Chapter 1
1.2 Justification
The probability distribution functions commonly used to model voice traffic, such as Poisson or exponential distributions, do not have a good fit when we try to use them to model some characteristics of present computer network traffic behavior such as the holding times, the length of the ON and OFF periods, the size of files available on web servers, transmission duration of files, etc., see [Bar99], [Cro98], [Pax95], [Wil97].
Besides, several studies show that when we deal with this kind of traffic the so-called heavy tailed distributions appear to be the proper distributions to use for the holding times, file lengths, Ethernet packet count per time unit, etc, see [Bar99], [Cro98], [Duf94], [Pax95], [Res99]. Most of these works assume a continuous heavy tailed distribution for these issues. The justification emerges when we derive two discrete distributions, the zeta prime distribution and the beta CC distribution, whose main characteristic is to be heavy tailed, and hence, can be used to fit the empirical distributions obtained in [Bar99], [Cro98], [Duf94], [Pax95], [Res99].
1.3 Contribution
Until the realization of this thesis we did not find any published work that treated the distribution of the holding times or the ON/OFF periods from a discrete point of view. The vast majority of works that treated the discrete heavy-tailed distributions are referred to the distribution of the frequencies of access to WWW URL's when they are ranked by its relative popularity [Bar99], [Cro98], [Cun95]. Also, in [LevOl] it was derived a discrete heavy tailed distribution from a model describing the navigation trails (e.g., number of hyperlinks they follow in a navigation session) made by WWW surfers.
In addition, the methodology presented in this thesis can be applied to other type of discrete heavy tailed phenomena presented on the WWW such as the number of file requests per user or Ethernet packet count per time unit.
Chapter 2
Heavy Tailed Phenomena on Computer Networks
Over the last decade, several experimental observations showed that Poisson models fail to describe the behavior and/or the statistics of computer network traffic such as Ethernet, [Pax95]. This is why, before starting the description of the model proposed here, we need to give a brief introduction of several important results dealing with the statistical analysis of traffic presented on current computer networks, such as the Internet, Ethernet LANs, etc.
It is well known that for the telephone service, the traffic processes are either independent or have temporal correlations that decay exponentially, it is also known that the traffic distributions have exponentially decaying tails. But for data networks we encounter statistically, temporal high variability in traffic processes captured by long-range dependence, i.e., the autocorrelations have a power law decay. Also, extreme forms of spatial variability can be described using heavy-tailed distributions with infinite variance.
It turns out that power-law behavior in time or space of some statistical traffic descriptors often cause the corresponding traffic processes to exhibit fractal characteristics.
In this chapter we give a brief introduction to the concepts of heavy-tailed distributions, self similarity and discuss the main research issues related to traffic behavior that have been found.
In the following chapters we will relate the distributions obtained in this thesis to the concepts introduced in this chapter.
™ Chapter 2
2.1 A Brief Introduction to Heavy Tailed Distributions
Basically, a random variable X is said to be heavy tailed if its complementary cumulative distribution function (ccdf), also known as survival function, satisfies
> x] ~ ex , as x —> oo; (2.1) where 0 < a < 2 is the tail index or shape parameter and c is a positive constant, and where the notation a(x)~f$(x) means Urn = 1. As we can see, the tail of the distribution
*-»- p(x)
decays hyperbolically, in contrast to distributions such as exponential and Gaussian which possess an exponentially decreasing tail.
An important aspect of heavy-tailed distributions is that for 0 < a < 2 they have infinite variance and if 0 < a < 1 they also have unbounded mean. In the traffic modeling and network analysis, we are interested in the case 1 < a < 2 (case of bounded mean), [ParOO].
The distributions that satisfy (2.1) are known as Zipf-Paretian distributions, [Zol86], [LevOl]. The simplest distributions exhibiting this property are the continuous Pareto distribution and the discrete Zipf distribution. The former with probability density function (pdf.)
f(x)=akax~(a+l\ a,k>0, x>k, (2.2) where 0 < a < 2 is the shape parameter and k is called the location parameter, its cumulative distribution function (cdf) is given by
(k\a
F(x) = P[X<x] = l-\ \ , a,k>0,x>k. (2.3) U J
The discrete Zipf distribution has probability mass function (pmf.) given by
p(x) = Pr[X =x] = ex'1-", a > 0, x = 1,2,...., (2.4) where c satisfies
1-1 c =
where £(a + 1) is known as the Riemann zeta function.
We leave the discussion on the basic properties of this kind of distributions for Chapter 3, while in the rest of this chapter we will mention how these distributions are related to network traffic.
2.2 Self-Similarity
In this section we give a fundamental definition of self-similarity from [Sta97], for ease of presentations of the following subsections.
On the traffic control aspect, self-similarity implies the existence of correlation structure at a distance. Self-similarity is the conservation of a property of an object (a time series for example) with respect to scaling in space and/or time, [ParOO].
Consider a stochastic process {X(t), t > 0}, then the process is statistically self-similar (exactly self-similar) with parameter H, 0.5 < H <1, if for some real a > 0, the rescaled process a'H X(af) satisfies, [Sta97]
mean, (2.6)
a
T/ rvf NH Var[X(ai)] . _ _
Var[X (t)] = - L , variance, (2.7)
Rx(at,as)
= ,„ , correlation, (2.7)
a
where H is known as the Hurst parameter and it is a measure of the degree of self- similarity, i.e., H=0.5 means no self-similarity and H ~ 0.1 means high self-similarity.
2.3 Heavy Tails on Network Traffic
In this section we give an overview of some of the more representative work of the statistical analysis and modeling of network traffic. We discuss briefly the presence of heavy-tailed distributions on Ethernet traffic, WWW and telephone networks.
™ Chapter 2
2.3.1 Heavy Tails on the Ethernet Traffic
In [Lel94], it is shown that the Ethernet's traffic behavior is not well described by the Poisson model assumption. This work continued in other papers such as [Wil97] or [Wil98].
In [Lel94] the statistical analysis of Ethernet traffic measures at Bellcore during a 4-year period is reported. After the analysis, they arrived to the conclusion that the Ethernet traffic is statistically self-similar. They proposed a stochastic model for this self-similar phenomenon by means of a renewal reward process through the aggregation of a sequence of independent identically distributed (iid) random variables (rv's) which had the main characteristic of being heavy tailed.
We can think on [Wil97] and [Wil98] as the continuation of [Lel94]. In these works the authors explore the physical causes of the self-similar behavior of the Ethernet traffic. They analyzed the traffic generated by individual sources or source destination pairs and showed that the distribution of the strictly alternating ON (source transmitting) and OFF (source inactive) periods exhibit the so-called "Noah effect", (see Appendix A). In fact, they showed that the distribution of this ON and OFF periods is heavy tailed with a ~ 1.7 for ON periods and or ~ 1.2 for OFF periods. Also, they proved that the superimposition of many of these "ON/OFF" sources produces aggregated network traffic that exhibits a
"long-range dependence" (see Appendix A). They state that the self-similarity observed on the Ethernet is governed by the period (ON or OFF) with having the heaviest tailed distribution.
The three papers referenced above also reported the presence of the Noah effect in WAN traffic with an a value typically around 1.0 and often even below 1.0 (infinite mean) for the OFF periods and an or value around 2.0 for the ON periods.
2.3.2 Heavy Tails on the WWW
The references used in this work about heavy tails on the WWW are [Bar99], [Cro98], [Cun95] and [Pax95]. All these works are mainly focused on the empirical study of the WWW and the distributions obtained from that study. On the contrary, [Lel94] and [Wil97]
are focused on the distributions that the empirical data presents and only mention the self-
similarity of the network traffic as a consequence of the ON or OFF periods being heavy tailed.
The studies made on [Bar99], [Cro98] and [Cun95] showed that the distribution of the transmission times, the size of the files available on the web servers, the number of files transmitted through the network, the average number of requests vs. file size, the relative popularity of the web pages, etc, is heavy tailed.
The authors of [Cro98] argue that this "heaviness" of the tail presented on the distributions of WWW traffic is the main cause of the presence of the long-range dependence (i.e., self-similar) observed on the WWW traffic. They showed that the tail of the distribution of the ON times (corresponding to the transmission times) is heavier than that of the distribution of the OFF times (corresponding to the silent times), meaning this that the self-similarity of the WWW traffic will be governed by the parameter a of the transmission times.
In [Pax95] it is also mentioned that holding times, packet interarrivals and frame sizes for variable bit video are heavy tailed distributed.
2.3.3 Heavy Tails on the Telephone Network
In [Duf94] the statistical analyses of the CCSN/SS7 traffic data are reported. The result that strongly called our attention is that the exponential approximation for the holding times model seriously underestimates the number of very long calls. Even in 1994 they showed that the distribution of the holding times was heavy tailed with an a value a little below 2.
Please note that in 1994 the Internet did not have the penetration that it has today.
Actually, since a lot of telephone owners have contracts with Internet service providers via phone modem, their calls to the Internet tend to be larger than their voice calls. For this reason, we can certainly assume that this heaviness on the holding time distribution tail has been increasing since 1994.
In Chapter 3 we give a deepest definition of the continuous Pareto and discrete Zipf distributions and its main characteristics. Also, we introduce the Log-log Complementary Distribution Plot which (LLCD plot) which is a useful tool to determine if a given distribution is heavy-tailed or not.
Chapter 3
Zipf-Pareto Distributions
In this chapter we give a description of the main characteristics of the so-called Zipf-Pareto distributions. We also discuss the Log-log Complementary Distribution plots (LLCD plots) as a mean to check the heavy-tail behavior of a given distribution.
Most of the works mentioned on Chapter 2 use the continuous Pareto and discrete Zipf distributions to fit their empirical observations. By example, in [Pax95], it is shown that the TELNET interpacket times and the size of FTPDATA burst fit a Pareto distribution, in [Cun95] it is shown that the file request and relative popularity of web documents fit a Zipf distribution, in [Cro97] it is analyzed each Web browser as an ON/OFF source and found that the data fit a Pareto distribution, etc. For this reason, we use the Pareto and Zipf distributions as a comparison parameter for the distributions obtained in Chapter 4 in this work. We also use the LLCD plots to check if the distributions that we obtain are heavy tailed.
As we stated before, a random variable X has a heavy tailed distribution function if its complementary distribution function (also know as survival function) satisfies
F(x) = Pr[X>x]~cX-a, asjc-»°o, 0 < a < 2 , (3.1) where c is a constant or a slowly varying function at infinite asymptotically constant and ~ is as in Chapter 2. The simplest distributions satisfying (3.1) are the continues Pareto distribution and the discrete Zipf distribution.
^ _ Chapters Among the continuous distributions we can mention the Burr, Beta of the second kind (generalized Pareto), Loggamma, Loghyperbolic, log-logistic and Frechet distributions as heavy tailed distributions, [Teu96].
3.1 The Pareto Distribution
The Pareto distribution is named after the economist Vilfredo Pareto, who in 1897 formulated a law that stated that the distribution of income over a population should satisfy
N=kx~e, (3.2)
where N is the number of persons having income greater than or equal to ;c, and k, 6 are parameters.
There are several forms of the Pareto distribution [Joh94]. The mostly used forms of the Pareto pdf are
), x>k, a,k>0, (3.3)
~l, x>0, « > 0 , (3.4) with corresponding cumulative distribution functions (cdf)
(k}a
F« = l- - , x>k, (3.5)
(x)
F(x) = l-(x + l)-a, x>0, (3.6) and ccdf
'= - •
x x^
k>
(3-
7)a, x>0. (3.8)
Figure 3.1 shows the plots of the pdf (a) and cdf (b) of the Pareto distribution given in (3.4) and (3.6), respectively, for several values of a. We can see on Figure 3.1.(b) that as alpha gets small the cdf decreases.
A very important characteristic of a random variable (rv) X with Pareto distribution is that the mA moment, E^C} only exists for m < a, meaning that for a < 2 the distribution
does not have variance, i.e., it is closely related to the Noah effect. Also for a < 1 we do not have a bounded mean.
1.4 1.2 -
0.8 0.6 0.4 0.2
0.5 1 1.5 2 2.5 3
x
(a) pdf
(b) cdf
Figure 3.1. The probability density function (pdf) and cumulative distribution function (cdf) of the Pareto distribution.
There is another distribution called generalized Pareto [Fel71], [Teu96] or beta prime [Joh94], with pdf
Chapter 3
(3.9) where F(«) is the gamma function, and a and 6 are as in the previous cases.
There is not a closed form for the cdf or ccdf of (3.9) and thus the cdf has the form
du, (3.10)
note that when 6=1, (3.9) and (3.10) are reduced to (3.4) and (3.6), respectively, i.e., the generalized Pareto distribution reduces to the Pareto distribution.
0.5 1
Figure 3.2. The generalized Pareto distribution for a= 1.2.
Figure 3.2 shows the pdf (3.9) for fixed a and several values of 6. Please note that the distribution appears to have the same behavior on the tail for different values of 6, while we can see a notable difference on the pdf for values near 0.
3.2 The Zipf Distribution
The Zipf distribution acquired its name due to G. K. Zipf, a Harvard linguistic professor, who observed that the relative occurrence of words, used in the English texts, is inversely proportional to its rank, that is, the relative frequency of the rth most popular word is given by
(3-11) were r holds for the rank of the word according to the number of times it appears, C is a constant, and a is a constant close to 1.
Equation (3.7) is known as the Zipf Law, and there are several works trying to explain the theoretical foundation of it, [Poo97]. There are also some works that show that the relative popularity of web pages access follows this law, [Bar99], [Cro98], [Cun95].
In this work, we are interested in the distribution derived from (3.7) (Zipf distribution) and the relationship that some authors state considering it as the discrete analogous to the Pareto distribution, [Joh92], [Pax95].
In particular, we are interested in the Zipf distribution which pmf has the form
jc = l,2..., a > 0 , (3.12)
>..., a>0,
(3.13) withc = \ -1 (3.14)
where £(•) denotes the Riemann zeta function and a > 0. Figure 3.3 shows the pmf plot of a random variable with Zipf distribution of the form (3.13) for a = 1.2.
0.7 0.6 0.5 0.4 0.3 0.2 0.1
_ffl ffl Q ©-
0 1 10
Figure 3.3. Plot of the pmf of a random variable with Zipf distribution and a = 1.2.
W Chapters
As in the continuous case for a random variable X with Zipf distribution, the r* moment, E[Xr], only exists for r < a. Then, for values of a < 2 the Zipf distribution does not have variance and for a < 1 it does not have a bounded mean..
Unfortunately, there are no closed expressions for the cdf and ccdf of (3.12) and (3.13), due to the fact that the summation of probabilities is a harmonic series. Then, the cdf and ccdf of (3.12) are given by
k=l
and for the shifted version (3.13) we obtain F(x) = Pr[X <x] =
=1,2,3...,
=1,2,3...,
, x = 0,1,2...,
(3.15)
(3.16)
(3.17)
*=o = 0,1,2..., (3.18)
Figure 3.4 shows the cdf plot of a random variable with Zipf distribution of the form (3.16) with a =1.2.
0.9 0.8 0.7 0.6 : 0.5
><
0.4 0.3 0.2 0.1
0 2 3 4 5 6 7 8 9 1 0
Figure 3.4. Plot of the cdf of a random variable with Zipf distribution and a = 1.2.
3.3 LLCD Plots
An important tool used in [Bar99], [Cro98], [Cun95], [Wil97] and by several other authors, is the so-called log-log complementary distribution plot (LLCD plot), which is a plot of the ccdf on log-log axes. The LLCD plot is used because the heavy tailed distributions plotted on log-log axes have the property that
1~~"CV,,NV
(3.19) Jlogjc
for large jc, [Cro98]. The LLCD plot shows a straight line on the right side of the graphic when we are plotting the ccdf of a heavy tailed distribution. As an example consider the ccdf defined by (3.7) with k = 1
F\x) = x , x ^ 1,
as it can be seen logF(jc) = -cdog(jc) and Equation (3.19) is satisfied with equality, then the LLCD plot is defined by the ordered pair
(log.x,-alog.x),
which produces a straight line on a log-log plane, see Figure 3.5.
10
Figure 3.5. LLCD plot of the Pareto distribution for several values of a.
™ Chapters Figure 3.6 shows the LLCD plot for a random variable with Zipf distribution and ccdf of the form (3. 16)
, x=l, 2, 3... .
Due to the form of (3.16) we need at first to calculate F(x) and then plot the ordered pair
(logjc,logF(jt)),
whose points follow a straight line as in the case of the Pareto distribution (Figure 3.6).
10°
10"
10"
10 10"
10"
= 0.9
a=1
10" 10' 102 10°
Figure 3.6. LLCD plot of the Zipf distribution for several values of a.
Note that a determines the value of the slope of the straight line, thus we can use the slope as an estimator of a when we are dealing with empirical data plotted in this way.
In order to view the differences between LLCD plots of random variables light tailed and random variables heavy tailed, Figure 3.7 shows the LLCD plot of a random variable exponentially distributed (see Appendix B) and a random variable with Pareto distribution (as defined above). The pdf, cdf and ccdf of a random variable exponentially distributed is given by
f ( x ) = fe-**, *>0 and A > 0 , (3.20)
F(x) = l-e~*', x>0, and A > 0 , J(x) = e~**, x>0, and A > 0 , respectively, now the LLCD plot is defined by the ordered pair
(3.21) (3.22)
which does not define a straight line on the log-log plane.
Exponential \Exponential
10 10 10
Figure 3.7. LLCD plot of the Exponential and Pareto distributions.
We can see that the exponential distribution decays much faster in comparison with the Pareto distribution, meaning that the exponential distribution is light tailed.
3.3.1 Examples of Empirically Obtained LLCD Plots
As we stated before, the LLCD plots are useful when we want to determine if an empirical distribution is heavy tailed or not. Next we present some figures taken from [Cro97], [Cro98] and [Wil97] showing the use of this kind of plots to see the heaviness on the tail of some traffic issues.
Figure 3.8 shows two plots taken from [Cro97]. They made a study of Web traffic with 130,140 transfers of Web documents. The data was obtained from 37 workstation-based Web browser in the Boston University Computer Science Department. They modify the NCSA Mosaic Web browser to get raw data of WWW traffic behavior. Figure 3.8.(a) is the
*9 Chapters LLCD plot of the empirical distribution of transmission times in the WWW, we can see on the right side of the figure that the plot is almost lineal with a slope of -1.21 corresponding to a = 1.21. Figure 3.8.(b) corresponds to the LLCD plot of inter-arrival times of URL request with a measured value of a=1.5.
0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5
-5-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 LoglO(Transmission Time in Seconds)
(a)
3.5
JOra
-1 -0.5 0 0.5 1 1-5 2 2.5 3 3.5
log10(URL Interarrival Time in Seconds)
(b)
Figure 3.8. LLCD plots taken from [Cro97].
Figure 3.9 shows the LLCD plot of Web documents popularity vs. number of references taken form [Cro98]. The data was obtained analyzing the number of references of 46,830 Web documents, a slope of -0.986 was calculated.
1QQOOO 10000
a2
1000
100 10
10 100 1000 10000 100000
Document Rank
Figure 3.9. LLCD plot taken from [Cro98]
Figure 3.10 was taken from [W1197]. The data used in this study was collected from two Ethernet traffic traces, generated by about 100 and 3,200 individual sources as the Bellcore Morristown Research and Engineering Center.
«=*»
o.s -I-O
log 1
(a)
1.S 2-O
<=?
O.S 1-O -I .S 2.O
loglOCx)
(b)
Figure 3.10. LLCD plots taken from [WJ197]
W Chapter 3
Figure 3.10.(a) corresponds to the LLCD plot of the OFF periods. This plot is almost a straight line and it could be well fitted with a Pareto distribution with parameter a = 1.2 (Figure 3.10.(b)). An important point to mention is that [Cro97], [Wil97] and [Cro98] state that the physical explanation for the self-similarity observed on the Ethernet traffic and the WWW is the presence of these heavy tailed distributions.
In the next chapter, we derive a discrete heavy tailed distribution based on (3.1), then we will compare it with the distribution explained in this chapter and finally we will obtain its LLCD plot to check if it is heavy tailed.
Chapter 4
A Discrete Heavy Tailed Distribution for Network Traffic Modeling
The teletraffic engineering emerged with the statistical characterization of telephone communications at the call level. Then, in the early years of telephony, it was found that the call arrivals behave more or less as a Poisson process and that the voice call holding time can be well modeled by an exponential distribution. These models allowed an efficient planning of the resources needed to obtain a certain quality of service, i.e., small blocking probability and small delay. In fact, traffic modeling is the main tool of any performance evaluation of telecommunication networks, [Mic97].
When computer networks were introduced in the 1960's it was assumed that the traffic generated by them was similar to that observed on telephone networks. However, due to the explosive growth of computer networks and the integration of multimedia services, it was necessary to introduce new models in order to capture the behavior of actual network traffic. These new models need to be accurate and capture the statistical properties of real- life traffic in order to provide a realistic foundation to network planning and traffic engineering, [Den96], [Mic97]. Traditional models for network traffic include the ON/OFF sources with exponential or geometric distribution, Markov Modulated Poisson Process (MMPP), fluid flow and Markov modulated fluid models, etc, [Ada97], [Mic97].
Nontraditional network traffic models are closely related with the self-similarity and the long ranged dependence observed in network traffic measurements reported in works such
UHBJw Chapter 4
as those described in Chapter 2. Among the models for self-similar and long-range dependent process we can count Fractional Brownian Motion (FBM), Fractional Autoregressive Integrated Moving Average (ARIMA) processes, superposition of ON/OFF sources with heavy tailed distribution and M/G/°° with heavy tailed service time distribution, [Ada97], [Mic97], [Pax95].
As we mentioned in Chapter 2 and Chapter 3, most of this new models lead us to traffic characterizations with heavy tailed distributions. In particular, it was found that the continuous Pareto distribution has a good fit for the length of ON and OFF periods, holding times, WWW document sizes, packet interarrival times, WWW packet flow duration (we consider a flow equal to the set of packets with the same source and destination IP address), etc, [Bar99], [ChaOO], [Cro98], [Cro97], [Dun97], [Pax95], [Wil97].
In this chapter we explain the derivation of a discrete heavy tailed pmf named zeta prime distribution that can be used as a discrete model of network traffic issues such as those mentioned above. We also show that the zeta prime distribution offers a better fit to the continuous Pareto distribution than the Zipf distribution (considered for some author as the discrete analogous of the Pareto distribution, [Joh92]). Our starting point will be property (3.1) observed on the statistical analysis of the Ethernet and WWW traffic reported on works such as [Bar99], [ChaOO], [Cro98], [Dun97], [Pax95], [Wil97], etc.
In order to work with a specific problem, we focus our derivation to find a discrete heavy tailed distribution suitable for the ON/OFF sources widely used in WWW and Ethernet traffic modeling [Ada97], [Cro98], [Mic97], [Wil97]. However, we need to recall that this distribution can be used as a discrete model for any data fitted to a continuous Pareto distribution.
Before we start with the mathematical calculations we need to give a definition of what we consider is an ON and an OFF period in the next section.
4.1 ON and OFF Periods
Basically, an ON/OFF source sends information during active ON periods that are separated by silent OFF periods, [Mic97]. Traditional ON/OFF source models assume exponential or geometric distribution for their ON and OFF periods (see Appendix B), [Ada97], [Cro98], [Wil97]. However, this assumption results in aggregated traffic
inconsistent with the Ethernet and WWW traffic measurements reported in works such as [Cro98], [Wil97], [Pax95], etc. Besides, the authors in [Wil97] show that the superposition of many ON/OFF sources with independent and identically Pareto distributed ON or OFF periods, result in self-similar traffic consistent with the before mentioned measurements. In particular, if the ON and OFF periods are heavy tailed with parameters UN and UF respectively, the resulting series will be self-similar with H = (3-min(aN,aF))/2, [Cro98].
As an example, we have to mention that the ON periods can consist in the transmission of a single packet, a packet train, a Web file, the duration of transmission request from a Web browser needed to download a WWW document, etc, [Dun97], [Cro98], [Sta97].
Next we provide a definition of the discrete ON and OFF periods.
4.1.1 Discrete ON and OFF Periods
To explain how we consider the discrete ON and OFF periods suppose that we are checking the activity of an ON/OFF source at periodic instants, tn = nAt, for a fixed A? > 0, and that the source is transmitting or not a packet in that moment according with some discrete probability distribution. A single packet can use several "time slots" for its transmission. Thus we say that an ON/OFF source generates an OFF period of k time slots if and only if it does not transmit any packet in k time slots and begins the transmission of one packet at the (k + l)th time slot, see Figure 4.1.
PACKET SOURCE
0 PACKETS
IBM Compatible 1 PACKET
OFF PERIOD OF TIME SLOTSk
Figure 4.1. Discrete OFF Period
Besides, if the source is transmitting a packet or a packet train during i time slots followed by a time slot without transmission we say that the source generates an ON period
VP Chapter 4
of i time slots (see Figure 4.2). For simplicity in next sections we will assume that each time slot has duration of 1 time unit.
PACKET SOURCE
0 PACKETS 1 PACKET
IBM Compatible
ON PERIOD OF TIME SLOTS
Figure 4.2. Discrete ON Period
In the next section we will derive a discrete distribution suitable for modeling the discrete ON and OFF periods.
4.2 Derivation of the Zeta Prime Distribution
As we mentioned above, the models with exponential or geometric distribution for the ON and OFF periods are not well suitable for computer network traffic since both distributions are memoryless. Then, based on the work described in Chapter 2 we assume that the discrete ccdf of an ON or OFF period is given by
using this function we will calculate the pmf of the ON or OFF period. Please note that (4.1) is similar to (3.8). Obviously, (4.1) satisfies the heavy tailed condition given in (3.1).
We can calculate the pmf by considering the following
Pr[X=x]=Pr[X>x]-Pr[X>x + l]=Pr[X>x-\]-Pr[X>x} x =0,1,2..., (4.2) then, substituting (4.1) into (4.2) we obtain
\ x=0,l,2.... (4.3) Equation (4.3) gives the pmf of a discrete-valued random variable X with sample space jc = 0,1,2,...and parameter a > 1. Equation (4.3) needs to satisfy the conditions of a probability function, hence we must verify that (4.3) is, in effect, a pmf. To do this we just need to verify that the sum of the probabilities is 1, that is,
. 1, (4.4)
n=0
in order to do this, define the partial sum Sk as follows
...+(k+\y
a, (4.5)
K-t-l K
S t = £ (n + \)-a = £ (n + 2)-tt =2- +... + (* +1)"* + (k + 2}~a, (4.6)
n=0
similarly we can define the sum Sk as
k
n=\ n=0
then, using equations (4.5) and (4.6) we can write
_ k k
S — S k = ^ (n + i) " — ^ (n + 2}~a =1 — (k + 2)~a (4-7)
n=0 n=0
and finally, we can write Equation (4.4) by using (4.7) as follows
-"}=!, (4.8)
n=0 ~°° =0 n=0
and we can conclude that (4.3) is a valid pmf.
Now, using (4.7) we can see that the cdf of (4.3) is given by
)-a, x = 0,1,2,.... (4.9)
n=0
The form of (4.3) is similar to the so-called Haight's zeta distribution [Joh92]
encountered by Frank A. Haight when he was studying the frequency appearance of words on groups of texts [Hai66]. The pmf of the Haight's zeta distribution is
Pr[X=x]=(2x-\ra -(2x + ir°, jc=l, 2..., <7>0. (4.10) The method used here to obtain (4.3) is quite different to that shown in [Hai66], because the author derives his zeta distribution using the "Zipf conjecture" that, in the ranked tabulation, the relative frequency of the nth most frequent reply is proportional to n~a , where a is a constant, (see [Hai66] or [Joh92] for further details). Due to this and to the differences between (4.3) and (4.10) we would rather refer to (4.3) as the zeta prime
Chapter 4
distribution. Figure 4.3 and Figure 4.4 show the plots for the pmf and the cdf of a random variable X with zeta prime distribution defined by equations (4.3) and (4.9), respectively.
0.6 0.5 0.4 i,0.3 0.2
0.1
9 9 o
2 3 4 5 6 7 8 9 1 0
x
Figure 4.3. pmf of a random variable X with zeta prime distribution for a =1.2.
2 3 4 5 6 7 8 9 1 0
0 1
Figure 4.4. cdf of a random variable X with zeta prime distribution for a = 1.2.
It may sound illogical to talk about an ON or an OFF period with a 0 time unit duration, therefore we can shift (4.3) in order to start from, x = 1, obtaining the pmf
a, a>0, x = l,2,..., (4.11) similarly to the procedure followed to obtain (4.9), we can set the cdf for the shifted version as
x r ) " — (n +1) ji= 1 — (jc + l) a, x =1,2,..., (4.12) and the complementary cdf is given by
T(x) = Pr[X > x] = (jc + l)~a, jc = 1, 2,..., (4.13) note that the only difference between (4.13) and (3.8) is that (4.13) has a discrete support and, hence, (4.13) satisfies the heavy tail condition given in (3.1). Figure 4.5 shows the LLCD plot of a random variable X with zeta distribution defined by equations (4.11), (4.12) and (4.13) for several values of a. Now the ordered pair defining the LLCD plot is given by
(log*, -alogCr + 1)),
which defines a straight line in a log-log plane. We can see that for the three a values the LLCD plot shows a straight line indicating the heavy tailed behavior of the zeta prime distribution.
10"
10"
10"
10
10"
10° 10' 10'
Figure 4.5. LLCD plot of the zeta prime distribution for several values of a.
Chapter 4
In order to view the differences between the LLCD plot of a discrete random variables light tailed and a discrete random variable heavy-tailed, Figure 4.6 shows the LLCD plot of two random variables geometrically distributed (see Appendix B) and a random variable with zeta prime distribution (as defined above). The pdf, cdf and ccdf of a random variable geometrically distributed are given by
•pY~l, 0<p<l and x = 1,2,3,..., (4.14) [-pY, 0<p<l and x = 1,2,3,..., (4.15)
•pY, 0<p<l and x = 1,2,3,..., (4.16) respectively, now the LLCD plot is defined by the ordered pair
(log*, -xlog(p)), which does not define a straight line on the log-log plane.
We can see that, like the exponential distribution (see Figure 3.7), the geometric distribution decays faster in comparison with the zeta prime distribution, meaning that the geometric distribution is light tailed.
10° , 10-
10"
10"
ID'"
10'°
Geometric \ P = 0-5 \
*t
10° 101 102
X
Figure 4.6. LLCD plot of the Geometric and zeta prime distributions.
In the following section we compare the zeta prime distribution with the continuous Pareto distribution and the Zipf distribution by mean of the cdf and LLCD plots.
4.3 Graphical Comparison between the Pareto, Zipf and Zeta Prime Distributions
Now we are going to compare the plots of the pmf s, cdf s and ccdf s of the Pareto, Zipf and zeta prime distributions.
As we mentioned in Section 4.1, the connection between the ON and OFF periods distribution and the self-similarity observed in computer network traffic is by means of the distribution's heavy tail index a (see also Appendix A). Note that the a parameter also defines the distribution's tail heaviness for the Pareto, Zipf and zeta prime distribution and that it is equivalent in the three cases.
By example, consider the continuous Pareto's ccdf given by Equation (3.8)
-<*, x > 0 , (4.17)
then, substituting (4.17) into Equation (3.19) we obtain
d log F(x) = — d r[- a log(* +1)\ ~ -a, i (4.18) Jlogjc
for large x. Applying the same procedure to the zeta prime distribution's ccdf defined by Equation (4.1) we obtain
d , -F, x logF(jt) = d alog(jc + 2)—a, , , ^ ,„ ,™(4.19) d log x d log ;c
for large x. Hence, a defines the same slope for the continuous Pareto distribution and the zeta prime distribution in an LLCD plot, hence, a is an equivalent parameter for both distributions. With respect to the Zipf distribution we will see ahead that its a parameter is equivalent to the a of the Pareto and Zeta distributions.
Figure 4.7 shows the plot of the zeta prime (4.3) and Zipf (3.13) pmf s. For small values of x the probabilities differ, but as x increases this difference becomes negligible. This helps us to assume that a similar behavior at the tail of both pmf's will be observed.
We show in Figure 4.8 the plot of the Pareto (3.6), the zeta prime (4.12) and Zipf (3.15) cdf's. We can see that the points defined by the zeta prime cdf are on the line defined by the Pareto cdf offering a better fit than that of the Zipf cdf.
Chapter 4
0.7 0.6 0.5
0.4
<
'0.3
0.2 0.1
•<- Zipf
•<- Zeta Prime
10
Figure 4.7. Plot of the zeta prime and Zipf pmf s for a = 1.2.
10 12 14 16 18 20
Figure 4.8. Plot of the zeta prime, Zipf and Pareto cdf's for a = 1.2.
We confirm that the zeta prime cdf (4.12) offers a better fit to the Pareto cdf (4.6) than the Zipf cdf (3.15) with the data presented in Table 4.1. We can also see in Table 4.1 that the zeta prime and the Pareto cdf's have the same value for integer values of x, while there is always a difference between the Zipf and Pareto cdf's.
Table 4.1. Values of the cdf of a random variable X with Zeta prime, Zipf and Pareto distributions.
x Zipf \ Zeta prime Pareto 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0.6709 0.8169 0.8768 0.9085 0.9280 0.9410 0.9503 0.9572 0.9625 0.9668 0.9702 0.9730 0.9754 0.9774 0.9792
0 0.5647 0.7324 0.8105 0.8550 0.8835 0.9032 0.9175 0.9284 0.9369 0.9437 0.9493 0.9539 0.9579 0.9612 0.9641
0 0.5647 0.7324 0.8105 0.8550 0.8835 0.9032 0.9175 0.9284 0.9369 0.9437 0.9493 0.9539 0.9579 0.9612 0.9641
Finally, in Figure 4.9 we show the LLCD plot for the Pareto ccdf (3.8), zeta prime ccdf (4.13) and Zipf ccdf (3.16) for (a) a = 1.2 and (b) a = 1.6. Note that the points defined by the zeta prime ccdf are always on the line defined by the Pareto ccdf while the points defined by the Zipf ccdf. are always parallel to the Pareto and zeta prime ccdf's. The first conclusion that we can get of Figure 4.9 is that the zeta prime distribution always offers a better fit to the Pareto distribution than the Zipf distribution. Second, due to the Zipf LLCD plot looks parallel to the Pareto and zeta prime LLCD plots, we can conclude that the three plots almost have the same slope and, hence, the parameter a is equivalent for the three distributions.
Chapter 4
10
(a) a =1.2.
10
(b)a=1.6.
Figure 4.9. LLCD plot of the zeta prime, Zipf and Pareto distributions.
With the results seen in Table 4.1, Figure 4.8 and Figure 4.7 we can conclude that the zeta prime distribution given by (4.11), (4.12) and (4.13), has a heavy tail behavior that resembles that of the Pareto distribution given by (3.4), (3.6) and (3.8). This behavior also
is closer to the Pareto distribution than that behavior given by the Zipf distribution of (3.12), (3.15) and (3.16).
Another advantage of the zeta prime distribution over the Zipf distribution is that it has a closed form for its cdf and ccdf. This allows ease in simulations and computations with the zeta prime distribution in comparison with the Zipf distribution.
In the next chapter we describe an urn model that leads us to a heavy tailed distribution, which can also be used to model the ON/OFF sources, the holding times, the Web file requests, etc.
Chapter 5
A Heavy Tailed Urn Model for Network Traffic Modeling
As we mentioned in Chapter 4 the traffic models are the main tool used to design and evaluate the performance of telecommunication networks. Recall that traditional models assume exponential or geometric distributions for the holding times or the ON and OFF periods of an ON/OFF source. We can find in the literature several simplified models concerning the traffic source behavior which lead us to a geometric distribution for holding times, ON and OFF periods, etc; see [Fel68], [Mic97], [Leo94], [Ros96]. A brief explanation of two traffic source models that leads us to geometrically distributed holding times and geometrically distributed ON and OFF periods is presented in Appendix B. We recall that the traffic source can be people making a phone call, an ON/OFF source, a computer terminal connected to the Internet, etc.
However, as we mentioned in Chapter 2 and Chapter 4, the models that assume geometric or exponential distribution for the holding times, the ON and OFF periods, packet interarrival times, etc, are inconsistent with actual network traffic measures, [Bar99], [ChaOO], [Cro98], [Pax95], etc. Also, some studies such as those reported in [Bol94] and [Duf94] show that exponential or geometric models for the CCSN/SS7 traffic seriously underestimates the number of very long calls. Moreover, in [Bol94] and [Duf94]
it was shown that the distribution of the holding times of the CCSN/SS7 traffic is heavy tailed.
^ Chapter 5
The models presented in Appendix B are based on the flip of a skewed coin at periodic instants to decide whether or not to continue with the phone call, the ON period or the OFF period. We have to mention that besides the use of a coin we can use an urn containing black and white balls. Then, at periodic instants we draw a ball from the urn in order to decide whether or not to continue depending on the color of the ball drawn. Finally, we have to return the ball to the urn in order to maintain the same black and white balls ratio each time we draw a ball from the urn. Note that with this ball drawing scheme we resemble the results obtained with the coin tossing.
However, as we mentioned above, the models presented in Appendix B lead us to a geometric distribution, which is completely inconsistent with the traffic measures reported in works such as [Bar99], [Bol94], [Duf94], [ChaOO], [Cro98], [Cro97], [Dun97], [Pax95], [W1197], etc. This is why in this chapter we present a traffic source model based on an urn drawing scheme that leads us to a discrete heavy tailed distribution named beta CC. Our goal is to provide a simple traffic source model, similar to those described in Appendix B, that also captures the heavy tail nature of actual network traffic. In this way we can provide a useful tool for network planning and traffic engineering based on the characteristics of actual network traffic.
In Chapter 4 we focused our derivation of the zeta prime distribution in finding a discrete distribution suitable for the ON and OFF periods. However, we recall that he zeta prime distribution is also suitable for any data originally fitted to a Pareto distribution. In order to give another example of the applicability of the distributions and models presented in this work, in this chapter we will center our attention in finding a discrete heavy tailed distribution suitable for the holding times by means of a traffic source urn model. However, this model can also be applied to any traffic source generating heavy tailed distributed network traffic.
Before we start with the explanation of our urn model, it is necessary to give an introduction to the so called residual lifetimes, in order to explain some assumptions made in Section 5.2 about the expected traffic source behavior. This is why in Section 5.1 we give a brief analysis of the residual lifetime of a random variable when it is light, medium or heavy tailed.
5.1 The Residual Lifetimes
An important issue arising when we work with probabilistic models for the holding times is the distribution of its residual lifetimes, [ChaOO], [Fel68], [FelVl], [Pax95].
To explain what a residual lifetime is, consider that a phone line is being used to make a voice or data call for a period of exactly X minutes. If we know that the current call time is t, where t < X , then the call's residual lifetime is given by X -t . However, we recall that this is just a deterministic example presented in order to understand the concept of a residual lifetime.
In particular, we are interested in how the residual lifetime behavior is affected by the current length of the holding time, when we assume that the holding time's length is governed by a given random variable. In order to calculate this, we need to make use of the conditional probability formula. That is, we need to calculate the probability that the holding time's length, X, exceeds some quantity, t + x , once that it has lasted for time a t.
This is given by
(5.1) Pr[X > t] Pr[X > t]
We have to recall that when we apply the conditional probability formula, we "generate" a new random variable whose ccdf is given by Equation (5.1), e.g., the ccdf of the residual lifetimes; see [Fel68], [FelVl], [Leo94]. Note that the ccdf of the residual lifetime is a function of x and t can be considered as a "parameter" that depends on the current holding time's length.
Recall that we are interested in the residual lifetime behavior when we assume heavy tailed distributed holding times. However, in order to understand the assumptions made for the urn model presented ahead, we analyze the behavior of the residual lifetimes when we assume that the holding time is not heavy tailed distributed.
Before beginning with the analysis, we need to mention the probability distribution classification as a basis of its tail behavior as it is defined in [Pax95] and [Teu96]. Usually, the exponential and the geometric distributions are taken as the reference point due to the memoryless property, (see Appendix B), in particular, in [Pax95] the exponential and the geometric distributions are classified as medium tailed due to that property. The lighter
Chapter 5 tailed distributions than the exponential or the geometric distributions are known as light tailed, [Pax95], or super-exponential, [Teu96], distributions, as an example of light tailed distributions we can mention the Rayleigh, the uniform (continuous and discrete) and the Benktander II distributions; see [Pax95] and [Teu96]. The distributions with a heavier tail than the exponential or the geometric distributions are known as sub-exponential distributions [Teu96] and if they also satisfy Equation (3.1) are known as heavy tailed or Pareto type distributions, [Pax95], [Teu96]. Next, we will show the differences in the residual lifetime behavior when we assume that the holding time is light (e.g., Rayleigh), medium (e.g., exponential) and heavy tailed (e.g., Pareto) distributed.
At first, assume that the holding time is Rayleigh distributed with pdf, cdf and ccdf given by
••~e^> , jc>0; cr CT>0, (5.2)
2
, jc>0; CT>0, (5.3)
*>0; <7>0, (5.4) respectively and where (T is the distribution's parameter. Recall that this distribution is lighter tailed than the exponential distribution, [Teu96]. Then, the residual lifetime ccdf is given by
(5.5) _fi{"
which is dependent on t, e.g., the time that the holding time has lasted. Figure 5.1 shows the probability that a Rayleigh distributed holding time, X, exceeds a residual lifetime, x - 1 , once that it has lasted for time t. We can conclude from Figure 5.1 that the larger t, the shorter probability that the holding time continues for at least another time unit. In other words, for light tailed distributed holding times, the longer you have lasted, the sooner you are likely to be done, [Pax95], [Teu96].