CAPITULO III : DESARROLLO EXPERIMENTAL
3.5 DISEÑO EXPERIMENTAL
3.5.1 CALCULOS Y CONSIDERACIONES PREVIAS A LA EXPERIMENTACIÓN
There are many methods which can be used to select models, which vary wildly in their statistical sophistication. The crudest methods such as the Kolmogorov-Smirnov test and the mean-squared error simply consider the distance between two distributions in order to select one model over another. There are many more sophisticated techniques, such as the Akaike Information Criterion [Akaike, 1974] which aims to estimate the Kullback- Leibler divergence [Kullback and Leibler, 1951] between the proposed distribution and the actual distribution in question, by calculating the likelihood of the data given the proposed distribution. Here we compare the fit provided by simple distance-based techniques such as the average absolute deviance between the cdf of the proposed distribution and the true distribution, alongside likelihood-based techniques in order to see which gives us a better fit to the Blue Sheep data measured using the estimated final size of the outbreak in the workplaces.
(a)✏= 0.01 (b)✏= 0.03
(c)✏= 0.05 (d)✏= 0.1
Figure 4.7: Summed final sizes in workplaces for di↵erent values of✏by ward. Workplace
with the highest value of the likelihood may not give us the most informative result. This
is because we are interested in how the change in transmission rates will a↵ect the size
of the epidemics in the workplaces. The way that this change in transmission rates will
take e↵ect will be in the increase in the number of infecteds from workplaces with a high
number of employees. Therefore one of the major factors that we must consider is the probability of selecting a ‘high’ number of employees. The fact that there are relatively
few of these in the Blue Sheep data (⇡ 84% are 1-10, ⇡ 98.75% are 1-100), means that
the likelihood calculations will be weighted heavily by accuracy in the bulk part of the distribution, which will therefore mean that we can achieve the greatest likelihood with a very inaccurate approximation of the tail of the distribution.
As mentioned above, along with calculating likelihoods, another simple test would be to simply sum the absolute error in the cdf of the distribution in question to the cdf of our empirical data and to then choose the set of parameters which minimised this. This approach will enable us to choose the distribution which is most accurate to the Blue Sheep data over the whole range of the data necessarily meaning that the quality of the fit must be, in at least a relative sense, good in all regimes of the data, but biased to the tail due to the use of the cdf rather than the probability mass function (pmf).
If we wish to fit the data to this distribution where we use a value for xmin 6= 1, then we
must adjust the above method to do this. To do this, we firstly work out the cdf of the whole dataset. Then to fit to the data for any distance-based method, using for example
least absolute error of cdf, we then only take into account the di↵erences in the cdf’s from
xmin and up. This results in us choosing parameters which line up closely with the data
from the value of xmin onwards but without the constraint of minimising the error for
lower values.
To adapt likelihood methods, we could weight the likelihood provided from the larger workplaces more heavily than for lower values. However we have chosen to simply calculate
the likelihood from the data which is greater than or equal to xmin, as there is no way of
incorporating the information contained in the data belowxminwithout explicitly including
it in our calculations. When we then take the parameters which have the greatest likelihood in this regime, and try to produce the whole of the workplace distribution, it is possible
for the resulting distribution to di↵er greatly from the Blue Sheep data. This is due to
that fact that the majority of the distribution can be entirely ignored if xmin 1.
We are interested in fitting the workplace data to a distribution, in order to investigate
how the assumption of di↵erent distributions for the workplace sizes can a↵ect the way
an epidemic will spread throughout the workplaces. As mentioned previously we use methodology from Virkar and Clauset [2012] to do this. Code for these calculations is also provided [Clauset and Virkar, 2012]. However this paper, deals with real valued data and allows us to fit a distribution to any dataset, without the use of a maximum value. As we wish to fit discrete data and have also imposed a maximum value onto the distribution, we develop our own techniques to do this.
In various situations to get the ‘best’ fit for a power law, or another distribution to a dataset, often some portion of the data that we are trying to fit is ignored in favour of fitting the tail of the data to a distribution [Clementi and Gallegati, 2005; Willinger and Paxson, 1998; Redner, 1998], which is discussed, along with the appropriateness of these fits by Clauset et al. [2009]. In general we are interested in characterising the whole of the distribution as well as just to the epidemiologically important tail, though will
xmin >1.
Note that due to the use of bins, it causes problems whenxmin >1. This is because if we
set our value for xmin to be inside a bin, it is unclear what the best course of action is,
as we will be unable to split the number of occurrences within the bin to ones belowxmin
and those above. The course of action taken is described with the following example: if
we have a bin which groups readings from, 51 to 100 and xmin = 100, we define this to
mean that we include the readings in this bin in our calculation. Now when we are trying to fit our distribution using distance-based methods, we normalise for values from 51 to
xmax.
The distributions we have considered for the workplace size distribution are: 1) ‘o↵set
truncated power law’ distributions (which are put forward as a good candidate for work- place sizes by Ferguson et al. [2006] based on a model from Riley and Ferguson [2006]), 2)
discrete power laws and 3) log-normal distributions. There are of course many di↵erent
distributions which we have not considered which could be as good or better candidates for modelling this dataset. However, the distributions above are often cited as candidates for modelling heavy-tailed distributions [Crovella et al., 1998; Mitzenmacher, 2004; Clauset et al., 2009] and so are also considered likely candidates here.
In each of the three following subsections of the thesis, one the distributions which we
are attempting to fit to the data (o↵set truncated power law, discrete power law and
log-normal) is first described. This leads into the calculation of the likelihood of the distribution in question, which is followed by the fitting of this distribution to the Blue Sheep data using the likelihood method, along with the minimum total absolute error in the cdf. The number of predicted infections from the best fitting distributions is then calculated and compared to the predicted infections for the Blue Sheep data for various
forms of the transmission rates, defined by the selection of ✏. To do this we introduce a
single infected individual into each workplace, and then calculate the expected final size
using (4.2) for the comparative value of R0 defined by ✏, which is again constrained by
requiringR0 = 1 in the✏= 0 case.
Therefore each of the following subsections contains methods and results, which is done to allow for full understanding of the fitting of each distribution in turn. Following these three subsections will be a summary of the results and a discussion of the relative success and failures of these distributions in the fitting to the Blue Sheep data.