CAPITULO III : DESARROLLO EXPERIMENTAL
4.4 JUSTIFICACIÓN DEL MODELO MATEMÁTICO
When we consider the best fit in terms of likelihood value, the o↵set truncated power
law distribution gives us the best fit to the data. The set of parameters which have the greatest likelihood are found when we fit to the data using likelihood methods for the first set of bins. This gives a log-likelihood value of -4,457,400. For the discrete power law, the best log-likelihood is -4,735,419 and for log-normal it is -5,833,272. This means that
simply considering likelihoods, the o↵set truncated power law is the best fit to the data
by a substantial margin with values of the parameters for this distribution beinga= 4.40
and c= 1.47.
However as we want to be able to say something about the seriousness of a potential epidemic in the population, it may be more useful to consider which set of parameters will give us the closest agreement with the Blue Sheep data in terms of the final epidemic size.
To make this comparison, we take sum the absolute values of the di↵erence from the Blue
Sheep prediction to our fitted prediction at 15 di↵erent values of ✏, spread evenly from
0.00001 to 0.1. We can then choose the set of parameters that minimise this for each distribution and then compare these values to each other to find the ‘best’ fit by these criteria.
When we perform this analysis forxmin = 1 we find that the o↵set truncated power law
distribution is again by far and away the best choice. However the set of parameters which give us the closest agreement to the Blue Sheep data are gained by minimising the
di↵erence in the cdf’s between the Blue Sheep data and the distribution for binning 2, and
the parameter values are a= 3.10 and c= 1.20. The minimum cumulative di↵erence we
the parameters which give the best likelihood give the greatest discrepancy in the number
of infections when considering only the o↵set truncated power law and the di↵erence in
the final sizes is 17,300,000 on average.
Despite this large leap between the best and worst fits, the set of parameters which per-
forms worst from the o↵set truncated power law distribution still beats all parameter fits
for the log-normal and discrete power law distributions. For the log-normal, the best fit
is for the likelihood fit of binning 2 and the di↵erence is 23,200,000 whilst for the dis-
crete power law, our best fit is from the minimum cdf error to binning 1 and the error is 43,500,000. This shows how poor the log-normal and discrete power law are when it
comes to this measure, which tells us that the o↵set truncated power law distribution is
definitely more descriptive of the data.
To achieve a more predictive fit to the amount of infection in the population for the discrete
power law, we increasedxmin and then attempt to minimise the absolute error in the cdf
between the distribution in question and the Blue Sheep data. Doing this, we can get a set of parameters which is far more accurate in terms of the number of infecteds as can
be seen best in figure 4.20a. Here using xmin = 1,500 for the discrete power law, the
cumulative error is 3,500,000 which is far better than using lower values of xmin but the
best o↵set truncated power law distribution is still more than twice as good.
We have seen that these chosen distributions fit the cdf of the data with varying degrees of
success, well for the o↵set truncated power law to poorly for the discrete power law. This
translated into a strong or poor agreement with the Blue Sheep data for the final sizes of
an epidemic in a population for various values of✏. It was shown that the strength of the
agreement here was highly dependent on the tail rather than the bulk of the distribution. In terms of fitting the data as closely as possible, we therefore have to decide what measure is best to measure the success of the fit by. If we were simply interested in fitting the data to a certain distribution, so that we could say that the data followed this distribution, then we could simply choose the distribution and set of parameters which gave us the largest value for the likelihood.
We can therefore conclude that the o↵set truncated power law distribution fits the data
more satisfactorily than the discrete power law or log-normal distribution as it outperforms both by a considerable margin in terms of likelihood and predictive power.
However we note that the set of parameters that we would report as fitting the data best depends on what we are interested in doing with the distribution. If we simply wish to report the most probable set of parameters in terms of the ability to describe the cdf of
the data, we would report thata = 4.40 andc = 1.47. However if we were interested in
studying what the e↵ect of the parameters is in terms of final size of an epidemic, we would
report that a= 3.10 and c= 1.20. Therefore what is to be done with the information is
therefore an important consideration to keep in mind when attempting to fit a distribution to data.
To increase the agreement between the worst performing distribution, the discrete power law, and the Blue Sheep data, in terms of the predicted number of infecteds, we have also
fitted to the cdf for di↵erent values of xmin. For a value of xmin = 1,500, we achieved a
good agreement between the power law and the actual data for the predicted number of infecteds. This was much improved when compared to any set of parameters we get from
using xmin = 1. This tells us that the tail of the distribution is of great importance in
and it is not instructive to simply find a parameter set which fits well in the bulk of the distribution and be confident that this is describing the data well in a way which you are interested in.
In general, it is interesting to see how the profile of the di↵erent distributions a↵ects
not only the fit of the distributions to the cdf, but also changes the way in which the
total number of infections predicted di↵ers from that of the Blue Sheep data. For the
discrete power law the large workplace sizes are over sampled when we choose parameters
to match the empirical cdf. On the other hand, the log-normal and o↵set truncated power
law distributions have select more small workplace sizes. This means that the number of infecteds that these distributions predict can be greater than the data suggests (discrete
power law, 4.17) or fewer (o↵set truncated power law, 4.10 and log-normal, 4.23).
If we are interested in including workplaces in the spread of epidemics in countries for which we have no data on the workplace size distribution and no idea what the distribution may be, then it is plausible that we may select a discrete power law as this will in all likelihood, not underestimate the severity of a potential epidemic, though this may produce a worst
case scenario which is difficult to believe. However as it has been shown that for the UK
the o↵set truncated power law gives us the best fit (of distributions considered), and this
was the distribution produced for the US in Ferguson et al. [2006], it is likely that, for economically developed Western countries at least, this is a fair choice of workplace size distribution.