• No se han encontrado resultados

CAPITULO III : DESARROLLO EXPERIMENTAL

4.4 JUSTIFICACIÓN DEL MODELO MATEMÁTICO

When we consider the best fit in terms of likelihood value, the o↵set truncated power

law distribution gives us the best fit to the data. The set of parameters which have the greatest likelihood are found when we fit to the data using likelihood methods for the first set of bins. This gives a log-likelihood value of -4,457,400. For the discrete power law, the best log-likelihood is -4,735,419 and for log-normal it is -5,833,272. This means that

simply considering likelihoods, the o↵set truncated power law is the best fit to the data

by a substantial margin with values of the parameters for this distribution beinga= 4.40

and c= 1.47.

However as we want to be able to say something about the seriousness of a potential epidemic in the population, it may be more useful to consider which set of parameters will give us the closest agreement with the Blue Sheep data in terms of the final epidemic size.

To make this comparison, we take sum the absolute values of the di↵erence from the Blue

Sheep prediction to our fitted prediction at 15 di↵erent values of ✏, spread evenly from

0.00001 to 0.1. We can then choose the set of parameters that minimise this for each distribution and then compare these values to each other to find the ‘best’ fit by these criteria.

When we perform this analysis forxmin = 1 we find that the o↵set truncated power law

distribution is again by far and away the best choice. However the set of parameters which give us the closest agreement to the Blue Sheep data are gained by minimising the

di↵erence in the cdf’s between the Blue Sheep data and the distribution for binning 2, and

the parameter values are a= 3.10 and c= 1.20. The minimum cumulative di↵erence we

the parameters which give the best likelihood give the greatest discrepancy in the number

of infections when considering only the o↵set truncated power law and the di↵erence in

the final sizes is 17,300,000 on average.

Despite this large leap between the best and worst fits, the set of parameters which per-

forms worst from the o↵set truncated power law distribution still beats all parameter fits

for the log-normal and discrete power law distributions. For the log-normal, the best fit

is for the likelihood fit of binning 2 and the di↵erence is 23,200,000 whilst for the dis-

crete power law, our best fit is from the minimum cdf error to binning 1 and the error is 43,500,000. This shows how poor the log-normal and discrete power law are when it

comes to this measure, which tells us that the o↵set truncated power law distribution is

definitely more descriptive of the data.

To achieve a more predictive fit to the amount of infection in the population for the discrete

power law, we increasedxmin and then attempt to minimise the absolute error in the cdf

between the distribution in question and the Blue Sheep data. Doing this, we can get a set of parameters which is far more accurate in terms of the number of infecteds as can

be seen best in figure 4.20a. Here using xmin = 1,500 for the discrete power law, the

cumulative error is 3,500,000 which is far better than using lower values of xmin but the

best o↵set truncated power law distribution is still more than twice as good.

We have seen that these chosen distributions fit the cdf of the data with varying degrees of

success, well for the o↵set truncated power law to poorly for the discrete power law. This

translated into a strong or poor agreement with the Blue Sheep data for the final sizes of

an epidemic in a population for various values of✏. It was shown that the strength of the

agreement here was highly dependent on the tail rather than the bulk of the distribution. In terms of fitting the data as closely as possible, we therefore have to decide what measure is best to measure the success of the fit by. If we were simply interested in fitting the data to a certain distribution, so that we could say that the data followed this distribution, then we could simply choose the distribution and set of parameters which gave us the largest value for the likelihood.

We can therefore conclude that the o↵set truncated power law distribution fits the data

more satisfactorily than the discrete power law or log-normal distribution as it outperforms both by a considerable margin in terms of likelihood and predictive power.

However we note that the set of parameters that we would report as fitting the data best depends on what we are interested in doing with the distribution. If we simply wish to report the most probable set of parameters in terms of the ability to describe the cdf of

the data, we would report thata = 4.40 andc = 1.47. However if we were interested in

studying what the e↵ect of the parameters is in terms of final size of an epidemic, we would

report that a= 3.10 and c= 1.20. Therefore what is to be done with the information is

therefore an important consideration to keep in mind when attempting to fit a distribution to data.

To increase the agreement between the worst performing distribution, the discrete power law, and the Blue Sheep data, in terms of the predicted number of infecteds, we have also

fitted to the cdf for di↵erent values of xmin. For a value of xmin = 1,500, we achieved a

good agreement between the power law and the actual data for the predicted number of infecteds. This was much improved when compared to any set of parameters we get from

using xmin = 1. This tells us that the tail of the distribution is of great importance in

and it is not instructive to simply find a parameter set which fits well in the bulk of the distribution and be confident that this is describing the data well in a way which you are interested in.

In general, it is interesting to see how the profile of the di↵erent distributions a↵ects

not only the fit of the distributions to the cdf, but also changes the way in which the

total number of infections predicted di↵ers from that of the Blue Sheep data. For the

discrete power law the large workplace sizes are over sampled when we choose parameters

to match the empirical cdf. On the other hand, the log-normal and o↵set truncated power

law distributions have select more small workplace sizes. This means that the number of infecteds that these distributions predict can be greater than the data suggests (discrete

power law, 4.17) or fewer (o↵set truncated power law, 4.10 and log-normal, 4.23).

If we are interested in including workplaces in the spread of epidemics in countries for which we have no data on the workplace size distribution and no idea what the distribution may be, then it is plausible that we may select a discrete power law as this will in all likelihood, not underestimate the severity of a potential epidemic, though this may produce a worst

case scenario which is difficult to believe. However as it has been shown that for the UK

the o↵set truncated power law gives us the best fit (of distributions considered), and this

was the distribution produced for the US in Ferguson et al. [2006], it is likely that, for economically developed Western countries at least, this is a fair choice of workplace size distribution.

Documento similar