FONDO DEL DEPORTE - MUNICIPALIDAD DE LA CIUDAD DE BUENOS AIRES

The spatial microsimulation modelling was conducted using R, an open source

programming language (R Core Team, 2015). R is an object oriented language, with a particular focus on statistics and graphical output. The object oriented nature of the language makes it highly suited for data analysis and manipulation, which are

particularly useful for population analysis. The analysis conducted in this research was based on methods used in Lovelace and Dumont’s (2016) book ‘Spatial

Microsimulation in R’. The source code used for the spatial microsimulation in this study is available in full online (https://github.com/tombro1987/SMStoothdecay) with notations.

The main function used for the analysis came from the ‘ipfp’ package (Blocker, 2015), which is a fast way of implementing the IPF procedure using the C programming

118

language. In order to work, the function requires three types of input data – a data frame containing aggregated Census data for the geographical area of interest (the constraint variables – or ‘wide’ data – Figure 12); a data frame containing individuals from the ADHS, with associated constraint data, and variables that the user wishes to ‘create’ in their new dataset (or ‘long’ data – Figure 13); and a data frame of individuals from the survey containing constraint data in a Boolean (or model matrix) format (Figure 14). This means that a ‘1’ is present in the column which is relevant to that individual, while a ‘0’ is placed in the others. For instance, if an individual is 25 years old a ‘1’ will be placed in the ‘25-34’ age grouping, with zeros in the other seven age columns. The row sums of each individual were taken to make sure that these summed to 6, as there should be a ‘1’ present for each constraint variable. The column order of the aggregated Census data and the Boolean format survey data should be the same, and should

preferably be in order of constraint application (i.e. least influential to most).

The reason for this extra Boolean formatted data frame is that the ADHS and Census data are not directly comparable in their original formats, and therefore this third data frame ‘flattens’ the individual level data in the ADHS, so that it matches the format of the Census (Lovelace and Dumont, 2016). It then becomes clear which categories an individual belongs to for each constraint variable, and the two data sets can then be compared.

Figure 12 – Constraint variables in R (each row represents the population total of an LSOA)

119

Figure 13 – Participant data from the ADHS (each row represents an individual)

Figure 14 – Model matrix of survey constraint data in dummy coded format

The reweighting process used in the ‘ipfp’ package is similar in nature to the method undertaken by Anderson (2007). All individuals in the survey are given an initial

weight, sometimes this is automatically set to 1, but it can also be calculated by dividing the number of individuals in an area (from the Census data) by the number of

households from the survey data. Anderson (2007) suggests using a regional weighting technique which involves excluding (or giving a weight of 0 to) individuals from outside the region being simulated, the idea being that this ‘avoids filling, for example, Sheffield with Londoners’ (p.12). In practice this sounds like a reasonable approach, however the ADHS data is only available at the Strategic Health Authority (SHA) level,

120

meaning that the lowest spatial scale relevant to Sheffield was the Yorkshire and Humber region. The Yorkshire and Humber region is far from homogenous, and Sheffield is just as likely to have characteristics in common with towns and regions from other parts of the UK as with the rest of the Yorkshire and the Humber region. This assumption is in line with work conducted by the ONS on ‘statistical neighbours’ (Office for National Statistics, 2011c), which according to the classification would have Leeds as the most similar local authority to Sheffield, followed by Newcastle-upon- Tyne, Cardiff, Preston and Derby. Based on this classification it can be seen that Sheffield is more similar to a number of towns and cities from outside the Yorkshire and Humber region than some of those in it. Therefore, it did not necessarily make sense to exclude other regions from the analysis, which would also reduce the sample size significantly. The accuracy of spatial microsimulation models can suffer through the reduction of sample sizes, as with a smaller pool of individuals and a potential reduction in the variety of characteristics amongst the sample, it may be harder for the method to create the target variables as accurately as would be desired. Target variables are those that are simulated from the survey data, that do not currently exist in any data sources produced for small area geographies (i.e. tooth decay). This theory is supported by Ryan et al. (2009), who found that ‘as input sample size increases, resulting

populations experience gains in accuracy’ (p.201).

The reweighting method used in the ‘ipfp’ package is the same as that displayed in the worked example of the IPF procedure in Section 4.3, where the formula below is applied to the six constraint variables iteratively until convergence of the datasets is achieved.

ni = wi × sij/mij (Equation 1) Anderson (2007) states that 20 iterations were enough to achieve convergence, whereas other authors have suggested only 10 are required (Ballas et al, 2005a). As a

compromise 15 iterations were used in this research. For this research more than one target variable needed to be simulated, due to the need for individual level data to fill out the non-neighbourhood based variables in the pathways. This was a fairly simple process, and involved adding additional target variables to the dataset of individuals from the ADHS before data manipulation took place, checking to make sure that the number of people matched up between the two input data sources taken from the survey

121

data (the participant data, and the model matrix data). The original sample size for a model simulating only tooth decay was 5388 individuals, however once the additional target variables had been added this was reduced (through missing data) to a final sample size of 4840 individuals.

In document MUNICIPALIDAD DE LA CIUDAD DE BUENOS AIRES (página 29-36)