CON NUESTRA COMUNIDAD EDUCATIVA
6 DISEÑO METODOLÓGICO
Local area deprivation indices used to represent place within the previous section are a single statistic limited by their original development, as they are based on a working age population, and the purpose and theories of poverty/deprivation underpinning them. A list of the local area census variables considered within this analysis are shown in Table 4.8 on the next page. From the comparisons across indices and the inferences drawn around their effect on six-month survival, the results were unclear with large unexplained variation. The relationship between local area deprivation indices and population density was also unclear; this makes it difficult to decipher the meaning of the relationships uncovered. This led to the next stage of exploration into place effect. In this section, individual features of place are combined to represent place.
Table 4.8: Local area census variables by LSOA descriptions, means and variances
Variable description Code Women
Mean (sd)
Men Mean (sd) Proportion of persons in single parent
household
Singleparent 10.1 (5.1) 10.0 (5.1)
Proportion of unskilled Unskill 19.1 (7.9) 19.1 (8.0)
Proportion of unemployed Unemployed 3.2 (2.1) 3.2 (2.2)
Proportion of >1 per person per room Overcrowd 5.1 (4.4) 5.1 (4.6)
Proportion of immigrants Immigrants 7.9 (2.8) 7.9 (2.7)
% people that are white within the area Perwhite 88.0 (11.3) 88.1 (11.3) % on state benefits, unemployed, lowest
grade workers
Statebenefits 18.3 (7.3) 18.1 (7.1)
% with a car or van Car 70.7 (15.4) 71.1 (15.6)
% white in the area Whitearea 94.1 (12.) 93.6 (13.8)
% of houses occupied with 1 person Onepersonhouse 29.2 (8.7) 28.8 (8.7) % lowest floor level at street level Lowfloorstreetlevel 86.5 (12.1) 86.8 (12.3) % single (never married in the area Singlearea 41.8 (7.4) 41.7 (7.5) Density (Number of Persons per Hectare) Popdensity 27.3 (22.7) 27.0 (23.2) % Christians in the area Christians 74.3 (11.9) 74.1 (12.8) % of pensioners that own their house Penownhoue 64.7 (22.8) 65.9 (22.9) % people that own their house Ownhouse 71.5 (20.3) 72.1 (20.3)
% of over 65 years old Over65 16.9 (5.7) 16.8 (5.6)
Combined Living Environment Indicator Liven 27.2 (17.9) 26.9 (18.0)
% with central heating Centralheat 12.0 (11.3) 11.9 (11.3)
Urban/rural classifier Urbrur 4.0 (0.9) 4.0 (0.9)
Combined Air Quality Indicator Airqual 1.3 (0.2) 1.3 (0.2) Mean age of population in the area Meanage 39.3 (4.2) 39.3 (4.3) Disability Living Allowance Claimants; Total Disallow 92.3 (45.7) 92.6 (46.3) Income Support Claimants; Total Incomesupport 50.2 (46.5) 50.5 (43.2) Jobseekers Allowance Claimants; Total Joballow 44.1 (27.6) 44.4 (29.3) Pension Credit Claimants; Total Pencred 83.3 (45.8) 81.9 (44.8) Combined Barriers to Housing and Services
Indicator
Barrierhouse 18.9 (8.6) 18.9 (8.8) Population Average Road Distance to Food
Store
Avdisfood 1.6 (1.8) 1.6 (1.8) Population Average Road Distance to GP
Premises
AvdisGP 1.4 (1.1) 1.4 (1.1) Population Average Road Distance to Post
Office
AvdisPO 0.9 (0.5) 0.9 (0.5) Combined Education, Skills and Training
Indicator
Edu 28.5 (23.3) 28.4 (23.3)
Pupil Absence Rate Absencerate 9.2 (2.2) 9.2 (2.2)
Combined Employment Indicator Employ 0.1 (0.1) 0.1 (0.1)
Housing In Poor Condition Housepoorcon 0.4 (0.1) 0.4 (0.1) % of people with no qualifications Noqual 34.2 (11.8) 34.2 (11.8)
A broader range of census data relating to the features of the local area was subsequently drawn upon displayed in T able, such as Population Average Road Distance to Food Store. The aim was to build up a picture of a wide range of interacting local area dimensions that together might have an effect on six-month survival.
4.5.1
Choice of statistical techniques
Survival analysis was the approach taken, given the format of the health outcome data available. A similar logical ordering was followed for the modelling process. This explored the underlying structure of the local area census data. The insight about the underlying data structure was then used to progress the analysis.
Exploration into the data structure was undertaken using cluster analysis, by clustering together local area census variables (discussed in section 4.5.2). The strong relationships present
between variables were identified and an appropriate number of natural groupings was determined.
The number of local area census variables was seen to be large, which meant that fitting any standard regression model would have been sizeable and complex. In addition, due to the nature of the information contained in the local area census variables, high correlations were found between these variables. The resulting model would have resulted in collinearity and confounding issues. For example, a large number of single parents in an area is highly correlated with the number of unskilled workers (correlation coefficients CI, p-value: men 0.73 to 0.74; p<0.001, women 0.74 to 0.76, p<0.001). Cox proportional hazard regression was therefore no longer suitable within this context.
By first clustering the local area census variables (forming classes), the size of the model is reduced and the collinearity problems are no longer present. These classes are then regressed upon survival within a latent class regression model. Personal characteristics and clinical factors are also considered, having a different influence on survival when considered for the different classes fitted. How this is achieved is discussed in sections 4.5.3 and 4.5.4.
4.5.2
Clustering
Everitt, Landau and Leese (2001) define clustering as deriving the useful division of data into an optimal number of groupings (clusters). The properties of the local area census variables are used to determine homogeneity within the clusters and heterogeneity between clusters. Clustering is quick to conduct, with an easy and flexible administration, determined by the
criteria followed. The inferences drawn can vary greatly, depending on the clustering approach taken, as they are susceptible to high changeability and subject to the analyst’s interpretation; this makes validation of results problematic.
Two clustering approaches were considered, so that a picture of the level of variability could be gained and overall interpretations made, in order to form an overview of the underlying structure of the interconnections between the 36 local area census variables (descriptions of each are given in Table 5.12).
Initially, a connectivity model using hierarchical clustering was fitted, clustering on the local area census variables – a technique known as variable clustering (Everitt, Landau and Leese, 2001). This starts with all local area census variables considered as separate groupings, then pulls together different people into larger groupings, either by how ‘close’ values are in terms of distance or by P-values via multi-scale bootstrap re-sampling, thus assessing the uncertainty and providing approximately unbiased (AU) p-values, as well as bootstrap probability (BP) values, computed via multi-scale bootstrap re-sampling. The process is repeated multiple times to form a hierarchy of groupings representation of the complex underlying nature of relationships between local area census variables. Hierarchical clustering draws attention to differences for men and women, indicating that between two and three clusters should be used.
Then mixture clusters were fitted. These consider each different cluster as a separate normal distribution with parameters mean and covariance; the overall data distributed is a combination of these distributions. Mixture clustering was able to capture correlations in the data and produce optimum clusters independent of their size. Assumptions of normality can be
problematic within a small sample of cases, unless restrictions are put in place. Although this is less of a problem in such a large dataset, care had to be taken not to overfit to the data. There are many different ways of fitting mixture models. An ellipsoidal, equal-shaped cluster was chosen, to maximise the fit of the data to the model, while accounting for complexity. The numbers of clusters fitted had to be chosen prior to fitting the models: three were chosen due to the hierarchical model insight, and so that comparisons could be made with fit for previous models.
Clustering provides valuable insight into the underlying relationships present in the local area census data. Hierarchical clustering determined that two or three clusters were optimum and uncovered potential groupings.
4.5.3
Latent class survival regression analysis
Latent class regression analysis combines the benefits of being able to cluster together local area census variables, according to their underlying data structure, with being able to regress on survival. Inferences about place effect on survival can be drawn from the latent class regression models. Collinearity is not present, due to the structure of the model.
Mixture clustering was used, because it was possible to fit the data structure sufficiently. Mixed type clusters were formed where people were grouped together, if they lived in areas that were similarly represented by local area statistics. The cases were assumed to have come from the same probability distribution, where distribution parameter was estimated from the data. The regression model formed is similar to the Cox proportional hazard model discussed in section 4.4.3.2. The clustering of local area census data (latent classes) are fitted in this model, which is also known as a finite mixture model for structured data (Vermunt and Magidson, 2000; Latent Gold, 2011). Personal characteristics and clinical factors are fitted in this model and their influences on survival rate can vary across the different classes formed. For example, if two classes are formed within a model, then a predictor variable (for example, increased age) could have a negative influence on survival within one class and a positive influence on survival within the second class, as the classes represent different subsets of data.
Latent class regression models are parametric, and therefore make assumptions about the distribution present between the predictor and class formation variables on six-month survival. Parameters are estimated using an expectation maximization (EM) algorithm to find the maximum likelihood based on distribution assumption. This model complexity (and the large increase in the number of distribution parameters in relation to increased latent classes) meant that, although this modelling technique has been known for many years, it has only recently become feasible due to computer advancements.
To reduce complexity, a parsimonious model was sought, while maximising the survival
variation explained with the model. A similar modelling strategy to the Cox proportional hazards models was used (as discussed in section 4.4.3.2). Models were built manually, considering the clinical and statistical significance of potential predictor variables (personal and clinical factors) and covariates (local area census variables that form classes within the model). This process was undertaken in three stages:
1) Determining the optimal number of classes 2) Choosing covariates that influence class formation
3) Choosing predictor variables that influence survival at six months
The number of classes was chosen by initially fitting models for one to four classes; this was restricted to four based on the information gained during the cluster analysis. These models were than compared using best-fit statistics. Akaike information criterion (AIC), Bayesian information criterion (BIC), likelihood ratio chi-square ( , and classification error and
(coefficient of determination) were used to choose the model that had the optimal number of classes.
AIC is a single number that makes a comparison between the complexity of the model and how well the model fits the data. BIC makes a similar comparison to that of AIC, although it penalises more heavily for models that are more complex. makes a comparison between the
likelihoods within the models considered. Classification error is a measure of how often the model predicts survival incorrectly and, finally, is used to measure unexplained variation within the model. The number of clusters to be chosen was not an obvious matter, with different statistics implying the use of different numbers of classes. Compromises were made: the optimal number of clusters is unknown and therefore the accuracy of this discussion is based on interpretations of the models.
Three classes were chosen. Simplification began with the latent class regression model, determined by all 36 local area census covariates and a variety of predictor variables. The number of local area census covariates was then reduced, using backwards elimination. This process was repeated, using personal characteristics and clinical factors, and the previously mentioned comparison statistics, to choose a model from which to make inference. A more complex modelling strategy was not attempted, due to the size and complexity of the models involved.
Once simplified, the models could then be used to make inferences about how locality
formation influences survival at six months, exploring the different characteristics of the people who reside within these localities.
4.5.4
Approximation of a latent class survival regression model
Fitting a latent class survival regression model was complex, time consuming and resulted in multiple problems. To reduce these issues, a simple and approximately equivalent model was fitted instead. A survival model is estimated using a Poisson regression model.
A Cox model is equivalent to a Poisson regression model in the form of a piecewise exponential survival model. This requires that the data is in the form of episode records whose end points correspond with the times at which events occur (Vermunt, 1997, for further detail).
Holford (1980), and Laird and Oliver (1981), originally realised that it is possible to approximate a piecewise proportional hazards model. It was noted that the piecewise proportional hazards model of the previous subsection was equivalent to a certain Poisson regression model. This model is approximate, as the log-likelihood for censored exponential data given coincides exactly with the log-likelihood that would be obtained by treating total number of deaths as a Poisson random variable, with mean exponential distribution at time of death. The only difference between the two models is that the total observation (or exposure) time is in log form for the Poisson model. This is a constant value and therefore does not have an overall effect on the inferences drawn from the model.
When fitting the approximate generalised Poisson regression model, the predictor and covariate (latent classes) were regressed on the binary mortality at six months, using an exposure time of the time of death so that survival is accounted for. This leads to the same estimates and standard errors as treating the exposure times as censored observations from an exponential distribution.
Thus, the piecewise exponential proportional hazards model is equivalent to a Poisson log-linear model for the pseudo observations (one for each combination of individual and interval), where the death indicator is the response and the log of exposure time enters as an offset.