Capítulo 3. Estrategia metodológica
3.2. Método de recolección de datos
3.2.3. Validez y confiabilidad
In this thesis, we propose copula-based models to impute missing data and the models are evaluated on two clinical data sets as described below - the Quality in Acute Stroke Care (QASC) study (Middleton et al., 2011), and the survey of Menstrual Disorder of Teenagers (MDOT) (Parker et al., 2010). Because of the sensitivity and confidential nature of these data, we would not provide public access to the data sets. The R code to reproduce the simulation results will be provided upon request.
1.3. APPLICATION DATA SETS 21
1.3.1
Quality in Acute Stroke Care (QASC) study
The QASC study was a randomized control trial conducted in 2005-2007 which implemented a multidisciplinary intervention to manage fever, hyperglycaemia and swallowing dysfunction in acute stroke patients. This study was one of the largest rigorously evaluated clinical trials which showed that organized stroke unit care significantly reduced death and disability among stroke patients. There were 19 acute stroke units in New South Wales, Australia that participated in the study, and they were randomly assigned to an intervention group (10 units) or a control group (9 units). A pre-intervention and a post-intervention cohort of patients were recruited, their demographic variables were obtained, and process of care variables and health outcome variables were recorded. The numbers of patients in the two cohorts were 595 and 885 respectively. The variables were mixed-type including continuous, ordinal and nominal variables, the data struc- ture was multilevel where patients were nested within hospitals and almost all the 15 variables contained missing values ranging from 10% to 16%, leading to 75% complete cases in the study population. The researchers were primarily interested to see if there were differences between treatment and control groups in health outcome variables. Our proposed imputation models in Chapter 2 and Chapter 3 were applied to the QASC data set.
1.3.2
Menstrual Disorder of Teenagers (MDOT) study
The Menstrual Disorder of Teenagers (MDOT) survey (Parker et al., 2010) was conducted in 2005 and 2016 to collect data on the menstrual patterns of teenage girls. Both surveys were conducted in the Australian Capital Territory (ACT) us- ing the same questionnaire. The two cohorts of participants were 15-19 years old teenage girls from 4 senior high schools in 2005, and 3 senior high schools in 2016. The participating schools were selected based on their number of enrollments and were located across the ACT region. Consent forms were signed by the parents of all the eligible girls before participating in the surveys. The quality of the data
was maximized by the careful design of the questionnaire, getting support from participating schools, and allocating time to fill in the questionnaires during class. The consistency of the data from the two cohorts was guaranteed by using the same questionnaire and following the same data collection procedure from 2005 in 2016. There were more than 100 questions in the questionniare, covering per- sonal information, typical menstruation characteristics, menstrual symptoms, life interference, menstrual experiences, and knowledge and diagnosis of some men- struation diseases. Due to the large number of questions in the questionnaire, less than 2% of participating girls provided complete answers to every question. Using the MDOT data set, a range of clinical questions can be asked, such as, which menstrual characteristics are changing over time, and can the MDOT ques- tionnaire identify girls with a higher risk of developing endometriosis. We will investigate some questions of particular interest in Chapter 4.
These two data sets serve as our motivating examples, and we believe that our developed methods can be adapted to other data sets with similar structure, for example, a three-layer hierarchical model. Our proposed models can not only be used as imputation engines for missing data, but more generally provide an innovative approach for joint modeling of variables of mixed type.
Chapter 2
Copula based imputation model
for multilevel data sets
2.1
Introduction
Multivariate analysis often involves understanding the relationships among vari- ables of different types. Our motivating data set is from a randomized control trial - the Quality in Acute Stroke Care (QASC) study (Middleton et al., 2011), which implemented a multidisciplinary intervention to manage fever, hypergly- caemia and swallowing dysfunction in acute stroke patients. Most of the variables in this multilevel data set contained missing values, and they were of mixed type (Table 2.1). In the ‘variable group’ column, ‘outcomes’ refers to the primary outcomes that assess the patients’ health status, and ‘process of care’ refers to the secondary outcomes during patients’ stay in hospital. ‘Allocations’ tracks the patients’ assignment to cohorts, treatment groups and hospitals. Ignoring all the patients with missing values is a commonly used approach to handle missing data but may lead to biased estimates and reduced statistical power (Van Buuren et al., 2011). The smaller sample size decreases the power to detect significant treatment effects, and this is especially serious in multilevel data sets due to the potential for positive dependence among units within the same cluster, such as
patients in a hospital. In this chapter, we use the multiple imputation (MI) ap- proach by filling in missing values from our proposed imputation models, and then perform statistical analyses on the imputed complete data sets.
Variable group Variable names Variable type Missing percentage
Outcomes
modified Rankin Scale ordinal 9.48%
Bartell Index ordinal 15.14%
physical health score continuous 15.74%
mental health score continuous 15.74%
Allocations hospcode indicator 0% id indicator 0% treatment binary 0% period binary 0% Demographic gender binary 0% age continuous 5.89%
marital Status nominal 14.8%
highest education level ordinal 15.95%
ATSI binary 17%
Process of care
time to presentation continuous 1.69%
length of stay count 4.53%
mean temperature continous 4.73%
Table 2.1: Summary of variables in the QASC data set
Current imputation models to handle missing data are potentially inadequate to apply to the QASC study which is complicated by the clustering effect of patients within acute stroke units and the mix of variable types (Goldstein et al., 2009). Hoff (2007) proposed using a semiparametric copula model based on the extended rank likelihood to analyse multivariate data of mixed types. We extend the work of Hoff (2007) by adding random effects to introduce correlation among individuals within clusters, and allow for unordered nominal variables through a multinomial probit model.
The structure of this chapter is as follows. In Section 2.2 we review cop- ula models and the extended rank likelihood for semiparametric Gaussian copula
2.2. THE EXTENDED RANK LIKELIHOOD OF GAUSSIAN COPULA 25