Características de preferencia - E-MANUAL. imagine the possibilities

direction of effects operating between variables. However, such studies may face a bias problem due to the loss of participants to study follow-up (also known as data attrition; Carlin et al., 2008). Missing values are therefore common in studies of longitudinal data sets. Certain subgroups have been found to be more likely to drop out of longitudinal studies and this may introduce bias in effect estimation; these subgroups include youths from low socio-economic backgrounds or with high levels of behaviour problems as well as males (Wolke et al., 2009).

The extent to which longitudinal data is biased depends on the causes underlying participant drop-out (Graham, 2009). Three different situations can be distinguished: (1) data missing completely at random (MCAR), (2) data missing at random (MAR) and (3) data missing not at random (MNAR). Rubin (1976) gave the following definitions: MAR is present if missingness depends on the observed data but not on the unobserved data

(unobserved due to loss of follow-up); MCAR is present if missingness does not depend on the observed or on the unobserved data; MNAR is present if missingness depends on the unobserved data. If data is not missing completely at random, results of the complete case analysis (analysis of the observed data) may be biased (Sterne et al., 2009). To address this issue imputation of the missing data is advised (Sterne et al., 2009).

7.1.2.1 Characteristics of the Multiple Imputation by Chained Equation model. Various papers have examined which data imputation method is the best to inform researchers about the preferred ways to deal with missing data. Stuart and colleagues compared multiple imputation, single imputation and maximum likelihood approaches. They reported distinct advantages of multiple imputation compared to the other two techniques: compared to multiple imputation single imputation only insufficiently accounts for uncertainty in missing values and maximum likelihood approaches are very complex to implement in complex models (Stuart et al., 2009). Shrive and colleagues (2006) compared six different imputation methods (multiple imputation, single

regression, individual mean, overall mean, participant’s preceding response and random selection of value). They reported that multiple imputation provided the most valid results in data with 10% missing values and in data with 30% missing values. More specifically, Engels and Diehr (2004) reported that multiple imputation based on the variables, that had been assessed in the participant before their records became incomplete plus any non-missing data present for them in the period after the missing value occurred, was superior to multiple imputation based only on participant’s values before the missing value occurred as this provided additional information for the

imputation data. Therefore, I conducted multiple imputation, whereby the imputation model included participants’ information before and after the missing value.

Multiple imputation was based on the Multiple Imputation by Chained Equation (MICE) approach (Van Buuren et al., 1999) using the ice command (Royston, 2009: Royston, 2007; Royston, 2004) in Stata 11 for Windows. Multiple Imputation by Chained Equation is a flexible approach to deal with missing data which has been implemented in available software packages such as Stata (Royston, 2004). MICE allows for uncertainty in missing data by creating multiple data sets with the missing values being replaced by imputed values and appropriately combining results of each data set. Sterne and

colleagues stated in 2009 that the process of multiple imputation consisted of two separate stages:

Stage 1: Creation of multiple data sets where the missing values are replaced by imputed values. The imputed values are based on the Bayesian inference theory. This theory states that the likelihood that a hypothesis is true (in our case the values of imputed data items) is determined by observed evidence (in our case the values of

observed data), which is called the posterior distribution of the hypothesis. Therefore, for each variable containing missing data, values are imputed using an equation. This is equivalent to a regression model with the variable with missing values resembling the outcome variable and observed data of other variables resembling predictor variables. More specifically, missing values of a variable will be imputed by using information from observed values of this variable as well as additional other observed variables, which are specified by the researcher based on high likelihood of association with the variable with missing values. The imputation procedure has to be a multiple iterative

process, as otherwise it would not fully account for the uncertainty in predicting missing values as it would fail to add appropriate variability into the imputed values. This is important as after all, it is never possible to impute the true values of the missing data.

Stage 2: Standard analyses are used on each data set and the results are combined according to Little and Rubin’s theory to get the final estimates and their standard errors (Little & Rubin, 2002; Rubin, 1987). This theory is based on a set of rules called Rubin’s rules, which calculates a matrix of combined variance and covariance incorporating within-imputation variability (reflecting the uncertainty about the results from each imputed data set) and between-imputation variability (reflecting the uncertainty due to missing values; White et al., 2011).

It has been argued that it is statistically impossible to prove that missing data are MAR (Sterne et al., 2009). Therefore, to avoid bias in imputed data some authors advocate including auxiliary variables in the imputation equation (Graham, 2009).

Auxiliary variables are variables which are included in the imputation equation to provide additional information whilst not being included in any subsequent statistical analyses of the imputed data (that is, the statistical analyses that will be conducted to answer research questions, once the imputation process has been completed).

MICE is based on creating multiple imputed data sets, whose estimates will then be combined; a widely debated issue of MICE is the number of imputed data sets (m > 1) to be created. Early work advocated that a number of m = 10 imputed data sets would be sufficient to achieve reliable results in the imputed data (Schafer & Graham, 2002). However, more recently it has been advocated that the number of imputed data sets should be based on the percentage of missing values. More specifically, with 40% of

missing values approximately m = 40 imputed data sets should be conducted to achieve reliable imputed data and sufficient power (Graham, 2009; White et al., 2011). However, the percentage of missingness of the main variables of interest is sufficient to take into account; that is, variables which are used as confounders in the later analyses do not have to be considered (Royston, 2004).

With regards to outcome variables two issues should be kept in mind. First, outcome variables need to be included in the imputation equation because of the information they provide for imputing missing values (Sterne et al., 2009; White et al., 2011). Second, when analysing imputed data, it has been advocated that imputed predictor, mediator, moderator and confounding variables should be analysed but unimputed (meaning complete case data only) outcome variables (White et al., 2011). It is important to not include imputed outcome variables in the analyses as they might provide biased estimates (White et al., 2011).

In Stata, MICE was first conducted by using the mvis command for multivariate imputations (Royston, 2004) which was later replaced by the ice command (Royston, 2009; Royston, 2007; White et al., 2011). To run the ice command in Stata, the command needs to be adjusted according to the variables used in the imputation equation. Different regression commands are used to impute different types of variables: to impute normally distributed continuous variables a linear regression model is used, to impute binary variables a logistic regression model is used and to impute ordered categorical variables a multinomial logistic regression model is used (White et al., 2011). Which type of

regression model should be used for which variable can be specified with the cmd() option of the ice command. By including continuous variables, which are not normally

distributed due to the concept they are measuring (i.e., financial difficulties, where one would expect that most people do not have financial difficulties), in the imputation equation it is desirable to specify the same non-normal distribution in the imputed continuous variable as in the complete case-based continuous variable. This is achieved by the match() option of the ice command. When planning to test interaction effects in the imputed data, interaction terms need to be included in the imputation equation to ensure any possible interaction effects are not lost after imputing the data. This is best done by creating interaction terms and including these in the imputation command; furthermore if any of the variables forming the interaction term are categorical, dummy variables need to be created. The interaction terms will then need to be specified in the ice command by using the passive() option. In addition, it is necessary to specify that the dummy variables are derived from the categorical variable and therefore do not need to be imputed separately. This is specified with the substitute() option of the ice command. To avoid co-linearity issues, which occur if variables are highly associated with each other, during the imputation process it is possible to specify which variables are used to impute another variable; this is specified by using the eq() option of the ice command (Royston, 2009; Royston, 2007; Royston, 2004).

In document E-MANUAL. imagine the possibilities (página 47-57)