Missing values for dependent and independent variables can cause biased estimates, biased standard errors and inefficiency. It is possible to classify types of missing data into: Missing Completely At Random (MCAR), Missing At Random (MAR) or Missing Not At Random (MNAR). If the probability that data are missing does not depend on the values of the observed or missing data, then the data are MCAR. In other words, data are MCAR when there is no systematic reason why data are missing, other than random chance. In contrast, data being MAR is much more common, and occurs when the probability of being missing depends only on observed data. For example, if data on SES were more likely to be missing in men than women, and this were the only reason (other than chance) for it to be missing, then data on SES would be MAR, providing information on gender was also available in the dataset. Finally, MNAR is the situation when the probability of being missing depends on both observed and missing data. For example, if data on SES were missing depending on the level of SES itself (so those of lower SES were less likely to complete the responses to these questions) then these data would be MNAR, as it would depend on a factor that was unobserved. Figure 4.6 illustrates the different missing data mechanisms.
155
Figure 4.6: Missing data mechanisms a
There are ways to distinguish whether data are MAR rather than MCAR (for example comparing the characteristics of individuals with observed data and individuals with missing data). However, there is no statistical way to test whether the data are MNAR rather than MAR so it is necessary to make a judgement based on knowledge of the situation.
There are two methods used in this thesis to deal with the issue of missing data: Complete-Case analysis (CC) and Multiple Imputation (MI).
Missingness Unobserved Values Xmiss Observed values Xobs Missingness Unobserved Values Xmiss Observed values Xobs Missingness Unobserved Values Xmiss Observed values Xobs MCAR MAR MNAR
156 Complete-case analysis
Complete-case analysis is a method whereby only individuals with complete data are included in the analysis, and individuals with missing values are excluded. This method may be employed when few data are missing. Thus omitting cases would not severely diminish the analysis population, and would be unlikely to cause significant bias. However, note that in multivariable analysis, if an individual has missing information on the dependent variable or any of the independent variables being considered, then their entire record is necessarily excluded from analysis. Thus, small amounts of missing data across a range of variables can lead to large amounts of missing data overall.
Multiple imputation
Multiple imputation by chained equations (MICE) is a method whereby missing data are completed using an iterative process536. Each missing observation is assigned (imputed) an initial value using some arbitrary method, for example the mean for that variable. These values are then replaced in turn using univariate imputation models, i.e. regression of the observed values of that variable on all other observed and currently imputed variables. This model may be improved by including auxiliary variables – these are variables which are not a part of the analysis model of interest but that predict either the incomplete variables or the probability that they are missing. It is an iterative process, performed until the dataset is complete with no missing data a process which is then repeated to result in a number of imputed datasets (say 𝑚 datasets). The main analysis is performed on each of the imputed datasets, resulting in 𝑚 estimates of the RR/OR/PR. These are then combined into a single overall estimate of the association being studied using Rubin’s rules537.
Before undertaking MI, some details require consideration. The literature suggests that 𝑚 should be at least equal to the proportion of incomplete cases (so, if 20% of data are missing in a dataset, then 20 imputed datasets should be created)536;538. Monte Carlo error (MCE) can be used to determine an adequate number of imputations to obtain stable results536. MCE reflects the variability in the results across the imputed datasets due to using a finite number of imputations. The MCE for the estimates, test statistics and the p values need to be sufficiently small in order to make reasonably reliable inferences following the imputation of the missing values. The literature
suggests that the MCE of all estimates should be <10% of the corresponding standard error and MCEs of the test statistics should be approximately 0.1536. Secondly, a sufficient number of the initial imputations should be discarded such that the estimates will be unaffected by the arbitrary method used for the first imputation and the process will have converged to produce stable estimates. This is called the burn-in period. It is
157 also important to take into account the potential presence of perfect prediction – there is a level of a categorical variable for which the outcome is certain to occur/not occur. This calls for a few low weighted observations to be added to the data set so that no prediction is perfect (augmented regression), in order to avoid biased results.
It is possible to account for the bias caused by data being MNAR using MI methods with weighted Rubin’s rules. However, this requires knowledge of the nature of the missing data and reasons why they are missing. As this is usually difficult to ascertain, one needs to make assumptions, which adds uncertainty to the results and are often unverifiable.
Choice of method for handling missing data
MI is a more efficient analysis method compared to CC because all cases are included, thus the estimates of associations are likely to be more precise. This is particularly true for situations with a large proportion of missing data539. However, when the missing data mechanism is MCAR then CC has been shown to lead to unbiased estimates of association and is a much simpler method to use. Under the MAR missing data mechanism, MI has negligible bias, whereas CC is biased because it ignores systematically missing data. However, there are other situations in which CC has negligible bias, such as when the variables in the model are either uncorrelated or the correlation between them is small. In data which has missing values for the
response variable, in general CC would be used because it does not add any value to the analysis to include these cases, except when auxiliary variables are available for imputation540;541. It is often unclear which of MI or CC are more appropriate for the missing data mechanism539, thus it may be helpful to evaluate the results of both approaches.
158