2. CAPÍTULO II METODOLOGÍA
2.12. Soluciones propuestas
2.12.3. Desarrollo de 5”s” para CRP3 y CRL4: falta de orden y limpieza
In this Chapter, we have conducted some basic EDA to explore the missing data pattern of the FNES data, and tried to impute the missing data by applying the various of imputation methods introduced in previous Chapters, with the focus on Bayesian Multiple Imputation. Unlike performing imputation methods on the simple SURF data, there are many challenges we have faced when dealing with the real FNES data.
The first challenge is that there are far more variables and observations from the FNES than the simple SURF. This makes it is very difficult to investigate the missing data patterns by purely looking at the datasets only. Hence, we have carried out EDA to help us to investigate the missing data patterns by plotting the response rate of each variables on bar charts.
The second challenge is that we impute the FNES missing data under the assumption that the missingness is MAR and is related to other variables that have been observed as well, but this is just an assumption. Unlike previous Chapters, we created the MCAR or MAR missingness for the replicate SURF datasets, so the variables that are related to the missingness were already known. For the real FNES data, we do not know whether the missingness is MCAR, MAR or NMAR. This means we do not know which variables are related to the missingness which creates difficulties when we want to construct the best imputation model. Hence, in this Chapter, we have introduced the univariate comparison method and the logistic regression assessment method to help us to identify the variables which are related to the missingness.
The third challenge is to adapt our imputation methods to impute the interrelated variables. As described, the difficulty is that once the variable with missing data is imputed, then other variables that are related to that variable need to be updated as well, otherwise, the imputa- tion results do not make practical sense. Our solution basically is to combine the deductive imputation method with other imputation methods to impute missing data in the interrelated variables. However, we have only imputed two variables that are related to each other. How to find an efficient algorithm to deal with large number of variables that are interrelated is something needed to be further studied.
The fourth challenge or improvement is to utilize the extra information we get by matching the 2007 and 2009 FNES datasets. The matched datasets are very useful in terms of enhancing our imputation methods. However, if it is only used for cold deck imputation, it is as though we are taking the jewel box but throwing away the jewellery within. Hence, we have proposed to incorporate the matched dataset in the Bayesian MI scheme. Doing this, we maximize the
information we can get from the matched and unmatched part of both 2007 and 2009 FNES data.
This Chapter also displays the imputation results for the selected FNES variables. As ex- pected, the resampling method applied to imputed incomplete data and the Bayesian MI have the largest variances of estimates than the single imputation method and the EM algorithm.
To sum up, from our investigation, we think the Bayesian MI is the best imputation method for the FNES data. This is because it produces similar estimates to other imputation methods; it properly propagates the imputation uncertainty; and it is extremely flexible in the case of incorporating extra information from the matched datasets. This is also because we can construct familiar and reliable logistic models by using the FNES variables, which might not be the same case for other datasets.
Chapter 12
Some final thoughts
This chapter summaries the previous chapters in this project, and proposes some thoughts on future work and improvements. Specifically, the first section summaries the main points and findings from previous chapters, and the second section lists things that we haven’t done, but could be done and improved in the future.
12.1
Summary of previous chapters
Chapter 2 focuses on introducing the three missing data mechanisms (MCAR, MAR, and NMAR). We have shown that the missing data do not cause biases only if they are MCAR. Both MAR and NMAR introduce bias to the estimates. This chapter paves the foundation of our discussion on how to deal with missing data in later chapters.
Chapter 3 exhibits most commonly used data deletion and imputation methods. This chap- ter also gives in-depth discussion on the concepts of non-response bias and imputation uncer- tainty. The main point is that the imputation methods are developed to tackle the bias issue if the missingness is MAR, but most imputation methods ignore the fact that they underestimate the imputation uncertainty due to treating the imputed values as true observed values.
Chapter 4 demonstrates how the various single imputation methods work in detail by ap- plying them to the replicate SURF datasets with incomplete Income variable. We have shown that the imputation methods can reduce bias if they properly incorporate the MAR mechanism which means the imputation model includes the variables that are related to the missingness. We have also shown that the imputation methods, such as stochastic regression model, and hot deck imputation, perform better than other imputation methods which haven’t gotten any random sampling mechanism. However, although some single imputation methods can deal with bias, none of them can reflect the imputation uncertainty.
Chapter 5 discusses two popular resampling methods (the bootstrap and the jackknife), and applies them to missing replicate SURF data to properly account for the imputation un- certainty. These methods are efficient for dealing with imputation uncertainty, but they also require large samples to achieve the desired results.
Chapter 6 introduces the EM algorithm which has been considered to be one of the best missing data handling technique. We have included in our introduction of the EM algorithm the case of multivariate missing data problems. Dealing with the multivariate missing data problem is one of EM’s advantages, compared to single imputation methods. The reason that
we go through the EM algorithm is that researchers normally use the EM algorithm to find the initial estimates for the Bayesian MI.
Chapter 7 discusses the underlying Bayesian iterative simulation methods of the Bayesian MI. We focus on how to apply the Metropolis-Hastings (MH) algorithm and the Gibbs sam- pling algorithm to impute missing data, and compare the pros and cons of these two methods. Again, we have also extended our introduction of these algorithms to the case of multivariate missing data problems. This chapter also lists a few convergence diagnosis methods. This chapter has the foundation of the Bayesian MI we apply to the replicate SURF and the FNES data.
Chapter 8 shows how exactly Bayesian MI works, and how we pool the estimates from mul- tiple imputed datasets together to compute the final MI estimates, and variances of estimates. This chapter also gives mathematical and simulation proofs of why and how the improper MI underestimates the variance of estimate.
Chapter 9 shows how to apply various imputation methods introduced in previous chapters to missing categorical data. These imputation methods have only been applied to continuous numerical missing data in previous chapters. In this Chapter, we show that, although the fundamental concepts of these imputation methods are the same, variations are needed in order to apply them to the missing categorical data. This chapter also prepares the use of these imputation methods for the FNES data as all of its variables are categorical variables.
Chapter 10 simply describes the sample design of the FNES data.
Chapter 11 uses EDA to investigate the missing data pattern of the FNES data. Then, this chapter introduces the univariate comparison method and logistic regression assessment method to detect the missing data mechanism and the variables that are related to the missing- ness. We start to introducing these detection methods for the missing data mechanism here, because of the need to detect the missing data mechanism and variables that related to miss- ingness only arises when we deal with the real life social survey data. Finally, we have applied several imputation methods introduced in previous chapters to a few FNES variables. The re- sults indicate that Bayesian MI produces estimates similar to other imputation methods, and it also gets similar variances of estimates to the bootstrap resampling methods. Furthermore, we propose the use of Bayesian MI for the case of partially matched datasets. Bayesian MI maximizes the information we can get from the matched and unmatched datasets in order to find the best imputation values. The model of Bayesian MI for the partially matched datasets is the new development in this project.