Amelia II performs multiple imputation, a general-purpose approach to data with missing values1. Multiple imputation reduces bias and provides increased efficiency compared to list-
wise deletion. Moreover, ad-hoc methods of imputation, for example mean imputation, lead to serious bias in variances and covariances. However, due to the technical naturel of algorithms involved, creating multiple imputations can be cumbersome. Amelia simply provides a way to create and implement an imputation model, generate imputed datasets, and check its fit using diagnostics. Furthermore, expectation maximum likelihood bootstrap (EMB) algorithm included in Amelia II imputes many variables, with more observations, in a short amount of time. The simplicity and power of the EMB algorithm makes it possible to write Amelia II so that it virtually never crashes, which it unique among all existing multiple imputation software and is much faster than the alternatives. Additionally, Amelia II has features to make valid and much more accurate imputations for cross-sectional, time-series, and time-series-cross-section data, and allows the incorporation of observation and data- matrix-cell level prior information. Furthermore, Amelia II provides diagnostic functions that help in checking the validity of the imputation model. The Amelia II software implements the ideas developed by Honaker and King2.
2.3.1.1 How Amelia Works
Multiple imputation involves creating m completed data sets by imputing m values for each missing cell in the data matrix. Across completed data sets, the observed values are the same, whereas missing values are filled in with a distribution of imputations that reflect the uncertainty about the missing data. After imputation with Amelia II’s EMB algorithm, any statistical method can be applied as if there had been no missing values to each of the m data sets, and a simple procedure is used to combine the results. Normally, imputation is done once and the m imputed data sets can be analyzed as many times and for as many purposes
60
wished. The advantage of Amelia II is that it combines the comparative speed and ease-of- use of the EMB algorithm with the power of multiple imputation. Unless the rate of missingness is very high, m = 5 (the program default) is probably adequate
2.3.1.2 Assumptions in Amelia
The imputation model in Amelia II assumes multivariate normal distribution for complete data (includes both observed and unobserved). If the (n × k) dataset are denoted as D (with observed part Dobs and unobserved part Dmis), then this assumption is
𝐷𝐷~𝒩𝒩𝒩𝒩(𝜇𝜇, ∑)………
(1)Stating that D has a multivariate normal distribution with mean vector µ and covariance matrix Σ. The multivariate normal distribution is a crude approximation of the true distribution of the data. It has been shown that this model works as well as other, more complicated models even in the face of categorical or mixed data3,4. Furthermore,
transformations of many types of variables can often make this normality assumption more plausible (transformations include; ordinal, nominal, natural log, square root, and logistic). Essentially, the problem of imputation is that only Dobs is observed, not the entirety of D. In
order to gain traction, the usual assumption in multiple imputation that the data are missing at random (MAR) is made. This assumption means that the pattern of missingness only depends on the observed data Dobs and not the unobserved data Dmis. Let M to be the
missingness matrix, with cells mij = 1 if dij ∈ Dmis and mij = 0 otherwise. Simply, M is a matrix
that indicates whether or not a cell is missing in the data. With this, MAR assumption can be defined as:
𝜌𝜌(𝑀𝑀|𝐷𝐷) = 𝜌𝜌(𝑀𝑀|𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜)………
(2)Importantly, MAR includes the case when missing values are created randomly, but it also includes many more sophisticated missingness models. When missingness is not dependent on the data at all, then data are missing completely at random (MCAR). Amelia requires both the multivariate normality and the MAR assumption (or the simpler special case of MCAR). Additionally, MAR assumption can be made more plausible by including additional variables
61
in the dataset D in the imputation dataset than just those eventually envisioned to be used in the analysis model.
2.3.1.3 The Amelia Algorithm
Multiple imputation is concerned with the complete-data parameters, θ = (µ, Σ). When writing down a model of the data, the observed data is actually Dobs and M, the missingness matrix.
Thus, the likelihood of the observed data is p(Dobs, M|θ). Using the MAR assumption, this can
be broken up as:
𝜌𝜌(𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜, 𝑀𝑀�𝜃𝜃) = 𝜌𝜌(𝑀𝑀|𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜)𝜌𝜌(𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜|𝜃𝜃)………..…
(3)Because inference on the complete data parameters is important, the likelihood can be written as:
𝐿𝐿(𝜃𝜃�𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜) ∝ 𝜌𝜌(𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜|𝜃𝜃)………..…
(4)which can be rewritten using the law of iterated expectations as:
𝜌𝜌(𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜�𝜃𝜃) = ∫ 𝜌𝜌(𝐷𝐷|𝜃𝜃)𝑑𝑑𝐷𝐷
𝑚𝑚𝑚𝑚𝑜𝑜………..……….
(5)With this likelihood and a flat prior on θ, then the posterior is
𝜌𝜌(𝜃𝜃�𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜) ∝ 𝜌𝜌(𝐷𝐷
𝑜𝑜𝑜𝑜𝑜𝑜�𝜃𝜃) ….………..……….
(6)The main computational difficulty in the analysis of incomplete data is taking draws from this posterior. The EM algorithm approach is computationally simplified to finding the mode of the posterior5dde (Figure 2). Amelia II’s EMB algorithm combines the classic EM algorithm
with a bootstrap approach to take draws from this posterior. For each draw, data are bootstrapped to simulate estimation uncertainty and then run the EM algorithm to find the mode of the posterior for the bootstrapped data, giving fundamental uncertainty as well1.
Once posterior of the complete-data parameters is drawn, imputations are made by drawing values of Dmis from its distribution conditional on Dobs and the draws of θ, which is a linear
62 Figure 2: Schematic of Multiple Imputation Approach with the EMB Algorithm, Adapted from
Honaker et al 2011