In applied research, it almost never happens that a data set is collected without missing data. No matter how well the process of data collection is planned, not all data will be available at the end. Particularly in lifespan research, people are sometimes tracked over years and one of the
most common reasons for missingness is the drop-out of participants through relocation, illness, or death. However, a variety of reasons for missingness is capable of surprising the researcher at the end of data collection, starting from illegibility of test results to the unwillingness of certain participants to answer certain questions in a questionnaire. The latter reason states a relationship between the missingness and a variable under investigation, whereas the former example gives reason to believe that the two are independent. According to Rubin (1976), mechanisms of missingness are seen as probabilistic models and can be divided in three categories according to assumptions about their correlational structure with variables under investigation. Let Y be an observed variable that contains missing values and X be a fully observed variable. Let R be a dichotomous random variable which indicates the occurrence of missingness in
Y . If the distribution of missingness depends on observed data but not on missing data, the
mechanism of missingness is termed missing at random (MAR). If R is independent of any observed variable, it is missing completely at random (MCAR). The most adverse case is not
missing at random (NMAR), whereby the pattern of missingness depends on the unobserved
missing values themselves.
Under the assumption of MAR processes, SEM Trees can handle missing values in the ob- served variables as well as in the covariates. Furthermore, SEM Trees can serve as a tool for model-based imputation of observed variables.
3.6.1 Missing Values in Observed Variables
McArdle (1994) makes a strong point that “structural equation models do not require all vari- ables to be measured on all individuals under all conditions” (p. 409). A common approach to estimate model parameters under missing data is the Full Information Maximum Likelihood (FIML) approach. The maximum likelihood fit function as presented so far is not able to handle data sets with missing values since a mean vector and covariance matrix cannot be computed. However, there is a reformulation of the same function that is able to obtain ML parameter estimates under missing values (Finkbeiner, 1979).
The FIML fit function is defined without resorting to an aggregated mean vector and a covariance matrix. This allows the calculation of a likelihood for single observations that arise for unique patterns of missing values in a larger data set
−2LLF IM L(x1, . . . , xN|µ, Σ) = N · p · ln (2π) + N X i=1 h ln |Σi| + (xi− µi) TΣ−1 i (xi− µi) i
where Σi is the model-implied covariance matrix with rows and columns deleted according to the pattern of missingness in the i-th observation xi, and likewise µi is the model-implied mean vector with elements deleted according to the respective pattern of missingness.
The implementation of a model with missing data can be integrated into the existing frame- work in an elegant way. The model is decomposed into a set of sub-models according to the patterns of missingness in the data. A pattern of missingness is simply a vector of Boolean values representing the missingness of the according variable. All data rows with the same pattern of missingness are tied together to form subsets of the original data set.
3.6.2 Missing Values in Covariates
Missing values in covariates can be dealt with by introducing only slight modifications to the likelihood ratio statistic for candidate split evaluation and to the traversal function. First, we modify some previous definitions so that we can incorporate missing values:
Definition 67. The symbol Ø represents a missing value2. The set R
Øextends the real numbers
by a missing value Ø, RØ= R∪ {Ø} . A missing-value data set D = (O, C) has elements oij ∈ RØ
and cij ∈ RØ.
For a given observation of a data set (o, c) ∈ D, the traversal function traverses the tree according to the decision rules for the values of c that are encoded by the tree. Whenever the traversal function encounters a node that requires a decision on a value ci that is missing, we cannot decide whether to continue with the left or right sub-tree. In that case, the parameter estimate of the current inner node is returned as the best model for the respective observation
o. The modified traversal function that can handle missing covariate values is given in the
following definition:
Definition 68. The missing-value traversal function φΥ: RØc → Rkof a SEM Tree Υ is defined
as Lθ(r) if Υ has only one node. Otherwise, let rlef t and rright be the children of r and Υlef t and Υright be the corresponding sub-trees.
φΥ(x) = φΥlef t(x) xLN(r)∈ LE((r, rlef t)) φΥright(x) xLN(r)∈ LE((r, rright)) Lθ(r) xLN(r)= Ø
This modified traversal function subsumes the previously introduced traversal function and can generally replace the previous definition. In order to deal with missing values, the evaluation of split candidates has to be modified. It makes sense that, during the evaluation of each candidate, the likelihood ratio of only the non-missing values is considered for the split. This avoids comparison of a larger number of observations in the pre-split model against a decreased number of observations in the post-split model. This necessitates a modification of the likelihood ratio test statistic for split candidate evaluation. Previously, we were able to calculate the negative two log-likelihood of the pre-split model once for each evaluation of a set of split candidates. With missing values, the situation changes slightly. The pre-split model has to be evaluated once for every split candidate, since the number of observations changes depending on the pattern of missing values in the covariate:
Theorem 69. Let Υ be a SEM Tree with template model M that is described by a expectations
vector µ and a covariance matrix Σ. Let n∈ N be a node with data set (o′, c′) = D′ ⊆ D. Let Dlef t be the left data set of the split candidate and Dright the right data set. Let ˆθ be found by
minimizing −2LLDlef t∪ Dright|M, ˆθ
, ˆθlef t be found by minimizing −2LL
Dlef t|M, ˆθlef t
, and ˆθright be found by minimizing −2LL
Dright|M, ˆθright
. Under the null hypothesis that a split candidate is uninformative, the likelihood ratio Λ between a pre-split and a post-split model under covariates with missing values is
2
Λ = −2LLDlef t∪ Dright|M, ˆθ + 2LLDlef t|M, ˆθlef t + 2LLDright|M, ˆθright
Proof. The proof follows from the definition of cross-validation and Theorems 51–54 concerning
the pre-split and post-split models.
Note that the same applies to the CV criterion if applied for model selection. For the sake of brevity this criterion is not re-iterated here.
Alternatively, a surrogate approach can be employed to deal with missing values (cf. Hastie, Tibshirani, & Friedman, 2001). The approach is based on finding another covariate that most closely describes the same split. This can be thought of as finding another covariate that best predicts the outcome of the missing-value covariate. If the value for the surrogate is also missing, the process is continued. If the value for all covariates is missing, the variable can be attributed to the larger class.
3.6.3 Variable Imputation With SEM Trees
At the stage of data analysis, one might want to infer missing variables of the observations
o ∈ O. For example, this could be a necessary pre-processing step for further analyses that
require a data set without missing values. Variable imputation based on a SEM Tree improves classic model-based imputation as SEM Trees provide a model that partitions the data set into groups with significant differences in the data set. Assuming that the tree recovers the group structure of the population, or at least parts of it, a more accurate retrieval of missing variables is to be expected.
The following description follows the single imputation procedure as described by Rubin (1976). In a maximum likelihood setting, we may choose to draw missing values according to a multivariate distribution conditioned on the observed values. The sampling distribution for missing values is accordingly given by the conditional multivariate normal. For a given pattern of missingness, let a data point x be partitioned in
x =
"
x1 x2
#
and let the mean vector µ and the covariance matrix Σ be partitioned as follows
µ = " µ1 µ2 # , Σ = " Σ11 Σ12 Σ21 Σ22 #
with x1, µ1, Σ11corresponding to the current pattern of missingness and x2, µ2, Σ22 correspond-
ing to the non-missing values. The missing values of the vector x1 can then be inferred by the
maximum likelihood solution of the conditional multivariate normal, which is the mean value of the conditional normal:
ˆ
x1= µ1+ Σ12Σ−122 (x2− µ2)
A caveat of the imputation of the maximum likely value is an artificial reduction of the variance in the imputed values which leads to a reduction of confidence intervals for parameter estimates and an overestimation of the precision of subsequent analyses. A variant of the above
method is the introduction of systematic variation by drawing new values from the conditional distribution with ˆ µ = µ1+ Σ12Σ−122 (x2− µ2) ˆ Σ = Σ11− Σ12Σ−122Σ21 ˆ x1 ∼ N ˆ µ, ˆΣ (3.6.1)
As a refinement to this general approach, model-based variable imputation offers the recon- struction of missing data sets according to hypotheses about the data. However, Schafer and Graham (2002) state that there is no need for a scientific theory underlying an imputation model. An example of variable imputation is given in Section 5.