• No se han encontrado resultados

Caracterización y efecto de las sillas

In document Document downloaded from: (página 177-200)

Material y métodos

6.4. Caracterización y efecto de las sillas

SEM Trees, as described up to here, operate on dichotomous covariates. This ensures binary trees as an output representation. In psychological data sets, covariates are typically not al- ways dichotomous but they are categorial, like occupation or type of education, ordinal, like the number of children or age in years, or continuous, like reaction time on a cognitive task. SEM Trees employ the approach usually taken in many conventional decision tree approaches (Quinlan, 1992, 1996; Zeileis et al., 2008) to handle these kinds of covariates. The original data set is mapped onto a proxy data set that replaces categorial, ordinal, and continuous covariates with sets of dichotomous covariates, the binary split candidates. The advantage of this proce- dure is that the same algorithm for determining splits can be used independently of the type of variable by only applying the respective mapping function. Furthermore, splits into multiple categories are still possible. For example, a tree that splits a covariate into three subgroups can be represented as a binary tree that first splits into two child nodes, representing category {1, 2} and category {3}. After that, a second split on the next level is performed, splitting category {1, 2} into {1} and {2}.

The mapping function depends on the type of the original variable. For categorial data, the variable is mapped onto a set of variables representing all possible subsets of values. Ordinal variables are transformed into a set of dichotomous variables which represent a “smaller or equal” relation on all possible split points. In the worst case, a continuous variable in a data set of size N leads to representation of N − 1 dichotomous variables. Continuous data are similarly transformed like ordinal variables with a slight adjustment of the split point between each two observed values. In the following, the mapping procedure is reiterated for the different kinds of covariate types.

Formally, for an ordinal variable Cordinal with the n distinct and ordered observed values

values, resulting in n − 1 split candidates: ˜ Ci := ( 0 Cordinal ≤ vi 1 otherwise , i∈ [1, 2, . . . , n − 1]

For a continuous variable Ccontinuous, a similar selection scheme is chosen. In order to reduce the bias of the threshold selection towards observed variables, the mean between each two successive observations is selected as threshold:

˜

Ci:=

(

0 Ccontinuous< 12(vi+ vi+1)

1 otherwise , i∈ [1, 2, . . . , n − 1]

For a categorial variable Ccategorial with values c1, . . . , cn, we define the implied covariate ˜CA∪B for a partition A ∪ B = {c1, . . . , cn} as ˜ CA∪B := ( 0 C ∈ A 1 Ccategorial ∈ B

For each categorial variable, the number of implied covariates is 2n−1− 1.

In the following, we will sometimes refer to sub-models as left sub-model and right sub-model, analogous to the notion of left and right data set. By definition, for a continuous or ordinal covariate, the left model represents the subset which has lower or equal values on a specified threshold and the right model corresponds to the subset which has larger values. For categorial variables, the right sub-model represents the subset of the sample with covariate values matching the specified set of values. The left sub-model represents the subset whose covariate values do not appear in the selected set. Graphical representations of SEM Trees adhere to this principle.

3.4.3 Time Complexity

In this section, I will analyze the time complexity of the SEM Tree algorithm. There are two essential variables whose influence on the time complexity is of interest: The number of observed samples N and the number of split candidates, that is, dichotomous covariates. We denoted the number of these candidate covariates with M. During the generation of a SEM Tree, a large number of models are fitted to data. Therefore, we have to estimate how often the optimizer is called and to what extent the run time of the optimizer depends on M and N . Generally, the optimizer for parameter estimation can be regarded as a black box that estimates the model parameters. For a realistic approximation of the runtime, we have to consider that the optimizer will likely use a numerical procedure to find the maximum likelihood estimate. The log-likelihood formulation for data without missing values (cf. Equation 3.1.8) can be evaluated in constant time. Later on, the FIML (Full Information Maximum Likelihood) fit function will replace the ML fit function because it is able to handle missing values in the observed variables. A deeper treatment of missing values in data sets follows below. The evaluation of the FIML log-likelihood is feasible in O (N ) because it sums up the likelihoods of each individual observation. The number of iterations of the optimizer generally does not depend on N ; on the contrary, if data is normally-distributed, a larger N might provide a more stable estimate of the covariance matrix and expectations vector and even reduce the number of steps of the optimization process.

For a time complexity analysis, we have to determine how often the optimizer is called in the process of generating a SEM Tree. During split candidate evaluation in each node, the pre-split model is fitted once. For each dichotomous covariate, the post-split model has to be evaluated once, that is, M calls to the optimizer are executed for the evaluation of the post-split models. We expect that each selected split variable separates the observations in two halves during SEM Tree generation. Under this assumption, the depth of the tree will be bounded by O (log (N )). A second bound to the depth of the tree is given by O (M). Taken together, we obtain the following time complexity for SEM Trees. When employing the likelihood function FM L for data without missing values:

TM L(N, M ) = O



M N + M · 2M

When using the likelihood function for data sets including missing data:

TF IM L(N, M ) = O



N2M + N M · 2M

In typical research situations, missing data will be present and the number of binary covariates will typically exceed the logarithm of the number of observations. For example, in psychological data sets N = 1000 already constitutes a large sample size. In this example, as soon as more than log2(1000) ≈ 10 covariates are included, the run time will be determined linearly by the number of dichotomous covariates and quadratically by the number of observations.

In document Document downloaded from: (página 177-200)