NIVEL DE INSTRUCCIÓN

e. MATERIALES Y MÉTODOS

The stopping criteria for SEM Trees have been developed based on a statistic whose distribution is known under the null hypothesis that a split candidate variable provides no information about diﬀerences with respect to the parameters in a template SEM. Interestingly, this criterion can also be interpreted as an information-theoretic criterion, in particular, it leads to the selection of covariates that maximize information gain, that is, the reduction in entropy of the predicted observations when considering a split of the tree. Therefore, SEM Trees essentially maximize the same criterion as decision trees that employ the maximization of information gain, like ID3 (Quinlan, 1986). The diﬀerence between the two is that ID3 predicts categorial outcomes, whereas SEM Trees predict continuous observations based on a model. ID3 maximizes the information gain when using a split candidate to partition a categorial outcome variable. The following section formally derives this relation.

We have previously used the notion of entropy H(X) of a random variable X. Usually, the entropy is estimated from a ﬁnite sample drawn from this random variable. In the following, we denote estimates of the entropy of discrete samples x1, . . . , xN of X by ˆH(x1, . . . , xN), whereas the entropy of X is denoted by H(X).

The information gain of X knowing Y is deﬁned as the estimated reduction of the en- tropy of X when knowing Y , formalized as Gain(x1, . . . xN, y1, . . . , yN) = ˆH(x1, . . . xN) −

H(x1, . . . xN|y1, . . . , yN) (Cover & Thomas, 1991). This information gain, as used in ID3, is the reduction in entropy of the target variable after splitting the target variable according to the discrete split candidate variable. Let the target variable X be a discrete random variable. Let the split candidate variable Y be a discrete random variable, whose elementary outcomes can be retrieved by V alues(Y ). Let N be the number of observations of the target variable.

The information gain about a variable X, knowing the state y of variable Y is deﬁned as the decrease of the entropy estimate ˆH(x1, . . . , xn) of X for known values of Y

Gain (x1, . . . , xn, Y ) = ˆH (X)−

y∈V alues(Y )

N H (xˆ 1, . . . , xn|Y = y)

SEM Trees operationalize the choice of a split candidate by the maximization of a log- likelihood ratio statistic. Interestingly, we can express the likelihood ratio as a variant of the information gain criterion. This links the log-likelihood ratio criterion back to the same fundamental rule of maximizing information gain. Indeed, by maximizing the likelihood ratio criterion, we are maximizing the information gain in the model-predicted distribution of the observations with respect to split candidates Y .

Lemma 63. Let M be a SEM. Let X be a random variable and x1, . . . , xN a finite set of

samples of X. The negative two likelihood is a linear function of the entropy estimate of the model-predicted distribution of M

−2LL (x1, . . . , xN|M) = 2N · ˆH (x1, . . . , xN|M)

Proof. The likelihood is an estimator of the entropy of a continuous multivariate random variable X. Let x1, . . . , xN be N samples of X, and let M be a multivariate SEM. We recall that the negative two log-likelihoods of the observations is the sum over the logarithm of the likelihood function −2LL (x1, . . . , xN|M) = −2 N X i=1 log (L (xi|M))

Samples xiare drawn according to the distribution of X. For N → ∞, the likelihood converges to the entropy of the distribution that is predicted by the model M (cf. Cover & Thomas, 1991)

lim N →∞ 1 N N X i=1 log (L (xi|M)) = ˆ X_{|M log (X|M) dx} = −H (X|M)

Equivalently1_{, we can write the fundamental relation between the likelihood function and the}

entropy of the model-implied distribution as

−2LL (x1, . . . , xn|M) = 2N · ˆH (x1, . . . , xN|M)

Note that it in maximum likelihood settings, the logarithm is often to base e, whereas in information-theoretic settings, logarithms are often base 2 representing the unit of bits. To account for this discrepancy, an additional multiplicative correction factor has to be added that however does not change the central relation between both measures.

Theorem 64. Let M be a SEM and D be a data set. Let LLRY be the log-likelihood ratio

statistic obtained by comparing a pre-split and post-split model with respect to the split candidate, represented by the random variable Y . The likelihood ratio statistic can be expressed as a linear function of the information gain about the model-predicted distribution L(D_{|M) when knowing} the states of Y .

LLRY(D|M) = 2N · Gain(L (D|M) , Y )

Proof. We begin by considering the information gain for a binary variable Y , representing a

split candidate in a binary SEM Tree.

Gain(x1, . . . , xn, Y ) = H (xˆ 1, . . . , xn) − X y∈V alues(Y ) Ny N H (xˆ 1, . . . , xn|Y = y) = H (xˆ 1, . . . , xn) − Nlef t N H (xˆ 1, . . . , xn|Y = left) −Nright N H (xˆ 1, . . . , xn|Y = right) 2N · Gain(x1, . . . , xn, Y ) = 2 ˆH (x1, . . . , xn)

−2Nlef tH (xˆ 1, . . . , xn|Y = left) − 2NrightH (xˆ 1, . . . , xn|Y = right) Using Lemma 63, we can rewrite the information gain based on the following observation that ˆ

H (x1, . . . , xn|Y = left) describes the partition of the observations of a data set D into the left data set and ˆH (x1, . . . , xn|Y = right) describes the right data set. Therefore, we can conclude that 2N · Gain(x1, . . . , xn, Y ) = −2LL (D|M (θ)) +2LLDlef t|M ˆ θlef t + 2LLDright|M ˆ θright = LLRY (D|M)

Corollary 65. When choosing a split candidate that maximizes the log-likelihood ratio of a

pre-split and a post-split model, in the process of SEM Tree generation, the split variable that maximizes the information gain between the model-predicted distribution and the split variable is chosen.

The expected value of the information gain is the mutual information measure, a general measure of statistical dependence between two random variables (Cover & Thomas, 1991). Mutual information measures how similar a joint distribution of two random variables is to the product of their marginal distributions.

The mutual information is a measure of the degree of statistical dependence. If the target variable and the split candidate variable are statistically independent, that is, the product of their marginal distributions is equal to their joint distribution, the mutual information is zero.

In analogy to the previous reasoning, we can conclude that by maximizing an estimator of the expected likelihood ratio when using cross-validation for candidate selection, we are choosing split candidates that maximize the mutual information between the model-predicted distribution and the split variable.

Corollary 66. When choosing a split candidate that maximizes the expected log-likelihood ratio

of a pre-split and a post-split model, in the process of SEM Tree generation, the split variable that maximizes the mutual information I (X, Y ) between the model-predicted distribution X and the split variable Y is chosen, with

I (X, Y ) = ˆ X X y∈V alues(Y ) P r (x, y) log _{P r (x, y)} P r (x) P r (y) dx

These elaborations show that the evaluation of split candidates in SEM Trees is reasonable from both a statistical and an information-theoretic point of view and, ﬁnally, that the evaluation of split candidates by the log-likelihood ratio and the classic information gain criteria are rooted in the same fundamental concepts.

In document UNIVERSIDAD NACIONAL DE LOJA (página 57-66)