Portal Único de Transparencia Gubernamental

Plan de Acción de Gobierno Abierto de la República Dominicana

1. Portal Único de Transparencia Gubernamental

implemented in several statistical software, among others in the R package mixtools. The implementation of the APEM algorithm and the calculation of the estimators are more complicated and require more time. A detailed comparison is provided by the simulation study in chapter 4. Before that, some techniques for obtaining suitable starting values are presented.

3.5 Techniques for Choosing Initial Values

Several investigations on the right choice of initial values have been made previously. For instance Karlis and Xekalaki (2003) compared some techniques for obtaining starting values for the EM algorithm for ﬁnite mixtures. Since these techniques assume individual observations, none of them can be used directly in the context presented here, where the observations are only available in a grouped form. An intensive literature search could not reveal previous works on this topic, thus, new strategies for obtaining initial values for mixtures given grouped observations are proposed. Most of them are based on modiﬁcations of techniques presented in Karlis and Xekalaki (2003). In particular, these techniques are presented for mixtures with two Gaussian components.

Starting Value Technique S1

First, the technique of Finch et al. (1989), briefly introduced in section 2.4, is adapted and modified to determine initial values for mixtures given grouped observations. Finch et al. (1989) claimed that “[...] the optimal estimates of the means are closely approximated by the arithmetic means of the split sample and the optimal estimated variance is close to the weighted average of the variance of the split sample” (p. 1021). They obtained the splitting by randomly generating a value from a uniform distribution for π₁. Then, the smallest n·π₁ observations are taken to be the first part and the remaining observations are the second part. Since this proceeding turns out to be too vague, Karlis and Xekalaki (2003) proposed to use equal mixing proportions. To achieve the splitting, the number of observations from the first part of the sample is denoted by n₁, where

n₁ :=

2, if n even,

n+1

2 , if n uneven.

Then, the starting values for the component means and variances are μ(0)₁ = 1 n₁ n1 s=1 x_s (3.42) μ(0)₂ = 1 n− n₁ n s=n1+1 x_s (3.43) σ2_1,2(0) = n1 s=1 (x_s− μ(0)₁ )2+ n s=n1+1 (x_s− μ(0)₂ )2/(n− 2). (3.44)

3.5. Techniques for Choosing Initial Values 41

In cases when n is uneven, the sum of (3.43) and the second sum in (3.44) start from n₁ instead of n₁+ 1.

Since the individual observations are not available in our context, a data mapping like presented in section 3.2.1 is proposed, where the midpoint of each interval is repeated de- pendent on the number of observations in each interval.

A similar proceeding was proposed by Schader and Schmidt (1988) for a single Gaussian distribution. They also recommended to include the middle of each interval and suggested to use μ(0) = 1/nm_i=1n_i¯a_i as an initial value for the mean of the Gaussian distribution. Their idea was adapted here and with the resulting data set, the starting values for μ_j and σ2_j, j = 1, 2 can be calculated from (3.42), (3.43), and (3.44). This technique is denoted by S1.

Starting Value Technique S2

This approach is adapted from Böhning et al. (1994) and denoted by S2. They used equal mixing proportions, too, and made the following suggestions for the component means

μ(0)₁ = x₁+ 1/2 (3.45)

μ(0)₂ = x_n− 1/2, (3.46)

“[..] since well separated values have often turned out to be a good strategy for avoiding local maxima which are not global ones” (p. 381). Although they did not make further speciﬁcations on the choice of the initial values for the variance, Karlis and Xekalaki (2003) proposed to use the following values

σ_1,22(0) = ¯σ2−(μ(0)₁ − ¯x)2+ (μ(0)₂ − ¯x)2/2

, (3.47)

where ¯x and ¯σ2 are the mean and the variance of the sample with individual observations. To obtain individual observations, the afore mentioned mapping is suggested. Note that if the variance estimate σ_1,22(0) came out to be negative, ¯σ2/2 is used instead. The proceeding is equivalent to the one used by Karlis and Xekalaki (2003).

Starting Value Technique S3

The following technique based on the range of the sample likewise considered by Karlis and Xekalaki (2003) is indexed by S3. Starting again with equal mixing proportions, the initial values for the component means are now

μ(0)₁ = min s (xs) + d 3 (3.48) μ(0)₂ = min s (xs) + 2d 3 , (3.49)

42 3.5. Techniques for Choosing Initial Values

where s = 1, . . . n and d is the range of the sample. Once again, this approach requires individual observations. Hence, the previously suggested mapping is applied here, too. The starting values for the component standard deviation are chosen to be

σ₁= σ₂ = 1/2· ¯σ, (3.50)

where ¯σ is the standard deviation of the transformed sample.

Starting Value Technique S4

The fourth considered approach was proposed by Seidel et al. (2000). This technique also involves the assumption that the mixing proportions are equal. The initial values for the component means are now

μ₁ = x¯− 1/2 · ¯σ (3.51)

μ₂ = x + 1/2¯ · ¯σ, (3.52)

where ¯x and ¯σ are the mean and the standard deviation of the sample. Since the individual observations are as well required here, a data mapping is suggested as describes earlier. The starting values for the component variances equal the variance of the whole transformed sample. This technique is denoted by S4.

Starting Value Technique S5

Finally, a new technique indexed by S5 is proposed. Once again, a data mapping is essen- tial. This technique is the only one that uses estimated values for the mixing proportions. Therefore, at ﬁrst, the middle of the sample range d is determined, which is

d := min

s (xs) + d/2. (3.53)

Then, the number of observations from the first part of the sample, denoted by N₁, divided by the sample size n is chosen to be the starting value for the first mixing proportion. The second part, denoted by N₂, divided by n, yields equivalently a first guess for the second mixing proportion. Both terms are summarized

π₁ = N₁/n (3.54)

π2 = N2/n. (3.55)

The starting values for the component means are calculated in a similar way, as suggested by Böhning et al. (1994). The authors used the smallest observation and added 0.5 to obtain a starting value for the ﬁrst component mean. Since the minimum of the samples may be an outlier, which would be inappropriate, the proposed technique instead considers the 1% percentile of the sample. In the same manner, the calculation of the second component

3.5. Techniques for Choosing Initial Values 43

mean is performed, but now, the 99% percentile is chosen. The component means are

μ₁ = q_0.01 and μ₂= q_0.99. (3.56)

The starting values for the common component variance are determined by the approach of Finch et al. (1989) given in equation (3.44).

In the following the results of the simulation studies are presented. One investigation (subsection 4.7) addresses the comparison of these ﬁve techniques against each other. A recommendation will be given for diﬀerent mixture models. Additionally, the previously presented methods for the estimation of the parameters of a Gaussian mixture will be compared. First, the true values will be used as starting values and subsequently starting values obtained by technique S1 - S5 will by applied.

4 Simulation Comparison

In this chapter the procedure and the results of the simulation study are outlined. Start- ing in section 4.1 with the description of the implementation, this chapter turns to the speciﬁcation of the comparison criteria, which is followed by the presentation of the investigated situations and mixture models. For convenience, the simulation study is presented for Gaussian mixtures with two components. Since the (π_j)_j=1,2 sum to one, one of them is redundant. Hence π₂ will not be speciﬁed and Ψ = (π₁, μ₁, μ₂, σ₁, σ₂).

The main simulation results are shown in the sections 4.2, 4.3, and 4.4, where the afore mentioned mixture models are considered. A variation of the sample size can be found in section 4.6. Section 4.7 addresses the problem of different initial values by presenting results of different starting values obtained by the techniques proposed in the previous section. Fi- nally, the influence of the interval width on the estimation is investigated and presented in section 4.8.

In document Plan de Acción de Gobierno Abierto de la República Dominicana (página 24-29)