Plan de Acción de Gobierno Abierto de la República Dominicana
1. Portal Único de Transparencia Gubernamental
implemented in several statistical software, among others in the R package mixtools. The implementation of the APEM algorithm and the calculation of the estimators are more complicated and require more time. A detailed comparison is provided by the simulation study in chapter 4. Before that, some techniques for obtaining suitable starting values are presented.
3.5 Techniques for Choosing Initial Values
Several investigations on the right choice of initial values have been made previously. For instance Karlis and Xekalaki (2003) compared some techniques for obtaining starting val- ues for the EM algorithm for finite mixtures. Since these techniques assume individual observations, none of them can be used directly in the context presented here, where the observations are only available in a grouped form. An intensive literature search could not reveal previous works on this topic, thus, new strategies for obtaining initial values for mix- tures given grouped observations are proposed. Most of them are based on modifications of techniques presented in Karlis and Xekalaki (2003). In particular, these techniques are presented for mixtures with two Gaussian components.
Starting Value Technique S1
First, the technique of Finch et al. (1989), briefly introduced in section 2.4, is adapted and modified to determine initial values for mixtures given grouped observations. Finch et al. (1989) claimed that “[...] the optimal estimates of the means are closely approximated by the arithmetic means of the split sample and the optimal estimated variance is close to the weighted average of the variance of the split sample” (p. 1021). They obtained the splitting by randomly generating a value from a uniform distribution for π1. Then, the smallest n·π1 observations are taken to be the first part and the remaining observations are the second part. Since this proceeding turns out to be too vague, Karlis and Xekalaki (2003) proposed to use equal mixing proportions. To achieve the splitting, the number of observations from the first part of the sample is denoted by n1, where
n1 :=
n
2, if n even,
n+1
2 , if n uneven.
Then, the starting values for the component means and variances are μ(0)1 = 1 n1 n1 s=1 xs (3.42) μ(0)2 = 1 n− n1 n s=n1+1 xs (3.43) σ21,2(0) = n1 s=1 (xs− μ(0)1 )2+ n s=n1+1 (xs− μ(0)2 )2/(n− 2). (3.44)
3.5. Techniques for Choosing Initial Values 41
In cases when n is uneven, the sum of (3.43) and the second sum in (3.44) start from n1 instead of n1+ 1.
Since the individual observations are not available in our context, a data mapping like presented in section 3.2.1 is proposed, where the midpoint of each interval is repeated de- pendent on the number of observations in each interval.
A similar proceeding was proposed by Schader and Schmidt (1988) for a single Gaussian distribution. They also recommended to include the middle of each interval and suggested to use μ(0) = 1/nmi=1ni¯ai as an initial value for the mean of the Gaussian distribution. Their idea was adapted here and with the resulting data set, the starting values for μj and σ2j, j = 1, 2 can be calculated from (3.42), (3.43), and (3.44). This technique is denoted by S1.
Starting Value Technique S2
This approach is adapted from Böhning et al. (1994) and denoted by S2. They used equal mixing proportions, too, and made the following suggestions for the component means
μ(0)1 = x1+ 1/2 (3.45)
μ(0)2 = xn− 1/2, (3.46)
“[..] since well separated values have often turned out to be a good strategy for avoiding local maxima which are not global ones” (p. 381). Although they did not make further specifications on the choice of the initial values for the variance, Karlis and Xekalaki (2003) proposed to use the following values
σ1,22(0) = ¯σ2−(μ(0)1 − ¯x)2+ (μ(0)2 − ¯x)2/2
, (3.47)
where ¯x and ¯σ2 are the mean and the variance of the sample with individual observations. To obtain individual observations, the afore mentioned mapping is suggested. Note that if the variance estimate σ1,22(0) came out to be negative, ¯σ2/2 is used instead. The proceeding is equivalent to the one used by Karlis and Xekalaki (2003).
Starting Value Technique S3
The following technique based on the range of the sample likewise considered by Karlis and Xekalaki (2003) is indexed by S3. Starting again with equal mixing proportions, the initial values for the component means are now
μ(0)1 = min s (xs) + d 3 (3.48) μ(0)2 = min s (xs) + 2d 3 , (3.49)
42 3.5. Techniques for Choosing Initial Values
where s = 1, . . . n and d is the range of the sample. Once again, this approach requires individual observations. Hence, the previously suggested mapping is applied here, too. The starting values for the component standard deviation are chosen to be
σ1= σ2 = 1/2· ¯σ, (3.50)
where ¯σ is the standard deviation of the transformed sample.
Starting Value Technique S4
The fourth considered approach was proposed by Seidel et al. (2000). This technique also involves the assumption that the mixing proportions are equal. The initial values for the component means are now
μ1 = x¯− 1/2 · ¯σ (3.51)
μ2 = x + 1/2¯ · ¯σ, (3.52)
where ¯x and ¯σ are the mean and the standard deviation of the sample. Since the individual observations are as well required here, a data mapping is suggested as describes earlier. The starting values for the component variances equal the variance of the whole transformed sample. This technique is denoted by S4.
Starting Value Technique S5
Finally, a new technique indexed by S5 is proposed. Once again, a data mapping is essen- tial. This technique is the only one that uses estimated values for the mixing proportions. Therefore, at first, the middle of the sample range d is determined, which is
¯
d := min
s (xs) + d/2. (3.53)
Then, the number of observations from the first part of the sample, denoted by N1, divided by the sample size n is chosen to be the starting value for the first mixing proportion. The second part, denoted by N2, divided by n, yields equivalently a first guess for the second mixing proportion. Both terms are summarized
π1 = N1/n (3.54)
π2 = N2/n. (3.55)
The starting values for the component means are calculated in a similar way, as suggested by Böhning et al. (1994). The authors used the smallest observation and added 0.5 to obtain a starting value for the first component mean. Since the minimum of the samples may be an outlier, which would be inappropriate, the proposed technique instead considers the 1% percentile of the sample. In the same manner, the calculation of the second component
3.5. Techniques for Choosing Initial Values 43
mean is performed, but now, the 99% percentile is chosen. The component means are
μ1 = q0.01 and μ2= q0.99. (3.56)
The starting values for the common component variance are determined by the approach of Finch et al. (1989) given in equation (3.44).
In the following the results of the simulation studies are presented. One investigation (subsection 4.7) addresses the comparison of these five techniques against each other. A recommendation will be given for different mixture models. Additionally, the previously presented methods for the estimation of the parameters of a Gaussian mixture will be com- pared. First, the true values will be used as starting values and subsequently starting values obtained by technique S1 - S5 will by applied.
4 Simulation Comparison
In this chapter the procedure and the results of the simulation study are outlined. Start- ing in section 4.1 with the description of the implementation, this chapter turns to the specification of the comparison criteria, which is followed by the presentation of the inves- tigated situations and mixture models. For convenience, the simulation study is presented for Gaussian mixtures with two components. Since the (πj)j=1,2 sum to one, one of them is redundant. Hence π2 will not be specified and Ψ = (π1, μ1, μ2, σ1, σ2).
The main simulation results are shown in the sections 4.2, 4.3, and 4.4, where the afore mentioned mixture models are considered. A variation of the sample size can be found in section 4.6. Section 4.7 addresses the problem of different initial values by presenting results of different starting values obtained by the techniques proposed in the previous section. Fi- nally, the influence of the interval width on the estimation is investigated and presented in section 4.8.