The robust version of COSA-KNN showed to be faster (in Section 3.1.1) and better (in Section 3.1.2) in capturing the clustering structure, as compared to its non-robust version. However, we have seen that in the presence of a subtle structure in the data, a method for choosing the values ofλandK is required to obtain a good performance of COSA-KNN. In this section we will work with robust version of COSA-KNN, and propose to find the values of the tuning parameters via a procedure that is based on the criterion: the Gap statistic (Tibshirani et al., 2001). The Gap statistic was originally developed for selecting the number of clusters in standard K-means clustering algorithms. However, the procedure also got successfully implemented in Witten and Tibshirani (2010), and Arias-Castro and Pu (2017) for closely related tuning-problems.
data with the expected value of the criterion for an appropriate null reference model (Tibshirani et al., 2001). Define the Gap statistic as
Gaplog(λ, K) =E e Q◦ h logQe◦(λ, K) i −logQe(λ, K) , (3.17) where E e Q◦ h logQe◦(λ, K) i
denotes the expectation of the criterion for data sets that originate from an appropriate null reference model. We estimate the expectation by taking the average fromB copies of logQe◦(λ, K)
, where each copy is computed based on a Monte Carlo sample drawn from the reference distribution.
We draw samples from the reference distribution for logQe◦(λ, K)
by comput- ing the COSA-KNN criterion onB permuted data setsX◦1, . . . ,X◦B. Each permuted data set is generated by independent permutations of the observations within each attribute. This renders correlated attributes in the original data to become uncorre- lated in the permuted data sets. Thus, the Gap statistic quantifies the strength of the clustering that is obtained on the real data set, compared to the clustering result that is obtained from data sets that do not contain any clustering structure. Eventually, we select the values ofλandKfor which the Gap is largest, or the smallest Gap that is within one standard error (1SE) of the largest Gap, and corresponds to a higher value forλandK.
3.3.1
1SE Rule and the Simpler COSA-KNN Model
Although the one standard error (1SE) rule was originally proposed for a different tuning parameter problem (Breiman, Friedman, Stone, & Olshen, 1984), its purpose remains similar in the Gap statistic procedure. With the 1SE rule we wish to choose tuning parameter values of a simpler model that would still be comparable to the optimal model.
We define the COSA-KNN model to become simpler for higher values ofλor larger values ofK, which can be conceptually explained as follows. For higher values ofλ, the more difficult it will become to find unique subsets of attributes for each cluster, thus the fewer ‘degrees of freedom’ for the candidate solutions of the subsets of attributes. Similarly, for larger values of K, the COSA distances will have more difficulty (less ‘degrees of freedom’) in capturing a clustering structure of smaller clusters. In the most extreme case where λ → ∞ and K → N −1, the COSA distances will just become a ‘simple’ sum over the attribute distances, e.g., the ordinary Manhattan distances. Thus, we define the COSA model to be more ‘complex’, since it has more degrees of freedom for smaller λand smallerK to represent the more more complex clustering structures – smaller clusters with each their own unique subset of important attributes.
Let λ∗ and K∗ be the values for λandK that correspond with the largest Gap statistic, then, we select the smallest Gap statistic that corresponds withλ≥λ∗ and
K ≥K∗, and is within the range of 1SE of the largest Gap statistic. Note that in general for these higher values ofλandK, the standard error of the criterion will be lower, and therefore the variance of the Gap statistic will also be lower (since these
are simpler COSA-KNN models). Thus, another way to look at the 1SE rule is that it acknowledges the error with which the maximum Gap statistic itself is estimated.
3.3.2
With or Without the Natural Logarithm
The original version of the Gap statistic, in (3.17), is based on the natural logarithm. Mohajer, Englmeier, and Schmid (2011) corroborated the findings by Dudoit and Fridlyand (2002) that this particular use of the logarithm renders the procedure to prefer more complex models (by overestimating the number of clusters), compared to the Gap statistic that is not based on the natural logarithm, i.e.
Gap(λ, K) =E e Q◦ h e Q◦(λ, K)i−Qe(λ, K). (3.18) Moreover, Mohajer et al.(2011) demonstrated a proof, for criteria that have a mono- tone relationship with their tuning parameters, that there are situations in which it becomes impossible to find the optimal Gap statistic in the original procedure. Their proof also shows that whenever the original Gaplog(λ, K) results in a solution, this solution will always be a possible solution with Gap(λ, K) as well, but the reverse is not necessarily true. However, it is not clear whether the proof of Mohajer et al. (2011) also holds forQe◦(λ, K).
The motivation Tibshirani et al. (2001) provide for taking the logarithm, seems solely based on interpretation reasons from likelihood theory. For those cases where the Gap statistic is applied on results from aK-means clustering algorithm, the Gap statistics behaves as a likelihood-ratio statistic based on mixture models (e.g. Scott and Symons, 1971). However, this is not an advantage we know how to exploit for COSA, nor does the logarithm provides us with computational advantages. Neverthe- less, we will compare both versions of the Gap statistic: with and without using the logarithm in equations (3.17) and (3.18), respectively.
3.3.3
The Algorithm for the Gap statistic procedure
For computing the Gap statistic with COSA-KNN, the following steps are required: 1. Compute the criterion obtained by performing COSA-KNN on the data Xfor
each candidate combination of the tuning parameter valuesK andλ.
2. Obtain permuted datasets X◦1, . . . ,X◦B by independently permuting the obser- vations within each attribute.
(a) For b = 1,2, . . . , B, compute logQe◦b(λ, K)
, the criterion obtained by performing COSA-KNN with candidate tuning parameter valuesλandK
on the dataX◦b. (b) Compute Gaplog(λ, K) = 1 B B X b=1 logQe◦b(λ, K) −logQe(λ, K) , (3.19)
or Gap(λ, K) = 1 B B X b=1 e Q◦b(λ, K)−Qe(λ, K). (3.20)
3. Chooseλ∗andK∗corresponding to the largest value of Gap(λ, K). Then, choose the simplest COSA-KNN model (smallest standard error) that is within range of one standard error of the value of Gap(λ∗, K∗). We assure in the computation of the standard error that it additionally takes into account the simulation error, i.e. selog( e Q◦(λ,K)) = s 1 + 1 B VARb logQe◦b(λ, K) , (3.21) for Gaplog(λ, K), or as
se e Q◦(λ,K)= s 1 + 1 B VARb e Q◦b(λ, K), (3.22) for Gap(λ, K).
In the text that will follow, the ‘Gap statistic’ may have two meanings. Either the Gap statistic refers to the one that is computed in equation (3.20), or it will be clear from the context that we may refer to the Gap statistic in general, i.e., both versions of the Gap statistic. Moreover, we refer to the Gap statistic as computed in equation (3.19), as the Gaplog statistic.