PROVINCIA PRODUCCIÓN DE LECHE (LITROS)
2.6. TECNOLOGÍA DE HELADOS
2.6.2. REQUISITOS FISICOQUÍMICOS Y MICROBIOLÓGICOS
Inspired eGK and eGKL clustering methods as proposed in by the research of [FG10 and DS11], we proposed the use of Mutual Information formulation applied to the Mahalanobis distance as the core similarity measure for the proposed new clustering algorithm. The treatment for determining thresholds for degree of similarity evaluation in the cluster
assignment, creation and merging has been unified with inverse of Chi-square distribution with a fixed dimensionality but different levels of probability values. The major steps for the algorithm are summarized as the following:
1. Determining the thresholds 𝒳𝑛,𝛽2 and 𝒳𝑛,𝛽2 𝑀, inverse of the Chi-square cumulative distribution function (CDF), that are applied during the decision making process for cluster assignment or creation and cluster merging. In the following Table 4, we show the values with fixed probability value,β, set to 0.95 and n ranges from 2 to 10. At this step, the dimensionality of the data, n, should also be specified since procedures for data pre-processing should have already been identified and that portion of the calculation is out of scope of the overall clustering algorithm.
Table 4: 𝒳𝑛,𝛽2 Values for dimensionality 2 – 10 with β = 0.95
n 2 3 4 5 6 7 8 9 10
𝒳𝑛,𝛽2 5.992 7.815 9.488 11.07 12.592 14.067 15.507 16.919 18.307
2. Determine the default initial inverse covariance matrix ∑0−1by specifying parameter γ (i.e.: consider default value of γ = 100):
∑0−1= 𝛾𝐼 (19) Depending on whether the data is treated with a predefined normalization procedure or not, the value of γ need to be carefully chosen. If the default γ is chosen to be too small (large values in the diagonal of the original covariance matrix), the algorithm may not produce any additional clusters due to the initial coverage of a newly created
cluster (starting from the 1st cluster) will have a coverage that is overly extensive. On the other hand, if the default value of γ is chosen to be too big hence resulting a newly created cluster (starting from the 1st cluster) to be overly tight, in the worst case
scenario every data point will result in a new cluster. 3. Set C = 0 and read 1st data 𝑥
1. C is the total number of clusters generated so far. Before the 1st data is process, this number should be zero.
4. Initialize the 1st cluster using the 1st data and default initial inverse covariance matrix
𝜇1= 𝑥1; ∑1−1= ∑ 0
−1= 𝛾𝐼; 𝐶 = 𝐶 + 1; 𝑑𝑒𝑡∑
1= det(∑0) ; 𝑁1 = 1;
Note that Ni represents the number of data assigned to the i’th so far. Since the 1st data
is used to initialize the 1st cluster, N
1 at this step is set to 1 as part of the initialization.
5. Read next data 𝑥𝑘 and calculate similarity measure using Mutual Information Formulation of the Mahalanobis Distance. This is done by assigning a temporary inverse covariance matrix to the current data, ∑𝑘,𝑡𝑒𝑚𝑝−1 = 𝛾𝐼, to facilitate the calculations described in (MM1 and 2). In other words, given cluster i, ℵ𝑖, and the temporary
cluster, ℵ𝑘,𝑡𝑒𝑚𝑝, we can compute the mutual information based Mahalanobis distance with a temporary mixture distribution ℵ𝑚. For the temporary cluster associated with 𝑥𝑘, the centroid is simply 𝑥𝑘 and the inverse covariance set to 𝛾𝐼.
Di,k(ℵi||ℵk) =12DM(ℵi||ℵM) +12DM(ℵk||ℵM), i = [1, C]
Where ℵM= 1 2ℵi+
1 2ℵk
And the formula to compute DM is the original Mahalanobis Distance:
DM(ℵ0||ℵM) = √(u0− uM)T∑M−1(u0− uM)
6. Identify active (closest) cluster with respect to current data 𝑥𝑘
7. Determine whether current data, 𝑥𝑘, should be assigned to the active cluster and trigger subsequent updates. Evaluation of whether the 𝑝’th cluster obtained from step warrants the assignment takes the following form:
Dp,k < 𝒳𝑛,𝛽2
7.a. Update the active cluster if the 𝑥𝑘 is successfully assigned to 𝑝’th cluster The update of the 𝑝’th cluster is straightforward following (4 - 7). Go to Step 8. 7.b. Create a new cluster if 𝑥𝑘 is not assigned to 𝑝’th cluster
The creation of a new cluster is identical to the description in Step 4 with the new cluster centroid set to be 𝑥𝑘, initial inverse covariance matrix and determinant of the covariance matrix set to default values, initial cluster size set to 1 and the increment of C to C + 1 due the newly added cluster. After this step, Step 8 will be skipped and the algorithm continues to Step 9.
8. Determine whether the active cluster shall be merged with an existing cluster. This process involves computing the similarities between the active cluster to the remainder of existing clusters.
Dq,p(ℵq||ℵp) =12DM(ℵq||ℵM) +12DM(ℵp||ℵM), q = [1, C], q ≠ p
Where ℵM= 12ℵq+12ℵp
And the formula to compute DM is the original Mahalanobis Distance:
DM(ℵ0||ℵM) = √(µ0− µM)T∑ M −1(µ
0− µM)
The top candidate to be considered to be merged with active cluster is determined with the following criteria:
𝑔 = arg min (𝐷𝑞,𝑝) , q = [1, C], q ≠ p
Subsequent evaluation of whether merging shall take place by comparing the similarity value between the closest cluster g and the active cluster p with the following:
Dg,p< 𝒳𝑛,𝛽2 𝑀
8.a. If the above criteria is met, the active cluster p and the closest cluster g will be combined using formula (NRV 2 – 3). Since this will result in a reduction of the total clusters to be reduced by 1. C should be modified accordingly.
𝐶 = 𝐶 − 1
If the merged cluster takes the index previously used by the active cluster p, following modifications to this merged cluster’s data count shall also be adjusted.
𝑁𝑝,𝑛𝑒𝑤= 𝑁𝑝,𝑜𝑙𝑑+ 𝑁𝑔
9. At this step, the processing of the current data 𝑥𝑘 is complete and the procedure will continue by looking at the next data. Therefore, go to step 5