REQUISITOS FISICOQUÍMICOS Y MICROBIOLÓGICOS

PROVINCIA PRODUCCIÓN DE LECHE (LITROS)

2.6. TECNOLOGÍA DE HELADOS

2.6.2. REQUISITOS FISICOQUÍMICOS Y MICROBIOLÓGICOS

Inspired eGK and eGKL clustering methods as proposed in by the research of [FG10 and DS11], we proposed the use of Mutual Information formulation applied to the Mahalanobis distance as the core similarity measure for the proposed new clustering algorithm. The treatment for determining thresholds for degree of similarity evaluation in the cluster

assignment, creation and merging has been unified with inverse of Chi-square distribution with a fixed dimensionality but different levels of probability values. The major steps for the algorithm are summarized as the following:

1. Determining the thresholds 𝒳_𝑛,𝛽2 and 𝒳_𝑛,𝛽2 _𝑀, inverse of the Chi-square cumulative distribution function (CDF), that are applied during the decision making process for cluster assignment or creation and cluster merging. In the following Table 4, we show the values with fixed probability value,β, set to 0.95 and n ranges from 2 to 10. At this step, the dimensionality of the data, n, should also be specified since procedures for data pre-processing should have already been identified and that portion of the calculation is out of scope of the overall clustering algorithm.

Table 4: 𝒳_𝑛,𝛽2 Values for dimensionality 2 – 10 with β = 0.95

n 2 3 4 5 6 7 8 9 10

𝒳𝑛,𝛽2 5.992 7.815 9.488 11.07 12.592 14.067 15.507 16.919 18.307

2. Determine the default initial inverse covariance matrix ∑₀−1by specifying parameter γ (i.e.: consider default value of γ = 100):

∑0−1= 𝛾𝐼 (19) Depending on whether the data is treated with a predefined normalization procedure or not, the value of γ need to be carefully chosen. If the default γ is chosen to be too small (large values in the diagonal of the original covariance matrix), the algorithm may not produce any additional clusters due to the initial coverage of a newly created

cluster (starting from the 1st_{cluster) will have a coverage that is overly extensive. On} the other hand, if the default value of γ is chosen to be too big hence resulting a newly created cluster (starting from the 1st_{cluster) to be overly tight, in the worst case}

scenario every data point will result in a new cluster. 3. Set C = 0 and read 1st_data_𝑥

1. C is the total number of clusters generated so far. Before the 1st_{data is process, this number should be zero.}

4. Initialize the 1st_{cluster using the 1}st_{data and default initial inverse covariance matrix}

𝜇₁= 𝑥₁; ∑₁−1_{= ∑} 0

−1_{= 𝛾𝐼; 𝐶 = 𝐶 + 1; 𝑑𝑒𝑡∑}

1= det(∑0) ; 𝑁1 = 1;

Note that Ni represents the number of data assigned to the i’th so far. Since the 1st data

is used to initialize the 1st_cluster,_N

1 at this step is set to 1 as part of the initialization.

5. Read next data 𝑥𝑘 and calculate similarity measure using Mutual Information Formulation of the Mahalanobis Distance. This is done by assigning a temporary inverse covariance matrix to the current data, ∑_{𝑘,𝑡𝑒𝑚𝑝}−1 = 𝛾𝐼, to facilitate the calculations described in (MM1 and 2). In other words, given cluster i, ℵ_𝑖, and the temporary

cluster, ℵ_{𝑘,𝑡𝑒𝑚𝑝}, we can compute the mutual information based Mahalanobis distance with a temporary mixture distribution ℵ_𝑚. For the temporary cluster associated with 𝑥_𝑘, the centroid is simply 𝑥_𝑘 and the inverse covariance set to 𝛾𝐼.

D_i,k(ℵ_i||ℵ_k) =1₂D_M(ℵ_i||ℵ_M) +1₂D_M(ℵ_k||ℵ_M), i = [1, C]

Where ℵ_M= 1 2ℵi+

1 2ℵk

And the formula to compute DM is the original Mahalanobis Distance:

D_M(ℵ₀||ℵ_M) = √(u0− uM)T∑M−1(u0− uM)

6. Identify active (closest) cluster with respect to current data 𝑥_𝑘

7. Determine whether current data, 𝑥_𝑘, should be assigned to the active cluster and trigger subsequent updates. Evaluation of whether the 𝑝’th cluster obtained from step warrants the assignment takes the following form:

D_p,k < 𝒳_𝑛,𝛽2

7.a. Update the active cluster if the 𝑥_𝑘 is successfully assigned to 𝑝’th cluster The update of the 𝑝’th cluster is straightforward following (4 - 7). Go to Step 8. 7.b. Create a new cluster if 𝑥_𝑘 is not assigned to 𝑝’th cluster

The creation of a new cluster is identical to the description in Step 4 with the new cluster centroid set to be 𝑥_𝑘, initial inverse covariance matrix and determinant of the covariance matrix set to default values, initial cluster size set to 1 and the increment of C to C + 1 due the newly added cluster. After this step, Step 8 will be skipped and the algorithm continues to Step 9.

8. Determine whether the active cluster shall be merged with an existing cluster. This process involves computing the similarities between the active cluster to the remainder of existing clusters.

D_q,p(ℵ_q||ℵp) =1₂DM(ℵq||ℵM) +1₂DM(ℵp||ℵM), q = [1, C], q ≠ p

Where ℵM= 1₂ℵq+1₂ℵp

And the formula to compute DM is the original Mahalanobis Distance:

D_M(ℵ₀||ℵ_M) = √(µ₀− µ_M)T_∑ M −1_(µ

0− µM)

The top candidate to be considered to be merged with active cluster is determined with the following criteria:

𝑔 = arg min (𝐷_𝑞,𝑝) , q = [1, C], q ≠ p

Subsequent evaluation of whether merging shall take place by comparing the similarity value between the closest cluster g and the active cluster p with the following:

D_g,p< 𝒳_𝑛,𝛽2 _𝑀

8.a. If the above criteria is met, the active cluster p and the closest cluster g will be combined using formula (NRV 2 – 3). Since this will result in a reduction of the total clusters to be reduced by 1. C should be modified accordingly.

𝐶 = 𝐶 − 1

If the merged cluster takes the index previously used by the active cluster p, following modifications to this merged cluster’s data count shall also be adjusted.

𝑁_{𝑝,𝑛𝑒𝑤}= 𝑁_{𝑝,𝑜𝑙𝑑}+ 𝑁_𝑔

9. At this step, the processing of the current data 𝑥_𝑘 is complete and the procedure will continue by looking at the next data. Therefore, go to step 5

In document Estudio de aceptabilidad de helados con fruta de la zona de Píllaro (página 31-36)