Capítulo 2: Diseño Geotécnico y Estructural de muros de contención de tierras
2.5 Materiales empleados como relleno
2.6.1 Diseño de la Pantalla
Given a setK= [Fi. . . FN], ofNinput entries (columns), whereFi ={Fit1, Fit2, ..., FitM},
then each input is either a normal event pattern or faulty one. Groups of se- quences or patterns that exhibit similar characteristics due to the presence of similar faults generating similar message types within a sequence or pattern can be identified through clustering. Hence, in order to group such similar sequences we employ distance based clustering.
For the purpose of faulty sequence detection in event logs, it is important to define a distance metric that could capture the informativeness of the sequence and/or correlation patterns between the events. Two metrics are defined as explained below.
• Jenson-Shannon Divergence (JSD)metric measures the divergence or sim- ilarity between two or more probability distributions. Events of log data are sometimes infrequent and spatial in distribution (can occur randomly in different sequences). JSD has been shown to be effective in capturing relationships among tokens of such distribution [97], hence we utilised it in this work. Considering two distributionsFi, Fj (note that we have es-
tablished earlier, that the input vectors are probability distributions), JSD shows how much information is lost when using one ofFiorFjto approx-
imate the other.
Hence, given two sequences (in this case, the input column vectors),Fi, Fj,
then JSD is defined by J SD(Fi, Fj) = 1 2KLD(FikE) + 1 2KLD(FjkE) (4.2)
where KLD is the Kullback Divergence [34] given as KLD(FikFj) = M P k=1 tkFi log( tkFi tkFj ) andE=Fi+Fj 2
Hence, the similarity betweenFi andFj is given by:
sim(Fi, Fj) =|1−J SD(Fi, Fj)|. (4.3)
The values range between 0 and 1, with values closer to 0 implying more dissimilar sequences and values close to 1 implying similar type of se- quences.
• The second metric,Correlation Metric (Corr), is based on the correlation between sequences. Given any two columns from matrixK, the correlation distance between them is given below, wherecovis the covariance andstd
is the standard deviation.
cov(Fi, Fj) = 1 M M X k=1 (Fi,k−Fi)(Fj,k−Fj) (4.4) std(Fi) = v u u t 1 M M X k=1 (Fi,k−Fi)2 (4.5) whereFi =M1 M P k=1 Fi,k sim(Fi, Fj) = 1− cov(Fi, Fj) std(Fi)∗std(Fj) (4.6)
where Fi,k =tkFi is the value of the frequency of message termtk in sequence Fi. We treat sim(.) as similarity or distance measure between two feature
vectors. Two clustering algorithms are proposed as explained below and seen in Algorithms 2 and 3.
Na¨ıve Clustering Algorithm
Different faults may induce similar error manifestation in the system. There- fore, in this algorithm, the basic assumption is that similar sequences are likely to have been generated by the same fault type; therefore, all data points (se-
quences) which are close enough based on a similarity metric are clustered to- gether (Algorithm 2). The algorithm first initialised each data point (sequences) as clusters such that all the sequences form sets of clusters. In that case, a copy of the the set of clusters C is created. All the clusters with high similarity values, sm (greater or equal to threshold), are considered to contain similar pattern and hence group together. However, those patterns different from all others are keep alone as singleton clusters. The result of this algorithm is a set of clusters of similar patterns.
Algorithm 2Na¨ıve clustering of event sequences
1: procedure Na¨ıveClustering(event sequences K =F1, F2, ..., FN, Simi-
larityThreshold,δ)
2: Initialize eachFi as a cluster of its own, all belonging to cluster set C;
Set of clustersCis the output.
3: foreachFi∈Kdo
4: foreach clusterck∈C do
5: forall members,Vj∈ck do
6: sm=sim(Fi,Vj);
7: if sm >=δthen
8: addFi to clusterck
9: else
10: if (last cluster)then
11: create a new cluster,c=Fi
12: addc toC
13: end if
14: end if
15: if Vj=last cluster point of ck then
16: addck toC
17: end if
18: end for
19: end for
20: end for
21: Repeat step 3 Until all sequences are clustered
22: Return(){outputs C, clusters containing event sequences}
23: end procedure
Hierarchical Agglomerative Clustering Variant
The motivation for this algorithm stems from the fact that time at which a fault is experience may affect the nature of the events despite the fact that the sequences may have similar faults and secondly, sequences have a high tendency
of belonging to more than one cluster, that is, patterns can be similar due to the fact that similar computers executing similar jobs are likely to produce similar messages [106]. In order to obtain more cohesive clusters of sequences, we introduce a variant on hierarchical clustering, which we refer to as HAC. The HAC algorithm (Algorithm 3) targets those sequences we refer to asborderline sequences (sequences with the possibility to belong to more than one cluster), so they are clustered in the right group in order to obtain the different sequence characteristics for detection. These are sequences with the tendency to belong to more than a cluster due to the presence of similar error messages. In HAC, such “borderline” sequences are captured as sub-clusters of cluster with higher closeness value. A cluster SC is a sub-cluster of cluster C, if|SC|<|C|, and if the validity index of C, (val ind(C)), is greater than that of SC and the similarity between their centroids is greater than or equal to the threshold,
δ. The Validity index [58], referred to in our algorithm as (val ind(C)), is a measure of goodness of cluster C by finding the compactness or how close elements of the cluster are and how separate it is from other clusters. We calculated this using the Silhouette coefficient [58], given by equation 4.7.
si=
bi−ai
max(ai, bi)
(4.7)
wheresi= silhouette coefficient for sequencei,bi= minimum(average distances
of sequence i with sequences of other clusters), ai = average distance of i to
sequences in its own cluster. Hence, the goodness of the cluster is the average of all the silhouettes,si, of the clusters; and−1≤si≤1, with value ofsi close to
1 indicating that the sequences clustered together are similar and values closer to -1 indicating less similar sequences.