Diseño de la Pantalla - Materiales empleados como relleno

Capítulo 2: Diseño Geotécnico y Estructural de muros de contención de tierras

2.5 Materiales empleados como relleno

2.6.1 Diseño de la Pantalla

Given a set_K= [Fi. . . FN], ofNinput entries (columns), whereFi ={Fit1, Fit2, ..., FitM},

then each input is either a normal event pattern or faulty one. Groups of sequences or patterns that exhibit similar characteristics due to the presence of similar faults generating similar message types within a sequence or pattern can be identified through clustering. Hence, in order to group such similar sequences we employ distance based clustering.

For the purpose of faulty sequence detection in event logs, it is important to define a distance metric that could capture the informativeness of the sequence and/or correlation patterns between the events. Two metrics are defined as explained below.

• Jenson-Shannon Divergence (JSD)metric measures the divergence or similarity between two or more probability distributions. Events of log data are sometimes infrequent and spatial in distribution (can occur randomly in different sequences). JSD has been shown to be effective in capturing relationships among tokens of such distribution [97], hence we utilised it in this work. Considering two distributionsFi, Fj (note that we have es-

tablished earlier, that the input vectors are probability distributions), JSD shows how much information is lost when using one ofFiorFjto approx-

imate the other.

Hence, given two sequences (in this case, the input column vectors),Fi, Fj,

then JSD is defined by J SD(Fi, Fj) = 1 2KLD(FikE) + 1 2KLD(FjkE) (4.2)

where KLD is the Kullback Divergence [34] given as KLD(FikFj) = M P k=1 tkFi log( tkFi tkFj ) andE=Fi+Fj 2

Hence, the similarity betweenFi andFj is given by:

sim(Fi, Fj) =|1−J SD(Fi, Fj)|. (4.3)

The values range between 0 and 1, with values closer to 0 implying more dissimilar sequences and values close to 1 implying similar type of sequences.

• The second metric,Correlation Metric (Corr), is based on the correlation between sequences. Given any two columns from matrixK, the correlation distance between them is given below, wherecovis the covariance andstd

is the standard deviation.

cov(Fi, Fj) = 1 M M X k=1 (Fi,k−Fi)(Fj,k−Fj) (4.4) std(Fi) = v u u t 1 M M X k=1 (Fi,k−Fi)2 (4.5) whereFi =_M1 M P k=1 Fi,k sim(Fi, Fj) = 1− cov(Fi, Fj) std(Fi)∗std(Fj) (4.6)

where Fi,k =tkFi is the value of the frequency of message termtk in sequence Fi. We treat sim(.) as similarity or distance measure between two feature

vectors. Two clustering algorithms are proposed as explained below and seen in Algorithms 2 and 3.

Na¨ıve Clustering Algorithm

Different faults may induce similar error manifestation in the system. There- fore, in this algorithm, the basic assumption is that similar sequences are likely to have been generated by the same fault type; therefore, all data points (se-

quences) which are close enough based on a similarity metric are clustered together (Algorithm 2). The algorithm first initialised each data point (sequences) as clusters such that all the sequences form sets of clusters. In that case, a copy of the the set of clusters C is created. All the clusters with high similarity values, sm (greater or equal to threshold), are considered to contain similar pattern and hence group together. However, those patterns different from all others are keep alone as singleton clusters. The result of this algorithm is a set of clusters of similar patterns.

Algorithm 2Na¨ıve clustering of event sequences

1: procedure Na¨ıveClustering(event sequences K =F1, F2, ..., FN, Simi-

larityThreshold,δ)

2: Initialize eachFi as a cluster of its own, all belonging to cluster set C;

Set of clusters_Cis the output.

3: foreachFi∈Kdo

4: foreach clusterck∈C do

5: forall members,Vj∈ck do

6: sm=sim(Fi,Vj);

7: if sm >=δthen

8: addFi to clusterck

9: else

10: if (last cluster)then

11: create a new cluster,c=Fi

12: addc toC

13: end if

14: end if

15: if Vj=last cluster point of ck then

16: addck toC

17: end if

18: end for

19: end for

20: end for

21: Repeat step 3 Until all sequences are clustered

22: Return(){outputs _C, clusters containing event sequences}

23: end procedure

Hierarchical Agglomerative Clustering Variant

The motivation for this algorithm stems from the fact that time at which a fault is experience may affect the nature of the events despite the fact that the sequences may have similar faults and secondly, sequences have a high tendency

of belonging to more than one cluster, that is, patterns can be similar due to the fact that similar computers executing similar jobs are likely to produce similar messages [106]. In order to obtain more cohesive clusters of sequences, we introduce a variant on hierarchical clustering, which we refer to as HAC. The HAC algorithm (Algorithm 3) targets those sequences we refer to asborderline sequences (sequences with the possibility to belong to more than one cluster), so they are clustered in the right group in order to obtain the different sequence characteristics for detection. These are sequences with the tendency to belong to more than a cluster due to the presence of similar error messages. In HAC, such “borderline” sequences are captured as sub-clusters of cluster with higher closeness value. A cluster SC is a sub-cluster of cluster C, if|SC|<|C|, and if the validity index of C, (val ind(C)), is greater than that of SC and the similarity between their centroids is greater than or equal to the threshold,

δ. The Validity index [58], referred to in our algorithm as (val ind(C)), is a measure of goodness of cluster C by finding the compactness or how close elements of the cluster are and how separate it is from other clusters. We calculated this using the Silhouette coefficient [58], given by equation 4.7.

si=

bi−ai

max(ai, bi)

(4.7)

wheresi= silhouette coefficient for sequencei,bi= minimum(average distances

of sequence i with sequences of other clusters), ai = average distance of i to

sequences in its own cluster. Hence, the goodness of the cluster is the average of all the silhouettes,si, of the clusters; and−1≤si≤1, with value ofsi close to

1 indicating that the sequences clustered together are similar and values closer to -1 indicating less similar sequences.

In document Estudio del análisis y diseño de muros de contención en obras hidráulicas (página 51-58)