• No se han encontrado resultados

Desmontaje de QUICK-LINK (SM-CN900-11)

As already mentioned, the software re-modularisation problem is also referred sometimes in the literature with the name of software clustering, as it regards the clustering of related software entities.

Even if this definition is quite trivial, it emphasise that this problem has several aspects in common with a typical clustering problem as intended in the machine learning literature (Section 3.3).

of hard clustering tasks, since all the entities, namely the classes of the system, can be associated to one and only one cluster (Section 2.3). Moreover, as any other

unsupervised machine learning approach, one of the key issues of the technique is

the choice of the similarity measure, which is crucial for the clustering performance since it states criteria to decide whether two software entities are similar enough to be put into the same cluster [131].

In the defined vector space model, the similarity between two classes is typically computed applying the well-known cosine similarity (Remark 2.8), expressed as the cosine of the angle determined by the two vectors representing them. Neverthe- less, the clustering of software entities introduces some constraints imposed by the specific domain. The most important one is that an automatically produced par- tition should not be either too huge (i.e., containing hundreds of software entities) nor too tiny (i.e., containing very few software entities) [187].

For this reasons, standard algorithms may not be effective unless they are (slightly) modified to impose such constraints.

In the remainder of this Section, a description of the proposed customisation for two well-known clustering algorithms is reported. In particular, the K-Medoids clustering algorithm is described in Section 3.3.1), while the Group Average Ag-

glomerative Clustering is discussed in Section 3.3.2).

3.3.1 K-medoids

As described in Section 2.3.2, the K-Medoids algorithm is a well-known variation of the classical K-Means algorithm, which is more robust with respect to noise and outliers. Moreover, since the resulting clustering strongly depends on the initial choice of medoids, initial medoids are randomly selected. However, to avoid unbalanced solutions, we introduced a novel halting criterion to avoid the risk of resulting in extremely small or extremely large clusters, which makes sense in the context of software re-modularisation.

Indeed, the original K-medoids algorithm starts with a random choice of the

k medoids and iterates assigning at each step all the entities to the most similar

medoids, and then recomputing the set of medoids.

clusters. An algorithmic description of the K-medoids algorithm is reported in Algorithm 1.

However, the main drawback of the algorithm is that resulting clusters strongly depends on the initial configuration. Thus, unlucky configurations could result in a partition including too small clusters: in the variant of the algorithm proposed, the whole procedure is repeated until a final solution where non-extreme clusters is attained or a maximum number of iterations are performed.

Even when the procedure halts due to the latter condition, the algorithm pro- vides the best solution among all the ones found in each iteration.

3.3.2 Group Average Agglomerative Clustering

In addition to the K-Medoids algorithm, also the Group Average Agglomerative Clustering (GAAC) one has been considered, which belongs to the category of the hierarchical clustering algorithm (Section 2.3.3).

In particular, the GAAC algorithm employes a linkage strategy that aggregates two clusters based on the the average similarity of all pairs of entities belonging to them (see Table 2.1). The main advantage of such strategy is that it is more robust with respect to outliers and tends to produce more balanced dendrograms [130]. An example of a dendrogram resulting after the application of GAAC algorithm is reported in Figure 2.2.

The main feature of hierarchical clustering algorithms is that they are determin- istic and does not require several random initialisation (as for partitional cluster- ings, e.g., K-medoids). Moreover, although the asymptotic time complexity of the HAC approach is worse than K-medoids one (Section 3.3.2), in the experiments we performed the K-medoids was slower because it was applied a large number of times on different initial points.

Conversely, from a software re-modularisation perspective, the main drawback of HAC is that it does not provide a flat partition of the system due to its ag- glomerative nature. Therefore, to get such partitions, the dendrogram has to be properly cut [132]. To this aim, the proposed customisation of the HAC algorithm consist in a specialised cutting strategy criterion. In particular, this strategy op-

Algorithm 5 GAAC Cutting Strategy

Input: Λ : The maximum number of elements admitted in a single cluster. Input: T : The dendrogram to be cut.

Input: k : the number of partitions to generate. Output: P : The set of k different partitions.

1: function GAACutStrategy(Λ, T, k) 2: r← root(T) 3: if r = null then 4: return P 5: end if 6: if (isLeaf(r))∨ (|P| ≥ k) then 7: P← P ∪ {T } 8: return P 9: end if

10: leftT← subtree(left(T)) ▷ Get the left subtree rooted in T

11: rightT← subtree(right(T)) ▷ Get the right subtree rooted in T

12: if (|leftT| ≥ Λ) ∧ (|rightT| ≥ Λ) then 13: P← P ∪ GAACutStrategy(Λ,leftT,k) 14: P← P ∪ GAACutStrategy(Λ,rightT,k) 15: else if (|leftT| ≥ Λ) ∧ (|rightT| < Λ) then 16: P← P ∪ {rightT}

17: P← P ∪ GAACutStrategy(Λ,leftT, k) 18: else if (|leftT| < Λ) ∧ (|rightT| ≥ Λ) then 19: P← P ∪ {leftT}

20: P← P ∪ GAACutStrategy(Λ,rightT, k)

21: else ▷ None of the two partition is extreme

22: P← P ∪ {leftT} 23: P← P ∪ {rightT} 24: end if

25: return P 26: end function

most k clusters. It is worth noting that this latter aspect is very important for the assessment of the approach as it makes the two clustering solutions, namely K-medoids and HAC, fairly comparable. The algorithm for the proposed cutting strategy is reported in Algorithm 5

Documento similar