• No se han encontrado resultados

Si me pidieran establecer o marcar las secuencias, y de esta forma ordenar los acon tecimientos, de los últimos 50 años propondría tres fases o épocas que me gustaría

4.3.3. El caso del caserío vasco

Dynamic programming is one of the most commonly used techniques for solving optimization problems. Recall in Chapter 4, we presented a dynamic programming algorithm for solving the k- MEDIAN-OUTLIERproblem, which gives an optimum solution under the assumption that the instance is perturbation resilient— the optimum solution remains unchanged even when the pairwise distances between points are changed by a multiplicative factor of 2 (refer toSection 4.6for details). Although real instances of clustering problems may not necessarily satisfy this condition, the algorithm itself can

the used as a heuristic for solving clustering with outliers. In this section, we present this dynamic programming based heuristic for k-MEANS-OUTLIER(it is a minor modification of algorithm inSection 4.6):

• Step 1. Given the input set of points V , and distance function d, contruct the minimum spanning tree of V based on d. Let ¯T be the MST thus constructed. The tree ¯T is rooted at an arbitrary vertex r oot. • Step 2. Transform ¯T into a binary tree T with dummy vertices. The procedure is as follows: while there is a vertex v with more than two children, pick any two children of v — v1, and v2; create a new child (dummy vertex) u of v; reattach subtrees rooted at v1, and v2as children of u. At the end of this process, let U be the set of dummy vertices added. For each dummy vertex u∈ U, set d(u, v) = 0, for every v∈ US V .

• Step 3. Using dynamic programming, partition binary tree T into k subtrees P10, . . . , Pk0, with centers

c10, . . . c0k(each ci0∈ Pi0T V ) and identify remaining Z0vertices of the tree as outliers, where Z0T V ≤ z, such that the cost functionPki=1P

u∈P0

i d 2(c0

i, u) is minimized. Note that, when the instance is perturbation resilient, this partitioning of T and outliers are infact the optimal clustering and outliers.

The dynamic programming is as follows: Let Tu denote the subtree rooted at u. Further let `u, ru respectively denote the left child, right child of u.

Let opt(u, j, t, c) be the minimum cost of partitioning the points in subtree Tu into j clusters after discarding t points as outliers. Here c can be any vertex in V or it can be the null (denoted using;). The clustering satisifies the following constraints:

• if c = ;, then u is marked as an outlier.

• if c 6= ;, then the cluster in which u belongs has center c. • Each cluster forms a subtree in Tu.

We can define opt(u, j, t, c) using the following recursive formula.

• c = ;, u ∈ V . Here u is an outlier. Hence, `u and ru are assigned to centers c0 ∈ T`u and c 00∈ Tr

u

respectively. Further, since u is already being marked as an outlier, there can be t− 1 outliers between

T`uand Tr

u.

opt(u, j, t, c) = minopt(`u, j0, t0, c0) + opt(ru, j00, t00, c00) :

j0+ j00= j, t0+ t00= t − 1, c0∈ T`u

[

;, c00∈ Tru

[

; (5.1)

outliers in Tu.

opt(u, j, t, c) = minopt(`u, j0, t0, c0) + opt(ru, j00, t00, c00) :

j0+ j00= j, t0+ t00= t, c0∈ T`u

[

;, c00∈ Tru

[

; (5.2)

• c /∈ T`uS Tru. The recursive formula is defined by 4 cases (lines 1-4 in the formula). The explanation

for each case is as follows: (1) Neither lu nor ru is assigned to the same cluster as u. They are either outliers, or they are assigned to centers c0, c00in subtree T`

u, Tru resp. (2) ru is assigned to the same

cluster as u but not`u. It is either an outlier or assigned to a center c0∈ T`u (3)`u is assigned to the

same cluster as u but not ru. It is either an outlier or assigned to a center c00∈ Tr

u (4) Both`u and ru are

assigned to the same cluster as u.

opt(u, j, t, c) = d2(u, c) + min€

minopt(`u, j0, t0, c0) + opt(ru, j00, t00, c00) :

j0+ j00= j − 1, t0+ t00= t, c0∈ T`u

[

;, c00∈ Tru

[ ; , minopt(`u, j0, t0, c0) + opt(ru, j00, t00, c) :

j0+ j00= j, t0+ t00= t, c0∈ T` u

[ ; , minopt(`u, j0, t0, c) + opt(ru, j00, t00, c00) :

j0+ j00= j, t0+ t00= t, c00∈ Tru

[ ; , minopt(`u, j0, t0, c) + opt(ru, j00, t00, c) :

j0+ j00= j − 1, t0+ t00= t Š (5.3)

• c ∈ T`u. The recursive formula in this case is obtained by removing lines (1), (2) from the above

formula.

• c ∈ T`u. The recursive formula in this case is obtained by removing lines (1), (3) from the above

formula.

Finally, the cost of the optimal partitioning of T is given by min

c∈V opt(root, k, z, c) where root is the root

of the tree. Let B be the set of centers corresponding to the paritions obtained by the DP algorithm. • Step 4. Let Z be the furthest z points from the centers B. Assign the remaining points V \ Z to the nearest center in B (ties broken arbitrarily), to obtain the clustering.

The running time of the algorithm is O (n · k · z)2, where n is the size of the input set, and k, z are the input parameters. Thus, to scale the algorithm to large datasets we need similar data summarization techniques as described in the previous sections.