• No se han encontrado resultados

Adaptación del modelo MAS a nuestro estudio

Lingüística pragmática y gramática del discurso

2.3. U N MARCO METODOLÓGICO COMPLEMENTARIO PARA EL ANÁLISIS ACÚSTICO DE LA ENTONACIÓN SUSPENDIDA

2.3.3. Adaptación del modelo MAS a nuestro estudio

To initially test our MOCA we performed a comparison using k-means to cluster a large number of prefabricated data sets where a desired clustering solution exists. This initial comparison should enable us to test if it is producing correct clustering solutions and performing at least inline with the benchmark clustering algorithm. In later chapters we perform more complex experiments against other Multi-Objective Evolutionary Algorithms to determine if our MOCA is more efficient than other MOEA implementations for clustering.

First, we constructed a series of synthetic data sets; then we defined an exper- imental methodology for comparing the algorithm’s performance against k-means; finally we report our results.

CHAPTER 5. A NOVEL MO CLUSTERING ALGORITHM 106

Algorithm 5.1 MOCA

• Initialise the population, S.

• Randomly create solutions by drawing objects from D to form sets of medoids and insert into S1 to serve as the start population.

• g = 1.

• while g < number of generations. – ∀~s∈ Sg

∗ Calculate awgss (P). ∗ Calculate abgss (P).

∗ Calculate connectivity (P).

– Calculate dominance depth of solutions in S as in Algorithm 4.2. – Calculate crowding distance of solutions in S as in Algorithm 4.3. – Select solutions to mutate using binary tournament selection with ≺n.

– Mutate each selected solution to with a randomly selected a mutation sub-operator:

∗ Decrease the number of clusters in the solution. ∗ Increase the number of clusters in the solution. ∗ Recompute the cluster prototypes.

– Select solutions to crossover using binary tournament selection with ≺n.

– Crossover the selected solutions by exchanging cluster prototypes. – Add the mutated and crossed over solutions into the population. – Sort S with ≺n.

– Add the fittest solution from Sg to Sg+1 until it is full.

– g = g + 1.

5.3.1

Construction of Synthetic Data Sets

In Section 3.2 we described a method for generating a synthetic data sets based upon the work of Milligan and Cooper [107, 108]. The proposed method can be used to generate data sets where the following factors are varied: the number of naturally occurring clusters, the number of dimensions, the distribution of the membership of objects to clusters and the proportion of outliers that exist within the data set.

We varied each of the four factors to produce different data designs. A data set was generated from each design three times leading to twenty thousand and seven data sets for this experiment. Each data set contained five hundred objects. These data sets were newly generated and are not identical to those in Chapter 3.

5.3.2

Experimental Method

We set the population size for our version of NSGA-II to 100; the number of gener- ations was set to 1,000; the mutation probability and the crossover probability were both set to 0.5. These choices were made based upon preliminary work where we experimented with mutation and crossover probabilities in the range of [0.1 : 0.9] in increments of 0.1, the population size was varied in the range [50 : 200] in increments of 10 and the number of generations was in the range of [100 : 2000] in increments of 100.

Our MOCA was executed on each of the previously described synthetic data sets with these parameters. The result of this is a set of clustering solutions. We test each solution generated against the optimal clustering solution using the Rand Statistic, R, previously defined in Section 2.4. We extract the highest, lowest and mean average values of R recorded for each Pareto set of solutions returned by an execution of MOCA. The value of k associated with the solutions that generated the minimum and maximum values of R and the average value of R are also reported. We also make a comparison of performance against the algorithm k-means. For each synthetic data set, we execute the algorithm k-means for varying values of k

CHAPTER 5. A NOVEL MO CLUSTERING ALGORITHM 108 ranging from 2 to 40 in increments of 1. We report the highest value of R recorded for each pool of solutions associated with a data set and the associated k value.

5.3.3

Comparison to DBSCAN

We will also compare MOCA against another clustering algorithm, we have chosen to compare MOCA against k-means and DBSCAN [35], a clustering algorithm that is based upon density, discussed in Section 2.2.3.

We will draw a subset of the data sets described in Sections 3.2 and 5.3.1. The number of clusters in the data sets will be between two and twenty in increments of two. The number of dimensions will be between two and ten in increments of two. All three data set designs (df) are used; an even distribution of clusters is denoted as ”a”, a cluster consisting of 10% of the objects and the rest as evenly distributed as possible is denoted ”b”; and a cluster consisting of 60% of the objects and the rest as evenly distributed as possible is denoted ”c”. The proportion of outliers is either 0% or 40% which is denoted as ”a” and ”b” respectively.

For each data set we execute the DBSCAN algorithm. This algorithm returns a single clustering solution without the need for a pre-determined value of k to be provided. We will then calculate the value of R of this solution compared to the intended clustering solution. We also report the value of k.

k-means is run in the same fashion as we described in Section 5.3.2. For each synthetic data set, we run the k-means algorithm with values of k from 2 to 40. Again we report the highest value of R for each pool of k-means solutions and the value of k associated with it.

MOCA is performed in almost exactly the same way as described in Section 5.3.2. The number of generations has been reduced to 100. Again, from each set of solutions returned by MOCA we extract the solutions with the highest and lowest values of R and report the value of k for this solution. We also report the mean and S.D. of k and R for each set of solutions returned by MOCA.