Procesos y resultados de la Comunidad de Práctica de Manejo de Información y

Clustering has become a widely accepted synonym of a broad array of activities of exploratory data analysis and model development in science, engineering, life sciences, business and economics, biological and medical disciplines (Oliveira, 2007). Clustering techniques can be used to organize data (numerical, categorical, or a mixture of both) into groups based on similarities among the individual data items. In other words, clustering techniques is a tool for discovering hidden structure in a data set. There are two types of clustering: hard (classical) and fuzzy clustering. In the classical hard clustering, each data point xi in the data set of the size n, X



x1,...,xn



belongs to one of J clusters. In case of the

fuzzy clustering, every data point xi is assigned to all clusters, but with different membership

degree. This membership degree expresses how ambiguously or definitely the data point should belong to the cluster (Höppner & all, 1999). Fuzzy clustering include many algorithms, the most accurate and frequently used is fuzzy C-means clustering (Höppner & all, 1999).

The main advantage of fuzzy C-means (FCM) clustering is that it allows gradual memberships of data points to clusters measured as degrees in [0,1]. This gives the flexibility to express chance that the data points can belong to more than one cluster. FCM attempts to find the most characteristic point in each cluster, which can be considered as the “centroid” of the cluster and the grade of membership for each object in the clusters. However, there is another question – how to determine an optional number of clusters? To solve this problem, the cluster validity indices are used. There are many cluster validity indices that have been proposed in the literature for evaluating number of FCM clusters. In the current research we will use PC (Dunn, 1974), XB (Xie, Beni, 1991) and E (Makhalova, 2015) indices. The maximum value of PC and XB indices and the minimum value of E correspond with the best fuzzy partition (indicate the optional number of clusters).

Let  be the probability of finding a new job in the analysed year. This quantity is allowed to have a value anywhere in the interval between 0 and 1. We use a continuous beta-

curve to model this parameter (or prior beta distribution). The beta-curve depends on two scale parameters, a and b, and Bayesians usually use this beta-curve for the modelling of distribution of probabilities because of its suitable sample space and flexible shape (Albert, 2001). The prior probability distribution was based on our external information. If we use a beta curve for prior density beta a b

 

; , the posterior density could be modeled also by beta- curve. In this case no numeric procedure is necessar, the prior parameters are updated using the information from the specific (usually small) dataset (Bolstadt, 2007). The traditional Bayesian formula to find the posterior probability density of the parameter

posterior  prior likelihood

was used. Values are given in the tables. Expected value of posterior distribution is used as the Bayesian estimator of the probability of reemployment. All calculations were done in MATLAB and R (R Core Team. 2015).

3. Results

Unemployment rates in the Czech Republic in five analysed quarters (according to CZSO, 2015) were 6.7 %, 6.8 %, 5.9 % and 5.7 %, these values reflect economic recovery after the crisis and associated improvement on the labour market.

There were 4,409 unemployed people in the analysed dataset, out of them 1,078 (24.4 %) found a new job. The fuzzy C-means clustering was applied using number of members of a household, age and unemployment duration. The fuzzy C-means clustering was applied with Euclidean distance as a distance between points. As Bezdek stated (Bezdek, 1981), selected measure of distance in fuzzy C-means clustering do not influence the accuracy of the results. Using chosen variables we obtained results of fuzzy C-means clustering given in Table 1.

Tab. 1 Number of clusters estimated by validity indices

number of clusters PC XB E 2 0.6523 0.2344 0.1173 3 0.5442 0.3210 0.2564 4 0.3211 0.4431 0.2645 5 0.2078 0.4875 0.3501 6 0.2065 0.5001 0.4780 Source: Own calculations

From Table 1 we can see that the optimal number of clusters is 2 (based on all three indices). To give a detail account of obtained results can be ascertain, which objects are assigned to these clusters. The first cluster can be referred as ‘hopelessly unemployed’. To this group the fuzzy clustering process assigned young people without high school education with unemployment length of more than 18 months, and older people (more than 45 years old) with unemployment duration longer than 18 months. The second cluster can be explained as ‘unemployed, but hopeful’. Young people with high school education with unemployment shorter than 18 months were assigned in this cluster. For the interpretation of clusters, the education level is crucial, although this variable was not used in the clustering procedure, as it is the qualitative variable and in C-means procedures only quantitative variables can be used. If we model the density of the parameter  (probability of finding a new job for specific group of unemployed persons in the analysed year), the most important and common problem

in this modelling is the lack of data. It means that if we are modelling the probability that 23 years old high school educated man is able to find a new job, there are usually not enough relevant observations in our sample. It looks reasonable to solve this problem using the Bayesian approach, that is well useable in models where we don´t have enough observations for using the classical (frequentist) statistical approach. Making inferences from the small dataset (if just a few observations is available as in our problem) doesn´t make sense from the frequentist point of view. In Bayesian approach we can use external information in order to construct a prior density for estimated parameter. In the case of the 23 years old man we collect observations of men of the neighbouring ages (21 - 23)in the same region of living.

The process is illustrated by two examples: estimation of probabilities of finding a new job (in analysed year) for 23 years old man with secondary education from the town of Semily (Liberecký kraj, NUTS3 CZ051) and for a 36 years old woman with high school education living in Karlovy Vary (NUTS3 CZ041 Karlovarský kraj).

There are only 4 unemployed men at the age of 23 from Semily in the sample. If the prior information is derived from all male respondents from the whole Liberecký kraj region and from neighbouring ages (21 - 23 years) we obtain 15 observations.

Source: Own calculations

Fig 1. Probability distribution of probability of reemployment (man 23 years, living in Semily)

The prior density, likelihood and posterior density are presented on Figure 1. We can see how the information contained in external dataset moves the expected value to a higher value (the difference equals 0.021) and reduces the variability of the estimated probability. The high variability of estimated probability is caused by small number of observations in the specific dataset (despite the use of Bayesian estimate).

Tab 2. Comparison of prior and posterior distribution of estimated probability

In document Aprendizaje cooperativo y la asignatura Manejo de Información y Datos Numéricos (página 36-44)