Zonas húmedas - SISTEMA DE EXPLOTACIÓN ÓRBIGO

Whatever definition of ‘local stations’ is adopted, the stations described by this definition will inevitably be a diverse group. While some of this variation can be described by model parameters, partitioning the group into smaller groups of stations with similar

characteristics could potentially improve model performance. One way to do this is through the use of cluster analysis, a technique for examining patterns within datasets. While it has not previously been explicitly applied in rail demand modelling, cluster analysis has been used to enhance understanding in an extremely wide range of applications. Examples range from distinguishing between different types of tissue in medical imaging (Lasch et al, 2004) to determining target groups for market research (Harrigan, 1985) and finding structural similarities between chemical compounds (Harrison, 1968).

There are three theoretical approaches to cluster analysis. The first is clustering by division, which involves starting with one cluster containing all objects, working out the best way to divide it in two, and repeating this to achieve larger numbers of clusters. This method is impractical for anything other than a tiny dataset as for a dataset of size n there are 2n-1_{possible first divisions which all have to be tested to find the best division}

(Waterson, 2009).

The second approach, known as partition clustering, uses an iterative nearest-neighbour approach. The number of clusters is chosen in advance, and random points are then chosen to represent cluster centroids. Each data point is assigned to its nearest centroid, and the resulting clusters are then used to calculate new centroids. This process is repeated until the clusters do not change from one iteration to the next (Hawkins et al., 1982). While this method can give good results, Waterson (2009) identified several issues affecting its use. Firstly, the choice of initial centroids may affect results, as the procedure may only identify local optima rather than minimising global error. There may also be problems representing multi-dimensional observations as a single point. Finally, specifying the number of

clusters in advance may result in an artificial structure being imposed on the dataset.

The third approach, hierarchical clustering, involves the creation of a series of partitions in the dataset (Everitt et al., 2001) running from a single cluster containing all individuals to n clusters each containing a single individual. This gives a graph called a dendrogram where the vertical scale represents the dissimilarity of the clusters being combined (Waterson, 2009). This method can be used to discover structure in data that is not readily apparent by visual inspection (Aldenderfer & Blashfield, 1984), an important consideration for data on phenomena such as rail demand which are determined by a large number of independent but related variables. While hierarchical clustering provides a convenient way of

partitioning datasets, Hawkins et al. (1982) warn that care should be taken when applying such methods to datasets where there is not necessarily an underlying hierarchical

structure. This is not to say that cluster analysis is useless in such cases, merely that the cluster solutions produced should not be reified (Aldenderfer & Blashfield, 1984).

A number of agglomerative clustering methods are available for use with hierarchical cluster analysis and Everitt et al. (2001) summarised these as shown in Table 3.2. They concluded that no one hierarchical clustering method can be recommended above others, but that different methods may give very different results on the same data. It is therefore

sensible to test multiple methods to identify which gives optimal results for a particular dataset.

Table 3.2: Agglomerative clustering methods for hierarchical cluster analysis

Name Distance between clusters defined as: Comments Nearest

neighbour Minimum distance between pair of objects, one in one cluster, one in the other

Tends to produce unbalanced and straggly clusters, particularly in large data sets Does not take account of cluster structure Sensitive to observational error

Furthest

neighbour Maximum distance between pair of objects, one in one cluster, one in the other

Tends to find compact clusters with equal maximum distance between objects Does not take account of cluster structure Between-

groups linkage Average distance between pair of objects, one in one cluster, one in the other

Tends to join clusters with small variances Intermediate between nearest and furthest neighbour

Takes account of cluster structure Relatively robust

Within-groups

linkage Weighted average distance between pair of objects, one in one cluster, one in the other, according to inverse of number of objects in each class

Intermediate between nearest and furthest neighbour

Takes account of cluster structure Relatively robust

Centroid

clustering Squared Euclidean distance between mean vectors (centroids) Assumes points can be represented in Euclidean space for geometrical interpretation More numerous group dominates merged cluster, subject to reversals

Median

clustering Squared Euclidean distance between weighted centroids Assumes points can be represented in Euclidean space for geometrical interpretation.

New group intermediate in position between merged groups, subject to reversals

Ward’s method

Increase in sum of squares within clusters, after fusion, summed over all variables

Assumes points can be represented in Euclidean space for geometrical interpretation.

Tends to find same size, spherical clusters Sensitive to outliers

A number of measures of similarity between observations are also available, and their advantages and disadvantages are summarised in Table 3.3, based on a discussion by Aldenderfer & Blashfield (1984). Once clusters have been produced, they can be

examined to establish what it is about the observations within each cluster that makes them similar. This allows a decision to be made on whether or not the dataset should be

Table 3.3: Comparison of measures of similarity between observations

Method Advantages Disadvantages

Pearson

correlation Not affected by dispersion and size differences between variables – can be advantage or disadvantage

Two profiles may have correlation of +1.0 but not be identical

Use to calculate correlation of cases does not make statistical sense

Euclidean

distance Simple to calculateHas intuitive appeal Involves use of square rootEstimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Squared

Euclidean distance

Simple to calculate Has intuitive appeal Avoids use of square root

Estimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Minkowski metric

distance function Estimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Manhattan

distance

May impose non-existent structure on relationships between variables Chebychev

distance Emphasises extreme values if these are expected to be important in clustering

If extreme values unimportant will give them undue weight

In document SISTEMA DE EXPLOTACIÓN ÓRBIGO (página 35-43)