Whatever definition of ‘local stations’ is adopted, the stations described by this definition will inevitably be a diverse group. While some of this variation can be described by model parameters, partitioning the group into smaller groups of stations with similar
characteristics could potentially improve model performance. One way to do this is through the use of cluster analysis, a technique for examining patterns within datasets. While it has not previously been explicitly applied in rail demand modelling, cluster analysis has been used to enhance understanding in an extremely wide range of applications. Examples range from distinguishing between different types of tissue in medical imaging (Lasch et al, 2004) to determining target groups for market research (Harrigan, 1985) and finding structural similarities between chemical compounds (Harrison, 1968).
There are three theoretical approaches to cluster analysis. The first is clustering by division, which involves starting with one cluster containing all objects, working out the best way to divide it in two, and repeating this to achieve larger numbers of clusters. This method is impractical for anything other than a tiny dataset as for a dataset of size n there are 2n-1 possible first divisions which all have to be tested to find the best division
(Waterson, 2009).
The second approach, known as partition clustering, uses an iterative nearest-neighbour approach. The number of clusters is chosen in advance, and random points are then chosen to represent cluster centroids. Each data point is assigned to its nearest centroid, and the resulting clusters are then used to calculate new centroids. This process is repeated until the clusters do not change from one iteration to the next (Hawkins et al., 1982). While this method can give good results, Waterson (2009) identified several issues affecting its use. Firstly, the choice of initial centroids may affect results, as the procedure may only identify local optima rather than minimising global error. There may also be problems representing multi-dimensional observations as a single point. Finally, specifying the number of
clusters in advance may result in an artificial structure being imposed on the dataset.
The third approach, hierarchical clustering, involves the creation of a series of partitions in the dataset (Everitt et al., 2001) running from a single cluster containing all individuals to n clusters each containing a single individual. This gives a graph called a dendrogram where the vertical scale represents the dissimilarity of the clusters being combined (Waterson, 2009). This method can be used to discover structure in data that is not readily apparent by visual inspection (Aldenderfer & Blashfield, 1984), an important consideration for data on phenomena such as rail demand which are determined by a large number of independent but related variables. While hierarchical clustering provides a convenient way of
partitioning datasets, Hawkins et al. (1982) warn that care should be taken when applying such methods to datasets where there is not necessarily an underlying hierarchical
structure. This is not to say that cluster analysis is useless in such cases, merely that the cluster solutions produced should not be reified (Aldenderfer & Blashfield, 1984).
A number of agglomerative clustering methods are available for use with hierarchical cluster analysis and Everitt et al. (2001) summarised these as shown in Table 3.2. They concluded that no one hierarchical clustering method can be recommended above others, but that different methods may give very different results on the same data. It is therefore
sensible to test multiple methods to identify which gives optimal results for a particular dataset.
Table 3.2: Agglomerative clustering methods for hierarchical cluster analysis
Name Distance between clusters defined as: Comments Nearest
neighbour Minimum distance between pair of objects, one in one cluster, one in the other
Tends to produce unbalanced and straggly clusters, particularly in large data sets Does not take account of cluster structure Sensitive to observational error
Furthest
neighbour Maximum distance between pair of objects, one in one cluster, one in the other
Tends to find compact clusters with equal maximum distance between objects Does not take account of cluster structure Between-
groups linkage Average distance between pair of objects, one in one cluster, one in the other
Tends to join clusters with small variances Intermediate between nearest and furthest neighbour
Takes account of cluster structure Relatively robust
Within-groups
linkage Weighted average distance between pair of objects, one in one cluster, one in the other, according to inverse of number of objects in each class
Intermediate between nearest and furthest neighbour
Takes account of cluster structure Relatively robust
Centroid
clustering Squared Euclidean distance between mean vectors (centroids) Assumes points can be represented in Euclidean space for geometrical interpretation More numerous group dominates merged cluster, subject to reversals
Median
clustering Squared Euclidean distance between weighted centroids Assumes points can be represented in Euclidean space for geometrical interpretation.
New group intermediate in position between merged groups, subject to reversals
Ward’s method
Increase in sum of squares within clusters, after fusion, summed over all variables
Assumes points can be represented in Euclidean space for geometrical interpretation.
Tends to find same size, spherical clusters Sensitive to outliers
A number of measures of similarity between observations are also available, and their advantages and disadvantages are summarised in Table 3.3, based on a discussion by Aldenderfer & Blashfield (1984). Once clusters have been produced, they can be
examined to establish what it is about the observations within each cluster that makes them similar. This allows a decision to be made on whether or not the dataset should be
Table 3.3: Comparison of measures of similarity between observations
Method Advantages Disadvantages
Pearson
correlation Not affected by dispersion and size differences between variables – can be advantage or disadvantage
Two profiles may have correlation of +1.0 but not be identical
Use to calculate correlation of cases does not make statistical sense
Euclidean
distance Simple to calculateHas intuitive appeal Involves use of square rootEstimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Squared
Euclidean distance
Simple to calculate Has intuitive appeal Avoids use of square root
Estimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Minkowski metric
distance function Estimation of similarity between cases strongly affected by size of variables – can be overcome by standardisation of variables Manhattan
distance
May impose non-existent structure on relationships between variables Chebychev
distance Emphasises extreme values if these are expected to be important in clustering
If extreme values unimportant will give them undue weight