• No se han encontrado resultados

Mechanisms of Accountability, Group Strength and Security167

6.2 On the Determinants of Security and Sensitivity

6.2.1 Mechanisms of Accountability, Group Strength and Security167

While developments in theoretical cluster analysis allows an improved cluster set to be determined from available data, there is further scope for cluster analysis advancement through improvement of data preparation techniques reflecting awareness of the assumptions intrinsic in cluster analysis. Here, a novel methodology for the incorporation of Cartesian data into analysis where spatially discrete clusters are required is presented.

For demonstration purposes an example is given. A random set of four dimensional data is generated, with variates i and ii representing Cartesian data and iii and iv representing non-Cartesian data, recorded at 30 nodes (Figure 3.1). In order to demonstrate the lack of preparation required for non-Cartesian data, Variate i is sampled from the uniform distribution 𝑈(2000,3000), variate ii is sampled from the uniform distribution 𝑈(5000,10000), and variates iii and iv are sampled from the uniform distribution 𝑈(0,1).

Spatially discrete clusters are defined as those where all members of all non-singleton cluster sets are adjacent to at least one member of the same cluster set. In many cases the desired outcome of a cluster analysis is an optimal fit of clusters to the data present. However, there are cases in which a secondary consideration is the generation of spatially discrete clusters, for example where a single node representing a gauging site is to be selected to represent a number of other nodes (Lin & Chen 2006) or when cluster analysis is implemented as a precursor to interpolation (Abedini et al 2008). While it is accepted that this represents a departure from the optimal empirical cluster set, the use of spatially discrete clusters reduces the complexity in further analysis and may represent a more physically representative clustering where data with high noise is being evaluated. Even where spatially discrete clustering is not a stated objective, cluster analyses frequently incorporate Cartesian data for this purpose. The inclusion of two variables providing information based on the same observation introduces a bias towards this information in clustering, as well as compromising independence in input error. Scaling of Cartesian data is not straightforward, as the variables should not be scaled separately as this distorts their information content - a unit of distance in one variable will not necessarily be consistently scaled in another variable. Jointly scaling the data is not straightforward and care must be taken to preserve the relationship between the variables. In none of the papers reviewed

86

where spatial data were included was this practise followed. Abedini et al (2008) implement an iterative weighting algorithm, which generates spatially discrete clusters at a cost of introducing bias to the clustering, yet this methodology does not guarantee spatially discrete clusters.

A straightforward approach which ensures spatially discrete clustering, does not introduce bias and reduces the complexity of the clustering problem through reduced the problem dimensions is introduced here. The methodology is in effect a modified distance metric, applicable to hierarchical clustering or the k-medoids approach.

i) The data 𝐷 are separated into Cartesian (𝐷𝐶) and non-Cartesian (𝐷𝑛) data subsets, where all data refer to set of nodes Ν.

ii) Cartesian data are used to generate a Delaunay triangulation map 𝐷𝑇(𝐷𝐶) (Figure 3.2).

iii) The Delaunay map is used to determine adjacency between nodes, with node pairs sharing an edge being ascribed an adjacency of 1, with adjacency being 0 in all other cases, such that

𝑎(𝑥, 𝑦) = {1 𝑖𝑓 (𝑥, 𝑦) ∈ 𝐷𝑇(𝐷𝐶) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

where x and y are members of Ν.

iv) The data are then considered in (variable iii, variable iv) space representative of the non-Cartesian data (Figure 3.3).

v) A modified Euclidean distance is applied such that the distance between nodes is the Euclidean distance for those with an adjacency of 1 and the minimum distance passing through only nodes connected in Cartesian space in all other cases, such that

87

In the practical application of this method, setting the maximum 𝑚 to the total number of nodes can result in a significant challenge in terms of process time with little benefit. Here, a maximum value of 5 was used, with no discernable loss of model skilfulness. However, hypothetically this may bias toward linking points which are close in Cartesian space but further apart in non-Cartesian space.

In cases such as the Isle of Wight dataset, where spatial locations are not random but are restricted to the onshore regions of the mapped area, a maximum edge length/minimum triangle angle filter can be applied to the Delaunay triangulation results to remove linkages considered unrealistic, in this case those crossing the sea. This distance metric is limited in compatibility to clustering methodologies accepting a distance matrix derived a priori rather than one requiring ad hoc recalculation of distance between points sampled from the continuous projection space between variables, which in essence includes hierarchical approaches but excludes many partition approaches. k-medoids is a notable exception and is used to compare clustering results with those from the hierarchical Ward’s method approach.

88

Figure 3.1. Example data plotted in four dimensions. Variable IV is shown on the colour bar.

Figure 3.2 Variables I and II of the example data plotted and linked using Delaunay triangulation.

89

Figure 3.3. Variables III and IV plotted showing linkage generated in Figure 3.2.

Figure 3.4. Information from Figure 3.3 is clustered, with colours representing clusters.

90

Figure 3.5. Nodes are plotted in Variable I, II space, and are again coloured by cluster. Note that every node is linked to at least one of the same colour while incorporating the information in Variables III and IV in the clustering. These can now be considered as groups of point information or interpolated. Care must be taken as even these random data generate clusters.