• No se han encontrado resultados

5. EL DESARROLLO DE LA INVESTIGACIÓN

6.3 Consideraciones Finales

The 2-D profile allows for a much clearer clustering depiction and a more precise dis- play and comparison of individual cluster properties. It requires less user interaction to explore the data and conveys structure at both a global and a local scale.

Global structure. The 2-D profile has the same topology as its input merge

tree in the sense that each cut of a horizontal line through the profile intersects as many hills as the tree has superarcs containing this value. It thus quickly reveals the number of clusters and their hierarchy, and whether clusters are separated or surrounded by noise. However, it is important to recall that Euclidean distance does not reflect similarity in the topological context. In a topological landscape, structural information is conveyed only by the hierarchy of the hills and the valleys between them. Figure 4.9 provides an example: If multiple neighbored hills share valleys at zero height, this just indicates that these regions are well-separated. It tells nothing about how far away from each other they are in the original domain. Because the profile’s topology is invariant with respect to changing the position of these hills, we cannot say that a particular cluster is “closer” to that of its neighbored hill than to any other separated cluster. Such information about distances is not captured by the density function’s topology. If at all possible, spatial relations may only be derived from valleys above zero height. In this case, the non-zero density saddles indicate a spatial overlap, e.g. subclusters, in terms of the adjusted filter radius σ. Note that a large filter radius could also combine actually separated regions. Therefore, the first step to reading and understanding topology-based visualizations

correctly is to resist interpreting distances between hills in any way other than based on the valleys between them. While this could appear counterintuitive at the first sight, it is actually this abstraction that allows us to preserve the clustering in lower dimensions.

There is only little space for optimizations without changing the profile’s topology. One possible modification is to accentuate well-separated clusters by increasing the gap between those hills that are separated by a valley at zero height (cf. Figure 4.9b). Another optimization is to sort the subtrees of saddle nodes. This changes the position of the corresponding hills, but still preserves the profile’s topology. For example, in Figure 4.9b, all subtrees were sorted by persistence. This places the most prominent features to the left and gives the profile a global downward trend from left-to-right. Placing similarly persistent hills next to each other facilitates convenient comparison between them. Nevertheless, it is not possible to switch arbitrary hills just by sorting subtrees of saddle nodes—as this would quickly destroy the profile’s topology. Sorting subtrees by cluster size or stability is also possible. Hills could even be sorted by inter-cluster distance in the original domain. However, the advantage is very limited since each hill has only two neighbors. Sorting by topology-driven quality measures better reflects the topological context of the profile itself and these measures can also be preserved without loss in two dimensions.

Local cluster information. Similar to the 3-D landscape, hills in the profile

describe individual cluster properties. While the values of a hill’s height and width (of its base line) still denote the cluster’s persistence and size, respectively, a hill’s shape additionally reflects the cluster’s stability. This is because the profile is basically a topology-based serialization of the input points on the x-axis together with their densities on the y-axis. The shape of the profile only results from ordering implicitly stored regular nodes so that they take the form of a hill for leaf superarcs and that of a slope for superarcs that connect two saddles. The convention is that at each height, the width of a hill reflects the number of points that have at least this density. Therefore, the hill’s shape accurately reflects the density distribution of the points, i.e. the cluster’s stability as defined in Chapter 3.2.4. This implies that the hill of a “stable” cluster, where many points are close to the density maximum, are rectangular-shaped, and less stable hills, with many points close to the saddle density, are more triangular- or peak-shaped. Plateaus at different height levels indicate suspicious subfeatures, but are also a typical effect of topological simplification. The alternating two-tone coloring scheme of the hills accentuates their hierarchy if they are nested.

Figure 4.10: 2-D topological landscape profile of the 9-D Reuters data set: Hills in the landscape represent clusters in the data. Their height, width, and area reflect a cluster’s persistence, size, and stability, respectively. Alternating colors represent the hierarchy of nested regions; cluster separation is emphasized by an additional gap between those hills separated by a valley at zero height. Histograms represent the data points at the height of their densities and can be colored by class. Excentric labeling for histogram fragments and hill-based labeling provides additional meta-information.

The data points are augmented to the profile as (horizontal) histograms for annotation and to show the point distributions. Individual points are represented by the fragments of the histograms. The length of all histograms on a particular hill totals to the cluster’s size. If classification information is available, histograms extend to stacked bar charts, one bar and color per class (cf. Figure 4.9b). With this representation the analyst can quickly determine whether classes correspond to clusters. As already mentioned for the Euclidean distance between the hills, the distance between histogram fragments does not indicate spatial similarity. While the fragments of a particular bar do have similar densities, they are probably located at very different positions inside the cluster; likely in a circular fashion around the density maximum (cf. Figure 5.10 on page 131).

Advantages. There are several advantages of the 2-D landscape profile over

the 3-D landscape: no view-dependent occlusion that hides features, no perspective distortion that complicates feature comparison, no strangely distorted base areas, more accurate depiction of individual cluster properties to simplify feature comparison, another cluster property conveyed by a hill’s shape, no invisible and randomly placed data points, a more compact and discriminable display of the data points as colored histograms, less user interaction required to navigate through the scene, less complex geometry and faster construction scheme, and no expensive metric-based distortion. Drawbacks, on the other hand, are a slight decrease in the expressive power of the landscape metaphor compared to the more natural looking hills in 3-D and a little less efficient screen-space utilization by the layout from left-to-right compared to the

Algorithm 2: Pseudo-code to construct the 2-D landscape profile. Input : root node of the merge tree

Output : topological landscape profile

1 procedure PaintLandscapeProfile( root) 2 x ← 0.0

3 PaintPart( root, x)

4 procedure PaintPart( node, x) 5 if HasParentArc( node) then 6 arc ← GetParentArc( node)

7 if IsLeaf( node) then

8 DrawHill( arc, x) // cf. Figure 4.11a

9 else

10 DrawInnerPart( arc, x) // cf. Figure 4.11b

11 for i ← 1 to NumberChildNodes( node) do 12 childNode ← GetChildNode( node, i) 13 PaintPart( childN ode, x)

14 x ← x + SubTreeSize( childNode)

15 end

spiral layout in 3-D. Figure 4.10 shows a 2-D topological landscape profile of the Reuters data set. Compared to the 3-D landscape shown in Figure 4.3, the profile contains more hierarchical features. This is because we removed vague hierarchies in the 3-D landscape by re-balancing (cf. Chapter 3.1.2) the merge tree’s branch decomposition for two reasons: First, we wanted to avoid massively distorted base areas for hierarchical features while giving initial example illustration. The second reason is that these little differences in height values would not have been noticed in the 3-D visualization. Both reason are actually drawbacks of the 3-D landscape metaphor. In contrast, the 2-D landscape profile easily reveals vague hierarchies because even small differences in height values are visible for the valleys, and because this hierarchy is additionally emphasized by alternating hill colors.

Documento similar