• No se han encontrado resultados

As we have discussed in Section 2.3.3, density-based clustering applies a local optimization criterion. Clusters are regarded as regions in the graph in which the nodes are dense and which are separated by regions of low node density. To detect regions of higher density, DENGRAPH computes neighborhoods which have a given radius () and must contain a minimum number of nodes (η) to ensure that the neighborhood is dense. A node having such a neighborhood is termed a core node. Nodes that have no such neighborhood are either border nodes if they are in the neighborhood of a core node or noise nodes.

To build a cluster, DENGRAPH traverses the graph by randomly picking nodes and places all density connected (cf. Figure 4.1) nodes it encounters to the same cluster. If a node is not density connected to the nodes seen thus far, it is assigned to the next cluster candidate. Not each node becomes member of a cluster: If a node does not have an adequately dense neighborhood with respect to  and η and is not density connected to any other node, then it is termed a noise node and its cluster candidate is dropped [57].

To cluster similar data vertices into groups a measure of similarity or dissimilarity is needed. The similarity between two vertices is a numerical measure of the degree to which both objects are alike. Accordingly, dissimilarity is a numerical measure of the degree to which both objects are different. The more alike the objects are, the higher is the similarity and the less similar the vertices are the higher the dissimilarity. Usually, when speaking if dissimilarity, the term distance is used. Frequently used distance functions are the Euclidean distance and the cosine distance.

To group actors according to their closeness in graph structures, we need to define a function that determines the distance between two actors. Similarity or closeness in social networks depends on the relationship between actors. We therefore define the distance based on the semantics of the relationship. In the case of the Enron data set, we assume that the closeness of actors is reflected in the number of interactions between them and define a distance function based in the frequency of interaction (cf. Section 4.4.2). For the Last.fm data we define a distance function based on the similarity of user profiles (cf. Section 4.5.2).

The approach discussed in the following works with any distance function which can be chosen depending on the data set and the goal of the analysis, the -neighborhood of a vertex u is defined as follows:

Definition 4.1 The -neighborhood of a vertex u denoted by N(u) is defined by

N(u) = {v ∈ V | ∃(u, v) ∈ E ∧ dist(u, v) ≤ } (4.1)

4.1 DENGRAPH: Density-based Graph Clustering

The definition of a -neighborhood leads to the definition of two types of vertices in a cluster: core vertices and border vertices. A vertex that does not belong to any cluster is called noise vertex.

Definition 4.2 u ∈ V is a “core vertex” if and only if |N(u)| ≥ η, where η denotes the

minimal number of neighbors required. If |N(u)| < η, then u is a “noise vertex”, unless

there is a core vertex v such that u ∈ N(v). Then, u is a “border vertex”.

We use the notion of core vertices to define reachability among vertices. We define directly density-reachable, density-reachable and density-connected vertices. However, we do so on the basis of -neighborhoods rather than using a distance function over the whole set of vertices. The concepts of direct density reachability, density reachability and direct connectivity are illustrated in Figure 4.1.

directly density reachable u is ε u v from v density reachable u is u v from v density connected u is from v border core u v m u ε(v) and ε(v)| ≥ η

Figure 4.1: The concepts directly density reachability, density reachability and density connectedness to determine whether nodes are density connected.

Definition 4.3 Let u, v ∈ V be two vertices. u is “directly density-reachable” from v within V with respect to  and η if and only if v is a core vertex and u is in its neighborhood, i.e. u ∈ N(v).

Directly density-reachability is a symmetric relation for pairs of core vertices. If a core and a border vertex is involved, it is not symmetric (cf. Figure 4.1 (left)).

Definition 4.4 Let u, v ∈ V be two vertices. u is “density-reachable” from v within V with respect to  and η if there is a chain of vertices p1, . . . , pn such that p1 = v, pn = u

and for each i = 2, . . . , n it holds that pi is directly density-reachable from pi−1 within V

The density-reachability relation is transitive but only for core vertices symmetric. Figure 4.1(middle) shows an example where vertex u is density-reachable from v via three other core vertices. By this definition, a vertex u cannot be density-reachable from a vertex v unless v is a core vertex. This restriction is removed by introducing the notion of density-connectivity between vertices, none of which needs to be a core vertex.

Definition 4.5 Let u, v ∈ V be two vertices. u is “density-connected” to v within V with respect to  and η if and only if there is a vertex m ∈ V such that u is density reachable from m and v is density reachable from m.

Density-connectivity is a symmetric relation (cf. Figure 4.1 (right)). Based on the connectivity of vertices, we can now define a cluster in graph as a “community” composed of all vertices that are density-connected within V with respect to  and η.

Definition 4.6 Let G(V, E) be an undirected, weighted graph. A non-empty set C ⊆ V is a “community” with respect to  and η if and only if:

• For all u, v ∈ V it holds that if u ∈ C and v is density reachable from u, then v ∈ C (Maximality condition).

• For all u, v ∈ C it holds that u is density-connected to v within V with respect to  and η (Connectivity condition).

Definition 4.7 Let C1, . . . , Ck be the communities with respect to  and η. We define

“noise” as the set of vertices in the graph not belonging to any community, i.e. noise = {u ∈ V | u /∈ Ci for i = i, . . . , k}.