4. RESULTADOS 33
4.6 Herramientas 44
4.6.1 Herramientas de Relaciones públicas 44
Hierarchical clustering is a class of clustering algorithms that has been used widely in analyzing social networks [173, 162] before the evolution of the new era of clustering algorithms. The meaning of clustering here is that nodes are grouped into clusters or communities such that nodes within each cluster are more close, similar, or related to one another and distinct from nodes belonging to different clusters or communities. The sim-
ilarity here is defined in term of structural equivalence [110, 171] where two nodes iandj
are structurally equivalentif they have same pattern of relationships with all other nodes. In other words, they have identical entries in their corresponding rows (or columns) of the adjacency matrix A.
There are different ways to measure structural similarity. One of these measures isEu- clidean distance [173]. Given two nodesi and j the distance between their corresponding rows (or columns) from A defined as:
dij = v u u t n X x=1 (aix−ajx)2. (2.1)
When dij = 0, nodes i and j are structurally equivalent, which means that the entries
in their respective rows in A are identical. As the first step in hierarchical clustering algorithms, dij is computed for all pairs of nodes in the graph and the results are stored
in a distance matrix.
An alternative measure of structural equivalence is based on the correlation between rows (or columns) of nodesi and j inA. The Pearson correlation coefficient[173] is used to find the correlation between the respective node’s rows as:
Rij = Pn x=1(aix−a¯i)(ajx−a¯j) pPn x=1(aix−a¯i)2 pPn x=1(ajx−a¯j)2 (2.2)
where ¯ai is the mean of the values in row iof A. The coefficientRij = 1 if nodesi and j
are structurally equivalent, −16Rij 61. The coefficient Rij is computed for all pairs of
nodes in the graph and the results from computation are stored in a similarity matrix to be used in the hierarchical clustering algorithms.
Alternatively, we can measure structural similarity by counting the number of common neighbors that vertices i and j have. If network G has edge set E then the (exclusive)
CHAPTER 2 Network Clustering Algorithms
neighborhood of vertex i is
Γ(i)≡ {x|(i, x)∈E} (2.3)
and the inclusive neighborhood of vertex i is the set containing the vertex itself and its neighbors, that is:
Γ+(i)≡ {i} ∪Γ(i) (2.4)
The common neighbors set (CNS) of i and j is therefore Γ(i)∩Γ(j). Structural simi- larity can be calculated either from A, or from the size of the common neighbors sets. It is known as the common neighbors similarity or the common neighbors index CNI:
CN Iij ≡ |Γ(i)∩Γ(j)|=
X
x
aixajx. (2.5)
Hierarchical clustering algorithms are classified into two types,agglomerative method and divisive method. A series of partitions takes place in both methods.
Agglomerative method
In agglomerative method [81], given a network of n nodes, initially assign each node
i ∈ V to unique cluster c(i), so in the initial stage we have as many clusters as we have nodes in the network. After that, calculate the similarity between all pairs of clusters, nodes at this stage, according to the chosen similarity measure. Then, merge two clusters that are the closest (most similar) pair into a single cluster. Compute the similarity of the new merged cluster and each of the old clusters, in terms of “structurally equivalent” measurements.
Repeat the steps of measuring similarity for new merged clusters and comparing re- sults with already existing clusters in order to group each cluster created in the initial stage in the network. At the end, all nodes are clustered into a single cluster of size n. In this technique nodes are added to larger and larger clusters and the hierarchal tree is built from bottom to top.
Three different approaches can construct the agglomerative technique according to
similarity or distance:
1. The single-linkage clustering, where the distance between two clusters is equal to the shortest distance from any member of one cluster to any member of the other cluster. In term of similarity, the similarity between two clusters is equal to the greatest similarity from any member of one cluster to any member of the other cluster
2. The complete-linkage clustering, where the distance between two clusters is equal to the longest distance from any member of one cluster to any member of the other cluster. The similarity between two clusters is equal to the smallest similarity from any member of one cluster to any member of the other cluster
3. The average-linkage clustering, where the distance between two clusters is equal to average distance from any member of one cluster to any member of the other cluster. The similarity between two clusters is equal to the average similarity from any member of one cluster to any member of the other cluster.
Divisive method
CHAPTER 2 Network Clustering Algorithms
edges with high betweenness and cutting them out. The method starts with a network of m edges; calculate the betweenness value for each e ∈ E. Remove the edge with the highest value of betweenness. This step allows construction of a dendrogram 1 according to the node partition which results.
Repeat the steps again of calculating the betweenness value for the existing edges and again remove the edge with the highest betweenness value. At the end, the network breaks up into n non-connected nodes with a dendrogram in which its leaves are the nodes of the network.
In this method the network divides into progressively smaller and smaller clusters. The challenge is in selecting the inter-cluster edges that carry highest betweenness. Thus Girvan and Newman in [64, 130] considered three definitions of edge betweenness central- ity:
1. Geodesic edge betweenness, which counts the number of shortest paths between all pairs of nodes that run through the selected edge
2. Random walk edge betweenness, is defined by the frequency of the passages across the edge of a random walker running on the network
3. Current-flow edge betweenness, it is the average value of the current carried by the edge, each edge e in the network carries some amount of current value.
1The dendrogram is a binary tree or a memory data structure that describes the history of the
algorithm.