5. PROPUESTAS DE MEJORA Y CONCLUSIONES 59
5.1 Propuestas de mejora 59
The map equation method proposed by Rosvall and Bergstrom [159], and now known as
Infomap, identifies communities according to the flow of the information in the networks.
Infomap aims to understand the behavior of integrated systems through comprehending the network structure, with respect to the dynamic flow on the networks. Infomap has two main steps, the most recent description can be found at [114].
This approach uses the Shannon limit of a Huffman code [79] which gives short code- words for commonly visited nodes, and long codewords for rarely visited nodes. The quality function used to evaluate a partition is the minimum description length (MDL) [68]. It measures the average lengthL(C) in bits per step of a random walk on the network with a node partition C={c1, . . . , c`}.
L(C) = qyH(C) +
`
X
i=1
piH(Pi) (2.7)
This equation has two parts: the first one is to explain the movements between the communities, where qy is the probability that a random walker switches communities and H(C) is the entropy of the community index codewords. The second part explains movements within the communities, where pi is the fraction of the movements within community ci and H(Pi) is the entropy of the movements within communityci. The map
equation in (2.7) provides a theoretical limit on specifying a network path given a cluster structure. Thus, it is sufficient to calculate this map equation for each partition of the network. The complexity of the Infomap algorithm isO(m) [60].
InInfomap, the conceptual idea of finding and identifying the clusters depends on the compression of a dynamic process on the structure of the network. This dynamic process is a random walk. Also, Infomap is a multilevel algorithm, this means that the network is treated recursively, to detect clusters, until no improvement occurs in the average code length, which explains the community structure of the network. The algorithm is ran-
CHAPTER 2 Network Clustering Algorithms
Figure 2.1.3: Community detection usingInfomaprandom walks technique. The algorithm has two levels. Level one starts in (A), the trajectory of an example random walker takes place on the nodes of the network. In (B), the Huffman code used in order to give each node a fingerprint is shown, the 314 bits of the bottom of figure (B) illustrates the movement of random walk. For example, the random walker begins at node labelled by 1111100 in the upper left corner, then moves to node labelled by 1100 for second movement, etc.
possible, until shortest description length is achieved.
The mechanism of Infomap can be described in two levels. The first level of the al- gorithm, which can be seen in Figure 2.1.3 from [159] 3, starts by feeding a network as input to the algorithm. The random walk takes place on the nodes of the network, in order to describe the trajectory (or locations) of the movements. Then, each node in the network is assigned a codeword according to the frequency of visit in the random walk. This can be done by calculating the ergodic node visit frequencies using the improved version of the greedy search technique by Clauset et al [32]. As a result, the transition matrix is created which describes the stationary distribution for random walks on the network. The novelty of Infomap is that it uses a Huffman code to describe where on the network the random walk is. Thus, the path of random walk through the network is encoded using Huffman codewords to provide unique prefix codewords that exploit the
3Figure reprinted with permission from Ref [159]. c Copyright (2008) National Academy of Sciences,
U.S.A. Reproduced by permission of PNAS Publishing
Figure 2.1.4: The second level of Infomapalgorithm. (A) shows the result from several itera- tions of merging neighboring clusters which provide maximum decrease in the MDL. In (B) the algorithm deals with previous clusters as super nodes and try to merge super nodes to obtain smaller MDL. The code under the figure (B) allows random walker to switch between clusters as there are unique enter codes and exit codes for each cluster. For instance, the code 111 indicates the entry code for red cluster whereas the code 0001 indicates the exit case, the code 0 indicates the entry code for the orange cluster and 1011 for exit from the cluster, etc.
regularity in patterns of movements on the network4. A lookup5 table is used for coding and decoding node labels in the network.
The second level of the algorithm, Figure 2.1.4 from [159], starts by merging two neigh- boring clusters, each cluster is a single node in this stage, into a single cluster such that the
L(C) in Equation 2.7 gives the largest decreasein its value in bits. This is done, see [159, Supplementary Information, Appendix] by refining the result using simulated annealing to minimize the description length that infers the flow of the information on the network using Shannon entropy [163]. However, if there is no such decrease of the MDL the node stays in its original cluster. Repeat this process each time “in a new random sequential
CHAPTER 2 Network Clustering Algorithms
order” [157] until no more decrease of the MDL can be achieved. Note that a network in this stage can never be separated again since there is no decrease in MDL value. Now, the algorithm deals with previous clusters as super nodes, and runs the process of merging and comparing super nodes that satisfy the largest decrease in MDL. Utilization of the idea of reusing the node labelling inside the clusters and giving each cluster unique en- try and exit codes yields on average 32% shorter description of code length for the network.
Additionally, this random walk based algorithm can be adapted to reveal hierarchical structure of large-scale networks [160] as it agglomerates clusters into super nodes.