Scalable influence maximization in social networks

(1)

Scalable Influence Maximization in Social

Networks

A thesis submitted by

Carlos Campo

In partial fulfillment of the requirements for the degree of

Master of Science

Universidad De Los Andes

Date

January 2012

Advisor

(2)

Abstract

Influence maximization is the problem of selecting a subset of individuals in a social network such that they maximize the spread of an influence. It is desirable to generate an idea diffusion cascade by making these initial influ-encers adopt it first. In other words, using the word-of-mouth “power” of these initial individuals should allow us to maximize the spread of an idea over the network.

The problem of finding the most influential nodes in a network is NP-hard. Kempe, Kleinberg, and Tardos (2003) has proved that a Greedy algorithm strategy obtains a solution that is provably within 63% of optimal for sev-eral cases; However it is computationally prohibitive to run this algorithm in large social networks. Finding scalable solutions for this problem is critical for enabling applications in viral social marketing among other topics like detection sensor location or trend monitoring.

In this thesis, a new heuristic algorithm is presented for mining top-K influ-encers based on alocal influence region probability spread score. The proposed algorithm is composed of: 1) a bounded variation of breadth-first search traversal algorithm for calculating the mentioned score of each individual; and 2) an iterative greedy select-and-update algorithm for choosing top in-fluencers. Experiment results show that this new algorithm can in several cases perform better (in influence spread and computational efficiency) than previous works.

(3)

Chapter 1 Introduction

World wide web invention marked the beginning of a massive digital infor-mation sharing age. The web has grown to become the biggest source of information. As a result, several challenges has arisen related to this phe-nomenon. One of the most important challenges is knowledge discovery; it is fundamental to be able to connect and mine all of this data to produce new valuable knowledge [1].

In the particular case of online social networks, they are seen as a mar-keting platform with great potential to be exploited. Social network analysis science and marketing are aware of this opportunity because diverse organiza-tions could find very valuable information for decision making. For example, all the real-time generated information may allow companies to monitor and offer their products in base of user needs and wishes.

Virtual social networks similarities to real ones allow the application of word-of-mouth principle to promote products, services and ideas over the Internet. Word-of-mouth principle states that people near you, like family and friends, make a significant bias in the decisions you make at the time you acquire some product or use some service. Word-of-mouth is said to be the most effective marketing strategy [2], so companies are eager to use this technique in the virtual world. Indeed, this is one scenario where influence maximization becomes valuable.

Influence maximization is about taking advantage of the word-of-mouth principle. If a company wants to market a new product, they would like to have a small target of individuals to market on. The company would give free trials of the product to these people hoping they will give recommendations to their friends. Then, those friends would also suggest the product to their

(6)

friends and so on. This would create a cascade effect which is the idea behind viral marketing. The problem here is to select a subset of individuals which makes this spread as big as possible taking into account that there is a limited budget for motivating the initial set of individuals.

Influence spread and key player identification has been studied by eco-nomics and organization theory scientists. Social network books [3, 4] has been published describing basic heuristics used to solve this problem. Re-cent works like Borgatti’s [5] have studied the problem detail. Borgatti has proposed some greedy approaches to find key players but these solutions are computationally expensive. Finally other authors like Delre et al. [6, 7] and Peyton [8] have made important progress on how network structure and diffusion models affect influence spread on social networks.

From the computer science point of view, maximization of influence in social networks can be specified as a graph optimization problem. Using this approach, Kempe, Kleinberg, and Tardos [9] has already proved this problem is NP-hard. In addition, they proposed a greedy algorithm which obtains a solution that is provably within (1 - 1/e) of the optimal. However it is computationally prohibitive to run this algorithm in large social net-works. Consequently, finding scalable efficient solutions for this problem has motivated several recent research including Chen et al. [10, 11], Leskovec et al. [12], Saito et al. [13] and others [14–20].

Objective and Contributions

This research is focused on addressing scalability issues related to the influ-ence maximization problem in social networks while minimizing any eventual trade-off in effective spread. So, a heuristic algorithm for maximizing influ-ence in a social network is presented. This algorithm is based on calculating an influence probability spread score for each individual in the network to es-timate its influence potential. Scores are computed using a specialized version of a breadth-first search (BFS) graph traversal algorithm. Then, individuals are score-ranked and top influencer individuals are selected. Although se-lection is an iterative greedy algorithm, it penalizes the score of individuals close to already selected ones to avoid losing total spread because of influence superposition.

A local influence region probability spread (LIPS) model is proposed to describe how to calculate above mentioned scores in limited influence regions. It is possible to set the region size through parameters. A trade-off between

(7)

efficiency of the algorithn and accuracy of the algorithm can be made by parameter configuration. The usage of limited local influence region coupled with an efficient traversal graph algorithm makes LIPS approximation have a gain in scalability.

In this document, an analysis of the most important existing approaches is also introduced. And then, experimentation is done to compare these algo-rithms. Finally, experimental results show that LIPS approach can in several cases outperform in total influence spread and computational efficiency pre-vious solutions.

Thesis Outline

The rest of this thesis is structured as follows:

Chapter 2 sets the stage by briefly reviewing the basic concepts of social net-works graph representation and influence diffusion models. It also introduces a formal declaration of the problem.

Chapter 3 describes most relevant algorithms previously developed for solv-ing the influence maximization problem.

Chapter 4 outlines LIPS approach for maximizing the spread of influence combining basic graph traversal algorithms and ideas from degree discount heuristics.

Chapter 5 describes the discoveries while testing LIPS algorithm in different configurations. A comparison between LIPS and other algorithms is shown. Chapter 6 concludes by summarizing the findings, contributions and impli-cations of this work on influence maximization.

(8)

Chapter 2 Influence on Social Network

Graph Basics

It is assumed that the reader is familiar with fundamental concepts of graph theory. If not, see [21] for a comprehensive introduction to graph theory. This chapter will provide an introduction to some concepts of social network graphs, diffusion models and the maximization influence problem.

2.1 Social Network Model

A social network can be represented as an undirected graph G = (V, E) where the vertices (nodes) V are people and the edges E are the connections between those people [3]. Depending on the kind of social network and the situation being studied, connections may have different meanings (e.g. friendship, collaboration, etc). In influence maximization problem case, each undirected edge (u, v) corresponds to an influence u exerts over v to have a desired behavior. This influence has pp(u, v) probability to be successful in a given context or topic.

2.2 Diffusion Models

As described in [9], the influence diffusion in a social network can be repre-sented as a stochastic process. Diffusion models specify the operational way in which influence is propagated through the graph (e.g. in a simulation). In applied viral marketing, defining diffusion models for social networks is

(9)

very important because top influencers found by an influence maximization algorithm depends on the selected influence propagation approach.

Most popular diffusion models for influence maximization are the Inde-pendent Cascade Model and Linear Cascade Model. As proved in [22], these two models can be unified with proper parameter initialization. So, LIPS strategy would be applicable to both models although Independent Cascade Model will be the one used in this research.

2.2.1 Independent Cascade Model

In Independent Cascade Model, nodes can be classified as active if they have been influenced, or inactive if they have not been influenced. Only active nodes can try to influence inactive nodes. When a node u first becomes active in step t of the diffusion process, it is given a single chance to activate each one of its inactive neighbor v. The success of the activation depends on the edge propagation probabilityppu,v. Ifusucceeds,v will be active ont+1.

If this only chance is a failed activation attempt,ucannot try to activatevin any future time. Deciding whether node u succeeds in activating a neighbor node v is as simple as simulating the toss of a biased coin.

LetSt denote the set of active nodes at stept. If a seed setS0 is selected, then the Independent Cascade Model process is done until Stf = ∅ at final

step tf.

Weighted Cascade Model

Weighted Cascade Model is a special case of the Independent Cascade Model in which the sum of the influence probabilities of all of predecessors of a given node v is equal to one (1). This condition could be also written as:

X

u∈V

ppu,v = 1 (2.1)

In cases where every neighbor of v influences it equally then:

ppu,v =

1

indegree(v) (2.2) whereindegree(v) is the number of incoming edges to v.

(10)

2.2.2 Linear Threshold Model

In this model, nodes are also classified as active or inactive. Each node

v has a threshold θv in the interval (0,1]. This threshold is the minimum

accumulated incoming influence required for v to be successfully influenced. Each edge from neighbor u to v has a previously assigned probability ppu,v

such that P

u∈V

ppu,v ≤ 1 for every v ∈ V. If the sum of the probabilities

of the edges from the active neighbors of v is greater than the threshold of activation of v then v becomes active. The diffusion ends when there is no more nodes to activate. If active(u) is true only when u is active then the following condition is satisfied for every v ∈V at model completion.

X

u∈V,active(u)=true

ppu,v > θv →active(v) = true (2.3)

2.3 Influence Maximization Problem

Definition

A social network is considered to be an undirected graph G= (V, E) where the nodes V are people and the edges E are the connections between those people. For every (u, v)∈E,pp(u, v) denotes the propagation probability of the edge, which is the probability that node u influences node v to have a desired behavior. In a particular propagation model, the expected influence spread of S, which is the expected number of activated nodes given seed set

S, is denoted as σG(S).

For a small number k (k << |V|) , the maximization influence problem is to select a subset S0 fromV such that |S0|=k and σG(S0) is maximized.

(11)

Chapter 3 Influence Maximization

Algorithms

In this chapter, a detailed explanation of most relevant algorithms is pre-sented. These set of algorithms include basic centrality based heuristics, greedy algorithms and other mixed approaches.

3.1 Graph Centrality Heuristics

Basic heuristic solutions for influence maximization are based on the central-ity of a node with respect to a graph [3]. Centralcentral-ity is a measurement of the relative importance of a node within a graph given a particular criteria. There are many criteria alternatives to define centrality but the most used measures of centrality for influence maximization in large social networks are: degree centrality, weighted degree centrality and closeness centrality [11, 23]. These heuristics work calculating the centrality value for each node. Then, they sort nodes with respect to centrality value. And finally, the top k nodes are selected as the solution seed set.

Borgatti [5] shows these approaches have many limitations. Most central-ity measures were not developed taking into account graph structure issues. Also, these solutions do not evalute the impact of selecting seeds which are close to each other.

(12)

Degree Centrality

This is the simplest centrality measure. Degree centrality assigns the impor-tance of a node based on the number of outgoing edges (outgoing degree). For a graph G= (V, E), the degree centrality measure for a vertex uis:

dcu =degree(u) (3.1)

Its ease to calculate makes it the perfect candidate to be a heuristics to solve influencers selection though its influence spread results are far from optimal [9–11].

Weighted Degree Centrality

Weighted degree centrality is very similar to degree centrality. Weighted degree centrality assigns the importance of a node based on the sum of the weights of its outgoing edges.

For a graphG= (V, E), let weight(u, v) be the weight of the edge going from uto v. Then, the weighted degree centrality measure for a vertex uis:

wdcu =

X

(u,v)∈E

weight(u, v) (3.2) It is as easy to calculate as the previous measure. Its influence spread results are not optimal either, but they are better than degree centrality. [9–11].

Closeness Centrality

This particular centrality measure is based on the minimum distance between a vertex and all of the other reachable vertices. The distance between two nodes is inverse to the exerted influence.

For a graph G = (V, E), given a function shortestDistance(u, v) that returns the shortest distance path fromutov. The degree centrality measure for a vertex u is:

ccu =

X

v∈V−{u}

(13)

Closeness centrality is very time consuming because an all-to-all shortest distance path algorithm must be processed. Also, top ranked nodes accord-ing to this measure tend to be very close so the total spreadaccord-ing is affected. Consequently, this centrality is discarded for influence maximization in large graphs as described in [9, 10].

3.2 Greedy Approaches

Kempe et al. [9] proposed a greedy algorithm using a hill climbing strategy that guarantees an influence spread within (1 - 1/e) of the optimal influ-ence spread when Linear Independent Cascade Model or Linear Threshold Model is used. This lower bound makes this algorithm produce better result guarantees than basic centrality based heuristics. However, it fails to be an efficient scalable approach, as it is demonstrated below.

As it can be seen inAlgorithm 1, Kempe’s algorithm consists in building a solution seed set S in k steps. At each step, the node that augments the expected spread influence the most is added to the seed set. In order to obtain the expected influence spread augmentation for a nodev ∈V −S, an average of the results of calling function calcAugment(S ∪ {v}) R times is calculated.

Algorithm 1 Greedy Algorithm (Kempe et al. [9])

1: set S=∅

2: for i= 1 to k do

3: for all v ∈(V −S)do 4: augmentv = 0

5: for i= 0 to R do

6: augmentv+ =calcAugment(S∪ {v})

7: end for

8: augmentv/=R

9: end for

10: S =S∪ {argmaxv∈V−S{augmentv}}

11: end for

12: return S

Given a seed set S and a cascade model (Linear Threshold Model or Independent Cascade Model), calcAugment(S) returns the number of

(14)

ac-tive nodes after simulating (stochastically) the diffusion process. As it is a stochastic simulation, Kempe et al. suggest that R value should be twenty thousand (20000) to guarantee a stable influence spread augmentation aver-age value.

This greedy algorithm is very time consuming because each simulation has to be madeRtimes for almost each one of the nodes(|V|) duringkiterations. In addition, each simulation is done using a graph traversal algorithm, whose time complexity isO(|V|+|E|). Then, the time complexity for this algorithm is O(k· |V| ·R·(|V|+|E|)). As a result, Kempe’s algorithm execution time can take days or weeks for a graph with several thousands of nodes.

In 2007, Leskovec et al. proposed an optimization on Kempe’s algorithm using the submodularity property (seeAppendix A) of the objective function

σG(S). They made the greedy algorithm execute around 700 times faster only

by noticing it is not necessary to recalculate the expected influence spread for all nodes every time. This optimization is called CELF (Cost Effective Lazy Forward).

Algorithm 2 Greedy Algorithm + CELF (Leskovec et al. [12])

1: set S=∅

2: for i= 1 to k do

3: maxAug= 0

4: for all v ∈(V −S)do

5: if i= 1 or maxAug < augmentv then

6: augmentv = 0

7: for i= 0 to N umSims do

8: augmentv+ = calcAugment(S∪ {v})

9: end for

10: augmentv/=N umSims

11: maxAug =max(maxAug, augmentv)

12: end if

13: end for

14: S =S∪ {argmaxv∈V−S{augmentv}}

15: end for

16: return S

Algorithm 2 shows how this optimized greedy version works. In each it-eration over the candidate nodes (lines 4-13), only the node which augments

(15)

the maximum expected influence spread is needed. In the first iteration, in-fluence spread value for each vertex is calculated. But, at any later iteration, if a previously calculated influence spread of a node v (augmentv) is less

than the current iteration maximum influence spread(maxAug) then there is no need to calculate augmentv again. augmentv cannot be any better

because of the submodularity property; The seed set of the current iteration is a superset of the seed set of the previous iteration.

For CELF greedy algorithm, the results of maximum achieved spread are the same of Kempe’s original algorithm, but it achieves these results faster. Anyway, this solution cannot be scaled to more than thousands of vertices. Execution time keep being in the order of hours or days [11]. The time complexity for this algorithm is the same as the original Kempe’s algorithm

O(k· |V| ·R·(|V|+|E|)).

3.3 PMIA model

Because of the long execution time values of the greedy algorithms, Chen et al. [10, 11] developed a heuristic algorithm that can achieve influence spread values very close to those of the greedy algorithms in less time. This heuris-tic does not have to stochasheuris-tically simulate the spread calculation to make a decision. Instead of using Independent Cascade Model to calculate influ-ence, they use an approximated model called PMIA. Using this model, in input cases where greedy algorithm can take hours, their approach can take seconds.

Let pp(P) be the influence propagation probability of path P. pp(P) is calculated as the product of all influence propagation probabilities of the edges of the path. Also, assume P(G, u, v) is the set of all paths going from

u tov. Then, the maximum influence path fromu tov (M IPG(u, v)) can be

defined as:

M IPG(u, v) = arg max

P {pp(P)|P ∈ P(G, u, v)} (3.4)

PMIA is defined as a propagation model in which a node u can only influence a node v through the maximum influence path M IPG(u, v) when

pp(M IPG(u, v)) is above a parameterizable thresholdθ. Every nodeuhave a

local influence out-arborescence composed by every reachable node v. Also, if a node v is influenced by u then it is said that node u belongs to the

(16)

Given a seed set S, activation probability of each node v ∈ V can be calculated using the out-arborescence of the seeds. The sum of the activation probabilities of all nodes influenced by the seed set is the objective function for this model σM(S).

Algorithm 3 Generic PMIA Structure (Chen et al. [11])

1: initialize S =∅

2: for all v ∈V do

3: inf luencev =P M IACalculateInf luence(v)

4: end for

5: for i= 1 to k do

6: u= arg maxv∈(V−S){inf luencev}

7: S =S∪u

8: for all v in in-arborescence(u)do

9: P M IADiscountInf luence(v, u)

10: end for

11: end for

12: return S

PMIA Algorithm 3 can be separated in two steps. First (lines 2-4), the activation probability sum (inf luence) for each vertex out-arborescence is calculated by function P M IACalculateInf luence(v). This is done using Dijkstra’s algorithm [24].

In the second step (lines 5-11), k seeds are selected. In each iteration, the vertex u that has the highest inf luence is selected as part of the seed set. Once vertex u is added to the seed set, the inf luence for each of the nodes belonging to u’s in-arborescence are updated to discount the influence propagated through u.

To calculate this algorithm’s time complexity [11], let niθ be the size

of the biggest in-arborescence set and let noθ be the size of the biggest

out-arborescence set. Assume the maximum running time to calculate an arborescence using Dijkstra’s lowest cost path algorithm is tiθ. Then,

initial-ization step is O(|V| ·tiθ) and seed selection step is O(k·niθ ·noθ·log|V|).

So, PMIA algorithm time complexity is O(|V| ·tiθ+k·niθ·noθ·log|V|).

Althoughtiθ complexity is proportional to Dijkstra’s algorithm

complex-ity (O(tiθ)∝O((|E|+|V|)·log|V|)), this expression is not explicitly included

in the PMIA’s complexity. This is because a reasonable value ofθ threshold is assumed such that niθ, noθ and tiθ are significantly smaller than|V|.

(17)

Chapter 4 Influence Maximization Based

On Local Influence Region

Probability Spread

In this chapter, LIPS is introduced. LIPS is a new solution approach to social network influence maximization problem. It is based on the calculation of local influence region probability spread scores to each node. LIPS is composed of a diffusion model and an algorithm; both of them are described in this chapter.

4.1 General Ideas

LIPS approach is inspired on research done by Chen et al. [10,11] as it follows the same algorithm general structure inAlgorithm 3. Though LIPS have been built under the following hypothesis:

• When only maximum influence paths are used to propagate influence like in PMIA, there may be valuable information in the form of propa-gation probabilities that is being lost when certain paths are discarded.

• As propagation probabilities tend to be low in most of the cases, then accumulated propagation probability through a unique path may de-crease very fast.

• More rigorous approximated propagation probabilities calculated within a compact region (unweighted shortest path) of a node could be

(18)

bet-ter than using a sparser farther-reach region like the one proposed by PMIA.

• Calculating unweighted shortest (BFS) paths has less computational complexity than calculating lowest cost paths (Dijskstra’s).

4.2 LIPS Model

LIPS model is an approximation of Independent Cascade Model that priv-ileges influence spread on local regions around each node. Unlike Influence Cascade Model which counts the number of activated nodes, LIPS model calculates an approximate activation probability for the nodes in influence regions of a seed. So, LIPS does not propagate influence as a discrete value (active or inactive) but as an influence probability. Consequently, if a node is said to be influenced, it means there is an activation probability this node is influenced (activated) in the Independent Cascade Model. The (LIPS) influ-ence region in which influinflu-ence probability is propagated is limited using two parameters given to the model: propagation threshold θ and maximum level dmax.

In LISP model, influence probability is propagated in successive iterations also called levels. In the first level, only the initial seed s is influenced with probability one (1). Then, the seed tries to propagate accumulated influence probability to its out-neighbors. If it succeeds then those neighbors are in level two(2). Nodes in third level are those who has successfully been influenced by nodes at level two(2) and so on. Accumulated propagation influence probability from a node u to v is the product of the activation probability of node u and the propagation probability of edge (u, v). A nodeu succeeds to influence another node v only if the applied accumulated propagation influence probability is greater than the propagation threshold

θ and if nodev level is not greater than maximum level dmax.

Nodes at a particular level cannot influence nodes in previous levels be-cause cycles would make very complex the task of calculating activation probabilities. On the other hand, nodes in the same level can influence each other but only using the component of the activation probability which has been caused by nodes of the previous level. Again, this is to avoid cyclic probabilities between nodes in the same level.

(19)

4.2.1 Formal Definition

For a graph G(V, E) and a seed s in seed set S, assume pa(s, v) is the

probability a node v in (V-S) is activated by seed s using LIPS model. Then, nodes could be organized around s in levels where Qs,d is the set of

nodes at level d defined as:

Qs,1 ={s} (4.1)

Qs,d =∪∀d−1

i=1v /∈Qs,i∧u∈Qs,d−1∧(u,v)∈E∧pa(s,u)·pp(u,v)>θ{v} (4.2)

Then the influence region of seed s (Qs) would be:

Qs=∪di=1maxQs,i (4.3)

Definition of activation probability pa(s, v) is divided in two components: a

transmitted probability Ptx(s, v) which is the propagated probability caused

by nodes of the previous level, and a same-level probability Pl(s, v) which is

the probability caused by nodes of the same level.

Ptx(s, v:Qs,d) = 1−

Y

u∈Qs,d−1∧(u,v)∈E∧pa(s,u)·pp(u,v)>θ

(1−pa(s, u)·pp(u, v))

(4.4)

Pl(s, v:Qs,d) = 1−

Y

u∈Qs,d∧(u,v)∈E∧ptx(s,u)·pp(u,v)>θ

(1−ptx(s, u)·pp(u, v)) (4.5)

Using probability addition rule, then total activation probability of a node is defined as:

Pa(s, s) = 1 (4.6)

Pa(s, v) =Ptx(s, v) +Pl(s, v)−Ptx(s, v)·Pl(s, v) (4.7)

Finally, the objective functionσL(S) to be maximized would be the sum of all

calculated probabilities in the model. The sum of the activation probabilities of one seed s is called thespread score of s. Function σL(S) definition is:

(20)

σL(S) =

X

s∈S∧v∈Q_dmax

pa(s, v) (4.8)

Using LIPS model, it is possible to calculate the spread score for each node in the network. Then seeds are selected using a hill climbing strategy as it is described in the next section.

4.3 The Algorithm

In this section, the LIPS algorithm is presented in two blocks: Algorithm 4

shows the main iterative greedy select-and-update algorithm for selecting top-k influencers andAlgorithm 5is the bounded variation of breadth-first search traversal algorithm for calculating local influence spread scores according to LIPS model.

LIPS algorithm’s first step is to calculate the score of each vertexv ∈V

by calling function CalculateLIP SScore(G, v, θ, dmax) (lines 3-5). While

calculating scores, Iv is also filled with all nodes u which have v in its local

influence region. Then, the algorithm iterates k times to build the seed set. In each iteration, the highest score vertex s in (V −S) is chosen as a seed (lines 7-8). Finally, the score is updated for each v in Is.

Algorithm 4 LIPS(G, k, θ,dmax)

1: set global S =∅

2: set global Iv =∅ for each v ∈V

3: for all v in V do

4: scorev =CalculateLIP SScore(G, v, θ, dmax)

5: end for

6: for i= 1 to k do

7: s= arg maxv{score(v)|v ∈(V −S)}

8: S =S∪ {s}

9: for all v inIs do

10: scorev =CalculateLIP SScore(G, v, θ, dmax)

11: end for

12: end for

(21)

Name Description

s source node to calculate score for

θ minimum propagated influence threshold

dmax maximum level to be reached when building local influence region

score local influence region probability spread score ofs

level(v) level of the nodev, if zero (0) it means nodevhas not been influenced

pactive(v) activation probability of a nodev

ptx(v) transmitted component of the activation probability of nodev

ptx(v) same-level component of the activation probability of nodev

Qi set of nodes ati

Iv (global)set of influencers ofv

S (global) seed set

Table 4.1: Important variables used in LIPS algorithms

Score recalculation is done to avoid influence superposition. As any node

s in the seed set S is an influencer, any influence propagated through a path arriving-to or passing-through node smust be discarded from influence spread score of other nodes. That is why CalculateLIP SScore is called again for each v in Is (lines 9-11).

CalculateLIP SScoreinAlgorithm 5calculates thelocal influence region spread score using a breadth-fist search algorithm which allows to traverse the graph in level order. The variables used are described at Table 4.1.

This algorithm starts by initializing variables in lines from 2 to 5. Then, the source node s is added to the set of elements in the first levelQ1 and its transmitted probability is set to one (1). As LIPS model is bounded to dmax

levels of influence, in line 8 a loop is declared which traverses only the first

dmax levels each one at a time.

Inside the level traversal loop, the first step is to calculate activation probability updates contributed between nodes of the same level. After that, for each node u in the current level i: 1) its total activation probability is calculated; 2) total score for s is updated and s is set as an influencer of u; and finally, 3) transmitted probabilities are calculated for the next level.

4.3.1 Complexity Analysis

For calculating LIPS algorithm time complexity, let ni be the size of the

biggest influencers set and letti be the maximum running time to calculate a

local influence propagation score usingAlgorithm 5(BFS variation). Assume a binary heap is used to store scores of all the nodes such that finding the

(22)

Algorithm 5 CalculateLIPSScore(G, s, θ, dmax)

1: /* Initialization */ 2: setscore= 0

3: setptx(v) = 0 for each nodev∈V

4: setlevel(v) = 0 for each nodev∈V

5: setQi=∅ for eachi∈[1, dmax]

6: insertsintoQ1

7: setptx(s) = 1, d(s) = 1

8: fori= 1 todmaxdo

9: /* Calculate same-level probability*/ 10: for allv inQi do

11: set plevel(v) = 0

12: for all(u, v) inE do

13: if level(u) =level(v)and ptx(u)·pp(u, v)> θthen

14: plevel(v)+ =ptx(u)·pp(u, v)·(1−plevel(v))

15: end if

16: end for

17: end for

18: for alluinQi do

19: /* Calculate activation probability*/

20: pactive(u) =ptx(u) +Plevel(u)−ptx(u)·plevel(u)

21: /* Update score and influencers*/ 22: score+ =pactive(u)

23: Iu=Iu∪ {s}

24: /* Calculate transmitted propagation probability in nodes of the next level*/ 25: if i < dmaxthen

26: for all(u, v) inE do

27: if pactive(u)·pp(u, v)> θandv /∈S then

28: if level(v) = 0 then

29: ptx(v) =pactive(u)·pp(u, v)

30: insertv intoQlevel(u)+1

31: level(v) =level(u) + 1 32: else if level(u)< level(v)then

33: ptx(v)+ =pactive(u)·pp(u, v)·(1−ptx(v))

34: end if

35: end if

36: end for

37: end if

38: end for

39: end for

(23)

top influencer isO(1). Then, initialization step isO(|V| ·(ti+log|V|)) which

is the complexity of calculating and storing LIPS|V|scores in a binary heap. The greedy select-and-update step is O(k·ni·(ti+log|V|)) because each one

of the k times an influencer is selected, all of its influencers(ni) scores must

be calculated again. So, total LIPS algorithm time complexity is:

O(|V| ·(ti+

heap-insert z }| {

log|V| )

| {z }

Initialization

+k·ni(ti+

heap-decreaseKey z }| {

log|V|

| {z }

Select−and−update

) (4.9)

LIPS performs the best when ni and ti are significantly smaller than |V|.

This occurs when local influence regions are small. In other words, the most favorable performance scenario happens when the social graph is sparse and reasonable values for θ and dmax have been chosen.

Comparing this complexity analysis result against PMIA algorithm’s one (O(|V| ·tiθ+k·niθ·noθ·log|V|)), it may not be obvious which algorithm is

more performant. Anyway, it is important to notice that LIPS’ti complexity

is proportional to the BFS algorithm complexity (O(ti)∝O(|E|+|V|) which

is lighter than Dijkstra’s complexity in PMIA case. In the experimentation section, it is demonstrated that LIPS performance behavior is better than PMIA’s in most cases.

(24)

Chapter 5 Influence Spread Experiment

Experimental results for evaluating LIPS algorithm performance in compar-ison to other algorithms are presented. First, different values for parameters

θ anddmax in LIPS algorithm are tested. Then, influence spread is compared

to different algorithms. And finally, tests are run to prove the scalability of LIPS.

5.1 Experiment Setup

5.1.1 Datasets

Three real-world networks are used in experimental stage following [11] as a guide to keep the results comparable between projects. These social graphs datasets were published by Jure Leskovec (in Stanford Large Network Dataset Collection). The first dataset is the Arxiv High Energy Physics paper cita-tion network from now on NetHEPT [25]. The second dataset is Who-trusts-whom network of Epinions.com which will be called epinions for simplifica-tion [26]. The third dataset is Amazon product co-purchasing network from March 2, 2003 [27].

All of these datasets are directed and only contains the graph structure information, so non-uniform propagation probabilities for the edges must be generated. Consequently, Weighted Cascade Model and a Random Triva-lency Model are built to complete datasets by adding edges probabilities. On the weighted cascade model, as explained in section 2.2.1, propagation probability from node u to v (pp(u, v)) is 1/in degree(v). When using the

(25)

random trivalency model, pp(u, v) is an arbitrary value selected from set 0.1, 0.01, 0.001.

The resulting datasets summary information is:

NetHEPT wc epinions wc amazon wc NetHEPT 3 epinions 3 amazon 3 Node Count 9875 75879 262111 9875 75879 262111 Edge Count 25973 405740 899792 25973 405740 899792 Avg Degree 2.63018 5.3472 3.43287 2.63018 5.3472 3.43287 Max Degree 65 3044 420 65 3044 420 Avg pp(u,v) 0.19010 0.10211 0.21226 0.037000 0.03702 0.0369985424 # Componets 3011 44523 92874 3011 44523 92874 Max Component 6059 31357 169238 6059 31357 169238 Avg Component 3.2794 1.70427 2.82222 3.2794 1.70427 2.82222

5.1.2 Algorithms and Infrastructure

The list of algorithms compared in the experiment is:

• LIPS: Algorithm based in LIPS model with different threshold and max level values. Several values for these parameters are tested to check influence of parameters in algorithm performance

• PMIA: PMIA model algorithm with threshold values 101 and 1001 [11].

• PageRank: Algorithm to rank web pages [28].

• Greedy: Kempe’s algorithm [9] using CELF optimization [12].

• Weighted Degree: Basic weighted degree centrality heuristic.

• Degree: Basic degree centrality heuristic.

• Random: It serves as a baseline comparison by selecting random nodes. Besides evaluating different parameter values for LIPS and PMIA, two different k values are tested when evaluating influence spread. These values of k are |V|/100 and |V|/1000.

For testing solution seed sets given as response by each algorithm, the influence spread is calculated as the average result of running a stochastic simulation on the datasets 20.000 times. The diffusion model used for the simulations was Independent Cascade Model. The experiments were run on an AMD Phenom II X6 1100T Processor 3.30 GHz machine with 4G memory using Windows 7 Ultimate x64.

With the exception of LIPS, other algorithm implementations were kindly provided by Chi Wang [10, 11].

(26)

5.2 Experiment Results

The results of the experimentation phase show the performance of LIPS model algorithm in comparison to other algorithms for influence maximiza-tion. In this section, LIPS is tested using different parameter values, calcu-lating spread influence and supporting the proposed datasets.

5.2.1 Parameters Tuning

Threshold (θ). For the threshold evaluation case, a constant maximum level limit is fixed while four different thresholds (1/80, 1/120, 1/550 and 1/1001) are tested for the three weighted cascade datasets. As it can be seen in the different charts of the Figure 5.1, threshold parameter is more important when k is a small value. As k value grows threshold parameter effect is diminished. It is recommended to use threshold values between 1/80 and 1/120. When the threshold (θ) is smaller than 1/120, influence spread quality is reduced.

(a)NetHEPT data (k=|V|/100) (b)NetHEPT data (k=|V|/1000) (c) epinions data (k=|V|/100)

(d)epinions data (k=|V|/1000) (e)amazon data (k=|V|/100) (f)amazon data (k=|V|/1000) Figure 5.1: LIPS algorithm using different threshold values

Figure 5.2ashows a comparison of the execution times for the threshold values being evaluated . It is clear that the algorithm is faster when the

(27)

(a) Time for different threshold values (b) Time for different max level values

Figure 5.2: LIPS algorithm execution time (k =|V|/100)

threshold is higher because it bounds the local influence region size and reduces the number of nodes visited and evaluated. Also, it should be noticed that higher thresholddoes not always result in better spreads.

(a)NetHEPT data (k=|V|/100) (b)NetHEPT data (k=|V|/1000) (c)epinions data (k=|V|/100)

(d)epinions data (k=|V|/1000) (e)amazon data (k=|V|/100) (f)amazon data (k=|V|/1000) Figure 5.3: LIPS algorithm using differentdmax values

Maximum influence level (dmax). This parameter shows a very

dif-ferent pattern from the threshold one. Figure 5.3 shows how a dmax change

from 2 to 3 makes a significant difference in spread results. Also augmenting

dmax from 3 to 4 shows a small but appreciable gain in spread. Anyway, it

(28)

Figure 5.2b, the execution time difference made by augmenting the value of

dmax by one (1) is around 5%. It may worth the time cost to have at least

dmax = 3. Based on this datasets, dmax = 3 or dmax = 4 seems like the ideal

configurations to this algorithm parameter.

5.2.2 Influence Spread

The influential spread results comparing the algorithms are presented below. Although most of the algorithms show effectiveness variability in different scenarios, PMIA and LIPS are very consistent. PMIA and LIPS are always among the best ranked results. The other heuristic algorithms for some datasets show a good behavior but in others scenarios their deviation is big from the best ones. It can be seen that for a same graph structure with different probability propagation generation models (WC or TRIVALENCY) results may change a lot.

(a) WC - (k=|V|/1000) (b) WC - (k=|V|/100)

(c) TRI - (k=|V|/1000) (d) TRI - (k=|V|/100)

Figure 5.4: Spread influence in NetHEPT

(29)

al-gorithm is the one that produces the greatest influence spread. Anyway, its advantage over PMIA and LIPS is barely noticeable in the charts. In ad-dition, as k grows the relative advantage of the greedy algorithm over the others decrease.

Any test run which did not finish after four (4) days is discarded from the results. That is why the greedy algorithm is only compared in the NetHEPT dataset which is the smaller of the datasets.

(a) WC - (k=|V|/1000) (b) WC - (k=|V|/100)

(c) TRI - (k=|V|/1000) (d) TRI - (k=|V|/100)

Figure 5.5: Spread influence in epinions

For the case of epinions dataset (Figure 5.5) in its weighted cascade model version, the best spread is achieved by LIPS algorithm by very little margin over PMIA. Comparing this results to the NetHEPT dataset, it can be seen that the gap between the best algorithms and the others is not significantly reduced as k is greater.

In the TRIVALENCY version of the epinions dataset, LIPS behavior is deficient when k is small (k = |V|/1000). As k gets greater, LIPS is again comparable to PMIA. In this dataset, PageRank behavior is outstanding.

(30)

(a) WC - (k=|V|/1000) (b) WC - (k=|V|/100)

(c) TRI - (k=|V|/1000) (d) TRI - (k=|V|/100)

Figure 5.6: Spread influence in amazon

Finally, in the last datasets shown in Figure 5.6, PMIA and LIPS are the algorithms that produce more influence spread. In these cases, the gap between top algorithms and the below ones increases askincreases. It is truly remarkable the consistence shown by PMIA and LIPS algorithms through all the datasets.

5.2.3 Scalability

Referring to Figure 5.7, total time taken by each algorithm to solve the problem for k = |V|/100 is presented. All the degree centrality heuristic algorithms except Weighted Degree are absent because their execution time is so small they cannot be appreciated proportionally in the chart. Focusing on the top algorithms, PMIA and LIPS, when they are compared face to face using the same threshold parameter θ = 1/120, LIPS is faster than PMIA in 4 out of 6 opportunities.

(31)

(a) (b) (c)

Figure 5.7: Execution time for different datasets (k =|V|/100) To confirm results inFigure 5.7,Figure 5.8showsk vs tfor three datasets in which time consumption grows faster in PMIA than in LIPS.

(a)NetHEPT WC data. T in log scale. (b) Epinions WC dataset (c) Amazon WC dataset

Figure 5.8: Scalability of algorithms against k

Results of the experimentation phase show that LIPS is an effective and performant model/algorithm to find influencers on a social network. Its spread results were always among the best in all the datasets. Moreover, its performance is clearly better than greedy algorithm’s one; LIPS performance is also better in comparison to other heuristics like PMIA in several cases. In summary, LIPS is a real alternative to previous existing works on the subject.

(32)

Chapter 6 Conclusions

The main contribution of this research is LIPS, a new heuristic algorithm for influence maximization which is consistently more efficient than previously researched approaches in several cases. LIPS is based on calculating influence score on local influence regions. These regions are defined by two parameters:

θ which defines which is the minimum influence probability value for an influence to be propagated; and dmax which is the maximum distance a node

can be to be influenced by another node. Defining the region size through parameters allows to decide a balance between efficiency and accuracy.

In a social graphs, the close around of an individual is one of the most important values in deciding whether this individual is a top influencer. This is why an specialized version of breadth-first search algorithm with less com-putational complexity has such a good behavior in front of greedy and lowest cost path based heuristics. So, it is important to get as much information as possible from a local influence region because influence probability propaga-tion decreases very fast with distance.

(33)

Chapter 7 Future Work

There is a massive amount of work to be done about influence maximization and related social network mining for knowledge discovery topics.

First, as it can be seen in this research results, estimating how good is a particular algorithm in a graph is complex. Some graph indicators have to be developed in order to quickly decide which approach is better for a each case. These indicators should also evaluate the real benefit of augmenting the influence spread versus the effort done to do it.

Another possible research should focus on developing more accurate dif-fusion models for social networks as in [15, 16]. These models should include variables for time, similarity between nodes, multi-topic influence among oth-ers. In addition, it would be interesting that these models also supported dynamic graphs which change its structure and attributes over time. This could be done through stream processing techniques.

Finally, scalability keeps being a challenge for this problem and dis-tributed computing should be the path to be taken. Algorithms like LIPS and PMIA are based on local region calculations; this should make them easier to parallelize using frameworks like Map-Reduce [29] and Pregel [30].

(34)

Appendix A

Objective Function Properties

In the best scenario, the maximization objective function σG(S) should have

some properties that can be used to guarantee a solution is approximable within 1−1/eof the optimal using a hill climbing approach. Kempe et al. [9] proves that this is only possible if the objective functionσG(S) is submodular

and monotonic. The description of these properties is as follows.

A.1 Monotonic Function

This property states that a function preserves a given order. Given a function

f, it is monolithic if:

x≤y→f(x)≤f(y) (A.1) For influence maximization, this implies that anytime a new vertex is added to a seed set S, the resulting influence spread is at list larger than S

spread.

A.2 Submodular Function

Given a function f, it is submodular if it satisfies

f(S∪ {v})−f(S)≥f(T ∪ {v})−f(T), (A.2) for all elementsvand all pairs of setsS ⊆T. In the objective functionσG(k),

this means that each time a new vertexv is added to the seed setS, then the augmenting value onσG(k) for other vertices not in S (V −S) is diminished.

(35)

Bibliography

[1] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - the story so far,”

Int. J. Semantic Web Inf. Syst, vol. 5, no. 3, pp. 1–22, 2009.

[2] I. R. Misner, The World’s best known marketing secret: Building your business with word-of-mouth marketing. Bard Press, 1999.

[3] S. Wasserman and K. Faust,Social Network Analysis. Methods and Ap-plications. 1994.

[4] M. O. Jackson, Social and Economic Networks. Princeton, NJ, USA: Princeton University Press, 2008.

[5] S. P. Borgatti, “Identifying sets of key players in a social network,”

Comput. Math. Organ. Theory, vol. 12, pp. 21–34, Apr. 2006.

[6] S. A. Delre, W. Jager, and M. A. Janssen, “Diffusion dynamics in small-world networks with heterogeneous consumers,” Comput. Math. Organ. Theory, vol. 13, pp. 185–202, June 2007.

[7] S. A. Delre, W. Jager, T. H. A. Bijmolt, and M. A. Janssen, “Will it spread or not? the effects of social influences and network topology on innovation diffusion,” Journal of Product Innovation Management, vol. 27, no. 2, pp. 267–282, 2010.

[8] H. P. Young, “Innovation diffusion in heterogeneous populations: Conta-gion, social influence, and social learning,”American Economic Review, vol. 99, pp. 1899–1924, December 2009.

[9] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of in-fluence through a social network,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, (New York, NY, USA), pp. 137–146, ACM, 2003.

(36)

[10] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” inProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, (New York, NY, USA), pp. 199–208, ACM, 2009.

[11] W. Chen, C. Wang, and Y. Wang, “Scalable influence maximization for prevalent viral marketing in large-scale social networks,” in Proceed-ings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, (New York, NY, USA), pp. 1029– 1038, ACM, 2010.

[12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance, “Cost-effective outbreak detection in networks,” in Proceed-ings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, (New York, NY, USA), pp. 420– 429, ACM, 2007.

[13] K. Saito, R. Nakano, and M. Kimura, “Prediction of information diffu-sion probabilities for independent cascade model,” in Proceedings of the 12th international conference on Knowledge-Based Intelligent Informa-tion and Engineering Systems, Part III, KES ’08, (Berlin, Heidelberg), pp. 67–75, Springer-Verlag, 2008.

[14] D. Ediger, K. Jiang, E. J. Riedy, D. A. Bader, C. Corley, R. Farber, and W. N. Reynolds, “Massive social network analysis: Mining twitter for social good,” in 39th International Conference on Parallel Processing (ICPP), (San Diego, CA), Sept. 2010.

[15] A. Goyal, F. Bonchi, and L. V. Lakshmanan, “Learning influence prob-abilities in social networks,” in Proceedings of the third ACM interna-tional conference on Web search and data mining, WSDM ’10, (New York, NY, USA), pp. 241–250, ACM, 2010.

[16] J. Tang, J. Sun, C. Wang, and Z. Yang, “Social influence analysis in large-scale networks,” in Proceedings of the 15th ACM SIGKDD inter-national conference on Knowledge discovery and data mining, KDD ’09, (New York, NY, USA), pp. 807–816, ACM, 2009.

[17] D. Gruhl, D. Liben-Nowell, R. Guha, and A. Tomkins, “Information diffusion through blogspace,” SIGKDD Explor. Newsl., vol. 6, pp. 43– 52, December 2004.

(37)

[18] M. Magnani, D. Montesi, and L. Rossi, “Information propagation anal-ysis in a social network site,” in Proceedings of the 2010 Interna-tional Conference on Advances in Social Networks Analysis and Mining, ASONAM ’10, (Washington, DC, USA), pp. 296–300, IEEE Computer Society, 2010.

[19] S. Bharathi, D. Kempe, and M. Salek, “Competitive influence max-imization in social networks,” in Proceedings of the 3rd international conference on Internet and network economics, WINE’07, (Berlin, Hei-delberg), pp. 306–311, Springer-Verlag, 2007.

[20] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts, “Everyone’s an influencer: quantifying influence on twitter,” in Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, (New York, NY, USA), pp. 65–74, ACM, 2011.

[21] J. Bondy and U. Murty, Graph theory with applications. American El-sevier Pub. Co., 1976.

[22] D. Kempe, J. Kleinberg, and va Tardos, “Influential nodes in a diffu-sion model for social networks,” in IN ICALP, pp. 1127–1138, Springer Verlag, 2005.

[23] T. Opsahl, F. Agneessens, and J. Skvoretz, “Node centrality in weighted networks: Generalizing degree and shortest paths,” Social Networks, vol. 32, no. 3, pp. 245–251, 2010.

[24] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Nu-merische Mathematik, vol. 1, pp. 269–271, 1959. 10.1007/BF01386390. [25] J. Leskovec, “High-energy physics theory citation network.” http://

snap.stanford.edu/data/cit-HepTh.html.

[26] J. Leskovec, “Epinions social network.” http://snap.stanford.edu/ data/soc-Epinions1.html.

[27] J. Leskovec, “Amazon product co-purchasing network, march 02 2003.”

http://snap.stanford.edu/data/amazon0302.html.

[28] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Comput. Netw. ISDN Syst., vol. 30, pp. 107–117, April 1998.

(38)

[29] F. Chierichetti, R. Kumar, and A. Tomkins, “Max-cover in map-reduce,” in Proceedings of the 19th international conference on World wide web, WWW ’10, (New York, NY, USA), pp. 231–240, ACM, 2010.

[30] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, (New York, NY, USA), pp. 135–146, ACM, 2010.

Scalable influence maximization in social networks